⚡ Bolt: Optimize email search with raw content scanning#469
⚡ Bolt: Optimize email search with raw content scanning#469
Conversation
Refactors `DatabaseManager.search_emails_with_limit` to use a "scan-first" strategy. Instead of always parsing the JSON content of a file (which is CPU intensive), it now reads the raw file content as a string first to check for the search term. Only if the term is found in the raw string does it proceed to parse the JSON and verify the match in the `content` field. This avoids `json.load` overhead for the vast majority of non-matching files. Impact: - Reduces search time by ~9% in benchmarks with 1000 emails. - Reduces memory allocation for non-matching files (avoiding dict and string allocations). - Safe against false positives (verifies match after parsing). Co-authored-by: MasumRab <8943353+MasumRab@users.noreply.github.com>
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
|
|
Reviewer's GuideIntroduces a fast pre-scan helper for gzipped email content that checks for the search term in the raw text before doing JSON parsing, and wires search to use this helper in a background thread while updating the corresponding performance test expectations. Sequence diagram for optimized email search with raw content scanningsequenceDiagram
actor User
participant DatabaseManager
participant AsyncIO as AsyncIOEventLoop
participant WorkerThread
participant FileSystem
User->>DatabaseManager: search_emails_with_limit(search_term, limit)
DatabaseManager->>DatabaseManager: compute search_term_lower
loop for each email_light in source_emails
DatabaseManager->>DatabaseManager: content_path = _get_email_content_path(email_id)
DatabaseManager->>FileSystem: check exists(content_path)
alt content_path exists
DatabaseManager->>AsyncIO: asyncio.to_thread(_scan_content_for_search, content_path, search_term_lower)
AsyncIO->>WorkerThread: run _scan_content_for_search(path, search_term_lower)
WorkerThread->>FileSystem: gzip.open(path)
WorkerThread->>WorkerThread: read file_content as text
alt search_term_lower not in file_content.lower()
WorkerThread-->>AsyncIO: return False
else search_term_lower in file_content.lower()
WorkerThread->>WorkerThread: data = json.loads(file_content)
WorkerThread->>WorkerThread: content = data.get(FIELD_CONTENT, "")
WorkerThread-->>AsyncIO: return isinstance(content, str) and search_term_lower in content.lower()
end
AsyncIO-->>DatabaseManager: is_match
alt is_match is True
DatabaseManager->>DatabaseManager: append email_light to filtered_emails
else is_match is False
DatabaseManager->>DatabaseManager: skip email_light
end
else content_path missing
DatabaseManager->>DatabaseManager: skip email_light
end
end
DatabaseManager->>DatabaseManager: results = [_add_category_details(email) for email in filtered_emails]
DatabaseManager-->>User: results
Class diagram for DatabaseManager search optimization changesclassDiagram
class DatabaseManager {
+_read_content_sync(content_path: str) Dict~str, Any~
+_scan_content_for_search(path: str, search_term_lower: str) bool
+_get_email_content_path(email_id: str) str
+search_emails_with_limit(search_term: str, limit: int) List~Dict~str, Any~~
}
DatabaseManager : uses gzip
DatabaseManager : uses json
DatabaseManager : uses asyncio.to_thread
DatabaseManager : uses os.path.exists
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
|
🤖 Hi @MasumRab, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. WalkthroughAdded a private helper Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant DB as DatabaseManager
participant Thread as ThreadWorker
participant FS as FileSystem
Client->>DB: search_emails_with_limit(query, limit)
DB->>DB: produce candidate list from DB/index
DB->>Thread: asyncio.to_thread(_scan_content_for_search, path, term)
Thread->>FS: open/read file (raw scan)
FS-->>Thread: raw content
Thread->>Thread: check raw for term, parse JSON if needed
Thread-->>DB: boolean (match / no-match)
DB-->>Client: filtered results (up to limit)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- The new
_scan_content_for_searchswallows I/O/JSON errors silently, whereas the previous implementation logged them; consider at least logging at debug level or preserving the error logging to avoid losing observability for corrupted or unreadable email content. - Reading and lowercasing the entire gzip file into a single string may be memory-inefficient for large messages; consider a streaming scan (e.g., reading in chunks or lines and short-circuiting on first match) before falling back to
json.loadswhen needed.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The new `_scan_content_for_search` swallows I/O/JSON errors silently, whereas the previous implementation logged them; consider at least logging at debug level or preserving the error logging to avoid losing observability for corrupted or unreadable email content.
- Reading and lowercasing the entire gzip file into a single string may be memory-inefficient for large messages; consider a streaming scan (e.g., reading in chunks or lines and short-circuiting on first match) before falling back to `json.loads` when needed.
## Individual Comments
### Comment 1
<location> `src/core/database.py:197` </location>
<code_context>
+ data = json.loads(file_content)
+ content = data.get(FIELD_CONTENT, "")
+ return isinstance(content, str) and search_term_lower in content.lower()
+ except (IOError, json.JSONDecodeError, ValueError):
+ return False
+
</code_context>
<issue_to_address>
**issue (bug_risk):** Swallowing all I/O/parse errors here may make debugging data issues harder
This helper now converts all errors into a bare `False` without logging, whereas the previous behavior logged the email ID and exception. That can hide issues such as corrupted gzip files or malformed JSON and make production incidents harder to diagnose. Consider at least logging at debug/warn level (or letting the caller handle logging) so operational problems remain visible.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| data = json.loads(file_content) | ||
| content = data.get(FIELD_CONTENT, "") | ||
| return isinstance(content, str) and search_term_lower in content.lower() | ||
| except (IOError, json.JSONDecodeError, ValueError): |
There was a problem hiding this comment.
issue (bug_risk): Swallowing all I/O/parse errors here may make debugging data issues harder
This helper now converts all errors into a bare False without logging, whereas the previous behavior logged the email ID and exception. That can hide issues such as corrupted gzip files or malformed JSON and make production incidents harder to diagnose. Consider at least logging at debug/warn level (or letting the caller handle logging) so operational problems remain visible.
|
🤖 I'm sorry @MasumRab, but I was unable to process your request. Please see the logs for more details. |
The system package index (PyTorch CPU) only provides wheels for MarkupSafe 3.0.2 for Python 3.13, causing installation failures on Python 3.11/3.12 in CI. This change pins `markupsafe<3.0.0` in `pyproject.toml` and updates `uv.lock` to use version 2.1.5 which has broad wheel availability. Co-authored-by: MasumRab <8943353+MasumRab@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@setup/pyproject.toml`:
- Around line 32-34: The dependency pin "markupsafe<3.0.0" is unrelated to this
PR and uses an incomplete constraint; either remove/move it to a separate
commit/PR with a short rationale, or if it must remain here replace the
single-sided constraint with a proper range (for example use a lower and upper
bound such as markupsafe>=2.1.0,<3.0.0) and add a brief comment in
pyproject.toml explaining why that exact range is required for the changes in
this branch.
|
|
||
| # Fix for markup issues with system packages | ||
| "markupsafe<3.0.0", |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
markupsafe PyPI latest version 2024 2025 2026 release history 3.0.0
💡 Result:
MarkupSafe (PyPI) — latest + 2024–2026 release history (incl. 3.0.0)
Latest on PyPI (as of Feb 14, 2026)
- Latest version: 3.0.3
- Release date: Sep 27, 2025 [1][2]
- No newer (2026) release is listed on PyPI as of this date [1].
Release history (2024–2026)
| Version | Release date | Notes (highlights) |
|---|---|---|
| 3.0.3 | 2025-09-27 | DeprecationWarning for __version__, PEP 489 multi-phase init, Windows ARM64 wheels, Python 3.14 wheels, riscv64 wheels [2] |
| 3.0.2 | 2024-10-18 | __str__ str-subclass compatibility; build requires setuptools ≥ 70.1 [2] |
| 3.0.1 | 2024-10-08 | GCC 14 warning fixes; proxy-object compatibility [2] |
| 3.0.0 | 2024-10-07 | Python 3.13 (+ free-threaded build) support; drop Python 3.7/3.8; switch to pyproject.toml metadata [2] |
| 2.1.5 | 2024-02-02 | striptags space-collapsing fix [2] |
| 2.1.4 | 2024-01-19 | Avoid regex in striptags (performance) [2] |
Sources: PyPI project page/release history [1]; MarkupSafe “Changes” (changelog) [2].
🏁 Script executed:
cat setup/pyproject.tomlRepository: MasumRab/EmailIntelligence
Length of output: 2690
Unrelated dependency pin with incomplete version constraint — consider splitting or clarifying rationale.
The markupsafe<3.0.0 pin is unrelated to the search optimization objective of this PR. Bundling unrelated dependency fixes makes it harder to bisect regressions. Either move this to a separate commit/PR with clear justification or clarify why it's necessary for the current changes.
Additionally, specifying only an upper bound without a lower bound allows installation of arbitrarily old versions. If retaining this pin, use a proper range like markupsafe>=2.1.0,<3.0.0. Note: markupsafe 3.0.0+ is now stable (released Oct 2024, current latest is 3.0.3), so the pin does exclude currently active versions—confirm this constraint is necessary.
🤖 Prompt for AI Agents
In `@setup/pyproject.toml` around lines 32 - 34, The dependency pin
"markupsafe<3.0.0" is unrelated to this PR and uses an incomplete constraint;
either remove/move it to a separate commit/PR with a short rationale, or if it
must remain here replace the single-sided constraint with a proper range (for
example use a lower and upper bound such as markupsafe>=2.1.0,<3.0.0) and add a
brief comment in pyproject.toml explaining why that exact range is required for
the changes in this branch.
The CI workflow uses `uv sync --dev` to install development dependencies. However, `dev` was defined only as an optional dependency (extra) and not as a dependency group, causing `uv` to skip installing tools like `pytest`. This change adds a `[dependency-groups]` section to `pyproject.toml` that includes the `dev` optional dependencies, ensuring they are installed during CI. Co-authored-by: MasumRab <8943353+MasumRab@users.noreply.github.com>
This commit fixes two critical CI issues: 1. **Missing `backend/` directory**: The `backend/` directory was removed/renamed, but the CI workflow still referenced it. Updated `ci.yml` to target `src/` and `modules/` instead. 2. **Missing dev dependencies**: `uv sync --dev` wasn't installing required optional dependencies (like `pytest` or `google`). Updated `pyproject.toml` to define a `dev` dependency group that includes the full project extras. 3. **Missing SECRET_KEY**: Added a dummy `SECRET_KEY` to the CI environment to prevent runtime errors during testing. Co-authored-by: MasumRab <8943353+MasumRab@users.noreply.github.com>
|



⚡ Bolt: Optimize email search with raw content scanning
💡 What:
Implemented a
_scan_content_for_searchhelper method inDatabaseManagerthat reads the raw file content as a string to perform a fast case-insensitive check before parsing the JSON structure.🎯 Why:
Profiling showed that
json.loadwas a bottleneck during search operations, especially when scanning many files that do not contain the search term. Parallelizing withasynciointroduced too much overhead for small files. The "scan-first" approach avoids the expensive parsing step for 99% of files.📊 Impact:
🔬 Measurement:
Verified with
scripts/benchmark_search.py(created during development) and existingtests/core/test_database_search_perf.py.Benchmark result: 0.51s vs 0.56s baseline (1000 files).
PR created automatically by Jules for task 4540183078167904732 started by @MasumRab
Summary by Sourcery
Optimize email search by pre-scanning gzipped raw content before JSON parsing to reduce unnecessary work during searches.
New Features:
Enhancements:
Tests:
Summary by CodeRabbit