feat: replace pdf-extract with pdf_oxide#1435
feat: replace pdf-extract with pdf_oxide#1435ztsalexey wants to merge 2 commits intonearai:stagingfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly upgrades the PDF text extraction capabilities by migrating from the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request successfully replaces the pdf-extract library with pdf_oxide for PDF text extraction. The changes include updating Cargo.toml to reflect the new dependency and modifying the extract_pdf function in src/document_extraction/extractors.rs to utilize pdf_oxide's API, specifically PdfDocument::from_bytes and to_markdown_all. The Cargo.lock file has been updated accordingly, showing the removal of old dependencies and the addition of new ones required by pdf_oxide, along with several minor version bumps for other packages. The documentation comment in src/document_extraction/mod.rs has also been updated to highlight the new library's layout-aware markdown extraction capability. The implementation appears correct and aligns with the stated goals of the pull request.
There was a problem hiding this comment.
Pull request overview
Replaces the PDF text extraction backend in the document extraction pipeline from pdf-extract to pdf_oxide, switching PDF output to layout-aware markdown intended to improve reading order/structure for LLM consumption.
Changes:
- Swap dependency from
pdf-extracttopdf_oxide(v0.3.x). - Update
extract_pdf()to parse bytes withPdfDocument::from_bytesand emit markdown viato_markdown_all. - Refresh documentation and lockfile to reflect the new extraction backend and transitive dependencies.
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/document_extraction/mod.rs | Updates module docs to reflect pdf_oxide as the PDF extraction backend. |
| src/document_extraction/extractors.rs | Reworks PDF extraction implementation to produce markdown via pdf_oxide. |
| Cargo.toml | Replaces pdf-extract dependency with pdf_oxide. |
| Cargo.lock | Updates resolved dependency graph after swapping PDF extraction crates. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let mut doc = pdf_oxide::PdfDocument::from_bytes(data.to_vec()) | ||
| .map_err(|e| format!("PDF extraction failed: {e}"))?; | ||
| let options = pdf_oxide::converters::ConversionOptions::default(); | ||
| doc.to_markdown_all(&options) |
There was a problem hiding this comment.
PdfDocument::from_bytes(data.to_vec()) copies the full PDF bytes. Since callers already have the PDF as a Vec<u8> (e.g., attachments hold Vec<u8>), this adds an extra allocation/copy compared to the previous pdf_extract::extract_text_from_mem(&[u8]) path and can double memory for near-10MB PDFs. Consider plumbing ownership through for the PDF path (e.g., accept Vec<u8> for PDFs or provide an extract_pdf(Vec<u8>) helper) so this function can avoid to_vec() when the data is already owned.
| doc.to_markdown_all(&options) | ||
| .map(|t| t.trim().to_string()) | ||
| .map_err(|e| format!("PDF extraction failed: {e}")) | ||
| .map_err(|e| format!("PDF markdown conversion failed: {e}")) |
There was a problem hiding this comment.
This change switches PDF extraction from plain text to layout-aware markdown (to_markdown_all), but the existing PDF test only asserts text.contains("Hello"), which won’t catch regressions in reading order/structure (the main motivation for the swap). Add a targeted test + fixture (e.g., multi-column or table PDF) that asserts markdown structure/order expectations so we have coverage for the new behavior.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d136a993d8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| let options = pdf_oxide::converters::ConversionOptions::default(); | ||
| doc.to_markdown_all(&options) |
There was a problem hiding this comment.
Disable default image embedding in PDF extraction
ConversionOptions::default() enables pdf_oxide's image output in the markdown path, so PDFs with logos, figures, or scanned pages can inject markdown image payloads (often data URIs) into extracted_text instead of just text. That interacts badly with DocumentExtractionMiddleware's 100k-character cap in src/document_extraction/mod.rs: an image-heavy PDF can burn most of the budget on image blobs and leave the model with little or none of the actual document text. For attachment text extraction we should explicitly turn image output off, or switch to a plain-text converter, so illustrated PDFs don't regress into truncated junk.
Useful? React with 👍 / 👎.
e8bc56a to
778b45d
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 4 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
pdf_oxide produces markdown with headings, is faster (0.8ms mean), and has zero non-Rust dependencies. Closes nearai#1311
Address review feedback: - Set include_images: false to prevent data URIs from consuming the 100k-char extraction budget on image-heavy PDFs - Accept Vec<u8> in extract_pdf/extract_text to avoid an extra .to_vec() copy (pdf_oxide requires owned data) - Add test asserting no markdown images or data URIs in output Closes nearai#1311
778b45d to
e45bf28
Compare
Summary
pdf-extractwithpdf_oxide(v0.3) for PDF text extractionpdf_oxideproduces layout-aware markdown with headings, is faster (0.8ms mean), and has zero non-Rust dependenciesextract_pdf()to usePdfDocument::from_bytes+to_markdown_allCloses #1311
Test plan
cargo fmt --check— passescargo clippy --all --all-features— zero warningscargo test --lib document_extraction::tests::extracts_pdf_text— passespdf-extractno longer inCargo.lock🤖 Generated with Claude Code