Skip to content

feat: replace pdf-extract with pdf_oxide#1435

Open
ztsalexey wants to merge 2 commits intonearai:stagingfrom
ztsalexey:fix/pdf-oxide-extraction
Open

feat: replace pdf-extract with pdf_oxide#1435
ztsalexey wants to merge 2 commits intonearai:stagingfrom
ztsalexey:fix/pdf-oxide-extraction

Conversation

@ztsalexey
Copy link
Copy Markdown
Contributor

Summary

  • Replace pdf-extract with pdf_oxide (v0.3) for PDF text extraction
  • pdf_oxide produces layout-aware markdown with headings, is faster (0.8ms mean), and has zero non-Rust dependencies
  • Updated extract_pdf() to use PdfDocument::from_bytes + to_markdown_all

Closes #1311

Test plan

  • cargo fmt --check — passes
  • cargo clippy --all --all-features — zero warnings
  • cargo test --lib document_extraction::tests::extracts_pdf_text — passes
  • pdf-extract no longer in Cargo.lock

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings March 19, 2026 22:19
@github-actions github-actions Bot added scope: dependencies Dependency updates size: XL 500+ changed lines risk: medium Business logic, config, or moderate-risk modules contributor: experienced 6-19 merged PRs labels Mar 19, 2026
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly upgrades the PDF text extraction capabilities by migrating from the pdf-extract library to pdf_oxide. This change not only enhances the quality of extracted text by providing layout-aware markdown but also boosts performance and reduces external dependencies. The core logic for PDF processing has been refactored to integrate seamlessly with the new library's API, ensuring a more robust and efficient document processing pipeline.

Highlights

  • PDF Text Extraction Library Change: The pdf-extract library has been replaced with pdf_oxide (v0.3) for PDF text extraction, improving performance and output quality.
  • Improved PDF Output: The new pdf_oxide library produces layout-aware markdown with headings, offering a more structured and readable text output from PDFs.
  • Performance and Dependencies: The switch to pdf_oxide results in faster extraction times (0.8ms mean) and eliminates non-Rust dependencies, streamlining the project's dependency tree.
  • Code Modernization: The extract_pdf() function was updated to utilize PdfDocument::from_bytes and to_markdown_all methods from the new library, along with refined error handling.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request successfully replaces the pdf-extract library with pdf_oxide for PDF text extraction. The changes include updating Cargo.toml to reflect the new dependency and modifying the extract_pdf function in src/document_extraction/extractors.rs to utilize pdf_oxide's API, specifically PdfDocument::from_bytes and to_markdown_all. The Cargo.lock file has been updated accordingly, showing the removal of old dependencies and the addition of new ones required by pdf_oxide, along with several minor version bumps for other packages. The documentation comment in src/document_extraction/mod.rs has also been updated to highlight the new library's layout-aware markdown extraction capability. The implementation appears correct and aligns with the stated goals of the pull request.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Replaces the PDF text extraction backend in the document extraction pipeline from pdf-extract to pdf_oxide, switching PDF output to layout-aware markdown intended to improve reading order/structure for LLM consumption.

Changes:

  • Swap dependency from pdf-extract to pdf_oxide (v0.3.x).
  • Update extract_pdf() to parse bytes with PdfDocument::from_bytes and emit markdown via to_markdown_all.
  • Refresh documentation and lockfile to reflect the new extraction backend and transitive dependencies.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 2 comments.

File Description
src/document_extraction/mod.rs Updates module docs to reflect pdf_oxide as the PDF extraction backend.
src/document_extraction/extractors.rs Reworks PDF extraction implementation to produce markdown via pdf_oxide.
Cargo.toml Replaces pdf-extract dependency with pdf_oxide.
Cargo.lock Updates resolved dependency graph after swapping PDF extraction crates.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/document_extraction/extractors.rs Outdated
Comment on lines +68 to +71
let mut doc = pdf_oxide::PdfDocument::from_bytes(data.to_vec())
.map_err(|e| format!("PDF extraction failed: {e}"))?;
let options = pdf_oxide::converters::ConversionOptions::default();
doc.to_markdown_all(&options)
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PdfDocument::from_bytes(data.to_vec()) copies the full PDF bytes. Since callers already have the PDF as a Vec<u8> (e.g., attachments hold Vec<u8>), this adds an extra allocation/copy compared to the previous pdf_extract::extract_text_from_mem(&[u8]) path and can double memory for near-10MB PDFs. Consider plumbing ownership through for the PDF path (e.g., accept Vec<u8> for PDFs or provide an extract_pdf(Vec<u8>) helper) so this function can avoid to_vec() when the data is already owned.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

Comment on lines +71 to +73
doc.to_markdown_all(&options)
.map(|t| t.trim().to_string())
.map_err(|e| format!("PDF extraction failed: {e}"))
.map_err(|e| format!("PDF markdown conversion failed: {e}"))
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change switches PDF extraction from plain text to layout-aware markdown (to_markdown_all), but the existing PDF test only asserts text.contains("Hello"), which won’t catch regressions in reading order/structure (the main motivation for the swap). Add a targeted test + fixture (e.g., multi-column or table PDF) that asserts markdown structure/order expectations so we have coverage for the new behavior.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d136a993d8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/document_extraction/extractors.rs Outdated
Comment on lines +70 to +71
let options = pdf_oxide::converters::ConversionOptions::default();
doc.to_markdown_all(&options)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Disable default image embedding in PDF extraction

ConversionOptions::default() enables pdf_oxide's image output in the markdown path, so PDFs with logos, figures, or scanned pages can inject markdown image payloads (often data URIs) into extracted_text instead of just text. That interacts badly with DocumentExtractionMiddleware's 100k-character cap in src/document_extraction/mod.rs: an image-heavy PDF can burn most of the budget on image blobs and leave the model with little or none of the actual document text. For attachment text extraction we should explicitly turn image output off, or switch to a plain-text converter, so illustrated PDFs don't regress into truncated junk.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 4 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pdf_oxide produces markdown with headings, is faster (0.8ms mean),
and has zero non-Rust dependencies.

Closes nearai#1311
Address review feedback:
- Set include_images: false to prevent data URIs from consuming
  the 100k-char extraction budget on image-heavy PDFs
- Accept Vec<u8> in extract_pdf/extract_text to avoid an extra
  .to_vec() copy (pdf_oxide requires owned data)
- Add test asserting no markdown images or data URIs in output

Closes nearai#1311
@ztsalexey ztsalexey force-pushed the fix/pdf-oxide-extraction branch from 778b45d to e45bf28 Compare April 3, 2026 19:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor: experienced 6-19 merged PRs risk: medium Business logic, config, or moderate-risk modules scope: dependencies Dependency updates size: XL 500+ changed lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(extraction): replace pdf-extract with pdf_oxide for layout-aware PDF extraction

2 participants