feat: replace pdf-extract with pdf_oxide by ztsalexey · Pull Request #1435 · nearai/ironclaw

ztsalexey · 2026-03-19T22:19:29Z

Summary

Replace pdf-extract with pdf_oxide (v0.3) for PDF text extraction
pdf_oxide produces layout-aware markdown with headings, is faster (0.8ms mean), and has zero non-Rust dependencies
Updated extract_pdf() to use PdfDocument::from_bytes + to_markdown_all

Closes #1311

Test plan

cargo fmt --check — passes
cargo clippy --all --all-features — zero warnings
cargo test --lib document_extraction::tests::extracts_pdf_text — passes
pdf-extract no longer in Cargo.lock

🤖 Generated with Claude Code

gemini-code-assist · 2026-03-19T22:19:51Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly upgrades the PDF text extraction capabilities by migrating from the pdf-extract library to pdf_oxide. This change not only enhances the quality of extracted text by providing layout-aware markdown but also boosts performance and reduces external dependencies. The core logic for PDF processing has been refactored to integrate seamlessly with the new library's API, ensuring a more robust and efficient document processing pipeline.

Highlights

PDF Text Extraction Library Change: The pdf-extract library has been replaced with pdf_oxide (v0.3) for PDF text extraction, improving performance and output quality.
Improved PDF Output: The new pdf_oxide library produces layout-aware markdown with headings, offering a more structured and readable text output from PDFs.
Performance and Dependencies: The switch to pdf_oxide results in faster extraction times (0.8ms mean) and eliminates non-Rust dependencies, streamlining the project's dependency tree.
Code Modernization: The extract_pdf() function was updated to utilize PdfDocument::from_bytes and to_markdown_all methods from the new library, along with refined error handling.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request successfully replaces the pdf-extract library with pdf_oxide for PDF text extraction. The changes include updating Cargo.toml to reflect the new dependency and modifying the extract_pdf function in src/document_extraction/extractors.rs to utilize pdf_oxide's API, specifically PdfDocument::from_bytes and to_markdown_all. The Cargo.lock file has been updated accordingly, showing the removal of old dependencies and the addition of new ones required by pdf_oxide, along with several minor version bumps for other packages. The documentation comment in src/document_extraction/mod.rs has also been updated to highlight the new library's layout-aware markdown extraction capability. The implementation appears correct and aligns with the stated goals of the pull request.

Copilot

Pull request overview

Replaces the PDF text extraction backend in the document extraction pipeline from pdf-extract to pdf_oxide, switching PDF output to layout-aware markdown intended to improve reading order/structure for LLM consumption.

Changes:

Swap dependency from pdf-extract to pdf_oxide (v0.3.x).
Update extract_pdf() to parse bytes with PdfDocument::from_bytes and emit markdown via to_markdown_all.
Refresh documentation and lockfile to reflect the new extraction backend and transitive dependencies.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 2 comments.

File	Description
src/document_extraction/mod.rs	Updates module docs to reflect `pdf_oxide` as the PDF extraction backend.
src/document_extraction/extractors.rs	Reworks PDF extraction implementation to produce markdown via `pdf_oxide`.
Cargo.toml	Replaces `pdf-extract` dependency with `pdf_oxide`.
Cargo.lock	Updates resolved dependency graph after swapping PDF extraction crates.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-19T22:22:31Z

+    let mut doc = pdf_oxide::PdfDocument::from_bytes(data.to_vec())
+        .map_err(|e| format!("PDF extraction failed: {e}"))?;
+    let options = pdf_oxide::converters::ConversionOptions::default();
+    doc.to_markdown_all(&options)


PdfDocument::from_bytes(data.to_vec()) copies the full PDF bytes. Since callers already have the PDF as a Vec<u8> (e.g., attachments hold Vec<u8>), this adds an extra allocation/copy compared to the previous pdf_extract::extract_text_from_mem(&[u8]) path and can double memory for near-10MB PDFs. Consider plumbing ownership through for the PDF path (e.g., accept Vec<u8> for PDFs or provide an extract_pdf(Vec<u8>) helper) so this function can avoid to_vec() when the data is already owned.

Copilot · 2026-03-19T22:22:32Z

+    doc.to_markdown_all(&options)
        .map(|t| t.trim().to_string())
-        .map_err(|e| format!("PDF extraction failed: {e}"))
+        .map_err(|e| format!("PDF markdown conversion failed: {e}"))


This change switches PDF extraction from plain text to layout-aware markdown (to_markdown_all), but the existing PDF test only asserts text.contains("Hello"), which won’t catch regressions in reading order/structure (the main motivation for the swap). Add a targeted test + fixture (e.g., multi-column or table PDF) that asserts markdown structure/order expectations so we have coverage for the new behavior.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d136a993d8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-19T22:24:05Z

+    let options = pdf_oxide::converters::ConversionOptions::default();
+    doc.to_markdown_all(&options)


Disable default image embedding in PDF extraction

ConversionOptions::default() enables pdf_oxide's image output in the markdown path, so PDFs with logos, figures, or scanned pages can inject markdown image payloads (often data URIs) into extracted_text instead of just text. That interacts badly with DocumentExtractionMiddleware's 100k-character cap in src/document_extraction/mod.rs: an image-heavy PDF can burn most of the budget on image blobs and leave the model with little or none of the actual document text. For attachment text extraction we should explicitly turn image output off, or switch to a plain-text converter, so illustrated PDFs don't regress into truncated junk.

Useful? React with 👍 / 👎.

Copilot

Pull request overview

Copilot reviewed 3 out of 4 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pdf_oxide produces markdown with headings, is faster (0.8ms mean), and has zero non-Rust dependencies. Closes nearai#1311

Address review feedback: - Set include_images: false to prevent data URIs from consuming the 100k-char extraction budget on image-heavy PDFs - Accept Vec<u8> in extract_pdf/extract_text to avoid an extra .to_vec() copy (pdf_oxide requires owned data) - Add test asserting no markdown images or data URIs in output Closes nearai#1311

Copilot AI review requested due to automatic review settings March 19, 2026 22:19

github-actions Bot added scope: dependencies Dependency updates size: XL 500+ changed lines risk: medium Business logic, config, or moderate-risk modules contributor: experienced 6-19 merged PRs labels Mar 19, 2026

Copilot started reviewing on behalf of ztsalexey March 19, 2026 22:19 View session

gemini-code-assist Bot reviewed Mar 19, 2026

View reviewed changes

Copilot AI reviewed Mar 19, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Mar 19, 2026

View reviewed changes

This was referenced Mar 20, 2026

🦞 OpenClaw 生态日报 2026-03-20 gsscsd/big_model_radar#66

Open

🦞 Bản tin hàng ngày hệ sinh thái OpenClaw 2026-03-20 compasify/agents-radar#61

Open

Copilot AI review requested due to automatic review settings March 20, 2026 20:23

ztsalexey force-pushed the fix/pdf-oxide-extraction branch from e8bc56a to 778b45d Compare March 20, 2026 20:23

Copilot started reviewing on behalf of ztsalexey March 20, 2026 20:23 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

github-actions Bot mentioned this pull request Mar 21, 2026

🦞 OpenClaw 生态日报 2026-03-21 gsscsd/big_model_radar#71

Open

ztsalexey added 2 commits April 3, 2026 13:09

feat: replace pdf-extract with pdf_oxide for layout-aware PDF extraction

ae36e74

pdf_oxide produces markdown with headings, is faster (0.8ms mean), and has zero non-Rust dependencies. Closes nearai#1311

ztsalexey force-pushed the fix/pdf-oxide-extraction branch from 778b45d to e45bf28 Compare April 3, 2026 19:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: replace pdf-extract with pdf_oxide#1435

feat: replace pdf-extract with pdf_oxide#1435
ztsalexey wants to merge 2 commits intonearai:stagingfrom
ztsalexey:fix/pdf-oxide-extraction

ztsalexey commented Mar 19, 2026

Uh oh!

gemini-code-assist Bot commented Mar 19, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 19, 2026

Uh oh!

ztsalexey Mar 20, 2026

Uh oh!

Copilot AI Mar 19, 2026

Uh oh!

ztsalexey Mar 20, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Mar 19, 2026

Uh oh!

ztsalexey Mar 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		let options = pdf_oxide::converters::ConversionOptions::default();
		doc.to_markdown_all(&options)

Conversation

ztsalexey commented Mar 19, 2026

Summary

Test plan

Uh oh!

gemini-code-assist Bot commented Mar 19, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

ztsalexey Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

ztsalexey Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

ztsalexey Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants