Skip to content

Bugfix and optimize GetStartXrefPosition#1036

Merged
BobLd merged 2 commits intoUglyToad:masterfrom
ricflams:getstartxrefposition-use-sliding-window
May 13, 2025
Merged

Bugfix and optimize GetStartXrefPosition#1036
BobLd merged 2 commits intoUglyToad:masterfrom
ricflams:getstartxrefposition-use-sliding-window

Conversation

@ricflams
Copy link
Copy Markdown
Contributor

@ricflams ricflams commented May 8, 2025

The bugfix was the important part but the optimization is pretty nice too.

  • Bugfix: If startxref was found so far back (eg in the very beginning which can be the case for Linearized PDFs) that we ended up setting actualStartOffset to 0 then the loop would exit immediately without actually searching that part.
  • Optimization: GetStartXrefPosition would search for startxref in the last 2048 bytes and then double that search-range (looking back 4096, 8192, etc bytes) to look for startxref until the entire file was searched. This was rather inefficient since each step would search the same parts over and over again. This has been changed to properly search (still increasingly larger) chunks that doesn't overlap. On a test of 5000 PDFs that reduced their load-time by 10%.
  • Change: No need for the exception to say that startxref couldn't be found "in the last 2048 characters" since the entire file was searched anyway.

ricflams added 2 commits May 9, 2025 00:01
The bugfix was the important part but the optimization is pretty nice too.

- Bugfix: If startxref was found so far back (eg in the very beginning which can be the case for Linearized PDFs) that we ended up setting actualStartOffset to 0 then the loop would exit immediately without actually searching that part.
- Optimization: GetStartXrefPosition would search for startxref in the last 2048 bytes and then double that search-range (looking back 4096, 8192, etc bytes) to look for startxref until the entire file was searched. This was rather inefficient since each step would search the same parts over and over again. This has been changed to properly search (still increasingly larger) chunks that doesn't overlap. On a test of 5000 PDFs that reduced their load-time by 10%.
- Change: No need for the exception to say that startxref couldn't be found "in the last 2048 characters" since the entire file was searched anyway.
@BobLd BobLd merged commit c3c477a into UglyToad:master May 13, 2025
2 of 3 checks passed
@ricflams ricflams deleted the getstartxrefposition-use-sliding-window branch May 13, 2025 20:18
@ricflams
Copy link
Copy Markdown
Contributor Author

Wonderful, thanks @BobLd 👍

@BobLd
Copy link
Copy Markdown
Collaborator

BobLd commented May 13, 2025

@ricflams if you have document samples that where failing before your fix, feel free to share them here, im happy to add tests. Or feel free to add a PR with tests

@ricflams
Copy link
Copy Markdown
Contributor Author

@BobLd, here's one example of a file with some trailing html that's would cause the former backwards-search to fail finding the startxref - it would wrongfully skip searching the first about 50K of the file.
doc-000145072406f43133cbd65ae8b76c55.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants