Bugfix and optimize GetStartXrefPosition by ricflams · Pull Request #1036 · UglyToad/PdfPig

ricflams · 2025-05-08T22:02:31Z

The bugfix was the important part but the optimization is pretty nice too.

Bugfix: If startxref was found so far back (eg in the very beginning which can be the case for Linearized PDFs) that we ended up setting actualStartOffset to 0 then the loop would exit immediately without actually searching that part.
Optimization: GetStartXrefPosition would search for startxref in the last 2048 bytes and then double that search-range (looking back 4096, 8192, etc bytes) to look for startxref until the entire file was searched. This was rather inefficient since each step would search the same parts over and over again. This has been changed to properly search (still increasingly larger) chunks that doesn't overlap. On a test of 5000 PDFs that reduced their load-time by 10%.
Change: No need for the exception to say that startxref couldn't be found "in the last 2048 characters" since the entire file was searched anyway.

The bugfix was the important part but the optimization is pretty nice too. - Bugfix: If startxref was found so far back (eg in the very beginning which can be the case for Linearized PDFs) that we ended up setting actualStartOffset to 0 then the loop would exit immediately without actually searching that part. - Optimization: GetStartXrefPosition would search for startxref in the last 2048 bytes and then double that search-range (looking back 4096, 8192, etc bytes) to look for startxref until the entire file was searched. This was rather inefficient since each step would search the same parts over and over again. This has been changed to properly search (still increasingly larger) chunks that doesn't overlap. On a test of 5000 PDFs that reduced their load-time by 10%. - Change: No need for the exception to say that startxref couldn't be found "in the last 2048 characters" since the entire file was searched anyway.

ricflams · 2025-05-13T20:19:22Z

Wonderful, thanks @BobLd 👍

BobLd · 2025-05-13T21:32:59Z

@ricflams if you have document samples that where failing before your fix, feel free to share them here, im happy to add tests. Or feel free to add a PR with tests

ricflams · 2025-05-14T10:40:56Z

@BobLd, here's one example of a file with some trailing html that's would cause the former backwards-search to fail finding the startxref - it would wrongfully skip searching the first about 50K of the file.
doc-000145072406f43133cbd65ae8b76c55.pdf

ricflams added 2 commits May 9, 2025 00:01

Merge branch 'master' into getstartxrefposition-use-sliding-window

61c0941

BobLd approved these changes May 13, 2025

View reviewed changes

BobLd merged commit c3c477a into UglyToad:master May 13, 2025
2 of 3 checks passed

ricflams deleted the getstartxrefposition-use-sliding-window branch May 13, 2025 20:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix and optimize GetStartXrefPosition#1036

Bugfix and optimize GetStartXrefPosition#1036
BobLd merged 2 commits intoUglyToad:masterfrom
ricflams:getstartxrefposition-use-sliding-window

ricflams commented May 8, 2025

Uh oh!

Uh oh!

ricflams commented May 13, 2025

Uh oh!

BobLd commented May 13, 2025

Uh oh!

ricflams commented May 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ricflams commented May 8, 2025

Uh oh!

Uh oh!

ricflams commented May 13, 2025

Uh oh!

BobLd commented May 13, 2025

Uh oh!

ricflams commented May 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants