Skip to content

Use zlib decode to properly use window size and checksum in flate filter#1186

Merged
BobLd merged 1 commit intoUglyToad:masterfrom
rhuijben:fix/use-zlib-for-flatefilter-decode
Oct 15, 2025
Merged

Use zlib decode to properly use window size and checksum in flate filter#1186
BobLd merged 1 commit intoUglyToad:masterfrom
rhuijben:fix/use-zlib-for-flatefilter-decode

Conversation

@rhuijben
Copy link
Copy Markdown
Contributor

@rhuijben rhuijben commented Oct 14, 2025

Use the zlib data to verify that the compressed data is not corrupted. This avoids quite a few cases of processing known corrupted data.

For .Net 6+ we can use the ZlibStream to do this in a single step, but we need a bit more work to be Framework compatible.
As we can't use the zlib framing now, we need to calculate the length of the uncompressed data ourselves. Fix a few related bugs to make this possible.

@rhuijben
Copy link
Copy Markdown
Contributor Author

Looks like it changes some behavior on strange/bad PDFs. It now detecteds corrupted zlib data via the adler32 checksum, and by that avoids that this bad data is later interpreted.
I think I have some managed adler32 code somewhere in another github project. Perhaps we can also fix the not NET6+ cases with that.

@BobLd
Copy link
Copy Markdown
Collaborator

BobLd commented Oct 14, 2025

@rhuijben thanks a lot for the PR, let me know if I can help

@rhuijben
Copy link
Copy Markdown
Contributor Author

@BobLd I fixed the issues at the flatefilter level, but I think there is an issue in where the streams are detected. I think we should drop the final newline ("\r\n", "\r" or "\n") from the Memory object we hand down.
Otherwise every filter type has to handle these exceptions.

@rhuijben rhuijben force-pushed the fix/use-zlib-for-flatefilter-decode branch 5 times, most recently from b6d8247 to 2b38d6b Compare October 15, 2025 10:10
@BobLd
Copy link
Copy Markdown
Collaborator

BobLd commented Oct 15, 2025

@rhuijben do you mind giving more details about "drop the final newline ("\r\n", "\r" or "\n") from the Memory object we hand down."?

Are you refering to the following in FlateFilter?

       if (length > 0 && length < input.Length)
       {
           // Truncates final "\r\n" or "\n" from source data if any. Fixes detecting where the adler checksum is. (Zlib uses framing for this)
           input = input.Slice(0, length);
       }

Also, do you have document samples that were failing before this change and now succeed?

@rhuijben
Copy link
Copy Markdown
Contributor Author

There are quite a few cases in the testsuite. Do you want some references for the individual failures?

I just found that length may also be available before the first filter. (Sometimes it is per filter). Perhaps that already trims the info enough so we don't need it in the flate filter.

The code relied quite a bit on the deflate implementation doing the right thing. (It usually does... but it is hard to integrate as it always reads more data from the stream below it than it should... But luckily it doesn't touch the data)

@rhuijben
Copy link
Copy Markdown
Contributor Author

rhuijben commented Oct 15, 2025

fseprd1102849.pdf from the WordCount test is one such example. (I think there are at least a hundred in the tests. Some with length parameter info, some without)..
Some having '\r\n' (Windows/network style), some '\n' (Linux) and some '\r' (Legacy Mac style).

Zlib has an internal end-of-stream marker that now just hides the problem. But other stream/file processors may have other issues with this trailing data.

[This one is fixed by either the fix in Decode() or the low level detection of "\r\n". But the same issue would have been undetected if it goes to other filters. Or they must have their own handling for trailing data]

@BobLd
Copy link
Copy Markdown
Collaborator

BobLd commented Oct 15, 2025

@rhuijben thanks for the details. I was looking for documents (not necessarily in the current tests), that used to fail to open before adding input.Slice(0, length); and now succeed.

But anyway, let me know when good with the changes (no rush), I'll review them

@BobLd BobLd self-requested a review October 15, 2025 12:04
@rhuijben
Copy link
Copy Markdown
Contributor Author

Last few changes were already cleaning up the PR.

Most things 'worked' because the zlib magic was triggered. But if zlib reports the data is corrupted, it almost always is. And I think at that point it is safer to not process it. The decompression can make a bit of junk data work out to be gigabytes of junk in bad ways. This changes behavior for some of the tests that needed other ways to work around bad data. But you'll find this in the review.

In my opinion using the length when available is a good thing. I'm still trying to find out why the top level parser would introduce this data while it is clearly not part of the actual stream. The checks inside the adler handling fix the tests to all pass but I find this an ugly hack. The length filters in the higher layers fix most (but not all!) of these cases too.

The reason I started looking was that I hoped to find the root cause of #1183. But it looks like this additional data is really encoded the way it should... (Could be used to hide information. But I think there are so many other ways to hide data in a pdf, that this is not that relevant)

@rhuijben
Copy link
Copy Markdown
Contributor Author

If you want things separated. Feel free to ask (or do yourself if that is easier).
I tried to minimize warnings and test failures on my system.

@BobLd
Copy link
Copy Markdown
Collaborator

BobLd commented Oct 15, 2025

Thanks for offering to split things. I usually prefer that but I believe it'll be fine.

I've done a first pass for the review, have a look when you have time.

I'll have a 2nd look later on.

Also, do you mind squashing your commits into a single one

@rhuijben rhuijben force-pushed the fix/use-zlib-for-flatefilter-decode branch from 7bacaf2 to 3569245 Compare October 15, 2025 14:41
@EliotJones
Copy link
Copy Markdown
Member

So I'm just looking into this because it looks like it causes page content to be skipped here #1235

In PDFBox the default is to use a more tolerant parser for these corrupted streams: https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/filter/FlateFilterDecoderStream.java#L47 which references this issue https://issues.apache.org/jira/browse/PDFBOX-1232.

I've run the first 2500 files of the test corpus but didn't detect any additional files affected by this. If we revert to DeflateStream what would the consequences be? Perhaps we can use checksum validation for lenient mode off but keep the support for corrupted streams where lenient parsing is on (default)?

@BobLd
Copy link
Copy Markdown
Collaborator

BobLd commented Feb 15, 2026

@EliotJones I'd say the priority is to be able to extract text. I'd need to have a look, but I'm not against reverting on my end if it breaks stuff...

EDIT: Would need to double check but might be related to #1243

Ellerbach added a commit to Ellerbach/azure-ai-search-simulator that referenced this pull request Feb 16, 2026
Updated [PdfPig](https://github.com/UglyToad/PdfPig) from 0.1.9 to
0.1.13.

<details>
<summary>Release notes</summary>

_Sourced from [PdfPig's
releases](https://github.com/UglyToad/PdfPig/releases)._

## 0.1.13

## What's Changed
* Increment version to 0.1.13 by @​BobLd in
UglyToad/PdfPig#1207
* Simply order by offset also when not doing brute force to fix #​1208
by @​BobLd in UglyToad/PdfPig#1210
* Ensure no key end up missing in ResolveInternal and fix #​1209 by
@​BobLd in UglyToad/PdfPig#1211
* update release logic to check out master before commit by @​EliotJones
in UglyToad/PdfPig#1212
* Return empty glyph in ReadCompositeGlyph when glyphIndex is out of
range and fix #​1213 by @​BobLd in
UglyToad/PdfPig#1215
* Handling of optional content group names without proper name by
@​carlokok in UglyToad/PdfPig#1216
* Minor Type1FontParser optimisations by @​BobLd in
UglyToad/PdfPig#1221
* Use file header offset when doing brute force find and fix #​1223 by
@​BobLd in UglyToad/PdfPig#1224
* Do not return glyph bbox and path in Type1Font if character name is
'.notdef' by @​BobLd in UglyToad/PdfPig#1229

## New Contributors
* @​carlokok made their first contribution in
UglyToad/PdfPig#1216

**Full Changelog**:
UglyToad/PdfPig@v0.1.12...unreleased

## 0.1.12

## What's Changed
* add nullability to core project by @​EliotJones in
UglyToad/PdfPig#1111
* Fix usage of List.Contains by @​theolivenbaum in
UglyToad/PdfPig#1112
* allow missing catalog type definition for catalog dictionary by
@​EliotJones in UglyToad/PdfPig#1113
* Performance improvements and .Net 9 support by @​chuckbeasley in
UglyToad/PdfPig#1116
* Update run_integration_tests.yml by @​BobLd in
UglyToad/PdfPig#1117
* Add global.json in tools by @​BobLd in
UglyToad/PdfPig#1118
* Update run_integration_tests.yml by @​BobLd in
UglyToad/PdfPig#1119
* Update run_integration_tests.yml by @​BobLd in
UglyToad/PdfPig#1120
* Update run_common_crawl_tests.yml by @​BobLd in
UglyToad/PdfPig#1121
* Update nightly_release.yml by @​BobLd in
UglyToad/PdfPig#1123
* Increase FlateFilter multiplier when preventing malicious OOM and fix
#​1125 by @​BobLd in UglyToad/PdfPig#1126
* Update build_and_test_macos.yml by @​BobLd in
UglyToad/PdfPig#1129
* Update build_and_test_macos.yml by @​BobLd in
UglyToad/PdfPig#1130
* Prevent StackOverflow in ParseTrailer and fix #​1122 by @​BobLd in
UglyToad/PdfPig#1127
* Lower max search depth in preventing StackOverflow in ParseTrailer by
@​BobLd in UglyToad/PdfPig#1131
* add container node support for BookmarksProvider.cs by @​migeyusu in
UglyToad/PdfPig#1133
* move file parsing to single-pass static methods by @​EliotJones in
UglyToad/PdfPig#1102
* Add early version of IOSSystemFontLister by @​BobLd in
UglyToad/PdfPig#1143
* File buffering read stream investigation by @​EliotJones in
UglyToad/PdfPig#1140
* Draft release on master build by @​EliotJones in
UglyToad/PdfPig#1145
* First create the StreamInputBytes in PdfDocument.Open() to check the
stream CanRead and CanSeek by @​BobLd in
UglyToad/PdfPig#1147
* Fix font matrix issues by @​BobLd in
UglyToad/PdfPig#1150
* Properly fix #​1148 by always parsing optional tables in
TrueTypeFontParser and remove Type 0 font hack by @​BobLd in
UglyToad/PdfPig#1151
* copy other parser behavior by treating end of stream as valid end
inline image by @​EliotJones in
UglyToad/PdfPig#1152
* add test jobs for common crawl 0000 to 0007 by @​EliotJones in
UglyToad/PdfPig#1153
* handle case where xobjects use same key as fonts by @​EliotJones in
UglyToad/PdfPig#1154
* read last line of ignore file by @​EliotJones in
UglyToad/PdfPig#1155
* Use correct font matrix when transforming the width in Type 0 font and
fix #​1156 by @​BobLd in UglyToad/PdfPig#1157
* Add initial support to process CFF fonts contained inside a TrueType
font by @​BobLd in UglyToad/PdfPig#1159
* Handle non seekable stream by copying it into a memory stream and fix
#​1146 by @​BobLd in UglyToad/PdfPig#1158
* handle case where offsets are out of range by @​EliotJones in
UglyToad/PdfPig#1160
* Use record struct in FileHeaderOffset by @​BobLd in
UglyToad/PdfPig#1161
* Expose letter's font via GetFont(), make Font property as obsolete and
use FontDetails instead by @​BobLd in
UglyToad/PdfPig#1166
* Add GetDescent() and GetAscent() to IFont and loose bounding box to
letter by @​BobLd in UglyToad/PdfPig#1167
* Use pageFactoryCache.Clear() in Pages dispose and fix #​1170 by
@​BobLd in UglyToad/PdfPig#1174
* Bugfix: xref-streams were not added by @​ricflams in
UglyToad/PdfPig#1173
* Guard against circular references in XRef tables/streams by @​ricflams
in UglyToad/PdfPig#1175
* Add more tests to NearestNeighbourWordExtractorTests by @​BobLd in
UglyToad/PdfPig#1180
* Feature/improve group indexes by @​BobLd in
UglyToad/PdfPig#1181
* Trim excess in long lived font collections by @​BobLd in
UglyToad/PdfPig#1184
* Set Type 3 font ascent to Top instead of Height, see #​1164 by @​BobLd
in UglyToad/PdfPig#1185
* Only apply RemoveStridePadding() when bytes per pixel is one and fix
#​1183 by @​BobLd in UglyToad/PdfPig#1187
* Use zlib decode to properly use window size and checksum in flate
filter by @​rhuijben in UglyToad/PdfPig#1186
* Avoid doing a true file seek for simple peeking in the token parser by
@​rhuijben in UglyToad/PdfPig#1188
* Fix regression introduced in 3592fc8 where slicing the stream to the
length breaks decoding by @​BobLd in
UglyToad/PdfPig#1192
* Update NameToUnicodeConvertAglSpecification to test what was intended
by @​rhuijben in UglyToad/PdfPig#1191
* Add CMap caching at document level and add MurmurHash3 hashing
function by @​BobLd in UglyToad/PdfPig#1193
* Avoid reading ahead and then seeking back by @​rhuijben in
UglyToad/PdfPig#1189
* Do not slice the stream to the length breaks decoding in FlateDecode
by @​BobLd in UglyToad/PdfPig#1194
 ... (truncated)

## 0.1.11

Welcome to version 0.1.11. The changes in this version have mainly
focused on stability. There is a breaking API change.

We have also started to run tests against a larger corpus of documents
from Common Crawl allowing us to find bugs and malformed files
proactively. This release is screened against 6000 additional files.

- Improvements to content and font parsing detected by fuzzing inputs.
- Improvements and resiliency for finding the `startxref` location when
parsing a file..
- Adds build and tests for Mac OS as well as retrieving system fonts on
iPad (Mac Catalyst).
- Support clipping when rendering XObjects.
- Prevent malformed files leading to an out-of-memory when decompressing
streams.
- Make `IGraphicsStateOperationFactory` and
`ReflectionGraphicsStateOperationFactory` public.
- Softmask support for images.
- Performance improvements using `Span` and `ReadOnlyMemory` where
available.
- Handle corrupt files where the stream contains comment tokens.
- Improvements to copying from existing files when using
`PdfDocumentBuilder`, fixes some bugs with copying fonts and dictionary
tokens referenced indirectly.
- Handle corrupt files with double `endstream` definitions.
- More tolerant parsing for a number of invalid PDFs, including invalid
USC2 input, CMAP formats, CFF fonts, missing font subtypes, invalid
`xref` table positions, missing `/FirstChar` entry for font dictionaries
and corrupt ASCII 85 encoded data.
- Fix an issue where adding content to an existing PDF using
`PdfDocumentBuilder` could result in upside-down or wrongly positioned
text due to global transforms in the source PDF.
- New option to completely skip annotations when building a document.
- Prevent infinite loops in certain documents #​1096.
- Improved performance when tokenizing numbers, this should provide a
minor speed improvement.
- When adding a page from an existing PDF to a `PdfDocumentBuilder` any
external link annotations should be preserved.

### Breaking changes

The method on `PdfDocumentBuilder`:

```
public PdfPageBuilder AddPage(PdfDocument document, int pageNumber, Func<PdfAction, PdfAction?>? copyLink)
```

Has been changed to wrap the `copyLink` parameter in an options object
to support the `KeepAnnotations` option:

```
public PdfPageBuilder AddPage(PdfDocument document, int pageNumber, AddPageOptions options)
```

You can just set the `CopyLinkFunc` property in the options object if
you need to access this functionality.

## Auto generated change log

* Bump version to 0.1.11-alpha001 by @​BobLd in
UglyToad/PdfPig#1009
* Improve Jpeg2000Helper to support J2K codec and add test by @​BobLd in
UglyToad/PdfPig#1010
* Add SetStrokeDetails() and SetFillDetails() to PdfPath and tidy up
ContentStreamProcessor by @​BobLd in
UglyToad/PdfPig#1014
* Implement clipping in ProcessFormXObject() by @​BobLd in
UglyToad/PdfPig#1015
* Fix #​1017 by @​lofcz in UglyToad/PdfPig#1018
* Fix PatternColor Equals() method and fix #​1016 by @​BobLd in
UglyToad/PdfPig#1019
* Feature/image mask by @​BobLd in
UglyToad/PdfPig#1012
* Update README.md by @​BobLd in
UglyToad/PdfPig#1020
* Fix bug where FormXObject bbox needs to be normalised by @​BobLd in
UglyToad/PdfPig#1021
* Add MacOS test pipeline and fix failing tests by @​BobLd in
UglyToad/PdfPig#1025
 ... (truncated)

## 0.1.10

## What's Changed
* Fix GetTextOrientation by cleanly checking if rotation is divisible by
90 and fix #​913 by @​BobLd in
UglyToad/PdfPig#914
* Add early version of BrowserSystemFontLister by @​BobLd in
UglyToad/PdfPig#920
* Remove list from FileTrailerParser.GetStartXrefPosition() by @​BobLd
in UglyToad/PdfPig#922
* Reorganise Filters and make them public by @​BobLd in
UglyToad/PdfPig#925
* Support decrypting V4/R4 files with AESV2 and no Length property by
@​Greybird in UglyToad/PdfPig#924
* Use pdfScanner in ReadVerticalDisplacements and fix #​693 and return
0… by @​BobLd in UglyToad/PdfPig#928
* Default page number to 0 in ExplicitDestination when the Dest has no
page number and fix #​736 by @​BobLd in
UglyToad/PdfPig#930
* Move Paths, GetAnnotations() and GetOptionalContents() outside of
ExperimentalAccess and mark Experimental class and reference as obsolete
by @​BobLd in UglyToad/PdfPig#931
* Upgrade tests project NuGet packages by @​BobLd in
UglyToad/PdfPig#932
* Optimize cross reference object offset validation by avoiding nested
loop by @​madelson in UglyToad/PdfPig#935
* Revive trimming/AOT analysis by @​madelson in
UglyToad/PdfPig#939
* Stop treating Warnings as Errors by @​BobLd in
UglyToad/PdfPig#941
* Handle alternate Unicode name representation cXXX and fix #​943 by
@​BobLd in UglyToad/PdfPig#944
* Handle odd ligatures names and fix #​945 by @​BobLd in
UglyToad/PdfPig#946
* Update additional glyph list to latest from PDFBox by @​BobLd in
UglyToad/PdfPig#948
* New GetText() option: NegativeGapAsWhitespace by @​Kizaemon in
UglyToad/PdfPig#952
* Fix for IndexOutOfRangeException exception by @​GrabzIt in
UglyToad/PdfPig#955
* Fix "Nightly Release" pipeline following csproj changes by @​BobLd in
UglyToad/PdfPig#957
* Do not throw exception when lenient parsing in ON in
CrossReferenceParser and fix #​959 by @​BobLd in
UglyToad/PdfPig#961
* Improve UnwrapIndexedColorSpaceBytes by @​BobLd in
UglyToad/PdfPig#962
* Fix out of range exception in AnnotationProvider by @​BobLd in
UglyToad/PdfPig#963
* Return a copy of the ArrayPoolBufferWriter buffer in Ascii85, AsciiHex
and RunLength filters and fix #​964 by @​BobLd in
UglyToad/PdfPig#965
* Make ColorSpaceDetails.BaseNumberOfColorComponents public to allow for
external image factories by @​BobLd in
UglyToad/PdfPig#966
* Improve GlyphList by @​BobLd in
UglyToad/PdfPig#967
* Properly handle ZapfDingbats font for TrueTypeSimpleFont and add tests
by @​BobLd in UglyToad/PdfPig#969
* Execute RemoveStridePadding in place when possible by @​BobLd in
UglyToad/PdfPig#968
* Add HexToken case in OptionalContent parsing by @​simonedd in
UglyToad/PdfPig#971
* Update UglyToad.PdfPig.ConsoleRunner target framework to net8 by
@​BobLd in UglyToad/PdfPig#972
* Do not throw error on Pop when stack size is 1 in lenient mode and fix
#​973 by @​BobLd in UglyToad/PdfPig#974
* Fix warnings about "type 'K' cannot be used as type parameter 'TKey'
in the generic type or method 'Dictionary<TKey, TValue>'" by @​BobLd in
UglyToad/PdfPig#976
* Refactor XObjectFactory by @​BobLd in
UglyToad/PdfPig#977
* Update UnpackComponents() to account for 1bpc + DeviceGray (hack for
Jbig2) by @​BobLd in UglyToad/PdfPig#978
* CcittFaxDecodeFilter: do not check for input length, invert bitmap
with ref byte and fix #​982 by @​BobLd in
UglyToad/PdfPig#983
* Add JPX bits per component decoding by @​BobLd in
UglyToad/PdfPig#986
* Issues/987 by @​BobLd in UglyToad/PdfPig#990
* Make DecodeParameterResolver class public by @​BobLd in
UglyToad/PdfPig#993
* Update Microsoft and SkiaSharp NuGet packages by @​BobLd in
UglyToad/PdfPig#994
* Update Microsoft NuGet packages for UglyToad.PdfPig.Package by @​BobLd
in UglyToad/PdfPig#996
* Resolve image data (implementation from @​kasperdaff) by @​BobLd in
UglyToad/PdfPig#998
* Pass IFilterProvider to IFilter.Decode() and handle null in
PdfExtensions.Resolve() by @​BobLd in
UglyToad/PdfPig#999
* Improve GetExtendedGraphicsStateDictionary() and
StackDictionary.TryGetValue() by @​BobLd in
UglyToad/PdfPig#1004
* Better handle integer overflow in DocstrumBoundingBoxes by @​BobLd in
UglyToad/PdfPig#1005
* version 0.1.10 by @​BobLd in
UglyToad/PdfPig#1006
* Update run_integration_tests.yml by @​BobLd in
UglyToad/PdfPig#1007

## New Contributors
* @​madelson made their first contribution in
UglyToad/PdfPig#935
* @​Kizaemon made their first contribution in
UglyToad/PdfPig#952
* @​GrabzIt made their first contribution in
UglyToad/PdfPig#955
 ... (truncated)

Commits viewable in [compare
view](UglyToad/PdfPig@v0.1.9...0.1.13).
</details>

[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=PdfPig&package-manager=nuget&previous-version=0.1.9&new-version=0.1.13)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Laurent Ellerbach <laurelle@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants