Make brotli stream matcher more robust to uncompressed binary and ASCII data files by dpgarrick · Pull Request #45 · mholt/archives

dpgarrick · 2025-07-11T03:11:32Z

Reduces false positives when detecting brotli compression on ASCII and binary data
Comprehensive test coverage with deterministic fuzzy testing and checks for small streams (as the original logic included several branches for small stream detection stuff)
The changes maintain backward compatibility while providing more accurate detection of brotli-compressed streams.

Closes #36

Copilot

Pull Request Overview

This PR enhances the Brotli stream matcher to better distinguish uncompressed ASCII data from valid Brotli-compressed streams by adding a more robust detection function and extensive fuzzy tests.

Consolidate repetitive quality-level tests into a loop and add TestBrotli_Fuzzy_Both for deterministic ASCII vs. compressed checks
Introduce isValidBrotliStream with an ASCII-only filter and helper functions (isASCII, isASCIIByte)
Update Match to call the new stream-based validation without breaking backward compatibility

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
brotli_test.go	Refactor per-quality tests into a loop, add comprehensive fuzzy testing suite
brotli.go	Implement `isValidBrotliStream` method with ASCII detection, wire it into `Match`

Comments suppressed due to low confidence (1)

brotli_test.go:58

The fuzzy tests cover ASCII and Brotli-compressed data, but they don’t verify how pure non-ASCII binary data is handled. Consider adding a test case for random non-ASCII binary inputs to ensure they aren’t misidentified as Brotli-compressed.

func TestBrotli_Fuzzy_Both(t *testing.T) {

brotli.go

mholt · 2025-07-11T18:52:52Z

Copilot AI review requested due to automatic review settings 15 hours ago

That's weird, I have not enabled automatic review from Copilot.

dpgarrick · 2025-07-12T23:15:38Z

Copilot AI review requested due to automatic review settings 15 hours ago

That's weird, I have not enabled automatic review from Copilot.

Must be from a setting I have enabled, my bad! I thought it should only have happened on my own repositories.

I find it now provides helpful feedback about 10-15% of the time as of the last few weeks. Prior to that it was about 0.01% of the time.

Although obviously it still gets a lot wrong as shown here.

dpgarrick · 2025-07-14T04:56:02Z

I added some fuzzy binary data generation as a new test for #36

All of the compressed data is being accurately identified however there are a number of uncompressed binary data tests which are being incorrectly identified as brotli in both main branch and this PR.

So far I have been unsuccessful finding a fix for that case.

If we could pass a configuration to archives which allows us to specify which of the registered formats we actually want enabled for use with the Identify routine and whether we want ByName and/or ByStream matching enabled for each of those that would be one workaround...

Kind of crazy such a widely used compression format has no reliable method of detection??

dpgarrick · 2025-07-14T05:01:33Z

I tried adding a final check which tries to read significantly more data from the stream, decompresses it, and compares the compressed versus decompressed output sizes. The assumption being that if its a legitimate compressed stream that the output would be some reasonable ratio larger. However I couldn't get it to work reliably.

18672881844 · 2025-07-14T07:11:19Z

I encountered the same issue as well. There are no uncompressed files, and the br compression was detected via bystream.

mholt · 2025-07-14T20:56:51Z

Must be from a setting I have enabled, my bad! I thought it should only have happened on my own repositories.

No worries, surprising to me too!

Thanks for the improvements. It looks like the tests are failing though. We'll need to get them to pass before we can merge this.

dpgarrick · 2025-07-14T21:37:03Z

Yes, the uncompressed binary case is proving tricky. I have tried a number of different things to try and catch that edge case but so far have been unsuccessful. I will keep trying periodically when I have some spare time.

dpgarrick · 2025-08-25T05:21:42Z

OK, I was able to get something working that passes all existing tests as well as a bunch of new ones I added, and I have updated the PR description.

What are your thoughts on adding configuration which disables brotli stream matching, perhaps via a new IdentifyWithOptions routine that allows providing some configuration with respect to the registered formats, e.g. which are registered and which match options should be used for each? Perhaps slightly overkill given most other formats seem to be able to be reliably detected, just brotli is not...

mholt

Very nice -- thanks for the effort on this!

Let's merge this and try it out. As for a new exported API to customize formats in the Identify routine, I'm open to a new API (though probably a little different) -- but first let's see how this goes. If it still gets too many false positives, then I'm down for a refactoring/additional API.

dpgarrick added 2 commits July 11, 2025 14:14

fuzzy testing and ascii check

5884523

simplify and expand tests across quality levels

fd132fc

Copilot AI review requested due to automatic review settings July 11, 2025 03:11

Copilot AI reviewed Jul 11, 2025

View reviewed changes

brotli.go Show resolved Hide resolved

brotli.go Outdated Show resolved Hide resolved

add some tests for small streams

22cb1cd

add fuzzy binary tests

4cb6a11

dpgarrick changed the title ~~Make brotli stream matcher more robust to ASCII data files~~ Make brotli stream matcher more robust to uncompressed binary and ASCII data files Jul 14, 2025

dpgarrick marked this pull request as draft July 14, 2025 21:38

updates

1d97b32

dpgarrick marked this pull request as ready for review August 25, 2025 04:56

mholt approved these changes Aug 25, 2025

View reviewed changes

mholt merged commit f07cd8e into mholt:main Aug 25, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make brotli stream matcher more robust to uncompressed binary and ASCII data files#45

Make brotli stream matcher more robust to uncompressed binary and ASCII data files#45
mholt merged 5 commits intomholt:mainfrom
dpgarrick:brotli-matcher

dpgarrick commented Jul 11, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

mholt commented Jul 11, 2025

Uh oh!

dpgarrick commented Jul 12, 2025

Uh oh!

dpgarrick commented Jul 14, 2025 •

edited

Loading

Uh oh!

dpgarrick commented Jul 14, 2025

Uh oh!

18672881844 commented Jul 14, 2025

Uh oh!

mholt commented Jul 14, 2025

Uh oh!

dpgarrick commented Jul 14, 2025

Uh oh!

dpgarrick commented Aug 25, 2025

Uh oh!

mholt left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

dpgarrick commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

mholt commented Jul 11, 2025

Uh oh!

dpgarrick commented Jul 12, 2025

Uh oh!

dpgarrick commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dpgarrick commented Jul 14, 2025

Uh oh!

18672881844 commented Jul 14, 2025

Uh oh!

mholt commented Jul 14, 2025

Uh oh!

dpgarrick commented Jul 14, 2025

Uh oh!

dpgarrick commented Aug 25, 2025

Uh oh!

mholt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dpgarrick commented Jul 11, 2025 •

edited

Loading

dpgarrick commented Jul 14, 2025 •

edited

Loading