Make brotli stream matcher more robust to uncompressed binary and ASCII data files#45
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR enhances the Brotli stream matcher to better distinguish uncompressed ASCII data from valid Brotli-compressed streams by adding a more robust detection function and extensive fuzzy tests.
- Consolidate repetitive quality-level tests into a loop and add
TestBrotli_Fuzzy_Bothfor deterministic ASCII vs. compressed checks - Introduce
isValidBrotliStreamwith an ASCII-only filter and helper functions (isASCII,isASCIIByte) - Update
Matchto call the new stream-based validation without breaking backward compatibility
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| brotli_test.go | Refactor per-quality tests into a loop, add comprehensive fuzzy testing suite |
| brotli.go | Implement isValidBrotliStream method with ASCII detection, wire it into Match |
Comments suppressed due to low confidence (1)
brotli_test.go:58
- The fuzzy tests cover ASCII and Brotli-compressed data, but they don’t verify how pure non-ASCII binary data is handled. Consider adding a test case for random non-ASCII binary inputs to ensure they aren’t misidentified as Brotli-compressed.
func TestBrotli_Fuzzy_Both(t *testing.T) {
That's weird, I have not enabled automatic review from Copilot. |
Must be from a setting I have enabled, my bad! I thought it should only have happened on my own repositories. I find it now provides helpful feedback about 10-15% of the time as of the last few weeks. Prior to that it was about 0.01% of the time. Although obviously it still gets a lot wrong as shown here. |
|
I added some fuzzy binary data generation as a new test for #36 All of the compressed data is being accurately identified however there are a number of uncompressed binary data tests which are being incorrectly identified as brotli in both main branch and this PR. So far I have been unsuccessful finding a fix for that case. If we could pass a configuration to archives which allows us to specify which of the registered formats we actually want enabled for use with the Kind of crazy such a widely used compression format has no reliable method of detection?? |
|
I tried adding a final check which tries to read significantly more data from the stream, decompresses it, and compares the compressed versus decompressed output sizes. The assumption being that if its a legitimate compressed stream that the output would be some reasonable ratio larger. However I couldn't get it to work reliably. |
|
I encountered the same issue as well. There are no uncompressed files, and the br compression was detected via bystream. |
No worries, surprising to me too! Thanks for the improvements. It looks like the tests are failing though. We'll need to get them to pass before we can merge this. |
|
Yes, the uncompressed binary case is proving tricky. I have tried a number of different things to try and catch that edge case but so far have been unsuccessful. I will keep trying periodically when I have some spare time. |
|
OK, I was able to get something working that passes all existing tests as well as a bunch of new ones I added, and I have updated the PR description. What are your thoughts on adding configuration which disables brotli stream matching, perhaps via a new |
mholt
left a comment
There was a problem hiding this comment.
Very nice -- thanks for the effort on this!
Let's merge this and try it out. As for a new exported API to customize formats in the Identify routine, I'm open to a new API (though probably a little different) -- but first let's see how this goes. If it still gets too many false positives, then I'm down for a refactoring/additional API.
Closes #36