Skip to content

fix:handle compound aliases like .tgz when cataloging archives#4421

Merged
spiffcs merged 9 commits intoanchore:mainfrom
VictorHuu:archiver-compound-aliases
Dec 2, 2025
Merged

fix:handle compound aliases like .tgz when cataloging archives#4421
spiffcs merged 9 commits intoanchore:mainfrom
VictorHuu:archiver-compound-aliases

Conversation

@VictorHuu
Copy link
Contributor

@VictorHuu VictorHuu commented Dec 1, 2025

Description

see: #4416 (comment)

In the near future, it's the library archives who should fix the logic to support compound extension names like .tgz

Summary of magic bytes vs extension for archive detection:

update:
file_source.go:213 - switch to stream-based (already opens file later anyway)
tar_file_traversal.go:23 - opens the file on line 17, so could pass tarReader

defer detection and use new map:
unknowns_tasks.go:64 only has coords.RealPath, would need to open files (potential perf hit for many files)
model.go:159 isArchive() is a helper - opening files could be more expensive here

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Checklist:

  • I have added unit tests that cover changed behavior
  • I have tested my code in common scenarios and confirmed there are no regressions
  • I have added comments to my code, particularly in hard-to-understand sections

Signed-off-by: Yuntao Hu <victorhu493@gmail.com>
@spiffcs
Copy link
Contributor

spiffcs commented Dec 1, 2025

Wow TY so much @VictorHuu for the PR. I think there are some other sections where we use Identify and want this fix. I'm going to push some changes here and we'll use this branch as the fix.

Signed-off-by: Christopher Phillips <32073428+spiffcs@users.noreply.github.com>
@spiffcs
Copy link
Contributor

spiffcs commented Dec 1, 2025

I'm also going to quickly explore an approach suggested by @willmurphyscode which would pass the reader. This might give us better automatic detection here where we don't have to implement/maintain this map 😄

update:
file_source.go:213 - switch to stream-based (already opens file later anyway)
tar_file_traversal.go:23 - opens the file on line 17, so could pass tarReader

defer:
unknowns_tasks.go:64 only has coords.RealPath, would need to open files (potential perf hit for many files)
model.go:159 isArchive() is a helper - opening files could be more expensive here

Signed-off-by: Christopher Phillips <32073428+spiffcs@users.noreply.github.com>
@spiffcs spiffcs marked this pull request as ready for review December 2, 2025 16:50
Signed-off-by: Christopher Phillips <32073428+spiffcs@users.noreply.github.com>
@spiffcs
Copy link
Contributor

spiffcs commented Dec 2, 2025

Looks like I need to check the CLI tests now with the new split approach

@willmurphyscode willmurphyscode changed the title fix:handle compound aliases like tar.gz when cataloging archives fix:handle compound aliases like .tgz when cataloging archives Dec 2, 2025
Signed-off-by: Christopher Phillips <32073428+spiffcs@users.noreply.github.com>
Signed-off-by: Christopher Phillips <32073428+spiffcs@users.noreply.github.com>
//
// This function is a drop-in replacement for archives.Identify that centralizes
// the compound alias handling logic in one place.
func IdentifyArchive(ctx context.Context, path string) (archives.Format, io.Reader, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, do we ever already have a file open when calling archives.Identify? Should this accept a reader rather than trying to open the file?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better here if the function handles opening and closing it's own content here. If we pass a reader and it gets used then there might be instances in the code where we expect it to still be available after the function is called and is no longer available. I can check to see if we don't do this, but I thought a more defensive approach would be to create/cleanup its own resources here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment below - I thought we had discussed that actually reading bytes from the file to check whether its an archive was a behavior change we didn't necessarily want in this bugfix release.

I don't think it's correct to os.Open here, but I'm not clear on when we reach line 42.

Signed-off-by: Christopher Phillips <32073428+spiffcs@users.noreply.github.com>
Signed-off-by: Christopher Phillips <32073428+spiffcs@users.noreply.github.com>
Signed-off-by: Christopher Phillips <32073428+spiffcs@users.noreply.github.com>
@spiffcs spiffcs merged commit afe28a2 into anchore:main Dec 2, 2025
12 checks passed
@spiffcs
Copy link
Contributor

spiffcs commented Dec 2, 2025

TY @VictorHuu for the help on this one

VictorHuu added a commit to VictorHuu/syft that referenced this pull request Dec 5, 2025
…nchore#4421)

---------
Signed-off-by: Yuntao Hu <victorhu493@gmail.com>
Signed-off-by: Christopher Phillips <32073428+spiffcs@users.noreply.github.com>
Co-authored-by: Christopher Phillips <32073428+spiffcs@users.noreply.github.com>
Signed-off-by: Yuntao Hu <victorhu493@gmail.com>
@VictorHuu VictorHuu deleted the archiver-compound-aliases branch December 6, 2025 07:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

v1.38.0 generates empty sbom for tgz sources

3 participants