Skip to content

Fix TarReader: implement full GNU sparse format 1.0 (PAX) reading with DataStream expansion#125283

Open
Copilot wants to merge 33 commits intomainfrom
copilot/fix-gnu-sparse-format-handling
Open

Fix TarReader: implement full GNU sparse format 1.0 (PAX) reading with DataStream expansion#125283
Copilot wants to merge 33 commits intomainfrom
copilot/fix-gnu-sparse-format-handling

Conversation

Copy link
Contributor

Copilot AI commented Mar 6, 2026

TarReader was not handling GNU sparse format 1.0 PAX entries, causing ~46% of entries from bsdtar-created archives (e.g., .NET SDK tarballs built on macOS/APFS) to expose internal placeholder paths like GNUSparseFile.0/real-file.dll, incorrect sizes, and corrupted extracted content.

Changes

Added read-only support for GNU sparse format 1.0 (PAX). When TarReader encounters PAX extended attributes GNU.sparse.major=1 and GNU.sparse.minor=0, it resolves the real file name from GNU.sparse.name, reports the expanded size from GNU.sparse.realsize, and wraps the raw data stream with GnuSparseStream which presents the expanded virtual file content (zeros for holes, packed data at correct offsets).

The sparse map embedded in the data section is parsed lazily on first Read, so _dataStream remains unconsumed during entry construction. This allows TarWriter.WriteEntry to round-trip the condensed sparse data correctly for both seekable and non-seekable source archives.

Older GNU sparse formats (0.0, 0.1) and write support are not addressed.

Segment search improvements in GnuSparseStream:

  • Sequential reads (the common case) use the cached _currentSegmentIndex and scan forward — O(1) amortized.
  • Backward seeks (e.g., random access) now use binary search — O(log n) instead of a linear scan from zero.
  • Segment validation in InitializeFromParsedMap uses overflow-safe arithmetic (offset > _realSize || length > _realSize - offset) to reject malformed maps that would cause integer overflow.
// Before: entry.Name == "GNUSparseFile.0/dotnet.dll", entry.Length == 512
// After:  entry.Name == "dotnet.dll", entry.Length == 1048576
using var reader = new TarReader(archiveStream);
TarEntry entry = reader.GetNextEntry();
entry.DataStream.ReadExactly(content); // correctly expanded virtual file

Testing

All existing tests pass. New TarReader.SparseFile.Tests.cs covers:

  • Parameterized sparse layouts (single segment, holes, multiple segments, all-holes) × copyData × sync/async
  • Corrupted sparse map handling (non-numeric values, truncated maps, buffer overflow) × sync/async
  • Wrong sparse version detection (missing minor, wrong major)
  • Seekable random access, partial reads, advance-past-entry correctness
  • Round-trip copy through TarWriter with seekable/non-seekable source × copyData
  • Golang corpus archives (pax-nil-sparse-data, pax-nil-sparse-hole, pax-sparse-big)

💬 Send tasks to Copilot coding agent from Slack and Teams to turn conversations into code. Copilot posts an update in your thread when it's finished.

Copilot AI and others added 2 commits March 6, 2026 22:52
…rse.name and GNU.sparse.realsize

Co-authored-by: lewing <24063+lewing@users.noreply.github.com>
…sertions for data stream integrity

Co-authored-by: lewing <24063+lewing@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix TarReader to handle GNU sparse format 1.0 correctly Fix TarReader handling of GNU sparse format 1.0 (PAX) — resolve GNU.sparse.name and GNU.sparse.realsize Mar 6, 2026
@lewing lewing requested a review from rzikm March 9, 2026 01:52
@rzikm
Copy link
Member

rzikm commented Mar 9, 2026

@copilot What does happen when we attempt to extract such archives? If we are to transparently handle the real name and size, then we should also expand the contents when reading the entrys DataStream

Copy link
Contributor Author

Copilot AI commented Mar 9, 2026

@copilot What does happen when we attempt to extract such archives? If we are to transparently handle the real name and size, then we should also expand the contents when reading the entrys DataStream

Extracting GNU sparse format 1.0 PAX entries with the current code produces a file with the correct name (from our fix) but incorrect content.

Here's what happens: the DataStream for a sparse 1.0 PAX entry contains the raw archive data section, which has this layout:

  1. A text-based sparse map header (newline-delimited: number of segments, then pairs of offset\nnumbytes\n)
  2. The non-zero data segments packed together

When ExtractAsRegularFile calls DataStream.CopyTo(fs), it copies those raw bytes verbatim — so the extracted file contains the sparse map text followed by packed data, rather than the properly expanded file (zeros for holes, data at the correct offsets).

Properly expanding the DataStream would require:

  • Parsing the sparse map text from the start of the data section
  • Creating a wrapper stream that serves zeros for sparse holes and actual bytes at the right offsets

This is significantly more complex than the current fix. The current PR is scoped to fix the metadata issues (Name and Length reporting) that were the primary pain point. Full sparse expansion is a meaningful follow-up that should be tracked separately.

I can revert the _gnuSparseRealSizeLength change if the inconsistency between the reported length and the DataStream content is considered too misleading. However, I'd argue knowing the real size (even when extraction doesn't yet expand the content) is still more useful than reporting the stored sparse data size.

@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-formats-tar
See info in area-owners.md if you want to be subscribed.

@rzikm
Copy link
Member

rzikm commented Mar 9, 2026

@copilot simply fixing the exposed metadata does not address the original reported issue, you need to implement complete handling of the pax 1.0 sparse format (only for reading, no need to be able to produce them)

I think it is okay to avoid implementing the older sparse file formats though.

Copilot AI and others added 2 commits March 9, 2026 21:29
…SparseStream

Co-authored-by: rzikm <32671551+rzikm@users.noreply.github.com>
…ed offsets, fix CanSeek order, fix async allocation

Co-authored-by: rzikm <32671551+rzikm@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Mar 9, 2026

@copilot simply fixing the exposed metadata does not address the original reported issue, you need to implement complete handling of the pax 1.0 sparse format (only for reading, no need to be able to ...

Done in commit 1015043a. The PR now fully handles GNU sparse format 1.0 PAX reading:

  • GnuSparseStream wraps the raw data section and expands it: it parses the embedded sparse map (numSegs, offset/length pairs), then serves zeros for sparse holes and packed data for real segments, presenting a virtual file of realsize bytes.
  • TarEntry.Name returns the real name from GNU.sparse.name (not the GNUSparseFile.0/… placeholder).
  • TarEntry.Length and DataStream.Length both report the expanded real size.
  • ExtractToFile will write the correctly expanded content.
  • Older GNU sparse formats (0.0, 0.1) are not addressed — those use different PAX attribute schemes and are a separate follow-up.

Code review fixes applied: MaxSparseSegments = 1_000_000 DoS cap, O(1) precomputed packed-offset prefix sums, CanSeek short-circuit order corrected, async byte[1] allocated once per parse (not per line call).

Copilot AI changed the title Fix TarReader handling of GNU sparse format 1.0 (PAX) — resolve GNU.sparse.name and GNU.sparse.realsize Fix TarReader: implement full GNU sparse format 1.0 (PAX) reading with DataStream expansion Mar 9, 2026
Copilot finished work on behalf of rzikm March 9, 2026 21:37
Copy link
Member

@rzikm rzikm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot Please review my comments

Rent the 1024-byte parsing buffer from ArrayPool<byte>.Shared instead
of allocating a new array on each call, reducing GC pressure.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
rzikm and others added 2 commits March 11, 2026 18:09
Reject segments that extend past realSize or are not in ascending
offset order during InitializeFromParsedMap. Add test cases for
out-of-bounds and misordered segments.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…c and async

Replace InlineData with MemberData generator that yields every
corrupted map string with both useAsync=false and useAsync=true,
ensuring full sync/async coverage for all 11 corruption scenarios.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@karelz karelz assigned rzikm and unassigned lewing and Copilot Mar 12, 2026
rzikm and others added 3 commits March 16, 2026 10:07
…eStream

- GnuSparseStream.Dispose now disposes the underlying raw stream
- Replace Debug.Assert with runtime check for negative skipBytes on
  non-seekable streams in ReadFromPackedData/ReadFromPackedDataAsync
- Restore AdvanceStreamAsync in the async ProcessDataBlock path
  (was incorrectly changed to sync AdvanceStream)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@rzikm
Copy link
Member

rzikm commented Mar 18, 2026

@copilot Address the code review comments,

for the segments binary search, preserve the behavior that we don't perform unnecessary searches when doing sequential read of the entire entry.

…tial scan for forward reads; fix overflow check

Co-authored-by: rzikm <32671551+rzikm@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Mar 18, 2026

@copilot Address the code review comments,

for the segments binary search, preserve the behavior that we don't perform unnecessary searches when doing sequential read of the entire entry.

Addressed in 6967f5fb:

  • Overflow check fixed: offset + length > _realSizeoffset > _realSize || length > _realSize - offset
  • Binary search added via BinarySearchSegment helper; sequential read optimization preserved — forward scan from the _currentSegmentIndex hint is unchanged for sequential reads (O(1) amortized), binary search is only used when the requested position is before the current hint

}
}

protected override void Dispose(bool disposing)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no disposeasync override

// _size retains the archive data section length needed for correct stream positioning.
if (TarHelpers.TryGetStringAsBaseTenLong(ExtendedAttributes, PaxEaGnuSparseRealSize, out long gnuSparseRealSize))
{
_gnuSparseRealSize = gnuSparseRealSize;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can the size be negative here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants