Fix TarReader: implement full GNU sparse format 1.0 (PAX) reading with DataStream expansion by Copilot · Pull Request #125283 · dotnet/runtime

Copilot · 2026-03-06T22:21:26Z

TarReader was not handling GNU sparse format 1.0 PAX entries, causing ~46% of entries from bsdtar-created archives (e.g., .NET SDK tarballs built on macOS/APFS) to expose internal placeholder paths like GNUSparseFile.0/real-file.dll, incorrect sizes, and corrupted extracted content.

Changes

Added read-only support for GNU sparse format 1.0 (PAX). When TarReader encounters PAX extended attributes GNU.sparse.major=1 and GNU.sparse.minor=0, it resolves the real file name from GNU.sparse.name, reports the expanded size from GNU.sparse.realsize, and wraps the raw data stream with GnuSparseStream which presents the expanded virtual file content (zeros for holes, packed data at correct offsets).

The sparse map embedded in the data section is parsed lazily on first Read, so _dataStream remains unconsumed during entry construction. This allows TarWriter.WriteEntry to round-trip the condensed sparse data correctly for both seekable and non-seekable source archives.

Older GNU sparse formats (0.0, 0.1) and write support are not addressed.

Segment search improvements in GnuSparseStream:

Sequential reads (the common case) use the cached _currentSegmentIndex and scan forward — O(1) amortized.
Backward seeks (e.g., random access) now use binary search — O(log n) instead of a linear scan from zero.
Segment validation in InitializeFromParsedMap uses overflow-safe arithmetic (offset > _realSize || length > _realSize - offset) to reject malformed maps that would cause integer overflow.

// Before: entry.Name == "GNUSparseFile.0/dotnet.dll", entry.Length == 512
// After:  entry.Name == "dotnet.dll", entry.Length == 1048576
using var reader = new TarReader(archiveStream);
TarEntry entry = reader.GetNextEntry();
entry.DataStream.ReadExactly(content); // correctly expanded virtual file

Testing

All existing tests pass. New TarReader.SparseFile.Tests.cs covers:

Parameterized sparse layouts (single segment, holes, multiple segments, all-holes) × copyData × sync/async
Corrupted sparse map handling (non-numeric values, truncated maps, buffer overflow) × sync/async
Wrong sparse version detection (missing minor, wrong major)
Seekable random access, partial reads, advance-past-entry correctness
Round-trip copy through TarWriter with seekable/non-seekable source × copyData
Golang corpus archives (pax-nil-sparse-data, pax-nil-sparse-hole, pax-sparse-big)

💬 Send tasks to Copilot coding agent from Slack and Teams to turn conversations into code. Copilot posts an update in your thread when it's finished.

…rse.name and GNU.sparse.realsize Co-authored-by: lewing <24063+lewing@users.noreply.github.com>

…sertions for data stream integrity Co-authored-by: lewing <24063+lewing@users.noreply.github.com>

rzikm · 2026-03-09T08:30:41Z

@copilot What does happen when we attempt to extract such archives? If we are to transparently handle the real name and size, then we should also expand the contents when reading the entrys DataStream

Copilot · 2026-03-09T08:36:48Z

@copilot What does happen when we attempt to extract such archives? If we are to transparently handle the real name and size, then we should also expand the contents when reading the entrys DataStream

Extracting GNU sparse format 1.0 PAX entries with the current code produces a file with the correct name (from our fix) but incorrect content.

Here's what happens: the DataStream for a sparse 1.0 PAX entry contains the raw archive data section, which has this layout:

A text-based sparse map header (newline-delimited: number of segments, then pairs of offset\nnumbytes\n)
The non-zero data segments packed together

When ExtractAsRegularFile calls DataStream.CopyTo(fs), it copies those raw bytes verbatim — so the extracted file contains the sparse map text followed by packed data, rather than the properly expanded file (zeros for holes, data at the correct offsets).

Properly expanding the DataStream would require:

Parsing the sparse map text from the start of the data section
Creating a wrapper stream that serves zeros for sparse holes and actual bytes at the right offsets

This is significantly more complex than the current fix. The current PR is scoped to fix the metadata issues (Name and Length reporting) that were the primary pain point. Full sparse expansion is a meaningful follow-up that should be tracked separately.

I can revert the _gnuSparseRealSize → Length change if the inconsistency between the reported length and the DataStream content is considered too misleading. However, I'd argue knowing the real size (even when extraction doesn't yet expand the content) is still more useful than reporting the stored sparse data size.

dotnet-policy-service · 2026-03-09T10:07:55Z

Tagging subscribers to this area: @dotnet/area-system-formats-tar
See info in area-owners.md if you want to be subscribed.

rzikm · 2026-03-09T20:46:49Z

@copilot simply fixing the exposed metadata does not address the original reported issue, you need to implement complete handling of the pax 1.0 sparse format (only for reading, no need to be able to produce them)

I think it is okay to avoid implementing the older sparse file formats though.

…SparseStream Co-authored-by: rzikm <32671551+rzikm@users.noreply.github.com>

…ed offsets, fix CanSeek order, fix async allocation Co-authored-by: rzikm <32671551+rzikm@users.noreply.github.com>

Copilot · 2026-03-09T21:36:59Z

@copilot simply fixing the exposed metadata does not address the original reported issue, you need to implement complete handling of the pax 1.0 sparse format (only for reading, no need to be able to ...

Done in commit 1015043a. The PR now fully handles GNU sparse format 1.0 PAX reading:

GnuSparseStream wraps the raw data section and expands it: it parses the embedded sparse map (numSegs, offset/length pairs), then serves zeros for sparse holes and packed data for real segments, presenting a virtual file of realsize bytes.
TarEntry.Name returns the real name from GNU.sparse.name (not the GNUSparseFile.0/… placeholder).
TarEntry.Length and DataStream.Length both report the expanded real size.
ExtractToFile will write the correctly expanded content.
Older GNU sparse formats (0.0, 0.1) are not addressed — those use different PAX attribute schemes and are a separate follow-up.

Code review fixes applied: MaxSparseSegments = 1_000_000 DoS cap, O(1) precomputed packed-offset prefix sums, CanSeek short-circuit order corrected, async byte[1] allocated once per parse (not per line call).

rzikm

@copilot Please review my comments

src/libraries/System.Formats.Tar/src/Resources/Strings.resx

src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Read.cs

src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.cs

src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarReader.cs

src/libraries/System.Formats.Tar/tests/TarReader/TarReader.GetNextEntry.Tests.cs

src/libraries/System.Formats.Tar/src/System/Formats/Tar/GnuSparseStream.cs

Rent the 1024-byte parsing buffer from ArrayPool<byte>.Shared instead of allocating a new array on each call, reducing GC pressure. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Reject segments that extend past realSize or are not in ascending offset order during InitializeFromParsedMap. Add test cases for out-of-bounds and misordered segments. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…c and async Replace InlineData with MemberData generator that yields every corrupted map string with both useAsync=false and useAsync=true, ensuring full sync/async coverage for all 11 corruption scenarios. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Read.cs

src/libraries/System.Formats.Tar/src/System/Formats/Tar/GnuSparseStream.cs

…eStream - GnuSparseStream.Dispose now disposes the underlying raw stream - Replace Debug.Assert with runtime check for negative skipBytes on non-seekable streams in ReadFromPackedData/ReadFromPackedDataAsync - Restore AdvanceStreamAsync in the async ProcessDataBlock path (was incorrectly changed to sync AdvanceStream) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…rse-format-handling

src/libraries/System.Formats.Tar/src/System/Formats/Tar/GnuSparseStream.cs

rzikm · 2026-03-18T10:58:24Z

@copilot Address the code review comments,

for the segments binary search, preserve the behavior that we don't perform unnecessary searches when doing sequential read of the entire entry.

…tial scan for forward reads; fix overflow check Co-authored-by: rzikm <32671551+rzikm@users.noreply.github.com>

Copilot · 2026-03-18T11:29:07Z

@copilot Address the code review comments,

for the segments binary search, preserve the behavior that we don't perform unnecessary searches when doing sequential read of the entire entry.

Addressed in 6967f5fb:

Overflow check fixed: offset + length > _realSize → offset > _realSize || length > _realSize - offset
Binary search added via BinarySearchSegment helper; sequential read optimization preserved — forward scan from the _currentSegmentIndex hint is unchanged for sequential reads (O(1) amortized), binary search is only used when the requested position is before the current hint

alinpahontu2912 · 2026-03-20T10:43:41Z

src/libraries/System.Formats.Tar/src/System/Formats/Tar/GnuSparseStream.cs

+            }
+        }
+
+        protected override void Dispose(bool disposing)


there is no disposeasync override

alinpahontu2912 · 2026-03-20T10:44:40Z

src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Read.cs

+                // _size retains the archive data section length needed for correct stream positioning.
+                if (TarHelpers.TryGetStringAsBaseTenLong(ExtendedAttributes, PaxEaGnuSparseRealSize, out long gnuSparseRealSize))
+                {
+                    _gnuSparseRealSize = gnuSparseRealSize;


can the size be negative here?

Initial plan

e609249

Copilot AI assigned Copilot and lewing Mar 6, 2026

Copilot started work on behalf of lewing March 6, 2026 22:21 View session

github-actions bot added the area-System.IO label Mar 6, 2026

Copilot AI and others added 2 commits March 6, 2026 22:52

Fix TarReader to handle GNU sparse format 1.0 (PAX) - resolve GNU.spa…

543c8f2

…rse.name and GNU.sparse.realsize Co-authored-by: lewing <24063+lewing@users.noreply.github.com>

Fix copy constructor to propagate _gnuSparseRealSize; improve test as…

351d885

…sertions for data stream integrity Co-authored-by: lewing <24063+lewing@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Fix TarReader to handle GNU sparse format 1.0 correctly~~ Fix TarReader handling of GNU sparse format 1.0 (PAX) — resolve GNU.sparse.name and GNU.sparse.realsize Mar 6, 2026

Copilot finished work on behalf of lewing March 6, 2026 22:58

Merge branch 'main' into copilot/fix-gnu-sparse-format-handling

cd7c720

lewing requested a review from rzikm March 9, 2026 01:52

Copilot started work on behalf of rzikm March 9, 2026 08:30 View session

Copilot finished work on behalf of rzikm March 9, 2026 08:37

rzikm added area-System.Formats.Tar and removed area-System.IO labels Mar 9, 2026

Copilot started work on behalf of rzikm March 9, 2026 20:46 View session

Copilot AI and others added 2 commits March 9, 2026 21:29

Implement full GNU sparse format 1.0 PAX DataStream expansion via Gnu…

5db74bb

…SparseStream Co-authored-by: rzikm <32671551+rzikm@users.noreply.github.com>

Fix GnuSparseStream: add MaxSparseSegments DoS limit, precompute pack…

1015043

…ed offsets, fix CanSeek order, fix async allocation Co-authored-by: rzikm <32671551+rzikm@users.noreply.github.com>

Copilot AI changed the title ~~Fix TarReader handling of GNU sparse format 1.0 (PAX) — resolve GNU.sparse.name and GNU.sparse.realsize~~ Fix TarReader: implement full GNU sparse format 1.0 (PAX) reading with DataStream expansion Mar 9, 2026

Copilot finished work on behalf of rzikm March 9, 2026 21:37

build-analysis bot mentioned this pull request Mar 10, 2026

iOS tests failing with WORKLOAD TIMED OUT - Killing user command. #108103

Open

rzikm reviewed Mar 10, 2026

View reviewed changes

This was referenced Mar 11, 2026

The Operation will be canceled. The next steps may not contain expected logs. dotnet/dnceng#3008

Open

XHarness device enumeration failure for tvos-device (exit code 71 GENERAL_FAILURE) #125371

Open

Use ArrayPool for ParseSparseMap buffer

6cdfd74

Rent the 1024-byte parsing buffer from ArrayPool<byte>.Shared instead of allocating a new array on each call, reducing GC pressure. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

build-analysis bot mentioned this pull request Mar 11, 2026

[android-arm64] The Operation will be canceled. The next steps may not contain expected logs. dotnet/dnceng#6408

Open

3 tasks

rzikm and others added 2 commits March 11, 2026 18:09

Validate sparse map segment ordering and bounds

313d905

Reject segments that extend past realSize or are not in ascending offset order during InitializeFromParsedMap. Add test cases for out-of-bounds and misordered segments. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

build-analysis bot mentioned this pull request Mar 11, 2026

System.IO.Tests.File_Move_Tests.File_Move_Multiple_From_Watched_To_Unwatched_Mac test failure #125018

Open

alinpahontu2912 requested changes Mar 12, 2026

View reviewed changes

karelz assigned rzikm and unassigned lewing and Copilot Mar 12, 2026

rzikm and others added 3 commits March 16, 2026 10:07

Merge remote-tracking branch 'upstream/main' into copilot/fix-gnu-spa…

dfedc9a

…rse-format-handling

Don't dispose multiple times

cddc9d0

rzikm requested a review from alinpahontu2912 March 16, 2026 10:52

alinpahontu2912 reviewed Mar 18, 2026

View reviewed changes

src/libraries/System.Formats.Tar/src/System/Formats/Tar/GnuSparseStream.cs Show resolved Hide resolved

src/libraries/System.Formats.Tar/src/System/Formats/Tar/GnuSparseStream.cs Outdated Show resolved Hide resolved

Copilot started work on behalf of rzikm March 18, 2026 10:58 View session

Fix segment search: binary search for backward seeks, preserve sequen…

6967f5f

…tial scan for forward reads; fix overflow check Co-authored-by: rzikm <32671551+rzikm@users.noreply.github.com>

Copilot AI requested a review from rzikm March 18, 2026 11:30

Copilot finished work on behalf of rzikm March 18, 2026 11:30

build-analysis bot mentioned this pull request Mar 18, 2026

slow macOS - "##[error]The job running on agent Azure Pipelines 9 ran longer than the maximum time of 60 minutes." dotnet/dnceng#1883

Open

3 tasks

rzikm requested review from alinpahontu2912 and ericstj March 19, 2026 12:48

Copilot AI mentioned this pull request Mar 19, 2026

Fix flaky File_Move_Multiple_From_Watched_To_Unwatched_Mac test caused by duplicate FSEvents #125779

Open

alinpahontu2912 reviewed Mar 20, 2026

View reviewed changes

Conversation

Copilot AI commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Testing

Uh oh!

rzikm commented Mar 9, 2026

Uh oh!

Copilot AI commented Mar 9, 2026

Uh oh!

dotnet-policy-service bot commented Mar 9, 2026

Uh oh!

rzikm commented Mar 9, 2026

Uh oh!

Copilot AI commented Mar 9, 2026

Uh oh!

rzikm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rzikm commented Mar 18, 2026

Uh oh!

Copilot AI commented Mar 18, 2026

Uh oh!

alinpahontu2912 Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

alinpahontu2912 Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Copilot AI commented Mar 6, 2026 •

edited

Loading