Add Parquet variant shredding support by CurtHagenlocher · Pull Request #332 · apache/arrow-dotnet

CurtHagenlocher · 2026-04-26T14:56:17Z

What's Changed

Implements the Parquet variant shredding spec end-to-end in a new Apache.Arrow.Operations.Shredding namespace, alongside minor changes to the base scalar and array types.

Operations.Shredding reader side:

ShreddedVariant / ShreddedObject / ShreddedArray ref-struct trio exposing typed columns and residual bytes side-by-side.
VariantArrayShreddingExtensions adds GetShreddedVariant(i) and GetLogicalVariantValue(i) on VariantArray.
ShredSchema.FromArrowType derives a shredding schema from an Arrow typed_value type, rejecting unsupported types (uint32, fixed-size-binary(N≠16)).

Operations.Shredding producer side:

VariantShredder decomposes a column of VariantValues against a ShredSchema into shared metadata + per-row ShredResults.
ShreddedVariantArrayBuilder assembles those into a shredded VariantArray with a typed_value Arrow tree matching the schema.

Apache.Arrow changes:

VariantExtensionDefinition accepts struct<metadata, value?, typed_value?> layouts in addition to the plain unshredded form.
VariantType gains IsShredded / HasValueColumn / HasTypedValueColumn / TypedValueField properties.
VariantArray.GetVariantValue and GetVariantReader throw on shredded columns with a pointer to the Operations.Shredding extensions.
The public VariantArray(IArrowArray) constructor now infers the VariantType (shredded or not) from the storage shape.
Operations gains a project reference to Apache.Arrow; Apache.Arrow does not reference Operations.

Apache.Arrow.Scalars changes:
VariantValueWriter.CopyValue(VariantReader source) transcodes a reader into this writer, re-resolving field IDs against the writer's metadata dictionary. Supports cross-dictionary transcoding and multi-source merge-into-one-dictionary workflows.
VariantMetadataBuilder.CollectFieldNames(VariantReader source) is the two-pass companion that accumulates source field names into the target metadata builder.

Validation:

Conformance tests run against the Iceberg shredded-variant corpus in apache/parquet-testing (test/parquet-testing/shredded_variant/). test/shredded_variant_ipc/regen.py converts each case-NNN.parquet to an Arrow IPC file via pyarrow; 137 resulting .arrow files are checked in so CI needs no Python. All 128 valid conformance cases pass; 6 schema-invalid and data-invalid cases are rejected with clear errors; 3 "spec-invalid but permissive" INVALID cases are documented as read-without-throw.
Additional round-trip, reader-style, and builder tests were implemented

Copilot

Pull request overview

Adds end-to-end Parquet shredded-variant support (reader + producer) under Apache.Arrow.Operations.Shredding, with supporting enhancements to Arrow Variant scalar/array APIs and conformance fixtures converted to Arrow IPC for CI.

Changes:

Introduces Apache.Arrow.Operations.Shredding types (e.g., ShredType, ShredOptions, and shared helpers) to represent and operate on shredded typed_value layouts.
Extends Variant scalar tooling with cross-metadata transcoding support (VariantValueWriter.CopyValue) and a metadata prepass helper (VariantMetadataBuilder.CollectFieldNames).
Adds a regeneration script and checks in Arrow IPC fixtures converted from the Parquet shredded-variant corpus.

Reviewed changes

Copilot reviewed 29 out of 166 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
test/shredded_variant_ipc/regen.py	Script to regenerate Arrow IPC fixtures from the parquet-testing shredded-variant corpus.
test/shredded_variant_ipc/case-*.arrow (many files)	Checked-in Arrow IPC fixtures generated from the shredded-variant Parquet test corpus.
src/Apache.Arrow.Scalars/Variant/VariantValueWriter.cs	Adds `CopyValue(VariantReader)` to transcode values while re-resolving field IDs against a target metadata dictionary.
src/Apache.Arrow.Scalars/Variant/VariantValue.cs	Adds `FromDecimal16(SqlDecimal)` to preserve Decimal16 intent and support values beyond `decimal` range.
src/Apache.Arrow.Scalars/Variant/VariantMetadataBuilder.cs	Adds `CollectFieldNames(VariantReader)` for two-pass encode workflows.
src/Apache.Arrow.Operations/Shredding/ShreddingHelpers.cs	Adds shared helper to construct per-row `ShreddedVariant` slots from element-group structs.
src/Apache.Arrow.Operations/Shredding/ShredType.cs	Defines the shredding type system for `typed_value` columns (primitive + object/array).
src/Apache.Arrow.Operations/Shredding/ShredOptions.cs	Adds schema inference tuning options (depth, frequency, type consistency).
src/Apache.Arrow.Operations/Apache.Arrow.Operations.csproj	Adds a project reference to `Apache.Arrow` to support shredding operations over Arrow arrays/types.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-26T15:01:01Z

+                SqlDecimal normalized = SqlDecimal.ConvertToPrecScale(value, 38, value.Scale);
+                return new VariantValue(VariantPrimitiveType.Decimal16, (object)normalized);
+            }
+            return new VariantValue(VariantPrimitiveType.Decimal16, (object)value.Value);


FromDecimal16(SqlDecimal) converts to decimal via value.Value when value.Data[3] == 0. SqlDecimal.Value can still throw for values that aren't representable as System.Decimal (e.g., scale/precision beyond decimal’s limits) even when the magnitude fits in 96 bits. Consider storing the SqlDecimal instance in those cases (or using a try/catch fallback) so Decimal16 materialization can’t unexpectedly overflow.

Suggested change

return new VariantValue(VariantPrimitiveType.Decimal16, (object)value.Value);

try

{

return new VariantValue(VariantPrimitiveType.Decimal16, (object)value.Value);

}

catch (OverflowException)

{

SqlDecimal normalized = SqlDecimal.ConvertToPrecScale(value, 38, value.Scale);

return new VariantValue(VariantPrimitiveType.Decimal16, (object)normalized);

}

I think in a followup change I'm going to always store a Decimal16 as a SqlDecimal and vice versa. The current "convert to decimal if it fits" strategy is unnecessarily complicated. Filed #33 to cover this.

CurtHagenlocher · 2026-04-26T15:52:43Z

+        {
+            StructType elementGroupType = (StructType)elementGroup.Data.DataType;
+            int valueIdx = elementGroupType.GetFieldIndex("value");
+            int typedIdx = elementGroupType.GetFieldIndex("typed_value");


These should probably be cached; need to take a second look.

(Shouldn't affect the public API, so can be done as a followup.)

CurtHagenlocher added 3 commits April 24, 2026 19:16

Add Parquet variant shredding support

0156650

Fix test and implement first round of feedback from Copilot.

4c41528

Merge from main

2d2e5e1

CurtHagenlocher requested a review from Copilot April 26, 2026 14:56

Copilot started reviewing on behalf of CurtHagenlocher April 26, 2026 14:56 View session

Copilot AI reviewed Apr 26, 2026

View reviewed changes

Fix build issues

438b56c

CurtHagenlocher commented Apr 26, 2026

View reviewed changes

CurtHagenlocher requested a review from adamreeve April 26, 2026 21:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Parquet variant shredding support#332

Add Parquet variant shredding support#332
CurtHagenlocher wants to merge 4 commits intoapache:mainfrom
CurtHagenlocher:VariantShredding

CurtHagenlocher commented Apr 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

CurtHagenlocher Apr 26, 2026 •

edited

Loading

Uh oh!

CurtHagenlocher Apr 26, 2026

Uh oh!

CurtHagenlocher Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

CurtHagenlocher commented Apr 26, 2026

What's Changed

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

CurtHagenlocher Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CurtHagenlocher Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

CurtHagenlocher Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CurtHagenlocher Apr 26, 2026 •

edited

Loading