Add Parquet variant shredding support#332
Add Parquet variant shredding support#332CurtHagenlocher wants to merge 4 commits intoapache:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds end-to-end Parquet shredded-variant support (reader + producer) under Apache.Arrow.Operations.Shredding, with supporting enhancements to Arrow Variant scalar/array APIs and conformance fixtures converted to Arrow IPC for CI.
Changes:
- Introduces
Apache.Arrow.Operations.Shreddingtypes (e.g.,ShredType,ShredOptions, and shared helpers) to represent and operate on shreddedtyped_valuelayouts. - Extends Variant scalar tooling with cross-metadata transcoding support (
VariantValueWriter.CopyValue) and a metadata prepass helper (VariantMetadataBuilder.CollectFieldNames). - Adds a regeneration script and checks in Arrow IPC fixtures converted from the Parquet shredded-variant corpus.
Reviewed changes
Copilot reviewed 29 out of 166 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| test/shredded_variant_ipc/regen.py | Script to regenerate Arrow IPC fixtures from the parquet-testing shredded-variant corpus. |
| test/shredded_variant_ipc/case-*.arrow (many files) | Checked-in Arrow IPC fixtures generated from the shredded-variant Parquet test corpus. |
| src/Apache.Arrow.Scalars/Variant/VariantValueWriter.cs | Adds CopyValue(VariantReader) to transcode values while re-resolving field IDs against a target metadata dictionary. |
| src/Apache.Arrow.Scalars/Variant/VariantValue.cs | Adds FromDecimal16(SqlDecimal) to preserve Decimal16 intent and support values beyond decimal range. |
| src/Apache.Arrow.Scalars/Variant/VariantMetadataBuilder.cs | Adds CollectFieldNames(VariantReader) for two-pass encode workflows. |
| src/Apache.Arrow.Operations/Shredding/ShreddingHelpers.cs | Adds shared helper to construct per-row ShreddedVariant slots from element-group structs. |
| src/Apache.Arrow.Operations/Shredding/ShredType.cs | Defines the shredding type system for typed_value columns (primitive + object/array). |
| src/Apache.Arrow.Operations/Shredding/ShredOptions.cs | Adds schema inference tuning options (depth, frequency, type consistency). |
| src/Apache.Arrow.Operations/Apache.Arrow.Operations.csproj | Adds a project reference to Apache.Arrow to support shredding operations over Arrow arrays/types. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| SqlDecimal normalized = SqlDecimal.ConvertToPrecScale(value, 38, value.Scale); | ||
| return new VariantValue(VariantPrimitiveType.Decimal16, (object)normalized); | ||
| } | ||
| return new VariantValue(VariantPrimitiveType.Decimal16, (object)value.Value); |
There was a problem hiding this comment.
FromDecimal16(SqlDecimal) converts to decimal via value.Value when value.Data[3] == 0. SqlDecimal.Value can still throw for values that aren't representable as System.Decimal (e.g., scale/precision beyond decimal’s limits) even when the magnitude fits in 96 bits. Consider storing the SqlDecimal instance in those cases (or using a try/catch fallback) so Decimal16 materialization can’t unexpectedly overflow.
| return new VariantValue(VariantPrimitiveType.Decimal16, (object)value.Value); | |
| try | |
| { | |
| return new VariantValue(VariantPrimitiveType.Decimal16, (object)value.Value); | |
| } | |
| catch (OverflowException) | |
| { | |
| SqlDecimal normalized = SqlDecimal.ConvertToPrecScale(value, 38, value.Scale); | |
| return new VariantValue(VariantPrimitiveType.Decimal16, (object)normalized); | |
| } |
There was a problem hiding this comment.
I think in a followup change I'm going to always store a Decimal16 as a SqlDecimal and vice versa. The current "convert to decimal if it fits" strategy is unnecessarily complicated. Filed #33 to cover this.
| { | ||
| StructType elementGroupType = (StructType)elementGroup.Data.DataType; | ||
| int valueIdx = elementGroupType.GetFieldIndex("value"); | ||
| int typedIdx = elementGroupType.GetFieldIndex("typed_value"); |
There was a problem hiding this comment.
These should probably be cached; need to take a second look.
There was a problem hiding this comment.
(Shouldn't affect the public API, so can be done as a followup.)
What's Changed
Implements the Parquet variant shredding spec end-to-end in a new
Apache.Arrow.Operations.Shreddingnamespace, alongside minor changes to the base scalar and array types.Operations.Shredding reader side:
ShreddedVariant/ShreddedObject/ShreddedArrayref-struct trio exposing typed columns and residual bytes side-by-side.VariantArrayShreddingExtensionsaddsGetShreddedVariant(i)andGetLogicalVariantValue(i)onVariantArray.ShredSchema.FromArrowTypederives a shredding schema from an Arrow typed_value type, rejecting unsupported types (uint32, fixed-size-binary(N≠16)).Operations.Shredding producer side:
VariantShredderdecomposes a column ofVariantValuesagainst aShredSchemainto shared metadata + per-rowShredResults.ShreddedVariantArrayBuilderassembles those into a shreddedVariantArraywith atyped_valueArrow tree matching the schema.Apache.Arrow changes:
VariantExtensionDefinitionacceptsstruct<metadata, value?, typed_value?>layouts in addition to the plain unshredded form.VariantTypegainsIsShredded/HasValueColumn/HasTypedValueColumn/TypedValueFieldproperties.VariantArray.GetVariantValueandGetVariantReaderthrow on shredded columns with a pointer to theOperations.Shreddingextensions.The public
VariantArray(IArrowArray)constructor now infers theVariantType(shredded or not) from the storage shape.Operations gains a project reference to Apache.Arrow; Apache.Arrow does not reference Operations.
Apache.Arrow.Scalars changes:
VariantValueWriter.CopyValue(VariantReader source)transcodes a reader into this writer, re-resolving field IDs against the writer's metadata dictionary. Supports cross-dictionary transcoding and multi-source merge-into-one-dictionary workflows.VariantMetadataBuilder.CollectFieldNames(VariantReader source)is the two-pass companion that accumulates source field names into the target metadata builder.Validation:
apache/parquet-testing(test/parquet-testing/shredded_variant/).test/shredded_variant_ipc/regen.pyconverts eachcase-NNN.parquetto an Arrow IPC file viapyarrow; 137 resulting .arrow files are checked in so CI needs no Python. All 128 valid conformance cases pass; 6 schema-invalid and data-invalid cases are rejected with clear errors; 3 "spec-invalid but permissive" INVALID cases are documented as read-without-throw.