Skip to content

Add Parquet variant shredding support#332

Open
CurtHagenlocher wants to merge 4 commits intoapache:mainfrom
CurtHagenlocher:VariantShredding
Open

Add Parquet variant shredding support#332
CurtHagenlocher wants to merge 4 commits intoapache:mainfrom
CurtHagenlocher:VariantShredding

Conversation

@CurtHagenlocher
Copy link
Copy Markdown
Contributor

What's Changed

Implements the Parquet variant shredding spec end-to-end in a new Apache.Arrow.Operations.Shredding namespace, alongside minor changes to the base scalar and array types.

Operations.Shredding reader side:

  • ShreddedVariant / ShreddedObject / ShreddedArray ref-struct trio exposing typed columns and residual bytes side-by-side.
  • VariantArrayShreddingExtensions adds GetShreddedVariant(i) and GetLogicalVariantValue(i) on VariantArray.
  • ShredSchema.FromArrowType derives a shredding schema from an Arrow typed_value type, rejecting unsupported types (uint32, fixed-size-binary(N≠16)).

Operations.Shredding producer side:

  • VariantShredder decomposes a column of VariantValues against a ShredSchema into shared metadata + per-row ShredResults.
  • ShreddedVariantArrayBuilder assembles those into a shredded VariantArray with a typed_value Arrow tree matching the schema.

Apache.Arrow changes:

  • VariantExtensionDefinition accepts struct<metadata, value?, typed_value?> layouts in addition to the plain unshredded form.

  • VariantType gains IsShredded / HasValueColumn / HasTypedValueColumn / TypedValueField properties.

  • VariantArray.GetVariantValue and GetVariantReader throw on shredded columns with a pointer to the Operations.Shredding extensions.

  • The public VariantArray(IArrowArray) constructor now infers the VariantType (shredded or not) from the storage shape.

  • Operations gains a project reference to Apache.Arrow; Apache.Arrow does not reference Operations.

    Apache.Arrow.Scalars changes:

  • VariantValueWriter.CopyValue(VariantReader source) transcodes a reader into this writer, re-resolving field IDs against the writer's metadata dictionary. Supports cross-dictionary transcoding and multi-source merge-into-one-dictionary workflows.

  • VariantMetadataBuilder.CollectFieldNames(VariantReader source) is the two-pass companion that accumulates source field names into the target metadata builder.

Validation:

  • Conformance tests run against the Iceberg shredded-variant corpus in apache/parquet-testing (test/parquet-testing/shredded_variant/). test/shredded_variant_ipc/regen.py converts each case-NNN.parquet to an Arrow IPC file via pyarrow; 137 resulting .arrow files are checked in so CI needs no Python. All 128 valid conformance cases pass; 6 schema-invalid and data-invalid cases are rejected with clear errors; 3 "spec-invalid but permissive" INVALID cases are documented as read-without-throw.
  • Additional round-trip, reader-style, and builder tests were implemented

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds end-to-end Parquet shredded-variant support (reader + producer) under Apache.Arrow.Operations.Shredding, with supporting enhancements to Arrow Variant scalar/array APIs and conformance fixtures converted to Arrow IPC for CI.

Changes:

  • Introduces Apache.Arrow.Operations.Shredding types (e.g., ShredType, ShredOptions, and shared helpers) to represent and operate on shredded typed_value layouts.
  • Extends Variant scalar tooling with cross-metadata transcoding support (VariantValueWriter.CopyValue) and a metadata prepass helper (VariantMetadataBuilder.CollectFieldNames).
  • Adds a regeneration script and checks in Arrow IPC fixtures converted from the Parquet shredded-variant corpus.

Reviewed changes

Copilot reviewed 29 out of 166 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
test/shredded_variant_ipc/regen.py Script to regenerate Arrow IPC fixtures from the parquet-testing shredded-variant corpus.
test/shredded_variant_ipc/case-*.arrow (many files) Checked-in Arrow IPC fixtures generated from the shredded-variant Parquet test corpus.
src/Apache.Arrow.Scalars/Variant/VariantValueWriter.cs Adds CopyValue(VariantReader) to transcode values while re-resolving field IDs against a target metadata dictionary.
src/Apache.Arrow.Scalars/Variant/VariantValue.cs Adds FromDecimal16(SqlDecimal) to preserve Decimal16 intent and support values beyond decimal range.
src/Apache.Arrow.Scalars/Variant/VariantMetadataBuilder.cs Adds CollectFieldNames(VariantReader) for two-pass encode workflows.
src/Apache.Arrow.Operations/Shredding/ShreddingHelpers.cs Adds shared helper to construct per-row ShreddedVariant slots from element-group structs.
src/Apache.Arrow.Operations/Shredding/ShredType.cs Defines the shredding type system for typed_value columns (primitive + object/array).
src/Apache.Arrow.Operations/Shredding/ShredOptions.cs Adds schema inference tuning options (depth, frequency, type consistency).
src/Apache.Arrow.Operations/Apache.Arrow.Operations.csproj Adds a project reference to Apache.Arrow to support shredding operations over Arrow arrays/types.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

SqlDecimal normalized = SqlDecimal.ConvertToPrecScale(value, 38, value.Scale);
return new VariantValue(VariantPrimitiveType.Decimal16, (object)normalized);
}
return new VariantValue(VariantPrimitiveType.Decimal16, (object)value.Value);
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FromDecimal16(SqlDecimal) converts to decimal via value.Value when value.Data[3] == 0. SqlDecimal.Value can still throw for values that aren't representable as System.Decimal (e.g., scale/precision beyond decimal’s limits) even when the magnitude fits in 96 bits. Consider storing the SqlDecimal instance in those cases (or using a try/catch fallback) so Decimal16 materialization can’t unexpectedly overflow.

Suggested change
return new VariantValue(VariantPrimitiveType.Decimal16, (object)value.Value);
try
{
return new VariantValue(VariantPrimitiveType.Decimal16, (object)value.Value);
}
catch (OverflowException)
{
SqlDecimal normalized = SqlDecimal.ConvertToPrecScale(value, 38, value.Scale);
return new VariantValue(VariantPrimitiveType.Decimal16, (object)normalized);
}

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

@CurtHagenlocher CurtHagenlocher Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in a followup change I'm going to always store a Decimal16 as a SqlDecimal and vice versa. The current "convert to decimal if it fits" strategy is unnecessarily complicated. Filed #33 to cover this.

{
StructType elementGroupType = (StructType)elementGroup.Data.DataType;
int valueIdx = elementGroupType.GetFieldIndex("value");
int typedIdx = elementGroupType.GetFieldIndex("typed_value");
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should probably be cached; need to take a second look.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Shouldn't affect the public API, so can be done as a followup.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants