GH-455: Add Variant specification docs by gene-db · Pull Request #456 · apache/parquet-format

gene-db · 2024-09-30T17:44:09Z

Rationale for this change

Spark and Parquet communities have agreed to move the Spark Variant spec to Parquet.

What changes are included in this PR?

Added the Variant specification docs.

Do these changes have PoC implementations?

Closes #455

VariantEncoding.md

VariantShredding.md

rdblue

+1 for getting this PR in with the basics so that we can start working on smaller, more focused PRs to get the shredding spec into a usable form. There are definitely some changes to make, but I'd prefer not holding up the initial addition waiting for them.

minor formatting Co-authored-by: Ryan Blue <blue@apache.org>

sfc-gh-aixu · 2024-09-30T20:27:50Z

+1. Thanks @gene-db to work on it. So we will include preliminary shredding spec as well? I'm fine with that.

Add license

gene-db · 2024-10-04T21:01:13Z

@rdblue I updated the PR to add licenses to the docs. I think that should make the tests pass.

julienledem

This looks good to me. I have left some comments.
As a follow up, it would be nice to have more explanations of the rationale for the decisions in this spec. If the spec is precise, it doesn't always explain why it is that way.

VariantEncoding.md

julienledem · 2024-10-04T23:15:34Z

VariantEncoding.md

+- The length of the ith string can be computed as `offset[i+1] - offset[i]`.
+- The offset of the first string is always equal to 0 and is therefore redundant. It is included in the spec to simplify in-memory-processing.
+- `offset_size_minus_one` indicates the number of bytes per `dictionary_size` and `offset` entry. I.e. a value of 0 indicates 1-byte offsets, 1 indicates 2-byte offsets, 2 indicates 3 byte offsets and 3 indicates 4-byte offsets.
+- If `sorted_strings` is set to 1, strings in the dictionary must be unique and sorted in lexicographic order. If the value is set to 0, readers may not make any assumptions about string order or uniqueness.


does this assume any kind of encoding or is it byte-wise?

All strings are UTF-8, but I think it's a follow up to clarify that.

Yes, they are all UTF-8. I can add a follow up to clarify that point.

so lexicographic order is defined by unicode code points.

julienledem · 2024-10-04T23:33:24Z

VariantShredding.md

+# Shredding Semantics
+
+Reconstruction of Variant value from a shredded representation is not expected to produce a bit-for-bit identical binary to the original unshredded value.
+For example, the order of fields in the binary may change, as may the physical representation of scalar values.


is the order of fields going to change? If we use the same order in the Parquet schema, then the order should be maintained, no?
Also it seems that we can add metadata to the parquet footer to make sure we can have identity preserving round trip. That seems like an important property to have to verify correctness.

In a Variant object, the field ids and field offsets have a strict ordering defined by the specification, but the field data (what the offsets are pointing to) do not have to be in the same order. Therefore, reconstruction may not preserve the same order of the field data as the original binary.

We can validate correctness by recursively inspecting Variant values (and field id/offset are valid according to the spec), and not bitwise comparing the results.

Co-authored-by: Julien Le Dem <julien@apache.org>

gene-db · 2024-10-07T17:03:43Z

@julienledem Thanks! I clarified some of the comments, and I will address them in a followup PR.

alamb · 2024-11-15T20:52:07Z

Does anyone know of parquet implementations that implement the variant type?

I would like to try and organize getting this into the Rust implementation (see apache/arrow-rs#6736) but I couldn't find any example data / implementations while writing that up

emkornfield · 2024-11-23T00:05:41Z

VariantEncoding.md

+```
+             7     6  5   4  3             0
+            +-------+---+---+---------------+
+header      |       |   |   |    version    |


it looks like bit 5 is unused? can we specify that is should always be zero?

emkornfield · 2024-11-23T00:15:01Z

VariantEncoding.md

+As a result, offsets will not necessarily be listed in ascending order.
+
+An implementation may rely on this field ID order in searching for field names.
+E.g. a binary search on field IDs (combined with metadata lookups) may be used to find a field with a given field.


Should this be "field with a given field name"?

emkornfield · 2024-11-23T00:16:07Z

VariantEncoding.md

+
+Field names are case-sensitive.
+Field names are required to be unique for each object.
+It is an error for an object to contain two fields with the same name, whether or not they have distinct dictionary IDs.


modulo case differences (I know this is stated above that names are case sensitive but i just want to make sure I am parsing this correctly)?

emkornfield · 2024-11-23T04:33:13Z

Does anyone know of parquet implementations that implement the variant type?

I would like to try and organize getting this into the Rust implementation (see apache/arrow-rs#6736) but I couldn't find any example data / implementations while writing that up

@alamb I think something might live in spark, this was merged as a fork from spark and we are trying to address it at a compatibility layer.

Add Variant specification docs

605c21e

rdblue reviewed Sep 30, 2024

View reviewed changes

VariantEncoding.md Outdated Show resolved Hide resolved

rdblue reviewed Sep 30, 2024

View reviewed changes

VariantShredding.md Outdated Show resolved Hide resolved

rdblue approved these changes Sep 30, 2024

View reviewed changes

Apply suggestions from code review

7318b2d

minor formatting Co-authored-by: Ryan Blue <blue@apache.org>

gene-db added 2 commits October 4, 2024 09:55

Update VariantEncoding.md

41cc505

Add license

Update VariantShredding.md

7b7a6bc

Add license

julienledem approved these changes Oct 4, 2024

View reviewed changes

Update VariantEncoding.md

a5c7d96

Co-authored-by: Julien Le Dem <julien@apache.org>

julienledem merged commit 4f20815 into apache:master Oct 9, 2024

gene-db mentioned this pull request Oct 10, 2024

[FOLLOWUP] Clarify Variant specification details #457

Merged

alamb mentioned this pull request Nov 15, 2024

[EPIC] [Parquet] Implement Variant type support in Parquet apache/arrow-rs#6736

Closed

97 tasks

aihuaxu mentioned this pull request Nov 20, 2024

Add Variant Logical Type apache/parquet-java#3070

Closed

emkornfield reviewed Nov 23, 2024

View reviewed changes

Fokko mentioned this pull request Nov 26, 2024

GH-3070: Add Variant logical type annotation to parquet-java apache/parquet-java#3072

Merged

alamb mentioned this pull request Mar 5, 2025

Run / test Datafusion with JSON Bench from ClickHouse apache/datafusion#14874

Open

Conversation

gene-db commented Sep 30, 2024

Rationale for this change

What changes are included in this PR?

Do these changes have PoC implementations?

Uh oh!

Uh oh!

Uh oh!

rdblue left a comment

Choose a reason for hiding this comment

Uh oh!

sfc-gh-aixu commented Sep 30, 2024

Uh oh!

gene-db commented Oct 4, 2024

Uh oh!

julienledem left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gene-db commented Oct 7, 2024

Uh oh!

alamb commented Nov 15, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emkornfield commented Nov 23, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants