GH-455: Add Variant specification docs#456
Conversation
rdblue
left a comment
There was a problem hiding this comment.
+1 for getting this PR in with the basics so that we can start working on smaller, more focused PRs to get the shredding spec into a usable form. There are definitely some changes to make, but I'd prefer not holding up the initial addition waiting for them.
minor formatting Co-authored-by: Ryan Blue <blue@apache.org>
|
+1. Thanks @gene-db to work on it. So we will include preliminary shredding spec as well? I'm fine with that. |
Add license
Add license
|
@rdblue I updated the PR to add licenses to the docs. I think that should make the tests pass. |
julienledem
left a comment
There was a problem hiding this comment.
This looks good to me. I have left some comments.
As a follow up, it would be nice to have more explanations of the rationale for the decisions in this spec. If the spec is precise, it doesn't always explain why it is that way.
| - The length of the ith string can be computed as `offset[i+1] - offset[i]`. | ||
| - The offset of the first string is always equal to 0 and is therefore redundant. It is included in the spec to simplify in-memory-processing. | ||
| - `offset_size_minus_one` indicates the number of bytes per `dictionary_size` and `offset` entry. I.e. a value of 0 indicates 1-byte offsets, 1 indicates 2-byte offsets, 2 indicates 3 byte offsets and 3 indicates 4-byte offsets. | ||
| - If `sorted_strings` is set to 1, strings in the dictionary must be unique and sorted in lexicographic order. If the value is set to 0, readers may not make any assumptions about string order or uniqueness. |
There was a problem hiding this comment.
does this assume any kind of encoding or is it byte-wise?
There was a problem hiding this comment.
All strings are UTF-8, but I think it's a follow up to clarify that.
There was a problem hiding this comment.
Yes, they are all UTF-8. I can add a follow up to clarify that point.
There was a problem hiding this comment.
so lexicographic order is defined by unicode code points.
| # Shredding Semantics | ||
|
|
||
| Reconstruction of Variant value from a shredded representation is not expected to produce a bit-for-bit identical binary to the original unshredded value. | ||
| For example, the order of fields in the binary may change, as may the physical representation of scalar values. |
There was a problem hiding this comment.
is the order of fields going to change? If we use the same order in the Parquet schema, then the order should be maintained, no?
Also it seems that we can add metadata to the parquet footer to make sure we can have identity preserving round trip. That seems like an important property to have to verify correctness.
There was a problem hiding this comment.
In a Variant object, the field ids and field offsets have a strict ordering defined by the specification, but the field data (what the offsets are pointing to) do not have to be in the same order. Therefore, reconstruction may not preserve the same order of the field data as the original binary.
We can validate correctness by recursively inspecting Variant values (and field id/offset are valid according to the spec), and not bitwise comparing the results.
Co-authored-by: Julien Le Dem <julien@apache.org>
|
@julienledem Thanks! I clarified some of the comments, and I will address them in a followup PR. |
|
Does anyone know of parquet implementations that implement the variant type? I would like to try and organize getting this into the Rust implementation (see apache/arrow-rs#6736) but I couldn't find any example data / implementations while writing that up |
| ``` | ||
| 7 6 5 4 3 0 | ||
| +-------+---+---+---------------+ | ||
| header | | | | version | |
There was a problem hiding this comment.
it looks like bit 5 is unused? can we specify that is should always be zero?
| As a result, offsets will not necessarily be listed in ascending order. | ||
|
|
||
| An implementation may rely on this field ID order in searching for field names. | ||
| E.g. a binary search on field IDs (combined with metadata lookups) may be used to find a field with a given field. |
There was a problem hiding this comment.
Should this be "field with a given field name"?
|
|
||
| Field names are case-sensitive. | ||
| Field names are required to be unique for each object. | ||
| It is an error for an object to contain two fields with the same name, whether or not they have distinct dictionary IDs. |
There was a problem hiding this comment.
modulo case differences (I know this is stated above that names are case sensitive but i just want to make sure I am parsing this correctly)?
@alamb I think something might live in spark, this was merged as a fork from spark and we are trying to address it at a compatibility layer. |
Rationale for this change
Spark and Parquet communities have agreed to move the Spark Variant spec to Parquet.
What changes are included in this PR?
Added the Variant specification docs.
Do these changes have PoC implementations?
Closes #455