Skip to content

Conversation

@zeroshade
Copy link
Member

Rationale for this change

Testing out the variant implementation here against Parquet java using the test cases generated in https://github.com/apache/parquet-testing/pull/90/files. Overall, it confirms that our implementation is generally compatible for reading parquet files written by parquet-java with some caveats.

What changes are included in this PR?

New testing suite in parquet/pqarrow/variant_test.go which uses the test cases defined in parquet-testing and attempts to read the parquet files and compares the resulting variants against the expected ones.

Some issues were found that I believe are issues with Parquet-java and the test cases rather than issues with the Go implementation, as such discussion is needed for the following:

  • The parquet test files are missing the Logical Variant Type annotation. Currently I've worked around that for testing purposes, but not in a way that can be merged or that is sustainable. As such the files need to be re-generated with the Variant Logical Type annotation before these tests can be enabled.
  • Several test cases test variations on situations where the value column is missing. Based on my reading of the spec this seems to be an invalid scenario. The specific case is that the spec states the typed_value field may be omitted when not shredding elements as a specific type, but says nothing about allowing omission of the value field. Currently, the Go implementation will error if this field is missing as per my reading of the spec, meaning those test cases fail.
  • Test case 43 testPartiallyShreddedObjectMissingFieldConflict seems to have a conflict between what is expected and what in the spec. The b field exists within the value field, while also being a shredded field, the test appears to assume the data in the value field would be ignored, but the spec says that value must never contain fields represented by the shredded fields. This needs clarification on the desired behavior and result.
  • Test case 84, testShreddedObjectWithOptionalFieldStructs tests the schenario where the shredded fields of an object are listed as optional in the schema, but the spec states that they must be required. Thus, the Go implementation errors on this test as the spec says this is an error. Clarification is needed on if this is a valid test case.
  • Test case 38 testShreddedObjectMissingTypedValue tests the case where the typed_value field is missing, this is allowed by the spec except that the spec states that in this scenario the value field must be required. The test case uses optional in this scenario causing the Go implementation to fail. Clarification is needed here.
  • Test case 125, testPartiallyShreddedObjectFieldConflict again tests the case of a field existing in both the value and the shredded column which the spec states is invalid and will lead to inconsistent results. Thus it is not valid to have this test case assert a specific result according to the spec unless the spec is amended to state that the shredded field takes precedence in this case.
  • One thing that makes the tests a bit difficult is that when we un-shred back into variants, the current variant code in some libraries will automatically downcast to the smallest viable precision (downcasting an int32 into an int16 for example if it fits). This is worked around by testing the values rather than the types, but is worth mentioning. Particularly in the case of decimal values
  • A couple error test cases verify that particular types are not supported such as UINT32 or Fixed Len Byte Array(4), nothing in the spec however says that an implementation couldn't just upcast a uint32 -> int64 or treat a fixed len byte array shredded column as a binary column. So is it meaningful to explicitly error on those cases rather than allow them since they are trivially convertable to valid variant types?

@zeroshade
Copy link
Member Author

CC @aihuaxu @emkornfield @rdblue

@zeroshade zeroshade requested review from kou, lidavidm and wgtmac July 28, 2025 20:48
@aihuaxu
Copy link

aihuaxu commented Jul 30, 2025

@zeroshade I regenerated the files with Variant logical type (apache/parquet-testing#91). Can you retest it?

@zeroshade
Copy link
Member Author

I'm away for the rest of this week but I can retest on Tuesday. The regeneration of the tests wouldn't fix the rest of the issues I listed right?

@aihuaxu
Copy link

aihuaxu commented Jul 30, 2025

Several test cases test variations on situations where the value column is missing. Based on my reading of the spec this seems to be an invalid scenario. The specific case is that the spec states the typed_value field may be omitted when not shredding elements as a specific type, but says nothing about allowing omission of the value field. Currently, the Go implementation will error if this field is missing as per my reading of the spec, meaning those test cases fail.

Both value and typed_value are optional per spec and value can be missing as I understand.
image

Test case 43 testPartiallyShreddedObjectMissingFieldConflict seems to have a conflict between what is expected and what in the spec. The b field exists within the value field, while also being a shredded field, the test appears to assume the data in the value field would be ignored, but the spec says that value must never contain fields represented by the shredded fields. This needs clarification on the desired behavior and result.

The value column of a partially shredded object must never contain fields represented by the Parquet columns in typed_value (shredded fields). Readers may always assume that data is written correctly and that shredded fields in typed_value are not present in value. This test case is to prove that the reader will only read from typed_value and ignore the one from value. That means, the reader is not responsible to validate the duplicate key and the reader will read from typed_value.

Test case 84, testShreddedObjectWithOptionalFieldStructs tests the schenario where the shredded fields of an object are listed as optional in the schema, but the spec states that they must be required. Thus, the Go implementation errors on this test as the spec says this is an error. Clarification is needed on if this is a valid test case.

This seems to be reasonable to error out for me to enforce it's required in the schema.

    required group email {
      optional binary value;
      optional binary typed_value (STRING);
    }

Test case 38 testShreddedObjectMissingTypedValue tests the case where the typed_value field is missing, this is allowed by the spec except that the spec states that in this scenario the value field must be required. The test case uses optional in this scenario causing the Go implementation to fail. Clarification is needed here.

We will generate the schema first which will have both value and typed_value optional. But a value is to be shredded, the value column may be required. Do we fail in GO that value schema is optional?

Test case 125, testPartiallyShreddedObjectFieldConflict again tests the case of a field existing in both the value and the shredded column which the spec states is invalid and will lead to inconsistent results. Thus it is not valid to have this test case assert a specific result according to the spec unless the spec is amended to state that the shredded field takes precedence in this case.

This is same as test case 43. My understanding is that if writer writes wrong data, the reader may only read the typed_value.

@zeroshade
Copy link
Member Author

Both value and typed_value are optional per spec and value can be missing as I understand.

While the spec states that typed_value may be omitted, it does not say the same about value. If the intent is that either can be omitted, the spec should be updated with that wording.

The value column of a partially shredded object must never contain fields represented by the Parquet columns in typed_value (shredded fields). Readers may always assume that data is written correctly and that shredded fields in typed_value are not present in value. This test case is to prove that the reader will only read from typed_value and ignore the one from value. That means, the reader is not responsible to validate the duplicate key and the reader will read from typed_value.

The section you quoted states that the partially shredded object must never contain the fields and that a reader may assume that shredded fields aren't present in the value field. It also states that the reason why they must never be written that way is because it can result in inconsistent reader behavior. If the intent is for a reader to always read from only the typed_value field in the case of a conflict like this, then the language in the spec should be updated accordingly instead of the current "may" language.

We will generate the schema first which will have both value and typed_value optional. But a value is to be shredded, the value column may be required. Do we fail in GO that value schema is optional?

Correct, the spec states that if the typed_value field is omitted, then the value field must be required, so Go errors if it is optional when the typed_value field is omitted causing this test case to fail.

This is same as test case 43. My understanding is that if writer writes wrong data, the reader may only read the typed_value.

The spec says that's a valid thing to do, but it also says that this must never happen and doesn't definitively state what the behavior in this case should be. Only that it may be inconsistent. As I said above, if the intent is that the data in the typed_value field is given precedence, the spec should be updated to say that.

@aihuaxu
Copy link

aihuaxu commented Jul 31, 2025

Test case 84, testShreddedObjectWithOptionalFieldStructs tests the schenario where the shredded fields of an object are listed as optional in the schema, but the spec states that they must be required. Thus, the Go implementation errors on this test as the spec says this is an error. Clarification is needed on if this is a valid test case.

@rdblue This seems to be reasonable to error out for me to enforce it's required in the schema. Do you remember why we had such positive case?

@cashmand
Copy link

Correct, the spec states that if the typed_value field is omitted, then the value field must be required, so Go errors if it is optional when the typed_value field is omitted causing this test case to fail.

My preference would be to relax the spec for this issue. It doesn't seem like there's much benefit to enforcing it on the read side, and it's easy to imagine a writer failing to enforce it in some cases where it usually adds a typed_value to the schema, but not always.

@wgtmac
Copy link
Member

wgtmac commented Jul 31, 2025

My preference would be to relax the spec for this issue. It doesn't seem like there's much benefit to enforcing it on the read side, and it's easy to imagine a writer failing to enforce it in some cases where it usually adds a typed_value to the schema, but not always.

+1

IIRC, these wordings are by purpose to not complicate the reader side implementation w.r.t. reading values and consuming column stats directly from typed_value.

@aihuaxu
Copy link

aihuaxu commented Jul 31, 2025

My preference would be to relax the spec for this issue. It doesn't seem like there's much benefit to enforcing it on the read side, and it's easy to imagine a writer failing to enforce it in some cases where it usually adds a typed_value to the schema, but not always.

+1

IIRC, these wordings are by purpose to not complicate the reader side implementation w.r.t. reading values and consuming column stats directly from typed_value.

image

I think we can update the wording "When typed_value is omitted, value must be required." => "When typed_value is omitted, value must be present.". both typed_value and value fields should always be optional but we are saying it's required and that is causing the confusion.

@emkornfield
Copy link
Contributor

I think we can update the wording "When typed_value is omitted, value must be required." => "When typed_value is omitted, value must be present.". both typed_value and value fields should always be optional but we are saying it's required and that is causing the confusion.

I think the spec tried to be careful with wording, but there is a lot of semantic overlap between required/missing/optional. Having a glossary for these terms and doing an audit of the spec to make sure they are used consistently would help.

@aihuaxu
Copy link

aihuaxu commented Aug 1, 2025

Test case 84, testShreddedObjectWithOptionalFieldStructs tests the schenario where the shredded fields of an object are listed as optional in the schema, but the spec states that they must be required. Thus, the Go implementation errors on this test as the spec says this is an error. Clarification is needed on if this is a valid test case.

@rdblue This seems to be reasonable to error out for me to enforce it's required in the schema. Do you remember why we had such positive case?

Another approach is to update the wording in the spec so the reader doesn't need to do such check.

@aihuaxu
Copy link

aihuaxu commented Aug 3, 2025

I made a small change to the spec to clarify (apache/parquet-format#512). The rest I feel should be aligned with what the spec described.

@rdblue
Copy link

rdblue commented Aug 4, 2025

@zeroshade, sorry for the confusion here. You're right about a lot of those test cases. They are not allowed by the spec. The implementation I generated these cases from is defensive and tries to read if it can rather than producing errors. I'd recommend doing the same thing to handle outside-of-spec cases.

For instance, most of the time if a column is missing, most implementations will allow you to project a column of nulls. Extending this idea to Variant, it's reasonable to assume that a missing value column indicates a column of nulls and read accordingly instead of failing. The other cases are similar.

The most confusing one is where there is a field in the value of a struct that is also a shredded field. The rationale here was that the shredded value should always take precedence because the shredded value may be read without the rest of the struct (if you're projecting extract_value(var, "$['b']", ...)) and that the behavior should not change based on the Parquet column projection.

The behavior in these cases was debated when we were working on the spec. We ultimately decided to disallow writers from producing them, but I think it is a best practice to ensure that the read behavior is predictable, accepts even slightly malformed cases, and has consistent behavior depending on the projected columns.

@zeroshade
Copy link
Member Author

The behavior in these cases was debated when we were working on the spec. We ultimately decided to disallow writers from producing them, but I think it is a best practice to ensure that the read behavior is predictable, accepts even slightly malformed cases, and has consistent behavior depending on the projected columns.

@rdblue I think it would be worthwhile to add wording to the spec which defines the consistent behavior desired for readers in these malformed cases rather than it becoming just a "de facto" standard based on convention. The wording can specify that while it's invalid for a writer to produce these that in those cases the spec defines what the reader behavior should be.

@alamb
Copy link
Contributor

alamb commented Aug 6, 2025

The behavior in these cases was debated when we were working on the spec. We ultimately decided to disallow writers from producing them, but I think it is a best practice to ensure that the read behavior is predictable, accepts even slightly malformed cases, and has consistent behavior depending on the projected columns.

@rdblue I think it would be worthwhile to add wording to the spec which defines the consistent behavior desired for readers in these malformed cases rather than it becoming just a "de facto" standard based on convention. The wording can specify that while it's invalid for a writer to produce these that in those cases the spec defines what the reader behavior should be.

Here is a proposal for how to make it clearer in the tests that this is an invalid file:

@zeroshade
Copy link
Member Author

zeroshade commented Aug 6, 2025

@rdblue @aihuaxu @emkornfield @wgtmac @cashmand

I've been thinking about this a bit more over the past few days and had the following thoughts. Please let me know what you all think.

The implementation I generated these cases from is defensive and tries to read if it can rather than producing errors. I'd recommend doing the same thing to handle outside-of-spec cases.

The problem I see here is multi-faceted. While I understand the idea and reasons for doing this, allowing outside-of-spec things on read but not on write causes a few problems:

  • It explicitly creates scenarios that cannot be round-tripped. You can read things which don't conform to the spec, sure, but then you have to write a spec-conforming result.
  • There's no incentive for bad writers to fix their issues. If the convention becomes that these invalid scenarios are allowed to be read, despite the spec not actually saying it should be allowed, then any writers which currently write invalid, non-spec-conforming files have no incentive to fix their issues.
  • It significantly increases the complexity of implementing the spec. It puts the onus on implementors to deduce that they have to allow these cases despite the lack of language in the spec, not to mention the actual difficulty of any Arrow integration with Parquet to have to either allow these invalid constructions or perform conversions and casts to handle the required/optional differences and whether or not specific fields exist.

For instance, most of the time if a column is missing, most implementations will allow you to project a column of nulls. Extending this idea to Variant, it's reasonable to assume that a missing value column indicates a column of nulls and read accordingly instead of failing. The other cases are similar.

This is true for compute engines, but not necessarily true for regular Parquet implementations. Generally, the idea of "projection" is an engine-level concept. Most implementations of parquet will simply error if you try to read a column that doesn't exist, letting the level above the file interactions decide upon projection.

The behavior in these cases was debated when we were working on the spec. We ultimately decided to disallow writers from producing them, but I think it is a best practice to ensure that the read behavior is predictable, accepts even slightly malformed cases, and has consistent behavior depending on the projected columns.

I agree with this sentiment. We definitely want read behavior to be predictable and have consistent behavior. If the desired consistent behavior isn't actually written into the spec, how would one know what that behavior should be?

My preference would be to relax the spec for this issue. It doesn't seem like there's much benefit to enforcing it on the read side, and it's easy to imagine a writer failing to enforce it in some cases where it usually adds a typed_value to the schema, but not always.

This touches on my ultimate point honestly. My conclusion ultimately is that one of two things should happen:

  1. The spec needs to be updated to explicitly state what the behavior should be in all of these test cases I have identified, and either have the wording updated accordingly. (Fields being present vs required, expressly stating whether certain fields must be required, must be optional or are allowed to be either, and so on).
  2. These test cases that fall outside the scope of the spec should not exist in parquet-testing as integration tests. The integration tests should only test what the spec actually states, not conventions of a particular implementation. If some behavior is expected to be consistent, then it should be written in the spec explicitly so that there's no confusion or ambiguity what the behavior is expected to be.

@alamb's suggestion is also a possible idea: to simply identify and mark which test cases are testing things which are outside the spec and invalid so that implementations can either choose to support those cases despite the spec, or can skip those cases.

@lidavidm
Copy link
Member

lidavidm commented Aug 6, 2025

Just chiming in as an arrow-go (but not parquet) contributor: I find the idea of test cases that are invalid and yet presented as real test cases rather unsettling; it feels like implementation- or vendor- specific behavior snuck its way in to what is supposed to be the official spec.

We definitely want read behavior to be predictable and have consistent behavior. If the desired consistent behavior isn't actually written into the spec, how would one know what that behavior should be?

I think this is the crux of the issue: if Parquet does want to allow slightly out-of-spec files to be read it needs to be clearly defined in the spec

@zeroshade
Copy link
Member Author

@aihuaxu any luck / possibility on getting updates to the parquet variant spec per my and @lidavidm's comments?

@alamb
Copy link
Contributor

alamb commented Aug 13, 2025

@aihuaxu any luck / possibility on getting updates to the parquet variant spec per my and @lidavidm's comments?

I think at this point where we are trying to finalize the Variant spec, the most expedient path would be to update the tests cases rather than re-open a discussion that seems to have been discussed at length in the initial spec design

Just my $0.02

@aihuaxu
Copy link

aihuaxu commented Aug 13, 2025

@aihuaxu any luck / possibility on getting updates to the parquet variant spec per my and @lidavidm's comments?

I think at this point where we are trying to finalize the Variant spec would be to update the tests cases rather than re-open a discussion that seems to have been discussed at length in the initial spec design

Just my $0.02

That's also my understanding that those have been discussed during the design phase. We can update the test cases to indicate that engines can have different behavior.

BTW: @zeroshade There is a ask to create the test files from GO language and then test out from Parquet-Java side. Can you help generate the same set of the test files?

@zeroshade
Copy link
Member Author

@alamb I'm fine with that too, I just wanted to get some confirmation one way or the other.

If @rdblue or @aihuaxu are willing to update the test cases (either by removing or using your suggestion to flag tests that are invalid constructions), then that's fine with me. Mostly I want to just get this figured out, and we shouldn't have integration tests which are testing behavior that doesn't exist in the spec.

@zeroshade
Copy link
Member Author

There is a ask to create the test files from GO language and then test out from Parquet-Java side. Can you help generate the same set of the test files?

I can do so, except for the constructions which aren't valid according to the spec as the Go implementation doesn't allow for it.

This includes the confusion over required vs optional vs present etc. at a minimum, possibly taking @emkornfield's suggestion here to at least add a glossary of terms or otherwise simply clarify the language based on the intent.

Is there a script that was used to generate the test cases? Or do I need to parse the values in the JSON to figure out what should go in the files?

In addition, where should I put the files?

@aihuaxu
Copy link

aihuaxu commented Aug 13, 2025

There is a ask to create the test files from GO language and then test out from Parquet-Java side. Can you help generate the same set of the test files?

I can do so, except for the constructions which aren't valid according to the spec as the Go implementation doesn't allow for it.

This includes the confusion over required vs optional vs present etc. at a minimum, possibly taking @emkornfield's suggestion here to at least add a glossary of terms or otherwise simply clarify the language based on the intent.

Is there a script that was used to generate the test cases? Or do I need to parse the values in the JSON to figure out what should go in the files?

In addition, where should I put the files?

We generated those test files from Iceberg unit tests (apache/iceberg#13654), not from a script. What I can think of is to expand this PR to read those files in tests and then rewrite out through GO-Variant writer logic. We can put in https://github.com/apache/parquet-testing repository. If we can have the same layout as apache/parquet-testing#90. That will be great to make the test from Parquet-Java easier.

BTW: can we consider merging apache/parquet-testing#90 and I will address updating test cases in apache/parquet-testing#91 to add variant logical type and update the test description?

@aihuaxu
Copy link

aihuaxu commented Aug 14, 2025

I'm updating the test description for the ones mentioned in apache/parquet-testing#91. @alamb and @zeroshade Can you take a look?

@zeroshade
Copy link
Member Author

@aihuaxu It appears all of the parquet files with Decimal values that you regenerated are inconsistent with the test cases. The values in the parquet files don't match the expected values in the variant (for example, containing 123456789 instead of 1234567890 and some other rounding differences. The original files had the correct values, but were just missing the Variant logical type.) Can you fix them please?

@zeroshade
Copy link
Member Author

Also some cases haven't been given the new notes/descriptions, such as the cases which test for a missing value column (the spec wording still does not allow for the value column to be omitted)

@aihuaxu
Copy link

aihuaxu commented Aug 15, 2025

@aihuaxu It appears all of the parquet files with Decimal values that you regenerated are inconsistent with the test cases. The values in the parquet files don't match the expected values in the variant (for example, containing 123456789 instead of 1234567890 and some other rounding differences. The original files had the correct values, but were just missing the Variant logical type.) Can you fix them please?

For decimal values, the changes are expected to align with the spec - original 1234567890 should be treated as decimal8 rather than decimal4 since the values with 1~9 precision is stored as decimal4.

Also some cases haven't been given the new notes/descriptions, such as the cases which test for a missing value column (the spec wording still does not allow for the value column to be omitted)
I didn't update for these test cases since missing value column can be omitted to indicate the value is null (I would like to clarify in the spec apache/parquet-format#512).

Please take another look at apache/parquet-testing#91. I consolidate the files into one place. Sorry that I didn't notice that.

@aihuaxu
Copy link

aihuaxu commented Aug 15, 2025

@zeroshade I'm clarifying the missing value field in apache/parquet-format#512. Can you also take a look?

@zeroshade
Copy link
Member Author

@aihuaxu I've updated the files and this PR with the relevant changes. The files generated by Go are in apache/parquet-testing#94

Please take a look at both the files and the code and let me know what you think. Thanks!

I'll wait until the relevant PRs for parquet-testing are merged before this can get merged.

@zeroshade
Copy link
Member Author

@alamb @lidavidm @wgtmac @emkornfield Can I get a review from any of you? Once the parquet-testing PR is merged, this can get merged and I'll update the docs for an arrow variant extension type.

@zeroshade zeroshade changed the title ci(parquet/pqarrow): integration tests for reading shredded variants DO NOT MERGE YET ci(parquet/pqarrow): integration tests for reading shredded variants Aug 22, 2025
@lidavidm
Copy link
Member

oops sorry, I missed this - I'll try to take a look

@zeroshade zeroshade merged commit a970499 into apache:main Aug 27, 2025
29 of 31 checks passed
@zeroshade zeroshade deleted the shredded-variant-tests branch August 27, 2025 15:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants