Skip to content

GH-68: Match language from parquet-format after merge of PARQUET-2139#69

Merged
wgtmac merged 3 commits intoapache:productionfrom
etseidl:parquet-2139-merge
Jul 8, 2024
Merged

GH-68: Match language from parquet-format after merge of PARQUET-2139#69
wgtmac merged 3 commits intoapache:productionfrom
etseidl:parquet-2139-merge

Conversation

@etseidl
Copy link
Copy Markdown
Contributor

@etseidl etseidl commented Jul 4, 2024

Closes #68

Copy link
Copy Markdown
Collaborator

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @etseidl -- I think this is a significant improvement

---
There are three types of metadata: file metadata, column (chunk) metadata and page
header metadata. All thrift structures are serialized using the TCompactProtocol.
There are two types of metadata: file metadata, and page header metadata. All
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend providing a link to precisely what these terms are referring to

I think "file metadata" refers to FileMetadata https://github.com/apache/parquet-format/blob/ed66e87da9b2d79d6e9262fe37d5eae045c6a639/src/main/thrift/parquet.thrift#L1141

I am not sure what "page header metadata" refers to . Is it DataPageHeader https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L580 ?

If so, maybe we could update this document to use the same terms FileMetadata rather than file metadata and DatePageHeader rather than page header

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes more sense when viewed with the image (which has an ERD of the metadata). And this is copied verbatim from the parquet-format README.md. But I am in agreement that the parquet-site could provide more information than the format, which is kept terse for a reason. I'll wordsmith this up some.

Copy link
Copy Markdown
Collaborator

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😍

Thank you @etseidl

@wgtmac wgtmac merged commit a407d81 into apache:production Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Update Format section to match format changes from PARQUET-2139

3 participants