-
Notifications
You must be signed in to change notification settings - Fork 82
feat(parquet/examples): enhance library examples #429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Instead of standalone examples like this, can we use Go testable examples that will show up in the docs and be tested to ensure they don't get out of date? |
I did it like here, what the difference? |
|
My preference would be to follow this pattern Here's a good explanation of it: https://go.dev/blog/examples |
|
okay, it should be in the path: arrow-go/parquet/example_writing_parquet_test.go? |
|
That path works great |
zeroshade
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! It looks great, just a few nitpicks
| // Create arrow writer props to store the schema in the parquet file | ||
| arrowWriterProps := pqarrow.NewArrowWriterProperties( | ||
| pqarrow.WithStoreSchema(), | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we elaborate in the comment about the benefit to storing the Arrow schema in the metadata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right, what about this comment?
// WithStoreSchema embeds the original Arrow schema into the Parquet file metadata,
// allowing it to be accurately restored when reading. This ensures correct handling
// of advanced types like dictionaries and preserves full type fidelity across write/readThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that looks good! Perhaps also mention that some of the other Parquet libraries also support this so it helps ensure cross-language type consistency too during reading.
| // Create a builder for each field | ||
| intFieldIdx := schema.FieldIndices("intField")[0] | ||
| stringFieldIdx := schema.FieldIndices("stringField")[0] | ||
| listFieldIdx := schema.FieldIndices("listField")[0] | ||
|
|
||
| intFieldBuilder := recordBuilder.Field(intFieldIdx).(*array.Int32Builder) | ||
| stringFieldBuilder := recordBuilder.Field(stringFieldIdx).(*array.StringBuilder) | ||
| listFieldBuilder := recordBuilder.Field(listFieldIdx).(*array.ListBuilder) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to do the FieldIndices calls when we know what the schema is and the indices are since we created the schema? There might be benefits to keeping this as simple as possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, sure, I'll specify the indices explicitly for simplicity:
intFieldBuilder := recordBuilder.Field(0).(*array.Int32Builder)
stringFieldBuilder := recordBuilder.Field(1).(*array.StringBuilder)
listFieldBuilder := recordBuilder.Field(2).(*array.ListBuilder)|
|
||
| // IMPORTANT: Close the writer to finalize the file | ||
| if err := writer.Close(); err != nil { | ||
| log.Printf("Failed to close parquet writer: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log.Fatalf?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I will fix it
| colIndices[idx] = schema.FieldIndices(colNames[idx])[0] | ||
| } | ||
|
|
||
| // Get the current record from the reader |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // Get the current record from the reader | |
| // Get a record reader from the file to iterate over |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
| for recordReader.Next() { | ||
| // Create a record | ||
| record := recordReader.Record() | ||
| record.Retain() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only need to call retain if we need to utilize the record beyond and outside of this loop. Since we only interact with the record in the loop and aren't storing it for use beyond the loop we don't need to call Retain and Release. Might be better to simplify this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, thanks for clarification
| // Get columns | ||
| intCol := record.Column(colIndices[0]).(*array.Int32) | ||
| stringCol := record.Column(colIndices[1]).(*array.String) | ||
| listCol := record.Column(colIndices[2]).(*array.List) | ||
| listValueCol := listCol.ListValues().(*array.Float32) | ||
|
|
||
| // Iterate over the rows within the current record | ||
| for idx := range int(record.NumRows()) { | ||
| // For the list column, get the start and end offsets for the current row | ||
| start, end := listCol.ValueOffsets(idx) | ||
|
|
||
| fmt.Printf("%d %s %v\n", intCol.Value(idx), stringCol.Value(idx), listValueCol.Float32Values()[start:end]) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to grab each column separately like this instead of just calling fmt.Println(record) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure? If I write it like this:
for recordReader.Next() {
// Create a record
record := recordReader.Record()
fmt.Println(record)
}The output will be like this:
record:
schema:
fields: 3
- intField: type=int32
metadata: ["PARQUET:field_id": "-1"]
- stringField: type=utf8
metadata: ["PARQUET:field_id": "-1"]
- listField: type=list<element: float32, nullable>
metadata: ["PARQUET:field_id": "-1"]
rows: 2
col[0][intField]: [38 13]
col[1][stringField]: ["val1" "val2"]
col[2][listField]: [[1 2 4 8] [1 2 4 8]]
record:
schema:
fields: 3
- intField: type=int32
metadata: ["PARQUET:field_id": "-1"]
- stringField: type=utf8
metadata: ["PARQUET:field_id": "-1"]
- listField: type=list<element: float32, nullable>
metadata: ["PARQUET:field_id": "-1"]
rows: 2
col[0][intField]: [53 93]
col[1][stringField]: ["val3" "val4"]
col[2][listField]: [[1 2 4 8] [1 2 4 8]]
record:
schema:
fields: 3
- intField: type=int32
metadata: ["PARQUET:field_id": "-1"]
- stringField: type=utf8
metadata: ["PARQUET:field_id": "-1"]
- listField: type=list<element: float32, nullable>
metadata: ["PARQUET:field_id": "-1"]
rows: 1
col[0][intField]: [66]
col[1][stringField]: ["val5"]
col[2][listField]: [[1 2 4 8]]
Instead of:
38 val1 [1 2 4 8]
13 val2 [1 2 4 8]
53 val3 [1 2 4 8]
93 val4 [1 2 4 8]
66 val5 [1 2 4 8]
Or did I misunderstand you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope you didn't misunderstand, perhaps change the batch size to be 3 instead of 2 so that it only outputs two records?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it’s best to leave it as is, since it demonstrates how to access cell values for each row. Working with records isn’t always obvious for beginners.
Like here:
arrow-go/parquet/file/file_reader_test.go
Lines 687 to 696 in 084371e
| for rr.Next() { | |
| rec := rr.Record() | |
| for i := 0; i < int(rec.NumRows()); i++ { | |
| col := rec.Column(0).(*array.Int64) | |
| val := col.Value(i) | |
| require.Equal(t, val, int64(totalRows+i)) | |
| } | |
| totalRows += int(rec.NumRows()) | |
| } |
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair enough, we can leave it as is.
zeroshade
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for this. It will significantly help people with interacting with this library!
Rationale for this change
Enhancing the library examples
What changes are included in this PR?
2 new examples on how to use the
arrow-golibrary:Are these changes tested?
Not applicable
Are there any user-facing changes?
No, just enhancing examples