feat(parquet/examples): enhance library examples #429

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

zeroshade merged 5 commits into apache:main from milden6:main

Jul 2, 2025

Contributor

milden6 commented Jun 28, 2025

Rationale for this change

Enhancing the library examples

What changes are included in this PR?

2 new examples on how to use the arrow-go library:

Writing parquet file
Reading parquet file

Are these changes tested?

Not applicable

Are there any user-facing changes?

No, just enhancing examples


          Add writing and reading parquet file examples

84ae3ab

milden6 requested a review from zeroshade as a code owner

June 28, 2025 11:22

milden6 mentioned this pull request

Learning Curve of the API #269

Open

Member

zeroshade commented Jun 28, 2025

Instead of standalone examples like this, can we use Go testable examples that will show up in the docs and be tested to ensure they don't get out of date?

Contributor Author

milden6 commented Jun 28, 2025

Instead of standalone examples like this, can we use Go testable examples that will show up in the docs and be tested to ensure they don't get out of date?

I did it like here, what the difference?
https://github.com/apache/arrow-go/tree/main/arrow/examples/table_creation

Member

zeroshade commented Jun 28, 2025 •

edited

Loading

My preference would be to follow this pattern

Here's a good explanation of it: https://go.dev/blog/examples

Contributor Author

milden6 commented Jun 30, 2025

okay, it should be in the path: arrow-go/parquet/example_writing_parquet_test.go?

Member

zeroshade commented Jun 30, 2025

That path works great

Daniil Mileev added 3 commits

July 1, 2025 11:26


          Merge remote-tracking branch 'upstream/main'

b2f9f2c

Merge changes from upstream


          Rewrite examples after review. Add Godoc examples

b25a59d


          Remove old examples

fc11c39

zeroshade requested changes

View reviewed changes

Member

zeroshade left a comment

Thanks for this! It looks great, just a few nitpicks

parquet/example_write_read_pq_test.go Outdated

Comment on lines 62 to 65

+              	// Create arrow writer props to store the schema in the parquet file
+              	arrowWriterProps := pqarrow.NewArrowWriterProperties(
+              		pqarrow.WithStoreSchema(),
+              	)

Member

zeroshade Jul 1, 2025

Should we elaborate in the comment about the benefit to storing the Arrow schema in the metadata?

Contributor Author

milden6 Jul 1, 2025 •

edited

Loading

Yes, you are right, what about this comment?

// WithStoreSchema embeds the original Arrow schema into the Parquet file metadata,
// allowing it to be accurately restored when reading. This ensures correct handling
// of advanced types like dictionaries and preserves full type fidelity across write/read

Member

zeroshade Jul 1, 2025

that looks good! Perhaps also mention that some of the other Parquet libraries also support this so it helps ensure cross-language type consistency too during reading.

parquet/example_write_read_pq_test.go Outdated

Comment on lines 76 to 83

+              	// Create a builder for each field
+              	intFieldIdx := schema.FieldIndices("intField")[0]
+              	stringFieldIdx := schema.FieldIndices("stringField")[0]
+              	listFieldIdx := schema.FieldIndices("listField")[0]
+              	intFieldBuilder := recordBuilder.Field(intFieldIdx).(*array.Int32Builder)
+              	stringFieldBuilder := recordBuilder.Field(stringFieldIdx).(*array.StringBuilder)
+              	listFieldBuilder := recordBuilder.Field(listFieldIdx).(*array.ListBuilder)

Member

zeroshade Jul 1, 2025

do we need to do the FieldIndices calls when we know what the schema is and the indices are since we created the schema? There might be benefits to keeping this as simple as possible

Contributor Author

milden6 Jul 1, 2025

Yep, sure, I'll specify the indices explicitly for simplicity:

	intFieldBuilder := recordBuilder.Field(0).(*array.Int32Builder)
	stringFieldBuilder := recordBuilder.Field(1).(*array.StringBuilder)
	listFieldBuilder := recordBuilder.Field(2).(*array.ListBuilder)

parquet/example_write_read_pq_test.go Outdated

+              	// IMPORTANT: Close the writer to finalize the file
+              	if err := writer.Close(); err != nil {
+              		log.Printf("Failed to close parquet writer: %v", err)

Member

zeroshade Jul 1, 2025

log.Fatalf?

Contributor Author

milden6 Jul 1, 2025

Thanks, I will fix it

parquet/example_write_read_pq_test.go Outdated

+              		colIndices[idx] = schema.FieldIndices(colNames[idx])[0]
+              	}
+              	// Get the current record from the reader

Member

zeroshade Jul 1, 2025

Suggested change

      
            	// Get the current record from the reader
          
            	// Get a record reader from the file to iterate over

Contributor Author

milden6 Jul 1, 2025

Thanks

parquet/example_write_read_pq_test.go Outdated

+              	for recordReader.Next() {
+              		// Create a record
+              		record := recordReader.Record()
+              		record.Retain()

Member

zeroshade Jul 1, 2025

We only need to call retain if we need to utilize the record beyond and outside of this loop. Since we only interact with the record in the loop and aren't storing it for use beyond the loop we don't need to call Retain and Release. Might be better to simplify this

Contributor Author

milden6 Jul 1, 2025

Okay, thanks for clarification

parquet/example_write_read_pq_test.go

Comment on lines 160 to 172

+              		// Get columns
+              		intCol := record.Column(colIndices[0]).(*array.Int32)
+              		stringCol := record.Column(colIndices[1]).(*array.String)
+              		listCol := record.Column(colIndices[2]).(*array.List)
+              		listValueCol := listCol.ListValues().(*array.Float32)
+              		// Iterate over the rows within the current record
+              		for idx := range int(record.NumRows()) {
+              			// For the list column, get the start and end offsets for the current row
+              			start, end := listCol.ValueOffsets(idx)
+              			fmt.Printf("%d  %s  %v\n", intCol.Value(idx), stringCol.Value(idx), listValueCol.Float32Values()[start:end])
+              		}

Member

zeroshade Jul 1, 2025

Do we need to grab each column separately like this instead of just calling fmt.Println(record) ?

Contributor Author

milden6 Jul 1, 2025 •

edited

Loading

Are you sure? If I write it like this:

for recordReader.Next() {
	// Create a record
	record := recordReader.Record()
	fmt.Println(record)
}

The output will be like this:

record:
  schema:
  fields: 3
    - intField: type=int32
          metadata: ["PARQUET:field_id": "-1"]
    - stringField: type=utf8
             metadata: ["PARQUET:field_id": "-1"]
    - listField: type=list<element: float32, nullable>
           metadata: ["PARQUET:field_id": "-1"]
  rows: 2
  col[0][intField]: [38 13]
  col[1][stringField]: ["val1" "val2"]
  col[2][listField]: [[1 2 4 8] [1 2 4 8]]

record:
  schema:
  fields: 3
    - intField: type=int32
          metadata: ["PARQUET:field_id": "-1"]
    - stringField: type=utf8
             metadata: ["PARQUET:field_id": "-1"]
    - listField: type=list<element: float32, nullable>
           metadata: ["PARQUET:field_id": "-1"]
  rows: 2
  col[0][intField]: [53 93]
  col[1][stringField]: ["val3" "val4"]
  col[2][listField]: [[1 2 4 8] [1 2 4 8]]

record:
  schema:
  fields: 3
    - intField: type=int32
          metadata: ["PARQUET:field_id": "-1"]
    - stringField: type=utf8
             metadata: ["PARQUET:field_id": "-1"]
    - listField: type=list<element: float32, nullable>
           metadata: ["PARQUET:field_id": "-1"]
  rows: 1
  col[0][intField]: [66]
  col[1][stringField]: ["val5"]
  col[2][listField]: [[1 2 4 8]]

Instead of:

38  val1  [1 2 4 8]
13  val2  [1 2 4 8]
53  val3  [1 2 4 8]
93  val4  [1 2 4 8]
66  val5  [1 2 4 8]

Or did I misunderstand you?

Member

zeroshade Jul 1, 2025

Nope you didn't misunderstand, perhaps change the batch size to be 3 instead of 2 so that it only outputs two records?

Contributor Author

milden6 Jul 1, 2025 •

edited

Loading

I think it’s best to leave it as is, since it demonstrates how to access cell values for each row. Working with records isn’t always obvious for beginners.

Like here:

arrow-go/parquet/file/file_reader_test.go

Lines 687 to 696 in 084371e

    
           for rr.Next() { 
        
           	rec := rr.Record() 
        
           	for i := 0; i < int(rec.NumRows()); i++ { 
        
           		col := rec.Column(0).(*array.Int64) 
        
           		val := col.Value(i) 
        
           		require.Equal(t, val, int64(totalRows+i)) 
        
           	} 
        
           	totalRows += int(rec.NumRows()) 
        
           }

What do you think?

Member

zeroshade Jul 2, 2025

fair enough, we can leave it as is.


          Change after review

16f8f7d

zeroshade approved these changes

View reviewed changes

Member

zeroshade left a comment

Thanks so much for this. It will significantly help people with interacting with this library!

zeroshade merged commit 45e47c7 into apache:main

23 checks passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet