Skip to content

Conversation

@milden6
Copy link
Contributor

@milden6 milden6 commented Jun 28, 2025

Rationale for this change

Enhancing the library examples

What changes are included in this PR?

2 new examples on how to use the arrow-go library:

  • Writing parquet file
  • Reading parquet file

Are these changes tested?

Not applicable

Are there any user-facing changes?

No, just enhancing examples

@milden6 milden6 requested a review from zeroshade as a code owner June 28, 2025 11:22
@zeroshade
Copy link
Member

Instead of standalone examples like this, can we use Go testable examples that will show up in the docs and be tested to ensure they don't get out of date?

@milden6
Copy link
Contributor Author

milden6 commented Jun 28, 2025

Instead of standalone examples like this, can we use Go testable examples that will show up in the docs and be tested to ensure they don't get out of date?

I did it like here, what the difference?
https://github.com/apache/arrow-go/tree/main/arrow/examples/table_creation

@zeroshade
Copy link
Member

zeroshade commented Jun 28, 2025

My preference would be to follow this pattern

Here's a good explanation of it: https://go.dev/blog/examples

@milden6
Copy link
Contributor Author

milden6 commented Jun 30, 2025

okay, it should be in the path: arrow-go/parquet/example_writing_parquet_test.go?

@zeroshade
Copy link
Member

That path works great

Copy link
Member

@zeroshade zeroshade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! It looks great, just a few nitpicks

Comment on lines 62 to 65
// Create arrow writer props to store the schema in the parquet file
arrowWriterProps := pqarrow.NewArrowWriterProperties(
pqarrow.WithStoreSchema(),
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we elaborate in the comment about the benefit to storing the Arrow schema in the metadata?

Copy link
Contributor Author

@milden6 milden6 Jul 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right, what about this comment?

// WithStoreSchema embeds the original Arrow schema into the Parquet file metadata,
// allowing it to be accurately restored when reading. This ensures correct handling
// of advanced types like dictionaries and preserves full type fidelity across write/read

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that looks good! Perhaps also mention that some of the other Parquet libraries also support this so it helps ensure cross-language type consistency too during reading.

Comment on lines 76 to 83
// Create a builder for each field
intFieldIdx := schema.FieldIndices("intField")[0]
stringFieldIdx := schema.FieldIndices("stringField")[0]
listFieldIdx := schema.FieldIndices("listField")[0]

intFieldBuilder := recordBuilder.Field(intFieldIdx).(*array.Int32Builder)
stringFieldBuilder := recordBuilder.Field(stringFieldIdx).(*array.StringBuilder)
listFieldBuilder := recordBuilder.Field(listFieldIdx).(*array.ListBuilder)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to do the FieldIndices calls when we know what the schema is and the indices are since we created the schema? There might be benefits to keeping this as simple as possible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, sure, I'll specify the indices explicitly for simplicity:

	intFieldBuilder := recordBuilder.Field(0).(*array.Int32Builder)
	stringFieldBuilder := recordBuilder.Field(1).(*array.StringBuilder)
	listFieldBuilder := recordBuilder.Field(2).(*array.ListBuilder)


// IMPORTANT: Close the writer to finalize the file
if err := writer.Close(); err != nil {
log.Printf("Failed to close parquet writer: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log.Fatalf?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I will fix it

colIndices[idx] = schema.FieldIndices(colNames[idx])[0]
}

// Get the current record from the reader
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Get the current record from the reader
// Get a record reader from the file to iterate over

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

for recordReader.Next() {
// Create a record
record := recordReader.Record()
record.Retain()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only need to call retain if we need to utilize the record beyond and outside of this loop. Since we only interact with the record in the loop and aren't storing it for use beyond the loop we don't need to call Retain and Release. Might be better to simplify this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, thanks for clarification

Comment on lines 160 to 172
// Get columns
intCol := record.Column(colIndices[0]).(*array.Int32)
stringCol := record.Column(colIndices[1]).(*array.String)
listCol := record.Column(colIndices[2]).(*array.List)
listValueCol := listCol.ListValues().(*array.Float32)

// Iterate over the rows within the current record
for idx := range int(record.NumRows()) {
// For the list column, get the start and end offsets for the current row
start, end := listCol.ValueOffsets(idx)

fmt.Printf("%d %s %v\n", intCol.Value(idx), stringCol.Value(idx), listValueCol.Float32Values()[start:end])
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to grab each column separately like this instead of just calling fmt.Println(record) ?

Copy link
Contributor Author

@milden6 milden6 Jul 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure? If I write it like this:

for recordReader.Next() {
	// Create a record
	record := recordReader.Record()
	fmt.Println(record)
}

The output will be like this:

record:
  schema:
  fields: 3
    - intField: type=int32
          metadata: ["PARQUET:field_id": "-1"]
    - stringField: type=utf8
             metadata: ["PARQUET:field_id": "-1"]
    - listField: type=list<element: float32, nullable>
           metadata: ["PARQUET:field_id": "-1"]
  rows: 2
  col[0][intField]: [38 13]
  col[1][stringField]: ["val1" "val2"]
  col[2][listField]: [[1 2 4 8] [1 2 4 8]]

record:
  schema:
  fields: 3
    - intField: type=int32
          metadata: ["PARQUET:field_id": "-1"]
    - stringField: type=utf8
             metadata: ["PARQUET:field_id": "-1"]
    - listField: type=list<element: float32, nullable>
           metadata: ["PARQUET:field_id": "-1"]
  rows: 2
  col[0][intField]: [53 93]
  col[1][stringField]: ["val3" "val4"]
  col[2][listField]: [[1 2 4 8] [1 2 4 8]]

record:
  schema:
  fields: 3
    - intField: type=int32
          metadata: ["PARQUET:field_id": "-1"]
    - stringField: type=utf8
             metadata: ["PARQUET:field_id": "-1"]
    - listField: type=list<element: float32, nullable>
           metadata: ["PARQUET:field_id": "-1"]
  rows: 1
  col[0][intField]: [66]
  col[1][stringField]: ["val5"]
  col[2][listField]: [[1 2 4 8]]

Instead of:

38  val1  [1 2 4 8]
13  val2  [1 2 4 8]
53  val3  [1 2 4 8]
93  val4  [1 2 4 8]
66  val5  [1 2 4 8]

Or did I misunderstand you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope you didn't misunderstand, perhaps change the batch size to be 3 instead of 2 so that it only outputs two records?

Copy link
Contributor Author

@milden6 milden6 Jul 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it’s best to leave it as is, since it demonstrates how to access cell values for each row. Working with records isn’t always obvious for beginners.

Like here:

for rr.Next() {
rec := rr.Record()
for i := 0; i < int(rec.NumRows()); i++ {
col := rec.Column(0).(*array.Int64)
val := col.Value(i)
require.Equal(t, val, int64(totalRows+i))
}
totalRows += int(rec.NumRows())
}

What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough, we can leave it as is.

Copy link
Member

@zeroshade zeroshade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for this. It will significantly help people with interacting with this library!

@zeroshade zeroshade merged commit 45e47c7 into apache:main Jul 2, 2025
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants