[data] Fix errors with concatenation with mixed pyarrow native and extension types#57566
Conversation
| table = pa.concat_tables( | ||
| subset_blocks, promote_options=arrow_promote_types_mode | ||
| ) | ||
| return {col_name: table.column(col_name) for col_name in table.schema.names} |
There was a problem hiding this comment.
Bug: Schema Mismatch in PyArrow Table Concatenation
The _concat_cols_with_native_pyarrow_types helper selects only existing columns from each block, failing to null-fill missing columns. This creates inconsistent schemas across blocks for native PyArrow types, causing pyarrow.concat_tables to fail with schema mismatch errors.
There was a problem hiding this comment.
Code Review
This pull request refactors the concat function in transform_pyarrow.py by breaking it down into smaller, more manageable helper functions for different column types. This is a great improvement for code readability and maintainability. I've found a couple of opportunities to improve performance in the new implementation.
| subset_blocks = [] | ||
| for block in blocks: | ||
| cols_to_select = [ | ||
| col_name for col_name in col_names if col_name in block.schema.names |
There was a problem hiding this comment.
For better performance, consider converting block.schema.names to a set for the membership check. Checking for an item's presence in a list has a time complexity of O(n), whereas for a set it's O(1) on average. Since this check is inside a loop, this optimization can be beneficial, especially with schemas containing many columns.
| col_name for col_name in col_names if col_name in block.schema.names | |
| col_name for col_name in col_names if col_name in set(block.schema.names) |
| col = pa.chunked_array(chunks_to_concat) | ||
| col_chunked_arrays = [] | ||
| for block in blocks: | ||
| if col_name in block.schema.names: |
There was a problem hiding this comment.
To improve efficiency, you can check for column existence directly on the schema object (block.schema) instead of its names attribute. col_name in block.schema is an O(1) operation on average, as pyarrow.Schema uses a more efficient lookup mechanism than a list search (in block.schema.names), which is O(n). This is particularly important here, as it's inside nested loops.
| if col_name in block.schema.names: | |
| if col_name in block.schema: |
Why are these changes needed?
Cherry-pick #56811
Original description:
If we had an execution where we needed to concatenate native pyarrow types and pyarrow extension types, we would get errors like the following:
This PR adds a test that replicates this and fixes the underlying issue by concatenating extension types and native types separately before rejoining them.
Related issue number
Checks
git commit -s) in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.