[Data] Fix reading from written parquet for numpy with NaNs#59172
Conversation
There was a problem hiding this comment.
Code Review
This pull request fixes an issue with reading Parquet files containing NumPy arrays with NaNs. The main changes are:
- In
TensorDtype.__from_arrow__, it now correctly handles Arrow arrays with null values by disabling zero-copy when converting to a NumPy array. This prevents failures when nulls are present. - In
TensorArray.isna, the definition of a missing value is changed. Now, for non-object types,NaNis treated as a valid value, not a missing value. This aligns with the goal of correctly handling NaNs.
My review focuses on improving the implementation of these changes. I've suggested a simplification in __from_arrow__ and a bug fix in isna for object-dtype arrays that is exposed by the changes in this PR. Specifically, the isna method for object arrays should handle np.nan values, which can be introduced during the Arrow-to-NumPy conversion.
b58d11e to
16cc17b
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request aims to fix an issue with isna for numpy arrays with NaNs after a Parquet roundtrip. The change addresses a bug where np.isnan could be called on non-float dtypes, causing a TypeError. However, the current fix of returning np.zeros for all non-object dtypes is too broad and incorrectly disables NaN detection for float tensors. I've provided a critical review comment with a suggested implementation that correctly handles different dtypes (float, string, and others) to ensure isna behaves as expected without reintroducing the original bug.
16cc17b to
6691812
Compare
| continue | ||
|
|
||
| if not col.notna().any(): | ||
| # If there are only null-values, coerce column to Arrow's `NullType` |
There was a problem hiding this comment.
Why removing the comments? Let's keep them
| # The column should NOT be converted to null type | ||
| assert not pa.types.is_null( | ||
| arrow_table.schema.field("foo").type | ||
| ), "TensorDtype column with all-NaN values should not be converted to null type" |
There was a problem hiding this comment.
Let's assert that the type is tensor
Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com> Signed-off-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com>
…ect#59172) ## Description Currently when transforming pandas to arrow block, we will convert it into `pa.null` if all the value in TensorArray is NaN or empty string, however all NaN might actually represent something and could actually happen in transformation. https://github.com/ray-project/ray/blob/8b4a1ee98fd92a972d07e9c08af19f02829dd40f/python/ray/air/util/tensor_extensions/pandas.py#L926-L958 And could triggered error when we tried to convert that arrow block back to pandas (try running the repro script from issue) This PR skip coerced to `pa.null` if the`col.dtype` is `TensorDtype`, which should tolerate the all NaN or all empty string table. ## Related issues Closes ray-project#59087 --------- Signed-off-by: You-Cheng Lin <mses010108@gmail.com> Signed-off-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com> Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
Description
Currently when transforming pandas to arrow block, we will convert it into
pa.nullif all the value in TensorArray is NaN or empty string, however all NaN might actually represent something and could actually happen in transformation.ray/python/ray/air/util/tensor_extensions/pandas.py
Lines 926 to 958 in 8b4a1ee
And could triggered error when we tried to convert that arrow block back to pandas (try running the repro script from issue)
This PR skip coerced to
pa.nullif thecol.dtypeisTensorDtype, which should tolerate the all NaN or all empty string table.Related issues
Closes #59087