[Data] Fix reading from written parquet for numpy with NaNs by owenowenisme · Pull Request #59172 · ray-project/ray

owenowenisme · 2025-12-04T15:25:03Z

Description

Currently when transforming pandas to arrow block, we will convert it into pa.null if all the value in TensorArray is NaN or empty string, however all NaN might actually represent something and could actually happen in transformation.

ray/python/ray/air/util/tensor_extensions/pandas.py

Lines 926 to 958 in 8b4a1ee

    
               def isna(self) -> "TensorArray": 
        
                   """ 
        
                   A 1-D array indicating if each value is missing. 
        
                   Returns 
        
                   ------- 
        
                   na_values : Union[np.ndarray, ExtensionArray] 
        
                       In most cases, this should return a NumPy ndarray. For 
        
                       exceptional cases like ``SparseArray``, where returning 
        
                       an ndarray would be expensive, an ExtensionArray may be 
        
                       returned. 
        
                   Notes 
        
                   ----- 
        
                   If returning an ExtensionArray, then 
        
                   * ``na_values._is_boolean`` should be True 
        
                   * `na_values` should implement :func:`ExtensionArray._reduce` 
        
                   * ``na_values.any`` and ``na_values.all`` should be implemented 
        
                   """ 
        
                   if self._tensor.dtype.type is np.object_: 
        
                       # Avoid comparing with __eq__ because the elements of the tensor 
        
                       # may do something funny with that operation. 
        
                       return np.array( 
        
                           [self._tensor[i] is None for i in range(len(self))], dtype=bool 
        
                       ) 
        
                   elif self._tensor.dtype.type is np.str_: 
        
                       return np.all(self._tensor == "", axis=tuple(range(1, self._tensor.ndim))) 
        
                   else: 
        
                       return np.all( 
        
                           np.isnan(self._tensor), axis=tuple(range(1, self._tensor.ndim)) 
        
                       )

And could triggered error when we tried to convert that arrow block back to pandas (try running the repro script from issue)

This PR skip coerced to pa.null if thecol.dtype is TensorDtype, which should tolerate the all NaN or all empty string table.

Related issues

Closes #59087

gemini-code-assist

Code Review

This pull request fixes an issue with reading Parquet files containing NumPy arrays with NaNs. The main changes are:

In TensorDtype.__from_arrow__, it now correctly handles Arrow arrays with null values by disabling zero-copy when converting to a NumPy array. This prevents failures when nulls are present.
In TensorArray.isna, the definition of a missing value is changed. Now, for non-object types, NaN is treated as a valid value, not a missing value. This aligns with the goal of correctly handling NaNs.

My review focuses on improving the implementation of these changes. I've suggested a simplification in __from_arrow__ and a bug fix in isna for object-dtype arrays that is exposed by the changes in this PR. Specifically, the isna method for object arrays should handle np.nan values, which can be introduced during the Arrow-to-NumPy conversion.

python/ray/air/util/tensor_extensions/pandas.py

owenowenisme · 2025-12-05T08:11:10Z

/gemini review

gemini-code-assist

Code Review

This pull request aims to fix an issue with isna for numpy arrays with NaNs after a Parquet roundtrip. The change addresses a bug where np.isnan could be called on non-float dtypes, causing a TypeError. However, the current fix of returning np.zeros for all non-object dtypes is too broad and incorrectly disables NaN detection for float tensors. I've provided a critical review comment with a suggested implementation that correctly handles different dtypes (float, string, and others) to ensure isna behaves as expected without reintroducing the original bug.

python/ray/air/util/tensor_extensions/pandas.py

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>

…-NaNs

alexeykudinkin · 2025-12-05T21:12:22Z

python/ray/data/_internal/pandas_block.py

+                continue
+
            if not col.notna().any():
-                # If there are only null-values, coerce column to Arrow's `NullType`


Why removing the comments? Let's keep them

python/ray/data/_internal/pandas_block.py

alexeykudinkin · 2025-12-05T21:22:54Z

python/ray/data/tests/test_pandas_block.py

+    # The column should NOT be converted to null type
+    assert not pa.types.is_null(
+        arrow_table.schema.field("foo").type
+    ), "TensorDtype column with all-NaN values should not be converted to null type"


Let's assert that the type is tensor

Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com> Signed-off-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com>

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>

…ect#59172) ## Description Currently when transforming pandas to arrow block, we will convert it into `pa.null` if all the value in TensorArray is NaN or empty string, however all NaN might actually represent something and could actually happen in transformation. https://github.com/ray-project/ray/blob/8b4a1ee98fd92a972d07e9c08af19f02829dd40f/python/ray/air/util/tensor_extensions/pandas.py#L926-L958 And could triggered error when we tried to convert that arrow block back to pandas (try running the repro script from issue) This PR skip coerced to `pa.null` if the`col.dtype` is `TensorDtype`, which should tolerate the all NaN or all empty string table. ## Related issues Closes ray-project#59087 --------- Signed-off-by: You-Cheng Lin <mses010108@gmail.com> Signed-off-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com> Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

owenowenisme added the go add ONLY when ready to merge, run all tests label Dec 4, 2025

gemini-code-assist bot reviewed Dec 4, 2025

View reviewed changes

python/ray/air/util/tensor_extensions/pandas.py Show resolved Hide resolved

python/ray/air/util/tensor_extensions/pandas.py Outdated Show resolved Hide resolved

owenowenisme force-pushed the data/fix-reading-from-writing-parquet-with-NaNs branch from b58d11e to 16cc17b Compare December 5, 2025 08:08

gemini-code-assist bot reviewed Dec 5, 2025

View reviewed changes

python/ray/air/util/tensor_extensions/pandas.py Outdated Show resolved Hide resolved

update

6691812

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>

owenowenisme force-pushed the data/fix-reading-from-writing-parquet-with-NaNs branch from 16cc17b to 6691812 Compare December 5, 2025 08:39

Merge branch 'master' into data/fix-reading-from-writing-parquet-with…

5c29b09

…-NaNs

owenowenisme marked this pull request as ready for review December 5, 2025 17:32

owenowenisme requested a review from a team as a code owner December 5, 2025 17:32

ray-gardener bot added the data Ray Data-related issues label Dec 5, 2025

alexeykudinkin reviewed Dec 5, 2025

View reviewed changes

owenowenisme and others added 3 commits December 6, 2025 12:00

Update python/ray/data/_internal/pandas_block.py

c732e65

Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com> Signed-off-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com>

update

5431e09

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>

update

ec55228

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>

alexeykudinkin approved these changes Dec 8, 2025

View reviewed changes

alexeykudinkin merged commit 5decfbd into ray-project:master Dec 8, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Fix reading from written parquet for numpy with NaNs#59172

[Data] Fix reading from written parquet for numpy with NaNs#59172
alexeykudinkin merged 5 commits intoray-project:masterfrom
owenowenisme:data/fix-reading-from-writing-parquet-with-NaNs

owenowenisme commented Dec 4, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

owenowenisme commented Dec 5, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

alexeykudinkin Dec 5, 2025

Uh oh!

owenowenisme Dec 6, 2025

Uh oh!

Uh oh!

alexeykudinkin Dec 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	def isna(self) -> "TensorArray":
	"""
	A 1-D array indicating if each value is missing.

	Returns
	-------
	na_values : Union[np.ndarray, ExtensionArray]
	In most cases, this should return a NumPy ndarray. For
	exceptional cases like ``SparseArray``, where returning
	an ndarray would be expensive, an ExtensionArray may be
	returned.

	Notes
	-----
	If returning an ExtensionArray, then

	* ``na_values._is_boolean`` should be True
	* `na_values` should implement :func:`ExtensionArray._reduce`
	* ``na_values.any`` and ``na_values.all`` should be implemented
	"""
	if self._tensor.dtype.type is np.object_:
	# Avoid comparing with __eq__ because the elements of the tensor
	# may do something funny with that operation.
	return np.array(
	[self._tensor[i] is None for i in range(len(self))], dtype=bool
	)
	elif self._tensor.dtype.type is np.str_:
	return np.all(self._tensor == "", axis=tuple(range(1, self._tensor.ndim)))
	else:
	return np.all(
	np.isnan(self._tensor), axis=tuple(range(1, self._tensor.ndim))
	)

Conversation

owenowenisme commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

owenowenisme commented Dec 5, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

alexeykudinkin Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

owenowenisme Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alexeykudinkin Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

owenowenisme commented Dec 4, 2025 •

edited

Loading