[Data] - schema() handle pd.ArrowDtype -> pyarrow type conversion by goutamvenkat-anyscale · Pull Request #57057 · ray-project/ray

goutamvenkat-anyscale · 2025-09-30T22:53:52Z

Why are these changes needed?

When the schema contains pd.ArrowDtype datatypes, the existing pa.from_numpy_dtype(dtype) in the schema function will fail.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run pre-commit jobs to lint the changes in this PR. (pre-commit setup)
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Note

Schema.types now converts pandas ArrowDtype to pyarrow types (including within TensorDtype), with unit tests validating dtype conversion.

Schema/types conversion
- Add _convert_to_pa_type to map pandas.ArrowDtype and numpy dtype to pyarrow types.
- Use helper for both generic column dtypes and TensorDtype._dtype (works with ArrowTensorType/ArrowTensorTypeV2).
- Import pandas to detect pd.ArrowDtype.
Tests
- Add parametric test ensuring Schema.types returns correct pyarrow types for pd.ArrowDtype and numpy dtypes.
- Minor test imports updated (e.g., pyarrow, Schema).

^{Written by Cursor Bugbot for commit 243cdd6. This will update automatically on new commits. Configure here.}

Signed-off-by: Goutam V. <goutam@anyscale.com>

gemini-code-assist

Code Review

This pull request correctly addresses an issue where pd.ArrowDtype was not handled properly when determining a dataset's schema types. The introduction of the _convert_to_pa_type helper function is a clean solution, and the accompanying test effectively validates the fix. I've added one suggestion to make the new helper function even more robust by handling raw pyarrow.DataType instances, which seems to be a possibility based on existing code patterns.

gemini-code-assist · 2025-09-30T22:56:17Z

python/ray/data/dataset.py

+        def _convert_to_pa_type(dtype: Union[np.dtype, pd.ArrowDtype]) -> pa.DataType:
+            if isinstance(dtype, pd.ArrowDtype):
+                return dtype.pyarrow_dtype
+            return pa.from_numpy_dtype(dtype)


This function correctly handles pd.ArrowDtype. To make it more robust, consider also handling raw pyarrow.DataType instances. It appears TensorDtype can sometimes be constructed with a pyarrow.DataType, which would cause a TypeError here as pa.from_numpy_dtype does not accept it. This error is then silently caught by the generic except Exception block, and the type becomes None, which can hide underlying issues. Explicitly handling pyarrow.DataType would prevent this.

Suggested change

def _convert_to_pa_type(dtype: Union[np.dtype, pd.ArrowDtype]) -> pa.DataType:

if isinstance(dtype, pd.ArrowDtype):

return dtype.pyarrow_dtype

return pa.from_numpy_dtype(dtype)

def _convert_to_pa_type(dtype: Union[np.dtype, pd.ArrowDtype, "pa.DataType"]) -> "pa.DataType":

if isinstance(dtype, pd.ArrowDtype):

return dtype.pyarrow_dtype

if isinstance(dtype, pa.DataType):

return dtype

return pa.from_numpy_dtype(dtype)

…y-project#57057)   ## Why are these changes needed? When the schema contains `pd.ArrowDtype` datatypes, the existing `pa.from_numpy_dtype(dtype)` in the schema function will fail. ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :(  --- > [!NOTE] > Schema.types now converts pandas ArrowDtype to pyarrow types (including within TensorDtype), with unit tests validating dtype conversion. > > - **Schema/types conversion** > - Add `_convert_to_pa_type` to map `pandas.ArrowDtype` and `numpy dtype` to `pyarrow` types. > - Use helper for both generic column dtypes and `TensorDtype._dtype` (works with ArrowTensorType/ArrowTensorTypeV2). > - Import `pandas` to detect `pd.ArrowDtype`. > - **Tests** > - Add parametric test ensuring `Schema.types` returns correct `pyarrow` types for `pd.ArrowDtype` and `numpy` dtypes. > - Minor test imports updated (e.g., `pyarrow`, `Schema`). > > Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 243cdd6. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).  Signed-off-by: Goutam V. <goutam@anyscale.com> Signed-off-by: Seiji Eicher <seiji@anyscale.com>

…y-project#57057)   ## Why are these changes needed? When the schema contains `pd.ArrowDtype` datatypes, the existing `pa.from_numpy_dtype(dtype)` in the schema function will fail. ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :(  --- > [!NOTE] > Schema.types now converts pandas ArrowDtype to pyarrow types (including within TensorDtype), with unit tests validating dtype conversion. > > - **Schema/types conversion** > - Add `_convert_to_pa_type` to map `pandas.ArrowDtype` and `numpy dtype` to `pyarrow` types. > - Use helper for both generic column dtypes and `TensorDtype._dtype` (works with ArrowTensorType/ArrowTensorTypeV2). > - Import `pandas` to detect `pd.ArrowDtype`. > - **Tests** > - Add parametric test ensuring `Schema.types` returns correct `pyarrow` types for `pd.ArrowDtype` and `numpy` dtypes. > - Minor test imports updated (e.g., `pyarrow`, `Schema`). > > Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 243cdd6. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).  Signed-off-by: Goutam V. <goutam@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>

…y-project#57057)   ## Why are these changes needed? When the schema contains `pd.ArrowDtype` datatypes, the existing `pa.from_numpy_dtype(dtype)` in the schema function will fail. ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :(  --- > [!NOTE] > Schema.types now converts pandas ArrowDtype to pyarrow types (including within TensorDtype), with unit tests validating dtype conversion. > > - **Schema/types conversion** > - Add `_convert_to_pa_type` to map `pandas.ArrowDtype` and `numpy dtype` to `pyarrow` types. > - Use helper for both generic column dtypes and `TensorDtype._dtype` (works with ArrowTensorType/ArrowTensorTypeV2). > - Import `pandas` to detect `pd.ArrowDtype`. > - **Tests** > - Add parametric test ensuring `Schema.types` returns correct `pyarrow` types for `pd.ArrowDtype` and `numpy` dtypes. > - Minor test imports updated (e.g., `pyarrow`, `Schema`). > > Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 243cdd6. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).  Signed-off-by: Goutam V. <goutam@anyscale.com>

…y-project#57057)   ## Why are these changes needed? When the schema contains `pd.ArrowDtype` datatypes, the existing `pa.from_numpy_dtype(dtype)` in the schema function will fail. ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :(  --- > [!NOTE] > Schema.types now converts pandas ArrowDtype to pyarrow types (including within TensorDtype), with unit tests validating dtype conversion. > > - **Schema/types conversion** > - Add `_convert_to_pa_type` to map `pandas.ArrowDtype` and `numpy dtype` to `pyarrow` types. > - Use helper for both generic column dtypes and `TensorDtype._dtype` (works with ArrowTensorType/ArrowTensorTypeV2). > - Import `pandas` to detect `pd.ArrowDtype`. > - **Tests** > - Add parametric test ensuring `Schema.types` returns correct `pyarrow` types for `pd.ArrowDtype` and `numpy` dtypes. > - Minor test imports updated (e.g., `pyarrow`, `Schema`). > > Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 243cdd6. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).  Signed-off-by: Goutam V. <goutam@anyscale.com> Signed-off-by: Josh Kodi <joshkodi@gmail.com>

…y-project#57057)   ## Why are these changes needed? When the schema contains `pd.ArrowDtype` datatypes, the existing `pa.from_numpy_dtype(dtype)` in the schema function will fail. ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :(  --- > [!NOTE] > Schema.types now converts pandas ArrowDtype to pyarrow types (including within TensorDtype), with unit tests validating dtype conversion. > > - **Schema/types conversion** > - Add `_convert_to_pa_type` to map `pandas.ArrowDtype` and `numpy dtype` to `pyarrow` types. > - Use helper for both generic column dtypes and `TensorDtype._dtype` (works with ArrowTensorType/ArrowTensorTypeV2). > - Import `pandas` to detect `pd.ArrowDtype`. > - **Tests** > - Add parametric test ensuring `Schema.types` returns correct `pyarrow` types for `pd.ArrowDtype` and `numpy` dtypes. > - Minor test imports updated (e.g., `pyarrow`, `Schema`). > > Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 243cdd6. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).  Signed-off-by: Goutam V. <goutam@anyscale.com>

…y-project#57057)   ## Why are these changes needed? When the schema contains `pd.ArrowDtype` datatypes, the existing `pa.from_numpy_dtype(dtype)` in the schema function will fail. ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :(  --- > [!NOTE] > Schema.types now converts pandas ArrowDtype to pyarrow types (including within TensorDtype), with unit tests validating dtype conversion. > > - **Schema/types conversion** > - Add `_convert_to_pa_type` to map `pandas.ArrowDtype` and `numpy dtype` to `pyarrow` types. > - Use helper for both generic column dtypes and `TensorDtype._dtype` (works with ArrowTensorType/ArrowTensorTypeV2). > - Import `pandas` to detect `pd.ArrowDtype`. > - **Tests** > - Add parametric test ensuring `Schema.types` returns correct `pyarrow` types for `pd.ArrowDtype` and `numpy` dtypes. > - Minor test imports updated (e.g., `pyarrow`, `Schema`). > > Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 243cdd6. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).  Signed-off-by: Goutam V. <goutam@anyscale.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>

…y-project#57057)   ## Why are these changes needed? When the schema contains `pd.ArrowDtype` datatypes, the existing `pa.from_numpy_dtype(dtype)` in the schema function will fail. ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :(  --- > [!NOTE] > Schema.types now converts pandas ArrowDtype to pyarrow types (including within TensorDtype), with unit tests validating dtype conversion. > > - **Schema/types conversion** > - Add `_convert_to_pa_type` to map `pandas.ArrowDtype` and `numpy dtype` to `pyarrow` types. > - Use helper for both generic column dtypes and `TensorDtype._dtype` (works with ArrowTensorType/ArrowTensorTypeV2). > - Import `pandas` to detect `pd.ArrowDtype`. > - **Tests** > - Add parametric test ensuring `Schema.types` returns correct `pyarrow` types for `pd.ArrowDtype` and `numpy` dtypes. > - Minor test imports updated (e.g., `pyarrow`, `Schema`). > > Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 243cdd6. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).  Signed-off-by: Goutam V. <goutam@anyscale.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>

[Data] - schema() handle pd.ArrowDtype -> pyarrow type

243cdd6

Signed-off-by: Goutam V. <goutam@anyscale.com>

goutamvenkat-anyscale requested a review from a team as a code owner September 30, 2025 22:53

iamjustinhsu approved these changes Sep 30, 2025

View reviewed changes

gemini-code-assist bot reviewed Sep 30, 2025

View reviewed changes

goutamvenkat-anyscale added data Ray Data-related issues go add ONLY when ready to merge, run all tests labels Sep 30, 2025

goutamvenkat-anyscale changed the title ~~[Data] - schema() handle pd.ArrowDtype -> pyarrow type~~ [Data] - schema() handle pd.ArrowDtype -> pyarrow type conversion Sep 30, 2025

alexeykudinkin approved these changes Sep 30, 2025

View reviewed changes

alexeykudinkin enabled auto-merge (squash) September 30, 2025 23:16

alexeykudinkin merged commit 2d9d528 into ray-project:master Oct 1, 2025
7 checks passed

goutamvenkat-anyscale deleted the goutam/handle_pd_arrow_dtype branch October 1, 2025 00:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] - schema() handle pd.ArrowDtype -> pyarrow type conversion#57057

[Data] - schema() handle pd.ArrowDtype -> pyarrow type conversion#57057
alexeykudinkin merged 1 commit intoray-project:masterfrom
goutamvenkat-anyscale:goutam/handle_pd_arrow_dtype

goutamvenkat-anyscale commented Sep 30, 2025 •

edited by cursor bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

goutamvenkat-anyscale commented Sep 30, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

goutamvenkat-anyscale commented Sep 30, 2025 •

edited by cursor bot

Loading