[Data] - schema() handle pd.ArrowDtype -> pyarrow type conversion#57057
Conversation
Signed-off-by: Goutam V. <goutam@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request correctly addresses an issue where pd.ArrowDtype was not handled properly when determining a dataset's schema types. The introduction of the _convert_to_pa_type helper function is a clean solution, and the accompanying test effectively validates the fix. I've added one suggestion to make the new helper function even more robust by handling raw pyarrow.DataType instances, which seems to be a possibility based on existing code patterns.
| def _convert_to_pa_type(dtype: Union[np.dtype, pd.ArrowDtype]) -> pa.DataType: | ||
| if isinstance(dtype, pd.ArrowDtype): | ||
| return dtype.pyarrow_dtype | ||
| return pa.from_numpy_dtype(dtype) |
There was a problem hiding this comment.
This function correctly handles pd.ArrowDtype. To make it more robust, consider also handling raw pyarrow.DataType instances. It appears TensorDtype can sometimes be constructed with a pyarrow.DataType, which would cause a TypeError here as pa.from_numpy_dtype does not accept it. This error is then silently caught by the generic except Exception block, and the type becomes None, which can hide underlying issues. Explicitly handling pyarrow.DataType would prevent this.
| def _convert_to_pa_type(dtype: Union[np.dtype, pd.ArrowDtype]) -> pa.DataType: | |
| if isinstance(dtype, pd.ArrowDtype): | |
| return dtype.pyarrow_dtype | |
| return pa.from_numpy_dtype(dtype) | |
| def _convert_to_pa_type(dtype: Union[np.dtype, pd.ArrowDtype, "pa.DataType"]) -> "pa.DataType": | |
| if isinstance(dtype, pd.ArrowDtype): | |
| return dtype.pyarrow_dtype | |
| if isinstance(dtype, pa.DataType): | |
| return dtype | |
| return pa.from_numpy_dtype(dtype) |
…y-project#57057) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? When the schema contains `pd.ArrowDtype` datatypes, the existing `pa.from_numpy_dtype(dtype)` in the schema function will fail. ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Schema.types now converts pandas ArrowDtype to pyarrow types (including within TensorDtype), with unit tests validating dtype conversion. > > - **Schema/types conversion** > - Add `_convert_to_pa_type` to map `pandas.ArrowDtype` and `numpy dtype` to `pyarrow` types. > - Use helper for both generic column dtypes and `TensorDtype._dtype` (works with ArrowTensorType/ArrowTensorTypeV2). > - Import `pandas` to detect `pd.ArrowDtype`. > - **Tests** > - Add parametric test ensuring `Schema.types` returns correct `pyarrow` types for `pd.ArrowDtype` and `numpy` dtypes. > - Minor test imports updated (e.g., `pyarrow`, `Schema`). > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 243cdd6. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> Signed-off-by: Goutam V. <goutam@anyscale.com> Signed-off-by: Seiji Eicher <seiji@anyscale.com>
…y-project#57057) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? When the schema contains `pd.ArrowDtype` datatypes, the existing `pa.from_numpy_dtype(dtype)` in the schema function will fail. ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Schema.types now converts pandas ArrowDtype to pyarrow types (including within TensorDtype), with unit tests validating dtype conversion. > > - **Schema/types conversion** > - Add `_convert_to_pa_type` to map `pandas.ArrowDtype` and `numpy dtype` to `pyarrow` types. > - Use helper for both generic column dtypes and `TensorDtype._dtype` (works with ArrowTensorType/ArrowTensorTypeV2). > - Import `pandas` to detect `pd.ArrowDtype`. > - **Tests** > - Add parametric test ensuring `Schema.types` returns correct `pyarrow` types for `pd.ArrowDtype` and `numpy` dtypes. > - Minor test imports updated (e.g., `pyarrow`, `Schema`). > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 243cdd6. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> Signed-off-by: Goutam V. <goutam@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
…y-project#57057) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? When the schema contains `pd.ArrowDtype` datatypes, the existing `pa.from_numpy_dtype(dtype)` in the schema function will fail. ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Schema.types now converts pandas ArrowDtype to pyarrow types (including within TensorDtype), with unit tests validating dtype conversion. > > - **Schema/types conversion** > - Add `_convert_to_pa_type` to map `pandas.ArrowDtype` and `numpy dtype` to `pyarrow` types. > - Use helper for both generic column dtypes and `TensorDtype._dtype` (works with ArrowTensorType/ArrowTensorTypeV2). > - Import `pandas` to detect `pd.ArrowDtype`. > - **Tests** > - Add parametric test ensuring `Schema.types` returns correct `pyarrow` types for `pd.ArrowDtype` and `numpy` dtypes. > - Minor test imports updated (e.g., `pyarrow`, `Schema`). > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 243cdd6. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> Signed-off-by: Goutam V. <goutam@anyscale.com>
…y-project#57057) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? When the schema contains `pd.ArrowDtype` datatypes, the existing `pa.from_numpy_dtype(dtype)` in the schema function will fail. ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Schema.types now converts pandas ArrowDtype to pyarrow types (including within TensorDtype), with unit tests validating dtype conversion. > > - **Schema/types conversion** > - Add `_convert_to_pa_type` to map `pandas.ArrowDtype` and `numpy dtype` to `pyarrow` types. > - Use helper for both generic column dtypes and `TensorDtype._dtype` (works with ArrowTensorType/ArrowTensorTypeV2). > - Import `pandas` to detect `pd.ArrowDtype`. > - **Tests** > - Add parametric test ensuring `Schema.types` returns correct `pyarrow` types for `pd.ArrowDtype` and `numpy` dtypes. > - Minor test imports updated (e.g., `pyarrow`, `Schema`). > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 243cdd6. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> Signed-off-by: Goutam V. <goutam@anyscale.com> Signed-off-by: Josh Kodi <joshkodi@gmail.com>
…y-project#57057) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? When the schema contains `pd.ArrowDtype` datatypes, the existing `pa.from_numpy_dtype(dtype)` in the schema function will fail. ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Schema.types now converts pandas ArrowDtype to pyarrow types (including within TensorDtype), with unit tests validating dtype conversion. > > - **Schema/types conversion** > - Add `_convert_to_pa_type` to map `pandas.ArrowDtype` and `numpy dtype` to `pyarrow` types. > - Use helper for both generic column dtypes and `TensorDtype._dtype` (works with ArrowTensorType/ArrowTensorTypeV2). > - Import `pandas` to detect `pd.ArrowDtype`. > - **Tests** > - Add parametric test ensuring `Schema.types` returns correct `pyarrow` types for `pd.ArrowDtype` and `numpy` dtypes. > - Minor test imports updated (e.g., `pyarrow`, `Schema`). > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 243cdd6. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> Signed-off-by: Goutam V. <goutam@anyscale.com>
…y-project#57057) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? When the schema contains `pd.ArrowDtype` datatypes, the existing `pa.from_numpy_dtype(dtype)` in the schema function will fail. ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Schema.types now converts pandas ArrowDtype to pyarrow types (including within TensorDtype), with unit tests validating dtype conversion. > > - **Schema/types conversion** > - Add `_convert_to_pa_type` to map `pandas.ArrowDtype` and `numpy dtype` to `pyarrow` types. > - Use helper for both generic column dtypes and `TensorDtype._dtype` (works with ArrowTensorType/ArrowTensorTypeV2). > - Import `pandas` to detect `pd.ArrowDtype`. > - **Tests** > - Add parametric test ensuring `Schema.types` returns correct `pyarrow` types for `pd.ArrowDtype` and `numpy` dtypes. > - Minor test imports updated (e.g., `pyarrow`, `Schema`). > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 243cdd6. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> Signed-off-by: Goutam V. <goutam@anyscale.com>
…y-project#57057) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? When the schema contains `pd.ArrowDtype` datatypes, the existing `pa.from_numpy_dtype(dtype)` in the schema function will fail. ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Schema.types now converts pandas ArrowDtype to pyarrow types (including within TensorDtype), with unit tests validating dtype conversion. > > - **Schema/types conversion** > - Add `_convert_to_pa_type` to map `pandas.ArrowDtype` and `numpy dtype` to `pyarrow` types. > - Use helper for both generic column dtypes and `TensorDtype._dtype` (works with ArrowTensorType/ArrowTensorTypeV2). > - Import `pandas` to detect `pd.ArrowDtype`. > - **Tests** > - Add parametric test ensuring `Schema.types` returns correct `pyarrow` types for `pd.ArrowDtype` and `numpy` dtypes. > - Minor test imports updated (e.g., `pyarrow`, `Schema`). > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 243cdd6. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> Signed-off-by: Goutam V. <goutam@anyscale.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
…y-project#57057) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? When the schema contains `pd.ArrowDtype` datatypes, the existing `pa.from_numpy_dtype(dtype)` in the schema function will fail. ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Schema.types now converts pandas ArrowDtype to pyarrow types (including within TensorDtype), with unit tests validating dtype conversion. > > - **Schema/types conversion** > - Add `_convert_to_pa_type` to map `pandas.ArrowDtype` and `numpy dtype` to `pyarrow` types. > - Use helper for both generic column dtypes and `TensorDtype._dtype` (works with ArrowTensorType/ArrowTensorTypeV2). > - Import `pandas` to detect `pd.ArrowDtype`. > - **Tests** > - Add parametric test ensuring `Schema.types` returns correct `pyarrow` types for `pd.ArrowDtype` and `numpy` dtypes. > - Minor test imports updated (e.g., `pyarrow`, `Schema`). > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 243cdd6. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> Signed-off-by: Goutam V. <goutam@anyscale.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>
Why are these changes needed?
When the schema contains
pd.ArrowDtypedatatypes, the existingpa.from_numpy_dtype(dtype)in the schema function will fail.Related issue number
Checks
git commit -s) in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.Note
Schema.types now converts pandas ArrowDtype to pyarrow types (including within TensorDtype), with unit tests validating dtype conversion.
_convert_to_pa_typeto mappandas.ArrowDtypeandnumpy dtypetopyarrowtypes.TensorDtype._dtype(works with ArrowTensorType/ArrowTensorTypeV2).pandasto detectpd.ArrowDtype.Schema.typesreturns correctpyarrowtypes forpd.ArrowDtypeandnumpydtypes.pyarrow,Schema).Written by Cursor Bugbot for commit 243cdd6. This will update automatically on new commits. Configure here.