[Data] Restoring ray.air.util.tensor_extensions.arrow class aliases to fix deserialization of existing datasets#59818
Conversation
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a backward compatibility fix for deserializing older Ray datasets by restoring class aliases for tensor extensions. The changes are well-structured, including an improved error message for easier debugging and dedicated CI tests to validate the fix. My review includes a suggestion to refactor the new test configuration in release/release_data_tests.yaml to improve maintainability by reducing code duplication.
| byod: | ||
| # NOTE: Image classification have to pin Pyarrow to 19.0 due to dataset using | ||
| # previous tensor extension type inheriting from ``pyarrow.PyExtensionType`` | ||
| # that is removed in Pyarrow 21.0 | ||
| python_depset: image_classification_py3.10.lock |
There was a problem hiding this comment.
To improve maintainability and avoid duplicating this configuration, you can define this byod block as a YAML anchor. You can then reuse this definition in the image_classification_chaos test configuration (lines 529-533) with an alias like byod: *pyarrow_pin.
byod: &pyarrow_pin
# NOTE: Image classification have to pin Pyarrow to 19.0 due to dataset using
# previous tensor extension type inheriting from ``pyarrow.PyExtensionType``
# that is removed in Pyarrow 21.0
python_depset: image_classification_py3.10.lock| byod: | ||
| # NOTE: Image classification have to pin Pyarrow to 19.0 due to dataset using | ||
| # previous tensor extension type inheriting from ``pyarrow.PyExtensionType`` | ||
| # that is removed in Pyarrow 21.0 | ||
| python_depset: image_classification_py3.10.lock |
| f"to read data written with an older version of Ray. Reading data " | ||
| f"written with older versions of Ray might expose you to arbitrary code " | ||
| f"execution. To try reading the data anyway, " | ||
| f"preset `RAY_DATA_AUTOLOAD_PYEXTENSIONTYPE=1` on *all* nodes." |
There was a problem hiding this comment.
Missing space between concatenated string literals in error message
The error message has adjacent string literals where line 167 ends with "on *all* nodes." and line 168 starts with "To learn more...". When concatenated, this produces "...on *all* nodes.To learn more..." without a space between the sentences, making the error message harder to read.
c936a4f to
93258ff
Compare
| group: data-batch-inference | ||
|
|
||
| cluster: | ||
| byod: |
There was a problem hiding this comment.
wait is this change orthogonal because we were always using pyarrow 21.0.0 before? or were we previously always using 19.0.0?
| - name: DEFAULTS | ||
| python: "3.10" | ||
| group: multimodal-inference-benchmarks | ||
| group: data-multimodal-inference-benchmarks |
There was a problem hiding this comment.
Was this a drive by fix?
… to fix deserialization of existing datasets (#59828) ## Description Follow-up for #59818 1. Fixing serde for `ArrowPythonObjectType` 2. Missing `__init__.py` files making packages omitted at build time ## Related issues > Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
… to fix deserialization of existing datasets (ray-project#59818) ## Description Context: [Slack](https://anyscaleteam.slack.com/archives/C04FMM4NPQ9/p1767322231131189) ray-project#59420 moved Ray Data's Arrow tensor extensions from `ray.air.util.tensor_extensions` to `ray.data._internal.tensor_extensions`. That actually broke deserialization of the datasets written with older Ray Data implementation of these extensions inheriting from `pyarrow.PyExtensionType`: 1. `PyEtensionType` pickles class-ref into the metadata when writing the data (in that case it's `ray.air.util.tensor_extensions.arrow.ArrowTensorType` for ex) 2. Upon reading the data it tries to unpickle it and now fails b/c these classes were moved. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>
… to fix deserialization of existing datasets (ray-project#59828) ## Description Follow-up for ray-project#59818 1. Fixing serde for `ArrowPythonObjectType` 2. Missing `__init__.py` files making packages omitted at build time ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>
… to fix deserialization of existing datasets (ray-project#59818) ## Description Context: [Slack](https://anyscaleteam.slack.com/archives/C04FMM4NPQ9/p1767322231131189) ray-project#59420 moved Ray Data's Arrow tensor extensions from `ray.air.util.tensor_extensions` to `ray.data._internal.tensor_extensions`. That actually broke deserialization of the datasets written with older Ray Data implementation of these extensions inheriting from `pyarrow.PyExtensionType`: 1. `PyEtensionType` pickles class-ref into the metadata when writing the data (in that case it's `ray.air.util.tensor_extensions.arrow.ArrowTensorType` for ex) 2. Upon reading the data it tries to unpickle it and now fails b/c these classes were moved. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
… to fix deserialization of existing datasets (ray-project#59828) ## Description Follow-up for ray-project#59818 1. Fixing serde for `ArrowPythonObjectType` 2. Missing `__init__.py` files making packages omitted at build time ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
… to fix deserialization of existing datasets (ray-project#59818) ## Description Context: [Slack](https://anyscaleteam.slack.com/archives/C04FMM4NPQ9/p1767322231131189) ray-project#59420 moved Ray Data's Arrow tensor extensions from `ray.air.util.tensor_extensions` to `ray.data._internal.tensor_extensions`. That actually broke deserialization of the datasets written with older Ray Data implementation of these extensions inheriting from `pyarrow.PyExtensionType`: 1. `PyEtensionType` pickles class-ref into the metadata when writing the data (in that case it's `ray.air.util.tensor_extensions.arrow.ArrowTensorType` for ex) 2. Upon reading the data it tries to unpickle it and now fails b/c these classes were moved. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
… to fix deserialization of existing datasets (ray-project#59828) ## Description Follow-up for ray-project#59818 1. Fixing serde for `ArrowPythonObjectType` 2. Missing `__init__.py` files making packages omitted at build time ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
… to fix deserialization of existing datasets (ray-project#59818) ## Description Context: [Slack](https://anyscaleteam.slack.com/archives/C04FMM4NPQ9/p1767322231131189) ray-project#59420 moved Ray Data's Arrow tensor extensions from `ray.air.util.tensor_extensions` to `ray.data._internal.tensor_extensions`. That actually broke deserialization of the datasets written with older Ray Data implementation of these extensions inheriting from `pyarrow.PyExtensionType`: 1. `PyEtensionType` pickles class-ref into the metadata when writing the data (in that case it's `ray.air.util.tensor_extensions.arrow.ArrowTensorType` for ex) 2. Upon reading the data it tries to unpickle it and now fails b/c these classes were moved. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
… to fix deserialization of existing datasets (ray-project#59828) ## Description Follow-up for ray-project#59818 1. Fixing serde for `ArrowPythonObjectType` 2. Missing `__init__.py` files making packages omitted at build time ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
… to fix deserialization of existing datasets (ray-project#59818) ## Description Context: [Slack](https://anyscaleteam.slack.com/archives/C04FMM4NPQ9/p1767322231131189) ray-project#59420 moved Ray Data's Arrow tensor extensions from `ray.air.util.tensor_extensions` to `ray.data._internal.tensor_extensions`. That actually broke deserialization of the datasets written with older Ray Data implementation of these extensions inheriting from `pyarrow.PyExtensionType`: 1. `PyEtensionType` pickles class-ref into the metadata when writing the data (in that case it's `ray.air.util.tensor_extensions.arrow.ArrowTensorType` for ex) 2. Upon reading the data it tries to unpickle it and now fails b/c these classes were moved. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
… to fix deserialization of existing datasets (ray-project#59828) ## Description Follow-up for ray-project#59818 1. Fixing serde for `ArrowPythonObjectType` 2. Missing `__init__.py` files making packages omitted at build time ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
Description
Context: Slack
#59420 moved Ray Data's Arrow tensor extensions from
ray.air.util.tensor_extensionstoray.data._internal.tensor_extensions.That actually broke deserialization of the datasets written with older Ray Data implementation of these extensions inheriting from
pyarrow.PyExtensionType:PyEtensionTypepickles class-ref into the metadata when writing the data (in that case it'sray.air.util.tensor_extensions.arrow.ArrowTensorTypefor ex)Related issues
Additional information