Skip to content

[Data] Restoring ray.air.util.tensor_extensions.arrow class aliases to fix deserialization of existing datasets#59818

Merged
bveeramani merged 5 commits intomasterfrom
ak/btch-inf-rel-tst-fix
Jan 2, 2026
Merged

[Data] Restoring ray.air.util.tensor_extensions.arrow class aliases to fix deserialization of existing datasets#59818
bveeramani merged 5 commits intomasterfrom
ak/btch-inf-rel-tst-fix

Conversation

@alexeykudinkin
Copy link
Contributor

@alexeykudinkin alexeykudinkin commented Jan 2, 2026

Description

Context: Slack

#59420 moved Ray Data's Arrow tensor extensions from ray.air.util.tensor_extensions to ray.data._internal.tensor_extensions.

That actually broke deserialization of the datasets written with older Ray Data implementation of these extensions inheriting from pyarrow.PyExtensionType:

  1. PyEtensionType pickles class-ref into the metadata when writing the data (in that case it's ray.air.util.tensor_extensions.arrow.ArrowTensorType for ex)
  2. Upon reading the data it tries to unpickle it and now fails b/c these classes were moved.

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
@alexeykudinkin alexeykudinkin requested review from a team as code owners January 2, 2026 20:55
@alexeykudinkin alexeykudinkin added the go add ONLY when ready to merge, run all tests label Jan 2, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a backward compatibility fix for deserializing older Ray datasets by restoring class aliases for tensor extensions. The changes are well-structured, including an improved error message for easier debugging and dedicated CI tests to validate the fix. My review includes a suggestion to refactor the new test configuration in release/release_data_tests.yaml to improve maintainability by reducing code duplication.

Comment on lines +503 to +507
byod:
# NOTE: Image classification have to pin Pyarrow to 19.0 due to dataset using
# previous tensor extension type inheriting from ``pyarrow.PyExtensionType``
# that is removed in Pyarrow 21.0
python_depset: image_classification_py3.10.lock
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To improve maintainability and avoid duplicating this configuration, you can define this byod block as a YAML anchor. You can then reuse this definition in the image_classification_chaos test configuration (lines 529-533) with an alias like byod: *pyarrow_pin.

    byod: &pyarrow_pin
      # NOTE: Image classification have to pin Pyarrow to 19.0 due to dataset using
      #       previous tensor extension type inheriting from ``pyarrow.PyExtensionType``
      #       that is removed in Pyarrow 21.0
      python_depset: image_classification_py3.10.lock

Comment on lines +529 to +533
byod:
# NOTE: Image classification have to pin Pyarrow to 19.0 due to dataset using
# previous tensor extension type inheriting from ``pyarrow.PyExtensionType``
# that is removed in Pyarrow 21.0
python_depset: image_classification_py3.10.lock
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To avoid configuration duplication, you can use a YAML alias here to reference the byod block defined in the image_classification_{{scaling}} test above. This makes the file more maintainable.

    byod: *pyarrow_pin

f"to read data written with an older version of Ray. Reading data "
f"written with older versions of Ray might expose you to arbitrary code "
f"execution. To try reading the data anyway, "
f"preset `RAY_DATA_AUTOLOAD_PYEXTENSIONTYPE=1` on *all* nodes."
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space between concatenated string literals in error message

The error message has adjacent string literals where line 167 ends with "on *all* nodes." and line 168 starts with "To learn more...". When concatenated, this produces "...on *all* nodes.To learn more..." without a space between the sentences, making the error message harder to read.

Fix in Cursor Fix in Web

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
@alexeykudinkin alexeykudinkin force-pushed the ak/btch-inf-rel-tst-fix branch from c936a4f to 93258ff Compare January 2, 2026 21:13
Copy link
Contributor

@iamjustinhsu iamjustinhsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

group: data-batch-inference

cluster:
byod:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait is this change orthogonal because we were always using pyarrow 21.0.0 before? or were we previously always using 19.0.0?

@bveeramani bveeramani enabled auto-merge (squash) January 2, 2026 21:29
- name: DEFAULTS
python: "3.10"
group: multimodal-inference-benchmarks
group: data-multimodal-inference-benchmarks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this a drive by fix?

@bveeramani bveeramani merged commit b46b7fd into master Jan 2, 2026
6 of 7 checks passed
@bveeramani bveeramani deleted the ak/btch-inf-rel-tst-fix branch January 2, 2026 22:23
alexeykudinkin added a commit that referenced this pull request Jan 4, 2026
… to fix deserialization of existing datasets (#59828)

## Description

Follow-up for #59818

 1. Fixing serde for `ArrowPythonObjectType`
 2. Missing `__init__.py` files making packages omitted at build time


## Related issues
> Link related issues: "Fixes #1234", "Closes #1234", or "Related to
#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
AYou0207 pushed a commit to AYou0207/ray that referenced this pull request Jan 13, 2026
… to fix deserialization of existing datasets (ray-project#59818)

## Description

Context:
[Slack](https://anyscaleteam.slack.com/archives/C04FMM4NPQ9/p1767322231131189)

ray-project#59420 moved Ray Data's Arrow tensor extensions from
`ray.air.util.tensor_extensions` to
`ray.data._internal.tensor_extensions`.

That actually broke deserialization of the datasets written with older
Ray Data implementation of these extensions inheriting from
`pyarrow.PyExtensionType`:

1. `PyEtensionType` pickles class-ref into the metadata when writing the
data (in that case it's
`ray.air.util.tensor_extensions.arrow.ArrowTensorType` for ex)
2. Upon reading the data it tries to unpickle it and now fails b/c these
classes were moved.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>
AYou0207 pushed a commit to AYou0207/ray that referenced this pull request Jan 13, 2026
… to fix deserialization of existing datasets (ray-project#59828)

## Description

Follow-up for ray-project#59818

 1. Fixing serde for `ArrowPythonObjectType`
 2. Missing `__init__.py` files making packages omitted at build time

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>
lee1258561 pushed a commit to pinterest/ray that referenced this pull request Feb 3, 2026
… to fix deserialization of existing datasets (ray-project#59818)

## Description

Context:
[Slack](https://anyscaleteam.slack.com/archives/C04FMM4NPQ9/p1767322231131189)

ray-project#59420 moved Ray Data's Arrow tensor extensions from
`ray.air.util.tensor_extensions` to
`ray.data._internal.tensor_extensions`.

That actually broke deserialization of the datasets written with older
Ray Data implementation of these extensions inheriting from
`pyarrow.PyExtensionType`:

1. `PyEtensionType` pickles class-ref into the metadata when writing the
data (in that case it's
`ray.air.util.tensor_extensions.arrow.ArrowTensorType` for ex)
2. Upon reading the data it tries to unpickle it and now fails b/c these
classes were moved.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
lee1258561 pushed a commit to pinterest/ray that referenced this pull request Feb 3, 2026
… to fix deserialization of existing datasets (ray-project#59828)

## Description

Follow-up for ray-project#59818

 1. Fixing serde for `ArrowPythonObjectType`
 2. Missing `__init__.py` files making packages omitted at build time


## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Feb 3, 2026
… to fix deserialization of existing datasets (ray-project#59818)

## Description

Context:
[Slack](https://anyscaleteam.slack.com/archives/C04FMM4NPQ9/p1767322231131189)

ray-project#59420 moved Ray Data's Arrow tensor extensions from
`ray.air.util.tensor_extensions` to
`ray.data._internal.tensor_extensions`.

That actually broke deserialization of the datasets written with older
Ray Data implementation of these extensions inheriting from
`pyarrow.PyExtensionType`:

1. `PyEtensionType` pickles class-ref into the metadata when writing the
data (in that case it's
`ray.air.util.tensor_extensions.arrow.ArrowTensorType` for ex)
2. Upon reading the data it tries to unpickle it and now fails b/c these
classes were moved.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Feb 3, 2026
… to fix deserialization of existing datasets (ray-project#59828)

## Description

Follow-up for ray-project#59818

 1. Fixing serde for `ArrowPythonObjectType`
 2. Missing `__init__.py` files making packages omitted at build time


## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
… to fix deserialization of existing datasets (ray-project#59818)

## Description

Context:
[Slack](https://anyscaleteam.slack.com/archives/C04FMM4NPQ9/p1767322231131189)

ray-project#59420 moved Ray Data's Arrow tensor extensions from
`ray.air.util.tensor_extensions` to
`ray.data._internal.tensor_extensions`.

That actually broke deserialization of the datasets written with older
Ray Data implementation of these extensions inheriting from
`pyarrow.PyExtensionType`:

1. `PyEtensionType` pickles class-ref into the metadata when writing the
data (in that case it's
`ray.air.util.tensor_extensions.arrow.ArrowTensorType` for ex)
2. Upon reading the data it tries to unpickle it and now fails b/c these
classes were moved.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
… to fix deserialization of existing datasets (ray-project#59828)

## Description

Follow-up for ray-project#59818

 1. Fixing serde for `ArrowPythonObjectType`
 2. Missing `__init__.py` files making packages omitted at build time

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
… to fix deserialization of existing datasets (ray-project#59818)

## Description

Context:
[Slack](https://anyscaleteam.slack.com/archives/C04FMM4NPQ9/p1767322231131189)

ray-project#59420 moved Ray Data's Arrow tensor extensions from
`ray.air.util.tensor_extensions` to
`ray.data._internal.tensor_extensions`.

That actually broke deserialization of the datasets written with older
Ray Data implementation of these extensions inheriting from
`pyarrow.PyExtensionType`:

1. `PyEtensionType` pickles class-ref into the metadata when writing the
data (in that case it's
`ray.air.util.tensor_extensions.arrow.ArrowTensorType` for ex)
2. Upon reading the data it tries to unpickle it and now fails b/c these
classes were moved.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
… to fix deserialization of existing datasets (ray-project#59828)

## Description

Follow-up for ray-project#59818

 1. Fixing serde for `ArrowPythonObjectType`
 2. Missing `__init__.py` files making packages omitted at build time

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ray fails to serialize self-reference objects

5 participants