Skip to content

[Data] Move extension types to ray.data#59420

Merged
bveeramani merged 1 commit intoray-project:masterfrom
zzchun:move_type_extensions
Dec 23, 2025
Merged

[Data] Move extension types to ray.data#59420
bveeramani merged 1 commit intoray-project:masterfrom
zzchun:move_type_extensions

Conversation

@zzchun
Copy link
Contributor

@zzchun zzchun commented Dec 13, 2025

Description

Relocate extension types from ray.air.util to ray.data._internal to reduce module dependencies. This includes moving object_extensions and tensor_extensions, updating imports codebase-wide, fixing import errors and test failures, and planning further cleanups to fully decouple ray.data from ray.air

The changes mainly included:

ray.air.util.tensor_extensions -> ray.data._internal.tensor_extensions
ray.air.util.object_extensions -> ray.data._internal.object_extensions
ray.air._internal.torch_utils -> ray.data.util.torch_utils
python/ray/air/util/transform_pyarrow.py -> ray/data/_internal/utils/transform_pyarrow.py
ray.air.util -> ray.data.util

Related issues

Closes #59418

@zzchun zzchun requested review from a team as code owners December 13, 2025 04:20
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request moves extension types from ray.air to ray.data._internal, which is a significant and clean refactoring. The changes primarily involve moving files and updating import paths across the codebase. The core logic seems unchanged.

My main concern is that the newly created packages, ray.data._internal.object_extensions and ray.data._internal.tensor_extensions, have empty __init__.py files. This will lead to ImportErrors in various modules that now attempt to import symbols directly from these packages. I've added comments with suggestions to populate these __init__.py files to correctly expose the necessary symbols from their respective submodules.

Additionally, I've identified a few places where multiple import statements from the same module could be consolidated for improved code clarity and maintainability.

Overall, this is a good refactoring, and once the import issues are addressed, it will be a solid improvement.

@ray-gardener ray-gardener bot added data Ray Data-related issues community-contribution Contributed by the community labels Dec 13, 2025
@zzchun zzchun force-pushed the move_type_extensions branch from 907cfe4 to 6ec4c87 Compare December 13, 2025 08:20
@zzchun zzchun force-pushed the move_type_extensions branch 2 times, most recently from 27e930f to 6711390 Compare December 15, 2025 15:16
@richardliaw richardliaw added the go add ONLY when ready to merge, run all tests label Dec 15, 2025
Copy link
Member

@bveeramani bveeramani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ty!

Overall LGTM! Just a couple minor comments about the docstrings

@bveeramani bveeramani changed the title [Data] Move extension types to ray.data [Data] Move extension types to ray.data Dec 16, 2025

from ray.air.util.data_batch_conversion import _unwrap_ndarray_object_type_if_needed
from ray.air.util.tensor_extensions.arrow import get_arrow_extension_tensor_types
from ray.data._internal.tensor_extensions.arrow import get_arrow_extension_tensor_types
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move these air modules to data as well? Not sure if this introduces a circular dependency...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 I think to do this properly:

  1. ray/air should NOT depend on ray.data
  2. ray.data should NOT depend on ray.air

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move these air modules to data as well?

We should, though I don't think we should move all of the modules at once. I think it'll be hard to review.

+1 I think to do this properly:

  1. ray/air should NOT depend on ray.data
  2. ray.data should NOT depend on ray.air

Yeah, the end state I want to work towards is ray.air doesn't exist.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect it's fairly manageable... ray air also has various train things, we just need to move the data stuff

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that Ray's train and tune modules already depend on air, moving air to data would create a dependency chain where train/tune depend on data.

Does this dependency structure make sense for our codebase?

cc @richardliaw @bveeramani @matthewdeng

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we don't want to move air to data. We just want to rip out the relevant parts that data depends on in air to data, so that data doesn't depend on air.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So concretely, there should be 63 places in ray/data that need to be changed.

  • 34 of those are ray.air.util.tensor_extensions
  • 6 of those are ray.air._internal.{torch/tensorflow}_utils
  • 10 of those are ray.air.util.data_batch_conversion
  • 4 of those are ray.air.util.transform_pyarrow
  • 3 of them are ray.air.constants
  • 4 is ray.air.util.object_extensions

You'll want to copy all of these parts out of ray.air so that we don't have a dependency from data to air.

Then, some parts of these will become dead, and we can remove it as well without any harm done to other libraries.

(There's 2 missing references but it'll be obvious after the above is addressed)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richardliaw @bveeramani Could you please help review this pull request when you have a moment? Thank you!

@zzchun zzchun requested review from a team as code owners December 18, 2025 07:21
@zzchun zzchun force-pushed the move_type_extensions branch from 3ac5d50 to a6af757 Compare December 18, 2025 07:30
@aslonnie aslonnie removed the request for review from a team December 18, 2025 12:32
@zzchun zzchun force-pushed the move_type_extensions branch from 89adc30 to 6fd82b2 Compare December 18, 2025 15:08
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Missing imports break test conftest file

Several critical imports were removed from this file but are still used: import copy (used at line 233 for copy.deepcopy), import os (used at lines 178, 179, 221 for os.path.join and os.mkdir), import time (used at lines 422, 619, 625 for time.perf_counter() and time.sleep()), import ray (used extensively throughout for ray.data, ray.get, @ray.remote, etc.), and from ray._common.test_utils import wait_for_condition (used at lines 639, 659, 712). The from ray.tests.conftest import * does not re-export these module imports, so the test file will fail with NameError at runtime.

python/ray/data/tests/conftest.py#L1-L21

import posixpath
from collections import defaultdict
import numpy as np
import pandas as pd
import pyarrow as pa
import pytest
from ray._private.arrow_utils import get_pyarrow_version
from ray._private.internal_api import get_memory_info_reply, get_state_from_address
from ray.data._internal.tensor_extensions.arrow import ArrowTensorArray
from ray.data.block import BlockExecStats, BlockMetadata
from ray.data.constants import TENSOR_COLUMN_NAME
from ray.data.context import DEFAULT_TARGET_MAX_BLOCK_SIZE, DataContext, ShuffleStrategy
from ray.data.tests.mock_server import * # noqa
# Trigger pytest hook to automatically zip test cluster logs to archive dir on failure
from ray.tests.conftest import * # noqa
from ray.tests.conftest import _ray_start
from ray.util.debug import reset_log_once
from ray.util.state import list_actors

Fix in Cursor Fix in Web


Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Removal of pre-serialization sanitization may cause actor deserialization failures

The removal of DataContextMetadata.from_data_context(data_context) before sending to StatsActor may reintroduce a serialization issue. The old code explicitly converted DataContext to DataContextMetadata "to avoid module dependency issues during Ray's cloudpickle serialization" - this handled cases where DataContext contains custom exception classes from modules unavailable in StatsActor's environment. Now DataContext is sent directly without sanitization, which could cause deserialization failures. The test test_data_context_with_custom_classes_serialization was specifically written to verify this fix works.

python/ray/data/_internal/stats.py#L896-L904

"""
get_or_create_stats_actor().register_dataset.remote(
ray.get_runtime_context().get_job_id(),
dataset_tag,
operator_tags,
topology,
data_context,
)

python/ray/data/_internal/metadata_exporter.py#L150-L161

start_time: The timestamp when the dataset is registered.
data_context: The DataContext attached to the dataset.
execution_start_time: The timestamp when the dataset execution starts.
execution_end_time: The timestamp when the dataset execution ends.
state: The state of the dataset.
"""
job_id: str
topology: Topology
dataset_id: str
start_time: float
data_context: DataContext

Fix in Cursor Fix in Web


@zzchun zzchun force-pushed the move_type_extensions branch 2 times, most recently from 95379fe to b6e8cc3 Compare December 19, 2025 02:34
data_context=DataContextMetadata.from_data_context(
DataContext.get_current()
),
data_context=DataContext.get_current(),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Test passes wrong type to register_dataset method

The test file removes the DataContextMetadata import and changes calls from DataContextMetadata.from_data_context(DataContext.get_current()) to just DataContext.get_current(). However, the _StatsActor.register_dataset method signature expects data_context: DataContextMetadata, and creates a DatasetMetadata object that also requires this type. When export_dataset_metadata tries to access data_context.config, it will fail because DataContext doesn't have a config attribute—only DataContextMetadata does.

Additional Locations (2)

Fix in Cursor Fix in Web

@zzchun zzchun force-pushed the move_type_extensions branch from 4a6541f to fc60e97 Compare December 19, 2025 03:54
ray.air.util.objection_extensions -> ray.data._internal.object_extensions
ray.air.util.tensor_extensions -> ray.data._internal.tensor_extensions

Signed-off-by: will <zzchun8@gmail.com>
@zzchun zzchun force-pushed the move_type_extensions branch from 4eb671e to d7a4f76 Compare December 22, 2025 11:54
@zzchun
Copy link
Contributor Author

zzchun commented Dec 23, 2025

/cc @richardliaw @bveeramani

👋 Could you please take a look at this PR when you have time?

The PR has been facing frequent merge conflicts with recently merged changes

Thank you!

Copy link
Member

@bveeramani bveeramani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! ty for the contribution!

from ray.data._internal.pandas_block import PandasBlockSchema


def convert_ndarray_to_tf_tensor(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is now duplicated across ray.air and ray.data, but its not dead code.

In a follow-up, I think we should remove this and update the reference in tensorflow_predictor to use the one defined in ray.data.

return result


def convert_ndarray_batch_to_torch_tensor_batch(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we use this anywhere in Ray Data. I think we should remove this in a follow-up

return batch


def load_torch_model(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

# Not present in torch<=1.7.0
# Adapted from https://github.com/pytorch/pytorch/blob/\
# c18da597e0bb1c1aecc97c77a73fed1849057fa4/torch/nn/modules/utils.py
def consume_prefix_in_state_dict_if_present_not_in_place(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

@bveeramani
Copy link
Member

@zzchun would you mind updating the PR description with the latest changes?

return data
elif pyarrow is not None and isinstance(data, pyarrow.Table):
from ray.air.util.tensor_extensions.arrow import (
from ray.data._internal.arrow_ops import transform_pyarrow
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this still importing from data? Should this util also be moved to data?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think you already created a duplicate in python/ray/data, should we just delete this file

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, the reason this is here is because this utility is still dependent from ray.train.predictors, so this is a duplicate. How about let's just address this in a later PR when we remove predictors?

@richardliaw

This comment was marked as off-topic.

@richardliaw
Copy link
Contributor

Actually nevermind, let's just review here.

@bveeramani bveeramani merged commit 8ca063f into ray-project:master Dec 23, 2025
6 checks passed
@bveeramani
Copy link
Member

@zzchun I merged the PR to avoid merge conflicts, but pls update the PR description when you get a chance so it's representative of the actual diff!

@zzchun zzchun deleted the move_type_extensions branch December 23, 2025 09:02
@zzchun
Copy link
Contributor Author

zzchun commented Dec 23, 2025

@bveeramani Thanks for the review. I've updated the description.

I will submit a follow-up PR shortly to address all reviewer's feedback.

@zzchun zzchun mentioned this pull request Dec 23, 2025
bveeramani added a commit that referenced this pull request Dec 27, 2025
## Description

Remove dead code in `python/ray/data/util/data_batch_conversion.py` and
`python/ray/data/util/torch_utils.py`,
which is related to PR: #59420 


## Related issues

Related to #59420.

Signed-off-by: will <zzchun8@gmail.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
bveeramani pushed a commit that referenced this pull request Jan 2, 2026
… to fix deserialization of existing datasets (#59818)

## Description

Context:
[Slack](https://anyscaleteam.slack.com/archives/C04FMM4NPQ9/p1767322231131189)

#59420 moved Ray Data's Arrow tensor extensions from
`ray.air.util.tensor_extensions` to
`ray.data._internal.tensor_extensions`.

That actually broke deserialization of the datasets written with older
Ray Data implementation of these extensions inheriting from
`pyarrow.PyExtensionType`:

1. `PyEtensionType` pickles class-ref into the metadata when writing the
data (in that case it's
`ray.air.util.tensor_extensions.arrow.ArrowTensorType` for ex)
2. Upon reading the data it tries to unpickle it and now fails b/c these
classes were moved.

## Related issues
> Link related issues: "Fixes #1234", "Closes #1234", or "Related to
#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
AYou0207 pushed a commit to AYou0207/ray that referenced this pull request Jan 13, 2026
## Description

Move extension types to ray.data

ray.air.util.objection_extensions ->
ray.data._internal.object_extensions
ray.air.util.tensor_extensions -> ray.data._internal.tensor_extensions

## Related issues
Closes ray-project#59418

Signed-off-by: will <zzchun8@gmail.com>
Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>
AYou0207 pushed a commit to AYou0207/ray that referenced this pull request Jan 13, 2026
## Description

Remove dead code in `python/ray/data/util/data_batch_conversion.py` and
`python/ray/data/util/torch_utils.py`,
which is related to PR: ray-project#59420

## Related issues

Related to ray-project#59420.

Signed-off-by: will <zzchun8@gmail.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>
AYou0207 pushed a commit to AYou0207/ray that referenced this pull request Jan 13, 2026
… to fix deserialization of existing datasets (ray-project#59818)

## Description

Context:
[Slack](https://anyscaleteam.slack.com/archives/C04FMM4NPQ9/p1767322231131189)

ray-project#59420 moved Ray Data's Arrow tensor extensions from
`ray.air.util.tensor_extensions` to
`ray.data._internal.tensor_extensions`.

That actually broke deserialization of the datasets written with older
Ray Data implementation of these extensions inheriting from
`pyarrow.PyExtensionType`:

1. `PyEtensionType` pickles class-ref into the metadata when writing the
data (in that case it's
`ray.air.util.tensor_extensions.arrow.ArrowTensorType` for ex)
2. Upon reading the data it tries to unpickle it and now fails b/c these
classes were moved.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>
lee1258561 pushed a commit to pinterest/ray that referenced this pull request Feb 3, 2026
## Description

Move extension types to ray.data

ray.air.util.objection_extensions ->
ray.data._internal.object_extensions
ray.air.util.tensor_extensions -> ray.data._internal.tensor_extensions

## Related issues
Closes ray-project#59418

Signed-off-by: will <zzchun8@gmail.com>
lee1258561 pushed a commit to pinterest/ray that referenced this pull request Feb 3, 2026
## Description

Remove dead code in `python/ray/data/util/data_batch_conversion.py` and
`python/ray/data/util/torch_utils.py`,
which is related to PR: ray-project#59420 


## Related issues

Related to ray-project#59420.

Signed-off-by: will <zzchun8@gmail.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
lee1258561 pushed a commit to pinterest/ray that referenced this pull request Feb 3, 2026
… to fix deserialization of existing datasets (ray-project#59818)

## Description

Context:
[Slack](https://anyscaleteam.slack.com/archives/C04FMM4NPQ9/p1767322231131189)

ray-project#59420 moved Ray Data's Arrow tensor extensions from
`ray.air.util.tensor_extensions` to
`ray.data._internal.tensor_extensions`.

That actually broke deserialization of the datasets written with older
Ray Data implementation of these extensions inheriting from
`pyarrow.PyExtensionType`:

1. `PyEtensionType` pickles class-ref into the metadata when writing the
data (in that case it's
`ray.air.util.tensor_extensions.arrow.ArrowTensorType` for ex)
2. Upon reading the data it tries to unpickle it and now fails b/c these
classes were moved.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Feb 3, 2026
… to fix deserialization of existing datasets (ray-project#59818)

## Description

Context:
[Slack](https://anyscaleteam.slack.com/archives/C04FMM4NPQ9/p1767322231131189)

ray-project#59420 moved Ray Data's Arrow tensor extensions from
`ray.air.util.tensor_extensions` to
`ray.data._internal.tensor_extensions`.

That actually broke deserialization of the datasets written with older
Ray Data implementation of these extensions inheriting from
`pyarrow.PyExtensionType`:

1. `PyEtensionType` pickles class-ref into the metadata when writing the
data (in that case it's
`ray.air.util.tensor_extensions.arrow.ArrowTensorType` for ex)
2. Upon reading the data it tries to unpickle it and now fails b/c these
classes were moved.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
## Description

Move extension types to ray.data

ray.air.util.objection_extensions ->
ray.data._internal.object_extensions
ray.air.util.tensor_extensions -> ray.data._internal.tensor_extensions

## Related issues
Closes ray-project#59418

Signed-off-by: will <zzchun8@gmail.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
## Description

Remove dead code in `python/ray/data/util/data_batch_conversion.py` and
`python/ray/data/util/torch_utils.py`,
which is related to PR: ray-project#59420

## Related issues

Related to ray-project#59420.

Signed-off-by: will <zzchun8@gmail.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
… to fix deserialization of existing datasets (ray-project#59818)

## Description

Context:
[Slack](https://anyscaleteam.slack.com/archives/C04FMM4NPQ9/p1767322231131189)

ray-project#59420 moved Ray Data's Arrow tensor extensions from
`ray.air.util.tensor_extensions` to
`ray.data._internal.tensor_extensions`.

That actually broke deserialization of the datasets written with older
Ray Data implementation of these extensions inheriting from
`pyarrow.PyExtensionType`:

1. `PyEtensionType` pickles class-ref into the metadata when writing the
data (in that case it's
`ray.air.util.tensor_extensions.arrow.ArrowTensorType` for ex)
2. Upon reading the data it tries to unpickle it and now fails b/c these
classes were moved.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
… to fix deserialization of existing datasets (ray-project#59818)

## Description

Context:
[Slack](https://anyscaleteam.slack.com/archives/C04FMM4NPQ9/p1767322231131189)

ray-project#59420 moved Ray Data's Arrow tensor extensions from
`ray.air.util.tensor_extensions` to
`ray.data._internal.tensor_extensions`.

That actually broke deserialization of the datasets written with older
Ray Data implementation of these extensions inheriting from
`pyarrow.PyExtensionType`:

1. `PyEtensionType` pickles class-ref into the metadata when writing the
data (in that case it's
`ray.air.util.tensor_extensions.arrow.ArrowTensorType` for ex)
2. Upon reading the data it tries to unpickle it and now fails b/c these
classes were moved.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Data] Move type extensions from ray.air to ray.data

4 participants