[Data] Move extension types to ray.data#59420
Conversation
There was a problem hiding this comment.
Code Review
This pull request moves extension types from ray.air to ray.data._internal, which is a significant and clean refactoring. The changes primarily involve moving files and updating import paths across the codebase. The core logic seems unchanged.
My main concern is that the newly created packages, ray.data._internal.object_extensions and ray.data._internal.tensor_extensions, have empty __init__.py files. This will lead to ImportErrors in various modules that now attempt to import symbols directly from these packages. I've added comments with suggestions to populate these __init__.py files to correctly expose the necessary symbols from their respective submodules.
Additionally, I've identified a few places where multiple import statements from the same module could be consolidated for improved code clarity and maintainability.
Overall, this is a good refactoring, and once the import issues are addressed, it will be a solid improvement.
907cfe4 to
6ec4c87
Compare
27e930f to
6711390
Compare
bveeramani
left a comment
There was a problem hiding this comment.
Ty!
Overall LGTM! Just a couple minor comments about the docstrings
ray.data
|
|
||
| from ray.air.util.data_batch_conversion import _unwrap_ndarray_object_type_if_needed | ||
| from ray.air.util.tensor_extensions.arrow import get_arrow_extension_tensor_types | ||
| from ray.data._internal.tensor_extensions.arrow import get_arrow_extension_tensor_types |
There was a problem hiding this comment.
Should we move these air modules to data as well? Not sure if this introduces a circular dependency...
There was a problem hiding this comment.
+1 I think to do this properly:
- ray/air should NOT depend on ray.data
- ray.data should NOT depend on ray.air
There was a problem hiding this comment.
Should we move these air modules to data as well?
We should, though I don't think we should move all of the modules at once. I think it'll be hard to review.
+1 I think to do this properly:
- ray/air should NOT depend on ray.data
- ray.data should NOT depend on ray.air
Yeah, the end state I want to work towards is ray.air doesn't exist.
There was a problem hiding this comment.
I suspect it's fairly manageable... ray air also has various train things, we just need to move the data stuff
There was a problem hiding this comment.
Given that Ray's train and tune modules already depend on air, moving air to data would create a dependency chain where train/tune depend on data.
Does this dependency structure make sense for our codebase?
There was a problem hiding this comment.
No, we don't want to move air to data. We just want to rip out the relevant parts that data depends on in air to data, so that data doesn't depend on air.
There was a problem hiding this comment.
So concretely, there should be 63 places in ray/data that need to be changed.
- 34 of those are
ray.air.util.tensor_extensions - 6 of those are
ray.air._internal.{torch/tensorflow}_utils - 10 of those are
ray.air.util.data_batch_conversion - 4 of those are
ray.air.util.transform_pyarrow - 3 of them are
ray.air.constants - 4 is
ray.air.util.object_extensions
You'll want to copy all of these parts out of ray.air so that we don't have a dependency from data to air.
Then, some parts of these will become dead, and we can remove it as well without any harm done to other libraries.
(There's 2 missing references but it'll be obvious after the above is addressed)
There was a problem hiding this comment.
@richardliaw @bveeramani Could you please help review this pull request when you have a moment? Thank you!
3ac5d50 to
a6af757
Compare
89adc30 to
6fd82b2
Compare
There was a problem hiding this comment.
Bug: Missing imports break test conftest file
Several critical imports were removed from this file but are still used: import copy (used at line 233 for copy.deepcopy), import os (used at lines 178, 179, 221 for os.path.join and os.mkdir), import time (used at lines 422, 619, 625 for time.perf_counter() and time.sleep()), import ray (used extensively throughout for ray.data, ray.get, @ray.remote, etc.), and from ray._common.test_utils import wait_for_condition (used at lines 639, 659, 712). The from ray.tests.conftest import * does not re-export these module imports, so the test file will fail with NameError at runtime.
python/ray/data/tests/conftest.py#L1-L21
ray/python/ray/data/tests/conftest.py
Lines 1 to 21 in 6fd82b2
There was a problem hiding this comment.
Bug: Removal of pre-serialization sanitization may cause actor deserialization failures
The removal of DataContextMetadata.from_data_context(data_context) before sending to StatsActor may reintroduce a serialization issue. The old code explicitly converted DataContext to DataContextMetadata "to avoid module dependency issues during Ray's cloudpickle serialization" - this handled cases where DataContext contains custom exception classes from modules unavailable in StatsActor's environment. Now DataContext is sent directly without sanitization, which could cause deserialization failures. The test test_data_context_with_custom_classes_serialization was specifically written to verify this fix works.
python/ray/data/_internal/stats.py#L896-L904
ray/python/ray/data/_internal/stats.py
Lines 896 to 904 in 67dc8f8
python/ray/data/_internal/metadata_exporter.py#L150-L161
ray/python/ray/data/_internal/metadata_exporter.py
Lines 150 to 161 in 67dc8f8
95379fe to
b6e8cc3
Compare
| data_context=DataContextMetadata.from_data_context( | ||
| DataContext.get_current() | ||
| ), | ||
| data_context=DataContext.get_current(), |
There was a problem hiding this comment.
Bug: Test passes wrong type to register_dataset method
The test file removes the DataContextMetadata import and changes calls from DataContextMetadata.from_data_context(DataContext.get_current()) to just DataContext.get_current(). However, the _StatsActor.register_dataset method signature expects data_context: DataContextMetadata, and creates a DatasetMetadata object that also requires this type. When export_dataset_metadata tries to access data_context.config, it will fail because DataContext doesn't have a config attribute—only DataContextMetadata does.
Additional Locations (2)
4a6541f to
fc60e97
Compare
ray.air.util.objection_extensions -> ray.data._internal.object_extensions ray.air.util.tensor_extensions -> ray.data._internal.tensor_extensions Signed-off-by: will <zzchun8@gmail.com>
4eb671e to
d7a4f76
Compare
|
👋 Could you please take a look at this PR when you have time? The PR has been facing frequent merge conflicts with recently merged changes Thank you! |
bveeramani
left a comment
There was a problem hiding this comment.
LGTM! ty for the contribution!
| from ray.data._internal.pandas_block import PandasBlockSchema | ||
|
|
||
|
|
||
| def convert_ndarray_to_tf_tensor( |
There was a problem hiding this comment.
This function is now duplicated across ray.air and ray.data, but its not dead code.
In a follow-up, I think we should remove this and update the reference in tensorflow_predictor to use the one defined in ray.data.
| return result | ||
|
|
||
|
|
||
| def convert_ndarray_batch_to_torch_tensor_batch( |
There was a problem hiding this comment.
I don't think we use this anywhere in Ray Data. I think we should remove this in a follow-up
| return batch | ||
|
|
||
|
|
||
| def load_torch_model( |
| # Not present in torch<=1.7.0 | ||
| # Adapted from https://github.com/pytorch/pytorch/blob/\ | ||
| # c18da597e0bb1c1aecc97c77a73fed1849057fa4/torch/nn/modules/utils.py | ||
| def consume_prefix_in_state_dict_if_present_not_in_place( |
|
@zzchun would you mind updating the PR description with the latest changes? |
| return data | ||
| elif pyarrow is not None and isinstance(data, pyarrow.Table): | ||
| from ray.air.util.tensor_extensions.arrow import ( | ||
| from ray.data._internal.arrow_ops import transform_pyarrow |
There was a problem hiding this comment.
Why is this still importing from data? Should this util also be moved to data?
There was a problem hiding this comment.
Yeah I think you already created a duplicate in python/ray/data, should we just delete this file
There was a problem hiding this comment.
oh, the reason this is here is because this utility is still dependent from ray.train.predictors, so this is a duplicate. How about let's just address this in a later PR when we remove predictors?
This comment was marked as off-topic.
This comment was marked as off-topic.
|
Actually nevermind, let's just review here. |
|
@zzchun I merged the PR to avoid merge conflicts, but pls update the PR description when you get a chance so it's representative of the actual diff! |
|
@bveeramani Thanks for the review. I've updated the description. I will submit a follow-up PR shortly to address all reviewer's feedback. |
… to fix deserialization of existing datasets (#59818) ## Description Context: [Slack](https://anyscaleteam.slack.com/archives/C04FMM4NPQ9/p1767322231131189) #59420 moved Ray Data's Arrow tensor extensions from `ray.air.util.tensor_extensions` to `ray.data._internal.tensor_extensions`. That actually broke deserialization of the datasets written with older Ray Data implementation of these extensions inheriting from `pyarrow.PyExtensionType`: 1. `PyEtensionType` pickles class-ref into the metadata when writing the data (in that case it's `ray.air.util.tensor_extensions.arrow.ArrowTensorType` for ex) 2. Upon reading the data it tries to unpickle it and now fails b/c these classes were moved. ## Related issues > Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
## Description Move extension types to ray.data ray.air.util.objection_extensions -> ray.data._internal.object_extensions ray.air.util.tensor_extensions -> ray.data._internal.tensor_extensions ## Related issues Closes ray-project#59418 Signed-off-by: will <zzchun8@gmail.com> Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>
## Description Remove dead code in `python/ray/data/util/data_batch_conversion.py` and `python/ray/data/util/torch_utils.py`, which is related to PR: ray-project#59420 ## Related issues Related to ray-project#59420. Signed-off-by: will <zzchun8@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>
… to fix deserialization of existing datasets (ray-project#59818) ## Description Context: [Slack](https://anyscaleteam.slack.com/archives/C04FMM4NPQ9/p1767322231131189) ray-project#59420 moved Ray Data's Arrow tensor extensions from `ray.air.util.tensor_extensions` to `ray.data._internal.tensor_extensions`. That actually broke deserialization of the datasets written with older Ray Data implementation of these extensions inheriting from `pyarrow.PyExtensionType`: 1. `PyEtensionType` pickles class-ref into the metadata when writing the data (in that case it's `ray.air.util.tensor_extensions.arrow.ArrowTensorType` for ex) 2. Upon reading the data it tries to unpickle it and now fails b/c these classes were moved. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>
## Description Move extension types to ray.data ray.air.util.objection_extensions -> ray.data._internal.object_extensions ray.air.util.tensor_extensions -> ray.data._internal.tensor_extensions ## Related issues Closes ray-project#59418 Signed-off-by: will <zzchun8@gmail.com>
## Description Remove dead code in `python/ray/data/util/data_batch_conversion.py` and `python/ray/data/util/torch_utils.py`, which is related to PR: ray-project#59420 ## Related issues Related to ray-project#59420. Signed-off-by: will <zzchun8@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
… to fix deserialization of existing datasets (ray-project#59818) ## Description Context: [Slack](https://anyscaleteam.slack.com/archives/C04FMM4NPQ9/p1767322231131189) ray-project#59420 moved Ray Data's Arrow tensor extensions from `ray.air.util.tensor_extensions` to `ray.data._internal.tensor_extensions`. That actually broke deserialization of the datasets written with older Ray Data implementation of these extensions inheriting from `pyarrow.PyExtensionType`: 1. `PyEtensionType` pickles class-ref into the metadata when writing the data (in that case it's `ray.air.util.tensor_extensions.arrow.ArrowTensorType` for ex) 2. Upon reading the data it tries to unpickle it and now fails b/c these classes were moved. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
… to fix deserialization of existing datasets (ray-project#59818) ## Description Context: [Slack](https://anyscaleteam.slack.com/archives/C04FMM4NPQ9/p1767322231131189) ray-project#59420 moved Ray Data's Arrow tensor extensions from `ray.air.util.tensor_extensions` to `ray.data._internal.tensor_extensions`. That actually broke deserialization of the datasets written with older Ray Data implementation of these extensions inheriting from `pyarrow.PyExtensionType`: 1. `PyEtensionType` pickles class-ref into the metadata when writing the data (in that case it's `ray.air.util.tensor_extensions.arrow.ArrowTensorType` for ex) 2. Upon reading the data it tries to unpickle it and now fails b/c these classes were moved. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
## Description Move extension types to ray.data ray.air.util.objection_extensions -> ray.data._internal.object_extensions ray.air.util.tensor_extensions -> ray.data._internal.tensor_extensions ## Related issues Closes ray-project#59418 Signed-off-by: will <zzchun8@gmail.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
## Description Remove dead code in `python/ray/data/util/data_batch_conversion.py` and `python/ray/data/util/torch_utils.py`, which is related to PR: ray-project#59420 ## Related issues Related to ray-project#59420. Signed-off-by: will <zzchun8@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
… to fix deserialization of existing datasets (ray-project#59818) ## Description Context: [Slack](https://anyscaleteam.slack.com/archives/C04FMM4NPQ9/p1767322231131189) ray-project#59420 moved Ray Data's Arrow tensor extensions from `ray.air.util.tensor_extensions` to `ray.data._internal.tensor_extensions`. That actually broke deserialization of the datasets written with older Ray Data implementation of these extensions inheriting from `pyarrow.PyExtensionType`: 1. `PyEtensionType` pickles class-ref into the metadata when writing the data (in that case it's `ray.air.util.tensor_extensions.arrow.ArrowTensorType` for ex) 2. Upon reading the data it tries to unpickle it and now fails b/c these classes were moved. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
… to fix deserialization of existing datasets (ray-project#59818) ## Description Context: [Slack](https://anyscaleteam.slack.com/archives/C04FMM4NPQ9/p1767322231131189) ray-project#59420 moved Ray Data's Arrow tensor extensions from `ray.air.util.tensor_extensions` to `ray.data._internal.tensor_extensions`. That actually broke deserialization of the datasets written with older Ray Data implementation of these extensions inheriting from `pyarrow.PyExtensionType`: 1. `PyEtensionType` pickles class-ref into the metadata when writing the data (in that case it's `ray.air.util.tensor_extensions.arrow.ArrowTensorType` for ex) 2. Upon reading the data it tries to unpickle it and now fails b/c these classes were moved. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
Description
Relocate extension types from
ray.air.utiltoray.data._internalto reduce module dependencies. This includes movingobject_extensionsandtensor_extensions, updating imports codebase-wide, fixing import errors and test failures, and planning further cleanups to fully decoupleray.datafromray.airThe changes mainly included:
ray.air.util.tensor_extensions->ray.data._internal.tensor_extensionsray.air.util.object_extensions->ray.data._internal.object_extensionsray.air._internal.torch_utils->ray.data.util.torch_utilspython/ray/air/util/transform_pyarrow.py->ray/data/_internal/utils/transform_pyarrow.pyray.air.util -> ray.data.utilRelated issues
Closes #59418