[data] Support Arrow-based transformations for preprocessors by cem-anyscale · Pull Request #59810 · ray-project/ray

cem-anyscale · 2026-01-02T17:54:21Z

Add Arrow-based transformation support in preprocessor base class
Implement Arrow-based transformations for OrdinalEncoder
Add batch_format parameter to StatComputationPlan.add_aggregator for Arrow post-processing

- Add Arrow-based transformation support in preprocessor base class - Implement Arrow-based transformations for OrdinalEncoder - Add batch_format parameter to StatComputationPlan.add_aggregator for Arrow post-processing Signed-off-by: cem <cem@anyscale.com>

gemini-code-assist

Code Review

This pull request introduces support for Arrow-based transformations in preprocessors, which is a great step towards improving performance. The changes are well-structured, with updates to the Preprocessor base class, an Arrow-native implementation for OrdinalEncoder, and a refactoring of the statistics computation logic to handle Arrow formats efficiently. The addition of comprehensive tests, including parameterized tests for both pandas and Arrow paths, is commendable. I've found one minor issue related to some seemingly copy-pasted code that should be addressed. Otherwise, the changes look solid.

gemini-code-assist · 2026-01-02T17:55:41Z

python/ray/data/preprocessors/encoder.py

+    def _encode_list_element(self, element: list, *, column_name: str):
+        ordinal_map = self.stats_[f"unique_values({column_name})"]
+        # If encoding lists, entire column is flattened, hence we map individual
+        # elements inside the list element (of the column)
+        if self.encode_lists:
+            return [ordinal_map.get(x) for x in element]
+
+        return ordinal_map.get(tuple(element))


It seems the _encode_list_element method has been copied from OrdinalEncoder into MultiHotEncoder. This method appears to be unused within MultiHotEncoder and its logic is incorrect for multi-hot encoding.

Specifically:

It uses self.encode_lists, which is an attribute of OrdinalEncoder but not MultiHotEncoder.

The implementation returns a list of ordinal mappings, which is the behavior of OrdinalEncoder, not MultiHotEncoder.

This looks like dead code from a copy-paste and could be confusing. I recommend removing this method from MultiHotEncoder.

yeah, this seems unused.

cursor · 2026-01-02T18:03:29Z

python/ray/data/preprocessors/encoder.py

+        if self.encode_lists:
+            return [ordinal_map.get(x) for x in element]
+
+        return ordinal_map.get(tuple(element))


MultiHotEncoder method references undefined attribute

The newly added MultiHotEncoder._encode_list_element method references self.encode_lists at line 592, but MultiHotEncoder doesn't define this attribute in its __init__. This appears to be copy-pasted from OrdinalEncoder without adaptation. While the method is currently dead code (not called by _transform_pandas), it would raise an AttributeError if ever invoked. The OrdinalEncoder defines encode_lists at line 139, but MultiHotEncoder has no such attribute.

Signed-off-by: cem <cem@anyscale.com>

raulchen · 2026-01-02T18:14:46Z

python/ray/data/preprocessor.py

+                self._transform_arrow,
+                batch_format="pyarrow",
+                zero_copy_batch=True,
+                **kwargs,


nit: update the error message below to include arrow format.

raulchen · 2026-01-02T18:28:50Z

python/ray/data/preprocessors/encoder.py

+    def _encode_list_element(self, element: list, *, column_name: str):
+        ordinal_map = self.stats_[f"unique_values({column_name})"]
+        # If encoding lists, entire column is flattened, hence we map individual
+        # elements inside the list element (of the column)
+        if self.encode_lists:
+            return [ordinal_map.get(x) for x in element]
+
+        return ordinal_map.get(tuple(element))


yeah, this seems unused.

Signed-off-by: cem <cem@anyscale.com>

kyuds · 2026-01-12T09:42:56Z

just a quick question: what is the rationale behind removing the post_key_fn for AggregateStatSpec? I am currently trying to migrate OrdinalEncoders to use Unique aggregators, but as the post_key_fn is gone, it is impossible for me to differentiate aggregation results between different columns.

…ject#59810) - Add Arrow-based transformation support in preprocessor base class - Implement Arrow-based transformations for OrdinalEncoder - Add batch_format parameter to StatComputationPlan.add_aggregator for Arrow post-processing --------- Signed-off-by: cem <cem@anyscale.com> Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>

…ject#59810) - Add Arrow-based transformation support in preprocessor base class - Implement Arrow-based transformations for OrdinalEncoder - Add batch_format parameter to StatComputationPlan.add_aggregator for Arrow post-processing --------- Signed-off-by: cem <cem@anyscale.com>

…ject#59810) - Add Arrow-based transformation support in preprocessor base class - Implement Arrow-based transformations for OrdinalEncoder - Add batch_format parameter to StatComputationPlan.add_aggregator for Arrow post-processing --------- Signed-off-by: cem <cem@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

cem-anyscale requested a review from a team as a code owner January 2, 2026 17:54

gemini-code-assist bot reviewed Jan 2, 2026

View reviewed changes

cursor bot reviewed Jan 2, 2026

View reviewed changes

fix docs

5e6e3e5

Signed-off-by: cem <cem@anyscale.com>

cem-anyscale added the go add ONLY when ready to merge, run all tests label Jan 2, 2026

raulchen approved these changes Jan 2, 2026

View reviewed changes

ray-gardener bot added the data Ray Data-related issues label Jan 2, 2026

fix circular dependency

426f780

Signed-off-by: cem <cem@anyscale.com>

cem-anyscale force-pushed the cem/arrow branch from c65e2c7 to 426f780 Compare January 2, 2026 19:02

cem-anyscale added 2 commits January 2, 2026 22:04

fix test

7526f9a

Signed-off-by: cem <cem@anyscale.com>

update test zero_copy_batch is enabled

9aa94fa

Signed-off-by: cem <cem@anyscale.com>

raulchen merged commit e94a52c into master Jan 2, 2026
6 checks passed

raulchen deleted the cem/arrow branch January 2, 2026 21:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] Support Arrow-based transformations for preprocessors#59810

[data] Support Arrow-based transformations for preprocessors#59810
raulchen merged 5 commits intomasterfrom
cem/arrow

cem-anyscale commented Jan 2, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 2, 2026

Uh oh!

raulchen Jan 2, 2026

Uh oh!

cursor bot Jan 2, 2026

Uh oh!

raulchen Jan 2, 2026

Uh oh!

raulchen Jan 2, 2026

Uh oh!

Uh oh!

kyuds commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cem-anyscale commented Jan 2, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

raulchen Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

cursor bot Jan 2, 2026

Choose a reason for hiding this comment

MultiHotEncoder method references undefined attribute

Uh oh!

raulchen Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

raulchen Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kyuds commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants