Skip to content

[data] Support Arrow-based transformations for preprocessors#59810

Merged
raulchen merged 5 commits intomasterfrom
cem/arrow
Jan 2, 2026
Merged

[data] Support Arrow-based transformations for preprocessors#59810
raulchen merged 5 commits intomasterfrom
cem/arrow

Conversation

@cem-anyscale
Copy link
Contributor

  • Add Arrow-based transformation support in preprocessor base class
  • Implement Arrow-based transformations for OrdinalEncoder
  • Add batch_format parameter to StatComputationPlan.add_aggregator for Arrow post-processing

- Add Arrow-based transformation support in preprocessor base class
- Implement Arrow-based transformations for OrdinalEncoder
- Add batch_format parameter to StatComputationPlan.add_aggregator for Arrow post-processing

Signed-off-by: cem <cem@anyscale.com>
@cem-anyscale cem-anyscale requested a review from a team as a code owner January 2, 2026 17:54
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for Arrow-based transformations in preprocessors, which is a great step towards improving performance. The changes are well-structured, with updates to the Preprocessor base class, an Arrow-native implementation for OrdinalEncoder, and a refactoring of the statistics computation logic to handle Arrow formats efficiently. The addition of comprehensive tests, including parameterized tests for both pandas and Arrow paths, is commendable. I've found one minor issue related to some seemingly copy-pasted code that should be addressed. Otherwise, the changes look solid.

Comment on lines +588 to +595
def _encode_list_element(self, element: list, *, column_name: str):
ordinal_map = self.stats_[f"unique_values({column_name})"]
# If encoding lists, entire column is flattened, hence we map individual
# elements inside the list element (of the column)
if self.encode_lists:
return [ordinal_map.get(x) for x in element]

return ordinal_map.get(tuple(element))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It seems the _encode_list_element method has been copied from OrdinalEncoder into MultiHotEncoder. This method appears to be unused within MultiHotEncoder and its logic is incorrect for multi-hot encoding.

Specifically:

  • It uses self.encode_lists, which is an attribute of OrdinalEncoder but not MultiHotEncoder.
  • The implementation returns a list of ordinal mappings, which is the behavior of OrdinalEncoder, not MultiHotEncoder.

This looks like dead code from a copy-paste and could be confusing. I recommend removing this method from MultiHotEncoder.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this seems unused.

if self.encode_lists:
return [ordinal_map.get(x) for x in element]

return ordinal_map.get(tuple(element))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MultiHotEncoder method references undefined attribute

The newly added MultiHotEncoder._encode_list_element method references self.encode_lists at line 592, but MultiHotEncoder doesn't define this attribute in its __init__. This appears to be copy-pasted from OrdinalEncoder without adaptation. While the method is currently dead code (not called by _transform_pandas), it would raise an AttributeError if ever invoked. The OrdinalEncoder defines encode_lists at line 139, but MultiHotEncoder has no such attribute.

Fix in Cursor Fix in Web

Signed-off-by: cem <cem@anyscale.com>
@cem-anyscale cem-anyscale added the go add ONLY when ready to merge, run all tests label Jan 2, 2026
self._transform_arrow,
batch_format="pyarrow",
zero_copy_batch=True,
**kwargs,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: update the error message below to include arrow format.

Comment on lines +588 to +595
def _encode_list_element(self, element: list, *, column_name: str):
ordinal_map = self.stats_[f"unique_values({column_name})"]
# If encoding lists, entire column is flattened, hence we map individual
# elements inside the list element (of the column)
if self.encode_lists:
return [ordinal_map.get(x) for x in element]

return ordinal_map.get(tuple(element))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this seems unused.

@ray-gardener ray-gardener bot added the data Ray Data-related issues label Jan 2, 2026
Signed-off-by: cem <cem@anyscale.com>
Signed-off-by: cem <cem@anyscale.com>
Signed-off-by: cem <cem@anyscale.com>
@raulchen raulchen merged commit e94a52c into master Jan 2, 2026
6 checks passed
@raulchen raulchen deleted the cem/arrow branch January 2, 2026 21:18
@kyuds
Copy link
Member

kyuds commented Jan 12, 2026

just a quick question: what is the rationale behind removing the post_key_fn for AggregateStatSpec? I am currently trying to migrate OrdinalEncoders to use Unique aggregators, but as the post_key_fn is gone, it is impossible for me to differentiate aggregation results between different columns.

AYou0207 pushed a commit to AYou0207/ray that referenced this pull request Jan 13, 2026
…ject#59810)

- Add Arrow-based transformation support in preprocessor base class
- Implement Arrow-based transformations for OrdinalEncoder
- Add batch_format parameter to StatComputationPlan.add_aggregator for
Arrow post-processing

---------

Signed-off-by: cem <cem@anyscale.com>
Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>
lee1258561 pushed a commit to pinterest/ray that referenced this pull request Feb 3, 2026
…ject#59810)

- Add Arrow-based transformation support in preprocessor base class
- Implement Arrow-based transformations for OrdinalEncoder
- Add batch_format parameter to StatComputationPlan.add_aggregator for
Arrow post-processing

---------

Signed-off-by: cem <cem@anyscale.com>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Feb 3, 2026
…ject#59810)

- Add Arrow-based transformation support in preprocessor base class
- Implement Arrow-based transformations for OrdinalEncoder
- Add batch_format parameter to StatComputationPlan.add_aggregator for
Arrow post-processing

---------

Signed-off-by: cem <cem@anyscale.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…ject#59810)

- Add Arrow-based transformation support in preprocessor base class
- Implement Arrow-based transformations for OrdinalEncoder
- Add batch_format parameter to StatComputationPlan.add_aggregator for
Arrow post-processing

---------

Signed-off-by: cem <cem@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…ject#59810)

- Add Arrow-based transformation support in preprocessor base class
- Implement Arrow-based transformations for OrdinalEncoder
- Add batch_format parameter to StatComputationPlan.add_aggregator for
Arrow post-processing

---------

Signed-off-by: cem <cem@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants