Conversation
cem-anyscale
commented
Jan 2, 2026
- Add Arrow-based transformation support in preprocessor base class
- Implement Arrow-based transformations for OrdinalEncoder
- Add batch_format parameter to StatComputationPlan.add_aggregator for Arrow post-processing
- Add Arrow-based transformation support in preprocessor base class - Implement Arrow-based transformations for OrdinalEncoder - Add batch_format parameter to StatComputationPlan.add_aggregator for Arrow post-processing Signed-off-by: cem <cem@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request introduces support for Arrow-based transformations in preprocessors, which is a great step towards improving performance. The changes are well-structured, with updates to the Preprocessor base class, an Arrow-native implementation for OrdinalEncoder, and a refactoring of the statistics computation logic to handle Arrow formats efficiently. The addition of comprehensive tests, including parameterized tests for both pandas and Arrow paths, is commendable. I've found one minor issue related to some seemingly copy-pasted code that should be addressed. Otherwise, the changes look solid.
| def _encode_list_element(self, element: list, *, column_name: str): | ||
| ordinal_map = self.stats_[f"unique_values({column_name})"] | ||
| # If encoding lists, entire column is flattened, hence we map individual | ||
| # elements inside the list element (of the column) | ||
| if self.encode_lists: | ||
| return [ordinal_map.get(x) for x in element] | ||
|
|
||
| return ordinal_map.get(tuple(element)) |
There was a problem hiding this comment.
It seems the _encode_list_element method has been copied from OrdinalEncoder into MultiHotEncoder. This method appears to be unused within MultiHotEncoder and its logic is incorrect for multi-hot encoding.
Specifically:
- It uses
self.encode_lists, which is an attribute ofOrdinalEncoderbut notMultiHotEncoder. - The implementation returns a list of ordinal mappings, which is the behavior of
OrdinalEncoder, notMultiHotEncoder.
This looks like dead code from a copy-paste and could be confusing. I recommend removing this method from MultiHotEncoder.
| if self.encode_lists: | ||
| return [ordinal_map.get(x) for x in element] | ||
|
|
||
| return ordinal_map.get(tuple(element)) |
There was a problem hiding this comment.
MultiHotEncoder method references undefined attribute
The newly added MultiHotEncoder._encode_list_element method references self.encode_lists at line 592, but MultiHotEncoder doesn't define this attribute in its __init__. This appears to be copy-pasted from OrdinalEncoder without adaptation. While the method is currently dead code (not called by _transform_pandas), it would raise an AttributeError if ever invoked. The OrdinalEncoder defines encode_lists at line 139, but MultiHotEncoder has no such attribute.
| self._transform_arrow, | ||
| batch_format="pyarrow", | ||
| zero_copy_batch=True, | ||
| **kwargs, |
There was a problem hiding this comment.
nit: update the error message below to include arrow format.
| def _encode_list_element(self, element: list, *, column_name: str): | ||
| ordinal_map = self.stats_[f"unique_values({column_name})"] | ||
| # If encoding lists, entire column is flattened, hence we map individual | ||
| # elements inside the list element (of the column) | ||
| if self.encode_lists: | ||
| return [ordinal_map.get(x) for x in element] | ||
|
|
||
| return ordinal_map.get(tuple(element)) |
Signed-off-by: cem <cem@anyscale.com>
c65e2c7 to
426f780
Compare
Signed-off-by: cem <cem@anyscale.com>
|
just a quick question: what is the rationale behind removing the |
…ject#59810) - Add Arrow-based transformation support in preprocessor base class - Implement Arrow-based transformations for OrdinalEncoder - Add batch_format parameter to StatComputationPlan.add_aggregator for Arrow post-processing --------- Signed-off-by: cem <cem@anyscale.com> Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>
…ject#59810) - Add Arrow-based transformation support in preprocessor base class - Implement Arrow-based transformations for OrdinalEncoder - Add batch_format parameter to StatComputationPlan.add_aggregator for Arrow post-processing --------- Signed-off-by: cem <cem@anyscale.com>
…ject#59810) - Add Arrow-based transformation support in preprocessor base class - Implement Arrow-based transformations for OrdinalEncoder - Add batch_format parameter to StatComputationPlan.add_aggregator for Arrow post-processing --------- Signed-off-by: cem <cem@anyscale.com>
…ject#59810) - Add Arrow-based transformation support in preprocessor base class - Implement Arrow-based transformations for OrdinalEncoder - Add batch_format parameter to StatComputationPlan.add_aggregator for Arrow post-processing --------- Signed-off-by: cem <cem@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
…ject#59810) - Add Arrow-based transformation support in preprocessor base class - Implement Arrow-based transformations for OrdinalEncoder - Add batch_format parameter to StatComputationPlan.add_aggregator for Arrow post-processing --------- Signed-off-by: cem <cem@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>