[Data] UDF Expression Support for with_column#55788
[Data] UDF Expression Support for with_column#55788richardliaw merged 33 commits intoray-project:masterfrom
Conversation
Signed-off-by: Goutam V <goutam@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a significant enhancement to Ray Data by adding support for User-Defined Functions (UDFs) within with_column expressions. The implementation is well-designed, introducing a UDFExpr for representing UDFs and a DataType class for better type handling. The logic to dynamically switch between a Project operator and map_batches based on the presence of a UDF and a batch_size is a smart approach. The tests are thorough and cover a wide range of scenarios.
My review includes a few suggestions for improvement:
- The docstring for
with_columnshould be updated to document the new functionality. - A potential correctness issue in the
__hash__implementation of the newDataTypeclass needs to be addressed. - An enhancement is proposed for the
DataTypeclass to improve Python-to-Arrow type conversion.
Overall, this is a valuable contribution that greatly increases the power and flexibility of Ray Data's expression API.
| if "a" in data[0] and "b" in data[0]: | ||
| ds_with_udf = ds.with_column(column_name, udf_fn(col("a"), col("b"))) | ||
| elif "x" in data[0] and "y" in data[0]: | ||
| ds_with_udf = ds.with_column(column_name, udf_fn(col("x"), col("y"))) | ||
| else: # first/last name scenario | ||
| ds_with_udf = ds.with_column(column_name, udf_fn(col("first"), col("last"))) |
There was a problem hiding this comment.
The logic to apply the UDF based on column names inside test_with_column_udf_multi_column can be simplified. Consider moving the UDF application logic into the test_scenario parametrization. For example, you could add a columns key to your test_scenario dicts (e.g., "columns": ["a", "b"]) and then simplify this block to cols_to_use = [col(c) for c in test_scenario["columns"]]; ds_with_udf = ds.with_column(column_name, udf_fn(*cols_to_use)). This would make the test body cleaner and the different test cases more explicit.
Signed-off-by: Goutam V <goutam@anyscale.com>
Signed-off-by: Goutam V <goutam@anyscale.com>
Signed-off-by: Goutam V <goutam@anyscale.com>
Signed-off-by: Goutam V <goutam@anyscale.com>
Signed-off-by: Goutam V <goutam@anyscale.com>
Signed-off-by: Goutam V <goutam@anyscale.com>
Signed-off-by: Goutam V <goutam@anyscale.com>
omatthew98
left a comment
There was a problem hiding this comment.
Small comment, overall lgtm.
Signed-off-by: Goutam V <goutam@anyscale.com>
Signed-off-by: Goutam V <goutam@anyscale.com>
Signed-off-by: Goutam V <goutam@anyscale.com>
…scale/ray into goutam/udf_expr
Signed-off-by: Goutam V <goutam@anyscale.com>
Signed-off-by: Goutam V <goutam@anyscale.com>
Signed-off-by: Goutam V <goutam@anyscale.com>
| assert set(ds.schema().names) == {"id", "plus_one", "times_two", "ten_minus_id"} | ||
|
|
||
|
|
||
| @pytest.mark.skipif( |
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Gang Zhao <gang@gang-JQ62HD2C37.local>
Signed-off-by: sampan <sampan@anyscale.com>
Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
Original PR #55788 by goutamvenkat-anyscale Original: ray-project/ray#55788
Merged from original PR #55788 Original: ray-project/ray#55788
Original PR #55788 by goutamvenkat-anyscale Original: ray-project/ray#55788
Merged from original PR #55788 Original: ray-project/ray#55788
Original PR #55788 by goutamvenkat-anyscale Original: ray-project/ray#55788
Merged from original PR #55788 Original: ray-project/ray#55788
Why are these changes needed?
This adds support for UDF as expressions into
with_column.Since this UDFExpr is designed for batches of data each parameter represents a
pyarrow.Array.Example usage:
Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.