Skip to content

[Data] UDF Expression Support for with_column#55788

Merged
richardliaw merged 33 commits intoray-project:masterfrom
goutamvenkat-anyscale:goutam/udf_expr
Aug 29, 2025
Merged

[Data] UDF Expression Support for with_column#55788
richardliaw merged 33 commits intoray-project:masterfrom
goutamvenkat-anyscale:goutam/udf_expr

Conversation

@goutamvenkat-anyscale
Copy link
Contributor

@goutamvenkat-anyscale goutamvenkat-anyscale commented Aug 20, 2025

Why are these changes needed?

This adds support for UDF as expressions into with_column.

Since this UDFExpr is designed for batches of data each parameter represents a pyarrow.Array.

Example usage:

import pyarrow as pa
import pyarrow.compute as pc
from ray.data.expressions import col


@udf()
def add_one(x: pa.Array) -> pa.Array:
       return pc.add(x, 1)

ds = ds.with_column('id_plus_1', add_one(col("id")))

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Goutam V <goutam@anyscale.com>
@goutamvenkat-anyscale goutamvenkat-anyscale requested a review from a team as a code owner August 20, 2025 18:31
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant enhancement to Ray Data by adding support for User-Defined Functions (UDFs) within with_column expressions. The implementation is well-designed, introducing a UDFExpr for representing UDFs and a DataType class for better type handling. The logic to dynamically switch between a Project operator and map_batches based on the presence of a UDF and a batch_size is a smart approach. The tests are thorough and cover a wide range of scenarios.

My review includes a few suggestions for improvement:

  • The docstring for with_column should be updated to document the new functionality.
  • A potential correctness issue in the __hash__ implementation of the new DataType class needs to be addressed.
  • An enhancement is proposed for the DataType class to improve Python-to-Arrow type conversion.

Overall, this is a valuable contribution that greatly increases the power and flexibility of Ray Data's expression API.

Comment on lines +2503 to +2508
if "a" in data[0] and "b" in data[0]:
ds_with_udf = ds.with_column(column_name, udf_fn(col("a"), col("b")))
elif "x" in data[0] and "y" in data[0]:
ds_with_udf = ds.with_column(column_name, udf_fn(col("x"), col("y")))
else: # first/last name scenario
ds_with_udf = ds.with_column(column_name, udf_fn(col("first"), col("last")))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic to apply the UDF based on column names inside test_with_column_udf_multi_column can be simplified. Consider moving the UDF application logic into the test_scenario parametrization. For example, you could add a columns key to your test_scenario dicts (e.g., "columns": ["a", "b"]) and then simplify this block to cols_to_use = [col(c) for c in test_scenario["columns"]]; ds_with_udf = ds.with_column(column_name, udf_fn(*cols_to_use)). This would make the test body cleaner and the different test cases more explicit.

Signed-off-by: Goutam V <goutam@anyscale.com>
Signed-off-by: Goutam V <goutam@anyscale.com>
Signed-off-by: Goutam V <goutam@anyscale.com>
@ray-gardener ray-gardener bot added the data Ray Data-related issues label Aug 20, 2025
@goutamvenkat-anyscale goutamvenkat-anyscale added the go add ONLY when ready to merge, run all tests label Aug 21, 2025
Signed-off-by: Goutam V <goutam@anyscale.com>
Signed-off-by: Goutam V <goutam@anyscale.com>
Signed-off-by: Goutam V <goutam@anyscale.com>
Copy link
Contributor

@omatthew98 omatthew98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comment, overall lgtm.

Signed-off-by: Goutam V <goutam@anyscale.com>
Signed-off-by: Goutam V <goutam@anyscale.com>
Signed-off-by: Goutam V <goutam@anyscale.com>
Signed-off-by: Goutam V <goutam@anyscale.com>
Signed-off-by: Goutam V <goutam@anyscale.com>
@alexeykudinkin alexeykudinkin enabled auto-merge (squash) August 29, 2025 01:16
Signed-off-by: Goutam V <goutam@anyscale.com>
@github-actions github-actions bot disabled auto-merge August 29, 2025 02:37
assert set(ds.schema().names) == {"id", "plus_one", "times_two", "ten_minus_id"}


@pytest.mark.skipif(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice test coverage!

@richardliaw richardliaw enabled auto-merge (squash) August 29, 2025 03:25
@richardliaw richardliaw merged commit e9c9a8f into ray-project:master Aug 29, 2025
6 checks passed
@goutamvenkat-anyscale goutamvenkat-anyscale deleted the goutam/udf_expr branch August 29, 2025 08:08
tohtana pushed a commit to tohtana/ray that referenced this pull request Aug 29, 2025
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
tohtana pushed a commit to tohtana/ray that referenced this pull request Aug 29, 2025
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
gangsf pushed a commit to gangsf/ray that referenced this pull request Sep 2, 2025
Signed-off-by: Gang Zhao <gang@gang-JQ62HD2C37.local>
sampan-s-nayak pushed a commit to sampan-s-nayak/ray that referenced this pull request Sep 8, 2025
Signed-off-by: sampan <sampan@anyscale.com>
jugalshah291 pushed a commit to jugalshah291/ray_fork that referenced this pull request Sep 11, 2025
Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
wyhong3103 pushed a commit to wyhong3103/ray that referenced this pull request Sep 12, 2025
Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
dstrodtman pushed a commit to dstrodtman/ray that referenced this pull request Oct 6, 2025
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
snorkelopstesting1-a11y pushed a commit to snorkel-marlin-repos/ray-project_ray_pr_55788_887d1dfb-e023-4cda-9c8a-9c2ff221de76 that referenced this pull request Oct 11, 2025
Original PR #55788 by goutamvenkat-anyscale
Original: ray-project/ray#55788
snorkelopstesting1-a11y added a commit to snorkel-marlin-repos/ray-project_ray_pr_55788_887d1dfb-e023-4cda-9c8a-9c2ff221de76 that referenced this pull request Oct 11, 2025
snorkelopstesting1-a11y pushed a commit to snorkel-marlin-repos/ray-project_ray_pr_55788_0e1206db-0f8d-4ea8-b1e2-585bbfadbe7e that referenced this pull request Oct 11, 2025
Original PR #55788 by goutamvenkat-anyscale
Original: ray-project/ray#55788
snorkelopstesting1-a11y added a commit to snorkel-marlin-repos/ray-project_ray_pr_55788_0e1206db-0f8d-4ea8-b1e2-585bbfadbe7e that referenced this pull request Oct 11, 2025
snorkelopsstgtesting1-spec pushed a commit to snorkel-marlin-repos/ray-project_ray_pr_55788_8b85ab43-54f2-4ffe-943d-1cc210b877ac that referenced this pull request Oct 22, 2025
Original PR #55788 by goutamvenkat-anyscale
Original: ray-project/ray#55788
snorkelopstesting1-a11y added a commit to snorkel-marlin-repos/ray-project_ray_pr_55788_8b85ab43-54f2-4ffe-943d-1cc210b877ac that referenced this pull request Oct 22, 2025
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants