Skip to content

[data/docs] Add more education around transformations#59415

Merged
richardliaw merged 9 commits intoray-project:masterfrom
richardliaw:data-update-transforms-page
Dec 19, 2025
Merged

[data/docs] Add more education around transformations#59415
richardliaw merged 9 commits intoray-project:masterfrom
richardliaw:data-update-transforms-page

Conversation

@richardliaw
Copy link
Contributor

@richardliaw richardliaw commented Dec 12, 2025

Adds documentation around

  • Expressions
  • Resource configuration
  • Async UDFs
  • Placement Groups / Distributed UDFs

And also refine text around key concepts.

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
@richardliaw richardliaw requested a review from a team as a code owner December 12, 2025 23:56
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly enhances the documentation for data transformations in Ray Data. It adds new sections on expressions, resource configuration, async UDFs, and distributed UDFs with placement groups. The changes improve clarity and provide more comprehensive examples. I've identified a few areas in the new documentation that could be improved for correctness and clarity, including an incorrect statement about ActorPoolStrategy, a confusing example for TaskPoolStrategy, a minor phrasing issue, and a missing import in a code snippet. Overall, these are great additions to the documentation.

You can specify the concurrency of the transformation by using the ``compute`` parameter.

For functions, use ``compute=ray.data.TaskPoolStrategy(size=n)`` to cap the number of concurrent tasks. By default, Ray Data will automatically determine the number of concurrent tasks.
For classes, use ``compute=ray.data.ActorPoolStrategy(size=n)`` to use a fixed size actor pool of ``n`` workers. Currently, this is required to be specified.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The statement "Currently, this is required to be specified" is incorrect. If compute is not specified for a class-based transform, Ray Data defaults to an autoscaling actor pool (ActorPoolStrategy()). It's good practice to specify it for resource control, but it's not strictly required. Please clarify this to avoid confusion.

Suggested change
For classes, use ``compute=ray.data.ActorPoolStrategy(size=n)`` to use a fixed size actor pool of ``n`` workers. Currently, this is required to be specified.
For classes, use ``compute=ray.data.ActorPoolStrategy(size=n)`` to use a fixed size actor pool of ``n`` workers. If ``compute`` is not specified, an autoscaling actor pool is used by default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We added a default in some recent version

Comment on lines +399 to +405
ds = ray.data.range(10).map_batches(lambda x: x * 2, compute=ray.data.TaskPoolStrategy(size=2))
ds.take_all()

.. testoutput::
:options: +MOCK

[{'id': 0}, {'id': 1}, {'id': 2}, {'id': 3}, {'id': 4}, {'id': 5}, {'id': 6}, {'id': 7}, {'id': 8}, {'id': 9}]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The example for TaskPoolStrategy seems incorrect. The lambda lambda x: x * 2 will raise a TypeError when applied to a batch, which is a dictionary of numpy arrays. Also, the testoutput shows the original data, not the transformed data.

To make this a useful and correct example, I suggest updating the lambda to perform a valid transformation and updating the output to reflect that transformation.

Suggested change
ds = ray.data.range(10).map_batches(lambda x: x * 2, compute=ray.data.TaskPoolStrategy(size=2))
ds.take_all()
.. testoutput::
:options: +MOCK
[{'id': 0}, {'id': 1}, {'id': 2}, {'id': 3}, {'id': 4}, {'id': 5}, {'id': 6}, {'id': 7}, {'id': 8}, {'id': 9}]
ds = ray.data.range(10).map_batches(lambda batch: {"id": batch["id"] * 2}, compute=ray.data.TaskPoolStrategy(size=2))
ds.take_all()
.. testoutput::
:options: +MOCK
[{'id': 0}, {'id': 2}, {'id': 4}, {'id': 6}, {'id': 8}, {'id': 10}, {'id': 12}, {'id': 14}, {'id': 16}, {'id': 18}]

Advanced: Distributed UDFs with Placement Groups
================================================

While all transformations are automatically parallelized across your Ray cluster, often times these transformations themselves can themselves be distributed. For example, if you're using
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The phrasing "often times these transformations themselves can themselves be distributed" is a bit repetitive. Consider rephrasing for clarity.

Suggested change
While all transformations are automatically parallelized across your Ray cluster, often times these transformations themselves can themselves be distributed. For example, if you're using
While all transformations are automatically parallelized across your Ray cluster, these transformations can often be distributed themselves. For example, if you're using

Comment on lines +578 to +581
from ray.data.expressions import col, udf
import pyarrow as pa
import pyarrow.compute as pc
import ray
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The example for udf in expressions uses DataType but doesn't import it. This will cause a NameError if a user tries to run this code. Please add the necessary import.

Suggested change
from ray.data.expressions import col, udf
import pyarrow as pa
import pyarrow.compute as pc
import ray
from ray.data.expressions import col, udf
from ray.data.datatype import DataType
import pyarrow as pa
import pyarrow.compute as pc
import ray

@ray-gardener ray-gardener bot added docs An issue or change related to documentation data Ray Data-related issues labels Dec 13, 2025
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
@richardliaw richardliaw added the go add ONLY when ready to merge, run all tests label Dec 15, 2025
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
-------------------------

Ray Data uses a *streaming execution model* to efficiently process large datasets.
Ray Data can leverage a *streaming execution model* to efficiently process large datasets.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: when do we not leverage a streaming execution model? Don't we always leverage it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the wording; basically when you have read -> shuffle -> write, it isn't really "streaming"

You can specify the concurrency of the transformation by using the ``compute`` parameter.

For functions, use ``compute=ray.data.TaskPoolStrategy(size=n)`` to cap the number of concurrent tasks. By default, Ray Data will automatically determine the number of concurrent tasks.
For classes, use ``compute=ray.data.ActorPoolStrategy(size=n)`` to use a fixed size actor pool of ``n`` workers. Currently, this is required to be specified.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

You can specify the concurrency of the transformation by using the ``compute`` parameter.

For functions, use ``compute=ray.data.TaskPoolStrategy(size=n)`` to cap the number of concurrent tasks. By default, Ray Data will automatically determine the number of concurrent tasks.
For classes, use ``compute=ray.data.ActorPoolStrategy(size=n)`` to use a fixed size actor pool of ``n`` workers. Currently, this is required to be specified.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We added a default in some recent version

{'id': 9}]


Expressions (Alpha)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be above the "Advanced" stuff? Feel like this is more applicable to most users

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now since it's early, I'm putting it lower. I think in the next release once we flesh this out more it'll go much higher.

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
@richardliaw richardliaw enabled auto-merge (squash) December 19, 2025 03:54
@richardliaw richardliaw merged commit c2a7f92 into ray-project:master Dec 19, 2025
7 checks passed
Yicheng-Lu-llll pushed a commit to Yicheng-Lu-llll/ray that referenced this pull request Dec 22, 2025
)

Adds documentation around
* Expressions
* Resource configuration
* Async UDFs
* Placement Groups / Distributed UDFs

And also refine text around key concepts.

---------

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
AYou0207 pushed a commit to AYou0207/ray that referenced this pull request Jan 13, 2026
)

Adds documentation around
* Expressions
* Resource configuration
* Async UDFs
* Placement Groups / Distributed UDFs

And also refine text around key concepts.

---------

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>
lee1258561 pushed a commit to pinterest/ray that referenced this pull request Feb 3, 2026
)

Adds documentation around
* Expressions
* Resource configuration
* Async UDFs
* Placement Groups / Distributed UDFs

And also refine text around key concepts.

---------

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
)

Adds documentation around
* Expressions
* Resource configuration
* Async UDFs
* Placement Groups / Distributed UDFs

And also refine text around key concepts.

---------

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues docs An issue or change related to documentation go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants