[data/docs] Add more education around transformations by richardliaw · Pull Request #59415 · ray-project/ray

richardliaw · 2025-12-12T23:56:16Z

Adds documentation around

Expressions
Resource configuration
Async UDFs
Placement Groups / Distributed UDFs

And also refine text around key concepts.

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

gemini-code-assist

Code Review

This pull request significantly enhances the documentation for data transformations in Ray Data. It adds new sections on expressions, resource configuration, async UDFs, and distributed UDFs with placement groups. The changes improve clarity and provide more comprehensive examples. I've identified a few areas in the new documentation that could be improved for correctness and clarity, including an incorrect statement about ActorPoolStrategy, a confusing example for TaskPoolStrategy, a minor phrasing issue, and a missing import in a code snippet. Overall, these are great additions to the documentation.

gemini-code-assist · 2025-12-12T23:57:55Z

doc/source/data/transforming-data.rst

+You can specify the concurrency of the transformation by using the ``compute`` parameter.
+
+For functions, use ``compute=ray.data.TaskPoolStrategy(size=n)`` to cap the number of concurrent tasks. By default, Ray Data will automatically determine the number of concurrent tasks.
+For classes, use ``compute=ray.data.ActorPoolStrategy(size=n)`` to use a fixed size actor pool of ``n`` workers. Currently, this is required to be specified.


The statement "Currently, this is required to be specified" is incorrect. If compute is not specified for a class-based transform, Ray Data defaults to an autoscaling actor pool (ActorPoolStrategy()). It's good practice to specify it for resource control, but it's not strictly required. Please clarify this to avoid confusion.

Suggested change

For classes, use ``compute=ray.data.ActorPoolStrategy(size=n)`` to use a fixed size actor pool of ``n`` workers. Currently, this is required to be specified.

For classes, use ``compute=ray.data.ActorPoolStrategy(size=n)`` to use a fixed size actor pool of ``n`` workers. If ``compute`` is not specified, an autoscaling actor pool is used by default.

We added a default in some recent version

gemini-code-assist · 2025-12-12T23:57:55Z

doc/source/data/transforming-data.rst

+    ds = ray.data.range(10).map_batches(lambda x: x * 2, compute=ray.data.TaskPoolStrategy(size=2))
+    ds.take_all()
+
+.. testoutput::
+    :options: +MOCK
+
+    [{'id': 0}, {'id': 1}, {'id': 2}, {'id': 3}, {'id': 4}, {'id': 5}, {'id': 6}, {'id': 7}, {'id': 8}, {'id': 9}]


The example for TaskPoolStrategy seems incorrect. The lambda lambda x: x * 2 will raise a TypeError when applied to a batch, which is a dictionary of numpy arrays. Also, the testoutput shows the original data, not the transformed data.

To make this a useful and correct example, I suggest updating the lambda to perform a valid transformation and updating the output to reflect that transformation.

Suggested change

ds = ray.data.range(10).map_batches(lambda x: x * 2, compute=ray.data.TaskPoolStrategy(size=2))

ds.take_all()

.. testoutput::

:options: +MOCK

[{'id': 0}, {'id': 1}, {'id': 2}, {'id': 3}, {'id': 4}, {'id': 5}, {'id': 6}, {'id': 7}, {'id': 8}, {'id': 9}]

ds = ray.data.range(10).map_batches(lambda batch: {"id": batch["id"] * 2}, compute=ray.data.TaskPoolStrategy(size=2))

ds.take_all()

.. testoutput::

:options: +MOCK

[{'id': 0}, {'id': 2}, {'id': 4}, {'id': 6}, {'id': 8}, {'id': 10}, {'id': 12}, {'id': 14}, {'id': 16}, {'id': 18}]

gemini-code-assist · 2025-12-12T23:57:55Z

doc/source/data/transforming-data.rst

+Advanced: Distributed UDFs with Placement Groups
+================================================
+
+While all transformations are automatically parallelized across your Ray cluster, often times these transformations themselves can themselves be distributed. For example, if you're using


The phrasing "often times these transformations themselves can themselves be distributed" is a bit repetitive. Consider rephrasing for clarity.

Suggested change

While all transformations are automatically parallelized across your Ray cluster, often times these transformations themselves can themselves be distributed. For example, if you're using

While all transformations are automatically parallelized across your Ray cluster, these transformations can often be distributed themselves. For example, if you're using

gemini-code-assist · 2025-12-12T23:57:55Z

doc/source/data/transforming-data.rst

+    from ray.data.expressions import col, udf
+    import pyarrow as pa
+    import pyarrow.compute as pc
+    import ray


The example for udf in expressions uses DataType but doesn't import it. This will cause a NameError if a user tries to run this code. Please add the necessary import.

Suggested change

from ray.data.expressions import col, udf

import pyarrow as pa

import pyarrow.compute as pc

import ray

from ray.data.expressions import col, udf

from ray.data.datatype import DataType

import pyarrow as pa

import pyarrow.compute as pc

import ray

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

doc/source/data/key-concepts.rst

bveeramani · 2025-12-18T19:42:43Z

doc/source/data/key-concepts.rst

 -------------------------

-Ray Data uses a *streaming execution model* to efficiently process large datasets.
+Ray Data can leverage a *streaming execution model* to efficiently process large datasets.


Nit: when do we not leverage a streaming execution model? Don't we always leverage it?

updated the wording; basically when you have read -> shuffle -> write, it isn't really "streaming"

doc/source/data/key-concepts.rst

doc/source/data/transforming-data.rst

bveeramani · 2025-12-18T19:46:47Z

doc/source/data/transforming-data.rst

+You can specify the concurrency of the transformation by using the ``compute`` parameter.
+
+For functions, use ``compute=ray.data.TaskPoolStrategy(size=n)`` to cap the number of concurrent tasks. By default, Ray Data will automatically determine the number of concurrent tasks.
+For classes, use ``compute=ray.data.ActorPoolStrategy(size=n)`` to use a fixed size actor pool of ``n`` workers. Currently, this is required to be specified.


bveeramani · 2025-12-18T19:46:57Z

doc/source/data/transforming-data.rst

+You can specify the concurrency of the transformation by using the ``compute`` parameter.
+
+For functions, use ``compute=ray.data.TaskPoolStrategy(size=n)`` to cap the number of concurrent tasks. By default, Ray Data will automatically determine the number of concurrent tasks.
+For classes, use ``compute=ray.data.ActorPoolStrategy(size=n)`` to use a fixed size actor pool of ``n`` workers. Currently, this is required to be specified.


We added a default in some recent version

doc/source/data/transforming-data.rst

bveeramani · 2025-12-18T19:51:00Z

doc/source/data/transforming-data.rst

+    {'id': 9}]
+
+
+Expressions (Alpha)


Should this be above the "Advanced" stuff? Feel like this is more applicable to most users

Right now since it's early, I'm putting it lower. I think in the next release once we flesh this out more it'll go much higher.

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

) Adds documentation around * Expressions * Resource configuration * Async UDFs * Placement Groups / Distributed UDFs And also refine text around key concepts. --------- Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

) Adds documentation around * Expressions * Resource configuration * Async UDFs * Placement Groups / Distributed UDFs And also refine text around key concepts. --------- Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>

) Adds documentation around * Expressions * Resource configuration * Async UDFs * Placement Groups / Distributed UDFs And also refine text around key concepts. --------- Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

) Adds documentation around * Expressions * Resource configuration * Async UDFs * Placement Groups / Distributed UDFs And also refine text around key concepts. --------- Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Signed-off-by: peterxcli <peterxcli@gmail.com>

Update transforms

1a08d6d

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

richardliaw requested a review from a team as a code owner December 12, 2025 23:56

gemini-code-assist bot reviewed Dec 12, 2025

View reviewed changes

ray-gardener bot added docs An issue or change related to documentation data Ray Data-related issues labels Dec 13, 2025

grammar-test

8905ab1

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

richardliaw added the go add ONLY when ready to merge, run all tests label Dec 15, 2025

richardliaw added 5 commits December 15, 2025 14:58

update

2b38053

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

update

abbf22c

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

update

e40f2f0

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

update

31b1659

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

upda

f055989

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

bveeramani reviewed Dec 18, 2025

View reviewed changes

update

38f152b

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

bveeramani approved these changes Dec 19, 2025

View reviewed changes

update

f17af14

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

richardliaw enabled auto-merge (squash) December 19, 2025 03:54

richardliaw merged commit c2a7f92 into ray-project:master Dec 19, 2025
7 checks passed

	For classes, use ``compute=ray.data.ActorPoolStrategy(size=n)`` to use a fixed size actor pool of ``n`` workers. Currently, this is required to be specified.
	For classes, use ``compute=ray.data.ActorPoolStrategy(size=n)`` to use a fixed size actor pool of ``n`` workers. If ``compute`` is not specified, an autoscaling actor pool is used by default.

	While all transformations are automatically parallelized across your Ray cluster, often times these transformations themselves can themselves be distributed. For example, if you're using
	While all transformations are automatically parallelized across your Ray cluster, these transformations can often be distributed themselves. For example, if you're using

Conversation

richardliaw commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

richardliaw commented Dec 12, 2025 •

edited

Loading