[Data] Make all map ops zero-copy by default #58285
[Data] Make all map ops zero-copy by default #58285alexeykudinkin merged 6 commits intoray-project:masterfrom
Conversation
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request changes the default value of zero_copy_batch to True for map operations, which is a sensible change to optimize for the common use case and improve performance. My review focuses on the correctness and documentation impact of this change.
I've identified a critical issue in Dataset.add_column where zero_copy_batch is incorrectly set to True. Since the add_column operation mutates the batch, this will lead to runtime errors. I've suggested reverting this specific change.
Additionally, I've pointed out that the docstring for Dataset.map_batches needs to be updated to reflect the new default value of zero_copy_batch and to provide clearer guidance to users.
Overall, the direction of the PR is good, but the identified issues should be addressed before merging.
| compute=compute, | ||
| concurrency=concurrency, | ||
| zero_copy_batch=False, | ||
| zero_copy_batch=True, |
There was a problem hiding this comment.
The add_column UDF mutates the batch in-place (e.g., batch.loc[:, col] = column). Setting zero_copy_batch=True will provide a read-only batch, which will cause a runtime error when the UDF attempts to modify it. This should be set to False to ensure a writable copy of the batch is provided.
| zero_copy_batch=True, | |
| zero_copy_batch=False, |
| compute: Optional[ComputeStrategy] = None, | ||
| batch_format: Optional[str] = "default", | ||
| zero_copy_batch: bool = False, | ||
| zero_copy_batch: bool = True, |
There was a problem hiding this comment.
With this change, the default for zero_copy_batch is now True. The docstring should be updated to reflect this.
- On line 614,
Default is ``False``should be changed toDefault is ``True``. - The tip on lines 485-487 is now outdated. It suggests setting
zero_copy_batch=Truefor performance, but this is now the default. It would be more helpful to rephrase it to warn users about in-place mutation, for example: "By default,map_batchesuses zero-copy batching to maximize performance. If your function mutates the batch, you must setzero_copy_batch=Falseto avoid errors."
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
|
Can you add a test where the UDF modifies data while zero_copy is True? Make sure the error message clearly tells users to tune the zero_copy flag. |
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
There's already such a test |
## Description Historically, the intention was to avoid failures upon attempts to modify provided batch in-place when, for ex, using Torch tensors. However, that is unjustifiably penalizing 99.9% of use-cases for 0.1% of scenarios. As such, we're flipping this setting to be `zero_copy_batch=True` by default. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
## Description Historically, the intention was to avoid failures upon attempts to modify provided batch in-place when, for ex, using Torch tensors. However, that is unjustifiably penalizing 99.9% of use-cases for 0.1% of scenarios. As such, we're flipping this setting to be `zero_copy_batch=True` by default. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
## Description Historically, the intention was to avoid failures upon attempts to modify provided batch in-place when, for ex, using Torch tensors. However, that is unjustifiably penalizing 99.9% of use-cases for 0.1% of scenarios. As such, we're flipping this setting to be `zero_copy_batch=True` by default. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
## Description Historically, the intention was to avoid failures upon attempts to modify provided batch in-place when, for ex, using Torch tensors. However, that is unjustifiably penalizing 99.9% of use-cases for 0.1% of scenarios. As such, we're flipping this setting to be `zero_copy_batch=True` by default. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>
## Description Historically, the intention was to avoid failures upon attempts to modify provided batch in-place when, for ex, using Torch tensors. However, that is unjustifiably penalizing 99.9% of use-cases for 0.1% of scenarios. As such, we're flipping this setting to be `zero_copy_batch=True` by default. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
Description
Historically, the intention was to avoid failures upon attempts to modify provided batch in-place when, for ex, using Torch tensors.
However, that is unjustifiably penalizing 99.9% of use-cases for 0.1% of scenarios. As such, we're flipping this setting to be
zero_copy_batch=Trueby default.Related issues
Additional information