[Data] Don't push limit past map_batches by default#60448
Conversation
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
There was a problem hiding this comment.
Code Review
This pull request changes the default value of udf_modifying_row_count in map_batches to True, making the API safer by default by disabling limit pushdown for map_batches unless explicitly enabled. This is a great improvement for correctness. The docstring update is clear, and the new test correctly validates the new behavior.
One side effect is a minor performance regression for internal methods like add_column and drop_columns that use map_batches without modifying row counts. These will now default to the safer, non-optimized path. A follow-up to explicitly set udf_modifying_row_count=False for these specific calls would be beneficial to restore optimal performance.
Overall, the change is a solid improvement to the API's robustness.
alexeykudinkin
left a comment
There was a problem hiding this comment.
This is a breaking change
This is a breaking change from 2.52 and 2.53, but it's both the correct behavior and the behavior that's consistent from <2.52. We made a breaking change between 2.51 and 2.52 |
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
| ds1 = ray.data.range(n, override_num_blocks=2) | ||
| ds1 = ds1.map_batches(lambda x: x, batch_size=target_rows) | ||
| ds1 = ds1.map_batches(lambda x: x, batch_size=target_rows) | ||
| ds1 = ds1.map_batches( |
There was a problem hiding this comment.
Why do we need to manually set udf_modifying_row_count=False here
There was a problem hiding this comment.
Because of this logic here:
ray/python/ray/data/_internal/logical/rules/operator_fusion.py
Lines 652 to 666 in e32a2f2
By default, we don't know if a map_batches will modify the number of rows. The user needs to explicitly set udf_modifying_row_count
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
| ds1 = ds1.map_batches( | ||
| lambda x: x, batch_size=target_rows, udf_modifying_row_count=False | ||
| ) | ||
| ds1 = ds1.map_batches( | ||
| lambda x: x, batch_size=target_rows, udf_modifying_row_count=False | ||
| ) | ||
|
|
There was a problem hiding this comment.
We can't break fusion like that. We'd have relax conditional you're referencing above
alexeykudinkin
left a comment
There was a problem hiding this comment.
Discussed offline:
- We'll be fixing fusion in the follow-up
) In Ray 2.52, we made a breaking change that allowed limits to incorrectly get pushed down past `map_batches` that produce more output rows than input rows received. This can happen for example if you're using `map_batches` to filter or download data. This PR reverts the breaking change to the (correct) behavior of Ray <2.52. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
This PR updates the operator fusion rule to fuse `MapBatches` even if they modify the row counts. The intention of this PR is to preserve the historical operator fusion behavior and avoid introducing regressions. For more details, see the timeline below. --- ### Timeline of Changes | Date | Event | Description | | :--- | :--- | :--- | | **June 8, 2023** | **Limit pushdown added** | Added limit pushdown and a property to `MapBatches` incorrectly stating it doesn't modify row counts. (#35950) | | **June 27, 2023** | **Limit pushdown disabled** | Rule disabled because it incorrectly pushed limits past UDFs that modified row counts. (#36831) | | **April 28, 2025** | **Fusion restricted** | Added logic to stop fusing operators that modify row counts when the downstream has a batch size. `MapBatches` stayed fused only because of its incorrect property (#52570). | | **July 8, 2025** | **Limit pushdown re-enabled with special case** | Re-enabled with a special case to prevent pushing limits past `MapBatches`. ([#39486](#39486)) | | **Oct 24, 2025** | **Special case removed** | Special case removed, re-introducing the bug where limits are pushed past `MapBatches`. ([#57880](#57880)) | | **Feb 2, 2026** | **Property Fix** | Updated `MapBatches` to correctly report it modifies rows by default. This fixed the pushdown bug but broke fusion logic. ([PR #60448](#60448)) | | **Feb 4, 2026** | (This PR) | Add a special-case to preserve the historical `MapBatches` fusion behavior | --- <!-- BUGBOT_STATUS --><sup><a href="https://cursor.com/dashboard?tab=bugbot">Cursor Bugbot</a> reviewed your changes and found no issues for commit <u>d99e7b1</u></sup><!-- /BUGBOT_STATUS --> --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
…ct#60756) This PR updates the operator fusion rule to fuse `MapBatches` even if they modify the row counts. The intention of this PR is to preserve the historical operator fusion behavior and avoid introducing regressions. For more details, see the timeline below. --- ### Timeline of Changes | Date | Event | Description | | :--- | :--- | :--- | | **June 8, 2023** | **Limit pushdown added** | Added limit pushdown and a property to `MapBatches` incorrectly stating it doesn't modify row counts. (ray-project#35950) | | **June 27, 2023** | **Limit pushdown disabled** | Rule disabled because it incorrectly pushed limits past UDFs that modified row counts. (ray-project#36831) | | **April 28, 2025** | **Fusion restricted** | Added logic to stop fusing operators that modify row counts when the downstream has a batch size. `MapBatches` stayed fused only because of its incorrect property (ray-project#52570). | | **July 8, 2025** | **Limit pushdown re-enabled with special case** | Re-enabled with a special case to prevent pushing limits past `MapBatches`. ([ray-project#39486](ray-project#39486)) | | **Oct 24, 2025** | **Special case removed** | Special case removed, re-introducing the bug where limits are pushed past `MapBatches`. ([ray-project#57880](ray-project#57880)) | | **Feb 2, 2026** | **Property Fix** | Updated `MapBatches` to correctly report it modifies rows by default. This fixed the pushdown bug but broke fusion logic. ([PR ray-project#60448](ray-project#60448)) | | **Feb 4, 2026** | (This PR) | Add a special-case to preserve the historical `MapBatches` fusion behavior | --- <!-- BUGBOT_STATUS --><sup><a href="https://cursor.com/dashboard?tab=bugbot">Cursor Bugbot</a> reviewed your changes and found no issues for commit <u>d99e7b1</u></sup><!-- /BUGBOT_STATUS --> --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: tiennguyentony <46289799+tiennguyentony@users.noreply.github.com>
…ct#60756) This PR updates the operator fusion rule to fuse `MapBatches` even if they modify the row counts. The intention of this PR is to preserve the historical operator fusion behavior and avoid introducing regressions. For more details, see the timeline below. --- ### Timeline of Changes | Date | Event | Description | | :--- | :--- | :--- | | **June 8, 2023** | **Limit pushdown added** | Added limit pushdown and a property to `MapBatches` incorrectly stating it doesn't modify row counts. (ray-project#35950) | | **June 27, 2023** | **Limit pushdown disabled** | Rule disabled because it incorrectly pushed limits past UDFs that modified row counts. (ray-project#36831) | | **April 28, 2025** | **Fusion restricted** | Added logic to stop fusing operators that modify row counts when the downstream has a batch size. `MapBatches` stayed fused only because of its incorrect property (ray-project#52570). | | **July 8, 2025** | **Limit pushdown re-enabled with special case** | Re-enabled with a special case to prevent pushing limits past `MapBatches`. ([ray-project#39486](ray-project#39486)) | | **Oct 24, 2025** | **Special case removed** | Special case removed, re-introducing the bug where limits are pushed past `MapBatches`. ([ray-project#57880](ray-project#57880)) | | **Feb 2, 2026** | **Property Fix** | Updated `MapBatches` to correctly report it modifies rows by default. This fixed the pushdown bug but broke fusion logic. ([PR ray-project#60448](ray-project#60448)) | | **Feb 4, 2026** | (This PR) | Add a special-case to preserve the historical `MapBatches` fusion behavior | --- <!-- BUGBOT_STATUS --><sup><a href="https://cursor.com/dashboard?tab=bugbot">Cursor Bugbot</a> reviewed your changes and found no issues for commit <u>d99e7b1</u></sup><!-- /BUGBOT_STATUS --> --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: tiennguyentony <46289799+tiennguyentony@users.noreply.github.com>
…ct#60756) This PR updates the operator fusion rule to fuse `MapBatches` even if they modify the row counts. The intention of this PR is to preserve the historical operator fusion behavior and avoid introducing regressions. For more details, see the timeline below. --- ### Timeline of Changes | Date | Event | Description | | :--- | :--- | :--- | | **June 8, 2023** | **Limit pushdown added** | Added limit pushdown and a property to `MapBatches` incorrectly stating it doesn't modify row counts. (ray-project#35950) | | **June 27, 2023** | **Limit pushdown disabled** | Rule disabled because it incorrectly pushed limits past UDFs that modified row counts. (ray-project#36831) | | **April 28, 2025** | **Fusion restricted** | Added logic to stop fusing operators that modify row counts when the downstream has a batch size. `MapBatches` stayed fused only because of its incorrect property (ray-project#52570). | | **July 8, 2025** | **Limit pushdown re-enabled with special case** | Re-enabled with a special case to prevent pushing limits past `MapBatches`. ([ray-project#39486](ray-project#39486)) | | **Oct 24, 2025** | **Special case removed** | Special case removed, re-introducing the bug where limits are pushed past `MapBatches`. ([ray-project#57880](ray-project#57880)) | | **Feb 2, 2026** | **Property Fix** | Updated `MapBatches` to correctly report it modifies rows by default. This fixed the pushdown bug but broke fusion logic. ([PR ray-project#60448](ray-project#60448)) | | **Feb 4, 2026** | (This PR) | Add a special-case to preserve the historical `MapBatches` fusion behavior | --- <!-- BUGBOT_STATUS --><sup><a href="https://cursor.com/dashboard?tab=bugbot">Cursor Bugbot</a> reviewed your changes and found no issues for commit <u>d99e7b1</u></sup><!-- /BUGBOT_STATUS --> --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
In Ray 2.52, we made a breaking change that allowed limits to incorrectly get pushed down past `map_batches` that produce more output rows than input rows received. This can happen for example if you're using `map_batches` to filter or download data. This PR reverts the breaking change to the (correct) behavior of Ray <2.52. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
This PR updates the operator fusion rule to fuse `MapBatches` even if they modify the row counts. The intention of this PR is to preserve the historical operator fusion behavior and avoid introducing regressions. For more details, see the timeline below. --- ### Timeline of Changes | Date | Event | Description | | :--- | :--- | :--- | | **June 8, 2023** | **Limit pushdown added** | Added limit pushdown and a property to `MapBatches` incorrectly stating it doesn't modify row counts. (#35950) | | **June 27, 2023** | **Limit pushdown disabled** | Rule disabled because it incorrectly pushed limits past UDFs that modified row counts. (#36831) | | **April 28, 2025** | **Fusion restricted** | Added logic to stop fusing operators that modify row counts when the downstream has a batch size. `MapBatches` stayed fused only because of its incorrect property (#52570). | | **July 8, 2025** | **Limit pushdown re-enabled with special case** | Re-enabled with a special case to prevent pushing limits past `MapBatches`. ([#39486](#39486)) | | **Oct 24, 2025** | **Special case removed** | Special case removed, re-introducing the bug where limits are pushed past `MapBatches`. ([#57880](#57880)) | | **Feb 2, 2026** | **Property Fix** | Updated `MapBatches` to correctly report it modifies rows by default. This fixed the pushdown bug but broke fusion logic. ([PR #60448](#60448)) | | **Feb 4, 2026** | (This PR) | Add a special-case to preserve the historical `MapBatches` fusion behavior | --- <!-- BUGBOT_STATUS --><sup><a href="https://cursor.com/dashboard?tab=bugbot">Cursor Bugbot</a> reviewed your changes and found no issues for commit <u>d99e7b1</u></sup><!-- /BUGBOT_STATUS --> --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
In Ray 2.52, we made a breaking change that allowed limits to incorrectly get pushed down past `map_batches` that produce more output rows than input rows received. This can happen for example if you're using `map_batches` to filter or download data. This PR reverts the breaking change to the (correct) behavior of Ray <2.52. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
This PR updates the operator fusion rule to fuse `MapBatches` even if they modify the row counts. The intention of this PR is to preserve the historical operator fusion behavior and avoid introducing regressions. For more details, see the timeline below. --- ### Timeline of Changes | Date | Event | Description | | :--- | :--- | :--- | | **June 8, 2023** | **Limit pushdown added** | Added limit pushdown and a property to `MapBatches` incorrectly stating it doesn't modify row counts. (#35950) | | **June 27, 2023** | **Limit pushdown disabled** | Rule disabled because it incorrectly pushed limits past UDFs that modified row counts. (#36831) | | **April 28, 2025** | **Fusion restricted** | Added logic to stop fusing operators that modify row counts when the downstream has a batch size. `MapBatches` stayed fused only because of its incorrect property (#52570). | | **July 8, 2025** | **Limit pushdown re-enabled with special case** | Re-enabled with a special case to prevent pushing limits past `MapBatches`. ([#39486](#39486)) | | **Oct 24, 2025** | **Special case removed** | Special case removed, re-introducing the bug where limits are pushed past `MapBatches`. ([#57880](#57880)) | | **Feb 2, 2026** | **Property Fix** | Updated `MapBatches` to correctly report it modifies rows by default. This fixed the pushdown bug but broke fusion logic. ([PR #60448](#60448)) | | **Feb 4, 2026** | (This PR) | Add a special-case to preserve the historical `MapBatches` fusion behavior | --- <!-- BUGBOT_STATUS --><sup><a href="https://cursor.com/dashboard?tab=bugbot">Cursor Bugbot</a> reviewed your changes and found no issues for commit <u>d99e7b1</u></sup><!-- /BUGBOT_STATUS --> --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
This PR updates `map_groups` to assume that the UDF might change the row count. This change is necessary to fix a bug where `Limit` gets incorrectly pushed past the `map_groups` (fixes #60872). For more context, see: * #60448 * #60756 --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
This PR updates `map_groups` to assume that the UDF might change the row count. This change is necessary to fix a bug where `Limit` gets incorrectly pushed past the `map_groups` (fixes #60872). For more context, see: * #60448 * #60756 --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
…ject#60881) This PR updates `map_groups` to assume that the UDF might change the row count. This change is necessary to fix a bug where `Limit` gets incorrectly pushed past the `map_groups` (fixes ray-project#60872). For more context, see: * ray-project#60448 * ray-project#60756 --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: Ondrej Prenek <ondra.prenek@gmail.com>
…ject#60881) This PR updates `map_groups` to assume that the UDF might change the row count. This change is necessary to fix a bug where `Limit` gets incorrectly pushed past the `map_groups` (fixes ray-project#60872). For more context, see: * ray-project#60448 * ray-project#60756 --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
…ject#60881) This PR updates `map_groups` to assume that the UDF might change the row count. This change is necessary to fix a bug where `Limit` gets incorrectly pushed past the `map_groups` (fixes ray-project#60872). For more context, see: * ray-project#60448 * ray-project#60756 --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: Muhammad Saif <2024BBIT200@student.Uet.edu.pk>
…ct#60756) This PR updates the operator fusion rule to fuse `MapBatches` even if they modify the row counts. The intention of this PR is to preserve the historical operator fusion behavior and avoid introducing regressions. For more details, see the timeline below. --- ### Timeline of Changes | Date | Event | Description | | :--- | :--- | :--- | | **June 8, 2023** | **Limit pushdown added** | Added limit pushdown and a property to `MapBatches` incorrectly stating it doesn't modify row counts. (ray-project#35950) | | **June 27, 2023** | **Limit pushdown disabled** | Rule disabled because it incorrectly pushed limits past UDFs that modified row counts. (ray-project#36831) | | **April 28, 2025** | **Fusion restricted** | Added logic to stop fusing operators that modify row counts when the downstream has a batch size. `MapBatches` stayed fused only because of its incorrect property (ray-project#52570). | | **July 8, 2025** | **Limit pushdown re-enabled with special case** | Re-enabled with a special case to prevent pushing limits past `MapBatches`. ([ray-project#39486](ray-project#39486)) | | **Oct 24, 2025** | **Special case removed** | Special case removed, re-introducing the bug where limits are pushed past `MapBatches`. ([ray-project#57880](ray-project#57880)) | | **Feb 2, 2026** | **Property Fix** | Updated `MapBatches` to correctly report it modifies rows by default. This fixed the pushdown bug but broke fusion logic. ([PR ray-project#60448](ray-project#60448)) | | **Feb 4, 2026** | (This PR) | Add a special-case to preserve the historical `MapBatches` fusion behavior | --- <!-- BUGBOT_STATUS --><sup><a href="https://cursor.com/dashboard?tab=bugbot">Cursor Bugbot</a> reviewed your changes and found no issues for commit <u>d99e7b1</u></sup><!-- /BUGBOT_STATUS --> --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
…ject#60881) This PR updates `map_groups` to assume that the UDF might change the row count. This change is necessary to fix a bug where `Limit` gets incorrectly pushed past the `map_groups` (fixes ray-project#60872). For more context, see: * ray-project#60448 * ray-project#60756 --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
) In Ray 2.52, we made a breaking change that allowed limits to incorrectly get pushed down past `map_batches` that produce more output rows than input rows received. This can happen for example if you're using `map_batches` to filter or download data. This PR reverts the breaking change to the (correct) behavior of Ray <2.52. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: Adel Nour <ans9868@nyu.edu>
…ct#60756) This PR updates the operator fusion rule to fuse `MapBatches` even if they modify the row counts. The intention of this PR is to preserve the historical operator fusion behavior and avoid introducing regressions. For more details, see the timeline below. --- ### Timeline of Changes | Date | Event | Description | | :--- | :--- | :--- | | **June 8, 2023** | **Limit pushdown added** | Added limit pushdown and a property to `MapBatches` incorrectly stating it doesn't modify row counts. (ray-project#35950) | | **June 27, 2023** | **Limit pushdown disabled** | Rule disabled because it incorrectly pushed limits past UDFs that modified row counts. (ray-project#36831) | | **April 28, 2025** | **Fusion restricted** | Added logic to stop fusing operators that modify row counts when the downstream has a batch size. `MapBatches` stayed fused only because of its incorrect property (ray-project#52570). | | **July 8, 2025** | **Limit pushdown re-enabled with special case** | Re-enabled with a special case to prevent pushing limits past `MapBatches`. ([ray-project#39486](ray-project#39486)) | | **Oct 24, 2025** | **Special case removed** | Special case removed, re-introducing the bug where limits are pushed past `MapBatches`. ([ray-project#57880](ray-project#57880)) | | **Feb 2, 2026** | **Property Fix** | Updated `MapBatches` to correctly report it modifies rows by default. This fixed the pushdown bug but broke fusion logic. ([PR ray-project#60448](ray-project#60448)) | | **Feb 4, 2026** | (This PR) | Add a special-case to preserve the historical `MapBatches` fusion behavior | --- <!-- BUGBOT_STATUS --><sup><a href="https://cursor.com/dashboard?tab=bugbot">Cursor Bugbot</a> reviewed your changes and found no issues for commit <u>d99e7b1</u></sup><!-- /BUGBOT_STATUS --> --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: Adel Nour <ans9868@nyu.edu>
…ject#60881) This PR updates `map_groups` to assume that the UDF might change the row count. This change is necessary to fix a bug where `Limit` gets incorrectly pushed past the `map_groups` (fixes ray-project#60872). For more context, see: * ray-project#60448 * ray-project#60756 --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: Adel Nour <ans9868@nyu.edu>
…ct#60756) This PR updates the operator fusion rule to fuse `MapBatches` even if they modify the row counts. The intention of this PR is to preserve the historical operator fusion behavior and avoid introducing regressions. For more details, see the timeline below. --- ### Timeline of Changes | Date | Event | Description | | :--- | :--- | :--- | | **June 8, 2023** | **Limit pushdown added** | Added limit pushdown and a property to `MapBatches` incorrectly stating it doesn't modify row counts. (ray-project#35950) | | **June 27, 2023** | **Limit pushdown disabled** | Rule disabled because it incorrectly pushed limits past UDFs that modified row counts. (ray-project#36831) | | **April 28, 2025** | **Fusion restricted** | Added logic to stop fusing operators that modify row counts when the downstream has a batch size. `MapBatches` stayed fused only because of its incorrect property (ray-project#52570). | | **July 8, 2025** | **Limit pushdown re-enabled with special case** | Re-enabled with a special case to prevent pushing limits past `MapBatches`. ([ray-project#39486](ray-project#39486)) | | **Oct 24, 2025** | **Special case removed** | Special case removed, re-introducing the bug where limits are pushed past `MapBatches`. ([ray-project#57880](ray-project#57880)) | | **Feb 2, 2026** | **Property Fix** | Updated `MapBatches` to correctly report it modifies rows by default. This fixed the pushdown bug but broke fusion logic. ([PR ray-project#60448](ray-project#60448)) | | **Feb 4, 2026** | (This PR) | Add a special-case to preserve the historical `MapBatches` fusion behavior | --- <!-- BUGBOT_STATUS --><sup><a href="https://cursor.com/dashboard?tab=bugbot">Cursor Bugbot</a> reviewed your changes and found no issues for commit <u>d99e7b1</u></sup><!-- /BUGBOT_STATUS --> --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
…ject#60881) This PR updates `map_groups` to assume that the UDF might change the row count. This change is necessary to fix a bug where `Limit` gets incorrectly pushed past the `map_groups` (fixes ray-project#60872). For more context, see: * ray-project#60448 * ray-project#60756 --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
) In Ray 2.52, we made a breaking change that allowed limits to incorrectly get pushed down past `map_batches` that produce more output rows than input rows received. This can happen for example if you're using `map_batches` to filter or download data. This PR reverts the breaking change to the (correct) behavior of Ray <2.52. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: peterxcli <peterxcli@gmail.com>
…ct#60756) This PR updates the operator fusion rule to fuse `MapBatches` even if they modify the row counts. The intention of this PR is to preserve the historical operator fusion behavior and avoid introducing regressions. For more details, see the timeline below. --- ### Timeline of Changes | Date | Event | Description | | :--- | :--- | :--- | | **June 8, 2023** | **Limit pushdown added** | Added limit pushdown and a property to `MapBatches` incorrectly stating it doesn't modify row counts. (ray-project#35950) | | **June 27, 2023** | **Limit pushdown disabled** | Rule disabled because it incorrectly pushed limits past UDFs that modified row counts. (ray-project#36831) | | **April 28, 2025** | **Fusion restricted** | Added logic to stop fusing operators that modify row counts when the downstream has a batch size. `MapBatches` stayed fused only because of its incorrect property (ray-project#52570). | | **July 8, 2025** | **Limit pushdown re-enabled with special case** | Re-enabled with a special case to prevent pushing limits past `MapBatches`. ([ray-project#39486](ray-project#39486)) | | **Oct 24, 2025** | **Special case removed** | Special case removed, re-introducing the bug where limits are pushed past `MapBatches`. ([ray-project#57880](ray-project#57880)) | | **Feb 2, 2026** | **Property Fix** | Updated `MapBatches` to correctly report it modifies rows by default. This fixed the pushdown bug but broke fusion logic. ([PR ray-project#60448](ray-project#60448)) | | **Feb 4, 2026** | (This PR) | Add a special-case to preserve the historical `MapBatches` fusion behavior | --- <!-- BUGBOT_STATUS --><sup><a href="https://cursor.com/dashboard?tab=bugbot">Cursor Bugbot</a> reviewed your changes and found no issues for commit <u>d99e7b1</u></sup><!-- /BUGBOT_STATUS --> --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: peterxcli <peterxcli@gmail.com>
…ject#60881) This PR updates `map_groups` to assume that the UDF might change the row count. This change is necessary to fix a bug where `Limit` gets incorrectly pushed past the `map_groups` (fixes ray-project#60872). For more context, see: * ray-project#60448 * ray-project#60756 --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: peterxcli <peterxcli@gmail.com>
) In Ray 2.52, we made a breaking change that allowed limits to incorrectly get pushed down past `map_batches` that produce more output rows than input rows received. This can happen for example if you're using `map_batches` to filter or download data. This PR reverts the breaking change to the (correct) behavior of Ray <2.52. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: peterxcli <peterxcli@gmail.com>
…ct#60756) This PR updates the operator fusion rule to fuse `MapBatches` even if they modify the row counts. The intention of this PR is to preserve the historical operator fusion behavior and avoid introducing regressions. For more details, see the timeline below. --- ### Timeline of Changes | Date | Event | Description | | :--- | :--- | :--- | | **June 8, 2023** | **Limit pushdown added** | Added limit pushdown and a property to `MapBatches` incorrectly stating it doesn't modify row counts. (ray-project#35950) | | **June 27, 2023** | **Limit pushdown disabled** | Rule disabled because it incorrectly pushed limits past UDFs that modified row counts. (ray-project#36831) | | **April 28, 2025** | **Fusion restricted** | Added logic to stop fusing operators that modify row counts when the downstream has a batch size. `MapBatches` stayed fused only because of its incorrect property (ray-project#52570). | | **July 8, 2025** | **Limit pushdown re-enabled with special case** | Re-enabled with a special case to prevent pushing limits past `MapBatches`. ([ray-project#39486](ray-project#39486)) | | **Oct 24, 2025** | **Special case removed** | Special case removed, re-introducing the bug where limits are pushed past `MapBatches`. ([ray-project#57880](ray-project#57880)) | | **Feb 2, 2026** | **Property Fix** | Updated `MapBatches` to correctly report it modifies rows by default. This fixed the pushdown bug but broke fusion logic. ([PR ray-project#60448](ray-project#60448)) | | **Feb 4, 2026** | (This PR) | Add a special-case to preserve the historical `MapBatches` fusion behavior | --- <!-- BUGBOT_STATUS --><sup><a href="https://cursor.com/dashboard?tab=bugbot">Cursor Bugbot</a> reviewed your changes and found no issues for commit <u>d99e7b1</u></sup><!-- /BUGBOT_STATUS --> --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: peterxcli <peterxcli@gmail.com>
…ject#60881) This PR updates `map_groups` to assume that the UDF might change the row count. This change is necessary to fix a bug where `Limit` gets incorrectly pushed past the `map_groups` (fixes ray-project#60872). For more context, see: * ray-project#60448 * ray-project#60756 --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: peterxcli <peterxcli@gmail.com>
In Ray 2.52, we made a breaking change that allowed limits to incorrectly get pushed down past
map_batchesthat produce more output rows than input rows received. This can happen for example if you're usingmap_batchesto filter or download data.This PR reverts the breaking change to the (correct) behavior of Ray <2.52.