Skip to content

[Data] Don't push limit past map_batches by default#60448

Merged
bveeramani merged 7 commits intomasterfrom
fix-limit-pushdown
Feb 2, 2026
Merged

[Data] Don't push limit past map_batches by default#60448
bveeramani merged 7 commits intomasterfrom
fix-limit-pushdown

Conversation

@bveeramani
Copy link
Member

@bveeramani bveeramani commented Jan 23, 2026

In Ray 2.52, we made a breaking change that allowed limits to incorrectly get pushed down past map_batches that produce more output rows than input rows received. This can happen for example if you're using map_batches to filter or download data.

This PR reverts the breaking change to the (correct) behavior of Ray <2.52.

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
@bveeramani bveeramani requested a review from a team as a code owner January 23, 2026 07:41
@bveeramani bveeramani marked this pull request as draft January 23, 2026 07:41
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request changes the default value of udf_modifying_row_count in map_batches to True, making the API safer by default by disabling limit pushdown for map_batches unless explicitly enabled. This is a great improvement for correctness. The docstring update is clear, and the new test correctly validates the new behavior.

One side effect is a minor performance regression for internal methods like add_column and drop_columns that use map_batches without modifying row counts. These will now default to the safer, non-optimized path. A follow-up to explicitly set udf_modifying_row_count=False for these specific calls would be beneficial to restore optimal performance.

Overall, the change is a solid improvement to the API's robustness.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Copy link
Contributor

@alexeykudinkin alexeykudinkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change

@bveeramani
Copy link
Member Author

This is a breaking change

This is a breaking change from 2.52 and 2.53, but it's both the correct behavior and the behavior that's consistent from <2.52. We made a breaking change between 2.51 and 2.52

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
@bveeramani bveeramani marked this pull request as ready for review January 30, 2026 18:37
@ray-gardener ray-gardener bot added the data Ray Data-related issues label Jan 30, 2026
ds1 = ray.data.range(n, override_num_blocks=2)
ds1 = ds1.map_batches(lambda x: x, batch_size=target_rows)
ds1 = ds1.map_batches(lambda x: x, batch_size=target_rows)
ds1 = ds1.map_batches(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to manually set udf_modifying_row_count=False here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of this logic here:

# Do not fuse Map operators in case:
#
# - Upstream could (potentially) drastically modify number of rows, while
# - Downstream has `min_rows_per_input_bundle` specified
#
# Fusing such transformations is not desirable as it could
#
# - Drastically reduce parallelism for the upstream up (for ex, if
# fusing ``Read->MapBatches(batch_size=...)`` with large enough batch-size
# could drastically reduce parallelism level of the Read op)
#
# - Potentially violate batching semantic by fusing
# ``Filter->MapBatches(batch_size=...)``
#
if (
.

By default, we don't know if a map_batches will modify the number of rows. The user needs to explicitly set udf_modifying_row_count

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
@bveeramani bveeramani enabled auto-merge (squash) January 31, 2026 02:46
@github-actions github-actions bot disabled auto-merge January 31, 2026 02:46
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Jan 31, 2026
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Comment on lines +814 to 820
ds1 = ds1.map_batches(
lambda x: x, batch_size=target_rows, udf_modifying_row_count=False
)
ds1 = ds1.map_batches(
lambda x: x, batch_size=target_rows, udf_modifying_row_count=False
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't break fusion like that. We'd have relax conditional you're referencing above

Copy link
Contributor

@alexeykudinkin alexeykudinkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline:

  • We'll be fixing fusion in the follow-up

@bveeramani bveeramani merged commit e151a0a into master Feb 2, 2026
6 checks passed
@bveeramani bveeramani deleted the fix-limit-pushdown branch February 2, 2026 20:44
rayhhome pushed a commit to rayhhome/ray that referenced this pull request Feb 4, 2026
)

In Ray 2.52, we made a breaking change that allowed limits to
incorrectly get pushed down past `map_batches` that produce more output
rows than input rows received. This can happen for example if you're
using `map_batches` to filter or download data.

This PR reverts the breaking change to the (correct) behavior of Ray
<2.52.

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
bveeramani added a commit that referenced this pull request Feb 4, 2026
This PR updates the operator fusion rule to fuse `MapBatches` even if
they modify the row counts. The intention of this PR is to preserve the
historical operator fusion behavior and avoid introducing regressions.

For more details, see the timeline below.
---

### Timeline of Changes

| Date | Event | Description |
| :--- | :--- | :--- |
| **June 8, 2023** | **Limit pushdown added** | Added limit pushdown and
a property to `MapBatches` incorrectly stating it doesn't modify row
counts. (#35950) |
| **June 27, 2023** | **Limit pushdown disabled** | Rule disabled
because it incorrectly pushed limits past UDFs that modified row counts.
(#36831) |
| **April 28, 2025** | **Fusion restricted** | Added logic to stop
fusing operators that modify row counts when the downstream has a batch
size. `MapBatches` stayed fused only because of its incorrect property
(#52570). |
| **July 8, 2025** | **Limit pushdown re-enabled with special case** |
Re-enabled with a special case to prevent pushing limits past
`MapBatches`. ([#39486](#39486))
|
| **Oct 24, 2025** | **Special case removed** | Special case removed,
re-introducing the bug where limits are pushed past `MapBatches`.
([#57880](#57880)) |
| **Feb 2, 2026** | **Property Fix** | Updated `MapBatches` to correctly
report it modifies rows by default. This fixed the pushdown bug but
broke fusion logic. ([PR
#60448](#60448)) |
| **Feb 4, 2026** | (This PR) | Add a special-case to preserve the
historical `MapBatches` fusion behavior |
---

<!-- BUGBOT_STATUS --><sup><a
href="https://cursor.com/dashboard?tab=bugbot">Cursor Bugbot</a>
reviewed your changes and found no issues for commit
<u>d99e7b1</u></sup><!-- /BUGBOT_STATUS -->

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
tiennguyentony pushed a commit to tiennguyentony/ray that referenced this pull request Feb 7, 2026
…ct#60756)

This PR updates the operator fusion rule to fuse `MapBatches` even if
they modify the row counts. The intention of this PR is to preserve the
historical operator fusion behavior and avoid introducing regressions.

For more details, see the timeline below.
---

### Timeline of Changes

| Date | Event | Description |
| :--- | :--- | :--- |
| **June 8, 2023** | **Limit pushdown added** | Added limit pushdown and
a property to `MapBatches` incorrectly stating it doesn't modify row
counts. (ray-project#35950) |
| **June 27, 2023** | **Limit pushdown disabled** | Rule disabled
because it incorrectly pushed limits past UDFs that modified row counts.
(ray-project#36831) |
| **April 28, 2025** | **Fusion restricted** | Added logic to stop
fusing operators that modify row counts when the downstream has a batch
size. `MapBatches` stayed fused only because of its incorrect property
(ray-project#52570). |
| **July 8, 2025** | **Limit pushdown re-enabled with special case** |
Re-enabled with a special case to prevent pushing limits past
`MapBatches`. ([ray-project#39486](ray-project#39486))
|
| **Oct 24, 2025** | **Special case removed** | Special case removed,
re-introducing the bug where limits are pushed past `MapBatches`.
([ray-project#57880](ray-project#57880)) |
| **Feb 2, 2026** | **Property Fix** | Updated `MapBatches` to correctly
report it modifies rows by default. This fixed the pushdown bug but
broke fusion logic. ([PR
ray-project#60448](ray-project#60448)) |
| **Feb 4, 2026** | (This PR) | Add a special-case to preserve the
historical `MapBatches` fusion behavior |
---

<!-- BUGBOT_STATUS --><sup><a
href="https://cursor.com/dashboard?tab=bugbot">Cursor Bugbot</a>
reviewed your changes and found no issues for commit
<u>d99e7b1</u></sup><!-- /BUGBOT_STATUS -->

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: tiennguyentony <46289799+tiennguyentony@users.noreply.github.com>
tiennguyentony pushed a commit to tiennguyentony/ray that referenced this pull request Feb 7, 2026
…ct#60756)


This PR updates the operator fusion rule to fuse `MapBatches` even if
they modify the row counts. The intention of this PR is to preserve the
historical operator fusion behavior and avoid introducing regressions.

For more details, see the timeline below.
---

### Timeline of Changes

| Date | Event | Description |
| :--- | :--- | :--- |
| **June 8, 2023** | **Limit pushdown added** | Added limit pushdown and
a property to `MapBatches` incorrectly stating it doesn't modify row
counts. (ray-project#35950) |
| **June 27, 2023** | **Limit pushdown disabled** | Rule disabled
because it incorrectly pushed limits past UDFs that modified row counts.
(ray-project#36831) |
| **April 28, 2025** | **Fusion restricted** | Added logic to stop
fusing operators that modify row counts when the downstream has a batch
size. `MapBatches` stayed fused only because of its incorrect property
(ray-project#52570). |
| **July 8, 2025** | **Limit pushdown re-enabled with special case** |
Re-enabled with a special case to prevent pushing limits past
`MapBatches`. ([ray-project#39486](ray-project#39486))
|
| **Oct 24, 2025** | **Special case removed** | Special case removed,
re-introducing the bug where limits are pushed past `MapBatches`.
([ray-project#57880](ray-project#57880)) |
| **Feb 2, 2026** | **Property Fix** | Updated `MapBatches` to correctly
report it modifies rows by default. This fixed the pushdown bug but
broke fusion logic. ([PR
ray-project#60448](ray-project#60448)) |
| **Feb 4, 2026** | (This PR) | Add a special-case to preserve the
historical `MapBatches` fusion behavior |
---

<!-- BUGBOT_STATUS --><sup><a
href="https://cursor.com/dashboard?tab=bugbot">Cursor Bugbot</a>
reviewed your changes and found no issues for commit
<u>d99e7b1</u></sup><!-- /BUGBOT_STATUS -->

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: tiennguyentony <46289799+tiennguyentony@users.noreply.github.com>
tiennguyentony pushed a commit to tiennguyentony/ray that referenced this pull request Feb 7, 2026
…ct#60756)


This PR updates the operator fusion rule to fuse `MapBatches` even if
they modify the row counts. The intention of this PR is to preserve the
historical operator fusion behavior and avoid introducing regressions.

For more details, see the timeline below.
---

### Timeline of Changes

| Date | Event | Description |
| :--- | :--- | :--- |
| **June 8, 2023** | **Limit pushdown added** | Added limit pushdown and
a property to `MapBatches` incorrectly stating it doesn't modify row
counts. (ray-project#35950) |
| **June 27, 2023** | **Limit pushdown disabled** | Rule disabled
because it incorrectly pushed limits past UDFs that modified row counts.
(ray-project#36831) |
| **April 28, 2025** | **Fusion restricted** | Added logic to stop
fusing operators that modify row counts when the downstream has a batch
size. `MapBatches` stayed fused only because of its incorrect property
(ray-project#52570). |
| **July 8, 2025** | **Limit pushdown re-enabled with special case** |
Re-enabled with a special case to prevent pushing limits past
`MapBatches`. ([ray-project#39486](ray-project#39486))
|
| **Oct 24, 2025** | **Special case removed** | Special case removed,
re-introducing the bug where limits are pushed past `MapBatches`.
([ray-project#57880](ray-project#57880)) |
| **Feb 2, 2026** | **Property Fix** | Updated `MapBatches` to correctly
report it modifies rows by default. This fixed the pushdown bug but
broke fusion logic. ([PR
ray-project#60448](ray-project#60448)) |
| **Feb 4, 2026** | (This PR) | Add a special-case to preserve the
historical `MapBatches` fusion behavior |
---

<!-- BUGBOT_STATUS --><sup><a
href="https://cursor.com/dashboard?tab=bugbot">Cursor Bugbot</a>
reviewed your changes and found no issues for commit
<u>d99e7b1</u></sup><!-- /BUGBOT_STATUS -->

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
elliot-barn pushed a commit that referenced this pull request Feb 9, 2026
In Ray 2.52, we made a breaking change that allowed limits to
incorrectly get pushed down past `map_batches` that produce more output
rows than input rows received. This can happen for example if you're
using `map_batches` to filter or download data.

This PR reverts the breaking change to the (correct) behavior of Ray
<2.52.

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Feb 9, 2026
This PR updates the operator fusion rule to fuse `MapBatches` even if
they modify the row counts. The intention of this PR is to preserve the
historical operator fusion behavior and avoid introducing regressions.

For more details, see the timeline below.
---

### Timeline of Changes

| Date | Event | Description |
| :--- | :--- | :--- |
| **June 8, 2023** | **Limit pushdown added** | Added limit pushdown and
a property to `MapBatches` incorrectly stating it doesn't modify row
counts. (#35950) |
| **June 27, 2023** | **Limit pushdown disabled** | Rule disabled
because it incorrectly pushed limits past UDFs that modified row counts.
(#36831) |
| **April 28, 2025** | **Fusion restricted** | Added logic to stop
fusing operators that modify row counts when the downstream has a batch
size. `MapBatches` stayed fused only because of its incorrect property
(#52570). |
| **July 8, 2025** | **Limit pushdown re-enabled with special case** |
Re-enabled with a special case to prevent pushing limits past
`MapBatches`. ([#39486](#39486))
|
| **Oct 24, 2025** | **Special case removed** | Special case removed,
re-introducing the bug where limits are pushed past `MapBatches`.
([#57880](#57880)) |
| **Feb 2, 2026** | **Property Fix** | Updated `MapBatches` to correctly
report it modifies rows by default. This fixed the pushdown bug but
broke fusion logic. ([PR
#60448](#60448)) |
| **Feb 4, 2026** | (This PR) | Add a special-case to preserve the
historical `MapBatches` fusion behavior |
---

<!-- BUGBOT_STATUS --><sup><a
href="https://cursor.com/dashboard?tab=bugbot">Cursor Bugbot</a>
reviewed your changes and found no issues for commit
<u>d99e7b1</u></sup><!-- /BUGBOT_STATUS -->

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Feb 9, 2026
In Ray 2.52, we made a breaking change that allowed limits to
incorrectly get pushed down past `map_batches` that produce more output
rows than input rows received. This can happen for example if you're
using `map_batches` to filter or download data.

This PR reverts the breaking change to the (correct) behavior of Ray
<2.52.

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
elliot-barn pushed a commit that referenced this pull request Feb 9, 2026
This PR updates the operator fusion rule to fuse `MapBatches` even if
they modify the row counts. The intention of this PR is to preserve the
historical operator fusion behavior and avoid introducing regressions.

For more details, see the timeline below.
---

### Timeline of Changes

| Date | Event | Description |
| :--- | :--- | :--- |
| **June 8, 2023** | **Limit pushdown added** | Added limit pushdown and
a property to `MapBatches` incorrectly stating it doesn't modify row
counts. (#35950) |
| **June 27, 2023** | **Limit pushdown disabled** | Rule disabled
because it incorrectly pushed limits past UDFs that modified row counts.
(#36831) |
| **April 28, 2025** | **Fusion restricted** | Added logic to stop
fusing operators that modify row counts when the downstream has a batch
size. `MapBatches` stayed fused only because of its incorrect property
(#52570). |
| **July 8, 2025** | **Limit pushdown re-enabled with special case** |
Re-enabled with a special case to prevent pushing limits past
`MapBatches`. ([#39486](#39486))
|
| **Oct 24, 2025** | **Special case removed** | Special case removed,
re-introducing the bug where limits are pushed past `MapBatches`.
([#57880](#57880)) |
| **Feb 2, 2026** | **Property Fix** | Updated `MapBatches` to correctly
report it modifies rows by default. This fixed the pushdown bug but
broke fusion logic. ([PR
#60448](#60448)) |
| **Feb 4, 2026** | (This PR) | Add a special-case to preserve the
historical `MapBatches` fusion behavior |
---

<!-- BUGBOT_STATUS --><sup><a
href="https://cursor.com/dashboard?tab=bugbot">Cursor Bugbot</a>
reviewed your changes and found no issues for commit
<u>d99e7b1</u></sup><!-- /BUGBOT_STATUS -->

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
bveeramani added a commit that referenced this pull request Feb 9, 2026
This PR updates `map_groups` to assume that the UDF might change the row
count. This change is necessary to fix a bug where `Limit` gets
incorrectly pushed past the `map_groups` (fixes
#60872).

For more context, see:
* #60448
* #60756

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
bveeramani added a commit that referenced this pull request Feb 9, 2026
This PR updates `map_groups` to assume that the UDF might change the row
count. This change is necessary to fix a bug where `Limit` gets
incorrectly pushed past the `map_groups` (fixes
#60872).

For more context, see:
* #60448
* #60756

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
preneond pushed a commit to preneond/ray that referenced this pull request Feb 15, 2026
…ject#60881)

This PR updates `map_groups` to assume that the UDF might change the row
count. This change is necessary to fix a bug where `Limit` gets
incorrectly pushed past the `map_groups` (fixes
ray-project#60872).

For more context, see:
* ray-project#60448
* ray-project#60756

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Ondrej Prenek <ondra.prenek@gmail.com>
limarkdcunha pushed a commit to limarkdcunha/ray that referenced this pull request Feb 17, 2026
…ject#60881)

This PR updates `map_groups` to assume that the UDF might change the row
count. This change is necessary to fix a bug where `Limit` gets
incorrectly pushed past the `map_groups` (fixes
ray-project#60872).

For more context, see:
* ray-project#60448
* ray-project#60756

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
MuhammadSaif700 pushed a commit to MuhammadSaif700/ray that referenced this pull request Feb 17, 2026
…ject#60881)

This PR updates `map_groups` to assume that the UDF might change the row
count. This change is necessary to fix a bug where `Limit` gets
incorrectly pushed past the `map_groups` (fixes
ray-project#60872).

For more context, see:
* ray-project#60448
* ray-project#60756

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Muhammad Saif <2024BBIT200@student.Uet.edu.pk>
Kunchd pushed a commit to Kunchd/ray that referenced this pull request Feb 17, 2026
…ct#60756)

This PR updates the operator fusion rule to fuse `MapBatches` even if
they modify the row counts. The intention of this PR is to preserve the
historical operator fusion behavior and avoid introducing regressions.

For more details, see the timeline below.
---

### Timeline of Changes

| Date | Event | Description |
| :--- | :--- | :--- |
| **June 8, 2023** | **Limit pushdown added** | Added limit pushdown and
a property to `MapBatches` incorrectly stating it doesn't modify row
counts. (ray-project#35950) |
| **June 27, 2023** | **Limit pushdown disabled** | Rule disabled
because it incorrectly pushed limits past UDFs that modified row counts.
(ray-project#36831) |
| **April 28, 2025** | **Fusion restricted** | Added logic to stop
fusing operators that modify row counts when the downstream has a batch
size. `MapBatches` stayed fused only because of its incorrect property
(ray-project#52570). |
| **July 8, 2025** | **Limit pushdown re-enabled with special case** |
Re-enabled with a special case to prevent pushing limits past
`MapBatches`. ([ray-project#39486](ray-project#39486))
|
| **Oct 24, 2025** | **Special case removed** | Special case removed,
re-introducing the bug where limits are pushed past `MapBatches`.
([ray-project#57880](ray-project#57880)) |
| **Feb 2, 2026** | **Property Fix** | Updated `MapBatches` to correctly
report it modifies rows by default. This fixed the pushdown bug but
broke fusion logic. ([PR
ray-project#60448](ray-project#60448)) |
| **Feb 4, 2026** | (This PR) | Add a special-case to preserve the
historical `MapBatches` fusion behavior |
---

<!-- BUGBOT_STATUS --><sup><a
href="https://cursor.com/dashboard?tab=bugbot">Cursor Bugbot</a>
reviewed your changes and found no issues for commit
<u>d99e7b1</u></sup><!-- /BUGBOT_STATUS -->

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Kunchd pushed a commit to Kunchd/ray that referenced this pull request Feb 17, 2026
…ject#60881)

This PR updates `map_groups` to assume that the UDF might change the row
count. This change is necessary to fix a bug where `Limit` gets
incorrectly pushed past the `map_groups` (fixes
ray-project#60872).

For more context, see:
* ray-project#60448
* ray-project#60756

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
ans9868 pushed a commit to ans9868/ray that referenced this pull request Feb 18, 2026
)

In Ray 2.52, we made a breaking change that allowed limits to
incorrectly get pushed down past `map_batches` that produce more output
rows than input rows received. This can happen for example if you're
using `map_batches` to filter or download data.

This PR reverts the breaking change to the (correct) behavior of Ray
<2.52.

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Adel Nour <ans9868@nyu.edu>
ans9868 pushed a commit to ans9868/ray that referenced this pull request Feb 18, 2026
…ct#60756)

This PR updates the operator fusion rule to fuse `MapBatches` even if
they modify the row counts. The intention of this PR is to preserve the
historical operator fusion behavior and avoid introducing regressions.

For more details, see the timeline below.
---

### Timeline of Changes

| Date | Event | Description |
| :--- | :--- | :--- |
| **June 8, 2023** | **Limit pushdown added** | Added limit pushdown and
a property to `MapBatches` incorrectly stating it doesn't modify row
counts. (ray-project#35950) |
| **June 27, 2023** | **Limit pushdown disabled** | Rule disabled
because it incorrectly pushed limits past UDFs that modified row counts.
(ray-project#36831) |
| **April 28, 2025** | **Fusion restricted** | Added logic to stop
fusing operators that modify row counts when the downstream has a batch
size. `MapBatches` stayed fused only because of its incorrect property
(ray-project#52570). |
| **July 8, 2025** | **Limit pushdown re-enabled with special case** |
Re-enabled with a special case to prevent pushing limits past
`MapBatches`. ([ray-project#39486](ray-project#39486))
|
| **Oct 24, 2025** | **Special case removed** | Special case removed,
re-introducing the bug where limits are pushed past `MapBatches`.
([ray-project#57880](ray-project#57880)) |
| **Feb 2, 2026** | **Property Fix** | Updated `MapBatches` to correctly
report it modifies rows by default. This fixed the pushdown bug but
broke fusion logic. ([PR
ray-project#60448](ray-project#60448)) |
| **Feb 4, 2026** | (This PR) | Add a special-case to preserve the
historical `MapBatches` fusion behavior |
---

<!-- BUGBOT_STATUS --><sup><a
href="https://cursor.com/dashboard?tab=bugbot">Cursor Bugbot</a>
reviewed your changes and found no issues for commit
<u>d99e7b1</u></sup><!-- /BUGBOT_STATUS -->

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Adel Nour <ans9868@nyu.edu>
ans9868 pushed a commit to ans9868/ray that referenced this pull request Feb 18, 2026
…ject#60881)

This PR updates `map_groups` to assume that the UDF might change the row
count. This change is necessary to fix a bug where `Limit` gets
incorrectly pushed past the `map_groups` (fixes
ray-project#60872).

For more context, see:
* ray-project#60448
* ray-project#60756

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Adel Nour <ans9868@nyu.edu>
Aydin-ab pushed a commit to kunling-anyscale/ray that referenced this pull request Feb 20, 2026
…ct#60756)

This PR updates the operator fusion rule to fuse `MapBatches` even if
they modify the row counts. The intention of this PR is to preserve the
historical operator fusion behavior and avoid introducing regressions.

For more details, see the timeline below.
---

### Timeline of Changes

| Date | Event | Description |
| :--- | :--- | :--- |
| **June 8, 2023** | **Limit pushdown added** | Added limit pushdown and
a property to `MapBatches` incorrectly stating it doesn't modify row
counts. (ray-project#35950) |
| **June 27, 2023** | **Limit pushdown disabled** | Rule disabled
because it incorrectly pushed limits past UDFs that modified row counts.
(ray-project#36831) |
| **April 28, 2025** | **Fusion restricted** | Added logic to stop
fusing operators that modify row counts when the downstream has a batch
size. `MapBatches` stayed fused only because of its incorrect property
(ray-project#52570). |
| **July 8, 2025** | **Limit pushdown re-enabled with special case** |
Re-enabled with a special case to prevent pushing limits past
`MapBatches`. ([ray-project#39486](ray-project#39486))
|
| **Oct 24, 2025** | **Special case removed** | Special case removed,
re-introducing the bug where limits are pushed past `MapBatches`.
([ray-project#57880](ray-project#57880)) |
| **Feb 2, 2026** | **Property Fix** | Updated `MapBatches` to correctly
report it modifies rows by default. This fixed the pushdown bug but
broke fusion logic. ([PR
ray-project#60448](ray-project#60448)) |
| **Feb 4, 2026** | (This PR) | Add a special-case to preserve the
historical `MapBatches` fusion behavior |
---

<!-- BUGBOT_STATUS --><sup><a
href="https://cursor.com/dashboard?tab=bugbot">Cursor Bugbot</a>
reviewed your changes and found no issues for commit
<u>d99e7b1</u></sup><!-- /BUGBOT_STATUS -->

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Aydin-ab pushed a commit to kunling-anyscale/ray that referenced this pull request Feb 20, 2026
…ject#60881)

This PR updates `map_groups` to assume that the UDF might change the row
count. This change is necessary to fix a bug where `Limit` gets
incorrectly pushed past the `map_groups` (fixes
ray-project#60872).

For more context, see:
* ray-project#60448
* ray-project#60756

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
)

In Ray 2.52, we made a breaking change that allowed limits to
incorrectly get pushed down past `map_batches` that produce more output
rows than input rows received. This can happen for example if you're
using `map_batches` to filter or download data.

This PR reverts the breaking change to the (correct) behavior of Ray
<2.52.

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…ct#60756)

This PR updates the operator fusion rule to fuse `MapBatches` even if
they modify the row counts. The intention of this PR is to preserve the
historical operator fusion behavior and avoid introducing regressions.

For more details, see the timeline below.
---

### Timeline of Changes

| Date | Event | Description |
| :--- | :--- | :--- |
| **June 8, 2023** | **Limit pushdown added** | Added limit pushdown and
a property to `MapBatches` incorrectly stating it doesn't modify row
counts. (ray-project#35950) |
| **June 27, 2023** | **Limit pushdown disabled** | Rule disabled
because it incorrectly pushed limits past UDFs that modified row counts.
(ray-project#36831) |
| **April 28, 2025** | **Fusion restricted** | Added logic to stop
fusing operators that modify row counts when the downstream has a batch
size. `MapBatches` stayed fused only because of its incorrect property
(ray-project#52570). |
| **July 8, 2025** | **Limit pushdown re-enabled with special case** |
Re-enabled with a special case to prevent pushing limits past
`MapBatches`. ([ray-project#39486](ray-project#39486))
|
| **Oct 24, 2025** | **Special case removed** | Special case removed,
re-introducing the bug where limits are pushed past `MapBatches`.
([ray-project#57880](ray-project#57880)) |
| **Feb 2, 2026** | **Property Fix** | Updated `MapBatches` to correctly
report it modifies rows by default. This fixed the pushdown bug but
broke fusion logic. ([PR
ray-project#60448](ray-project#60448)) |
| **Feb 4, 2026** | (This PR) | Add a special-case to preserve the
historical `MapBatches` fusion behavior |
---

<!-- BUGBOT_STATUS --><sup><a
href="https://cursor.com/dashboard?tab=bugbot">Cursor Bugbot</a>
reviewed your changes and found no issues for commit
<u>d99e7b1</u></sup><!-- /BUGBOT_STATUS -->

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…ject#60881)

This PR updates `map_groups` to assume that the UDF might change the row
count. This change is necessary to fix a bug where `Limit` gets
incorrectly pushed past the `map_groups` (fixes
ray-project#60872).

For more context, see:
* ray-project#60448
* ray-project#60756

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
)

In Ray 2.52, we made a breaking change that allowed limits to
incorrectly get pushed down past `map_batches` that produce more output
rows than input rows received. This can happen for example if you're
using `map_batches` to filter or download data.

This PR reverts the breaking change to the (correct) behavior of Ray
<2.52.

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…ct#60756)

This PR updates the operator fusion rule to fuse `MapBatches` even if
they modify the row counts. The intention of this PR is to preserve the
historical operator fusion behavior and avoid introducing regressions.

For more details, see the timeline below.
---

### Timeline of Changes

| Date | Event | Description |
| :--- | :--- | :--- |
| **June 8, 2023** | **Limit pushdown added** | Added limit pushdown and
a property to `MapBatches` incorrectly stating it doesn't modify row
counts. (ray-project#35950) |
| **June 27, 2023** | **Limit pushdown disabled** | Rule disabled
because it incorrectly pushed limits past UDFs that modified row counts.
(ray-project#36831) |
| **April 28, 2025** | **Fusion restricted** | Added logic to stop
fusing operators that modify row counts when the downstream has a batch
size. `MapBatches` stayed fused only because of its incorrect property
(ray-project#52570). |
| **July 8, 2025** | **Limit pushdown re-enabled with special case** |
Re-enabled with a special case to prevent pushing limits past
`MapBatches`. ([ray-project#39486](ray-project#39486))
|
| **Oct 24, 2025** | **Special case removed** | Special case removed,
re-introducing the bug where limits are pushed past `MapBatches`.
([ray-project#57880](ray-project#57880)) |
| **Feb 2, 2026** | **Property Fix** | Updated `MapBatches` to correctly
report it modifies rows by default. This fixed the pushdown bug but
broke fusion logic. ([PR
ray-project#60448](ray-project#60448)) |
| **Feb 4, 2026** | (This PR) | Add a special-case to preserve the
historical `MapBatches` fusion behavior |
---

<!-- BUGBOT_STATUS --><sup><a
href="https://cursor.com/dashboard?tab=bugbot">Cursor Bugbot</a>
reviewed your changes and found no issues for commit
<u>d99e7b1</u></sup><!-- /BUGBOT_STATUS -->

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…ject#60881)

This PR updates `map_groups` to assume that the UDF might change the row
count. This change is necessary to fix a bug where `Limit` gets
incorrectly pushed past the `map_groups` (fixes
ray-project#60872).

For more context, see:
* ray-project#60448
* ray-project#60756

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants