Skip to content

[Serve] Make batching work with multiplexing#59334

Merged
abrarsheikh merged 7 commits intomasterfrom
56633-abrar-batch
Dec 18, 2025
Merged

[Serve] Make batching work with multiplexing#59334
abrarsheikh merged 7 commits intomasterfrom
56633-abrar-batch

Conversation

@abrarsheikh
Copy link
Contributor

@abrarsheikh abrarsheikh commented Dec 10, 2025

fixes #56633

  • Add documentation
  • update get_multiplexed_model_id to see if we are batch context first
  • update logic
  • add tests
  • does not introduce any backwards incompatibility, previously the system did not provide any guarantee about contents of a batch and now we are add a constraint that guarantees each batch contains requests for same model.
  • execute sub batches concurrently

The thing I dislike about this implementation is that it does not fill the batch in the case where the replica is responsible for > 2 models and incoming traffic is equally distributed between those models. Becasue the current implementation fills the batch first, then divides them.

Metric Baseline (42905 reqs) Master (27526 reqs) Δ Change (Master − Baseline)
Requests 42,905 27,526 −15,379
Fails 0 0 0
Median (ms) 290 300 +10 ms
95%ile (ms) 560 570 +10 ms
99%ile (ms) 620 640 +20 ms
Average (ms) 327.41 332.96 +5.55 ms
Min (ms) 61 80 +19 ms
Max (ms) 764 802 +38 ms
Avg Size (bytes) 13 13 0
Current RPS 299 293 −6
Current Failures/s 0 0 0

Signed-off-by: abrar <abrar@anyscale.com>
@abrarsheikh abrarsheikh requested review from a team as code owners December 10, 2025 04:26
@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@abrarsheikh abrarsheikh added the go add ONLY when ready to merge, run all tests label Dec 10, 2025
Signed-off-by: abrar <abrar@anyscale.com>
Copy link
Contributor

@harshit-anyscale harshit-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, except model_1.pt file is added but has no changes

@ray-gardener ray-gardener bot added the serve Ray Serve Related Issue label Dec 10, 2025
Signed-off-by: abrar <abrar@anyscale.com>

## Using model multiplexing with batching

You can combine model multiplexing with the `@serve.batch` decorator for efficient batched inference. When you use both features together, Ray Serve automatically splits batches by model ID to ensure each batch contains only requests for the same model. This prevents issues where a single batch would contain requests targeting different models.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I understand this description is that Serve will treat each model's batch independently, i.e. waiting to reach the max_batch_size or the timeout before firing for each model, but in reality, it waits for the max_batch_size or timeout across all models. For example if our max_batch_size=8, Serve will process sub batches of size [1, 4, 3] instead of waiting for each model to have 8 request.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right.

@abrarsheikh abrarsheikh merged commit 1599fb7 into master Dec 18, 2025
6 checks passed
@abrarsheikh abrarsheikh deleted the 56633-abrar-batch branch December 18, 2025 21:33
Yicheng-Lu-llll pushed a commit to Yicheng-Lu-llll/ray that referenced this pull request Dec 22, 2025
fixes ray-project#56633

- [x] Add documentation
- [x] update `get_multiplexed_model_id` to see if we are batch context
first
- [x] update logic
- [x] add tests
- [x] does not introduce any backwards incompatibility, previously the
system did not provide any guarantee about contents of a batch and now
we are add a constraint that guarantees each batch contains requests for
same model.
- [x] execute sub batches concurrently 

The thing I dislike about this implementation is that it does not fill
the batch in the case where the replica is responsible for > 2 models
and incoming traffic is equally distributed between those models.
Becasue the current implementation fills the batch first, then divides
them.

Metric | Baseline (42905 reqs) | Master (27526 reqs) | Δ Change (Master
− Baseline)
-- | -- | -- | --
Requests | 42,905 | 27,526 | −15,379
Fails | 0 | 0 | 0
Median (ms) | 290 | 300 | +10 ms
95%ile (ms) | 560 | 570 | +10 ms
99%ile (ms) | 620 | 640 | +20 ms
Average (ms) | 327.41 | 332.96 | +5.55 ms
Min (ms) | 61 | 80 | +19 ms
Max (ms) | 764 | 802 | +38 ms
Avg Size (bytes) | 13 | 13 | 0
Current RPS | 299 | 293 | −6
Current Failures/s | 0 | 0 | 0

---------

Signed-off-by: abrar <abrar@anyscale.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
fixes ray-project#56633

- [x] Add documentation
- [x] update `get_multiplexed_model_id` to see if we are batch context
first
- [x] update logic
- [x] add tests
- [x] does not introduce any backwards incompatibility, previously the
system did not provide any guarantee about contents of a batch and now
we are add a constraint that guarantees each batch contains requests for
same model.
- [x] execute sub batches concurrently

The thing I dislike about this implementation is that it does not fill
the batch in the case where the replica is responsible for > 2 models
and incoming traffic is equally distributed between those models.
Becasue the current implementation fills the batch first, then divides
them.

Metric | Baseline (42905 reqs) | Master (27526 reqs) | Δ Change (Master
− Baseline)
-- | -- | -- | --
Requests | 42,905 | 27,526 | −15,379
Fails | 0 | 0 | 0
Median (ms) | 290 | 300 | +10 ms
95%ile (ms) | 560 | 570 | +10 ms
99%ile (ms) | 620 | 640 | +20 ms
Average (ms) | 327.41 | 332.96 | +5.55 ms
Min (ms) | 61 | 80 | +19 ms
Max (ms) | 764 | 802 | +38 ms
Avg Size (bytes) | 13 | 13 | 0
Current RPS | 299 | 293 | −6
Current Failures/s | 0 | 0 | 0

---------

Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Serve] model multiplexing and batching does not work together

3 participants