[Serve] Make batching work with multiplexing#59334

Merged

abrarsheikh merged 7 commits intomasterfrom

56633-abrar-batch

Dec 18, 2025

Contributor

abrarsheikh commented Dec 10, 2025 •

edited

Loading

fixes #56633

Add documentation
update get_multiplexed_model_id to see if we are batch context first
update logic
add tests
does not introduce any backwards incompatibility, previously the system did not provide any guarantee about contents of a batch and now we are add a constraint that guarantees each batch contains requests for same model.
execute sub batches concurrently

The thing I dislike about this implementation is that it does not fill the batch in the case where the replica is responsible for > 2 models and incoming traffic is equally distributed between those models. Becasue the current implementation fills the batch first, then divides them.

Metric	Baseline (42905 reqs)	Master (27526 reqs)	Δ Change (Master − Baseline)
Requests	42,905	27,526	−15,379
Fails	0	0	0
Median (ms)	290	300	+10 ms
95%ile (ms)	560	570	+10 ms
99%ile (ms)	620	640	+20 ms
Average (ms)	327.41	332.96	+5.55 ms
Min (ms)	61	80	+19 ms
Max (ms)	764	802	+38 ms
Avg Size (bytes)	13	13	0
Current RPS	299	293	−6
Current Failures/s	0	0	0


          [Serve] Make batching work with multiplexing

4a1ec3b

Signed-off-by: abrar <abrar@anyscale.com>

abrarsheikh requested review from a team as code owners

December 10, 2025 04:26

Contributor

gemini-code-assist bot commented Dec 10, 2025

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

abrarsheikh added the go label

abrarsheikh mentioned this pull request

[Serve] model multiplexing and batching does not work together #56633

Closed

abrarsheikh requested a review from harshit-anyscale

December 10, 2025 04:32


          fix doc test

28ad89a

Signed-off-by: abrar <abrar@anyscale.com>

harshit-anyscale approved these changes

View reviewed changes

Contributor

harshit-anyscale left a comment

lgtm, except model_1.pt file is added but has no changes

ray-gardener bot added the serve label

abrarsheikh added 2 commits

December 11, 2025 05:27


          Merge branch 'master' of github.com:ray-project/ray into 56633-abrar-…

56960d3

…batch


          process sub batches concurrently

957ee85

Signed-off-by: abrar <abrar@anyscale.com>

cursor bot reviewed

View reviewed changes

python/ray/serve/batching.py Outdated Show resolved Hide resolved

abrarsheikh added 2 commits

December 11, 2025 06:14


          capture right context

670fbf9

Signed-off-by: abrar <abrar@anyscale.com>


          Merge branch 'master' of github.com:ray-project/ray into 56633-abrar-…

b899ee4

…batch

abrarsheikh requested a review from akyang-anyscale

December 16, 2025 04:48

harshit-anyscale approved these changes

View reviewed changes

python/ray/serve/batching.py Outdated Show resolved Hide resolved


          remove extra code

7db4db7

Signed-off-by: abrar <abrar@anyscale.com>

harshit-anyscale approved these changes

View reviewed changes

akyang-anyscale approved these changes

View reviewed changes

doc/source/serve/model-multiplexing.md


		## Using model multiplexing with batching

		You can combine model multiplexing with the `@serve.batch` decorator for efficient batched inference. When you use both features together, Ray Serve automatically splits batches by model ID to ensure each batch contains only requests for the same model. This prevents issues where a single batch would contain requests targeting different models.

Contributor

akyang-anyscale Dec 18, 2025

The way I understand this description is that Serve will treat each model's batch independently, i.e. waiting to reach the max_batch_size or the timeout before firing for each model, but in reality, it waits for the max_batch_size or timeout across all models. For example if our max_batch_size=8, Serve will process sub batches of size [1, 4, 3] instead of waiting for each model to have 8 request.

Contributor Author

abrarsheikh Dec 18, 2025

you are right.

abrarsheikh merged commit 1599fb7 into master

6 checks passed

abrarsheikh deleted the 56633-abrar-batch branch

December 18, 2025 21:33

Yicheng-Lu-llll pushed a commit to Yicheng-Lu-llll/ray that referenced this pull request


          [Serve] Make batching work with multiplexing (ray-project#59334)

fe4ef60

fixes ray-project#56633

- [x] Add documentation
- [x] update `get_multiplexed_model_id` to see if we are batch context
first
- [x] update logic
- [x] add tests
- [x] does not introduce any backwards incompatibility, previously the
system did not provide any guarantee about contents of a batch and now
we are add a constraint that guarantees each batch contains requests for
same model.
- [x] execute sub batches concurrently 

The thing I dislike about this implementation is that it does not fill
the batch in the case where the replica is responsible for > 2 models
and incoming traffic is equally distributed between those models.
Becasue the current implementation fills the batch first, then divides
them.

Metric | Baseline (42905 reqs) | Master (27526 reqs) | Δ Change (Master
− Baseline)
-- | -- | -- | --
Requests | 42,905 | 27,526 | −15,379
Fails | 0 | 0 | 0
Median (ms) | 290 | 300 | +10 ms
95%ile (ms) | 560 | 570 | +10 ms
99%ile (ms) | 620 | 640 | +20 ms
Average (ms) | 327.41 | 332.96 | +5.55 ms
Min (ms) | 61 | 80 | +19 ms
Max (ms) | 764 | 802 | +38 ms
Avg Size (bytes) | 13 | 13 | 0
Current RPS | 299 | 293 | −6
Current Failures/s | 0 | 0 | 0

---------

Signed-off-by: abrar <abrar@anyscale.com>

peterxcli pushed a commit to peterxcli/ray that referenced this pull request


          [Serve] Make batching work with multiplexing (ray-project#59334)

84cee1f

fixes ray-project#56633

- [x] Add documentation
- [x] update `get_multiplexed_model_id` to see if we are batch context
first
- [x] update logic
- [x] add tests
- [x] does not introduce any backwards incompatibility, previously the
system did not provide any guarantee about contents of a batch and now
we are add a constraint that guarantees each batch contains requests for
same model.
- [x] execute sub batches concurrently

The thing I dislike about this implementation is that it does not fill
the batch in the case where the replica is responsible for > 2 models
and incoming traffic is equally distributed between those models.
Becasue the current implementation fills the batch first, then divides
them.

Metric | Baseline (42905 reqs) | Master (27526 reqs) | Δ Change (Master
− Baseline)
-- | -- | -- | --
Requests | 42,905 | 27,526 | −15,379
Fails | 0 | 0 | 0
Median (ms) | 290 | 300 | +10 ms
95%ile (ms) | 560 | 570 | +10 ms
99%ile (ms) | 620 | 640 | +20 ms
Average (ms) | 327.41 | 332.96 | +5.55 ms
Min (ms) | 61 | 80 | +19 ms
Max (ms) | 764 | 802 | +38 ms
Avg Size (bytes) | 13 | 13 | 0
Current RPS | 299 | 293 | −6
Current Failures/s | 0 | 0 | 0

---------

Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go serve