[data][llm] Prefer uniproc executor over mp executor when world_size==1 by jeffreywang-anyscale · Pull Request #60403 · ray-project/ray

jeffreywang-anyscale · 2026-01-22T06:33:10Z

Description

Why are these changes needed?

We observed suboptimal performance when using mp as the distributed executor backend.

Changes

When the world size is 1, use uni (uniproc executor) by default to avoid spawning additional processes and IPC overhead.
When the world size is greater than 1, prefer ray over mp as the distributed executor backend for the following reasons:
a. Improved resource cleanup: Using mp as the vLLM backend is known to leave dangling processes during engine shutdown, whereas ray provides more reliable lifecycle management.
b. Unified execution path: Cross-node PP requires ray as the distributed backend. To maintain a consistent code path across different TP/PP configurations, we standardize on ray, which supports all TP/PP combinations.
c. Advanced placement control: With ray as the distributed backend, users can explicitly control placement via vLLMEngineProcessorConfig.placement_group_config, allowing fine-grained scheduling of Ray Data actors.
Adapted tests and benchmark script.

Related issues

N/A

Additional information

Classification workload benchmark results

Setup
- Model: HuggingFaceTB/fineweb-edu-classifier
- Repro script: python benchmark_processor.py --mode classify --batch-size 2048 --concurrency 1 --num-prompts 102400 --model HuggingFaceTB/fineweb-edu-classifier
Results

Distributed executor backend	Throughput (rows/s)	Δ vs. uni
uni	1361.45	—
mp	1144.76	-15.9%

Note: With the changes in this PR, the uniproc executor backend is now used by default, whereas mp was the default prior to these changes. Therefore, run the repro script with and without these changes to obtain the results above.

Generation workload benchmark results

Purpose: Observe the throughput difference when using different executor backend under different model size / decode length.
Setup
- Model: facebook/opt-1.3b or facebook/opt-350m
- Repro script
  - uniproc executor backend (default): python benchmark_processor.py --mode vllm_engine --batch-size 512 --concurrency 1 --num-prompts 2560 --model facebook/opt-350m --max-tokens 1
  - mp/ray executor backend: python benchmark_processor.py --mode vllm_engine --batch-size 512 --concurrency 1 --num-prompts 2560 --model facebook/opt-350m --max-tokens 1 --distributed-executor-backend mp
- Note: Vary the decode length by adjusting the max-tokens CLI argument.
Results (Model: facebook/opt-1.3b)

Decode Length	Distributed Executor Backend	Throughput (rows/s)	Δ vs. uni
1	uni	62.84	0.0%
1	mp	53.41	-15.0%
1	ray	52.68	-16.2%
50	uni	33.23	0.0%
50	mp	32.54	-2.1%
50	ray	32.31	-2.8%

Results (Model: facebook/opt-350m)

Decode Length	Distributed Executor Backend	Throughput (rows/s)	Δ vs. uni
1	uni	80.38	0.0%
1	mp	65.78	-18.2%
1	ray	64.68	-19.5%
50	uni	56.84	0.0%
50	mp	49.34	-13.2%
50	ray	47.80	-15.9%

Observations

When model size and decode length shrink, the gap between mp and uni becomes more evident.
Regardless of workload type, uni consistently outperforms mp because it avoids spawning additional processes and the associated IPC overhead, even though the data crossing process boundaries is relatively small (limited to sampled tokens and scheduler metadata).

Conclusions

When GPU work is small, IPC overhead becomes more pronounced, leading to degraded performance when using the mp backend. In contrast, as model size and decode length increase, the relative overhead diminishes and the performance gap between uni and mp narrows.

gemini-code-assist · 2026-01-22T06:33:13Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

kouroshHakha

The changes in vllm stage makes sense. Can you re-review the changes to benchmark and only keep those that were absolutely necessary? For example enforcing eager is actually not good for the benchmark by default. It will slow down the decoding.

kouroshHakha · 2026-01-22T16:44:24Z

python/ray/llm/_internal/batch/benchmark/benchmark_processor.py

-    "enable_prefix_caching": True,
-    "enable_chunked_prefill": True,
-    "max_num_batched_tokens": 4096,
+    "enforce_eager": True,


Why change these? Enforecing eager will slow down the benchmarks for decoding

Just trying to reproduce the numbers I had earlier. Your suggestion makes sense. Will remove enforce_eager and update the numbers.

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

jeffreywang-anyscale · 2026-01-22T19:15:18Z

python/ray/llm/_internal/batch/benchmark/benchmark_processor.py

-    "enable_prefix_caching": True,
-    "enable_chunked_prefill": True,


enable_prefix_caching and enable_chunked_prefill are enabled by default in vLLM.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

python/ray/llm/_internal/batch/benchmark/benchmark_processor.py

…=1 (ray-project#60403) Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

…=1 (ray-project#60403) Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Signed-off-by: jinbum-kim <jinbum9958@gmail.com>

…=1 (ray-project#60403) Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Signed-off-by: 400Ping <jiekaichang@apache.org>

…=1 (ray-project#60403) Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

jeffreywang-anyscale requested a review from kouroshHakha January 22, 2026 06:33

jeffreywang-anyscale requested a review from a team as a code owner January 22, 2026 06:33

jeffreywang-anyscale added the go add ONLY when ready to merge, run all tests label Jan 22, 2026

ray-gardener bot added data Ray Data-related issues llm labels Jan 22, 2026

kouroshHakha reviewed Jan 22, 2026

View reviewed changes

jeffreywang-anyscale added 2 commits January 22, 2026 10:28

Use uniproc when world size==1

a8be6c8

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

CR feedback

4ef92f7

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

jeffreywang-anyscale force-pushed the uni-proc branch from 00b536e to 4ef92f7 Compare January 22, 2026 19:12

jeffreywang-anyscale commented Jan 22, 2026

View reviewed changes

cursor bot reviewed Jan 22, 2026

View reviewed changes

python/ray/llm/_internal/batch/benchmark/benchmark_processor.py Show resolved Hide resolved

kouroshHakha enabled auto-merge (squash) January 25, 2026 07:47

kouroshHakha approved these changes Jan 25, 2026

View reviewed changes

kouroshHakha merged commit cde2545 into ray-project:master Jan 25, 2026
7 checks passed

xinyuangui2 pushed a commit to xinyuangui2/ray that referenced this pull request Jan 26, 2026

[data][llm] Prefer uniproc executor over mp executor when world_size=…

5648c93

…=1 (ray-project#60403) Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

jeffreywang-anyscale mentioned this pull request Jan 29, 2026

[Data][LLM] Incredibly bad performance on embedding with Ray LLM compared to write a naive VLLM actor #60567

Closed

jeffreywang-anyscale mentioned this pull request Feb 10, 2026

[data][llm] Use the default uniproc as the distributed backend #60934

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data][llm] Prefer uniproc executor over mp executor when world_size==1#60403

[data][llm] Prefer uniproc executor over mp executor when world_size==1#60403
kouroshHakha merged 2 commits intoray-project:masterfrom
jeffreywang-anyscale:uni-proc

jeffreywang-anyscale commented Jan 22, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Jan 22, 2026

Uh oh!

kouroshHakha left a comment

Uh oh!

kouroshHakha Jan 22, 2026

Uh oh!

jeffreywang-anyscale Jan 22, 2026

Uh oh!

jeffreywang-anyscale Jan 22, 2026

Uh oh!

cursor bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		"enable_prefix_caching": True,
		"enable_chunked_prefill": True,

Conversation

jeffreywang-anyscale commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Why are these changes needed?

Changes

Related issues

Additional information

Classification workload benchmark results

Generation workload benchmark results

Observations

Conclusions

Uh oh!

gemini-code-assist bot commented Jan 22, 2026

Uh oh!

kouroshHakha left a comment

Choose a reason for hiding this comment

Uh oh!

kouroshHakha Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

jeffreywang-anyscale Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

jeffreywang-anyscale Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jeffreywang-anyscale commented Jan 22, 2026 •

edited

Loading