Skip to content

[data][llm] Prefer uniproc executor over mp executor when world_size==1#60403

Merged
kouroshHakha merged 2 commits intoray-project:masterfrom
jeffreywang-anyscale:uni-proc
Jan 25, 2026
Merged

[data][llm] Prefer uniproc executor over mp executor when world_size==1#60403
kouroshHakha merged 2 commits intoray-project:masterfrom
jeffreywang-anyscale:uni-proc

Conversation

@jeffreywang-anyscale
Copy link
Contributor

@jeffreywang-anyscale jeffreywang-anyscale commented Jan 22, 2026

Description

Why are these changes needed?

We observed suboptimal performance when using mp as the distributed executor backend.

Changes

  1. When the world size is 1, use uni (uniproc executor) by default to avoid spawning additional processes and IPC overhead.
  2. When the world size is greater than 1, prefer ray over mp as the distributed executor backend for the following reasons:
    a. Improved resource cleanup: Using mp as the vLLM backend is known to leave dangling processes during engine shutdown, whereas ray provides more reliable lifecycle management.
    b. Unified execution path: Cross-node PP requires ray as the distributed backend. To maintain a consistent code path across different TP/PP configurations, we standardize on ray, which supports all TP/PP combinations.
    c. Advanced placement control: With ray as the distributed backend, users can explicitly control placement via vLLMEngineProcessorConfig.placement_group_config, allowing fine-grained scheduling of Ray Data actors.
  3. Adapted tests and benchmark script.

Related issues

N/A

Additional information

Classification workload benchmark results

  • Setup

    • Model: HuggingFaceTB/fineweb-edu-classifier
    • Repro script: python benchmark_processor.py --mode classify --batch-size 2048 --concurrency 1 --num-prompts 102400 --model HuggingFaceTB/fineweb-edu-classifier
  • Results

Distributed executor backend Throughput (rows/s) Δ vs. uni
uni 1361.45
mp 1144.76 -15.9%

Note: With the changes in this PR, the uniproc executor backend is now used by default, whereas mp was the default prior to these changes. Therefore, run the repro script with and without these changes to obtain the results above.

Generation workload benchmark results

  • Purpose: Observe the throughput difference when using different executor backend under different model size / decode length.
  • Setup
    • Model: facebook/opt-1.3b or facebook/opt-350m
    • Repro script
      • uniproc executor backend (default): python benchmark_processor.py --mode vllm_engine --batch-size 512 --concurrency 1 --num-prompts 2560 --model facebook/opt-350m --max-tokens 1
      • mp/ray executor backend: python benchmark_processor.py --mode vllm_engine --batch-size 512 --concurrency 1 --num-prompts 2560 --model facebook/opt-350m --max-tokens 1 --distributed-executor-backend mp
    • Note: Vary the decode length by adjusting the max-tokens CLI argument.
  • Results (Model: facebook/opt-1.3b)
Decode Length Distributed Executor Backend Throughput (rows/s) Δ vs. uni
1 uni 62.84 0.0%
1 mp 53.41 -15.0%
1 ray 52.68 -16.2%
50 uni 33.23 0.0%
50 mp 32.54 -2.1%
50 ray 32.31 -2.8%
  • Results (Model: facebook/opt-350m)
Decode Length Distributed Executor Backend Throughput (rows/s) Δ vs. uni
1 uni 80.38 0.0%
1 mp 65.78 -18.2%
1 ray 64.68 -19.5%
50 uni 56.84 0.0%
50 mp 49.34 -13.2%
50 ray 47.80 -15.9%

Observations

  1. When model size and decode length shrink, the gap between mp and uni becomes more evident.
  2. Regardless of workload type, uni consistently outperforms mp because it avoids spawning additional processes and the associated IPC overhead, even though the data crossing process boundaries is relatively small (limited to sampled tokens and scheduler metadata).

Conclusions

When GPU work is small, IPC overhead becomes more pronounced, leading to degraded performance when using the mp backend. In contrast, as model size and decode length increase, the relative overhead diminishes and the performance gap between uni and mp narrows.

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@jeffreywang-anyscale jeffreywang-anyscale added the go add ONLY when ready to merge, run all tests label Jan 22, 2026
@ray-gardener ray-gardener bot added data Ray Data-related issues llm labels Jan 22, 2026
Copy link
Contributor

@kouroshHakha kouroshHakha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in vllm stage makes sense. Can you re-review the changes to benchmark and only keep those that were absolutely necessary? For example enforcing eager is actually not good for the benchmark by default. It will slow down the decoding.

"enable_prefix_caching": True,
"enable_chunked_prefill": True,
"max_num_batched_tokens": 4096,
"enforce_eager": True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change these? Enforecing eager will slow down the benchmarks for decoding

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just trying to reproduce the numbers I had earlier. Your suggestion makes sense. Will remove enforce_eager and update the numbers.

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Comment on lines -55 to -56
"enable_prefix_caching": True,
"enable_chunked_prefill": True,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enable_prefix_caching and enable_chunked_prefill are enabled by default in vLLM.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

@kouroshHakha kouroshHakha enabled auto-merge (squash) January 25, 2026 07:47
@kouroshHakha kouroshHakha merged commit cde2545 into ray-project:master Jan 25, 2026
7 checks passed
xinyuangui2 pushed a commit to xinyuangui2/ray that referenced this pull request Jan 26, 2026
…=1 (ray-project#60403)

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
jinbum-kim pushed a commit to jinbum-kim/ray that referenced this pull request Jan 29, 2026
…=1 (ray-project#60403)

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: jinbum-kim <jinbum9958@gmail.com>
400Ping pushed a commit to 400Ping/ray that referenced this pull request Feb 1, 2026
…=1 (ray-project#60403)

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: 400Ping <jiekaichang@apache.org>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…=1 (ray-project#60403)

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…=1 (ray-project#60403)

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests llm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants