[data][llm] Prefer uniproc executor over mp executor when world_size==1#60403
[data][llm] Prefer uniproc executor over mp executor when world_size==1#60403kouroshHakha merged 2 commits intoray-project:masterfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
kouroshHakha
left a comment
There was a problem hiding this comment.
The changes in vllm stage makes sense. Can you re-review the changes to benchmark and only keep those that were absolutely necessary? For example enforcing eager is actually not good for the benchmark by default. It will slow down the decoding.
| "enable_prefix_caching": True, | ||
| "enable_chunked_prefill": True, | ||
| "max_num_batched_tokens": 4096, | ||
| "enforce_eager": True, |
There was a problem hiding this comment.
Why change these? Enforecing eager will slow down the benchmarks for decoding
There was a problem hiding this comment.
Just trying to reproduce the numbers I had earlier. Your suggestion makes sense. Will remove enforce_eager and update the numbers.
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
00b536e to
4ef92f7
Compare
| "enable_prefix_caching": True, | ||
| "enable_chunked_prefill": True, |
There was a problem hiding this comment.
enable_prefix_caching and enable_chunked_prefill are enabled by default in vLLM.
…=1 (ray-project#60403) Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
…=1 (ray-project#60403) Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Signed-off-by: jinbum-kim <jinbum9958@gmail.com>
…=1 (ray-project#60403) Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Signed-off-by: 400Ping <jiekaichang@apache.org>
…=1 (ray-project#60403) Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
…=1 (ray-project#60403) Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
Description
Why are these changes needed?
We observed suboptimal performance when using
mpas the distributed executor backend.Changes
uni(uniproc executor) by default to avoid spawning additional processes and IPC overhead.rayovermpas the distributed executor backend for the following reasons:a. Improved resource cleanup: Using
mpas the vLLM backend is known to leave dangling processes during engine shutdown, whereasrayprovides more reliable lifecycle management.b. Unified execution path: Cross-node PP requires
rayas the distributed backend. To maintain a consistent code path across different TP/PP configurations, we standardize onray, which supports all TP/PP combinations.c. Advanced placement control: With
rayas the distributed backend, users can explicitly control placement viavLLMEngineProcessorConfig.placement_group_config, allowing fine-grained scheduling of Ray Data actors.Related issues
N/A
Additional information
Classification workload benchmark results
Setup
HuggingFaceTB/fineweb-edu-classifierpython benchmark_processor.py --mode classify --batch-size 2048 --concurrency 1 --num-prompts 102400 --model HuggingFaceTB/fineweb-edu-classifierResults
Note: With the changes in this PR, the uniproc executor backend is now used by default, whereas mp was the default prior to these changes. Therefore, run the repro script with and without these changes to obtain the results above.
Generation workload benchmark results
facebook/opt-1.3borfacebook/opt-350mpython benchmark_processor.py --mode vllm_engine --batch-size 512 --concurrency 1 --num-prompts 2560 --model facebook/opt-350m --max-tokens 1python benchmark_processor.py --mode vllm_engine --batch-size 512 --concurrency 1 --num-prompts 2560 --model facebook/opt-350m --max-tokens 1 --distributed-executor-backend mpfacebook/opt-1.3b)facebook/opt-350m)Observations
mpandunibecomes more evident.uniconsistently outperformsmpbecause it avoids spawning additional processes and the associated IPC overhead, even though the data crossing process boundaries is relatively small (limited to sampled tokens and scheduler metadata).Conclusions
When GPU work is small, IPC overhead becomes more pronounced, leading to degraded performance when using the
mpbackend. In contrast, as model size and decode length increase, the relative overhead diminishes and the performance gap betweenuniandmpnarrows.