[vllm] fix: enable multi-node vllm replicas with expert parallelism by guillemgt · Pull Request #5339 · verl-project/verl

guillemgt · 2026-02-17T16:27:10Z

What does this PR do?

Removes an overly restrictive assertion that prevented using multi-node vLLM servers when Expert Parallelism (EP) is enabled and tensor parallelism spans multiple nodes. The underlying functionality already worked—this change simply allows it to be used by handling the case where tensor_model_parallel_size > gpus_per_node.

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: https://github.com/volcengine/verl/pulls?q=is%3Apr+expert+parallel+multi-node
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- Module: rollout
- Type: fix
- No breaking changes

Test

The previous assertion failure occurred when launching multi-node vLLM servers with:

Expert Parallelism enabled (expert_parallel_size > 1)
Tensor parallelism spanning multiple nodes (tensor_model_parallel_size > gpus_per_node)

Testing approach: Validated with multi-node vLLM deployments using EP and cross-node TP configurations.

API and Usage Example

No API changes. Internal fix only.

Design & Code Changes

The original code assumed gpus_per_node % tensor_model_parallel_size == 0, which fails when TP spans multiple nodes. This PR adds conditional logic to compute data_parallel_start_rank correctly for both cases:

TP fits within a node (data_parallel_size_local > 0): existing behavior
TP spans multiple nodes (data_parallel_size_local == 0): compute rank using node_rank // nodes_per_data_parallel_worker

Checklist Before Submitting

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation. (N/A - internal fix only)
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: Multi-node EP configurations require multi-GPU infrastructure not available in CI.
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace.
If your PR is related to the recipe submodule, please also update the reference to the submodule commit via git submodule update --remote or cd recipe && git pull origin main. (N/A)

gemini-code-assist

Code Review

This pull request correctly removes a restrictive assertion to enable multi-node vLLM deployments with expert parallelism where tensor parallelism spans multiple nodes. The logic to handle tensor_model_parallel_size > gpus_per_node is a good addition. However, I've identified a critical missing validation check that could lead to runtime errors or hangs with specific configurations. I've also suggested some minor improvements to the new assertion messages to enhance clarity for future debugging.

verl/workers/rollout/vllm_rollout/vllm_async_server.py

- Add critical validation that tensor_model_parallel_size is divisible by gpus_per_node when tensor parallelism spans multiple nodes - Improve assertion messages for clarity by removing misleading "or vice versa" text - Fix extraneous asterisks in assertion error messages Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

[rollout] fix: enable multi-node vllm servers with EP > 1

1bb0629

guillemgt requested review from PeterSH6, chenhaiq and wuxibin89 as code owners February 17, 2026 16:27

guillemgt changed the title ~~[rollout] fix: enable multi-node vllm replicas with expert parallelism~~ [vllm] fix: enable multi-node vllm replicas with expert parallelism Feb 17, 2026

gemini-code-assist bot reviewed Feb 17, 2026

View reviewed changes

verl/workers/rollout/vllm_rollout/vllm_async_server.py Show resolved Hide resolved

verl/workers/rollout/vllm_rollout/vllm_async_server.py Show resolved Hide resolved

verl/workers/rollout/vllm_rollout/vllm_async_server.py Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[vllm] fix: enable multi-node vllm replicas with expert parallelism#5339

[vllm] fix: enable multi-node vllm replicas with expert parallelism#5339
guillemgt wants to merge 2 commits intoverl-project:mainfrom
guillemgt:fix-multinode-vllm-ep

guillemgt commented Feb 17, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

guillemgt commented Feb 17, 2026

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant