Skip to content

Comments

[vllm] fix: enable multi-node vllm replicas with expert parallelism#5339

Open
guillemgt wants to merge 2 commits intoverl-project:mainfrom
guillemgt:fix-multinode-vllm-ep
Open

[vllm] fix: enable multi-node vllm replicas with expert parallelism#5339
guillemgt wants to merge 2 commits intoverl-project:mainfrom
guillemgt:fix-multinode-vllm-ep

Conversation

@guillemgt
Copy link
Contributor

What does this PR do?

Removes an overly restrictive assertion that prevented using multi-node vLLM servers when Expert Parallelism (EP) is enabled and tensor parallelism spans multiple nodes. The underlying functionality already worked—this change simply allows it to be used by handling the case where tensor_model_parallel_size > gpus_per_node.

Checklist Before Starting

Test

The previous assertion failure occurred when launching multi-node vLLM servers with:

  • Expert Parallelism enabled (expert_parallel_size > 1)
  • Tensor parallelism spanning multiple nodes (tensor_model_parallel_size > gpus_per_node)

Testing approach: Validated with multi-node vLLM deployments using EP and cross-node TP configurations.

API and Usage Example

No API changes. Internal fix only.

Design & Code Changes

The original code assumed gpus_per_node % tensor_model_parallel_size == 0, which fails when TP spans multiple nodes. This PR adds conditional logic to compute data_parallel_start_rank correctly for both cases:

  1. TP fits within a node (data_parallel_size_local > 0): existing behavior
  2. TP spans multiple nodes (data_parallel_size_local == 0): compute rank using node_rank // nodes_per_data_parallel_worker

Checklist Before Submitting

  • Read the Contribute Guide.
  • Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
  • Add / Update the documentation. (N/A - internal fix only)
  • Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: Multi-node EP configurations require multi-GPU infrastructure not available in CI.
  • Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace.
  • If your PR is related to the recipe submodule, please also update the reference to the submodule commit via git submodule update --remote or cd recipe && git pull origin main. (N/A)

@guillemgt guillemgt changed the title [rollout] fix: enable multi-node vllm replicas with expert parallelism [vllm] fix: enable multi-node vllm replicas with expert parallelism Feb 17, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly removes a restrictive assertion to enable multi-node vLLM deployments with expert parallelism where tensor parallelism spans multiple nodes. The logic to handle tensor_model_parallel_size > gpus_per_node is a good addition. However, I've identified a critical missing validation check that could lead to runtime errors or hangs with specific configurations. I've also suggested some minor improvements to the new assertion messages to enhance clarity for future debugging.

- Add critical validation that tensor_model_parallel_size is divisible by gpus_per_node when tensor parallelism spans multiple nodes
- Improve assertion messages for clarity by removing misleading "or vice versa" text
- Fix extraneous asterisks in assertion error messages

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant