[AMD][ROCm] MoRI EP: a high-performance all2all backend#28664
[AMD][ROCm] MoRI EP: a high-performance all2all backend#28664tjtanaa merged 4 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces MoRI-EP as a high-performance all2all backend for Mixture-of-Experts models on ROCm platforms. The changes are extensive, touching configuration, device communicators, and MoE layer implementations to integrate the new backend. The implementation introduces MoriAll2AllManager for communication and AiterExperts for the expert computation path. Overall, the changes are well-structured and seem to correctly integrate the new backend. However, I found a critical issue in the logic for handling shared experts in the DeepSeek V2 model, which could lead to incorrect behavior.
|
CC @sunway513 @mgoin @houseroad @robertgshaw2-redhat @HAIAI Please help review this one. Thanks! |
| from aiter import QuantType, get_hip_quant | ||
|
|
||
| if quant_config.is_block_quantized: | ||
| quant_func = get_hip_quant(QuantType.per_1x128) |
There was a problem hiding this comment.
Is this part of code included in CUDAGraph?
There was a problem hiding this comment.
I’m not sure if I understand your question. Here is to do the FP8 quant before dispatch so that we can reduce communication overhead
|
A test in |
Great suggestion! Will do |
|
Documentation preview: https://vllm--28664.org.readthedocs.build/en/28664/ |
SageMoore
left a comment
There was a problem hiding this comment.
This looks good. Will accept once the test is added. Thanks for the contribution!
|
This pull request has merge conflicts that must be resolved before it can be |
|
@alexsun07 any updates on this? It would be great to get this merged. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
Comment @cursor review or bugbot run to trigger another review on this PR
|
@alexsun07 @tjtanaa can we focus getting this PR merged? Thanks. |
|
Hi @alexsun07, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
This pull request has merge conflicts that must be resolved before it can be |
|
@alexsun07 The community landed a refactor PR. Could you help to resolve the merge conflict? |
Signed-off-by: Alex Sun <alex.s@amd.com>
Signed-off-by: Alex Sun <alex.s@amd.com>
Signed-off-by: Alex Sun <alex.s@amd.com>
resolved |
Thanks @sunway513 Merged now by @tjtanaa |
…#28664) Signed-off-by: Alex Sun <alex.s@amd.com> Signed-off-by: mohammad najafi <mohammad.najafi@amd.com>
…#28664) Signed-off-by: Alex Sun <alex.s@amd.com> Signed-off-by: 陈建华 <1647430658@qq.com>
…#28664) Signed-off-by: Alex Sun <alex.s@amd.com>
…#28664) Signed-off-by: Alex Sun <alex.s@amd.com>
Purpose
This PR is to integrate MoRI-EP, a high performance all2all comm kernel, with vLLM as an all2all backend. See MoRI project here. And MoRI supports cuda graph.
This PR follows the design of vLLM's Fused MoE Modular Kernel. The Fused MoE Modular Kernel is composed of following components:
[Router] → [Quantize-Dispatch] → [Permute-Experts-Unpermute] → [Combine]
For MoRI+AITER path, which is the high performance practice from AMD, it would be:
[Router] → [Quantize-Dispatch] → [Experts] → [Combine]
Two new classes are introduced:
Summary of performance comparison between MoRI-EP and naive backend (bs=128 per DP rank):
How to install MoRI
See https://github.com/ROCm/mori
Test Plan
Test platform: MI300X + CX7
Accuracy
Serve on DeepSeek-V3/R1 (Block scale quant)
Serve on DeepSeek-R1-PTPC (per token per channel quant)
see here for more info about PTPC.
Evaluate by gsm8k
Performance
Test EP8 and EP16 performance, compare with naive all2all backend
EP8 with mori backend
EP8 with naive backend:
replace
--all2all-backend moriwith--all2all-backend naive.EP16 with mori backend
EP16 with naive backend:
replace
--all2all-backend moriwith--all2all-backend naive, and use--enforce-eager.Benchmark:
use --random-input-len 1 --random-prefix-len 1023 because we want to simulate the PD disagg and test decode performance without prefill.
Test Result
Accuracy
MoRI-EP with DeepSeek-V3
MoRI-EP with DeepSeek-R1-PTPC
Decode Performance
Summary
EP8 mori all2all backend
EP8 naive all2all backend
EP16 mori all2all backend
EP16 naive all2all backend
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.