Skip to content

[AMD][ROCm] MoRI EP: a high-performance all2all backend#28664

Merged
tjtanaa merged 4 commits intovllm-project:mainfrom
alexsun07:mori_ep
Jan 22, 2026
Merged

[AMD][ROCm] MoRI EP: a high-performance all2all backend#28664
tjtanaa merged 4 commits intovllm-project:mainfrom
alexsun07:mori_ep

Conversation

@alexsun07
Copy link
Contributor

@alexsun07 alexsun07 commented Nov 13, 2025

Purpose

This PR is to integrate MoRI-EP, a high performance all2all comm kernel, with vLLM as an all2all backend. See MoRI project here. And MoRI supports cuda graph.

This PR follows the design of vLLM's Fused MoE Modular Kernel. The Fused MoE Modular Kernel is composed of following components:
[Router] → [Quantize-Dispatch] → [Permute-Experts-Unpermute] → [Combine]

For MoRI+AITER path, which is the high performance practice from AMD, it would be:
[Router] → [Quantize-Dispatch] → [Experts] → [Combine]

Two new classes are introduced:

  • MoriPrepareAndFinalize: do the [Quantize-Dispatch] and [Combine]
  • AiterExperts: do the [Experts] and don't do permute or unpermute

Summary of performance comparison between MoRI-EP and naive backend (bs=128 per DP rank):

all2all EP size Mean TPOT Output tps per node perf
naive 8 128.42 7119.64 1.00x
mori 8 94.14 9439.57 1.33x
naive (eager) 16 305.36 2740.34 1.00x
mori 16 110.87 7343.28 2.68x

How to install MoRI

See https://github.com/ROCm/mori

Test Plan

Test platform: MI300X + CX7

Accuracy

Serve on DeepSeek-V3/R1 (Block scale quant)

VLLM_ROCM_USE_AITER=1 \
VLLM_ROCM_USE_AITER_MOE=1 \
VLLM_USE_V1=1 \
vllm serve deepseek-ai/DeepSeek-V3 \
    -tp 1 \
    -dp 8 \
    --port 30000 \
    --all2all-backend mori \
    --enable-expert-parallel

Serve on DeepSeek-R1-PTPC (per token per channel quant)
see here for more info about PTPC.

VLLM_ROCM_USE_AITER=1 \
VLLM_ROCM_USE_AITER_MOE=1 \
VLLM_USE_V1=1 \
vllm serve EmbeddedLLM/deepseek-r1-FP8-Dynamic \
    -tp 1 \
    -dp 8 \
    --port 30000 \
    --all2all-backend mori \
    --enable-expert-parallel

Evaluate by gsm8k

lm_eval --model local-completions \
    --tasks gsm8k \
    --model_args model=<model_path>,base_url=http://localhost:30000/v1/completions,num_concurrent=256,max_retries=3,tokenized_requests=False 

Performance

Test EP8 and EP16 performance, compare with naive all2all backend

EP8 with mori backend

VLLM_ROCM_USE_AITER=1 \
VLLM_ROCM_USE_AITER_MOE=1 \
VLLM_USE_V1=1 \
VLLM_MOE_DP_CHUNK_SIZE=512 \
vllm serve EmbeddedLLM/deepseek-r1-FP8-Dynamic \
    -tp 1 \
    -dp 8 \
    --port 30000 \
    --all2all-backend mori \
    --max-num-seqs 128 \
    --enable-expert-parallel \
    --cudagraph-capture-sizes 1 2 4 8 16 32 64 128

EP8 with naive backend:
replace --all2all-backend mori with --all2all-backend naive.

EP16 with mori backend

# node0
VLLM_ROCM_USE_AITER=1 \
VLLM_ROCM_USE_AITER_MOE=1 \
VLLM_USE_V1=1 \
VLLM_MOE_DP_CHUNK_SIZE=512 \
vllm serve /nfs/DeepSeek-R1-PTPC \
    -dp 16 \
    --data-parallel-size-local 8 \
    --data-parallel-address <node-0-ip> --data-parallel-rpc-port <node-0-port> \
    --enable-expert-parallel \
    --all2all-backend mori \
    --port 30000 \
    --max-num-seqs 128 \
    --cuda-graph-sizes 1 2 4 8 16 32 64 128 \
    --trust-remote-code 

# node1
VLLM_ROCM_USE_AITER=1 \
VLLM_ROCM_USE_AITER_MOE=1 \
VLLM_USE_V1=1 \
VLLM_MOE_DP_CHUNK_SIZE=512 \
vllm serve /nfs/DeepSeek-R1-PTPC \
    -dp 16 \
    --headless \
    --data-parallel-size-local 8 \
    --data-parallel-start-rank 8 \
    --data-parallel-address <node-0-ip> --data-parallel-rpc-port <node-0-port> \
    --enable-expert-parallel \
    --all2all-backend mori \
    --port 30000 \
    --max-num-seqs 128 \
    --cuda-graph-sizes 1 2 4 8 16 32 64 128 \
    --trust-remote-code

EP16 with naive backend:
replace --all2all-backend mori with --all2all-backend naive, and use --enforce-eager.

Benchmark:
use --random-input-len 1 --random-prefix-len 1023 because we want to simulate the PD disagg and test decode performance without prefill.

vllm bench serve \
    --max-concurrency <1024 * node_num> \
    --num-prompts <4096 * node_num> \
    --model <model_path>
    --port 30000 \
    --ignore-eos \
    --trust-remote-code \
    --dataset-name random \
    --seed 2025 \
    --random-input-len 1 \
    --random-prefix-len 1023 \
    --random-output-len 500

Test Result

Accuracy

MoRI-EP with DeepSeek-V3

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9469|±  |0.0062|
|     |       |strict-match    |     5|exact_match|↑  |0.9447|±  |0.0063|

MoRI-EP with DeepSeek-R1-PTPC

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9538|±  |0.0058|
|     |       |strict-match    |     5|exact_match|↑  |0.9530|±  |0.0058|

Decode Performance

Summary

all2all EP size Mean TPOT Output tps per node perf
naive 8 128.42 7119.64 1.00x
mori 8 94.14 9439.57 1.33x
naive (eager) 16 305.36 2740.34 1.00x
mori 16 110.87 7343.28 2.68x

EP8 mori all2all backend

============ Serving Benchmark Result ============
Successful requests:                     4096      
Failed requests:                         0         
Maximum request concurrency:             1024      
Benchmark duration (s):                  216.96    
Total input tokens:                      4190208   
Total generated tokens:                  2048000   
Request throughput (req/s):              18.88     
Output token throughput (tok/s):         9439.57   
Peak output token throughput (tok/s):    13171.00  
Peak concurrent requests:                1152.00   
Total Token throughput (tok/s):          28752.92  
---------------Time to First Token----------------
Mean TTFT (ms):                          3079.99   
Median TTFT (ms):                        1172.27   
P99 TTFT (ms):                           14658.47  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          94.14     
Median TPOT (ms):                        95.69     
P99 TPOT (ms):                           98.65     
---------------Inter-token Latency----------------
Mean ITL (ms):                           105.46    
Median ITL (ms):                         84.14     
P99 ITL (ms):                            503.41    
==================================================

EP8 naive all2all backend

============ Serving Benchmark Result ============
Successful requests:                     4096      
Failed requests:                         0         
Maximum request concurrency:             1024      
Benchmark duration (s):                  287.65    
Total input tokens:                      4190208   
Total generated tokens:                  2048000   
Request throughput (req/s):              14.24     
Output token throughput (tok/s):         7119.64   
Peak output token throughput (tok/s):    10230.00  
Peak concurrent requests:                1152.00   
Total Token throughput (tok/s):          21686.42  
---------------Time to First Token----------------
Mean TTFT (ms):                          3118.80   
Median TTFT (ms):                        1093.97   
P99 TTFT (ms):                           15430.33  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          128.42    
Median TPOT (ms):                        129.82    
P99 TPOT (ms):                           137.51    
---------------Inter-token Latency----------------
Mean ITL (ms):                           133.46    
Median ITL (ms):                         112.55    
P99 ITL (ms):                            513.15    
==================================================

EP16 mori all2all backend

============ Serving Benchmark Result ============
Successful requests:                     8192
Failed requests:                         0
Maximum request concurrency:             2048
Benchmark duration (s):                  278.89
Total input tokens:                      8380416
Total generated tokens:                  4096000
Request throughput (req/s):              29.37
Output token throughput (tok/s):         14686.55
Peak output token throughput (tok/s):    20942.00
Peak concurrent requests:                2271.00
Total Token throughput (tok/s):          44735.22
---------------Time to First Token----------------
Mean TTFT (ms):                          10838.91
Median TTFT (ms):                        7431.13
P99 TTFT (ms):                           34603.02
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          110.87
Median TPOT (ms):                        111.76
P99 TPOT (ms):                           127.69
---------------Inter-token Latency----------------
Mean ITL (ms):                           209.21
Median ITL (ms):                         94.86
P99 ITL (ms):                            864.02
==================================================

EP16 naive all2all backend

============ Serving Benchmark Result ============
Successful requests:                     8192
Failed requests:                         0
Maximum request concurrency:             2048
Benchmark duration (s):                  747.35
Total input tokens:                      8380416
Total generated tokens:                  4096000
Request throughput (req/s):              10.96
Output token throughput (tok/s):         5480.68
Peak output token throughput (tok/s):    9665.00
Peak concurrent requests:                2187.00
Total Token throughput (tok/s):          16694.17
---------------Time to First Token----------------
Mean TTFT (ms):                          10112.99
Median TTFT (ms):                        7514.72
P99 TTFT (ms):                           35132.93
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          305.36
Median TPOT (ms):                        305.49
P99 TPOT (ms):                           317.03
---------------Inter-token Latency----------------
Mean ITL (ms):                           328.70
Median ITL (ms):                         297.74
P99 ITL (ms):                            857.16
==================================================

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces MoRI-EP as a high-performance all2all backend for Mixture-of-Experts models on ROCm platforms. The changes are extensive, touching configuration, device communicators, and MoE layer implementations to integrate the new backend. The implementation introduces MoriAll2AllManager for communication and AiterExperts for the expert computation path. Overall, the changes are well-structured and seem to correctly integrate the new backend. However, I found a critical issue in the logic for handling shared experts in the DeepSeek V2 model, which could lead to incorrect behavior.

@alexsun07
Copy link
Contributor Author

CC @sunway513 @mgoin @houseroad @robertgshaw2-redhat @HAIAI
The previous #27273 was force closed by github and cannot be reopened. I will use this PR instead. Sorry for the trouble.

Please help review this one. Thanks!

from aiter import QuantType, get_hip_quant

if quant_config.is_block_quantized:
quant_func = get_hip_quant(QuantType.per_1x128)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this part of code included in CUDAGraph?

Copy link
Contributor Author

@alexsun07 alexsun07 Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not sure if I understand your question. Here is to do the FP8 quant before dispatch so that we can reduce communication overhead

@HAIAI HAIAI self-assigned this Nov 13, 2025
@bnellnm
Copy link
Collaborator

bnellnm commented Nov 13, 2025

A test in tests/kernels/moe would be good. It probably wouldn't be too hard to add the new Mori kernels to tests/kernels/moe/test_modular_kernel_combinations.py by registering the new prepare/finalize and experts classes in tests/kernels/moe/modular_kernel_tools/mk_objects.py.

@alexsun07
Copy link
Contributor Author

A test in tests/kernels/moe would be good. It probably wouldn't be too hard to add the new Mori kernels to tests/kernels/moe/test_modular_kernel_combinations.py by registering the new prepare/finalize and experts classes in tests/kernels/moe/modular_kernel_tools/mk_objects.py.

Great suggestion! Will do

@Duyi-Wang Duyi-Wang requested a review from gshtras as a code owner November 18, 2025 03:05
@mergify
Copy link

mergify bot commented Nov 18, 2025

Documentation preview: https://vllm--28664.org.readthedocs.build/en/28664/

@mergify mergify bot added documentation Improvements or additions to documentation ci/build labels Nov 18, 2025
Copy link
Contributor

@SageMoore SageMoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. Will accept once the test is added. Thanks for the contribution!

@mergify
Copy link

mergify bot commented Nov 19, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @alexsun07.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@github-project-automation github-project-automation bot moved this to Ready in NVIDIA Dec 15, 2025
@SageMoore
Copy link
Contributor

@alexsun07 any updates on this? It would be great to get this merged.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

Comment @cursor review or bugbot run to trigger another review on this PR

@sunway513
Copy link

@alexsun07 @tjtanaa can we focus getting this PR merged? Thanks.

@mergify
Copy link

mergify bot commented Jan 20, 2026

Hi @alexsun07, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@alexsun07 alexsun07 requested review from HAIAI and tjtanaa January 21, 2026 07:01
@mergify
Copy link

mergify bot commented Jan 21, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @alexsun07.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jan 21, 2026
@tjtanaa
Copy link
Collaborator

tjtanaa commented Jan 21, 2026

@alexsun07 The community landed a refactor PR. Could you help to resolve the merge conflict?

Signed-off-by: Alex Sun <alex.s@amd.com>
Signed-off-by: Alex Sun <alex.s@amd.com>
Signed-off-by: Alex Sun <alex.s@amd.com>
Signed-off-by: Alex Sun <alex.s@amd.com>
@alexsun07
Copy link
Contributor Author

@alexsun07 The community landed a refactor PR. Could you help to resolve the merge conflict?

resolved

@mergify mergify bot removed the needs-rebase label Jan 22, 2026
@tjtanaa tjtanaa merged commit 49a1262 into vllm-project:main Jan 22, 2026
61 checks passed
@github-project-automation github-project-automation bot moved this from Ready to Done in NVIDIA Jan 22, 2026
@alexsun07
Copy link
Contributor Author

@alexsun07 @tjtanaa can we focus getting this PR merged? Thanks.

Thanks @sunway513

Merged now by @tjtanaa

monajafi-amd pushed a commit to monajafi-amd/vllm that referenced this pull request Jan 23, 2026
…#28664)

Signed-off-by: Alex Sun <alex.s@amd.com>
Signed-off-by: mohammad najafi <mohammad.najafi@amd.com>
cwazai pushed a commit to cwazai/vllm that referenced this pull request Jan 25, 2026
…#28664)

Signed-off-by: Alex Sun <alex.s@amd.com>
Signed-off-by: 陈建华 <1647430658@qq.com>
lapy pushed a commit to lapy/vllm that referenced this pull request Jan 27, 2026
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build deepseek Related to DeepSeek models documentation Improvements or additions to documentation nvidia ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

8 participants