[AMD][ROCm] MoRI EP: a high-performance all2all backend by alexsun07 · Pull Request #28664 · vllm-project/vllm

alexsun07 · 2025-11-13T15:52:32Z

Purpose

This PR is to integrate MoRI-EP, a high performance all2all comm kernel, with vLLM as an all2all backend. See MoRI project here. And MoRI supports cuda graph.

This PR follows the design of vLLM's Fused MoE Modular Kernel. The Fused MoE Modular Kernel is composed of following components:
[Router] → [Quantize-Dispatch] → [Permute-Experts-Unpermute] → [Combine]

For MoRI+AITER path, which is the high performance practice from AMD, it would be:
[Router] → [Quantize-Dispatch] → [Experts] → [Combine]

Two new classes are introduced:

MoriPrepareAndFinalize: do the [Quantize-Dispatch] and [Combine]
AiterExperts: do the [Experts] and don't do permute or unpermute

Summary of performance comparison between MoRI-EP and naive backend (bs=128 per DP rank):

all2all	EP size	Mean TPOT	Output tps per node	perf
naive	8	128.42	7119.64	1.00x
mori	8	94.14	9439.57	1.33x
naive (eager)	16	305.36	2740.34	1.00x
mori	16	110.87	7343.28	2.68x

How to install MoRI

See https://github.com/ROCm/mori

Test Plan

Test platform: MI300X + CX7

Accuracy

Serve on DeepSeek-V3/R1 (Block scale quant)

VLLM_ROCM_USE_AITER=1 \
VLLM_ROCM_USE_AITER_MOE=1 \
VLLM_USE_V1=1 \
vllm serve deepseek-ai/DeepSeek-V3 \
    -tp 1 \
    -dp 8 \
    --port 30000 \
    --all2all-backend mori \
    --enable-expert-parallel

Serve on DeepSeek-R1-PTPC (per token per channel quant)
see here for more info about PTPC.

VLLM_ROCM_USE_AITER=1 \
VLLM_ROCM_USE_AITER_MOE=1 \
VLLM_USE_V1=1 \
vllm serve EmbeddedLLM/deepseek-r1-FP8-Dynamic \
    -tp 1 \
    -dp 8 \
    --port 30000 \
    --all2all-backend mori \
    --enable-expert-parallel

Evaluate by gsm8k

lm_eval --model local-completions \
    --tasks gsm8k \
    --model_args model=<model_path>,base_url=http://localhost:30000/v1/completions,num_concurrent=256,max_retries=3,tokenized_requests=False

Performance

Test EP8 and EP16 performance, compare with naive all2all backend

EP8 with mori backend

VLLM_ROCM_USE_AITER=1 \
VLLM_ROCM_USE_AITER_MOE=1 \
VLLM_USE_V1=1 \
VLLM_MOE_DP_CHUNK_SIZE=512 \
vllm serve EmbeddedLLM/deepseek-r1-FP8-Dynamic \
    -tp 1 \
    -dp 8 \
    --port 30000 \
    --all2all-backend mori \
    --max-num-seqs 128 \
    --enable-expert-parallel \
    --cudagraph-capture-sizes 1 2 4 8 16 32 64 128

EP8 with naive backend:
replace --all2all-backend mori with --all2all-backend naive.

EP16 with mori backend

# node0
VLLM_ROCM_USE_AITER=1 \
VLLM_ROCM_USE_AITER_MOE=1 \
VLLM_USE_V1=1 \
VLLM_MOE_DP_CHUNK_SIZE=512 \
vllm serve /nfs/DeepSeek-R1-PTPC \
    -dp 16 \
    --data-parallel-size-local 8 \
    --data-parallel-address <node-0-ip> --data-parallel-rpc-port <node-0-port> \
    --enable-expert-parallel \
    --all2all-backend mori \
    --port 30000 \
    --max-num-seqs 128 \
    --cuda-graph-sizes 1 2 4 8 16 32 64 128 \
    --trust-remote-code 

# node1
VLLM_ROCM_USE_AITER=1 \
VLLM_ROCM_USE_AITER_MOE=1 \
VLLM_USE_V1=1 \
VLLM_MOE_DP_CHUNK_SIZE=512 \
vllm serve /nfs/DeepSeek-R1-PTPC \
    -dp 16 \
    --headless \
    --data-parallel-size-local 8 \
    --data-parallel-start-rank 8 \
    --data-parallel-address <node-0-ip> --data-parallel-rpc-port <node-0-port> \
    --enable-expert-parallel \
    --all2all-backend mori \
    --port 30000 \
    --max-num-seqs 128 \
    --cuda-graph-sizes 1 2 4 8 16 32 64 128 \
    --trust-remote-code

EP16 with naive backend:
replace --all2all-backend mori with --all2all-backend naive, and use --enforce-eager.

Benchmark:
use --random-input-len 1 --random-prefix-len 1023 because we want to simulate the PD disagg and test decode performance without prefill.

vllm bench serve \
    --max-concurrency <1024 * node_num> \
    --num-prompts <4096 * node_num> \
    --model <model_path>
    --port 30000 \
    --ignore-eos \
    --trust-remote-code \
    --dataset-name random \
    --seed 2025 \
    --random-input-len 1 \
    --random-prefix-len 1023 \
    --random-output-len 500

Test Result

Accuracy

MoRI-EP with DeepSeek-V3

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9469|±  |0.0062|
|     |       |strict-match    |     5|exact_match|↑  |0.9447|±  |0.0063|

MoRI-EP with DeepSeek-R1-PTPC

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9538|±  |0.0058|
|     |       |strict-match    |     5|exact_match|↑  |0.9530|±  |0.0058|

Decode Performance

Summary

all2all	EP size	Mean TPOT	Output tps per node	perf
naive	8	128.42	7119.64	1.00x
mori	8	94.14	9439.57	1.33x
naive (eager)	16	305.36	2740.34	1.00x
mori	16	110.87	7343.28	2.68x

EP8 mori all2all backend

============ Serving Benchmark Result ============
Successful requests:                     4096      
Failed requests:                         0         
Maximum request concurrency:             1024      
Benchmark duration (s):                  216.96    
Total input tokens:                      4190208   
Total generated tokens:                  2048000   
Request throughput (req/s):              18.88     
Output token throughput (tok/s):         9439.57   
Peak output token throughput (tok/s):    13171.00  
Peak concurrent requests:                1152.00   
Total Token throughput (tok/s):          28752.92  
---------------Time to First Token----------------
Mean TTFT (ms):                          3079.99   
Median TTFT (ms):                        1172.27   
P99 TTFT (ms):                           14658.47  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          94.14     
Median TPOT (ms):                        95.69     
P99 TPOT (ms):                           98.65     
---------------Inter-token Latency----------------
Mean ITL (ms):                           105.46    
Median ITL (ms):                         84.14     
P99 ITL (ms):                            503.41    
==================================================

EP8 naive all2all backend

============ Serving Benchmark Result ============
Successful requests:                     4096      
Failed requests:                         0         
Maximum request concurrency:             1024      
Benchmark duration (s):                  287.65    
Total input tokens:                      4190208   
Total generated tokens:                  2048000   
Request throughput (req/s):              14.24     
Output token throughput (tok/s):         7119.64   
Peak output token throughput (tok/s):    10230.00  
Peak concurrent requests:                1152.00   
Total Token throughput (tok/s):          21686.42  
---------------Time to First Token----------------
Mean TTFT (ms):                          3118.80   
Median TTFT (ms):                        1093.97   
P99 TTFT (ms):                           15430.33  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          128.42    
Median TPOT (ms):                        129.82    
P99 TPOT (ms):                           137.51    
---------------Inter-token Latency----------------
Mean ITL (ms):                           133.46    
Median ITL (ms):                         112.55    
P99 ITL (ms):                            513.15    
==================================================

EP16 mori all2all backend

============ Serving Benchmark Result ============
Successful requests:                     8192
Failed requests:                         0
Maximum request concurrency:             2048
Benchmark duration (s):                  278.89
Total input tokens:                      8380416
Total generated tokens:                  4096000
Request throughput (req/s):              29.37
Output token throughput (tok/s):         14686.55
Peak output token throughput (tok/s):    20942.00
Peak concurrent requests:                2271.00
Total Token throughput (tok/s):          44735.22
---------------Time to First Token----------------
Mean TTFT (ms):                          10838.91
Median TTFT (ms):                        7431.13
P99 TTFT (ms):                           34603.02
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          110.87
Median TPOT (ms):                        111.76
P99 TPOT (ms):                           127.69
---------------Inter-token Latency----------------
Mean ITL (ms):                           209.21
Median ITL (ms):                         94.86
P99 ITL (ms):                            864.02
==================================================

EP16 naive all2all backend

============ Serving Benchmark Result ============
Successful requests:                     8192
Failed requests:                         0
Maximum request concurrency:             2048
Benchmark duration (s):                  747.35
Total input tokens:                      8380416
Total generated tokens:                  4096000
Request throughput (req/s):              10.96
Output token throughput (tok/s):         5480.68
Peak output token throughput (tok/s):    9665.00
Peak concurrent requests:                2187.00
Total Token throughput (tok/s):          16694.17
---------------Time to First Token----------------
Mean TTFT (ms):                          10112.99
Median TTFT (ms):                        7514.72
P99 TTFT (ms):                           35132.93
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          305.36
Median TPOT (ms):                        305.49
P99 TPOT (ms):                           317.03
---------------Inter-token Latency----------------
Mean ITL (ms):                           328.70
Median ITL (ms):                         297.74
P99 ITL (ms):                            857.16
==================================================

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request introduces MoRI-EP as a high-performance all2all backend for Mixture-of-Experts models on ROCm platforms. The changes are extensive, touching configuration, device communicators, and MoE layer implementations to integrate the new backend. The implementation introduces MoriAll2AllManager for communication and AiterExperts for the expert computation path. Overall, the changes are well-structured and seem to correctly integrate the new backend. However, I found a critical issue in the logic for handling shared experts in the DeepSeek V2 model, which could lead to incorrect behavior.

vllm/model_executor/models/deepseek_v2.py

alexsun07 · 2025-11-13T15:57:31Z

CC @sunway513 @mgoin @houseroad @robertgshaw2-redhat @HAIAI
The previous #27273 was force closed by github and cannot be reopened. I will use this PR instead. Sorry for the trouble.

Please help review this one. Thanks!

tjtanaa · 2025-11-13T16:24:33Z

vllm/model_executor/layers/fused_moe/mori_prepare_finalize.py

+            from aiter import QuantType, get_hip_quant
+
+            if quant_config.is_block_quantized:
+                quant_func = get_hip_quant(QuantType.per_1x128)


Is this part of code included in CUDAGraph?

I’m not sure if I understand your question. Here is to do the FP8 quant before dispatch so that we can reduce communication overhead

vllm/model_executor/layers/fused_moe/fused_aiter_moe.py

bnellnm · 2025-11-13T20:59:20Z

A test in tests/kernels/moe would be good. It probably wouldn't be too hard to add the new Mori kernels to tests/kernels/moe/test_modular_kernel_combinations.py by registering the new prepare/finalize and experts classes in tests/kernels/moe/modular_kernel_tools/mk_objects.py.

alexsun07 · 2025-11-14T01:09:41Z

A test in tests/kernels/moe would be good. It probably wouldn't be too hard to add the new Mori kernels to tests/kernels/moe/test_modular_kernel_combinations.py by registering the new prepare/finalize and experts classes in tests/kernels/moe/modular_kernel_tools/mk_objects.py.

Great suggestion! Will do

mergify · 2025-11-18T03:05:36Z

Documentation preview: https://vllm--28664.org.readthedocs.build/en/28664/

SageMoore

This looks good. Will accept once the test is added. Thanks for the contribution!

mergify · 2025-11-19T17:08:53Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @alexsun07.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

SageMoore · 2026-01-02T14:03:25Z

@alexsun07 any updates on this? It would be great to get this merged.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

Comment @cursor review or bugbot run to trigger another review on this PR

vllm/_aiter_ops.py

sunway513 · 2026-01-20T13:06:36Z

@alexsun07 @tjtanaa can we focus getting this PR merged? Thanks.

mergify · 2026-01-20T16:02:25Z

Hi @alexsun07, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2026-01-21T14:40:41Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @alexsun07.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

tjtanaa · 2026-01-21T14:53:11Z

@alexsun07 The community landed a refactor PR. Could you help to resolve the merge conflict?

Signed-off-by: Alex Sun <alex.s@amd.com>

alexsun07 · 2026-01-22T04:21:04Z

@alexsun07 The community landed a refactor PR. Could you help to resolve the merge conflict?

resolved

alexsun07 · 2026-01-22T08:47:27Z

@alexsun07 @tjtanaa can we focus getting this PR merged? Thanks.

Thanks @sunway513

Merged now by @tjtanaa

…#28664) Signed-off-by: Alex Sun <alex.s@amd.com> Signed-off-by: mohammad najafi <mohammad.najafi@amd.com>

…#28664) Signed-off-by: Alex Sun <alex.s@amd.com> Signed-off-by: 陈建华 <1647430658@qq.com>

…#28664) Signed-off-by: Alex Sun <alex.s@amd.com>

alexsun07 requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, pavanimajety, robertgshaw2-redhat, simon-mo, tjtanaa, tlrmchlsmth, yewentao256 and youkaichao as code owners November 13, 2025 15:52

mergify bot added deepseek Related to DeepSeek models nvidia rocm Related to AMD ROCm labels Nov 13, 2025

github-project-automation bot added this to NVIDIA Nov 13, 2025

gemini-code-assist bot reviewed Nov 13, 2025

View reviewed changes

vllm/model_executor/models/deepseek_v2.py Outdated Show resolved Hide resolved

alexsun07 mentioned this pull request Nov 13, 2025

[AMD][ROCm] MoRI EP: a high-performance all2all backend #27273

Closed

5 tasks

tjtanaa reviewed Nov 13, 2025

View reviewed changes

HAIAI self-assigned this Nov 13, 2025

bnellnm reviewed Nov 13, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/fused_aiter_moe.py Outdated Show resolved Hide resolved

Duyi-Wang requested a review from gshtras as a code owner November 18, 2025 03:05

mergify bot added documentation Improvements or additions to documentation ci/build labels Nov 18, 2025

SageMoore reviewed Nov 18, 2025

View reviewed changes

github-project-automation bot moved this to Ready in NVIDIA Dec 15, 2025

alexsun07 force-pushed the mori_ep branch from 8b78dcd to 321250b Compare January 20, 2026 11:26

cursor bot reviewed Jan 20, 2026

View reviewed changes

vllm/_aiter_ops.py Show resolved Hide resolved

alexsun07 force-pushed the mori_ep branch from 321250b to e4d97f4 Compare January 20, 2026 15:48

mergify bot removed the needs-rebase label Jan 20, 2026

alexsun07 requested review from HAIAI and tjtanaa January 21, 2026 07:01

mergify bot added the needs-rebase label Jan 21, 2026

alexsun07 added 4 commits January 22, 2026 03:14

mori ep integration

46cff8a

Signed-off-by: Alex Sun <alex.s@amd.com>

add mori ut

64fbfb5

Signed-off-by: Alex Sun <alex.s@amd.com>

fix per request

18d47f2

Signed-off-by: Alex Sun <alex.s@amd.com>

refactor

b641373

Signed-off-by: Alex Sun <alex.s@amd.com>

alexsun07 force-pushed the mori_ep branch from 11dcc40 to b641373 Compare January 22, 2026 04:20

mergify bot removed the needs-rebase label Jan 22, 2026

tjtanaa merged commit 49a1262 into vllm-project:main Jan 22, 2026
61 checks passed

github-project-automation bot moved this from Ready to Done in NVIDIA Jan 22, 2026

monajafi-amd pushed a commit to monajafi-amd/vllm that referenced this pull request Jan 23, 2026

[AMD][ROCm] MoRI EP: a high-performance all2all backend (vllm-project…

35a0a58

…#28664) Signed-off-by: Alex Sun <alex.s@amd.com> Signed-off-by: mohammad najafi <mohammad.najafi@amd.com>

cwazai pushed a commit to cwazai/vllm that referenced this pull request Jan 25, 2026

[AMD][ROCm] MoRI EP: a high-performance all2all backend (vllm-project…

6e96888

…#28664) Signed-off-by: Alex Sun <alex.s@amd.com> Signed-off-by: 陈建华 <1647430658@qq.com>

lapy pushed a commit to lapy/vllm that referenced this pull request Jan 27, 2026

[AMD][ROCm] MoRI EP: a high-performance all2all backend (vllm-project…

892decf

…#28664) Signed-off-by: Alex Sun <alex.s@amd.com>

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026

[AMD][ROCm] MoRI EP: a high-performance all2all backend (vllm-project…

833262d

…#28664) Signed-off-by: Alex Sun <alex.s@amd.com>

Uh oh!

Conversation

alexsun07 commented Nov 13, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

How to install MoRI

Test Plan

Accuracy

Performance

Test Result

Accuracy

Decode Performance

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

alexsun07 commented Nov 13, 2025

Uh oh!

tjtanaa Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

alexsun07 Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bnellnm commented Nov 13, 2025

Uh oh!

alexsun07 commented Nov 14, 2025

Uh oh!

mergify bot commented Nov 18, 2025

Uh oh!

SageMoore left a comment

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Nov 19, 2025

Uh oh!

SageMoore commented Jan 2, 2026

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sunway513 commented Jan 20, 2026

Uh oh!

mergify bot commented Jan 20, 2026

Uh oh!

mergify bot commented Jan 21, 2026

Uh oh!

tjtanaa commented Jan 21, 2026

Uh oh!

alexsun07 commented Jan 22, 2026

Uh oh!

Uh oh!

alexsun07 commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

alexsun07 commented Nov 13, 2025 •

edited by github-actions bot

Loading

alexsun07 Nov 14, 2025 •

edited

Loading