Optimization: EPV1 dispatch & combine kernel by TianDi101 · Pull Request #128 · ROCm/mori

TianDi101 · 2026-01-08T08:29:59Z

No description provided.

TianDi101 · 2026-01-08T08:30:47Z

Dispatch optimization result:
token=32, 1.88x
token=64, 1.48x
token=128, 1.25x

Need to check E2E accuracy and stability.

fp8, token=32, fp8, previous

fp8, token=32, fp8, optimized

fp8, token=64, previous

fp8, token=64, optimized

fp8, token=128, previous

fp8, token=128, optimized

jhchouuu · 2026-01-08T08:57:59Z

It seems that the main improvement was made to the recv phase of the dispatch LL kernel.
if there is data with 64 tokens, and also with CX7 and MI300?

TianDi101 · 2026-01-08T14:44:07Z

It seems that the main improvement was made to the recv phase of the dispatch LL kernel. if there is data with 64 tokens, and also with CX7 and MI300?

Yes, the perf gains come from recv phase, mainly XGMI part. I guess it should also improve CX7+MI300X.
I just attached token=64 perf improvement for your reference.

jhchouuu · 2026-01-08T14:55:08Z

cc @isytwu

TianDi101 · 2026-01-09T17:21:17Z

Combine optimization result:
token=32, 1.46x
token=64, 1.41x
token=128, 1.41x

Need to check E2E accuracy and stability.

bf16, token=32, previous

bf16, token=32, optimized

bf16, token=64, previous

bf16, token=64, optimized

bf16, token=128, previous

bf16, token=128, optimized

TianDi101 · 2026-01-16T09:08:07Z

Dispatch & Combine Staging buffer copy (accum) optimization results compared to the last post:

Average perf
Token=32 => dispatch 1.125x, combine 1.25x
Token=64 => dispatch 1.10x, combine 1.26x
Token=128 =>dispatch 1.05x, combine 1.18x

Best Perf
Token=32 => dispatch 1.24x, combine 1.25x
Token=64 => dispatch 1.25x, combine 1.33x
Token=128 =>dispatch 1.08x, combine 1.18x

Need to check E2E accuracy and stability. Optims before this one has been tested.

FP8 Dispatch, 32/64/128 tokens respectively

BF16 Combine, 32/64/128 tokens respectively

TianDi101 · 2026-01-21T11:13:16Z

EP16 E2E test OK.

EP32 perf optimization results

FP8 average perf
Token=32 => dispatch 1.31x, combine 1.48x
Token=64 => dispatch 1.15x, combine 1.62x
Token=128 =>dispatch 1.03x, combine 1.23x

BF16 average perf
Token=32 => dispatch 1.28x, combine 1.52x
Token=64 => dispatch 1.11x, combine 1.48x
Token=128 =>dispatch 1.03x, combine 1.26x

TianDi101 · 2026-01-21T13:22:22Z

Optimization result for high-bandwidth case:
Previous: EP16 bf16 token=4096

Optimized: EP16 bf16 token=4096

Copilot

Pull request overview

This PR optimizes the EPV1 (Expert Parallelism V1) dispatch and combine kernels by refactoring the data copy operations into separate kernels and improving the parallelization strategy. The changes include extracting staging buffer copy logic into a dedicated EpDispatchCopyToStaging kernel, introducing a separate EpCombineAll kernel for final combination, and adding low-latency variants for both dispatch and combine operations.

Changes:

Separated staging buffer copy operations into a dedicated EpDispatchCopyToStaging kernel for better parallelism
Added EpCombineAll kernel to separate the final token combination from inter-node combine operations
Introduced EpCombineInterNodeV1KernelLowLatency with new low-latency implementations (DispatchInterNodeLLRecv, CombineInterNodeLL, CombineIntraNodeLL)

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 17 comments.

Show a summary per file

File	Description
src/ops/dispatch_combine/internode_v1.hpp	Added kernel function declarations for new optimization kernels
src/ops/dispatch_combine/internode_v1.cpp	Refactored dispatch/combine logic, extracted staging copy, added low-latency variants, includes large commented code blocks
src/ops/dispatch_combine/dispatch_combine.cpp	Updated kernel launches to use new separate kernels and added multiprocessor count initialization
include/mori/utils/hip_helper.hpp	New utility header for querying GPU multiprocessor count (has critical linking issue)
include/mori/ops/dispatch_combine/dispatch_combine.hpp	Added `cuCount` member variable to store multiprocessor count
examples/ops/dispatch_combine/test_dispatch_combine_internode.py	Added sweep benchmark functionality and updated token count handling

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/ops/dispatch_combine/internode_v1.cpp

examples/ops/dispatch_combine/test_dispatch_combine_internode.py

src/ops/dispatch_combine/internode_v1.cpp

include/mori/ops/dispatch_combine/dispatch_combine.hpp

src/ops/dispatch_combine/internode_v1.cpp

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-21T16:04:38Z

examples/ops/dispatch_combine/test_dispatch_combine_internode.py

+            [max - min for max, min in zip(comb_lat_max_list, comb_lat_min_list)],
+            label="Combine Max-Min",
+        )
+        plt.xticks([i * 16 for i in range(max_tokens // 16)])


The xticks calculation range(max_tokens // 16) may not align with the actual max_token_list values. Consider using range(max_tokens // 16 + 1) or calculating ticks based on the actual max_token_list to ensure proper tick placement on the x-axis.

Suggested change

plt.xticks([i * 16 for i in range(max_tokens // 16)])

max_x = max(max_token_list) if max_token_list else 0

plt.xticks([i * 16 for i in range(max_x // 16 + 1)])

TianDi101 added 4 commits January 14, 2026 10:10

optim epv1 dispatch

350c2af

optimize combine kernel

2ed6a4f

add combine ll

f4ecba3

add support to sweep bench

35b4673

TianDi101 force-pushed the optim_epv1 branch from 21c5b7d to 35b4673 Compare January 15, 2026 05:13

optimize staging buffer copy

5e45402

TianDi101 added 2 commits January 16, 2026 09:23

comment out unoptimized path

cd1f175

cu count

4e6bc31

use seperate copy kernel for V1

abfd75b

TianDi101 force-pushed the optim_epv1 branch from 9b9edc1 to abfd75b Compare January 21, 2026 15:22

TianDi101 requested a review from Copilot January 21, 2026 15:26

Copilot started reviewing on behalf of TianDi101 January 21, 2026 15:27 View session

Copilot AI reviewed Jan 21, 2026

View reviewed changes

clean code & fix division by zero

6c8393f

TianDi101 requested a review from Copilot January 21, 2026 15:58

Copilot started reviewing on behalf of TianDi101 January 21, 2026 15:59 View session

Copilot AI reviewed Jan 21, 2026

View reviewed changes

TianDi101 merged commit adf49e3 into main Jan 22, 2026
6 checks passed

	plt.xticks([i * 16 for i in range(max_tokens // 16)])
	max_x = max(max_token_list) if max_token_list else 0
	plt.xticks([i * 16 for i in range(max_x // 16 + 1)])

Conversation

TianDi101 commented Jan 8, 2026

Uh oh!

TianDi101 commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jhchouuu commented Jan 8, 2026

Uh oh!

TianDi101 commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jhchouuu commented Jan 8, 2026

Uh oh!

TianDi101 commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TianDi101 commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TianDi101 commented Jan 21, 2026

Uh oh!

TianDi101 commented Jan 21, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

TianDi101 commented Jan 8, 2026 •

edited

Loading

TianDi101 commented Jan 8, 2026 •

edited

Loading

TianDi101 commented Jan 9, 2026 •

edited

Loading

TianDi101 commented Jan 16, 2026 •

edited

Loading