Conversation
|
It seems that the main improvement was made to the recv phase of the dispatch LL kernel. |
Yes, the perf gains come from recv phase, mainly XGMI part. I guess it should also improve CX7+MI300X. |
|
cc @isytwu |
|
Dispatch & Combine Staging buffer copy (accum) optimization results compared to the last post: Average perf Best Perf Need to check E2E accuracy and stability. Optims before this one has been tested. |
|
EP16 E2E test OK. EP32 perf optimization results FP8 average perf BF16 average perf |
There was a problem hiding this comment.
Pull request overview
This PR optimizes the EPV1 (Expert Parallelism V1) dispatch and combine kernels by refactoring the data copy operations into separate kernels and improving the parallelization strategy. The changes include extracting staging buffer copy logic into a dedicated EpDispatchCopyToStaging kernel, introducing a separate EpCombineAll kernel for final combination, and adding low-latency variants for both dispatch and combine operations.
Changes:
- Separated staging buffer copy operations into a dedicated
EpDispatchCopyToStagingkernel for better parallelism - Added
EpCombineAllkernel to separate the final token combination from inter-node combine operations - Introduced
EpCombineInterNodeV1KernelLowLatencywith new low-latency implementations (DispatchInterNodeLLRecv,CombineInterNodeLL,CombineIntraNodeLL)
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 17 comments.
Show a summary per file
| File | Description |
|---|---|
| src/ops/dispatch_combine/internode_v1.hpp | Added kernel function declarations for new optimization kernels |
| src/ops/dispatch_combine/internode_v1.cpp | Refactored dispatch/combine logic, extracted staging copy, added low-latency variants, includes large commented code blocks |
| src/ops/dispatch_combine/dispatch_combine.cpp | Updated kernel launches to use new separate kernels and added multiprocessor count initialization |
| include/mori/utils/hip_helper.hpp | New utility header for querying GPU multiprocessor count (has critical linking issue) |
| include/mori/ops/dispatch_combine/dispatch_combine.hpp | Added cuCount member variable to store multiprocessor count |
| examples/ops/dispatch_combine/test_dispatch_combine_internode.py | Added sweep benchmark functionality and updated token count handling |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| [max - min for max, min in zip(comb_lat_max_list, comb_lat_min_list)], | ||
| label="Combine Max-Min", | ||
| ) | ||
| plt.xticks([i * 16 for i in range(max_tokens // 16)]) |
There was a problem hiding this comment.
The xticks calculation range(max_tokens // 16) may not align with the actual max_token_list values. Consider using range(max_tokens // 16 + 1) or calculating ticks based on the actual max_token_list to ensure proper tick placement on the x-axis.
| plt.xticks([i * 16 for i in range(max_tokens // 16)]) | |
| max_x = max(max_token_list) if max_token_list else 0 | |
| plt.xticks([i * 16 for i in range(max_x // 16 + 1)]) |




















No description provided.