Optimize the performance of the Combine Ant Moving function and the use of HCCL buffer#314
Merged
Yael-X merged 4 commits intosgl-project:mainfrom Jan 19, 2026
Merged
Conversation
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
acbef90 to
54273ed
Compare
a7a85f7 to
63663ac
Compare
5ab45af to
783c82c
Compare
Yael-X
previously approved these changes
Jan 16, 2026
c926819 to
db43a2b
Compare
db43a2b to
985e74a
Compare
Yael-X
approved these changes
Jan 19, 2026
zzx-study
added a commit
to zzx-study/sgl-kernel-npu
that referenced
this pull request
Jan 20, 2026
…into addEnv * 'main' of https://github.com/sgl-project/sgl-kernel-npu: Optimize the performance of the Combine Ant Moving function and the use of HCCL buffer (sgl-project#314) deepep adapt custom cann installation path (sgl-project#327)
1329009851
added a commit
to 1329009851/sgl-kernel-npu
that referenced
this pull request
Jan 20, 2026
* 'main' of https://github.com/1329009851/sgl-kernel-npu: Optimize the performance of the Combine Ant Moving function and the use of HCCL buffer (sgl-project#314) deepep adapt custom cann installation path (sgl-project#327) [Chore] CANN version bump to 8.5.0 (sgl-project#326) add dfx for operator FusedDeepMoe (sgl-project#317)
Yael-X
added a commit
to Yael-X/sgl-kernel-npu
that referenced
this pull request
Jan 26, 2026
* 'main' of https://github.com/sgl-project/sgl-kernel-npu: (24 commits) [Doc] Improved README.md content and English grammar and integrated the DeepWiki badge for Ask AI (sgl-project#345) (test) add solve_tril from upstream (sgl-project#339) Add AscendC triangular inverse (sgl-project#332) support the situation that topk maybe -1 on machine A3 (sgl-project#313) chunk_gated_delta_rule_npu output final state (sgl-project#341) The environment variable DEEPEP_HCCL_BUFFSIZE is added, and the priority of DEEPEP_HCCL_BUFFSIZE is higher than that of HCCL_BUFFSIZE. (sgl-project#329) Added the low_latency operator API documentation. (sgl-project#337) Added the verification of num_max_dispatch_tokens_per_rank to the decode operator adaptation layer. (sgl-project#330) Document get_dispatch_layout API (sgl-project#338) 【Doc】add fused deep moe doc (sgl-project#335) add deepep normal api doc (sgl-project#336) remove the limit that A2 internode only support topk 8 (sgl-project#323) Optimize the performance of the Combine Ant Moving function and the use of HCCL buffer (sgl-project#314) deepep adapt custom cann installation path (sgl-project#327) [Chore] CANN version bump to 8.5.0 (sgl-project#326) add dfx for operator FusedDeepMoe (sgl-project#317) Integrate ccache for faster compilation (sgl-project#318) Modify contribution guide (sgl-project#315) fix bmm transpose in cann 8.5 (sgl-project#316) fix little batchsize and int8 quant on ci (sgl-project#302) ...
zhuyutong332
added a commit
to zhuyutong332/sgl-kernel-npu
that referenced
this pull request
Jan 27, 2026
* upstream/main: add function for deep-ep tests (sgl-project#301) [Doc] Improved README.md content and English grammar and integrated the DeepWiki badge for Ask AI (sgl-project#345) (test) add solve_tril from upstream (sgl-project#339) Add AscendC triangular inverse (sgl-project#332) support the situation that topk maybe -1 on machine A3 (sgl-project#313) chunk_gated_delta_rule_npu output final state (sgl-project#341) The environment variable DEEPEP_HCCL_BUFFSIZE is added, and the priority of DEEPEP_HCCL_BUFFSIZE is higher than that of HCCL_BUFFSIZE. (sgl-project#329) Added the low_latency operator API documentation. (sgl-project#337) Added the verification of num_max_dispatch_tokens_per_rank to the decode operator adaptation layer. (sgl-project#330) Document get_dispatch_layout API (sgl-project#338) 【Doc】add fused deep moe doc (sgl-project#335) add deepep normal api doc (sgl-project#336) remove the limit that A2 internode only support topk 8 (sgl-project#323) Optimize the performance of the Combine Ant Moving function and the use of HCCL buffer (sgl-project#314) deepep adapt custom cann installation path (sgl-project#327) [Chore] CANN version bump to 8.5.0 (sgl-project#326) add dfx for operator FusedDeepMoe (sgl-project#317) Integrate ccache for faster compilation (sgl-project#318)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
The combine ant migration feature can effectively support long sequences when fewer HCCL buffers are used. Although the performance of the single-operator test with combine ant migration enabled is the same as that without it, the performance after integrating into the framework is poor, and the performance deteriorates significantly compared to before (2100 tps -> 1200 tps). This PR aims to improve the performance of combine ant migration while optimizing the use of HCCL buffers at the same time.
Modification
Optimize the performance of the Combine Ant Moving function using double buffer.
Through profiling analysis, I found that the execution time of each stage within the operator does not significantly degrade. This is because the combine "ant moving" operation requires a full-card synchronization after each round of communication, and the model cannot guarantee that tokens will be evenly distributed to each expert, which results in significant performance overhead due to full-card synchronization. This PR addresses this issue by using double buffering in multiple rounds of communication, thereby overlapping the communication and full-card synchronization times, which greatly reduces the performance overhead caused by full-card synchronization. The only drawback of this approach is that it requires an additional buffer of size
perRoundTokens * topK * hiddenSize. However, sinceperRoundTokensis typically set to 1024, this would only result in an additional 112MB buffer usage.Optimize the performance of the Combine Ant Moving function by removing unnecessary
PipeBarrier<PIPE_ALL>.Optimize the HCCL normal operator by removing unnecessary notify_dispatch memory usage.$204 MB \times 2$ to $102 MB \times 2$ .
The HCCL buffer usage of notify_dispatch is reduced from
Benchmarking and Profiling
Our own single-operator benchmark
HCCL_BUFFSIZE=4000 python3 test_intranode.py --num-tokens=<bs>HCCL_BUFFSIZE=900 DEEPEP_NORMAL_COMBINE_ENABLE_LONG_SEQ=1 DEEPEP_NORMAL_LONG_SEQ_ROUND=32 DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=1024 python3 test_intranode.py --num-tokens=<bs>Qwen 235B prefill
Checklist
Review Process