Optimize the performance of the Combine Ant Moving function and the use of HCCL buffer by oagniqgnat · Pull Request #314 · sgl-project/sgl-kernel-npu

oagniqgnat · 2026-01-13T04:08:42Z

Motivation

The combine ant migration feature can effectively support long sequences when fewer HCCL buffers are used. Although the performance of the single-operator test with combine ant migration enabled is the same as that without it, the performance after integrating into the framework is poor, and the performance deteriorates significantly compared to before (2100 tps -> 1200 tps). This PR aims to improve the performance of combine ant migration while optimizing the use of HCCL buffers at the same time.

Modification

Optimize the performance of the Combine Ant Moving function using double buffer.
Through profiling analysis, I found that the execution time of each stage within the operator does not significantly degrade. This is because the combine "ant moving" operation requires a full-card synchronization after each round of communication, and the model cannot guarantee that tokens will be evenly distributed to each expert, which results in significant performance overhead due to full-card synchronization. This PR addresses this issue by using double buffering in multiple rounds of communication, thereby overlapping the communication and full-card synchronization times, which greatly reduces the performance overhead caused by full-card synchronization. The only drawback of this approach is that it requires an additional buffer of size perRoundTokens * topK * hiddenSize. However, since perRoundTokens is typically set to 1024, this would only result in an additional 112MB buffer usage.
Optimize the performance of the Combine Ant Moving function by removing unnecessary PipeBarrier<PIPE_ALL>.
Optimize the HCCL normal operator by removing unnecessary notify_dispatch memory usage.
The HCCL buffer usage of notify_dispatch is reduced from $204 MB \times 2$ to $102 MB \times 2$.

Benchmarking and Profiling

Our own single-operator benchmark

disable ant moving: HCCL_BUFFSIZE=4000 python3 test_intranode.py --num-tokens=<bs>
enable ant moving: HCCL_BUFFSIZE=900 DEEPEP_NORMAL_COMBINE_ENABLE_LONG_SEQ=1 DEEPEP_NORMAL_LONG_SEQ_ROUND=32 DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=1024 python3 test_intranode.py --num-tokens=<bs>

bs	disable GB/s	enable (before optimization) GB/s	enable (after optimization) GB/s
4096	101.59	98.38	109.01
8192	101.30	96.72	113.56
16384	99.30	97.55	113.00
32768	-	98.24	114.52

Qwen 235B prefill

disable token/s	enable (before optimization) token/s	enable (after optimization) token/s
2040	1200	2106

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments (/tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci) or contact authorized users to do so.
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-01-13T04:08:45Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

…ble buffer.

csrc/deepep/ops/op_host/cam_moe_combine_normal_tiling.cc

…into addEnv * 'main' of https://github.com/sgl-project/sgl-kernel-npu: Optimize the performance of the Combine Ant Moving function and the use of HCCL buffer (sgl-project#314) deepep adapt custom cann installation path (sgl-project#327)

* 'main' of https://github.com/1329009851/sgl-kernel-npu: Optimize the performance of the Combine Ant Moving function and the use of HCCL buffer (sgl-project#314) deepep adapt custom cann installation path (sgl-project#327) [Chore] CANN version bump to 8.5.0 (sgl-project#326) add dfx for operator FusedDeepMoe (sgl-project#317)

* 'main' of https://github.com/sgl-project/sgl-kernel-npu: (24 commits) [Doc] Improved README.md content and English grammar and integrated the DeepWiki badge for Ask AI (sgl-project#345) (test) add solve_tril from upstream (sgl-project#339) Add AscendC triangular inverse (sgl-project#332) support the situation that topk maybe -1 on machine A3 (sgl-project#313) chunk_gated_delta_rule_npu output final state (sgl-project#341) The environment variable DEEPEP_HCCL_BUFFSIZE is added, and the priority of DEEPEP_HCCL_BUFFSIZE is higher than that of HCCL_BUFFSIZE. (sgl-project#329) Added the low_latency operator API documentation. (sgl-project#337) Added the verification of num_max_dispatch_tokens_per_rank to the decode operator adaptation layer. (sgl-project#330) Document get_dispatch_layout API (sgl-project#338) 【Doc】add fused deep moe doc (sgl-project#335) add deepep normal api doc (sgl-project#336) remove the limit that A2 internode only support topk 8 (sgl-project#323) Optimize the performance of the Combine Ant Moving function and the use of HCCL buffer (sgl-project#314) deepep adapt custom cann installation path (sgl-project#327) [Chore] CANN version bump to 8.5.0 (sgl-project#326) add dfx for operator FusedDeepMoe (sgl-project#317) Integrate ccache for faster compilation (sgl-project#318) Modify contribution guide (sgl-project#315) fix bmm transpose in cann 8.5 (sgl-project#316) fix little batchsize and int8 quant on ci (sgl-project#302) ...

* upstream/main: add function for deep-ep tests (sgl-project#301) [Doc] Improved README.md content and English grammar and integrated the DeepWiki badge for Ask AI (sgl-project#345) (test) add solve_tril from upstream (sgl-project#339) Add AscendC triangular inverse (sgl-project#332) support the situation that topk maybe -1 on machine A3 (sgl-project#313) chunk_gated_delta_rule_npu output final state (sgl-project#341) The environment variable DEEPEP_HCCL_BUFFSIZE is added, and the priority of DEEPEP_HCCL_BUFFSIZE is higher than that of HCCL_BUFFSIZE. (sgl-project#329) Added the low_latency operator API documentation. (sgl-project#337) Added the verification of num_max_dispatch_tokens_per_rank to the decode operator adaptation layer. (sgl-project#330) Document get_dispatch_layout API (sgl-project#338) 【Doc】add fused deep moe doc (sgl-project#335) add deepep normal api doc (sgl-project#336) remove the limit that A2 internode only support topk 8 (sgl-project#323) Optimize the performance of the Combine Ant Moving function and the use of HCCL buffer (sgl-project#314) deepep adapt custom cann installation path (sgl-project#327) [Chore] CANN version bump to 8.5.0 (sgl-project#326) add dfx for operator FusedDeepMoe (sgl-project#317) Integrate ccache for faster compilation (sgl-project#318)

Optimize the performance of the Combine Ant Moving function using dou…

54273ed

…ble buffer.

oagniqgnat force-pushed the combine_long_seq_optimize branch from acbef90 to 54273ed Compare January 13, 2026 04:09

oagniqgnat added 2 commits January 13, 2026 12:36

Optimize the performance of the Combine Ant Moving function.

8350bb8

lint

fed38e6

oagniqgnat changed the title ~~Optimize the performance of the Combine Ant Moving function using double buffer.~~ Optimize the performance of the Combine Ant Moving function Jan 13, 2026

oagniqgnat force-pushed the combine_long_seq_optimize branch 2 times, most recently from a7a85f7 to 63663ac Compare January 13, 2026 11:13

oagniqgnat changed the title ~~Optimize the performance of the Combine Ant Moving function~~ Optimize the performance of the Combine Ant Moving function and the use of HCCL buffer Jan 13, 2026

oagniqgnat force-pushed the combine_long_seq_optimize branch 2 times, most recently from 5ab45af to 783c82c Compare January 14, 2026 03:43

Yael-X previously approved these changes Jan 16, 2026

View reviewed changes

csrc/deepep/ops/op_host/cam_moe_combine_normal_tiling.cc Outdated Show resolved Hide resolved

oagniqgnat dismissed Yael-X’s stale review via cd64028 January 16, 2026 06:45

oagniqgnat force-pushed the combine_long_seq_optimize branch 3 times, most recently from c926819 to db43a2b Compare January 16, 2026 08:52

Optimized the use of HCCL by the normal operator.

985e74a

oagniqgnat force-pushed the combine_long_seq_optimize branch from db43a2b to 985e74a Compare January 19, 2026 01:16

Yael-X approved these changes Jan 19, 2026

View reviewed changes

Yael-X merged commit a0619a7 into sgl-project:main Jan 19, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize the performance of the Combine Ant Moving function and the use of HCCL buffer#314

Optimize the performance of the Combine Ant Moving function and the use of HCCL buffer#314
Yael-X merged 4 commits intosgl-project:mainfrom
oagniqgnat:combine_long_seq_optimize

oagniqgnat commented Jan 13, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Jan 13, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

oagniqgnat commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modification

Benchmarking and Profiling

Our own single-operator benchmark

Qwen 235B prefill

Checklist

Review Process

Uh oh!

gemini-code-assist bot commented Jan 13, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

oagniqgnat commented Jan 13, 2026 •

edited

Loading