Skip to content

Optimize the performance of the Combine Ant Moving function and the use of HCCL buffer#314

Merged
Yael-X merged 4 commits intosgl-project:mainfrom
oagniqgnat:combine_long_seq_optimize
Jan 19, 2026
Merged

Optimize the performance of the Combine Ant Moving function and the use of HCCL buffer#314
Yael-X merged 4 commits intosgl-project:mainfrom
oagniqgnat:combine_long_seq_optimize

Conversation

@oagniqgnat
Copy link
Contributor

@oagniqgnat oagniqgnat commented Jan 13, 2026

Motivation

The combine ant migration feature can effectively support long sequences when fewer HCCL buffers are used. Although the performance of the single-operator test with combine ant migration enabled is the same as that without it, the performance after integrating into the framework is poor, and the performance deteriorates significantly compared to before (2100 tps -> 1200 tps). This PR aims to improve the performance of combine ant migration while optimizing the use of HCCL buffers at the same time.

Modification

  1. Optimize the performance of the Combine Ant Moving function using double buffer.
    Through profiling analysis, I found that the execution time of each stage within the operator does not significantly degrade. This is because the combine "ant moving" operation requires a full-card synchronization after each round of communication, and the model cannot guarantee that tokens will be evenly distributed to each expert, which results in significant performance overhead due to full-card synchronization. This PR addresses this issue by using double buffering in multiple rounds of communication, thereby overlapping the communication and full-card synchronization times, which greatly reduces the performance overhead caused by full-card synchronization. The only drawback of this approach is that it requires an additional buffer of size perRoundTokens * topK * hiddenSize. However, since perRoundTokens is typically set to 1024, this would only result in an additional 112MB buffer usage.

  2. Optimize the performance of the Combine Ant Moving function by removing unnecessary PipeBarrier<PIPE_ALL>.

  3. Optimize the HCCL normal operator by removing unnecessary notify_dispatch memory usage.
    The HCCL buffer usage of notify_dispatch is reduced from $204 MB \times 2$ to $102 MB \times 2$.

Benchmarking and Profiling

Our own single-operator benchmark

  • disable ant moving: HCCL_BUFFSIZE=4000 python3 test_intranode.py --num-tokens=<bs>
  • enable ant moving: HCCL_BUFFSIZE=900 DEEPEP_NORMAL_COMBINE_ENABLE_LONG_SEQ=1 DEEPEP_NORMAL_LONG_SEQ_ROUND=32 DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS=1024 python3 test_intranode.py --num-tokens=<bs>
bs disable GB/s enable (before optimization) GB/s enable (after optimization) GB/s
4096 101.59 98.38 109.01
8192 101.30 96.72 113.56
16384 99.30 97.55 113.00
32768 - 98.24 114.52

Qwen 235B prefill

disable token/s enable (before optimization) token/s enable (after optimization) token/s
2040 1200 2106

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments (/tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci) or contact authorized users to do so.
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@oagniqgnat oagniqgnat force-pushed the combine_long_seq_optimize branch from acbef90 to 54273ed Compare January 13, 2026 04:09
@oagniqgnat oagniqgnat changed the title Optimize the performance of the Combine Ant Moving function using double buffer. Optimize the performance of the Combine Ant Moving function Jan 13, 2026
@oagniqgnat oagniqgnat force-pushed the combine_long_seq_optimize branch 2 times, most recently from a7a85f7 to 63663ac Compare January 13, 2026 11:13
@oagniqgnat oagniqgnat changed the title Optimize the performance of the Combine Ant Moving function Optimize the performance of the Combine Ant Moving function and the use of HCCL buffer Jan 13, 2026
@oagniqgnat oagniqgnat force-pushed the combine_long_seq_optimize branch 2 times, most recently from 5ab45af to 783c82c Compare January 14, 2026 03:43
Yael-X
Yael-X previously approved these changes Jan 16, 2026
@oagniqgnat oagniqgnat force-pushed the combine_long_seq_optimize branch 3 times, most recently from c926819 to db43a2b Compare January 16, 2026 08:52
@oagniqgnat oagniqgnat force-pushed the combine_long_seq_optimize branch from db43a2b to 985e74a Compare January 19, 2026 01:16
@Yael-X Yael-X merged commit a0619a7 into sgl-project:main Jan 19, 2026
4 checks passed
zzx-study added a commit to zzx-study/sgl-kernel-npu that referenced this pull request Jan 20, 2026
…into addEnv

* 'main' of https://github.com/sgl-project/sgl-kernel-npu:
  Optimize the performance of the Combine Ant Moving function and the use of HCCL buffer (sgl-project#314)
  deepep adapt custom cann installation path (sgl-project#327)
1329009851 added a commit to 1329009851/sgl-kernel-npu that referenced this pull request Jan 20, 2026
* 'main' of https://github.com/1329009851/sgl-kernel-npu:
  Optimize the performance of the Combine Ant Moving function and the use of HCCL buffer (sgl-project#314)
  deepep adapt custom cann installation path (sgl-project#327)
  [Chore] CANN version bump to 8.5.0 (sgl-project#326)
  add dfx for operator FusedDeepMoe (sgl-project#317)
Yael-X added a commit to Yael-X/sgl-kernel-npu that referenced this pull request Jan 26, 2026
* 'main' of https://github.com/sgl-project/sgl-kernel-npu: (24 commits)
  [Doc] Improved README.md content and English grammar and integrated the DeepWiki badge for Ask AI (sgl-project#345)
  (test) add solve_tril from upstream (sgl-project#339)
  Add AscendC triangular inverse (sgl-project#332)
  support the situation that topk maybe -1 on machine A3 (sgl-project#313)
  chunk_gated_delta_rule_npu output final state (sgl-project#341)
  The environment variable DEEPEP_HCCL_BUFFSIZE is added, and the priority of DEEPEP_HCCL_BUFFSIZE is higher than that of HCCL_BUFFSIZE. (sgl-project#329)
  Added the low_latency operator API documentation. (sgl-project#337)
  Added the verification of num_max_dispatch_tokens_per_rank to the decode operator adaptation layer. (sgl-project#330)
  Document get_dispatch_layout API (sgl-project#338)
  【Doc】add fused deep moe doc (sgl-project#335)
  add deepep normal api doc (sgl-project#336)
  remove the limit that A2 internode only support topk 8 (sgl-project#323)
  Optimize the performance of the Combine Ant Moving function and the use of HCCL buffer (sgl-project#314)
  deepep adapt custom cann installation path (sgl-project#327)
  [Chore] CANN version bump to 8.5.0 (sgl-project#326)
  add dfx for operator FusedDeepMoe (sgl-project#317)
  Integrate ccache for faster compilation (sgl-project#318)
  Modify contribution guide (sgl-project#315)
  fix bmm transpose in cann 8.5 (sgl-project#316)
  fix little batchsize and int8 quant on ci (sgl-project#302)
  ...
zhuyutong332 added a commit to zhuyutong332/sgl-kernel-npu that referenced this pull request Jan 27, 2026
* upstream/main:
  add function for deep-ep tests (sgl-project#301)
  [Doc] Improved README.md content and English grammar and integrated the DeepWiki badge for Ask AI (sgl-project#345)
  (test) add solve_tril from upstream (sgl-project#339)
  Add AscendC triangular inverse (sgl-project#332)
  support the situation that topk maybe -1 on machine A3 (sgl-project#313)
  chunk_gated_delta_rule_npu output final state (sgl-project#341)
  The environment variable DEEPEP_HCCL_BUFFSIZE is added, and the priority of DEEPEP_HCCL_BUFFSIZE is higher than that of HCCL_BUFFSIZE. (sgl-project#329)
  Added the low_latency operator API documentation. (sgl-project#337)
  Added the verification of num_max_dispatch_tokens_per_rank to the decode operator adaptation layer. (sgl-project#330)
  Document get_dispatch_layout API (sgl-project#338)
  【Doc】add fused deep moe doc (sgl-project#335)
  add deepep normal api doc (sgl-project#336)
  remove the limit that A2 internode only support topk 8 (sgl-project#323)
  Optimize the performance of the Combine Ant Moving function and the use of HCCL buffer (sgl-project#314)
  deepep adapt custom cann installation path (sgl-project#327)
  [Chore] CANN version bump to 8.5.0 (sgl-project#326)
  add dfx for operator FusedDeepMoe (sgl-project#317)
  Integrate ccache for faster compilation (sgl-project#318)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants