Skip to content

Fix the bug that the layout kernel crashed when the num of experts is no less than 384#383

Merged
Yael-X merged 2 commits intosgl-project:mainfrom
luanyundu:update
Feb 27, 2026
Merged

Fix the bug that the layout kernel crashed when the num of experts is no less than 384#383
Yael-X merged 2 commits intosgl-project:mainfrom
luanyundu:update

Conversation

@luanyundu
Copy link
Contributor

@luanyundu luanyundu commented Feb 26, 2026

  • This modification is used to fix the bug that the layout kernel crashed when the num of experts is no less than 384, also I close the layout output verification to improve the duration of internode test as the correctness can be verified by dispatch-combine result verification.
  • The performance is below:
    • --num-experts=256
      Server 0
      [tuning] Dispatch (BF16) 36.19 GB/s (HCCS), 9.06 GB/s (RDMA), avg_t: 12916.85 us, notify_t: 3747.06 us
      [tuning] Combine 53.96 GB/s (HCCS), 13.51 GB/s (RDMA), avg_t: 8662.11 us
      Server 1
      [tuning] Dispatch (BF16) 35.43 GB/s (HCCS), 8.81 GB/s (RDMA), avg_t: 13284.70 us, notify_t: 3209.65 us
      [tuning] Combine 54.56 GB/s (HCCS), 13.56 GB/s (RDMA), avg_t: 8628.18 us

    • --num-experts=384
      Server 0
      [tuning] Dispatch (BF16) 33.80 GB/s (HCCS), 8.44 GB/s (RDMA), avg_t: 13857.38 us, notify_t: 3880.77 us
      [tuning] Combine 52.96 GB/s (HCCS), 13.23 GB/s (RDMA), avg_t: 8844.66 us
      Server 1
      [tuning] Dispatch (BF16) 34.40 GB/s (HCCS), 8.52 GB/s (RDMA), avg_t: 13735.61 us, notify_t: 4084.84 us
      [tuning] Combine 53.49 GB/s (HCCS), 13.25 GB/s (RDMA), avg_t: 8832.54 us

    • --num-experts=512
      Server 0
      [tuning] Dispatch (BF16) 37.66 GB/s (HCCS), 9.42 GB/s (RDMA), avg_t: 12425.59 us, notify_t: 3593.53 us
      [tuning] Combine 52.35 GB/s (HCCS), 13.10 GB/s (RDMA), avg_t: 8938.52 us
      Server 1
      [tuning] Dispatch (BF16) 42.45 GB/s (HCCS), 10.68 GB/s (RDMA), avg_t: 10955.64 us, notify_t: 5023.46 us
      [tuning] Combine 51.75 GB/s (HCCS), 13.02 GB/s (RDMA), avg_t: 8986.22 us

    • In all cases , the average duration of layout kernel is less than 0.35ms.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @luanyundu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a layout kernel crash that occurred with a high number of experts. The solution involves a comprehensive refactoring of the expert dispatching mechanism, primarily by correcting how data sizes and indices are calculated and managed within the deep_ep and dispatch_layout_a2 components. A new tokenIdx is introduced to improve the precision of token-to-expert mapping, and the overall data flow within the kernels has been streamlined for correctness and efficiency under varying expert counts. Corresponding test adjustments ensure compatibility with these fundamental changes.

Highlights

  • Layout Kernel Bug Fix: Addressed a critical bug in the layout kernel that caused crashes when the number of experts (num_experts) was 384 or more, by correcting buffer sizing and indexing logic.
  • Data Structure Refinement: Modified the notify_send_data_size calculation and introduced a new tokenIdx to accurately track which expert each token is sent to, replacing previous num_experts-based indexing with num_topk-based indexing for dispatching.
  • Kernel Logic Update: Refactored the DispatchLayoutA2 and NotifyDispatchA2 kernels to align with the new data structures, including adjusting buffer allocations, removing redundant temporary buffers, and revising the token-to-expert mapping logic.
  • Test Adjustments: Updated Python test cases (test_internode.py) to reflect the changes in data layout and indexing, specifically adapting send_token_idx dimensions and offsets to use num_topk.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • csrc/deepep/deep_ep.cpp
    • Updated notify_send_data_size calculation to correctly incorporate num_topk.
    • Modified the description of notify send data parameters to include a new item for token-to-expert mapping and re-indexed subsequent items.
  • csrc/deepep/ops2/op_kernel/dispatch_layout_a2.h
    • Introduced tokenIdxOffset and tokenIdx32AlignIntLen_ for managing token indices.
    • Adjusted sendTokenIdx32AlignIntLen_ calculation to use numTopk_ instead of numExperts_.
    • Removed unused tempExpertBuf_, intermediateExpertBuf_, and intermediateServerBuf_ variables.
    • Modified buffer initialization to include tokenIdxBuf_.
    • Refactored the logic for calculating sendTokenIdxTensor to directly map tokens to experts based on topkIdxTensor.
    • Added a Cast operation to populate tokenIdxTensor from topkIdxTensor.
    • Updated SyncFunc calls for event synchronization.
    • Adjusted sendSize calculation for sendTokenIdxDataCopyParams.
    • Added tokenIdxGM_ and tokenIdxBuf_ member variables.
    • Removed tempExpertBuf_, intermediateExpertBuf_, intermediateServerBuf_ member variables.
    • Added tokenIdx32AlignIntLen_ member variable.
  • csrc/deepep/ops2/op_kernel/notify_dispatch_a2.h
    • Refactored BuildExpandIdxData to use topkNum for data copying and introduced expertIdxTensor and lastLt for processing.
    • Changed loop iteration count from numExperts to MAX_BS in BuildExpandIdxData.
    • Updated Sub operation to use lastLt and numExperts.
    • Adjusted tarOffset and DataCopyPad parameters in BuildExpandIdxData.
    • Modified BuildOffsetInnerData to use copyExpertsParams and copyTopkParams, and introduced expertIdxTensor for processing.
    • Updated recvOffset calculation in BuildOffsetInnerData to use topkNum.
    • Adjusted dataOffset calculation in BuildRecvData to account for MAX_BS * topkNum * 2.
  • tests/python/deepep/test_internode.py
    • Disabled enable_a2_test by default with a clarifying comment.
    • Modified send_token_idx tensor size from num_tokens * num_experts to num_tokens * num_topk.
    • Updated the calculation of send_token_idx to use num_topk.
    • Adjusted slice offsets for ref_send_token_idx and ref_expert_rank_token_idx to match the new data layout.
Activity
  • The pull request was created by luanyundu to fix a layout kernel crash.
  • The description is marked as 'waiting modify...', indicating potential further updates from the author.
  • Significant changes were made across C++ kernel files and Python test files to address the bug and adapt to new data handling logic.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a crash in the layout kernel when dealing with a large number of experts. The core of the fix involves correcting buffer size calculations and indexing logic, replacing num_experts with the more appropriate num_topk in several key areas. This change is propagated through the C++ kernels and the corresponding Python tests.

While the primary fix appears correct, my review has identified several critical issues where the refactoring has removed necessary bounds checks, potentially leading to new out-of-bounds memory access bugs when handling invalid expert indices. I have provided suggestions to reintroduce these checks. Additionally, I've noted that a relevant test case has been disabled and recommended re-enabling it to serve as a regression test.

@Yael-X Yael-X merged commit c636882 into sgl-project:main Feb 27, 2026
6 checks passed
1329009851 added a commit to 1329009851/sgl-kernel-npu that referenced this pull request Feb 27, 2026
…-npu into sgl-cmake2

* 'sgl-cmake2' of https://github.com/1329009851/sgl-kernel-npu:
  Fix the bug that the layout kernel crashed when the num of experts is no less than 384 (sgl-project#383)
  adapt sglang (sgl-project#357)
  GLM5 optimize (sgl-project#382)
  Update layernorm_gated.py (sgl-project#378)
  support qwen3.5 (sgl-project#377)
zzx-study pushed a commit to zzx-study/sgl-kernel-npu that referenced this pull request Feb 28, 2026
… no less than 384 (sgl-project#383)

* Fix the bug that the layout kernel crashed when the num of experts is no less than 384

* Modify review suggestions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants