[Feature] Add tile-based method of supporting large VPT in moe_fused_gate kernel#6946
[Feature] Add tile-based method of supporting large VPT in moe_fused_gate kernel#6946ltaodream wants to merge 62 commits intosgl-project:mainfrom
Conversation
There was a problem hiding this comment.
Hello @ltaodream, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello! Gemini here, providing a summary of this pull request to help everyone quickly understand the changes. This PR, authored by @ltaodream, aims to address a limitation in the moe_fused_gate kernel which currently only supports a maximum of 32 experts per group. This limitation prevents models like kimi-vl, which use 64 experts per group, from utilizing the fused gate operator. The core change involves increasing this limit to 128 by implementing a tiling approach within the CUDA kernel.
Highlights
- Increase MoE Expert Group Limit: The primary goal is to raise the maximum number of experts per group supported by the
moe_fused_gatekernel from 32 to 128. This is necessary for compatibility with models requiring a higher number of experts per group. - Implement Tiling in CUDA Kernel: The CUDA kernel (
moe_fused_gate.cu) has been modified to process the expert data in smaller chunks (tiling). This allows the kernel to handle the increasedMAX_VPT(experts per group) by iterating through tiles of data rather than processing the entireVPTin a single unrolled loop. - Update Python-side Check: The Python code (
topk.py) that checks the eligibility for using themoe_fused_gatekernel has been updated to reflect the new maximum limit of 128 experts per group.
Changelog
- python/sglang/srt/layers/moe/topk.py
- Updated the condition for using the
moe_fused_gatekernel to allow up to 128 experts per group (changed<= 32to<= 128). - Updated the associated comment to reflect the new maximum supported limit of 128.
- Updated the condition for using the
- sgl-kernel/csrc/moe/moe_fused_gate.cu
- Increased the
static constexpr int MAX_VPTfrom 32 to 128. - Added a comment indicating that tiling is used to handle the increased limit.
- Refactored the
moe_fused_gate_implkernel to process expert data in tiles of 32, allowing it to handleparams.VPTup toMAX_VPT=128.
- Increased the
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request successfully extends the capability of the moe_fused_gate operator to support up to 128 experts per group, up from 32, by implementing a tiling mechanism in the CUDA kernel. This change is crucial for supporting models like kimi-vl.
The core logic in the CUDA kernel appears sound, and the tiling approach is well-implemented. The corresponding update in the Python code is also correct.
A few points for consideration:
- Clarity of Tiling Constant: The CUDA kernel uses the number
32for tiling. Clarifying its origin or using a named constant could improve maintainability. - Performance Benchmarks: It would be beneficial to see benchmark results, especially to confirm that there's no performance regression for cases where the number of experts per group (
VPT) was already within the old limit (<= 32), and to understand the performance characteristics with the newVPTup to 128. - PR Title Typo: There's a small typo in the PR title: "tilling" should be "tiling".
Overall, good work on implementing this feature enhancement!
Summary of Findings
- CUDA Kernel Tiling Constant: The CUDA kernel uses a hardcoded
32for tiling. Using a named constant likeWARP_SIZE(if applicable) orPROCESSING_TILE_SIZEcould improve code clarity and maintainability. - Performance Benchmarking: The introduction of tiling in the CUDA kernel might affect performance. It's important to benchmark this change, especially for cases previously supported (VPT <= 32) and for the newly supported larger VPT values, to ensure no regressions and understand new performance characteristics.
- Clarity of Ceiling Division: In
sgl-kernel/csrc/moe/moe_fused_gate.cu, the expression(params.VPT + 31) / 32is used for ceiling division. While common, a helper function or a comment could slightly improve readability for developers less familiar with this pattern. (Not commented due to review settings: low severity)
Merge Readiness
The pull request implements an important feature enhancement. The core logic seems correct. However, before merging, it would be ideal to address the point about the magic number 32 for better maintainability and to provide benchmark results to confirm the performance implications of the tiling mechanism. I am requesting changes to encourage addressing these points. As a reviewer, I am not authorized to approve the pull request; other maintainers should review and approve the changes before merging.
There was a problem hiding this comment.
The constant 32 is used here for the tile size and in the calculation of the number of tiles (params.VPT + 31) / 32. This value is also WARP_SIZE.
Is the choice of 32 specifically tied to WARP_SIZE or another hardware characteristic? If so, would it be clearer to use the WARP_SIZE constant or define a new named constant (e.g., PROCESSING_TILE_SIZE = 32) for this purpose? This could enhance readability and make it easier to understand the rationale behind this value if it needs to be adjusted in the future.
There was a problem hiding this comment.
The introduction of tiling is a key change to support MAX_VPT = 128. The PR description checklist mentions providing throughput/latency benchmark results.
Could you share any benchmark data for this new implementation? Specifically, it would be interesting to see:
- Performance for
params.VPTvalues that were previously supported (e.g.,VPT <= 32) to ensure no significant regressions. - Performance for new larger
params.VPTvalues (e.g., 64, 128) to understand the characteristics of the tiled approach.
This information would be valuable for assessing the impact of these changes.
| } else if (cmp_gt(val, global_max_val_second)) { | ||
| global_max_val_second = val; | ||
| global_max_second_idx = global_idx; | ||
| } |
There was a problem hiding this comment.
I cannot see the difference between these two block
There was a problem hiding this comment.
Agree, there are a lot of duplicated code which can be refactored.
|
Please extend the unittest and present a performance report about this PR and before, thanks! |
… to 384 implemented with tilling
| switch (num_experts) { | ||
| case 384: | ||
| if (num_expert_group == 1) | ||
| // Kimi K2 配置: VPT = 384/1 = 384, ROWS_PER_WARP = 32/1 = 32 |
[Fix] Modify tile-based method of supporting large VPT in moe_fused_gate kernel
|
@ttaohe @Misaka9468 The three of us completed the PR "Add tile-based method of supporting large VPT in moe_fused_gate kernel" together. MoE Fused Gate: add tiled path and static specializations for large VPT (64/384), unify switch-case dispatch, and provide multi-dtype benchmarking.
Results of kimi-vl and kimi-k2 Bench_hf result of kimi-vl Summary MMMU Benchmark on Kimi-VL-A3B-Instruct Model. |
Test Results: test_vlm_models.py [#4491 ] (Overall Acc: 0.5244)Summary ResultsThe overall accuracy on the MMMU validation set is 0.5244.
Click to view detailed scores by category{
"Overall-Art and Design": { "num": 120, "acc": 0.7 },
"Art": { "num": 30, "acc": 0.7 },
"Art_Theory": { "num": 30, "acc": 0.83333 },
"Design": { "num": 30, "acc": 0.8 },
"Music": { "num": 30, "acc": 0.46667 },
"Overall-Business": { "num": 150, "acc": 0.47333 },
"Accounting": { "num": 30, "acc": 0.5 },
"Economics": { "num": 30, "acc": 0.5 },
"Finance": { "num": 30, "acc": 0.36667 },
"Manage": { "num": 30, "acc": 0.4 },
"Marketing": { "num": 30, "acc": 0.6 },
"Overall-Science": { "num": 150, "acc": 0.47333 },
"Biology": { "num": 30, "acc": 0.56667 },
"Chemistry": { "num": 30, "acc": 0.36667 },
"Geography": { "num": 30, "acc": 0.63333 },
"Math": { "num": 30, "acc": 0.36667 },
"Physics": { "num": 30, "acc": 0.43333 },
"Overall-Health and Medicine": { "num": 150, "acc": 0.47333 },
"Basic_Medical_Science": { "num": 30, "acc": 0.4 },
"Clinical_Medicine": { "num": 30, "acc": 0.5 },
"Diagnostics_and_Laboratory_Medicine": { "num": 30, "acc": 0.36667 },
"Pharmacy": { "num": 30, "acc": 0.53333 },
"Public_Health": { "num": 30, "acc": 0.56667 },
"Overall-Humanities and Social Science": { "num": 120, "acc": 0.65833 },
"History": { "num": 30, "acc": 0.6 },
"Literature": { "num": 30, "acc": 0.83333 },
"Sociology": { "num": 30, "acc": 0.56667 },
"Psychology": { "num": 30, "acc": 0.63333 },
"Overall-Tech and Engineering": { "num": 210, "acc": 0.45714 },
"Agriculture": { "num": 30, "acc": 0.56667 },
"Architecture_and_Engineering": { "num": 30, "acc": 0.46667 },
"Computer_Science": { "num": 30, "acc": 0.53333 },
"Electronics": { "num": 30, "acc": 0.36667 },
"Energy_and_Power": { "num": 30, "acc": 0.46667 },
"Materials": { "num": 30, "acc": 0.53333 },
"Mechanical_Engineering": { "num": 30, "acc": 0.26667 },
"Overall": { "num": 900, "acc": 0.52444 }
}
|
|
Cool, maybe need rebase master. |
@FlamingoPg Hi, I’ve rebased my branch onto the latest main. Since the old branch had a long chain of merge commits from the original main, the history is now completely rewritten and GitHub can’t update PR #6946 cleanly. Would you prefer that I close #6946 and open a fresh PR with the single squash commit, or do you have another way you’d like to handle it? |
Maybe you can reopen a new PR |
|
The original PR link has been closed. New PR link: [Feature] Add tile-based method of supporting large VPT in moe_fused_gate kernel by ltaodream · Pull Request #9579 · sgl-project/sglang |
New PR link: [Feature] Add tile-based method of supporting large VPT in moe_fused_gate kernel by ltaodream · Pull Request #9579 |




Motivation
The kimi-vl supported at #5383 and kimi-k2 model cannot use the fuse moe gate operator. This operator currently does not support the case where the group expert is greater than 32. The group expert of kimi-vl is 64, the group expert of kimi-k2 is 384, and this operator cannot be used at present.
Modifications
@ttaohe @Misaka9468 The three of us completed the PR "Add tile-based method of supporting large VPT in moe_fused_gate kernel" together.
Checklist