Skip to content

[Feature] Add tile-based method of supporting large VPT in moe_fused_gate kernel#6946

Closed
ltaodream wants to merge 62 commits intosgl-project:mainfrom
ltaodream:main
Closed

[Feature] Add tile-based method of supporting large VPT in moe_fused_gate kernel#6946
ltaodream wants to merge 62 commits intosgl-project:mainfrom
ltaodream:main

Conversation

@ltaodream
Copy link

@ltaodream ltaodream commented Jun 7, 2025

Motivation

The kimi-vl supported at #5383 and kimi-k2 model cannot use the fuse moe gate operator. This operator currently does not support the case where the group expert is greater than 32. The group expert of kimi-vl is 64, the group expert of kimi-k2 is 384, and this operator cannot be used at present.

Modifications

@ttaohe @Misaka9468 The three of us completed the PR "Add tile-based method of supporting large VPT in moe_fused_gate kernel" together.

Checklist

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @ltaodream, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello! Gemini here, providing a summary of this pull request to help everyone quickly understand the changes. This PR, authored by @ltaodream, aims to address a limitation in the moe_fused_gate kernel which currently only supports a maximum of 32 experts per group. This limitation prevents models like kimi-vl, which use 64 experts per group, from utilizing the fused gate operator. The core change involves increasing this limit to 128 by implementing a tiling approach within the CUDA kernel.

Highlights

  • Increase MoE Expert Group Limit: The primary goal is to raise the maximum number of experts per group supported by the moe_fused_gate kernel from 32 to 128. This is necessary for compatibility with models requiring a higher number of experts per group.
  • Implement Tiling in CUDA Kernel: The CUDA kernel (moe_fused_gate.cu) has been modified to process the expert data in smaller chunks (tiling). This allows the kernel to handle the increased MAX_VPT (experts per group) by iterating through tiles of data rather than processing the entire VPT in a single unrolled loop.
  • Update Python-side Check: The Python code (topk.py) that checks the eligibility for using the moe_fused_gate kernel has been updated to reflect the new maximum limit of 128 experts per group.

Changelog

  • python/sglang/srt/layers/moe/topk.py
    • Updated the condition for using the moe_fused_gate kernel to allow up to 128 experts per group (changed <= 32 to <= 128).
    • Updated the associated comment to reflect the new maximum supported limit of 128.
  • sgl-kernel/csrc/moe/moe_fused_gate.cu
    • Increased the static constexpr int MAX_VPT from 32 to 128.
    • Added a comment indicating that tiling is used to handle the increased limit.
    • Refactored the moe_fused_gate_impl kernel to process expert data in tiles of 32, allowing it to handle params.VPT up to MAX_VPT=128.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request successfully extends the capability of the moe_fused_gate operator to support up to 128 experts per group, up from 32, by implementing a tiling mechanism in the CUDA kernel. This change is crucial for supporting models like kimi-vl.

The core logic in the CUDA kernel appears sound, and the tiling approach is well-implemented. The corresponding update in the Python code is also correct.

A few points for consideration:

  1. Clarity of Tiling Constant: The CUDA kernel uses the number 32 for tiling. Clarifying its origin or using a named constant could improve maintainability.
  2. Performance Benchmarks: It would be beneficial to see benchmark results, especially to confirm that there's no performance regression for cases where the number of experts per group (VPT) was already within the old limit (<= 32), and to understand the performance characteristics with the new VPT up to 128.
  3. PR Title Typo: There's a small typo in the PR title: "tilling" should be "tiling".

Overall, good work on implementing this feature enhancement!

Summary of Findings

  • CUDA Kernel Tiling Constant: The CUDA kernel uses a hardcoded 32 for tiling. Using a named constant like WARP_SIZE (if applicable) or PROCESSING_TILE_SIZE could improve code clarity and maintainability.
  • Performance Benchmarking: The introduction of tiling in the CUDA kernel might affect performance. It's important to benchmark this change, especially for cases previously supported (VPT <= 32) and for the newly supported larger VPT values, to ensure no regressions and understand new performance characteristics.
  • Clarity of Ceiling Division: In sgl-kernel/csrc/moe/moe_fused_gate.cu, the expression (params.VPT + 31) / 32 is used for ceiling division. While common, a helper function or a comment could slightly improve readability for developers less familiar with this pattern. (Not commented due to review settings: low severity)

Merge Readiness

The pull request implements an important feature enhancement. The core logic seems correct. However, before merging, it would be ideal to address the point about the magic number 32 for better maintainability and to provide benchmark results to confirm the performance implications of the tiling mechanism. I am requesting changes to encourage addressing these points. As a reviewer, I am not authorized to approve the pull request; other maintainers should review and approve the changes before merging.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The constant 32 is used here for the tile size and in the calculation of the number of tiles (params.VPT + 31) / 32. This value is also WARP_SIZE.

Is the choice of 32 specifically tied to WARP_SIZE or another hardware characteristic? If so, would it be clearer to use the WARP_SIZE constant or define a new named constant (e.g., PROCESSING_TILE_SIZE = 32) for this purpose? This could enhance readability and make it easier to understand the rationale behind this value if it needs to be adjusted in the future.

Comment on lines 97 to 176
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The introduction of tiling is a key change to support MAX_VPT = 128. The PR description checklist mentions providing throughput/latency benchmark results.

Could you share any benchmark data for this new implementation? Specifically, it would be interesting to see:

  1. Performance for params.VPT values that were previously supported (e.g., VPT <= 32) to ensure no significant regressions.
  2. Performance for new larger params.VPT values (e.g., 64, 128) to understand the characteristics of the tiled approach.

This information would be valuable for assessing the impact of these changes.

} else if (cmp_gt(val, global_max_val_second)) {
global_max_val_second = val;
global_max_second_idx = global_idx;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot see the difference between these two block

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, there are a lot of duplicated code which can be refactored.

@Alcanderian
Copy link
Collaborator

Alcanderian commented Jun 7, 2025

Please extend the unittest and present a performance report about this PR and before, thanks!

switch (num_experts) {
case 384:
if (num_expert_group == 1)
// Kimi K2 配置: VPT = 384/1 = 384, ROWS_PER_WARP = 32/1 = 32
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Kimi K2 config:

@ltaodream ltaodream changed the title [Feature] add support the number of group expert from the original 32 to 128 implemented with tilling [Feature] Add tile-based method of supporting large VPT in moe_fused_gate kernel Aug 12, 2025
@ltaodream
Copy link
Author

ltaodream commented Aug 12, 2025

@ttaohe @Misaka9468 The three of us completed the PR "Add tile-based method of supporting large VPT in moe_fused_gate kernel" together.

MoE Fused Gate: add tiled path and static specializations for large VPT (64/384), unify switch-case dispatch, and provide multi-dtype benchmarking.

  • Added a tiled implementation for large VPT and two static specializations for THREADS_PER_ROW=1: (num_experts=64, group=1) and (num_experts=384, group=1). These are exposed via consistent switch-case dispatch using LAUNCH_MOE_GATE_TILED_CONFIG.
  • Kept existing template fast paths for small VPT (e.g., 128/256), and route all other large-VPT cases to the generic tiled kernel.
  • Refactored tiled declarations/macros into moe_fused_gate_tiled.h to keep moe_fused_gate.cu clean.
  • Added a benchmark (bf16/fp16/fp32) comparing Original (eager, compile-static, compile-dynamic) vs SGL Kernel on representative large-VPT configs.

Results of kimi-vl and kimi-k2
image

Benchmmmu result of kimi-vl
image

Bench_hf result of kimi-vl

python benchmark/mmmu/bench_hf.py --model-path moonshotai/Kimi-VL-A3B-Instruct

answers saved to: moonshotai/Kimi-VL-A3B-Instruct_val_hf.json
Evaluating...
{'Accounting': {'acc': 0.4, 'num': 30},
 'Agriculture': {'acc': 0.562, 'num': 16},
 'Architecture_and_Engineering': {'acc': 0.367, 'num': 30},
 'Art': {'acc': 0.5, 'num': 30},
 'Art_Theory': {'acc': 0.3, 'num': 30},
 'Basic_Medical_Science': {'acc': 0.667, 'num': 30},
 'Biology': {'acc': 0.333, 'num': 30},
 'Chemistry': {'acc': 0.2, 'num': 30},
 'Clinical_Medicine': {'acc': 0.4, 'num': 30},
 'Computer_Science': {'acc': 0.6, 'num': 30},
 'Design': {'acc': 0.533, 'num': 30},
 'Diagnostics_and_Laboratory_Medicine': {'acc': 0.367, 'num': 30},
 'Economics': {'acc': 0.533, 'num': 30},
 'Electronics': {'acc': 0.333, 'num': 30},
 'Energy_and_Power': {'acc': 0.4, 'num': 30},
 'Finance': {'acc': 0.4, 'num': 30},
 'Geography': {'acc': 0.533, 'num': 30},
 'History': {'acc': 0.5, 'num': 30},
 'Literature': {'acc': 0.828, 'num': 29},
 'Manage': {'acc': 0.333, 'num': 30},
 'Marketing': {'acc': 0.6, 'num': 30},
 'Materials': {'acc': 0.367, 'num': 30},
 'Math': {'acc': 0.4, 'num': 30},
 'Mechanical_Engineering': {'acc': 0.467, 'num': 30},
 'Music': {'acc': 0.267, 'num': 30},
 'Overall': {'acc': 0.44, 'num': 885},
 'Overall-Art and Design': {'acc': 0.4, 'num': 120},
 'Overall-Business': {'acc': 0.453, 'num': 150},
 'Overall-Health and Medicine': {'acc': 0.453, 'num': 150},
 'Overall-Humanities and Social Science': {'acc': 0.563, 'num': 119},
 'Overall-Science': {'acc': 0.353, 'num': 150},
 'Overall-Tech and Engineering': {'acc': 0.434, 'num': 196},
 'Pharmacy': {'acc': 0.467, 'num': 30},
 'Physics': {'acc': 0.3, 'num': 30},
 'Psychology': {'acc': 0.367, 'num': 30},
 'Public_Health': {'acc': 0.367, 'num': 30},
 'Sociology': {'acc': 0.567, 'num': 30}}
eval out saved to moonshotai/Kimi-VL-A3B-Instruct_val_hf.json
Overall accuracy: 0.44

Summary MMMU Benchmark on Kimi-VL-A3B-Instruct Model.
Results as follows:
[bench_sglang.py]: 0.512
[bench_hf.py]: 0.44

@ttaohe
Copy link

ttaohe commented Aug 12, 2025

@ttaohe @Misaka9468 The three of us completed the PR "Add tile-based method of supporting large VPT in moe_fused_gate kernel" together.

MoE Fused Gate: add tiled path and static specializations for large VPT (64/384), unify switch-case dispatch, and provide multi-dtype benchmarking.

  • Added a tiled implementation for large VPT and two static specializations for THREADS_PER_ROW=1: (num_experts=64, group=1) and (num_experts=384, group=1). These are exposed via consistent switch-case dispatch using LAUNCH_MOE_GATE_TILED_CONFIG.
  • Kept existing template fast paths for small VPT (e.g., 128/256), and route all other large-VPT cases to the generic tiled kernel.
  • Refactored tiled declarations/macros into moe_fused_gate_tiled.h to keep moe_fused_gate.cu clean.
  • Added a benchmark (bf16/fp16/fp32) comparing Original (eager, compile-static, compile-dynamic) vs SGL Kernel on representative large-VPT configs.

Results of kimi-vl and kimi-k2 image

Benchmmmu result of kimi-vl image

Bench_hf result of kimi-vl

python benchmark/mmmu/bench_hf.py --model-path moonshotai/Kimi-VL-A3B-Instruct

answers saved to: moonshotai/Kimi-VL-A3B-Instruct_val_hf.json
Evaluating...
{'Accounting': {'acc': 0.4, 'num': 30},
 'Agriculture': {'acc': 0.562, 'num': 16},
 'Architecture_and_Engineering': {'acc': 0.367, 'num': 30},
 'Art': {'acc': 0.5, 'num': 30},
 'Art_Theory': {'acc': 0.3, 'num': 30},
 'Basic_Medical_Science': {'acc': 0.667, 'num': 30},
 'Biology': {'acc': 0.333, 'num': 30},
 'Chemistry': {'acc': 0.2, 'num': 30},
 'Clinical_Medicine': {'acc': 0.4, 'num': 30},
 'Computer_Science': {'acc': 0.6, 'num': 30},
 'Design': {'acc': 0.533, 'num': 30},
 'Diagnostics_and_Laboratory_Medicine': {'acc': 0.367, 'num': 30},
 'Economics': {'acc': 0.533, 'num': 30},
 'Electronics': {'acc': 0.333, 'num': 30},
 'Energy_and_Power': {'acc': 0.4, 'num': 30},
 'Finance': {'acc': 0.4, 'num': 30},
 'Geography': {'acc': 0.533, 'num': 30},
 'History': {'acc': 0.5, 'num': 30},
 'Literature': {'acc': 0.828, 'num': 29},
 'Manage': {'acc': 0.333, 'num': 30},
 'Marketing': {'acc': 0.6, 'num': 30},
 'Materials': {'acc': 0.367, 'num': 30},
 'Math': {'acc': 0.4, 'num': 30},
 'Mechanical_Engineering': {'acc': 0.467, 'num': 30},
 'Music': {'acc': 0.267, 'num': 30},
 'Overall': {'acc': 0.44, 'num': 885},
 'Overall-Art and Design': {'acc': 0.4, 'num': 120},
 'Overall-Business': {'acc': 0.453, 'num': 150},
 'Overall-Health and Medicine': {'acc': 0.453, 'num': 150},
 'Overall-Humanities and Social Science': {'acc': 0.563, 'num': 119},
 'Overall-Science': {'acc': 0.353, 'num': 150},
 'Overall-Tech and Engineering': {'acc': 0.434, 'num': 196},
 'Pharmacy': {'acc': 0.467, 'num': 30},
 'Physics': {'acc': 0.3, 'num': 30},
 'Psychology': {'acc': 0.367, 'num': 30},
 'Public_Health': {'acc': 0.367, 'num': 30},
 'Sociology': {'acc': 0.567, 'num': 30}}
eval out saved to moonshotai/Kimi-VL-A3B-Instruct_val_hf.json
Overall accuracy: 0.44

Summary MMMU Benchmark on Kimi-VL-A3B-Instruct Model. Results as follows: [bench_sglang.py]: 0.512 [bench_hf.py]: 0.44

Test Results: test_vlm_models.py [#4491 ] (Overall Acc: 0.5244)

Summary Results

The overall accuracy on the MMMU validation set is 0.5244.

Tasks Version Filter n-shot Metric Value Stderr
mmmu_val 0 none 0 mmmu_acc 0.5244 ± N/A

Click to view detailed scores by category
{
    "Overall-Art and Design": { "num": 120, "acc": 0.7 },
    "Art": { "num": 30, "acc": 0.7 },
    "Art_Theory": { "num": 30, "acc": 0.83333 },
    "Design": { "num": 30, "acc": 0.8 },
    "Music": { "num": 30, "acc": 0.46667 },
    "Overall-Business": { "num": 150, "acc": 0.47333 },
    "Accounting": { "num": 30, "acc": 0.5 },
    "Economics": { "num": 30, "acc": 0.5 },
    "Finance": { "num": 30, "acc": 0.36667 },
    "Manage": { "num": 30, "acc": 0.4 },
    "Marketing": { "num": 30, "acc": 0.6 },
    "Overall-Science": { "num": 150, "acc": 0.47333 },
    "Biology": { "num": 30, "acc": 0.56667 },
    "Chemistry": { "num": 30, "acc": 0.36667 },
    "Geography": { "num": 30, "acc": 0.63333 },
    "Math": { "num": 30, "acc": 0.36667 },
    "Physics": { "num": 30, "acc": 0.43333 },
    "Overall-Health and Medicine": { "num": 150, "acc": 0.47333 },
    "Basic_Medical_Science": { "num": 30, "acc": 0.4 },
    "Clinical_Medicine": { "num": 30, "acc": 0.5 },
    "Diagnostics_and_Laboratory_Medicine": { "num": 30, "acc": 0.36667 },
    "Pharmacy": { "num": 30, "acc": 0.53333 },
    "Public_Health": { "num": 30, "acc": 0.56667 },
    "Overall-Humanities and Social Science": { "num": 120, "acc": 0.65833 },
    "History": { "num": 30, "acc": 0.6 },
    "Literature": { "num": 30, "acc": 0.83333 },
    "Sociology": { "num": 30, "acc": 0.56667 },
    "Psychology": { "num": 30, "acc": 0.63333 },
    "Overall-Tech and Engineering": { "num": 210, "acc": 0.45714 },
    "Agriculture": { "num": 30, "acc": 0.56667 },
    "Architecture_and_Engineering": { "num": 30, "acc": 0.46667 },
    "Computer_Science": { "num": 30, "acc": 0.53333 },
    "Electronics": { "num": 30, "acc": 0.36667 },
    "Energy_and_Power": { "num": 30, "acc": 0.46667 },
    "Materials": { "num": 30, "acc": 0.53333 },
    "Mechanical_Engineering": { "num": 30, "acc": 0.26667 },
    "Overall": { "num": 900, "acc": 0.52444 }
}

@ltaodream ltaodream requested a review from BBuf August 13, 2025 02:08
@ttaohe
Copy link

ttaohe commented Aug 23, 2025

cc @BBuf @yuan-luo

@FlamingoPg
Copy link
Collaborator

Cool, maybe need rebase master.

@ltaodream
Copy link
Author

ltaodream commented Aug 25, 2025

Cool, maybe need rebase master.

@FlamingoPg Hi, I’ve rebased my branch onto the latest main. Since the old branch had a long chain of merge commits from the original main, the history is now completely rewritten and GitHub can’t update PR #6946 cleanly. Would you prefer that I close #6946 and open a fresh PR with the single squash commit, or do you have another way you’d like to handle it?

@FlamingoPg
Copy link
Collaborator

@FlamingoPg Hi, I’ve rebased my branch onto the latest main. Since the old branch had a long chain of merge commits from the original main, the history is now completely rewritten and GitHub can’t update PR #6946 cleanly. Would you prefer that I close #6946 and open a fresh PR with the single squash commit, or do you have another way you’d like to handle it?

Maybe you can reopen a new PR

@ltaodream
Copy link
Author

The original PR link has been closed. New PR link: [Feature] Add tile-based method of supporting large VPT in moe_fused_gate kernel by ltaodream · Pull Request #9579 · sgl-project/sglang

@ltaodream ltaodream closed this Aug 25, 2025
@ltaodream ltaodream reopened this Aug 25, 2025
@ltaodream
Copy link
Author

@FlamingoPg Hi, I’ve rebased my branch onto the latest main. Since the old branch had a long chain of merge commits from the original main, the history is now completely rewritten and GitHub can’t update PR #6946 cleanly. Would you prefer that I close #6946 and open a fresh PR with the single squash commit, or do you have another way you’d like to handle it?

Maybe you can reopen a new PR

New PR link: [Feature] Add tile-based method of supporting large VPT in moe_fused_gate kernel by ltaodream · Pull Request #9579

@ltaodream ltaodream closed this Aug 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants