Skip to content

[Fix] Modify tile-based method of supporting large VPT in moe_fused_gate kernel#1

Merged
ltaodream merged 12 commits intoltaodream:mainfrom
ttaohe:main
Aug 12, 2025
Merged

[Fix] Modify tile-based method of supporting large VPT in moe_fused_gate kernel#1
ltaodream merged 12 commits intoltaodream:mainfrom
ttaohe:main

Conversation

@ttaohe
Copy link
Collaborator

@ttaohe ttaohe commented Aug 12, 2025

MoE Fused Gate: add tiled path and static specializations for large VPT (64/384), unify switch-case dispatch, and provide multi-dtype benchmarking.

  • Added a tiled implementation for large VPT and two static specializations for THREADS_PER_ROW=1: (num_experts=64, group=1) and (num_experts=384, group=1). These are exposed via consistent switch-case dispatch using LAUNCH_MOE_GATE_TILED_CONFIG.
  • Kept existing template fast paths for small VPT (e.g., 128/256), and route all other large-VPT cases to the generic tiled kernel.
  • Refactored tiled declarations/macros into moe_fused_gate_tiled.h to keep moe_fused_gate.cu clean.
  • Added a benchmark (bf16/fp16/fp32) comparing Original (eager, compile-static, compile-dynamic) vs SGL Kernel on representative large-VPT configs.

Results of kimi-vl and kimi-k2
image

Benchmmmu result of kimi-vl
image

@ltaodream ltaodream merged commit 29ed296 into ltaodream:main Aug 12, 2025
@ltaodream ltaodream assigned ltaodream and ttaohe and unassigned ltaodream and ttaohe Aug 12, 2025
@ltaodream ltaodream self-requested a review August 12, 2025 10:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants