-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Closed
Labels
Description
Optimization Items
- MoE TopK kernel fusion @BBuf
- Opt kimi_k2_thinking biased topk module #13150 @BBuf
- [opt kimi k2 1 / n] Add kimi k2 moe fused gate #13287 @BBuf
- [opt kimi k2 2/n] apply kimi k2 thinking moe_fused_gate #13332 @BBuf
- [opt kimi k2 3/n] opt kimi_k2 moe_fused_gate kernel #13374 @BBuf
- [opt kimi k2 4 / n] Delete useless pad kernel in sgl_moe_align_block_size #13587 @BBuf
- [kimi k2 thinking] Avoid useless torch.zeros_ #13596 @BBuf
- Fix moe tuning bug @BBuf [Kernel] Simplify fused_marlin_moe by removing config tuning logic #13723 Apply new moe wna16 marlin gemm #14125
- Fix IMA in large batch and long seq_length(Sync VLLM marlin moe): @BBuf [Hot fix] Fix Kimi k2 thinking ima #13717 [Opt Kimi k2 thinking] Fix shared memory allocation in Marlin MoE kernel for large block sizes #13902 Add new moe wna16 marlin gemm #14122 Apply new moe wna16 marlin gemm #14125
- Opt moe align block size. @BBuf Opt moe align block size kernel #14133 Apply new moe align block size kernel #14134
- Optimize reduce sum kernel after MoE Apply moe_reduce_sum kernel for fused_marlin_moe #12888 Apply back moe_sum_reduce for fused_marlin_moe #14829
- EP support for marlin MoE @BBuf Add Expert Parallelism (EP) support for kimi-k2-thinking #13725
- DeepEP MoE support (all-to-all) @BBuf [DeepEP Support] Support kimi-k2-thinking deepep #13789 [DeepEP] Add SGLANG_DEEPEP_BF16_DISPATCH env var in Normal mode #13787
- Support Piecewise CUDA Graph. @b8zhong [Piecewise CUDA Graph] Support Kimi-K2 (non-Thinking) #13466
- Flashinfer TRT-LLM Marlin kernel for SM100 @b8zhong feat: MxInt4 x Bf16 TRT-LLM Gen MoE support flashinfer-ai/flashinfer#2159 [WIP: depends on Flashinfer bump]
Related resources
https://huggingface.co/moonshotai/Kimi-K2-Thinking
Profiling command example:
export SGLANG_TORCH_PROFILER_DIR=/sgl-workspace/sglang/profile/
python -m sglang.launch_server --model-path moonshotai/Kimi-K2-Thinking --tp 8 --trust-remote-code --tool-call-parser kimi_k2 --reasoning-parser kimi_k2
# bs1
python3 -m sglang.bench_serving --model moonshotai/Kimi-K2-Thinking --dataset-name random --backend sglang-oai --random-range-ratio 1 --random-input-len 1200 --random-output-len 20 --max-concurrency 1 --num-prompts 5 --profile
# bs32
python3 -m sglang.bench_serving --model moonshotai/Kimi-K2-Thinking --dataset-name random --backend sglang-oai --random-range-ratio 1 --random-input-len 1200 --random-output-len 20 --max-concurrency 32 --num-prompts 32 --profileReactions are currently unavailable