overlap shared + routed expert computation in kimi linear #12660

b8zhong · 2025-11-05T01:51:56Z

Before:

CUDA_VISIBLE_DEVICES=0,1 python3 -m sglang.launch_server --model-path moonshotai/Kimi-Linear-48B-A3B-Instruct --tp 2 --trust-remote-code

python3 -m sglang.test.send_one

+-------------+--------+------------+-----------------+
| Latency (s) | Tokens | Acc Length | Speed (token/s) |
+-------------+--------+------------+-----------------+
|    2.366    |  459   |   1.000    |     193.98      |
+-------------+--------+------------+-----------------+

After:

+-------------+--------+------------+-----------------+
| Latency (s) | Tokens | Acc Length | Speed (token/s) |
+-------------+--------+------------+-----------------+
|    2.230    |  459   |   1.000    |     205.84      |
+-------------+--------+------------+-----------------+

Around 6%.

python3 benchmark/gsm8k/bench_sglang.py --num-questions 1319 --parallel 1319
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:52<00:00, 25.15it/s]
Accuracy: 0.895
Invalid: 0.000
Latency: 52.663 s
Output throughput: 2460.908 token/s

gemini-code-assist · 2025-11-05T01:51:59Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Copilot

Pull Request Overview

This PR implements dual-stream CUDA execution for the Kimi Linear model's MoE (Mixture of Experts) layer to improve performance by overlapping computation of shared experts and routed experts on separate CUDA streams. The implementation mirrors the pattern used in other MoE models like DeepSeek V2, GLM4 MoE, Qwen2 MoE, and Bailing MoE.

Adds dual-stream support with an alternative CUDA stream for parallelizing shared expert and routed expert computations
Applies dual-stream optimization only during CUDA graph capture mode for small batches (≤1024 tokens)
Creates a single shared alternative stream at the model level, passed down to decoder layers and MoE modules

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

python/sglang/srt/models/kimi_linear.py

yizhang2077

LGTM

b8zhong · 2025-11-11T22:18:06Z

@Fridge003 Do you think it's ready?

Also, this model is not in CI

Fridge003 · 2025-11-11T22:52:53Z

@b8zhong Yeah should be ready.
Maybe we can open a PR that adds it to CI?

more

13cabf2

b8zhong requested a review from Copilot November 5, 2025 01:52

b8zhong added the run-ci label Nov 5, 2025

Merge branch 'main' into kimi-linear-fused

7bce724

Copilot AI reviewed Nov 5, 2025

View reviewed changes

python/sglang/srt/models/kimi_linear.py Show resolved Hide resolved

python/sglang/srt/models/kimi_linear.py Show resolved Hide resolved

b8zhong added 2 commits November 5, 2025 08:01

Merge branch 'main' into kimi-linear-fused

a64d6d3

Merge branch 'main' into kimi-linear-fused

5a635ae

b8zhong requested a review from ispobock November 6, 2025 17:29

yizhang2077 approved these changes Nov 9, 2025

View reviewed changes

b8zhong requested a review from Fridge003 November 9, 2025 08:07

Fridge003 approved these changes Nov 9, 2025

View reviewed changes

Fridge003 merged commit cc2e36c into sgl-project:main Nov 11, 2025
178 of 199 checks passed

b8zhong mentioned this pull request Nov 23, 2025

[Feature] Refactor MoE to automatically support dual stream #13807

Open

2 tasks

b8zhong deleted the kimi-linear-fused branch December 3, 2025 05:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

overlap shared + routed expert computation in kimi linear #12660

overlap shared + routed expert computation in kimi linear #12660

Uh oh!

b8zhong commented Nov 5, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Nov 5, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

yizhang2077 left a comment

Uh oh!

b8zhong commented Nov 11, 2025

Uh oh!

Fridge003 commented Nov 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

overlap shared + routed expert computation in kimi linear #12660

overlap shared + routed expert computation in kimi linear #12660

Uh oh!

Conversation

b8zhong commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Nov 5, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

yizhang2077 left a comment

Choose a reason for hiding this comment

Uh oh!

b8zhong commented Nov 11, 2025

Uh oh!

Fridge003 commented Nov 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

b8zhong commented Nov 5, 2025 •

edited

Loading