-
Notifications
You must be signed in to change notification settings - Fork 3.6k
overlap shared + routed expert computation in kimi linear #12660
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
overlap shared + routed expert computation in kimi linear #12660
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements dual-stream CUDA execution for the Kimi Linear model's MoE (Mixture of Experts) layer to improve performance by overlapping computation of shared experts and routed experts on separate CUDA streams. The implementation mirrors the pattern used in other MoE models like DeepSeek V2, GLM4 MoE, Qwen2 MoE, and Bailing MoE.
- Adds dual-stream support with an alternative CUDA stream for parallelizing shared expert and routed expert computations
- Applies dual-stream optimization only during CUDA graph capture mode for small batches (≤1024 tokens)
- Creates a single shared alternative stream at the model level, passed down to decoder layers and MoE modules
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
yizhang2077
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
@Fridge003 Do you think it's ready?
|
|
@b8zhong Yeah should be ready. |
Before:
python3 -m sglang.test.send_oneAfter:
Around 6%.