Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3#6835
Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3#6835Xu-Wenqing wants to merge 17 commits intosgl-project:mainfrom
Conversation
Signed-off-by: Xu-Wenqing <xwq391974@alibaba-inc.com>
Signed-off-by: Xu-Wenqing <xwq391974@alibaba-inc.com>
Signed-off-by: 许文卿 <xwq391974@alibaba-inc.com>
There was a problem hiding this comment.
Hello @Xu-Wenqing, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello team, gemini-code-assist here to provide a summary of this pull request. This PR aims to improve the performance of the fused Mixture-of-Experts (MoE) Triton kernel specifically for the DeepSeek-R1-0528 model when running on NVIDIA H20-3e GPUs. The core change involves adding a new configuration file containing optimized tuning parameters for the kernel across various sequence lengths. The PR description includes benchmark results demonstrating the performance benefits of applying these configurations, showing improvements in throughput and latency metrics.
Highlights
- Add MoE Kernel Tuning Configs: This pull request introduces a new JSON configuration file containing optimized parameters for the fused MoE Triton kernel. These parameters were generated through a tuning process to find the best settings for performance.
- Target Hardware and Model: The added tuning configurations are specifically tailored for the DeepSeek-R1-0528 model running on NVIDIA H20-3e GPUs, utilizing the fp8_w8a8 data type.
- Performance Improvement: Benchmark results provided in the PR description show that applying these new tuning configurations leads to improved request throughput, total token throughput, and reduced end-to-end latency and inter-token latency compared to running without the specific configs.
Changelog
- python/sglang/srt/layers/moe/fused_moe_triton/configs/E=264,N=256,device_name=NVIDIA_H20-3e,dtype=fp8_w8a8,block_shape=[128, 128].json
- Added a new JSON file containing optimized kernel tuning parameters for the fused MoE Triton kernel on H20-3e GPUs.
- Includes tuning parameters (BLOCK_SIZE_M, BLOCK_SIZE_N, BLOCK_SIZE_K, GROUP_SIZE_M, num_warps, num_stages) for various sequence lengths (1, 2, 4, 8, 16, 24, 32, 48, 64, 96, 128, 256, 512, 1024, 1536, 2048, 3072, 4096).
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Kernels tuned just right,
On H20-3e they fly,
Faster tokens stream.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request adds new MoE kernel tuning configurations for the DeepSeek-R1-0528 model on NVIDIA H20-3e GPUs. The changes are well-motivated, and the provided benchmark results clearly demonstrate a performance improvement with the new configurations, which is excellent work!
The added JSON configuration file is well-structured and its filename follows the established conventions. The parameters within the configuration appear to be within reasonable ranges for Triton kernel tuning.
One minor suggestion for overall repository maintainability (outside the scope of this diff, but relevant to the context of this PR):
- The
README.mdfile inpython/sglang/srt/layers/moe/fused_moe_triton/configs/currently mentions that "The example configurations provided are for the Mixtral model...". It might be beneficial to update thisREADMEto reflect that configurations for other models and hardware (like DeepSeek-R1 and H20-3e as added in this PR) are also present, or to generalize the statement about examples.
Overall, the changes in this PR are clear and beneficial. Good job!
Summary of Findings
- Code Quality: The added JSON configuration file is well-structured, and the parameters are consistent with typical Triton kernel tuning configurations.
- Performance: The benchmark results provided in the pull request description show a clear performance improvement with the new MoE configurations, validating their effectiveness.
- Documentation (Outside Diff): A minor suggestion was made in the general comments to update the
READMEin theconfigsdirectory to reflect the growing set of models/hardware supported by the tuning configurations. This was not commented on directly as it's outside the diff.
Merge Readiness
The code changes in this pull request appear to be of high quality and demonstrate a positive performance impact. Based on the review of the provided diff, there are no critical or high-severity issues identified. The PR seems ready for merging from a code quality perspective, pending any further internal checks or discussions. As a language model, I am not authorized to approve pull requests; this decision should be made by the repository maintainers.
|
@Xu-Wenqing Thank you for your contribution. We are going to update our implementation for shared experts fusion to reduce redundant memory usage (#6736). Could you please further tine-tune the configuration for E=257? Thank you very much. |
Signed-off-by: 许文卿 <xwq391974@alibaba-inc.com>
@ch-wan Sure. Added E=257 config. |
Signed-off-by: 许文卿 <xwq391974@alibaba-inc.com>
Signed-off-by: Xu Wenqing <xuwq1993@qq.com>
|
@ch-wan Removed "E=264" config and updated the benchmark result for "E=257". Since H20-3e has 141G memory, there's no need to tune tp=16 for DeepSeek-V3/R1 model. |
Signed-off-by: 许文卿 <xwq391974@alibaba-inc.com>
Motivation
H20-3e is H20 GPU with 141G memory. Add Moe tuning configs for DeepSeek R1
Modifications
Tuning command:
Deploy DeepSeek-R1-0528:
Benchmark:
Result (without Moe config):
Result (with Moe config):
Checklist