Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3 by Xu-Wenqing · Pull Request #6835 · sgl-project/sglang

Xu-Wenqing · 2025-06-03T09:27:00Z

Motivation

H20-3e is H20 GPU with 141G memory. Add Moe tuning configs for DeepSeek R1

Modifications

Tuning command:

python3 /mnt/data/sglang/benchmark/kernels/fused_moe_triton/tuning_fused_moe_triton.py --model /mnt/data/DeepSeek-R1-0528 --tp-size 8 --dtype fp8_w8a8 --tune

Deploy DeepSeek-R1-0528:

python3 -m sglang.launch_server --model-path /mnt/data/DeepSeek-R1-0528 --disable-radix-cache --host 0.0.0.0 --port 8000 --tp 8 --trust-remote-code --enable-metrics --served-model-name DeepSeek-R1-0528

Benchmark:

python3 -m sglang.bench_serving --tokenizer /mnt/data/DeepSeek-R1-0528 --host 0.0.0.0 --port 8000 --backend sglang --dataset-name random --random-input 1024 --random-output 512 --max-concurrency 8 --num-prompt 200

Result (without Moe config):

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 8         
Successful requests:                     200       
Benchmark duration (s):                  200.99    
Total input tokens:                      103005    
Total generated tokens:                  53590     
Total generated tokens (retokenized):    53520     
Request throughput (req/s):              1.00      
Input token throughput (tok/s):          512.49    
Output token throughput (tok/s):         266.63    
Total token throughput (tok/s):          779.12    
Concurrency:                             7.78      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   7820.23   
Median E2E Latency (ms):                 7870.32   
---------------Time to First Token----------------
Mean TTFT (ms):                          191.25    
Median TTFT (ms):                        155.90    
P99 TTFT (ms):                           903.01    
---------------Inter-Token Latency----------------
Mean ITL (ms):                           28.58     
Median ITL (ms):                         25.83     
P95 ITL (ms):                            29.02     
P99 ITL (ms):                            112.89    
Max ITL (ms):                            1471.31   
==================================================

Result (with Moe config):

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 8         
Successful requests:                     200       
Benchmark duration (s):                  176.85    
Total input tokens:                      103005    
Total generated tokens:                  53590     
Total generated tokens (retokenized):    53431     
Request throughput (req/s):              1.13      
Input token throughput (tok/s):          582.44    
Output token throughput (tok/s):         303.02    
Total token throughput (tok/s):          885.46    
Concurrency:                             7.78      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   6878.36   
Median E2E Latency (ms):                 6753.04   
---------------Time to First Token----------------
Mean TTFT (ms):                          170.69    
Median TTFT (ms):                        148.23    
P99 TTFT (ms):                           668.37    
---------------Inter-Token Latency----------------
Mean ITL (ms):                           25.13     
Median ITL (ms):                         22.56     
P95 ITL (ms):                            24.54     
P99 ITL (ms):                            116.88    
Max ITL (ms):                            259.50    
==================================================

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

Signed-off-by: Xu-Wenqing <xwq391974@alibaba-inc.com>

Signed-off-by: 许文卿 <xwq391974@alibaba-inc.com>

gemini-code-assist

Hello @Xu-Wenqing, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello team, gemini-code-assist here to provide a summary of this pull request. This PR aims to improve the performance of the fused Mixture-of-Experts (MoE) Triton kernel specifically for the DeepSeek-R1-0528 model when running on NVIDIA H20-3e GPUs. The core change involves adding a new configuration file containing optimized tuning parameters for the kernel across various sequence lengths. The PR description includes benchmark results demonstrating the performance benefits of applying these configurations, showing improvements in throughput and latency metrics.

Highlights

Add MoE Kernel Tuning Configs: This pull request introduces a new JSON configuration file containing optimized parameters for the fused MoE Triton kernel. These parameters were generated through a tuning process to find the best settings for performance.
Target Hardware and Model: The added tuning configurations are specifically tailored for the DeepSeek-R1-0528 model running on NVIDIA H20-3e GPUs, utilizing the fp8_w8a8 data type.
Performance Improvement: Benchmark results provided in the PR description show that applying these new tuning configurations leads to improved request throughput, total token throughput, and reduced end-to-end latency and inter-token latency compared to running without the specific configs.

Changelog

python/sglang/srt/layers/moe/fused_moe_triton/configs/E=264,N=256,device_name=NVIDIA_H20-3e,dtype=fp8_w8a8,block_shape=[128, 128].json
- Added a new JSON file containing optimized kernel tuning parameters for the fused MoE Triton kernel on H20-3e GPUs.
- Includes tuning parameters (BLOCK_SIZE_M, BLOCK_SIZE_N, BLOCK_SIZE_K, GROUP_SIZE_M, num_warps, num_stages) for various sequence lengths (1, 2, 4, 8, 16, 24, 32, 48, 64, 96, 128, 256, 512, 1024, 1536, 2048, 3072, 4096).

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Kernels tuned just right,
On H20-3e they fly,
Faster tokens stream.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds new MoE kernel tuning configurations for the DeepSeek-R1-0528 model on NVIDIA H20-3e GPUs. The changes are well-motivated, and the provided benchmark results clearly demonstrate a performance improvement with the new configurations, which is excellent work!

The added JSON configuration file is well-structured and its filename follows the established conventions. The parameters within the configuration appear to be within reasonable ranges for Triton kernel tuning.

One minor suggestion for overall repository maintainability (outside the scope of this diff, but relevant to the context of this PR):

The README.md file in python/sglang/srt/layers/moe/fused_moe_triton/configs/ currently mentions that "The example configurations provided are for the Mixtral model...". It might be beneficial to update this README to reflect that configurations for other models and hardware (like DeepSeek-R1 and H20-3e as added in this PR) are also present, or to generalize the statement about examples.

Overall, the changes in this PR are clear and beneficial. Good job!

Summary of Findings

Code Quality: The added JSON configuration file is well-structured, and the parameters are consistent with typical Triton kernel tuning configurations.
Performance: The benchmark results provided in the pull request description show a clear performance improvement with the new MoE configurations, validating their effectiveness.
Documentation (Outside Diff): A minor suggestion was made in the general comments to update the README in the configs directory to reflect the growing set of models/hardware supported by the tuning configurations. This was not commented on directly as it's outside the diff.

Merge Readiness

The code changes in this pull request appear to be of high quality and demonstrate a positive performance impact. Based on the review of the provided diff, there are no critical or high-severity issues identified. The PR seems ready for merging from a code quality perspective, pending any further internal checks or discussions. As a language model, I am not authorized to approve pull requests; this decision should be made by the repository maintainers.

ch-wan · 2025-06-03T17:09:08Z

@Xu-Wenqing Thank you for your contribution. We are going to update our implementation for shared experts fusion to reduce redundant memory usage (#6736). Could you please further tine-tune the configuration for E=257? Thank you very much.

Signed-off-by: 许文卿 <xwq391974@alibaba-inc.com>

Xu-Wenqing · 2025-06-04T05:18:44Z

@Xu-Wenqing Thank you for your contribution. We are going to update our implementation for shared experts fusion to reduce redundant memory usage (#6736). Could you please further tine-tune the configuration for E=257? Thank you very much.

@ch-wan Sure. Added E=257 config.

Signed-off-by: 许文卿 <xwq391974@alibaba-inc.com>

Xu-Wenqing · 2025-06-04T23:54:26Z

@ch-wan @zhyncs this PR: #6735 removed --n-share-experts-fusion and this PR: #6736 added --disable-shared-experts-fusion. Seems "E=264" will not be used any more and "E=256" or "E=257" will be used depends on "disable_shared_experts_fusion". Should I remove "E=264" here?

Signed-off-by: Xu Wenqing <xuwq1993@qq.com>

Xu-Wenqing · 2025-06-05T07:19:42Z

@ch-wan Removed "E=264" config and updated the benchmark result for "E=257". Since H20-3e has 141G memory, there's no need to tune tp=16 for DeepSeek-V3/R1 model.

Xu-Wenqing · 2025-06-09T07:41:00Z

@ch-wan @zhyncs could you please take a review again? This PR added DeepSeek-R1/V3 Moe kernel config for H20-3e (141G memory H20 GPU).

Signed-off-by: 许文卿 <xwq391974@alibaba-inc.com>

Xu-Wenqing and others added 4 commits May 30, 2025 21:08

Add H20-3e moe config for DeepSeek-V3/R1

a6fac2e

Signed-off-by: Xu-Wenqing <xwq391974@alibaba-inc.com>

Add H20-3e moe config for DeepSeek-V3/R1

133ec2d

Signed-off-by: Xu-Wenqing <xwq391974@alibaba-inc.com>

Merge branch 'sgl-project:main' into dev/add_h20_3e_moe

15e723f

Add H20-3e moe config

2eb849e

Signed-off-by: 许文卿 <xwq391974@alibaba-inc.com>

Xu-Wenqing requested review from BBuf, HaiShaw, Ying1123, ch-wan, ispobock, merrymercy and zhyncs as code owners June 3, 2025 09:27

gemini-code-assist bot reviewed Jun 3, 2025

View reviewed changes

Xu-Wenqing added 2 commits June 4, 2025 11:20

Merge branch 'main' into dev/add_h20_3e_moe

5e556e7

Add H20-3e moe config

668b9a1

Signed-off-by: 许文卿 <xwq391974@alibaba-inc.com>

Add H20-3e moe config for DeepSeek-V3/R1

ac7183a

Signed-off-by: 许文卿 <xwq391974@alibaba-inc.com>

Xu-Wenqing changed the title ~~Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1-0528~~ Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3 Jun 5, 2025

Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3

97c6a91

Signed-off-by: Xu Wenqing <xuwq1993@qq.com>

Xu-Wenqing added 5 commits June 5, 2025 17:00

Merge branch 'main' into dev/add_h20_3e_moe

2e15a9e

Merge branch 'main' into dev/add_h20_3e_moe

239e3b9

Merge branch 'main' into dev/add_h20_3e_moe

c23b2f7

Merge branch 'main' into dev/add_h20_3e_moe

90f9735

Merge branch 'main' into dev/add_h20_3e_moe

4854d14

Xu-Wenqing and others added 2 commits June 9, 2025 17:49

move file to new folder

df37e78

Signed-off-by: 许文卿 <xwq391974@alibaba-inc.com>

Merge branch 'main' into dev/add_h20_3e_moe

11b243d

Xu-Wenqing added 2 commits June 9, 2025 21:51

Merge branch 'main' into dev/add_h20_3e_moe

da6495f

Merge branch 'main' into dev/add_h20_3e_moe

a13025a

Xu-Wenqing closed this Oct 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3#6835

Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3#6835
Xu-Wenqing wants to merge 17 commits intosgl-project:mainfrom
Xu-Wenqing:dev/add_h20_3e_moe

Xu-Wenqing commented Jun 3, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

ch-wan commented Jun 3, 2025

Uh oh!

Xu-Wenqing commented Jun 4, 2025

Uh oh!

Xu-Wenqing commented Jun 4, 2025 •

edited

Loading

Uh oh!

Xu-Wenqing commented Jun 5, 2025 •

edited

Loading

Uh oh!

Xu-Wenqing commented Jun 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Xu-Wenqing commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Changelog

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

ch-wan commented Jun 3, 2025

Uh oh!

Xu-Wenqing commented Jun 4, 2025

Uh oh!

Xu-Wenqing commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Xu-Wenqing commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Xu-Wenqing commented Jun 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Xu-Wenqing commented Jun 3, 2025 •

edited

Loading

Xu-Wenqing commented Jun 4, 2025 •

edited

Loading

Xu-Wenqing commented Jun 5, 2025 •

edited

Loading