Skip to content

Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3#6835

Closed
Xu-Wenqing wants to merge 17 commits intosgl-project:mainfrom
Xu-Wenqing:dev/add_h20_3e_moe
Closed

Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3#6835
Xu-Wenqing wants to merge 17 commits intosgl-project:mainfrom
Xu-Wenqing:dev/add_h20_3e_moe

Conversation

@Xu-Wenqing
Copy link
Contributor

@Xu-Wenqing Xu-Wenqing commented Jun 3, 2025

Motivation

H20-3e is H20 GPU with 141G memory. Add Moe tuning configs for DeepSeek R1

Modifications

Tuning command:

python3 /mnt/data/sglang/benchmark/kernels/fused_moe_triton/tuning_fused_moe_triton.py --model /mnt/data/DeepSeek-R1-0528 --tp-size 8 --dtype fp8_w8a8 --tune

Deploy DeepSeek-R1-0528:

python3 -m sglang.launch_server --model-path /mnt/data/DeepSeek-R1-0528 --disable-radix-cache --host 0.0.0.0 --port 8000 --tp 8 --trust-remote-code --enable-metrics --served-model-name DeepSeek-R1-0528

Benchmark:

python3 -m sglang.bench_serving --tokenizer /mnt/data/DeepSeek-R1-0528 --host 0.0.0.0 --port 8000 --backend sglang --dataset-name random --random-input 1024 --random-output 512 --max-concurrency 8 --num-prompt 200

Result (without Moe config):

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 8         
Successful requests:                     200       
Benchmark duration (s):                  200.99    
Total input tokens:                      103005    
Total generated tokens:                  53590     
Total generated tokens (retokenized):    53520     
Request throughput (req/s):              1.00      
Input token throughput (tok/s):          512.49    
Output token throughput (tok/s):         266.63    
Total token throughput (tok/s):          779.12    
Concurrency:                             7.78      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   7820.23   
Median E2E Latency (ms):                 7870.32   
---------------Time to First Token----------------
Mean TTFT (ms):                          191.25    
Median TTFT (ms):                        155.90    
P99 TTFT (ms):                           903.01    
---------------Inter-Token Latency----------------
Mean ITL (ms):                           28.58     
Median ITL (ms):                         25.83     
P95 ITL (ms):                            29.02     
P99 ITL (ms):                            112.89    
Max ITL (ms):                            1471.31   
==================================================

Result (with Moe config):

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 8         
Successful requests:                     200       
Benchmark duration (s):                  176.85    
Total input tokens:                      103005    
Total generated tokens:                  53590     
Total generated tokens (retokenized):    53431     
Request throughput (req/s):              1.13      
Input token throughput (tok/s):          582.44    
Output token throughput (tok/s):         303.02    
Total token throughput (tok/s):          885.46    
Concurrency:                             7.78      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   6878.36   
Median E2E Latency (ms):                 6753.04   
---------------Time to First Token----------------
Mean TTFT (ms):                          170.69    
Median TTFT (ms):                        148.23    
P99 TTFT (ms):                           668.37    
---------------Inter-Token Latency----------------
Mean ITL (ms):                           25.13     
Median ITL (ms):                         22.56     
P95 ITL (ms):                            24.54     
P99 ITL (ms):                            116.88    
Max ITL (ms):                            259.50    
==================================================

Checklist

Xu-Wenqing and others added 4 commits May 30, 2025 21:08
Signed-off-by: Xu-Wenqing <xwq391974@alibaba-inc.com>
Signed-off-by: Xu-Wenqing <xwq391974@alibaba-inc.com>
Signed-off-by: 许文卿 <xwq391974@alibaba-inc.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @Xu-Wenqing, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello team, gemini-code-assist here to provide a summary of this pull request. This PR aims to improve the performance of the fused Mixture-of-Experts (MoE) Triton kernel specifically for the DeepSeek-R1-0528 model when running on NVIDIA H20-3e GPUs. The core change involves adding a new configuration file containing optimized tuning parameters for the kernel across various sequence lengths. The PR description includes benchmark results demonstrating the performance benefits of applying these configurations, showing improvements in throughput and latency metrics.

Highlights

  • Add MoE Kernel Tuning Configs: This pull request introduces a new JSON configuration file containing optimized parameters for the fused MoE Triton kernel. These parameters were generated through a tuning process to find the best settings for performance.
  • Target Hardware and Model: The added tuning configurations are specifically tailored for the DeepSeek-R1-0528 model running on NVIDIA H20-3e GPUs, utilizing the fp8_w8a8 data type.
  • Performance Improvement: Benchmark results provided in the PR description show that applying these new tuning configurations leads to improved request throughput, total token throughput, and reduced end-to-end latency and inter-token latency compared to running without the specific configs.

Changelog

  • python/sglang/srt/layers/moe/fused_moe_triton/configs/E=264,N=256,device_name=NVIDIA_H20-3e,dtype=fp8_w8a8,block_shape=[128, 128].json
    • Added a new JSON file containing optimized kernel tuning parameters for the fused MoE Triton kernel on H20-3e GPUs.
    • Includes tuning parameters (BLOCK_SIZE_M, BLOCK_SIZE_N, BLOCK_SIZE_K, GROUP_SIZE_M, num_warps, num_stages) for various sequence lengths (1, 2, 4, 8, 16, 24, 32, 48, 64, 96, 128, 256, 512, 1024, 1536, 2048, 3072, 4096).
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


Kernels tuned just right,
On H20-3e they fly,
Faster tokens stream.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds new MoE kernel tuning configurations for the DeepSeek-R1-0528 model on NVIDIA H20-3e GPUs. The changes are well-motivated, and the provided benchmark results clearly demonstrate a performance improvement with the new configurations, which is excellent work!

The added JSON configuration file is well-structured and its filename follows the established conventions. The parameters within the configuration appear to be within reasonable ranges for Triton kernel tuning.

One minor suggestion for overall repository maintainability (outside the scope of this diff, but relevant to the context of this PR):

  • The README.md file in python/sglang/srt/layers/moe/fused_moe_triton/configs/ currently mentions that "The example configurations provided are for the Mixtral model...". It might be beneficial to update this README to reflect that configurations for other models and hardware (like DeepSeek-R1 and H20-3e as added in this PR) are also present, or to generalize the statement about examples.

Overall, the changes in this PR are clear and beneficial. Good job!

Summary of Findings

  • Code Quality: The added JSON configuration file is well-structured, and the parameters are consistent with typical Triton kernel tuning configurations.
  • Performance: The benchmark results provided in the pull request description show a clear performance improvement with the new MoE configurations, validating their effectiveness.
  • Documentation (Outside Diff): A minor suggestion was made in the general comments to update the README in the configs directory to reflect the growing set of models/hardware supported by the tuning configurations. This was not commented on directly as it's outside the diff.

Merge Readiness

The code changes in this pull request appear to be of high quality and demonstrate a positive performance impact. Based on the review of the provided diff, there are no critical or high-severity issues identified. The PR seems ready for merging from a code quality perspective, pending any further internal checks or discussions. As a language model, I am not authorized to approve pull requests; this decision should be made by the repository maintainers.

@ch-wan
Copy link
Collaborator

ch-wan commented Jun 3, 2025

@Xu-Wenqing Thank you for your contribution. We are going to update our implementation for shared experts fusion to reduce redundant memory usage (#6736). Could you please further tine-tune the configuration for E=257? Thank you very much.

Signed-off-by: 许文卿 <xwq391974@alibaba-inc.com>
@Xu-Wenqing
Copy link
Contributor Author

@Xu-Wenqing Thank you for your contribution. We are going to update our implementation for shared experts fusion to reduce redundant memory usage (#6736). Could you please further tine-tune the configuration for E=257? Thank you very much.

@ch-wan Sure. Added E=257 config.

Signed-off-by: 许文卿 <xwq391974@alibaba-inc.com>
@Xu-Wenqing
Copy link
Contributor Author

Xu-Wenqing commented Jun 4, 2025

@ch-wan @zhyncs this PR: #6735 removed --n-share-experts-fusion and this PR: #6736 added --disable-shared-experts-fusion. Seems "E=264" will not be used any more and "E=256" or "E=257" will be used depends on "disable_shared_experts_fusion". Should I remove "E=264" here?

@Xu-Wenqing Xu-Wenqing changed the title Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1-0528 Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3 Jun 5, 2025
Signed-off-by: Xu Wenqing <xuwq1993@qq.com>
@Xu-Wenqing
Copy link
Contributor Author

Xu-Wenqing commented Jun 5, 2025

@ch-wan Removed "E=264" config and updated the benchmark result for "E=257". Since H20-3e has 141G memory, there's no need to tune tp=16 for DeepSeek-V3/R1 model.

@Xu-Wenqing
Copy link
Contributor Author

@ch-wan @zhyncs could you please take a review again? This PR added DeepSeek-R1/V3 Moe kernel config for H20-3e (141G memory H20 GPU).

Xu-Wenqing and others added 2 commits June 9, 2025 17:49
Signed-off-by: 许文卿 <xwq391974@alibaba-inc.com>
@Xu-Wenqing Xu-Wenqing closed this Oct 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants