[NVIDIA] Fix use case of SGLANG_ENABLE_FLASHINFER_GEMM #13274

kaixih · 2025-11-14T08:26:28Z

Motivation

When SGLANG_ENABLE_FLASHINFER_GEMM is enabled, we observed that the accuracy test fails. (by @gracehonv )
The root cause is that the model execution switches to use e8m0 scales for Blackwell GPUs even when DeepGEMM is not in use.
This PR fixes the issue by ensuring that the correct scaling mode is applied.

Modifications

Accuracy Tests

script:

set -x
if [[ "$1" == "server" ]]; then
model_str=/model/deepseek-ai-DeepSeek-R1-0528/models--deepseek-ai--DeepSeek-R1-0528/snapshots/4236a6af538feda4548eca9ab308586007567f52
export SGLANG_ENABLE_FLASHINFER_GEMM=true
#export SGLANG_SUPPORT_CUTLASS_BLOCK_FP8=true

  python3 -m sglang.launch_server \
    --trust-remote-code \
    --disable-radix-cache \
    --max-running-requests 512 \
    --chunked-prefill-size 8192 \
    --mem-fraction-static 0.9 \
    --cuda-graph-max-bs 128 \
    --max-prefill-tokens 8192 \
    --kv-cache-dtype fp8_e4m3 \
    --quantization fp8 \
    --model-path ${model_str} \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size=4 \
    --data-parallel-size=1 \
    --expert-parallel-size=1 \
    --scheduler-recv-interval 10 \
    --stream-interval 10 \
    --tokenizer-path ${model_str} \
    --attention-backend trtllm_mla \
    --moe-runner-backend flashinfer_trtllm \
    --enable-symm-mem \

fi

if [[ "$1" == "bench" ]]; then
python3 /sgl-workspace/sglang/benchmark/gsm8k/bench_sglang.py \
  --num-questions 2000 \
  --parallel 2000 \
  --num-shots 8 \
  --port 8000
fi

if [[ "$1" == "chat" ]]; then
  curl http://127.0.0.1:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "$model_str",
      "messages": [{"role": "assistant", "content": "What is 37 * 42?"}],
      "extra_body": {
        "chat_template_kwargs": {
          "thinking": true
        }
      }
    }'
fi

result:

100%|███████████████████████████████████████████████████████████████████████████| 1319/1319 [04:30<00:00,  4.87it/s]
Accuracy: 0.966
Invalid: 0.000
Latency: 271.168 s
Output throughput: 523.439 token/s
+ [[ bench == \c\h\a\t ]]

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

kaixih · 2025-11-14T08:26:50Z

@gracehonv @Fridge003

gemini-code-assist · 2025-11-14T08:27:13Z

Summary of Changes

Hello @kaixih, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical accuracy regression that occurred when the SGLANG_ENABLE_FLASHINFER_GEMM flag was active. The root cause was an unintended application of e8m0 scaling, which is typically associated with DeepGEMM, even when DeepGEMM itself was not engaged. The changes introduce a precise conditional check within the model's weight loading process, ensuring that this specific scaling mode is only activated under the correct circumstances, thereby restoring the expected model accuracy and behavior.

Highlights

Accuracy Fix: Resolved an issue where model accuracy degraded when SGLANG_ENABLE_FLASHINFER_GEMM was enabled, caused by incorrect application of e8m0 scaling.
Conditional Scaling Logic: Modified the post_load_weights function to ensure that e8m0 scaling is only applied when DeepGEMM is explicitly in use, preventing its activation when FLASHINFER_GEMM is enabled without DeepGEMM.
Dependency Import: Added an import for ENABLE_FLASHINFER_GEMM to facilitate the new conditional logic in the deepseek_v2 model implementation.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request correctly addresses the reported accuracy issue when SGLANG_ENABLE_FLASHINFER_GEMM is enabled. By adding not ENABLE_FLASHINFER_GEMM to the condition for applying DEEPGEMM_SCALE_UE8M0, the model now correctly applies the scaling mode only when DeepGEMM is in use, preventing unintended e8m0 scales. The changes are concise and directly resolve the problem statement.

python/sglang/srt/models/deepseek_v2.py

python/sglang/srt/layers/quantization/fp8_utils.py

Fridge003 · 2025-11-14T19:41:28Z

@kaixih Can you please test again and see whether it works

kaixih · 2025-11-14T19:52:25Z

%|███████████████████████████████████████████████████████████████████████████| 1319/1319 [04:11<00:00,  5.25it/s]
Accuracy: 0.964
Invalid: 0.000
Latency: 251.736 s
Output throughput: 562.976 token/s

Fix broken fp8 flashinfer call

e5c31ea

github-actions bot added the deepseek label Nov 14, 2025

sglang-bot added the run-ci label Nov 14, 2025

gemini-code-assist bot reviewed Nov 14, 2025

View reviewed changes

Fridge003 reviewed Nov 14, 2025

View reviewed changes

python/sglang/srt/models/deepseek_v2.py Outdated Show resolved Hide resolved

Fridge003 added the format Auto Format Code label Nov 14, 2025

Deprecate old names

363a6b4

kaixih requested review from BBuf, Edwardf0t1, FlamingoPg and ch-wan as code owners November 14, 2025 18:08

github-actions bot added the documentation Improvements or additions to documentation label Nov 14, 2025

hnyls2002 requested changes Nov 14, 2025

View reviewed changes

python/sglang/srt/layers/quantization/fp8_utils.py Outdated Show resolved Hide resolved

hnyls2002 assigned Fridge003 Nov 14, 2025

Use envs

587935a

Fridge003 reviewed Nov 14, 2025

View reviewed changes

python/sglang/srt/layers/quantization/fp8_utils.py Outdated Show resolved Hide resolved

Update python/sglang/srt/layers/quantization/fp8_utils.py

4570f2b

Lint

9b72427

Fridge003 approved these changes Nov 14, 2025

View reviewed changes

Fridge003 merged commit 5ae0ac4 into sgl-project:main Nov 14, 2025
46 of 78 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NVIDIA] Fix use case of SGLANG_ENABLE_FLASHINFER_GEMM #13274

[NVIDIA] Fix use case of SGLANG_ENABLE_FLASHINFER_GEMM #13274

kaixih commented Nov 14, 2025

Uh oh!

kaixih commented Nov 14, 2025

Uh oh!

gemini-code-assist bot commented Nov 14, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fridge003 commented Nov 14, 2025

Uh oh!

kaixih commented Nov 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[NVIDIA] Fix use case of SGLANG_ENABLE_FLASHINFER_GEMM #13274

[NVIDIA] Fix use case of SGLANG_ENABLE_FLASHINFER_GEMM #13274

Conversation

kaixih commented Nov 14, 2025

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

kaixih commented Nov 14, 2025

Uh oh!

gemini-code-assist bot commented Nov 14, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fridge003 commented Nov 14, 2025

Uh oh!

kaixih commented Nov 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants