Skip to content

Conversation

@kaixih
Copy link
Collaborator

@kaixih kaixih commented Nov 14, 2025

Motivation

When SGLANG_ENABLE_FLASHINFER_GEMM is enabled, we observed that the accuracy test fails. (by @gracehonv )
The root cause is that the model execution switches to use e8m0 scales for Blackwell GPUs even when DeepGEMM is not in use.
This PR fixes the issue by ensuring that the correct scaling mode is applied.

Modifications

Accuracy Tests

script:

set -x
if [[ "$1" == "server" ]]; then
model_str=/model/deepseek-ai-DeepSeek-R1-0528/models--deepseek-ai--DeepSeek-R1-0528/snapshots/4236a6af538feda4548eca9ab308586007567f52
export SGLANG_ENABLE_FLASHINFER_GEMM=true
#export SGLANG_SUPPORT_CUTLASS_BLOCK_FP8=true

  python3 -m sglang.launch_server \
    --trust-remote-code \
    --disable-radix-cache \
    --max-running-requests 512 \
    --chunked-prefill-size 8192 \
    --mem-fraction-static 0.9 \
    --cuda-graph-max-bs 128 \
    --max-prefill-tokens 8192 \
    --kv-cache-dtype fp8_e4m3 \
    --quantization fp8 \
    --model-path ${model_str} \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size=4 \
    --data-parallel-size=1 \
    --expert-parallel-size=1 \
    --scheduler-recv-interval 10 \
    --stream-interval 10 \
    --tokenizer-path ${model_str} \
    --attention-backend trtllm_mla \
    --moe-runner-backend flashinfer_trtllm \
    --enable-symm-mem \

fi

if [[ "$1" == "bench" ]]; then
python3 /sgl-workspace/sglang/benchmark/gsm8k/bench_sglang.py \
  --num-questions 2000 \
  --parallel 2000 \
  --num-shots 8 \
  --port 8000
fi

if [[ "$1" == "chat" ]]; then
  curl http://127.0.0.1:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "$model_str",
      "messages": [{"role": "assistant", "content": "What is 37 * 42?"}],
      "extra_body": {
        "chat_template_kwargs": {
          "thinking": true
        }
      }
    }'
fi

result:

100%|███████████████████████████████████████████████████████████████████████████| 1319/1319 [04:30<00:00,  4.87it/s]
Accuracy: 0.966
Invalid: 0.000
Latency: 271.168 s
Output throughput: 523.439 token/s
+ [[ bench == \c\h\a\t ]]

Benchmarking and Profiling

Checklist

@kaixih
Copy link
Collaborator Author

kaixih commented Nov 14, 2025

@gracehonv @Fridge003

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @kaixih, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical accuracy regression that occurred when the SGLANG_ENABLE_FLASHINFER_GEMM flag was active. The root cause was an unintended application of e8m0 scaling, which is typically associated with DeepGEMM, even when DeepGEMM itself was not engaged. The changes introduce a precise conditional check within the model's weight loading process, ensuring that this specific scaling mode is only activated under the correct circumstances, thereby restoring the expected model accuracy and behavior.

Highlights

  • Accuracy Fix: Resolved an issue where model accuracy degraded when SGLANG_ENABLE_FLASHINFER_GEMM was enabled, caused by incorrect application of e8m0 scaling.
  • Conditional Scaling Logic: Modified the post_load_weights function to ensure that e8m0 scaling is only applied when DeepGEMM is explicitly in use, preventing its activation when FLASHINFER_GEMM is enabled without DeepGEMM.
  • Dependency Import: Added an import for ENABLE_FLASHINFER_GEMM to facilitate the new conditional logic in the deepseek_v2 model implementation.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request correctly addresses the reported accuracy issue when SGLANG_ENABLE_FLASHINFER_GEMM is enabled. By adding not ENABLE_FLASHINFER_GEMM to the condition for applying DEEPGEMM_SCALE_UE8M0, the model now correctly applies the scaling mode only when DeepGEMM is in use, preventing unintended e8m0 scales. The changes are concise and directly resolve the problem statement.

@Fridge003 Fridge003 added the format Auto Format Code label Nov 14, 2025
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Nov 14, 2025
@Fridge003
Copy link
Collaborator

@kaixih Can you please test again and see whether it works

@kaixih
Copy link
Collaborator Author

kaixih commented Nov 14, 2025

%|███████████████████████████████████████████████████████████████████████████| 1319/1319 [04:11<00:00,  5.25it/s]
Accuracy: 0.964
Invalid: 0.000
Latency: 251.736 s
Output throughput: 562.976 token/s

@Fridge003 Fridge003 merged commit 5ae0ac4 into sgl-project:main Nov 14, 2025
46 of 78 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deepseek documentation Improvements or additions to documentation format Auto Format Code run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants