[NVIDIA] Change default quant method for model_opt by kaixih · Pull Request #11991 · sgl-project/sglang

kaixih · 2025-10-23T01:53:43Z

Motivation

While working on Qwen + Eagle, I found it’s not compatible with NVIDIA’s released Eagle heads (nvidia/Qwen3-235B-A22B-Eagle3). The root cause is that we default quant_method to fp8 for ModelOpt checkpoints, but this assumption doesn’t hold, for example, this model doesn’t specify any quantization method.

Modifications

This PR updates the logic to set the default quant_method to None when it is not explicitly defined in hf_quant_config.json.

Accuracy Tests

python3 -m sglang.launch_server \
    --model Qwen/Qwen3-235B-A22B \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path nvidia/Qwen3-235B-A22B-Eagle3 \
    --speculative-num-steps 5 \
    --speculative-eagle-topk 8 \
    --speculative-num-draft-tokens 32 \
    --mem-fraction-static 0.75 \
    --tp 8 \
    --ep 8 \
    --context-length 8192 \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 30000 \
    --dtype bfloat16

+ python benchmark/gsm8k/bench_sglang.py --port 30000 --num-questions 10000 --parallel 10000 --num-shots 8
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [03:24<00:00,  6.44it/s]
Accuracy: 0.951
Invalid: 0.000
Latency: 204.845 s

Benchmarking and Profiling

NA

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist · 2025-10-23T01:53:55Z

Summary of Changes

Hello @kaixih, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical compatibility issue preventing the proper loading of certain pre-trained models, specifically NVIDIA's Eagle heads, due to an incorrect assumption about their default quantization method. By modifying the system to no longer implicitly default to FP8 quantization when no method is explicitly defined, it ensures broader compatibility and smoother integration with a wider range of models.

Highlights

Quantization Method Default: The default quantization method for ModelOpt checkpoints has been updated. Previously, if no method was specified, it would default to fp8. This has been changed to None to prevent compatibility issues with models that do not explicitly define a quantization method, such as NVIDIA's Eagle heads.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request correctly changes the default quantization method for ModelOpt checkpoints. Previously, it defaulted to fp8, which caused issues with models that don't specify a quantization method. By changing the default to None, the code now correctly handles such cases, preventing incorrect quantization assumptions. The change is logical and well-motivated. I've added one minor suggestion regarding type hinting to improve code clarity and maintainability.

python/sglang/srt/configs/model_config.py

kaixih · 2025-10-23T02:00:45Z

I notice the original code mentions Default to FP8 for backward compatibility which I guess there might be some model checkpoints released by modelopt uses fp8 quantization but doesn't specify it in the quant_algo field. If that's the case, we should workaround that explicitly. @Fridge003 can you help confirm? If we know the model name, I can help add some special rules for that.

Edwardf0t1 · 2025-10-23T23:41:51Z

@kaixih Could you please check with your changes, if we can still run a modelopt fp8 ckpt with --quantization modelopt? For example,

python -m sglang.launch_server \
  --model-path nvidia/Llama-3.1-8B-Instruct-FP8 \
  --trust-remote-code \
  --quantization modelopt

kaixih · 2025-10-24T06:07:49Z

@Edwardf0t1 , it’s fixed now. I tried your command and it works. Using --quantization modelopt_fp8 also works.

In short, for ModelOpt FP8 checkpoints, users can pass either modelopt or modelopt_fp8. If modelopt is used, the system will read hf_quant_config.json (looking for "quant_algo": "FP8") and automatically override it to modelopt_fp8.

For non-quantized ModelOpt checkpoints, users should just use --quantization modelopt.

Edwardf0t1 · 2025-10-24T06:26:01Z

@Edwardf0t1 , it’s fixed now. I tried your command and it works. Using --quantization modelopt_fp8 also works.

In short, for ModelOpt FP8 checkpoints, users can pass either modelopt or modelopt_fp8. If modelopt is used, the system will read hf_quant_config.json (looking for "quant_algo": "FP8") and automatically override it to modelopt_fp8.

For non-quantized ModelOpt checkpoints, users should just use --quantization modelopt.

Thanks for the test. By "non-quantized ModelOpt checkpoints", do you mean the original model? Users can specify modelopt_fp8 or modelopt_fp4 to quantize the original BF16 model.

Edwardf0t1 · 2025-10-24T06:35:36Z

python/sglang/srt/layers/quantization/modelopt_quant.py

+    def override_quantization_method(cls, hf_quant_cfg, user_quant) -> Optional[str]:
+        if (
+            user_quant == "modelopt"
+            and hf_quant_cfg.get("quant_method", "") == "modelopt_fp8"


We already have an override_quantization_method function in L158.

Fixed. PTAL.

kaixih · 2025-10-24T07:08:13Z

I mean these modelopt released eagle heads: https://huggingface.co/nvidia/Qwen3-235B-A22B-Eagle3/blob/main/hf_quant_config.json. They are bf16 with no quantization.

@Edwardf0t1

Edwardf0t1

LGTM, left a suggestion.

python/sglang/srt/layers/quantization/base_config.py

sglang-bot added the run-ci label Oct 23, 2025

kaixih changed the title ~~Change default quant method for model_opt~~ [NVIDIA] Change default quant method for model_opt Oct 23, 2025

gemini-code-assist bot reviewed Oct 23, 2025

View reviewed changes

python/sglang/srt/configs/model_config.py Show resolved Hide resolved

kaixih force-pushed the fix_default_modelopt_quant branch from b89d93f to 705aa00 Compare October 23, 2025 02:02

kaixih requested review from BBuf, Edwardf0t1, FlamingoPg and ch-wan as code owners October 24, 2025 05:29

Change default quant method for model_opt

cf7a52b

kaixih force-pushed the fix_default_modelopt_quant branch from 7c411c4 to cf7a52b Compare October 24, 2025 05:36

Format

47318fd

Edwardf0t1 reviewed Oct 24, 2025

View reviewed changes

kaixih added 2 commits October 24, 2025 06:58

Address comments

a4e85dc

Add comments

adc8839

Format

d30fb0c

Edwardf0t1 approved these changes Oct 24, 2025

View reviewed changes

python/sglang/srt/layers/quantization/base_config.py Show resolved Hide resolved

kaixih and others added 2 commits October 24, 2025 18:43

Minor

2e45719

Merge branch 'main' into fix_default_modelopt_quant

a4828bc

Fridge003 merged commit ff60406 into sgl-project:main Oct 26, 2025
68 of 71 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NVIDIA] Change default quant method for model_opt#11991

[NVIDIA] Change default quant method for model_opt#11991
Fridge003 merged 7 commits intosgl-project:mainfrom
kaixih:fix_default_modelopt_quant

kaixih commented Oct 23, 2025

Uh oh!

gemini-code-assist bot commented Oct 23, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

kaixih commented Oct 23, 2025

Uh oh!

Edwardf0t1 commented Oct 23, 2025

Uh oh!

kaixih commented Oct 24, 2025 •

edited

Loading

Uh oh!

Edwardf0t1 commented Oct 24, 2025

Uh oh!

Edwardf0t1 Oct 24, 2025

Uh oh!

kaixih Oct 24, 2025

Uh oh!

kaixih commented Oct 24, 2025

Uh oh!

Edwardf0t1 left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

Conversation

kaixih commented Oct 23, 2025

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Oct 23, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

kaixih commented Oct 23, 2025

Uh oh!

Edwardf0t1 commented Oct 23, 2025

Uh oh!

kaixih commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Edwardf0t1 commented Oct 24, 2025

Uh oh!

Edwardf0t1 Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

kaixih Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

kaixih commented Oct 24, 2025

Uh oh!

Edwardf0t1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

kaixih commented Oct 24, 2025 •

edited

Loading