Skip to content

[NVIDIA] Change default quant method for model_opt#11991

Merged
Fridge003 merged 7 commits intosgl-project:mainfrom
kaixih:fix_default_modelopt_quant
Oct 26, 2025
Merged

[NVIDIA] Change default quant method for model_opt#11991
Fridge003 merged 7 commits intosgl-project:mainfrom
kaixih:fix_default_modelopt_quant

Conversation

@kaixih
Copy link
Collaborator

@kaixih kaixih commented Oct 23, 2025

Motivation

While working on Qwen + Eagle, I found it’s not compatible with NVIDIA’s released Eagle heads (nvidia/Qwen3-235B-A22B-Eagle3). The root cause is that we default quant_method to fp8 for ModelOpt checkpoints, but this assumption doesn’t hold, for example, this model doesn’t specify any quantization method.

Modifications

This PR updates the logic to set the default quant_method to None when it is not explicitly defined in hf_quant_config.json.

Accuracy Tests

python3 -m sglang.launch_server \
    --model Qwen/Qwen3-235B-A22B \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path nvidia/Qwen3-235B-A22B-Eagle3 \
    --speculative-num-steps 5 \
    --speculative-eagle-topk 8 \
    --speculative-num-draft-tokens 32 \
    --mem-fraction-static 0.75 \
    --tp 8 \
    --ep 8 \
    --context-length 8192 \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 30000 \
    --dtype bfloat16
+ python benchmark/gsm8k/bench_sglang.py --port 30000 --num-questions 10000 --parallel 10000 --num-shots 8
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [03:24<00:00,  6.44it/s]
Accuracy: 0.951
Invalid: 0.000
Latency: 204.845 s

Benchmarking and Profiling

NA

Checklist

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @kaixih, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical compatibility issue preventing the proper loading of certain pre-trained models, specifically NVIDIA's Eagle heads, due to an incorrect assumption about their default quantization method. By modifying the system to no longer implicitly default to FP8 quantization when no method is explicitly defined, it ensures broader compatibility and smoother integration with a wider range of models.

Highlights

  • Quantization Method Default: The default quantization method for ModelOpt checkpoints has been updated. Previously, if no method was specified, it would default to fp8. This has been changed to None to prevent compatibility issues with models that do not explicitly define a quantization method, such as NVIDIA's Eagle heads.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@kaixih kaixih changed the title Change default quant method for model_opt [NVIDIA] Change default quant method for model_opt Oct 23, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly changes the default quantization method for ModelOpt checkpoints. Previously, it defaulted to fp8, which caused issues with models that don't specify a quantization method. By changing the default to None, the code now correctly handles such cases, preventing incorrect quantization assumptions. The change is logical and well-motivated. I've added one minor suggestion regarding type hinting to improve code clarity and maintainability.

@kaixih
Copy link
Collaborator Author

kaixih commented Oct 23, 2025

I notice the original code mentions Default to FP8 for backward compatibility which I guess there might be some model checkpoints released by modelopt uses fp8 quantization but doesn't specify it in the quant_algo field. If that's the case, we should workaround that explicitly. @Fridge003 can you help confirm? If we know the model name, I can help add some special rules for that.

@kaixih kaixih force-pushed the fix_default_modelopt_quant branch from b89d93f to 705aa00 Compare October 23, 2025 02:02
@Edwardf0t1
Copy link
Collaborator

@kaixih Could you please check with your changes, if we can still run a modelopt fp8 ckpt with --quantization modelopt? For example,

python -m sglang.launch_server \
  --model-path nvidia/Llama-3.1-8B-Instruct-FP8 \
  --trust-remote-code \
  --quantization modelopt

@kaixih kaixih force-pushed the fix_default_modelopt_quant branch from 7c411c4 to cf7a52b Compare October 24, 2025 05:36
@kaixih
Copy link
Collaborator Author

kaixih commented Oct 24, 2025

@Edwardf0t1 , it’s fixed now. I tried your command and it works. Using --quantization modelopt_fp8 also works.

In short, for ModelOpt FP8 checkpoints, users can pass either modelopt or modelopt_fp8. If modelopt is used, the system will read hf_quant_config.json (looking for "quant_algo": "FP8") and automatically override it to modelopt_fp8.

For non-quantized ModelOpt checkpoints, users should just use --quantization modelopt.

@Edwardf0t1
Copy link
Collaborator

@Edwardf0t1 , it’s fixed now. I tried your command and it works. Using --quantization modelopt_fp8 also works.

In short, for ModelOpt FP8 checkpoints, users can pass either modelopt or modelopt_fp8. If modelopt is used, the system will read hf_quant_config.json (looking for "quant_algo": "FP8") and automatically override it to modelopt_fp8.

For non-quantized ModelOpt checkpoints, users should just use --quantization modelopt.

Thanks for the test. By "non-quantized ModelOpt checkpoints", do you mean the original model? Users can specify modelopt_fp8 or modelopt_fp4 to quantize the original BF16 model.

def override_quantization_method(cls, hf_quant_cfg, user_quant) -> Optional[str]:
if (
user_quant == "modelopt"
and hf_quant_cfg.get("quant_method", "") == "modelopt_fp8"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have an override_quantization_method function in L158.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. PTAL.

@kaixih
Copy link
Collaborator Author

kaixih commented Oct 24, 2025

I mean these modelopt released eagle heads: https://huggingface.co/nvidia/Qwen3-235B-A22B-Eagle3/blob/main/hf_quant_config.json. They are bf16 with no quantization.

@Edwardf0t1

Copy link
Collaborator

@Edwardf0t1 Edwardf0t1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, left a suggestion.

@Fridge003 Fridge003 merged commit ff60406 into sgl-project:main Oct 26, 2025
68 of 71 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Comments