[NVIDIA] Change default quant method for model_opt#11991
[NVIDIA] Change default quant method for model_opt#11991Fridge003 merged 7 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @kaixih, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a critical compatibility issue preventing the proper loading of certain pre-trained models, specifically NVIDIA's Eagle heads, due to an incorrect assumption about their default quantization method. By modifying the system to no longer implicitly default to FP8 quantization when no method is explicitly defined, it ensures broader compatibility and smoother integration with a wider range of models. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request correctly changes the default quantization method for ModelOpt checkpoints. Previously, it defaulted to fp8, which caused issues with models that don't specify a quantization method. By changing the default to None, the code now correctly handles such cases, preventing incorrect quantization assumptions. The change is logical and well-motivated. I've added one minor suggestion regarding type hinting to improve code clarity and maintainability.
|
I notice the original code mentions |
b89d93f to
705aa00
Compare
|
@kaixih Could you please check with your changes, if we can still run a modelopt fp8 ckpt with |
7c411c4 to
cf7a52b
Compare
|
@Edwardf0t1 , it’s fixed now. I tried your command and it works. Using In short, for ModelOpt FP8 checkpoints, users can pass either For non-quantized ModelOpt checkpoints, users should just use |
Thanks for the test. By "non-quantized ModelOpt checkpoints", do you mean the original model? Users can specify |
| def override_quantization_method(cls, hf_quant_cfg, user_quant) -> Optional[str]: | ||
| if ( | ||
| user_quant == "modelopt" | ||
| and hf_quant_cfg.get("quant_method", "") == "modelopt_fp8" |
There was a problem hiding this comment.
We already have an override_quantization_method function in L158.
|
I mean these modelopt released eagle heads: https://huggingface.co/nvidia/Qwen3-235B-A22B-Eagle3/blob/main/hf_quant_config.json. They are bf16 with no quantization. |
Edwardf0t1
left a comment
There was a problem hiding this comment.
LGTM, left a suggestion.
Motivation
While working on Qwen + Eagle, I found it’s not compatible with NVIDIA’s released Eagle heads (nvidia/Qwen3-235B-A22B-Eagle3). The root cause is that we default quant_method to fp8 for ModelOpt checkpoints, but this assumption doesn’t hold, for example, this model doesn’t specify any quantization method.
Modifications
This PR updates the logic to set the default quant_method to None when it is not explicitly defined in hf_quant_config.json.
Accuracy Tests
python3 -m sglang.launch_server \ --model Qwen/Qwen3-235B-A22B \ --speculative-algorithm EAGLE3 \ --speculative-draft-model-path nvidia/Qwen3-235B-A22B-Eagle3 \ --speculative-num-steps 5 \ --speculative-eagle-topk 8 \ --speculative-num-draft-tokens 32 \ --mem-fraction-static 0.75 \ --tp 8 \ --ep 8 \ --context-length 8192 \ --trust-remote-code \ --host 0.0.0.0 \ --port 30000 \ --dtype bfloat16Benchmarking and Profiling
NA
Checklist