-
Notifications
You must be signed in to change notification settings - Fork 3.6k
[NVIDIA] Fix use case of SGLANG_ENABLE_FLASHINFER_GEMM #13274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary of ChangesHello @kaixih, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical accuracy regression that occurred when the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The pull request correctly addresses the reported accuracy issue when SGLANG_ENABLE_FLASHINFER_GEMM is enabled. By adding not ENABLE_FLASHINFER_GEMM to the condition for applying DEEPGEMM_SCALE_UE8M0, the model now correctly applies the scaling mode only when DeepGEMM is in use, preventing unintended e8m0 scales. The changes are concise and directly resolve the problem statement.
|
@kaixih Can you please test again and see whether it works |
|
Motivation
When SGLANG_ENABLE_FLASHINFER_GEMM is enabled, we observed that the accuracy test fails. (by @gracehonv )
The root cause is that the model execution switches to use e8m0 scales for Blackwell GPUs even when DeepGEMM is not in use.
This PR fixes the issue by ensuring that the correct scaling mode is applied.
Modifications
Accuracy Tests
script:
result:
Benchmarking and Profiling
Checklist