ignore the deepgemm check when the model weight with nvfp4 and moe ba…#12782
Conversation
…ckend is flashinfer cutedsl
Summary of ChangesHello @rainj-me, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a crucial compatibility fix, allowing for the successful deployment of FP4 quantized models, such as DeepSeek R1, on GB200 infrastructure. It specifically addresses an issue where the combination of Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request disables the DeepGEMM check for a specific configuration involving nvfp4 quantization and the flashinfer_cutedsl MoE backend. This change is necessary to support model deployment on GB200 hardware. The logic seems correct, but I've suggested a minor refactoring to improve the readability of the conditional statement.
| if ( | ||
| self.deepep_mode.enable_low_latency() | ||
| and not _is_npu | ||
| and not ( | ||
| get_moe_runner_backend().is_flashinfer_cutedsl() | ||
| and self.quant_config.get_name() == "modelopt_fp4" | ||
| ) | ||
| ): |
There was a problem hiding this comment.
For better readability and maintainability, consider extracting the new condition into a separate boolean variable. This makes the if statement's intent clearer at a glance.
| if ( | |
| self.deepep_mode.enable_low_latency() | |
| and not _is_npu | |
| and not ( | |
| get_moe_runner_backend().is_flashinfer_cutedsl() | |
| and self.quant_config.get_name() == "modelopt_fp4" | |
| ) | |
| ): | |
| is_fp4_flashinfer_cutedsl = ( | |
| get_moe_runner_backend().is_flashinfer_cutedsl() | |
| and self.quant_config.get_name() == "modelopt_fp4" | |
| ) | |
| if ( | |
| self.deepep_mode.enable_low_latency() | |
| and not _is_npu | |
| and not is_fp4_flashinfer_cutedsl | |
| ): |
…ckend is flashinfer cutedsl
Motivation
support DS R1 FP4 model deploy to GB200 with deepep and flashinfer-cutedsl moe backend.
Modifications
ease the DeepGEMM check when the model weight is fp4 quantitized and the moe backend is flashinfer-cutedsl
Accuracy Tests
For the decode
Rank 0
Rank 1
Without this PR, the logs are
With this PR, the logs are
Benchmarking and Profiling
Checklist