Skip to content

Comments

ignore the deepgemm check when the model weight with nvfp4 and moe ba…#12782

Merged
Fridge003 merged 3 commits intosgl-project:mainfrom
bytedance-iaas:dev/fp4-flashinfer-cutedsl
Nov 6, 2025
Merged

ignore the deepgemm check when the model weight with nvfp4 and moe ba…#12782
Fridge003 merged 3 commits intosgl-project:mainfrom
bytedance-iaas:dev/fp4-flashinfer-cutedsl

Conversation

@rainj-me
Copy link
Collaborator

@rainj-me rainj-me commented Nov 6, 2025

…ckend is flashinfer cutedsl

Motivation

support DS R1 FP4 model deploy to GB200 with deepep and flashinfer-cutedsl moe backend.

Modifications

ease the DeepGEMM check when the model weight is fp4 quantitized and the moe backend is flashinfer-cutedsl

Accuracy Tests

For the decode

Rank 0

NCCL_MNNVL_ENABLE=1 NCCL_CUMEM_ENABLE=1 NCCL_SOCKET_IFNAME=eth0 NCCL_SOCKET_FAMILY=AF_INET GLOO_SOCKET_IFNAME=eth0 SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 SGLANG_ENABLE_JIT_DEEPGEMM=0 SGLANG_DEEPEP_BF16_DISPATCH=1 SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=1024 python3 -m sglang.launch_server --model-path /data01/models/DeepSeek-R1-0528-FP4-v2 --trust-remote-code --dist-init-addr 192.168.0.1:21000 --nnodes 2 --node-rank 0 --disaggregation-mode decode --tp-size 8 --ep-size 8 --dp-size 8 --enable-dp-attention --enable-dp-lm-head --device cuda --host 0.0.0.0 --port 28000 --mem-fraction-static 0.7 --attention-backend trtllm_mla --moe-runner-backend flashinfer_cutedsl --moe-a2a-backend deepep --deepep-mode low_latency --quantization modelopt_fp4 --chunked-prefill-size 131072 --page-size 64 --disable-radix-cache --max-running-requests 1536 --prefill-round-robin-balance --disaggregation-transfer-backend nixl --cuda-graph-max-bs 192 --ep-num-redundant-experts 32 --watchdog-timeout 1200

Rank 1

NCCL_MNNVL_ENABLE=1 NCCL_CUMEM_ENABLE=1 NCCL_SOCKET_IFNAME=eth0 NCCL_SOCKET_FAMILY=AF_INET GLOO_SOCKET_IFNAME=eth0 SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 SGLANG_ENABLE_JIT_DEEPGEMM=0 SGLANG_DEEPEP_BF16_DISPATCH=1 SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=1024 python3 -m sglang.launch_server --model-path /data01/models/DeepSeek-R1-0528-FP4-v2 --trust-remote-code --dist-init-addr 192.168.0.1:21000 --nnodes 2 --node-rank 1 --disaggregation-mode decode --tp-size 8 --ep-size 8 --dp-size 8 --enable-dp-attention --enable-dp-lm-head --device cuda --host 0.0.0.0 --port 28000 --mem-fraction-static 0.7 --attention-backend trtllm_mla --moe-runner-backend flashinfer_cutedsl --moe-a2a-backend deepep --deepep-mode low_latency --quantization modelopt_fp4 --chunked-prefill-size 131072 --page-size 64 --disable-radix-cache --max-running-requests 1536 --prefill-round-robin-balance --disaggregation-transfer-backend nixl --cuda-graph-max-bs 192 --ep-num-redundant-experts 32 --watchdog-timeout 1200

Without this PR, the logs are

[2025-11-06 18:59:07 DP3 TP3 EP3] Detected nvfp4 checkpoint. Please note that the format is experimental and subject to change.
[2025-11-06 18:59:07 DP3 TP3 EP3] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2802, in run_scheduler_process
    scheduler = Scheduler(
                ^^^^^^^^^^
  ...
    self.experts = get_moe_impl_class(quant_config)(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/ep_moe/layer.py", line 118, in __init__
    deep_gemm_wrapper.ENABLE_JIT_DEEPGEMM
AssertionError: DeepEP DeepEPMode.LOW_LATENCY mode requires deep_gemm

With this PR, the logs are

[2025-11-06 19:04:21] INFO:     Started server process [2015574]
[2025-11-06 19:04:21] INFO:     Waiting for application startup.
[2025-11-06 19:04:21] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 0.6, 'top_k': 50, 'top_p': 0.95}
[2025-11-06 19:04:21] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 0.6, 'top_k': 50, 'top_p': 0.95}
[2025-11-06 19:04:21] INFO:     Application startup complete.
[2025-11-06 19:04:21] INFO:     Uvicorn running on http://0.0.0.0:28000 (Press CTRL+C to quit)
[2025-11-06 19:04:22] INFO:     127.0.0.1:55904 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-11-06 19:04:22] Start of pd disaggregation warmup ...
[2025-11-06 19:04:22] INFO:     127.0.0.1:55908 - "POST /generate HTTP/1.1" 200 OK
[2025-11-06 19:04:22] End of prefill disaggregation mode warmup with status 200, resp .....
[2025-11-06 19:04:22] The server is fired up and ready to roll!

Benchmarking and Profiling

Checklist

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @rainj-me, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a crucial compatibility fix, allowing for the successful deployment of FP4 quantized models, such as DeepSeek R1, on GB200 infrastructure. It specifically addresses an issue where the combination of flashinfer-cutedsl as the MoE backend and DeepEP's low-latency mode would incorrectly enforce a DeepGEMM dependency. By adjusting this check, the PR ensures these advanced configurations can operate as intended, broadening the range of supported model deployments.

Highlights

  • DeepSeek R1 FP4 Model Support: Enables the deployment of DeepSeek R1 FP4 quantized models on GB200 systems when using DeepEP and the flashinfer-cutedsl MoE backend.
  • Conditional DeepGEMM Check Bypass: Modifies the DeepGEMM assertion to be skipped under specific conditions: when modelopt_fp4 quantization is active and flashinfer_cutedsl is used as the MoE runner backend in DeepEP's low-latency mode.
  • Resolves Assertion Error: Fixes an AssertionError that previously occurred, preventing these specific configurations from running due to an unmet deep_gemm requirement.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request disables the DeepGEMM check for a specific configuration involving nvfp4 quantization and the flashinfer_cutedsl MoE backend. This change is necessary to support model deployment on GB200 hardware. The logic seems correct, but I've suggested a minor refactoring to improve the readability of the conditional statement.

Comment on lines +115 to +122
if (
self.deepep_mode.enable_low_latency()
and not _is_npu
and not (
get_moe_runner_backend().is_flashinfer_cutedsl()
and self.quant_config.get_name() == "modelopt_fp4"
)
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better readability and maintainability, consider extracting the new condition into a separate boolean variable. This makes the if statement's intent clearer at a glance.

Suggested change
if (
self.deepep_mode.enable_low_latency()
and not _is_npu
and not (
get_moe_runner_backend().is_flashinfer_cutedsl()
and self.quant_config.get_name() == "modelopt_fp4"
)
):
is_fp4_flashinfer_cutedsl = (
get_moe_runner_backend().is_flashinfer_cutedsl()
and self.quant_config.get_name() == "modelopt_fp4"
)
if (
self.deepep_mode.enable_low_latency()
and not _is_npu
and not is_fp4_flashinfer_cutedsl
):

@rainj-me rainj-me added the run-ci label Nov 6, 2025
@rainj-me rainj-me requested a review from Fridge003 as a code owner November 6, 2025 22:05
@Fridge003 Fridge003 merged commit a119363 into sgl-project:main Nov 6, 2025
61 of 77 checks passed
@HanHan009527 HanHan009527 deleted the dev/fp4-flashinfer-cutedsl branch December 16, 2025 16:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants