ignore the deepgemm check when the model weight with nvfp4 and moe ba… by rainj-me · Pull Request #12782 · sgl-project/sglang

rainj-me · 2025-11-06T19:07:23Z

…ckend is flashinfer cutedsl

Motivation

support DS R1 FP4 model deploy to GB200 with deepep and flashinfer-cutedsl moe backend.

Modifications

ease the DeepGEMM check when the model weight is fp4 quantitized and the moe backend is flashinfer-cutedsl

Accuracy Tests

For the decode

Rank 0

NCCL_MNNVL_ENABLE=1 NCCL_CUMEM_ENABLE=1 NCCL_SOCKET_IFNAME=eth0 NCCL_SOCKET_FAMILY=AF_INET GLOO_SOCKET_IFNAME=eth0 SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 SGLANG_ENABLE_JIT_DEEPGEMM=0 SGLANG_DEEPEP_BF16_DISPATCH=1 SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=1024 python3 -m sglang.launch_server --model-path /data01/models/DeepSeek-R1-0528-FP4-v2 --trust-remote-code --dist-init-addr 192.168.0.1:21000 --nnodes 2 --node-rank 0 --disaggregation-mode decode --tp-size 8 --ep-size 8 --dp-size 8 --enable-dp-attention --enable-dp-lm-head --device cuda --host 0.0.0.0 --port 28000 --mem-fraction-static 0.7 --attention-backend trtllm_mla --moe-runner-backend flashinfer_cutedsl --moe-a2a-backend deepep --deepep-mode low_latency --quantization modelopt_fp4 --chunked-prefill-size 131072 --page-size 64 --disable-radix-cache --max-running-requests 1536 --prefill-round-robin-balance --disaggregation-transfer-backend nixl --cuda-graph-max-bs 192 --ep-num-redundant-experts 32 --watchdog-timeout 1200

Rank 1

NCCL_MNNVL_ENABLE=1 NCCL_CUMEM_ENABLE=1 NCCL_SOCKET_IFNAME=eth0 NCCL_SOCKET_FAMILY=AF_INET GLOO_SOCKET_IFNAME=eth0 SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 SGLANG_ENABLE_JIT_DEEPGEMM=0 SGLANG_DEEPEP_BF16_DISPATCH=1 SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=1024 python3 -m sglang.launch_server --model-path /data01/models/DeepSeek-R1-0528-FP4-v2 --trust-remote-code --dist-init-addr 192.168.0.1:21000 --nnodes 2 --node-rank 1 --disaggregation-mode decode --tp-size 8 --ep-size 8 --dp-size 8 --enable-dp-attention --enable-dp-lm-head --device cuda --host 0.0.0.0 --port 28000 --mem-fraction-static 0.7 --attention-backend trtllm_mla --moe-runner-backend flashinfer_cutedsl --moe-a2a-backend deepep --deepep-mode low_latency --quantization modelopt_fp4 --chunked-prefill-size 131072 --page-size 64 --disable-radix-cache --max-running-requests 1536 --prefill-round-robin-balance --disaggregation-transfer-backend nixl --cuda-graph-max-bs 192 --ep-num-redundant-experts 32 --watchdog-timeout 1200

Without this PR, the logs are

[2025-11-06 18:59:07 DP3 TP3 EP3] Detected nvfp4 checkpoint. Please note that the format is experimental and subject to change.
[2025-11-06 18:59:07 DP3 TP3 EP3] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2802, in run_scheduler_process
    scheduler = Scheduler(
                ^^^^^^^^^^
  ...
    self.experts = get_moe_impl_class(quant_config)(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/ep_moe/layer.py", line 118, in __init__
    deep_gemm_wrapper.ENABLE_JIT_DEEPGEMM
AssertionError: DeepEP DeepEPMode.LOW_LATENCY mode requires deep_gemm

With this PR, the logs are

[2025-11-06 19:04:21] INFO:     Started server process [2015574]
[2025-11-06 19:04:21] INFO:     Waiting for application startup.
[2025-11-06 19:04:21] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 0.6, 'top_k': 50, 'top_p': 0.95}
[2025-11-06 19:04:21] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 0.6, 'top_k': 50, 'top_p': 0.95}
[2025-11-06 19:04:21] INFO:     Application startup complete.
[2025-11-06 19:04:21] INFO:     Uvicorn running on http://0.0.0.0:28000 (Press CTRL+C to quit)
[2025-11-06 19:04:22] INFO:     127.0.0.1:55904 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-11-06 19:04:22] Start of pd disaggregation warmup ...
[2025-11-06 19:04:22] INFO:     127.0.0.1:55908 - "POST /generate HTTP/1.1" 200 OK
[2025-11-06 19:04:22] End of prefill disaggregation mode warmup with status 200, resp .....
[2025-11-06 19:04:22] The server is fired up and ready to roll!

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

…ckend is flashinfer cutedsl

gemini-code-assist · 2025-11-06T19:08:15Z

Summary of Changes

Hello @rainj-me, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a crucial compatibility fix, allowing for the successful deployment of FP4 quantized models, such as DeepSeek R1, on GB200 infrastructure. It specifically addresses an issue where the combination of flashinfer-cutedsl as the MoE backend and DeepEP's low-latency mode would incorrectly enforce a DeepGEMM dependency. By adjusting this check, the PR ensures these advanced configurations can operate as intended, broadening the range of supported model deployments.

Highlights

DeepSeek R1 FP4 Model Support: Enables the deployment of DeepSeek R1 FP4 quantized models on GB200 systems when using DeepEP and the flashinfer-cutedsl MoE backend.
Conditional DeepGEMM Check Bypass: Modifies the DeepGEMM assertion to be skipped under specific conditions: when modelopt_fp4 quantization is active and flashinfer_cutedsl is used as the MoE runner backend in DeepEP's low-latency mode.
Resolves Assertion Error: Fixes an AssertionError that previously occurred, preventing these specific configurations from running due to an unmet deep_gemm requirement.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request disables the DeepGEMM check for a specific configuration involving nvfp4 quantization and the flashinfer_cutedsl MoE backend. This change is necessary to support model deployment on GB200 hardware. The logic seems correct, but I've suggested a minor refactoring to improve the readability of the conditional statement.

gemini-code-assist · 2025-11-06T19:09:08Z

python/sglang/srt/layers/moe/ep_moe/layer.py

+        if (
+            self.deepep_mode.enable_low_latency()
+            and not _is_npu
+            and not (
+                get_moe_runner_backend().is_flashinfer_cutedsl()
+                and self.quant_config.get_name() == "modelopt_fp4"
+            )
+        ):


For better readability and maintainability, consider extracting the new condition into a separate boolean variable. This makes the if statement's intent clearer at a glance.

Suggested change

if (

self.deepep_mode.enable_low_latency()

and not _is_npu

and not (

get_moe_runner_backend().is_flashinfer_cutedsl()

and self.quant_config.get_name() == "modelopt_fp4"

)

):

is_fp4_flashinfer_cutedsl = (

get_moe_runner_backend().is_flashinfer_cutedsl()

and self.quant_config.get_name() == "modelopt_fp4"

)

if (

self.deepep_mode.enable_low_latency()

and not _is_npu

and not is_fp4_flashinfer_cutedsl

):

ignore the deepgemm check when the model weight with nvfp4 and moe ba…

d6028e1

…ckend is flashinfer cutedsl

rainj-me requested review from BBuf, Edwardf0t1, HaiShaw, Ying1123, ch-wan, ispobock, kushanam, merrymercy and zhyncs as code owners November 6, 2025 19:07

gemini-code-assist bot reviewed Nov 6, 2025

View reviewed changes

Merge branch 'main' into dev/fp4-flashinfer-cutedsl

41a98b4

rainj-me added the run-ci label Nov 6, 2025

Merge branch 'main' into dev/fp4-flashinfer-cutedsl

54945bd

rainj-me requested a review from Fridge003 as a code owner November 6, 2025 22:05

Fridge003 approved these changes Nov 6, 2025

View reviewed changes

Fridge003 merged commit a119363 into sgl-project:main Nov 6, 2025
61 of 77 checks passed

HanHan009527 deleted the dev/fp4-flashinfer-cutedsl branch December 16, 2025 16:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

ignore the deepgemm check when the model weight with nvfp4 and moe ba…#12782

ignore the deepgemm check when the model weight with nvfp4 and moe ba…#12782
Fridge003 merged 3 commits intosgl-project:mainfrom
bytedance-iaas:dev/fp4-flashinfer-cutedsl

rainj-me commented Nov 6, 2025

Uh oh!

gemini-code-assist bot commented Nov 6, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

rainj-me commented Nov 6, 2025

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Nov 6, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants