Support piecewise cuda graph for dsv3 fp4 by ispobock · Pull Request #15531 · sgl-project/sglang

ispobock · 2025-12-20T13:39:09Z

Motivation

Accuracy Tests

python3 -m sglang.launch_server --model nvidia/DeepSeek-R1-0528-FP4-v2 --tp 8 --trust-remote --model-loader-extra-config '{"enable_multithread_load": "true","num_threads": 64}' --enable-piecewise-cuda-graph --quantization modelopt_fp4
python3 benchmark/gsm8k/bench_sglang.py --num-questions 1400 --parallel 1400

Accuracy: 0.947
Invalid: 0.000
Latency: 25.469 s
Output throughput: 5467.202 token/s

gemini-code-assist · 2025-12-20T13:39:30Z

Summary of Changes

Hello @ispobock, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates piecewise CUDA graph capabilities for DeepSeek V3 FP4 models, aiming to optimize performance by allowing graph capture for specific computational segments. It involves adapting the attention and MoE layers to work seamlessly with this new execution mode, ensuring proper fallback mechanisms and preventing conflicts with Torch Dynamo. The changes are validated through new dedicated test suites covering accuracy and speed.

Highlights

Piecewise CUDA Graph Support: Introduced support for piecewise CUDA graphs specifically for DeepSeek V3 FP4 models, enabling more efficient execution by leveraging graph capture for parts of the computation.
Attention Backend Adaptation: Modified the TRT-LLM MLA attention backend to fallback to FlashInfer MLA when operating within a piecewise CUDA graph, as the latter is designed for this specific forward mode.
MoE Layer Integration: Updated the Mixture-of-Experts (MoE) layer to utilize a new custom operation, flashinfer_fp4_moe_forward_piecewise_cuda_graph_impl, when piecewise CUDA graphs are active, ensuring compatibility with FlashInfer FP4 MoE.
FP4 GEMM Operation Standardization: Standardized the FP4 GEMM calls from an internal _sglang_fp4_gemm function to torch.ops.sglang.fp4_gemm, likely indicating a move towards a more formal custom operator registration.
Torch Dynamo Compatibility: Adjusted the expert distribution recorder's context management to use nullcontext when piecewise CUDA graphs are enabled, preventing potential graph breaks with Torch Dynamo.
New Test Cases: Added comprehensive test cases for DeepSeek V3 FP4 with piecewise CUDA graphs, including accuracy tests on GSM8K and single-batch speed benchmarks, to validate the new functionality.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

ispobock · 2025-12-20T13:40:24Z

/tag-and-rerun-ci

gemini-code-assist

Code Review

This pull request adds support for piecewise CUDA graphs for DeepSeek V3 FP4 models, which is a great enhancement for performance. The changes are well-structured and include necessary fallbacks, custom operators for CUDA graph compatibility, and workarounds for torch.dynamo limitations. The addition of a dedicated test suite for this feature is also a good practice. I have one minor suggestion to remove a redundant assertion to improve code clarity.

gemini-code-assist · 2025-12-20T13:41:47Z

python/sglang/srt/layers/moe/fused_moe_triton/layer.py

+        assert TopKOutputChecker.format_is_bypassed(
+            topk_output
+        ), "Only bypassed topk output is supported for flashinfer fp4 moe"
+


This assertion is redundant because self.forward_impl is called in both branches of the if is_in_piecewise_cuda_graph(): statement (indirectly via the custom op in the if branch), and forward_impl already contains the same assertion. To avoid duplication, we can remove this one.

b8zhong · 2025-12-21T01:27:21Z

...n/sglang/srt/layers/quantization/compressed_tensors/schemes/compressed_tensors_w4a4_nvfp4.py

            w = layer.weight_packed.T
            w_blockscale = layer.weight_scale.T

-        out = _sglang_fp4_gemm(


QQ: we can just delete this wrapper now probably right

yes, we can check the usage of it

All the ci passed: https://github.com/sgl-project/sglang/actions/runs/20395182671?pr=15531 We can check and remove it in another PR.

I think it cannot be removed, since it's registered to this sglang.fp4_gemm op.

sglang/python/sglang/srt/layers/quantization/modelopt_quant.py

Line 106 in 8fe3e37

@torch.library.custom_op("sglang::fp4_gemm", mutates_args=())

ispobock added 6 commits December 20, 2025 09:37

update

8303df7

update bypass

be56d90

update

b4ce4c7

fix attn

243724d

add ut

15153a6

update

f8c891d

ispobock requested review from AniZpZ, BBuf, Edwardf0t1, FlamingoPg, Fridge003, HaiShaw, Qiaolin-Yu, Ying1123, ch-wan, fzyzcjy, hebiao064, merrymercy and zhyncs as code owners December 20, 2025 13:39

github-actions bot added quant LLM Quantization deepseek blackwell SM100/SM120 labels Dec 20, 2025

ispobock mentioned this pull request Dec 20, 2025

[Feature] Roadmap for Prefill (Piecewise) CUDA Graph #11490

Open

34 tasks

github-actions bot added the run-ci label Dec 20, 2025

gemini-code-assist bot reviewed Dec 20, 2025

View reviewed changes

b8zhong reviewed Dec 21, 2025

View reviewed changes

ispobock merged commit 8fe3e37 into sgl-project:main Dec 21, 2025
314 of 329 checks passed

jiaming1130 pushed a commit to zhuyijie88/sglang that referenced this pull request Dec 25, 2025

Support piecewise cuda graph for dsv3 fp4 (sgl-project#15531)

33ed61e

YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026

Support piecewise cuda graph for dsv3 fp4 (sgl-project#15531)

2815232

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support piecewise cuda graph for dsv3 fp4#15531

Support piecewise cuda graph for dsv3 fp4#15531
ispobock merged 6 commits intosgl-project:mainfrom
ispobock:dsv3-pcg

ispobock commented Dec 20, 2025

Uh oh!

gemini-code-assist bot commented Dec 20, 2025

Uh oh!

ispobock commented Dec 20, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 20, 2025

Uh oh!

b8zhong Dec 21, 2025

Uh oh!

ispobock Dec 21, 2025

Uh oh!

ispobock Dec 21, 2025

Uh oh!

ispobock Dec 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ispobock commented Dec 20, 2025

Motivation

Accuracy Tests

Uh oh!

gemini-code-assist bot commented Dec 20, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

ispobock commented Dec 20, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

b8zhong Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

ispobock Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

ispobock Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

ispobock Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants