Skip to content

Support piecewise cuda graph for dsv3 fp4#15531

Merged
ispobock merged 6 commits intosgl-project:mainfrom
ispobock:dsv3-pcg
Dec 21, 2025
Merged

Support piecewise cuda graph for dsv3 fp4#15531
ispobock merged 6 commits intosgl-project:mainfrom
ispobock:dsv3-pcg

Conversation

@ispobock
Copy link
Collaborator

Motivation

#11490

Accuracy Tests

python3 -m sglang.launch_server --model nvidia/DeepSeek-R1-0528-FP4-v2 --tp 8 --trust-remote --model-loader-extra-config '{"enable_multithread_load": "true","num_threads": 64}' --enable-piecewise-cuda-graph --quantization modelopt_fp4
python3 benchmark/gsm8k/bench_sglang.py --num-questions 1400 --parallel 1400

Accuracy: 0.947
Invalid: 0.000
Latency: 25.469 s
Output throughput: 5467.202 token/s

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @ispobock, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates piecewise CUDA graph capabilities for DeepSeek V3 FP4 models, aiming to optimize performance by allowing graph capture for specific computational segments. It involves adapting the attention and MoE layers to work seamlessly with this new execution mode, ensuring proper fallback mechanisms and preventing conflicts with Torch Dynamo. The changes are validated through new dedicated test suites covering accuracy and speed.

Highlights

  • Piecewise CUDA Graph Support: Introduced support for piecewise CUDA graphs specifically for DeepSeek V3 FP4 models, enabling more efficient execution by leveraging graph capture for parts of the computation.
  • Attention Backend Adaptation: Modified the TRT-LLM MLA attention backend to fallback to FlashInfer MLA when operating within a piecewise CUDA graph, as the latter is designed for this specific forward mode.
  • MoE Layer Integration: Updated the Mixture-of-Experts (MoE) layer to utilize a new custom operation, flashinfer_fp4_moe_forward_piecewise_cuda_graph_impl, when piecewise CUDA graphs are active, ensuring compatibility with FlashInfer FP4 MoE.
  • FP4 GEMM Operation Standardization: Standardized the FP4 GEMM calls from an internal _sglang_fp4_gemm function to torch.ops.sglang.fp4_gemm, likely indicating a move towards a more formal custom operator registration.
  • Torch Dynamo Compatibility: Adjusted the expert distribution recorder's context management to use nullcontext when piecewise CUDA graphs are enabled, preventing potential graph breaks with Torch Dynamo.
  • New Test Cases: Added comprehensive test cases for DeepSeek V3 FP4 with piecewise CUDA graphs, including accuracy tests on GSM8K and single-batch speed benchmarks, to validate the new functionality.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@ispobock
Copy link
Collaborator Author

/tag-and-rerun-ci

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for piecewise CUDA graphs for DeepSeek V3 FP4 models, which is a great enhancement for performance. The changes are well-structured and include necessary fallbacks, custom operators for CUDA graph compatibility, and workarounds for torch.dynamo limitations. The addition of a dedicated test suite for this feature is also a good practice. I have one minor suggestion to remove a redundant assertion to improve code clarity.

Comment on lines +1196 to +1199
assert TopKOutputChecker.format_is_bypassed(
topk_output
), "Only bypassed topk output is supported for flashinfer fp4 moe"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This assertion is redundant because self.forward_impl is called in both branches of the if is_in_piecewise_cuda_graph(): statement (indirectly via the custom op in the if branch), and forward_impl already contains the same assertion. To avoid duplication, we can remove this one.

w = layer.weight_packed.T
w_blockscale = layer.weight_scale.T

out = _sglang_fp4_gemm(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: we can just delete this wrapper now probably right

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we can check the usage of it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the ci passed: https://github.com/sgl-project/sglang/actions/runs/20395182671?pr=15531 We can check and remove it in another PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it cannot be removed, since it's registered to this sglang.fp4_gemm op.

@torch.library.custom_op("sglang::fp4_gemm", mutates_args=())

@ispobock ispobock merged commit 8fe3e37 into sgl-project:main Dec 21, 2025
314 of 329 checks passed
jiaming1130 pushed a commit to zhuyijie88/sglang that referenced this pull request Dec 25, 2025
YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

blackwell SM100/SM120 deepseek quant LLM Quantization run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants