-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Open
Labels
Description
Checklist
- 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- 2. Please use English, otherwise it will be closed.
Motivation
With PR #10062 merged, we have implemented the foundational framework for piecewise CUDA Graph and torch.compile backend support.
In this issue, we aim to outline the key follow-up tasks and enhancements planned for future iterations.
Todo List
- Support eager compiler @Oasis-Git Eager Compiler for Torch Compile #11803
- Fix B200 can not use piecewise-cuda-graph. @BBuf [b200] fix piecewise cuda graph launch bug #12067
- Fix TP illegal memory issue @zyksir [Fix] fix allreduce bug in Piecewise Graph #12106 Custom All Reduce for Piecewise Cuda Graph #15356
- Fix accuracy issue on high concurrency @ispobock @BBuf @Oasis-Git 34f9ac7
- DeepSeek V3 Architecture Support: Add support for MLA. @ispobock Support piecewise cuda graph for MLA #11812
- Sgl-kernel Update: Move the jit script of
weak_ref_tensortosgl_kernel@BBuf Migrate weak_ref_tensor to sgl-kernel #12505 -
custom_opsIntegration: Enable custom operator support forcuda_forward. @BBuf We do this by set compile backend totorch. - Quantization Support: Extend backend to support quantized model execution.
- FP8 @BBuf, @hebiao064 Support FP8 Per Token Quant Piecewise #13272 ModelOpt FP8 @b8zhong [Piecewise CUDA Graph] Support ModelOpt FP8 #13094
- INT8 @BBuf [Piecewise CUDA Graph] Support INT8 #14918
- W4A8 @b8zhong [Piecewise CUDA Graph] Support W4A8 #13179
- FP4 + DeepSeek (TRTLLM MLA + Flashinfer-TRTLLM MoE) @ispobock Support piecewise cuda graph for dsv3 fp4 #15531
- ModelOpt FP4 @b8zhong [Piecewise CUDA Graph] Support ModelOpt FP4 #13101
- GPTQ, AWQ @BBuf [PieceWise CUDA Graph] Support awq/gptq model in piecewise cudagraph #12518
- Other Attention Support
- Verify compatibility with the rest of MLA backends (FlashMLA, TRT-LLM/Cutlass MLA) @b8zhong
- SWA (GPT-OSS) @Oasis-Git Piecewise Cuda Graph Support for gpt-oss model #13045
- Hybrid Linear Attention (Qwen3-Next, Kimi-Linear) @Chen-0210 Support piecewise cuda graph for Qwen3-next #13081
- LogProbs Support: Support
LogProbsreturn value @narutolhy support more model in piecewise cuda graph #11745 - Data Parallelism Attention: Compatible with DP attention.
- Expert Parallelism (EP): Implement full EP functionality and integration with the existing compile graph backend. @Oasis-Git EP Support for Piecewise Cuda Graph #14164
- Pipeline Parallelism (PP): Implement full PP functionality and integration with the existing compile graph backend. @baonudesifeizhai Enable Pipeline Parallelism support for Piecewise CUDA Graph #14515 #14547
- Inductor Backend: Support inductor compiler
- PassManager Support: Better Fusion Kernel with torch compile enabled. @yuan-luo [1/N] Support PassManager Framework and Fusion Pass #11830
- Turn on piecewise cuda graph by default
We list all the model that can not be applied with piecewise cuda graph support by now. All the contributors and users can raise their model with issues here.
Model List
- Qwen/Qwen3-235B-A22B @BBuf piecewise cuda graph support qwen3-moe #11845
- deepseek-ai/DeepSeek-V3 @ispobock Support piecewise cuda graph for deepseek v3 #12996
- openai/gpt-oss-120b @Oasis-Git Piecewise Cuda Graph Support for gpt-oss model #13045
- moonshotai/Kimi-K2-Thinking, moonshotai/Kimi-K2-Instruct-0905 @b8zhong @ispobock @BBuf [Piecewise CUDA Graph] Support Kimi-K2 (non-Thinking) #13466 Support piecewise cuda graph for fused marlin moe #15100 Fix warp illegal instruction in kimi k2 thinking PCG #15306
- Qwen/Qwen3-Next-80B-A3B-Instruct @Chen-0210 Support piecewise cuda graph for Qwen3-next #13081
- QwenLM/Qwen3-VL @yuan-luo [Feature] Support Piecewise Graph for VLM #12838 [VLM] Support Piecewise CUDA Graph for Qwen2.5-VL #13055
- Grok2/Mixtral: [Piecewise CUDA Graph] Fix recompile issue for Mixtral and Grok2 #13667 @hebiao064 @zminglei
Contributions and discussions are highly welcome.
Related resources
No response
Reactions are currently unavailable