[Feature] Roadmap for Prefill (Piecewise) CUDA Graph

### Checklist

- [x] 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 2. Please use English, otherwise it will be closed.

### Motivation

With PR #10062 merged, we have implemented the foundational framework for piecewise CUDA Graph and torch.compile backend support.  
In this issue, we aim to outline the key follow-up tasks and enhancements planned for future iterations.

### Todo List
- [x] Support eager compiler @Oasis-Git https://github.com/sgl-project/sglang/pull/11803
- [x] Fix B200 can not use piecewise-cuda-graph. @BBuf  https://github.com/sgl-project/sglang/pull/12067
- [x] Fix TP illegal memory issue @zyksir https://github.com/sgl-project/sglang/pull/12106 https://github.com/sgl-project/sglang/pull/15356
- [x] Fix accuracy issue on high concurrency @ispobock @BBuf @Oasis-Git https://github.com/sgl-project/sglang/pull/11812/commits/34f9ac73b2623e2652ab6511e5a96bccbb8db8f6
- [x] **DeepSeek V3 Architecture Support**: Add support for MLA. @ispobock https://github.com/sgl-project/sglang/pull/11812
- [x] **Sgl-kernel Update**: Move the jit script of `weak_ref_tensor` to `sgl_kernel` @BBuf https://github.com/sgl-project/sglang/pull/12505
- [x] **`custom_ops` Integration**: Enable custom operator support for `cuda_forward`. @BBuf  We do this by set compile backend to `torch`.
- [x] **Quantization Support**: Extend backend to support quantized model execution.
    - [x] FP8 @BBuf, @hebiao064 https://github.com/sgl-project/sglang/pull/13272 ModelOpt FP8 @b8zhong https://github.com/sgl-project/sglang/pull/13094
    - [x] INT8 @BBuf https://github.com/sgl-project/sglang/pull/14918
    - [x] W4A8 @b8zhong https://github.com/sgl-project/sglang/pull/13179
    - [x] FP4 + DeepSeek (TRTLLM MLA + Flashinfer-TRTLLM MoE) @ispobock https://github.com/sgl-project/sglang/pull/15531
    - [x] ModelOpt FP4 @b8zhong https://github.com/sgl-project/sglang/pull/13101
    - [x] GPTQ, AWQ @BBuf https://github.com/sgl-project/sglang/pull/12518
- [x] **Other Attention Support**
    - [x] Verify compatibility with the rest of MLA backends (FlashMLA, TRT-LLM/Cutlass MLA) @b8zhong 
    - [x] SWA (GPT-OSS) @Oasis-Git https://github.com/sgl-project/sglang/pull/13045
    - [x] Hybrid Linear Attention (Qwen3-Next, Kimi-Linear) @Chen-0210 https://github.com/sgl-project/sglang/pull/13081
- [x] **LogProbs Support**: Support `LogProbs` return value @narutolhy https://github.com/sgl-project/sglang/pull/11745
- [ ] **Data Parallelism Attention**: Compatible with DP attention. 
- [x] **Expert Parallelism (EP)**: Implement full EP functionality and integration with the existing compile graph backend. @Oasis-Git https://github.com/sgl-project/sglang/pull/14164
- [ ] **Pipeline Parallelism (PP)**: Implement full PP functionality and integration with the existing compile graph backend. @baonudesifeizhai https://github.com/sgl-project/sglang/pull/14547
- [ ] **Inductor Backend**: Support inductor compiler
- [ ] **PassManager Support**: Better Fusion Kernel with torch compile enabled. @yuan-luo https://github.com/sgl-project/sglang/pull/11830
- [ ] Turn on piecewise cuda graph by default

We list all the model that can not be applied with piecewise cuda graph support by now. All the contributors and users can raise their model with issues here.
### Model List
- [x] **Qwen/Qwen3-235B-A22B** @BBuf https://github.com/sgl-project/sglang/pull/11845
- [x] **deepseek-ai/DeepSeek-V3** @ispobock https://github.com/sgl-project/sglang/pull/12996
- [x] **openai/gpt-oss-120b** @Oasis-Git https://github.com/sgl-project/sglang/pull/13045
- [x] **moonshotai/Kimi-K2-Thinking, moonshotai/Kimi-K2-Instruct-0905** @b8zhong @ispobock @BBuf  https://github.com/sgl-project/sglang/pull/13466 https://github.com/sgl-project/sglang/pull/15100 https://github.com/sgl-project/sglang/pull/15306 
- [x] **Qwen/Qwen3-Next-80B-A3B-Instruct** @Chen-0210 https://github.com/sgl-project/sglang/pull/13081
- [x] **QwenLM/Qwen3-VL** @yuan-luo https://github.com/sgl-project/sglang/issues/12838 https://github.com/sgl-project/sglang/pull/13055
- [x] Grok2/Mixtral: https://github.com/sgl-project/sglang/pull/13667 @hebiao064 @zminglei 

_Contributions and discussions are highly welcome._

### Related resources

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Roadmap for Prefill (Piecewise) CUDA Graph #11490

Checklist

Motivation

Todo List

Model List

Related resources

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] Roadmap for Prefill (Piecewise) CUDA Graph #11490

Description

Checklist

Motivation

Todo List

Model List

Related resources

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions