-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Open
Description
Checklist
- If this is not a feature request but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
- Please use English. Otherwise, it will be closed.
Motivation
Earlier this year, LLaDA released the first diffusion LLM (dLLM), immediately capturing significant attention from both the academic and industrial communities. But there were no production-ready dLLM serving engine.
We plan to implement the most performant, production-ready dLLM framework in SGLang, make dLLM robust !
Features
- Initial diffusion LLM framework @ClawSeven @btw616 [Feature] Initial block diffusion language model support #12588
- Support LLaDA2.0-flash / LLaDA2.0-mini
- Support tensor parallel / expert parallel
- Support block diffusion and kv cache
- Doc and CI @ClawSeven @Monstertail
[DLLM] Add documentation for diffusion LLMs #14358
[DLLM] Add CI for diffusion LLMs #14723 - Support cuda graph @btw616 [DLLM] Add initial cuda graph support #14203
- Support self-defined attention mask
- Support parallel decoding @btw616 @Monstertail [DLLM] Add threshold based parallel decoding support #14412
- Support temperature, topp, topk
- dLLM code refactoring @ClawSeven
- Support initial dynamic batching [DLLM] Implement initial dynamic batching for diffusion LLM #14883
- Batching optimization & dLLM scheduling refactor [DLLM] Basic dLLM scheduling strategy and implementation #17484
- Requests early exit (for decoding optimization)
- Support radix cache (for prefill optimization) @btw616 [DLLM] Add initial radix cache support #18724
- Support overlap scheduling
- Support dLLM editing @edwardzjl @btw616 [DLLM] Add JointThreshold algorithm for joint M2T and T2T decoding #18171
- Metrics for dLLM @zhanghaotong
- Support non-block diffusion LLMs
- piecewise CUDA Graph (for prefill optimization)
For RL
- Support step maps @RuixiangMa @ClawSeven [DLLM] support step map #17297
VL-dLLM
- Initial multi-modal LLM implementation @btw616
More supported models
- LLaDA2.0
- SDAR @chengshuang18 Add SDAR model support #18318
- Fast-dLLM v2 @Monstertail (WIP) [DLLM]Fast-dLLM-v2 support with HierarchyBlock algorithm for parallel decoding #17577
More Hardwares
- Nvidia
- AMD:
- triton backend: [AMD] Add DLLM support for AMD GPUs with LLaDA2 testing #15560
- aiter backend
- Ascend
- ascend backend: [NPU] support DLLM ascend backend on NPU, with LLaDA2 testing #16494
- triton backend
- Intel
More Parallelism
- Tensor parallelism
- Expert parallelism
- Data Parallelism (with DPA)
- Context Parallelism
- Pipeline parallelism
Kernel Optimization for dLLM
- Moe
The dLLM makes the fused‑MoE kernel a bottleneck, so we need to optimize the fused implementation for dLLM scenarios.- FP8 optimization for small batch sizes: Add SwapAB Optimization for triton fused_moe_kernel on SM90. #15712
- Attention
Optimize block‑wise causal attention for dLLM prefill.
More disaggregation
PD is not suitable for dLLM, but AFD might be a viable option.
More Tests
- Small unit tests for specific functions
- Nightly unit tests for E2E accuracy and throughput testing
Better streaming output
- Support diffusion-style streaming output (like Mercury)
RFC
Related resources
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels