-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Open
Labels
Description
Model Support
- Day-0 Model Support (Nemotron V3)
- LLM Model Optimization
- Focus on blackwell hardware: (G)B300/(G)B200/Spark/Thor (Popular + Customer driven)
- DeepSeek R1
- NVFP4 Disagg optimization
- Long Context optimization
- MTP + Disagg compatibility
- Gpt-oss
- Flashinfer trtllm-gen MoE autotuning
- Support DEP=2
- Support trtlllm-gen MoE tileN 64 vs 32 selection
- mnnvl_alltoall integration
- Low Latency
- Finalize+Slice fusion
- RoPE+Q+Cache fusion
- Router gemm
- DeepSeek V3.2/Speciale
- Optimization roadmap: [Roadmap] DeepSeek v3.2 Optimization #15025
- Qwen3-Next
- MoE Flashinfer Backend
- Other models: Nemotron V3/GLM-4.6/Kimi-K2/Qwen3-235B/Mistral-3…
- Diffusion model optimization
- WAN, Flux, Z-Image, Qwen-Image…
- Research previews: transfusion models, real-time video models
Flashinfer
- Kernel support for models including Kimi-K2, DeepSeek and QWEN-Next
- Attention Updates
- Blackwell (SM10x) performance
- NSA & DSA support (SM10x)
- Investigating
- Sparse Attention APIs – Diffusion models
- Deterministic Use Cases & Reqs
- FP4 KV-Cache Support
- New compute + comms fused kernels
- AR + RMSNorm
- AR + RMSNorm + Quant
- Improved packaging & deployment
- jit-cache & cubins
Model Optimizer
- NVFP4 Quantization support + Inference Optimized HF Checkpoints: link
- Improved techniques
- PTQ algorithms (NVFP4 AWQ/GPTQ/Rotation)
- Eagle3 Offline Training SGLang hidden state support
- Improved AutoQuantize – More mixed precision checkpoint options
- Pruning/Distillation improvement – More pruned/NAS models
- Direct SGLang integration
Kernel Optimization
- Implement low latency symmetric kernels with NCCL v2.29
- Test and upgrade cutlass/cutlass-dsl to latest version (v4.4)
- Replace slower torch/triton kernels with experimental Cutile/CuteDSL implementation
- Move more kernels from sgl-kernel to JIT
PD Disaggregation/Large Scale Serving
- Benchmarking + Optimization
- Continue to push SOL performance with SGL model recipes
- E2E benchmarking + Pareto curve optimization with new strategies including:
- MTP
- Long Context
- Real life workloads to use EPLB + KV cache
- DP Attention Load Balancing
- Dynamo
- Production ready GB200/GB300 + Slurm/K8s Recipes
- KVBM + Hi-cache integration
- KV Cache connector API
- Explore integration of AI Configurator/Grove/ModelExpress
- NIXL
- KV transfer optimization
- Layer/Chunkwise KV transfer
- Resilience in Prod Deployment
- NIXL EP for fault tolerance + DP ranks created without fixed world
- Lightweight companion process for weights allowing fast restarts and checkpointing
DGX Station & DGX Spark
- DGX Spark
- Introduce model support matrix for SGLang playbook
- Enable community contribution
- DGX Station
- Day 0 Support with runnable examples
CI/CD
- Refactors and Improvements [Roadmap] CI suites organization #13808
- Unit Level (Builds/Utility/Features)
- Kernels
- E2E Sanity
- E2E Performance
- Create tests/execution plan
- Ensure enough HW to improve CI/CD
- Enable Cuda13 CI workflow
- Add B200/B300 CI runners
- Add whole rack GB200 CI (if possible)
- Add performance dashboard for popular model (deepseek r1) in track of performance changes
Related resources
No response
Reactions are currently unavailable