SGLang Nvidia Collaboration Roadmap (2026 Q1)

## Model Support
- Day-0 Model Support (Nemotron V3)
- LLM Model Optimization 
    - Focus on blackwell hardware: (G)B300/(G)B200/Spark/Thor (Popular + Customer driven)
  - DeepSeek R1
    - NVFP4 Disagg optimization
    - Long Context optimization
    - MTP + Disagg compatibility
  - Gpt-oss
    - Flashinfer trtllm-gen MoE autotuning
    - Support DEP=2
      - Support trtlllm-gen MoE tileN 64 vs 32 selection
      - mnnvl_alltoall integration
    - Low Latency
      - Finalize+Slice fusion
      - RoPE+Q+Cache fusion
      - Router gemm
    - DeepSeek V3.2/Speciale
      - Optimization roadmap: https://github.com/sgl-project/sglang/issues/15025 
    - Qwen3-Next
      - MoE Flashinfer Backend
    - Other models: Nemotron V3/GLM-4.6/Kimi-K2/Qwen3-235B/Mistral-3…
    - Diffusion model optimization
      - WAN, Flux, Z-Image, Qwen-Image…
      - Research previews: transfusion models, real-time video models

## Flashinfer
- Kernel support for models including Kimi-K2, DeepSeek and QWEN-Next
- Attention Updates
  - Blackwell (SM10x) performance
  - NSA & DSA support (SM10x)
- Investigating
  - Sparse Attention APIs – Diffusion models
  - Deterministic Use Cases & Reqs
- FP4 KV-Cache Support
- New compute + comms fused kernels
  - AR + RMSNorm
  - AR + RMSNorm + Quant
- Improved packaging & deployment
  - jit-cache & cubins

## Model Optimizer
- NVFP4 Quantization support + Inference Optimized HF Checkpoints: [link](https://huggingface.co/collections/nvidia/inference-optimized-checkpoints-with-model-optimizer)
- Improved techniques
  - PTQ algorithms (NVFP4 AWQ/GPTQ/Rotation)
  - Eagle3 Offline Training SGLang hidden state support
  - Improved AutoQuantize – More mixed precision checkpoint options
  - Pruning/Distillation improvement – More pruned/NAS models
- Direct SGLang integration

## Kernel Optimization
- Implement low latency symmetric kernels with NCCL v2.29
- Test and upgrade cutlass/cutlass-dsl to latest version (v4.4)
- Replace slower torch/triton kernels with experimental Cutile/CuteDSL implementation
- Move more kernels from sgl-kernel to JIT

## PD Disaggregation/Large Scale Serving
- Benchmarking + Optimization
  - Continue to push SOL performance with SGL model recipes 
  - E2E benchmarking + Pareto curve optimization with new strategies including:
    - MTP
    - Long Context 
    - Real life workloads to use EPLB + KV cache
      - Roadmap: https://github.com/sgl-project/sglang/issues/14661 
    - DP Attention Load Balancing
      - Roadmap: https://github.com/sgl-project/sglang/issues/16080 
- Dynamo
  - Production ready GB200/GB300 + Slurm/K8s Recipes
  - KVBM + Hi-cache integration
  - KV Cache connector API
  - Explore integration of AI Configurator/Grove/ModelExpress
- NIXL
  - KV transfer optimization
  - Layer/Chunkwise KV transfer
- Resilience in Prod Deployment
  - NIXL EP for fault tolerance + DP ranks created without fixed world
  - Lightweight companion process for weights allowing fast restarts and checkpointing

## DGX Station & DGX Spark
- DGX Spark
  - Introduce model support matrix for [SGLang playbook](https://build.nvidia.com/spark/sglang)
  - Enable community contribution
- DGX Station
  - Day 0 Support with runnable examples

## CI/CD
- Refactors and Improvements #13808
  - Unit Level (Builds/Utility/Features)
  - Kernels
  - E2E Sanity
  - E2E Performance
  - Create tests/execution plan
- Ensure enough HW to improve CI/CD
  - Enable Cuda13 CI workflow
  - Add B200/B300 CI runners
  - Add whole rack GB200 CI (if possible)
- Add performance dashboard for popular model (deepseek r1) in track of performance changes


### Related resources

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SGLang Nvidia Collaboration Roadmap (2026 Q1) #17130

Model Support

Flashinfer

Model Optimizer

Kernel Optimization

PD Disaggregation/Large Scale Serving

DGX Station & DGX Spark

CI/CD

Related resources

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SGLang Nvidia Collaboration Roadmap (2026 Q1) #17130

Description

Model Support

Flashinfer

Model Optimizer

Kernel Optimization

PD Disaggregation/Large Scale Serving

DGX Station & DGX Spark

CI/CD

Related resources

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions