Skip to content

SGLang Nvidia Collaboration Roadmap (2026 Q1) #17130

@Fridge003

Description

@Fridge003

Model Support

  • Day-0 Model Support (Nemotron V3)
  • LLM Model Optimization
    • Focus on blackwell hardware: (G)B300/(G)B200/Spark/Thor (Popular + Customer driven)
    • DeepSeek R1
      • NVFP4 Disagg optimization
      • Long Context optimization
      • MTP + Disagg compatibility
    • Gpt-oss
      • Flashinfer trtllm-gen MoE autotuning
      • Support DEP=2
        • Support trtlllm-gen MoE tileN 64 vs 32 selection
        • mnnvl_alltoall integration
      • Low Latency
        • Finalize+Slice fusion
        • RoPE+Q+Cache fusion
        • Router gemm
      • DeepSeek V3.2/Speciale
      • Qwen3-Next
        • MoE Flashinfer Backend
      • Other models: Nemotron V3/GLM-4.6/Kimi-K2/Qwen3-235B/Mistral-3…
      • Diffusion model optimization
        • WAN, Flux, Z-Image, Qwen-Image…
        • Research previews: transfusion models, real-time video models

Flashinfer

  • Kernel support for models including Kimi-K2, DeepSeek and QWEN-Next
  • Attention Updates
    • Blackwell (SM10x) performance
    • NSA & DSA support (SM10x)
  • Investigating
    • Sparse Attention APIs – Diffusion models
    • Deterministic Use Cases & Reqs
  • FP4 KV-Cache Support
  • New compute + comms fused kernels
    • AR + RMSNorm
    • AR + RMSNorm + Quant
  • Improved packaging & deployment
    • jit-cache & cubins

Model Optimizer

  • NVFP4 Quantization support + Inference Optimized HF Checkpoints: link
  • Improved techniques
    • PTQ algorithms (NVFP4 AWQ/GPTQ/Rotation)
    • Eagle3 Offline Training SGLang hidden state support
    • Improved AutoQuantize – More mixed precision checkpoint options
    • Pruning/Distillation improvement – More pruned/NAS models
  • Direct SGLang integration

Kernel Optimization

  • Implement low latency symmetric kernels with NCCL v2.29
  • Test and upgrade cutlass/cutlass-dsl to latest version (v4.4)
  • Replace slower torch/triton kernels with experimental Cutile/CuteDSL implementation
  • Move more kernels from sgl-kernel to JIT

PD Disaggregation/Large Scale Serving

  • Benchmarking + Optimization
  • Dynamo
    • Production ready GB200/GB300 + Slurm/K8s Recipes
    • KVBM + Hi-cache integration
    • KV Cache connector API
    • Explore integration of AI Configurator/Grove/ModelExpress
  • NIXL
    • KV transfer optimization
    • Layer/Chunkwise KV transfer
  • Resilience in Prod Deployment
    • NIXL EP for fault tolerance + DP ranks created without fixed world
    • Lightweight companion process for weights allowing fast restarts and checkpointing

DGX Station & DGX Spark

  • DGX Spark
    • Introduce model support matrix for SGLang playbook
    • Enable community contribution
  • DGX Station
    • Day 0 Support with runnable examples

CI/CD

  • Refactors and Improvements [Roadmap] CI suites organization #13808
    • Unit Level (Builds/Utility/Features)
    • Kernels
    • E2E Sanity
    • E2E Performance
    • Create tests/execution plan
  • Ensure enough HW to improve CI/CD
    • Enable Cuda13 CI workflow
    • Add B200/B300 CI runners
    • Add whole rack GB200 CI (if possible)
  • Add performance dashboard for popular model (deepseek r1) in track of performance changes

Related resources

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions