Skip to content

SGLang Autonomous Model Gateway Roadmap #13098

@slin1237

Description

@slin1237

Gateway 2.4 and 2.5 Final Releases

  • MCP labeling support for built-in tools
  • HA support for data synchronization
  • WASM support for custom plugins
  • Bug fixes for gRPC models
  • Additional model support (min-m2, kimi-k2 thinking)
  • Support background mode for responses API across all models and OAI router

SGLang Autonomous Model Gateway 3.0

Multimodality

  • Support multimodality and image processor, preferably using PyO3 binding existing SGLang image processor
  • Support both URL and raw data image content

Semantic Routing

  • Support PII and classify API for classifying intent and complexity of the input
  • Training new model for semantic routing
  • Publish training library for customers to train on their own data
  • Publish models to HuggingFace
  • Support automatic routing in multi-router mode (use Candle to execute those models)

SLO-Based Routing

  • Allow Gateway to actively listen to SGLang server's KV cache events to better handle routing decisions in gRPC mode
  • Define SLO criteria, such as latency, accuracy, cost, and preference; define set of APIs, preferably HTTP headers to decide the best routing decision
  • Allow SGLang server to start with both gRPC and HTTP server

Gateway UI

  • Terminal UI which includes components such as router metrics, worker metrics, worker metadata, router metadata, and active logs
  • Reactive UI to launch workers remotely; this should support both local machine and remote, with SSH as a beta feature for remote support

Message API Support

  • Natively support Anthropic Message API instead of wrapping around chat completion in gRPC mode
  • HTTP mode routing will fall back to wrapping around chat completion
  • Natively support MCP calls and multi-turn in Anthropic Message API
  • Add continuous integration test for Message API; critical model to support is M2

Build and Language Support Improvement

  • Binding to Go
  • Binding to Node.js
  • Better organization for bindings across all three languages
  • Restructure project as Cargo workspace to streamline multi-crate development and dependency management
  • Publish Rust crate during CI
  • Optimize build and config to leverage ccache properly
  • Update Docker build for multi-architecture support

gRPC Multi-Model Gateway Support

  • Introduce model card data structure to worker, which includes metadata such as tokenizer, chat template, reasoning parser, tool parser, DP size, TP size, etc.
  • Add gRPC endpoint to fetch tokenizer, chat template, and remote Python code for multimodality support
  • Add registry pattern to tokenizer which maps model family to tokenizer

Metrics and Observability Framework

Core Metrics Improvements

  • Model-Specific Metrics
    • Add TTFT (Time to First Token) tracking per model instance with labels for model_id, worker_id
    • Implement token throughput metrics per model (input/output tokens per second)
    • Track generation speed metrics (tokens/second) during streaming per model

OpenTelemetry Integration

  • Distributed Tracing
    • Integrate OpenTelemetry SDK with proper span creation and propagation
    • Add trace context propagation between router and workers (W3C TraceContext)
    • Implement span attributes for model_id, worker_id, request_type, batch_size
    • Create custom spans for routing decisions, queue operations, and retries
    • Add OTLP exporter support for Jaeger, Tempo, and other backends

Dashboard and Visualization

  • Observability UI
    • Create Grafana dashboard templates for standard deployments
    • Add real-time metrics streaming to terminal UI

Sub-issues

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions