[Router-GO] implement a Go SGLang Router - OpenAI Compatible API Server by whybeyoung · Pull Request #14770 · sgl-project/sglang

whybeyoung · 2025-12-10T01:53:20Z

Go SGLang Router - OpenAI Compatible API Server

Go SGLang Router is a high-performance OpenAI-compatible API server that communicates with the SGLang backend via gRPC and performs efficient preprocessing and postprocessing through Rust FFI.

CC @slin1237 ， and we need more test for this pr

Features

✅ OpenAI API Compatible: Fully compatible with OpenAI Chat Completions API
✅ High Performance: Low latency and high throughput using gRPC and Rust FFI
✅ Streaming Support: Server-Sent Events (SSE) streaming responses
✅ Thread-Safe: Pre-created tokenizer handle, lock-free concurrency
✅ Graceful Shutdown: Context cancellation mechanism to avoid resource leaks and panics

Architecture Overview

Important Note: gRPC mode still calls FFI, which is used for:

Preprocessing: chat_template and tokenization (request phase)
Postprocessing: token decoding and tool parsing (response phase)

gRPC is only used for communication with the SGLang backend, while input/output processing completely relies on Rust FFI.

┌─────────────────────────────────────────────────────────────────┐
│                        HTTP Client                               │
│                    (OpenAI API Format)                           │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│                    FastHTTP Server                               │
│              handlers/chat.go:HandleChatCompletion               │
│              - Parse request JSON                                │
│              - SetBodyStreamWriter (SSE)                        │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│              SGLang Client (client.go)                           │
│         CreateChatCompletionStream(ctx, req)                      │
│         - Wraps gRPC client                                      │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│          gRPC Client (internal/grpc/client_grpc.go)              │
│         CreateChatCompletionStream(ctx, reqJSON)                 │
│                                                                  │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │  Step 1: FFI Preprocess (Rust FFI)                       │  │
│  │  - ffi.PreprocessChatRequestWithTokenizer()              │  │
│  │  - chat_template application                              │  │
│  │  - tokenization                                           │  │
│  │  - tool constraints generation                            │  │
│  │  Returns: PromptText, TokenIDs, ToolConstraintsJSON,     │  │
│  │           PromptTokens                                   │  │
│  └────────────────────┬─────────────────────────────────────┘  │
│                       │                                          │
│                       ▼                                          │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │  Step 2: Build gRPC Request                              │  │
│  │  - Parse request JSON (model, temperature, etc.)        │  │
│  │  - Create proto.GenerateRequest                         │  │
│  │  - Set TokenizedInput (PromptText, TokenIDs)            │  │
│  │  - Set SamplingParams (temperature, top_p, top_k, etc.)  │  │
│  │  - Set Constraints (from ToolConstraintsJSON)            │  │
│  └────────────────────┬─────────────────────────────────────┘  │
│                       │                                          │
│                       ▼                                          │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │  Step 3: Create gRPC Stream                              │  │
│  │  - client.Generate(generateReq) → gRPC stream            │  │
│  │  - Connects to SGLang Backend (Rust)                      │  │
│  └────────────────────┬─────────────────────────────────────┘  │
│                       │                                          │
│                       ▼                                          │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │  Step 4: Create Converter & BatchPostprocessor          │  │
│  │  - ffi.CreateGrpcResponseConverterWithTokenizer()       │  │
│  │  - Uses preprocessed.PromptTokens for initial count      │  │
│  │  - ffi.NewBatchPostprocessor(batchSize=1, immediate)     │  │
│  └────────────────────┬─────────────────────────────────────┘  │
│                       │                                          │
│                       ▼                                          │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │  Step 5: Start readLoop (Background Goroutine)           │  │
│  │  - go grpcStream.readLoop()                               │  │
│  │  - Returns GrpcChatCompletionStream immediately          │  │
│  └────────────────────┬─────────────────────────────────────┘  │
└───────────────────────┼────────────────────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────────────────────┐
│         GrpcChatCompletionStream.readLoop()                     │
│         (Background Goroutine)                                   │
│                                                                  │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │  Recv() Goroutine (Dedicated)                            │  │
│  │  - Continuously calls stream.Recv()                      │  │
│  │  - Sends results to recvChan (buffered, 2000)          │  │
│  │  - Exits on ctx.Done() or error                          │  │
│  │  - Calls stream.CloseSend() on ctx.Done()               │  │
│  └────────────────────┬─────────────────────────────────────┘  │
│                       │                                          │
│                       ▼                                          │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │  Main Loop                                                │  │
│  │  - Reads from recvChan                                    │  │
│  │  - For each proto.GenerateResponse:                      │  │
│  │    → go processAndSendResponse() (async)                 │  │
│  │      - protoToJSON() converts proto to JSON string        │  │
│  │      - batchPostprocessor.AddChunk(protoJSON)            │  │
│  │        → FFI postprocessing (token decoding, tool parsing)│  │
│  │        → Returns OpenAI-format JSON strings               │  │
│  │      - Sends JSON to resultJSONChan (buffered, 10000)     │  │
│  │      - All operations check ctx.Done() for cancellation  │  │
│  │  - On EOF: flush batch, send remaining results, return  │  │
│  │  - On error: send to errChan (buffered, 100)            │  │
│  │  - defer: cancel ctx, wait goroutines, close channels     │  │
│  └────────────────────┬─────────────────────────────────────┘  │
└───────────────────────┼────────────────────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────────────────────┐
│         resultJSONChan (Buffered Channel, 10000)                 │
│         - Contains OpenAI-format JSON strings                    │
│         - Ready for consumption                                  │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│         ChatCompletionStream.RecvJSON()                          │
│         (client.go:410)                                          │
│         - Direct wrapper: return grpcStream.RecvJSON()           │
│         - No intermediate processing                             │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│         FastHTTP SetBodyStreamWriter                             │
│         (handlers/chat.go:159)                                   │
│         - Loop: stream.RecvJSON() → format SSE → flush         │
│         - Format: "data: {json}\n\n"                           │
│         - Final: "data: [DONE]\n\n"                             │
│         - Immediate flush after each chunk                      │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│                        HTTP Client                               │
│                    (SSE Stream)                                  │
│                    Receives: data: {...}\n\n                    │
└─────────────────────────────────────────────────────────────────┘

Benchmakr vs rust router

## Rust
#Input tokens: 50561
#Output tokens: 25883
Starting warmup with 5 sequences...
Warmup completed with 5 sequences. Starting main benchmark run...

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    20.0
Max request concurrency:                 20
Successful requests:                     100
Benchmark duration (s):                  37.71
Total input tokens:                      50561
Total input text tokens:                 50561
Total input vision tokens:               0
Total generated tokens:                  25883
Total generated tokens (retokenized):    25599
Request throughput (req/s):              2.65
Input token throughput (tok/s):          1340.75
Output token throughput (tok/s):         686.35
Total token throughput (tok/s):          2027.10
Concurrency:                             18.58
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   7008.05
Median E2E Latency (ms):                 7061.24
---------------Time to First Token----------------
Mean TTFT (ms):                          156.09
Median TTFT (ms):                        133.81
P99 TTFT (ms):                           318.53
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          26.59
Median TPOT (ms):                        26.75
P99 TPOT (ms):                           29.18
---------------Inter-Token Latency----------------
Mean ITL (ms):                           26.71
Median ITL (ms):                         23.61
P95 ITL (ms):                            66.11
P99 ITL (ms):                            115.30
Max ITL (ms):                            201.08
==================================================

## golang
#Input tokens: 50561
#Output tokens: 25883
Starting warmup with 5 sequences...
Warmup completed with 5 sequences. Starting main benchmark run...

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    20.0
Max request concurrency:                 20
Successful requests:                     100
Benchmark duration (s):                  34.22
Total input tokens:                      50561
Total input text tokens:                 50561
Total input vision tokens:               0
Total generated tokens:                  22970
Total generated tokens (retokenized):    31740
Request throughput (req/s):              2.92
Input token throughput (tok/s):          1477.70
Output token throughput (tok/s):         671.32
Total token throughput (tok/s):          2149.03
Concurrency:                             18.42
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   6303.33
Median E2E Latency (ms):                 6294.46
---------------Time to First Token----------------
Mean TTFT (ms):                          157.10
Median TTFT (ms):                        149.16
P99 TTFT (ms):                           251.98
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          26.49
Median TPOT (ms):                        27.15
P99 TPOT (ms):                           28.73
---------------Inter-Token Latency----------------
Mean ITL (ms):                           26.97
Median ITL (ms):                         24.61
P95 ITL (ms):                            52.39
P99 ITL (ms):                            86.52
Max ITL (ms):                            194.55
==================================================

Quick Start

Start Server

./run.sh

The server will start on port :8080.

Usage Example

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/path/to/model",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

Key Design

1. Thread-Safe Tokenizer

Pre-create TokenizerHandle at startup
Rust side uses Arc<dyn TokenizerTrait>, thread-safe
Lock-free concurrency, eliminating lock contention

2. Context Cancellation Mechanism (Graceful Shutdown)

Use context.Context cancellation mechanism
In readLoop's defer: cancel context first, then wait for all goroutines to complete, finally close channels
processAndSendResponse checks ctx.Done() at function start, all select statements include case <-s.ctx.Done()
Avoids "send on closed channel" panic

3. Cancellable Recv()

Use dedicated goroutine to execute Recv()
Pass results through recvChan
Call CloseSend() when context is cancelled to make Recv() return error

4. Simplified Channel Design

resultJSONChan: Main data channel (gRPC layer)
errChan: Error channel (gRPC layer)
recvChan: Internal communication channel (gRPC layer)
Removed redundant channels and duplicate reads

Configuration

Channel Buffer Sizes

type ChannelBufferSizes struct {
    ResultJSONChan int // Default: 10000
    ErrChan        int // Default: 100
    RecvChan       int // Default: 2000
}

Timeout Configuration

type Timeouts struct {
    KeepaliveTime    time.Duration // Default: 300s
    KeepaliveTimeout time.Duration // Default: 20s
    CloseTimeout     time.Duration // Default: 5s
}

Performance Optimizations

Pre-create Tokenizer: Created at startup to avoid first request latency
Lock-Free Concurrency: Tokenizer is thread-safe, no locks needed
Lazy Parsing: JSON parsing deferred until needed
Direct JSON Passing: RecvJSON() avoids parse/serialize overhead
Immediate Batching: batchSize=1, no delay
Async Processing: readLoop processes in background, doesn't block request handling
Configurable Buffers: Adjust channel sizes based on concurrency needs

File Structure

sgl-model-gateway/bindings/golang/
├── client.go                          # High-level client API
├── internal/
│   ├── grpc/
│   │   └── client_grpc.go            # gRPC client implementation
│   ├── ffi/                          # FFI bindings (Rust)
│   └── proto/                        # Protobuf definitions
└── examples/
    └── oai_server/
        ├── handlers/
        │   └── chat.go               # HTTP request handling
        ├── models/
        │   └── chat.go               # Request/response models
        └── service/
            └── sglang_service.go      # Service layer

Error Handling

Context Cancellation Mechanism

Client disconnects → SetBodyStreamWriter detects flush error
Cancel streamCtx → readLoop detects ctx.Done()
Call stream.CloseSend() → Recv() goroutine returns error
readLoop defer executes:
- Set closed flag
- Cancel context (if not already cancelled)
- Wait for all processAndSendResponse goroutines to complete (processWg.Wait())
- Close all channels (resultJSONChan, errChan, readLoopDone)
Clean up resources and exit

Channel Blocking and Race Condition Prevention

Context cancellation mechanism: All channel sends use select statements with case <-s.ctx.Done()
Graceful exit: When context is cancelled, all blocking send operations can return immediately
WaitGroup synchronization: readLoop's defer uses processWg.Wait() to ensure all goroutines complete before closing channels
Avoid panic: Through context cancellation and WaitGroup synchronization, avoids "send on closed channel" panic

Key Functions

CreateChatCompletionStream

Location: internal/grpc/client_grpc.go:108

Preprocess request (FFI)
Build gRPC request
Create converter and batch processor
Start readLoop

readLoop

Location: internal/grpc/client_grpc.go:290

Start Recv() goroutine (continuously calls stream.Recv())
Process proto responses
Asynchronously call processAndSendResponse (tracked with processWg)
Graceful shutdown in defer:
- Set closed flag
- Cancel context (if not already cancelled)
- Wait for all processAndSendResponse goroutines to complete (processWg.Wait())
- Close all channels (resultJSONChan, errChan, readLoopDone)

processAndSendResponse

Location: internal/grpc/client_grpc.go:379

Check ctx.Done() at function start, return immediately if cancelled
Convert proto to JSON
Call FFI batch processor
All select statements include case <-s.ctx.Done() for graceful shutdown handling
Send JSON to channel

RecvJSON

Location:

internal/grpc/client_grpc.go:412: gRPC layer implementation
client.go:410: Client wrapper layer
Read from resultJSONChan
Directly return JSON string, no parsing needed

gemini-code-assist · 2025-12-10T01:53:23Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

…Server (sgl-project#14770)

* [model-gateway] extract conversation out of oai router (sgl-project#14440) Co-authored-by: key4ng <rukeyang@gmail.com> * [DeepseekV3.2][NSA][Indexer] Fix PAGED top-k transform for NSA indexer chunked execution on H200 (sgl-project#14325) * [model-gateway] move oai header util to router header util (sgl-project#14441) Co-authored-by: key4ng <rukeyang@gmail.com> * [FIX] trtllm-moe-fp4-renorm for Qwen series models (sgl-project#14350) * add doc for quantized kv cache (sgl-project#14348) Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com> Co-authored-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com> * fix: Correct environment variable syntax in docker-compose configuration (sgl-project#8287) Signed-off-by: Kay Yan <kay.yan@daocloud.io> * [model-gateway] move all responses api event from oai to proto (sgl-project#14446) Co-authored-by: key4ng <rukeyang@gmail.com> * [model-gateway] add mistral 3 image processor (sgl-project#14445) Co-authored-by: Chang Su <chang.s.su@oracle.com> * [model-gateway] grpc to leverage event type (sgl-project#14450) Co-authored-by: Chang Su <chang.s.su@oracle.com> * ministral3 (sgl-project#14251) Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Co-authored-by: Yueming Yuan <yy28@illinois.edu> * [Bug] fix not desired disable fused share experts caused by rocm logic (sgl-project#14432) * Rename secrets.WHL_TOKEN -> secrets.GH_PAT_FOR_WHL_RELEASE (sgl-project#14421) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * [diffusion] improve: further optimize model load (sgl-project#13836) * Add CI permissions for user 'yushengsu-thu' (sgl-project#14468) * [ez] Fix typing (sgl-project#14473) * Add AMD stage support to /rerun-stage command and fix related bugs (sgl-project#14463) * Add YAMY1234 to CI Permission (sgl-project#14475) * clean up gemlite usage (sgl-project#14444) * [diffusion] chore: further improve model searching logic (sgl-project#14484) * [diffusion] fix: fix bug about pin memory when offloading (sgl-project#14472) * [diffusion] cli: add argument --adjust-frames and --override-protected-fields (sgl-project#13996) Co-authored-by: dev <devnull@example.com> Co-authored-by: Mick <mickjagger19@icloud.com> * dockerfile: add runtime stage + ubuntu 24.04 (sgl-project#13861) * [diffusion] fix: fix CLIP text encoder attention mask not used (sgl-project#14364) Co-authored-by: niehen6174 <niehen.6174@gmail.com> Co-authored-by: Mick <mickjagger19@icloud.com> * Enable RadixCache for Mamba2 models (sgl-project#13584) * [diffusion] fix: Fix profiler trace missing Python stack in diffusion pipeline (sgl-project#14499) * support GLM-V vision model dp (sgl-project#14097) * [misc] add model arch and type to server info and use it for harmony (sgl-project#14456) * Add Mistral Large 3 Eagle Support (sgl-project#14466) Co-authored-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com> * Add Mistral Large 3 to nightly CI tests (sgl-project#14459) * [diffusion] chore: set allowing overriding protected fields of sampling params as default behavior (sgl-project#14471) * [model-gateway] move conversation to first class routing (sgl-project#14506) Co-authored-by: key4ng <rukeyang@gmail.com> * [Spec] Mamba2 support in target models (sgl-project#13434) * [diffusion] feat: support cache-dit integration (sgl-project#14234) Co-authored-by: shuxiguo <shuxiguo@meituan.com> Co-authored-by: DefTruth <qiustudent_r@163.com> Co-authored-by: Mick <mickjagger19@icloud.com> * Add fused FP8 KV cache write kernel for TRTLLM MHA backend (sgl-project#14093) Co-authored-by: Qiaolin Yu <liin1211@outlook.com> * [model-gateway] Add WASM support for middleware (sgl-project#12471) Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> * [model-gateway] reorganized conversation handler (sgl-project#14507) Co-authored-by: key4ng <rukeyang@gmail.com> * tiny remove deprecated endpoint call (sgl-project#13607) * [model-gateway] fix server info comment (sgl-project#14508) * Add Mistral Large 3 basic test to PR CI (sgl-project#14460) * Fix removing worker will make it healthy forever in prometheus metrics (sgl-project#14420) * [model-gateway] Make Tokenizer Builder Aware of Env Vars Like HF_ENDPOINT (sgl-project#14405) * [model-gateway] change sgl-router to sgl-model-gateway (sgl-project#14312) * [model-gateway] fix left over sgl-router names to sgl-model-gateway (sgl-project#14512) * [model-gateway] fix logs in smg workflow (sgl-project#14513) * [model-gateway] fix left over sgl-router names in wasm (sgl-project#14514) * [model-gateway] fix code owner for wasm (sgl-project#14516) * chore: bump sgl-kernel version to 0.3.18.post3 (sgl-project#14427) Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com> * Tiny use trtllm_mha as default when possible (sgl-project#14291) * [Docs] Add /rerun-stage command to contribution guide (sgl-project#14521) * Fix safetensors validation to catch corruption after download (sgl-project#14465) * [CODEOWNER] update codeowner for qwen3-next related (sgl-project#14522) * fix: fix rmsnorm -> layernorm in qwen3 omni (sgl-project#11791) Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com> * [diffusion] chore: temporarily upgrade diffusers to make Z-image compatible with Cache-DiT (sgl-project#14530) * [bug] fix notebook to include new keys from model_info (sgl-project#14528) * Revise DP Multi-Modal Encoder Document (sgl-project#14290) * [CPU] add mamba fla kernels for Qwen3-next (sgl-project#12324) * Revert "tiny remove deprecated endpoint call" (sgl-project#14533) * support mtp with deepseek r1 nvfp4 model (sgl-project#13115) Co-authored-by: Trevor Morris <tmorris@nvidia.com> * [diffusion] refactor: simplify sampling params' override logic (sgl-project#14539) * [diffusion] perf: add QKV fusion optimization for Flux models (sgl-project#14505) Co-authored-by: Mick <mickjagger19@icloud.com> * [model-gateway][tracing]: implement request tracing using OpenTelemetry with trace context propagation (HTTP) (sgl-project#13897) * [diffusion] lora: fix LoRA dtype handling and weight attribute access for z-image model (sgl-project#14543) Co-authored-by: niehen6174 <nihen6174@gmail.com> * fix "GrammarMatcher has terminated after accepting the stop token, but is trying to find the next token mask" when both reasoning and spec are enabled (sgl-project#14464) * [1/n] Fix hanging during DeepGemm Warmup (sgl-project#14493) * [Bug fix] Add /model_info endpoint to mini_lb (sgl-project#14535) * [Qwen3-next] remove heuristics and add radix cache kl test (sgl-project#14520) * [Misc]Register and refactor some environs for dpsk-fp4 and DeepEp (sgl-project#14538) * chore: bump sgl-kernel version to 0.3.18.post3 (sgl-project#14518) * Update CI_PERMISSIONS.json (sgl-project#14552) * Update DeepSeek V3 docs to use B200 (sgl-project#14447) * [Doc] Add short explanation on page size (sgl-project#14557) * [docs] Add missing word in argument description (sgl-project#14205) * support piecewise cuda graph for Olmo models (sgl-project#14476) * Enhance prefill PP node robustness (sgl-project#14494) * DOC update nemo-skills in docs (sgl-project#14555) Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> * remove unecessary dual stream token threshold from the rest of models (qwen moe, kimi linear, etc.) (sgl-project#14337) * feat(ci): add framework target to release-docker workflows (sgl-project#14559) * Fix attention backend logic for Qwen3-Next on SM100 (sgl-project#14560) * [FLA] Add explicit kernel arguments to kda.py for Kimi Linear support (sgl-project#14561) * Add CUDA kernel size analysis tool for sgl-kernel optimization (sgl-project#14544) * [DLLM] feat: Add threshold based parallel decoding support (sgl-project#14412) Co-authored-by: Jinwei Yao <jinweiy@illinois.edu> Co-authored-by: 赵晨阳 <zhaochen20@outlook.com> * Add unit-test-backend-8-gpu-b200 to rerun-stage command (sgl-project#14569) * [apply][2/2] Fused qk_norm_rope for Qwen3-MoE (sgl-project#13998) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> * Add Expert Parallelism (EP) support for kimi-k2-thinking (sgl-project#13725) * Tiny remove wrong import from `python.sglang` (sgl-project#14577) * Add small model test for spec v2 + dp + trtllm_mla (sgl-project#14576) * [diffusion] cli: profiling utilities support (sgl-project#14185) Co-authored-by: jianyingzhu <53300651@qq.com> Co-authored-by: Jianying <53503712+jianyingzhu@users.noreply.github.com> Co-authored-by: Mick <mickjagger19@icloud.com> * [NPU]LoRA: Adding Torch Native backend (sgl-project#14132) * [BugFix] fix prefixcache performance and accuracy on ascend (sgl-project#13573) * Fix FP8 KV Triton type issue and add regression test (sgl-project#14553) * Rename TensorRT Model Optimizer to Model Optimizer (sgl-project#14455) Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com> * [CI] Tiny speed up VLM CI (sgl-project#14517) Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com> * [Minor] Temporarily skipping deepep large mtp test (sgl-project#14586) * [model-gateway] extra accumulator and tool handler in oai router (sgl-project#14587) * [model-gateway] Fixed WASM Security Vulnerability - Execution Timeout (sgl-project#14588) * [model-gateway] reorganize metrics, logging, and otel to its own module (sgl-project#14590) * Refactor tuning block wise kernel and opt Qwen/Qwen3-VL-32B-Instruct-FP8 (sgl-project#14141) * [CI]Unblock and split spec v2+dp test (sgl-project#14551) * [Tool Call] Fix DeepSeekV32Detector skipping functions with no params in streaming mode (sgl-project#14573) * [feat] use cachebuffer to store mm feature to speedup hash (sgl-project#14386) * [CI] Fix unit-test-backend-8-gpu-b200 running on every /rerun-stage (sgl-project#14591) * [model-gateway] fix WASM memory limit per module (sgl-project#14600) * Tiny fix missing policy decision recording (sgl-project#14605) * Super tiny remove unneeded policy flag (sgl-project#14608) * [model-gateway] refactor otel to be more efficient (sgl-project#14604) * Super tiny remove unused select_worker_pair (sgl-project#14609) * [model-gateway] fix WASM unbounded request/response body read vuln (sgl-project#14612) * [2/2] Add rope kernel in sgl-kernel (sgl-project#14452) * [DLLM] Add initial cuda graph support (sgl-project#14203) * Super tiny fix unused code in router (sgl-project#14618) * [Glm46v] Bug fix for accuracy drop and unable to launch server (sgl-project#14585) Co-authored-by: yhyang201 <yhyang201@gmail.com> Co-authored-by: zRzRzRzRzRzRzR <2448370773@qq.com> Co-authored-by: Minglei Zhu <mingleizhu1122@gmail.com> * Fix amd rope definition (sgl-project#14556) * modify the sgl-kernel to be compatible with transformers 5.x. (sgl-project#14625) * [Reasoning + Structured Output] make reasoning compatible with structured output (sgl-project#12551) Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com> * [diffusion] feat: add support for LoRA layers in transformer_2 within LoRAPipeline (sgl-project#14606) * chore: bump sgl-kernel version to 0.3.19 (sgl-project#14632) * [cpu] Implement all gather/reduce for arm64 cpu (sgl-project#12527) * [diffusion] chore: further refine output resolution adjustment logic (sgl-project#14558) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Fix dp-aware incompatible with service-discovery (sgl-project#14629) * update transformers package version to 5.0.0rc0 (sgl-project#14356) * chore: bump sgl-kernel version to 0.3.19 (sgl-project#14649) * chore: bump SGLang version to 0.5.6.post1 (sgl-project#14651) * [AMD] change fused rms quant interface for aiter upgrade (sgl-project#14497) * [model-gateway] reducing cpu overhead in various of places (sgl-project#14658) * [model-gateway] reduce cpu overhead in grpc router (sgl-project#14663) * [model-gateway] fix WASM arbitrary file read security vol (sgl-project#14664) * vlm: Use fa3 as the default backend for qwen3 vl (sgl-project#14634) * [model-gateway] Optimize memory usage in HTTP router (sgl-project#14667) * fix: use .get() when accessing strict mem-check env variable (sgl-project#14657) * improve default glm mtp setting (sgl-project#14457) Signed-off-by: Brayden Zhong <b8zhong@users.noreply.github.com> * Fix cache-aware router should pick min load instead of min tenant size (sgl-project#14650) * Bump up diffusers to latest official release version (sgl-project#14670) * [model-gateway] add OTEL integration to grpc router (sgl-project#14671) * [CI] Increase max-parallel to 15 for high priority PRs (sgl-project#14675) * [HiCache] fix condition check when use decode offload (sgl-project#14489) * [RadixTree] Optimize the Time Complexity of Node Retrieval Operation from O(n*m) to O(n) (sgl-project#13334) Signed-off-by: CLFutureX <chenyongqyl@163.com> Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> * Tiny support printing requests in bench_serving for observability (sgl-project#14652) * Aiter fp8 kv cache (sgl-project#13147) * [SMG]feat: implement TokenGuardBody for managing token return (sgl-project#14653) * [NPU] chore: bump basic software version to 8.3.rc2 (sgl-project#14614) * [CI] Unblock gb200 cutedsl test (sgl-project#14469) * Add ffmpeg into sglang docker - required by transformers multimodal V… (sgl-project#14679) * [Bugfix] Fix KeyError for Mistral-Large-3 rope_scaling config (sgl-project#14627) * Tiny support sgl-router http response status code metrics (sgl-project#14689) * [CI] Migrate Eagle 1-GPU tests to test/registered/ (sgl-project#14529) * Revert "[Bug] fix not desired disable fused share experts caused by r… (sgl-project#14676) * Add per-request decode tp size (sgl-project#14678) Co-authored-by: Byron Hsu <byronhsu1230@gmail.com> * [ci][smg] fix docker release ci and add it to pr test (sgl-project#14683) * Tiny extract select_worker_min_load (sgl-project#14648) * Fix dp-aware incompatible with completions and chat completions APIs (sgl-project#14647) * [CI] Fix Llama 3.1 8B FP4 CI (sgl-project#14699) * fix: make override DeepseekV2Model work (sgl-project#14707) * chore: add code owners for deepseek_v2.py (sgl-project#14714) * [CI] Move mistral large 3 basic to nightly (sgl-project#14622) * fix the deepep 8 gpu unit test (sgl-project#14601) * Add fuse_marlin_moe test to ci and add new ep test (sgl-project#14686) * [Bugfix] Fix environ error in scheduler_runtime_checker_mixin.py (sgl-project#14461) Signed-off-by: Kun(llfl) <i@imux.top> * [Feat] Add received_time in serving_base (sgl-project#13432) Signed-off-by: zhanghaotong <zhanghaotong.zht@antgroup.com> * fix: prevent HugginqFace access when SGLANG_USE_MODELSCOPE is enabled (sgl-project#12039) * [Test] Skip STANDALONE speculative decoding tests for different hidden sizes (sgl-project#14733) * [diffusion] feat: support comparing batch perf (sgl-project#14738) Co-authored-by: shuxiguo <shuxiguo@meituan.com> Co-authored-by: Mick <mickjagger19@icloud.com> * Revert "[Feat] Add received_time in serving_base" (sgl-project#14743) * [Model] Add PaddleOCR-VL Model Support (sgl-project#12953) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> * fix rope parameter initialization error caused by transformers v5.0 update (sgl-project#14745) * [model-gateway] optimize core modules (sgl-project#14751) * [SMG] perf: optimize tokenizer for reduced CPU and memory overhead (sgl-project#14752) * Add FP8 Blockwise GEMM Backend Flag `--fp8-gemm-backend` (sgl-project#14379) * fix: checking if tokenizer is in cache before downloading from HF (sgl-project#14698) * fix: making rate limit a warning instead of error (sgl-project#14753) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * move multi-item scoring functions in tokenizer manager into a separate file (sgl-project#14740) * Improve CI by trying a warmup before unit tests (sgl-project#14669) * [Perf] Optimize radix tree for cache-aware load balancin (sgl-project#14758) * [Feature] Add LoRA support for embedding layers (sgl-project#14177) Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: Beichen-Ma <bm685@cornell.edu> * [model-gateway] release gateway 0.2.4 (sgl-project#14763) * [ci]: Enable the new hf API (sgl-project#14687) * Re-add the API serving timing metrics. (sgl-project#14744) Signed-off-by: zhanghaotong <zhanghaotong.zht@antgroup.com> Co-authored-by: zhanghaotong <zhanghaotong.zht@antgroup.com> * fix: adding rate limit warning at verify token permission stage (sgl-project#14756) * Disable 8-gpu-b200 runner in PR tests (sgl-project#14768) * [fix] Fix issues for in-flight weight updates (sgl-project#14064) Co-authored-by: 赵晨阳 <zhaochen20@outlook.com> * [Auto Sync] Update data_parallel_controller.py, detokenizer... (20251209) (sgl-project#14759) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> * fix: race condition between validation and download locks (sgl-project#14761) * Fix VLM accuracy thresholds for nightly tests (sgl-project#14777) * fix server args bug (sgl-project#14725) * handling incomplete rope_scaling config ci after transformers upgrade (sgl-project#14784) * fix b200 ci (sgl-project#14786) * [RL] support weight reload for low-bit rollout (sgl-project#9650) Co-authored-by: Hecate0821 <hec4te0821@gmail.com> Co-authored-by: eternally-z <zzywzj@gmail.com> Co-authored-by: Wilboludriver <wilbolu@outlook.com> Co-authored-by: Wilbolu <81792854+Wilboludriver@users.noreply.github.com> Co-authored-by: Ke Bao <ispobaoke@gmail.com> * fix: add missing logic for SGLANG_USE_MODELSCOPE variable (sgl-project#14794) * fix b200 fa4 ci (sgl-project#14788) * [diffusion] profile: early exit when enough steps are captured to reduce the size of the trace file (sgl-project#14803) * [GLM-4.6V] Support Pipeline Parallelism for GLM-4.6V & GLM-4.1V (sgl-project#14720) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> * [diffusion] CI: Add LoRA support to diffusion server configuration and test cases (sgl-project#14697) * Revert "fix: checking if tokenizer is in cache before downloading from HF" (sgl-project#14808) * [diffusion] performance: refactor diffusion fuse qkv and apply to qwen-image (sgl-project#14793) * [SMG-GO] implement a Go SGLang Model Gateway - OpenAI Compatible API Server (sgl-project#14770) * [model-gateway] Dynamically Populate Tool Call Parser Choices (sgl-project#14807) * Support HTTP response status code prometheus metrics (sgl-project#14710) * Fix router keep nonzero metrics after worker is deleted (sgl-project#14819) * Tiny fix incorrect worker removal command (sgl-project#14822) * [NPU] bug fix for mtp and w4a8 (sgl-project#14806) * [CI] fix UT success check in `test_eagle_infer_beta_dp_attention.py` (sgl-project#14831) * Fix CI registry scan to only check test/registered directory (sgl-project#14812) * [model-gateway] add anthropic message api spec (sgl-project#14834) * [diffusion] doc: fix tiny typo in multimodal_gen/README.md (sgl-project#14830) * [model-gateway] support customizing Prometheus duration buckets (sgl-project#14716) * [model-gateway] support engine response http status statistics in router (sgl-project#14712) * [CI] Reduce stage-b auto-partition from 4 to 2 (sgl-project#14769) Co-authored-by: Liangsheng Yin <lsyincs@gmail.com> * Apply back moe_sum_reduce for fused_marlin_moe (sgl-project#14829) * [diffusion] parallel: pad tokens for video models under sp (sgl-project#14833) * [diffusion] CI: use unified sampling_params for CI (sgl-project#14045) * [Auto Sync] Update tool_chat_template_deepseekv31.jinja (20251210) (sgl-project#14837) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Jue Wang <zjuwangjue@gmail.com> * Revert transformers to 4.57.1 (sgl-project#14801) * [model-gateway] Fix incompatible metric comparison in` PowerOfTwo` policy (sgl-project#14823) * [bugfix] qwen25-VL support lora (sgl-project#14638) * fix lora target all + csgmv backend (sgl-project#14796) * [model-gateway] adds default implementations to RouterTrait in mod.rs (sgl-project#14841) * [AMD] Add model to AMD nightly test (sgl-project#14442) * Treat unittest SkipTest exception as pass instead of as failure (sgl-project#14847) * [model-gateway] code clean up on oai router (sgl-project#14850) * [model-gateway] fix import order in oai conversation (sgl-project#14851) * fix fp8 gemm nightly CI (sgl-project#14844) Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com> * fix: restrict cache validation behaviors to CI only (sgl-project#14849) * Fix CUDA version handling in ci_install_deepep.sh (sgl-project#14854) * Fix TestGLM41VPPAccuracy test flakiness (sgl-project#14848) * Minor code style fix for dllm (sgl-project#14836) * Enable TP for Mamba-based models (sgl-project#14811) Signed-off-by: Roi Koren <roik@nvidia.com> * [CI] Temp disable gb200 test (sgl-project#14865) * Refactor Marlin MoeRunner (sgl-project#14554) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> * [6/n] Fix `num_token_non_padded` computation in prefill (sgl-project#14313) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: Runkai Tao <rt572@physics.rutger.edu> * Remove myself to test CI gate issue (sgl-project#14871) * fix: creating blobs only once for publish trace retries (sgl-project#14845) * Move and update MindSpore docs, make it appear on the online documentation (sgl-project#14861) Co-authored-by: wangtiance <tiancew@qq.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * fix nightly vlm ci : restore original eval for requests without regex (sgl-project#14875) * Only count limitations for previous runs that reaches the test stages (sgl-project#14856) * [CI][BUG] fix ib setup for disaggregation hicache test (sgl-project#14877) Signed-off-by: lukotong-7 <shicanwei.scw@alibaba-inc.com> * [Fix] Remove unused import from test_disaggregation_hicache.py (sgl-project#14880) * fix: adding temporary bypass for nightly tests (sgl-project#14876) * Avoid deleting entire cache for missing shards (sgl-project#14754 follow-up) (sgl-project#14853) * Tiny add more error info for bench_serving (sgl-project#14827) * Tiny support range ratio in GSP in bench serving (sgl-project#14828) * [diffusion] feat: enable torch compile to eliminate GPU bubble (sgl-project#13641) Co-authored-by: jianyingzhu <53300651@qq.com> Co-authored-by: Jianying <53503712+jianyingzhu@users.noreply.github.com> Co-authored-by: root <root@2u2g-spr-0417.ipp4a1.colossus.nvidia.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> * [NPU] adapt dsv3.2 nsa prefill context parallel (sgl-project#14541) * [diffusion] feat: support sageattn & sageattn3 backend (sgl-project#14878) * dsv32 multistream opt * clean code * delete renormalize in topk * dsv32 use batch_matmul_transpose in MTP * modify comment * Support dynamic w8a8 * dsv3 support ascend_fuseep * rebase modify --------- Signed-off-by: Kay Yan <kay.yan@daocloud.io> Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com> Signed-off-by: Brayden Zhong <b8zhong@users.noreply.github.com> Signed-off-by: CLFutureX <chenyongqyl@163.com> Signed-off-by: Kun(llfl) <i@imux.top> Signed-off-by: zhanghaotong <zhanghaotong.zht@antgroup.com> Signed-off-by: Roi Koren <roik@nvidia.com> Signed-off-by: lukotong-7 <shicanwei.scw@alibaba-inc.com> Co-authored-by: Simo Lin <linsimo.mark@gmail.com> Co-authored-by: key4ng <rukeyang@gmail.com> Co-authored-by: YAMY <74099316+YAMY1234@users.noreply.github.com> Co-authored-by: Sam <lsam@nvidia.com> Co-authored-by: b8zhong <b8zhong@uwaterloo.ca> Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com> Co-authored-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com> Co-authored-by: Kay Yan <kay.yan@daocloud.io> Co-authored-by: Chang Su <chang.s.su@oracle.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: Yueming Yuan <yy28@illinois.edu> Co-authored-by: Junrong Lin <33685709+ocss884@users.noreply.github.com> Co-authored-by: sglang-bot <sglangbot@gmail.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: zyksir <zhuyikai.zyk@gmail.com> Co-authored-by: Alison Shao <54658187+alisonshao@users.noreply.github.com> Co-authored-by: Yinghai Lu <yinghai@thinkingmachines.ai> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: Minglei Zhu <mingleizhu1122@gmail.com> Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: GMI Xiao Jin <xiao.j@gmicloud.ai> Co-authored-by: dev <devnull@example.com> Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com> Co-authored-by: WenhaoZhang <42087078+niehen6174@users.noreply.github.com> Co-authored-by: niehen6174 <niehen.6174@gmail.com> Co-authored-by: roikoren755 <26850796+roikoren755@users.noreply.github.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: Yuxuan Zhang <2448370773@qq.com> Co-authored-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Co-authored-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com> Co-authored-by: blahblah <28567807+Brain97@users.noreply.github.com> Co-authored-by: shuxiguo <shuxiguo@meituan.com> Co-authored-by: DefTruth <qiustudent_r@163.com> Co-authored-by: Hudson Xing <77495133+harvenstar@users.noreply.github.com> Co-authored-by: Qiaolin Yu <liin1211@outlook.com> Co-authored-by: Tony Lu <tonylu@linux.alibaba.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Wenyi Xu <wenyixu101@gmail.com> Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com> Co-authored-by: Hanming Lu <69857889+hanming-lu@users.noreply.github.com> Co-authored-by: Vincent Zhong <207368749+vincentzed@users.noreply.github.com> Co-authored-by: Yuhao Yang <47235274+yhyang201@users.noreply.github.com> Co-authored-by: blzheng <beilei.zheng@intel.com> Co-authored-by: Rain Jiang <96632942+rainj-me@users.noreply.github.com> Co-authored-by: Trevor Morris <tmorris@nvidia.com> Co-authored-by: Feng Su <sufeng@linux.alibaba.com> Co-authored-by: niehen6174 <nihen6174@gmail.com> Co-authored-by: gongwei-130 <56567052+gongwei-130@users.noreply.github.com> Co-authored-by: harrisonlimh <97203667+harrisonlimh@users.noreply.github.com> Co-authored-by: Lee Nau <lnau@nvidia.com> Co-authored-by: almaslof <187766901+almaslof@users.noreply.github.com> Co-authored-by: Rain H <2510421000@qq.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Chen1022 <jincong.cjc@ant-intl.com> Co-authored-by: Tiwei Bie <tiwei.btw@antgroup.com> Co-authored-by: Jinwei Yao <jinweiy@illinois.edu> Co-authored-by: 赵晨阳 <zhaochen20@outlook.com> Co-authored-by: Yuan Luo <yuan.luo@hotmail.com> Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com> Co-authored-by: AichenF <aichenf@nvidia.com> Co-authored-by: jianyingzhu <53300651@qq.com> Co-authored-by: Jianying <53503712+jianyingzhu@users.noreply.github.com> Co-authored-by: Vladimir Serov <serov.vladimir.zser@gmail.com> Co-authored-by: khalilzhk <khalilzhk@gmail.com> Co-authored-by: Zhiyu <zhiyuc@nvidia.com> Co-authored-by: wentx <3843588+momaek@users.noreply.github.com> Co-authored-by: Nicholas <45984215+liusy58@users.noreply.github.com> Co-authored-by: Binyao Jiang <byjiang1996@gmail.com> Co-authored-by: yhyang201 <yhyang201@gmail.com> Co-authored-by: Muqi Li <muqi1029@gmail.com> Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Co-authored-by: Prozac614 <dwt614707404@163.com> Co-authored-by: Yibo Cai <yibo.cai@arm.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: yctseng0211 <yctseng@amd.com> Co-authored-by: Francis <38564764+ssssnow@users.noreply.github.com> Co-authored-by: PiteXChen <44110731+CLFutureX@users.noreply.github.com> Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> Co-authored-by: kk <43161300+kkHuang-amd@users.noreply.github.com> Co-authored-by: Jimmy <29097382+jimmy-evo@users.noreply.github.com> Co-authored-by: Even Zhou <even.y.zhou@outlook.com> Co-authored-by: Yineng Zhang <me@zhyncs.com> Co-authored-by: Byron Hsu <byronhsu1230@gmail.com> Co-authored-by: kun-llfl <i@imux.top> Co-authored-by: zhanghaotong <zhanghaotong.zht@antgroup.com> Co-authored-by: yrk111222 <2493404415@qq.com> Co-authored-by: yudian0504 <138860534+yudian0504@users.noreply.github.com> Co-authored-by: Douglas Yang <dyang@college.harvard.edu> Co-authored-by: Ethan (Yusheng) Su <yushengsu.thu@gmail.com> Co-authored-by: Beichen-Ma <bm685@cornell.edu> Co-authored-by: MingxuZh <109504044+MingxuZh@users.noreply.github.com> Co-authored-by: ShawnY112358 <61113840+ShawnY112358@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: TomerBN-Nvidia <tbarnatan@nvidia.com> Co-authored-by: Peng Zhang <aniz1905@gmail.com> Co-authored-by: Hecate0821 <hec4te0821@gmail.com> Co-authored-by: eternally-z <zzywzj@gmail.com> Co-authored-by: Wilboludriver <wilbolu@outlook.com> Co-authored-by: Wilbolu <81792854+Wilboludriver@users.noreply.github.com> Co-authored-by: Ke Bao <ispobaoke@gmail.com> Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com> Co-authored-by: liupeng374 <liupeng374@huawei.com> Co-authored-by: Li Jinliang <975761915@qq.com> Co-authored-by: Liangsheng Yin <lsyincs@gmail.com> Co-authored-by: Jue Wang <zjuwangjue@gmail.com> Co-authored-by: Praneth Paruchuri <pranethparuchuri@gmail.com> Co-authored-by: Siyuan Chen <41201609+SYChen123@users.noreply.github.com> Co-authored-by: michael-amd <Michael.Zhang@amd.com> Co-authored-by: Trang Do <200224632+trangdough@users.noreply.github.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: yuchengz816-bot <yuchengz816@gmail.com> Co-authored-by: Runkai Tao <rt572@physics.rutger.edu> Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com> Co-authored-by: Tiance Wang <wangtiance@gmail.com> Co-authored-by: wangtiance <tiancew@qq.com> Co-authored-by: shicanwei.scw <shicanwei.scw@alibaba-inc.com> Co-authored-by: Shangming Cai <csmthu@gmail.com> Co-authored-by: root <root@2u2g-spr-0417.ipp4a1.colossus.nvidia.com> Co-authored-by: liupeng374 <782420244@qq.com>

whybeyoung added 7 commits December 10, 2025 00:52

support golang sdk and example server

b10089a

update

c8c4107

update

86c3895

get model info

e8b2025

update

8640dd1

update

6fa0171

support generate

5ff2dcf

whybeyoung requested review from CatherineSue and slin1237 as code owners December 10, 2025 01:53

github-actions bot added documentation Improvements or additions to documentation model-gateway labels Dec 10, 2025

lint

77b07a7

slin1237 added the run-ci label Dec 10, 2025

slin1237 approved these changes Dec 10, 2025

View reviewed changes

slin1237 merged commit 766476f into sgl-project:main Dec 10, 2025
74 of 75 checks passed

shevateng0 pushed a commit to shevateng0/sglang that referenced this pull request Dec 10, 2025

[SMG-GO] implement a Go SGLang Model Gateway - OpenAI Compatible API …

866a2e3

…Server (sgl-project#14770)

Prozac614 pushed a commit to Prozac614/sglang that referenced this pull request Dec 17, 2025

[SMG-GO] implement a Go SGLang Model Gateway - OpenAI Compatible API …

e5b8432

…Server (sgl-project#14770)

GuoYechang pushed a commit to GuoYechang/sglang that referenced this pull request Jan 13, 2026

[SMG-GO] implement a Go SGLang Model Gateway - OpenAI Compatible API …

6b7ca29

…Server (sgl-project#14770)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Router-GO] implement a Go SGLang Router - OpenAI Compatible API Server#14770

[Router-GO] implement a Go SGLang Router - OpenAI Compatible API Server#14770
slin1237 merged 8 commits intosgl-project:mainfrom
whybeyoung:go_bindings_pr

whybeyoung commented Dec 10, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Dec 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

whybeyoung commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Go SGLang Router - OpenAI Compatible API Server

Features

Architecture Overview

Benchmakr vs rust router

Quick Start

Start Server

Usage Example

Key Design

1. Thread-Safe Tokenizer

2. Context Cancellation Mechanism (Graceful Shutdown)

3. Cancellable Recv()

4. Simplified Channel Design

Configuration

Channel Buffer Sizes

Timeout Configuration

Performance Optimizations

File Structure

Error Handling

Context Cancellation Mechanism

Channel Blocking and Race Condition Prevention

Key Functions

CreateChatCompletionStream

readLoop

processAndSendResponse

RecvJSON

Uh oh!

gemini-code-assist bot commented Dec 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

whybeyoung commented Dec 10, 2025 •

edited

Loading