[Roadmap] Encoder Disaggregation for Multi-modal models

## Motivation
Multimodal support enables the framework to handle and process multiple types of data beyond text, including images, videos, and audio, allowing for more comprehensive AI applications that can understand and reason across different modalities.

The original single encoder architecture is posing challenges for the TTFT and throughput of multimodal models. To address this, we introduced an encoder dp method in https://github.com/sgl-project/sglang/pull/13126. However, the flexibility and scalability of this colocate-based DP method is limited, and due to colocating, it also consumes GPU resources, which might interfere with the language part sometimes. Therefore, we introduce an encoder disaggregation method in this PR: https://github.com/sgl-project/sglang/pull/12263.

Thanks to the **RedNote (xiaohongshu) hilab team** @gty111, the **Alibaba Cloud Computing** team @liusy58, and the **Ant Group SCT** team @ZhengWG  for their collaboration. We combined the advantages of different designs and made this first version. We also thank @yhyang201  and @mickqian  for their insightful reviews and detailed quality checks.

## Encoder Disaggregation Roadmap
In addition to bug fixes, here are some following features we would like to integrate step by step.

### Encoder Configuration Optimization, Better Task Distribution, and Load Balancing (with SGLang Model Gateway)
 - [ ] Support dynamic scaling and configuration of disaggregated encoder (static `--encoder-urls` solution might be deprecated in the future) @gty111 @liusy58 
 - [ ] Adaptive EPD/PD: We can implement an optimization to make this adaptive. Specifically, if multiple images are present, they should be routed to the encoder; otherwise, fall back to the standard PD (Prefill-Decode) path.

### Modality Extension
 - [ ] Video and audio input support @ZhengWG  https://github.com/sgl-project/sglang/pull/15475 https://github.com/sgl-project/sglang/pull/17824


### Optimization of Communication Efficiency 
 - [ ] The current version supports ZMQ and the Mooncake transfer engine backend for communication, but the overall design and performance are not yet optimal. @gty111 @ZhengWG https://github.com/sgl-project/sglang/pull/16487

### Global Encoder Cache

 - [ ] Design a distributed Cache to cache the processed results and use a hash-based matching method to skip the processing of duplicate multi-modal data (e.g., images, videos, audio), thus further reducing TTFT @ShangmingCai @liusy58 @hzh0425 

### Compatibility with CUDA Graph @bluecoffee8 
 - [ ] ViT CUDA Graph support.
 - [ ] Piecewise CudaGraph
 
### Fault tolerance
 - [x] Support EPD error handlinghttps://github.com/sgl-project/sglang/pull/16670
 
### SGLang Model Gateway support
 - [ ] EPD Disaggregation SGLang Model Gateway Support: https://github.com/sgl-project/sglang/pull/17550 @chenzongyao200127 @JasonZhang517
 - [ ] SGLang Encoder server grpc Support: https://github.com/sgl-project/sglang/pull/16552 @chenzongyao200127 @JasonZhang517

We welcome contributors who are interested in these tasks and willing to join us for rapid development.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Roadmap] Encoder Disaggregation for Multi-modal models #15118

Motivation

Encoder Disaggregation Roadmap

Encoder Configuration Optimization, Better Task Distribution, and Load Balancing (with SGLang Model Gateway)

Modality Extension

Optimization of Communication Efficiency

Global Encoder Cache

Compatibility with CUDA Graph @bluecoffee8

Fault tolerance

SGLang Model Gateway support

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Roadmap] Encoder Disaggregation for Multi-modal models #15118

Description

Motivation

Encoder Disaggregation Roadmap

Encoder Configuration Optimization, Better Task Distribution, and Load Balancing (with SGLang Model Gateway)

Modality Extension

Optimization of Communication Efficiency

Global Encoder Cache

Compatibility with CUDA Graph @bluecoffee8

Fault tolerance

SGLang Model Gateway support

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions