Skip to content

[Roadmap] Encoder Disaggregation for Multi-modal models #15118

@ShangmingCai

Description

@ShangmingCai

Motivation

Multimodal support enables the framework to handle and process multiple types of data beyond text, including images, videos, and audio, allowing for more comprehensive AI applications that can understand and reason across different modalities.

The original single encoder architecture is posing challenges for the TTFT and throughput of multimodal models. To address this, we introduced an encoder dp method in #13126. However, the flexibility and scalability of this colocate-based DP method is limited, and due to colocating, it also consumes GPU resources, which might interfere with the language part sometimes. Therefore, we introduce an encoder disaggregation method in this PR: #12263.

Thanks to the RedNote (xiaohongshu) hilab team @gty111, the Alibaba Cloud Computing team @liusy58, and the Ant Group SCT team @ZhengWG for their collaboration. We combined the advantages of different designs and made this first version. We also thank @yhyang201 and @mickqian for their insightful reviews and detailed quality checks.

Encoder Disaggregation Roadmap

In addition to bug fixes, here are some following features we would like to integrate step by step.

Encoder Configuration Optimization, Better Task Distribution, and Load Balancing (with SGLang Model Gateway)

  • Support dynamic scaling and configuration of disaggregated encoder (static --encoder-urls solution might be deprecated in the future) @gty111 @liusy58
  • Adaptive EPD/PD: We can implement an optimization to make this adaptive. Specifically, if multiple images are present, they should be routed to the encoder; otherwise, fall back to the standard PD (Prefill-Decode) path.

Modality Extension

Optimization of Communication Efficiency

Global Encoder Cache

  • Design a distributed Cache to cache the processed results and use a hash-based matching method to skip the processing of duplicate multi-modal data (e.g., images, videos, audio), thus further reducing TTFT @ShangmingCai @liusy58 @hzh0425

Compatibility with CUDA Graph @bluecoffee8

  • ViT CUDA Graph support.
  • Piecewise CudaGraph

Fault tolerance

SGLang Model Gateway support

We welcome contributors who are interested in these tasks and willing to join us for rapid development.

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions