-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Description
Motivation
Multimodal support enables the framework to handle and process multiple types of data beyond text, including images, videos, and audio, allowing for more comprehensive AI applications that can understand and reason across different modalities.
The original single encoder architecture is posing challenges for the TTFT and throughput of multimodal models. To address this, we introduced an encoder dp method in #13126. However, the flexibility and scalability of this colocate-based DP method is limited, and due to colocating, it also consumes GPU resources, which might interfere with the language part sometimes. Therefore, we introduce an encoder disaggregation method in this PR: #12263.
Thanks to the RedNote (xiaohongshu) hilab team @gty111, the Alibaba Cloud Computing team @liusy58, and the Ant Group SCT team @ZhengWG for their collaboration. We combined the advantages of different designs and made this first version. We also thank @yhyang201 and @mickqian for their insightful reviews and detailed quality checks.
Encoder Disaggregation Roadmap
In addition to bug fixes, here are some following features we would like to integrate step by step.
Encoder Configuration Optimization, Better Task Distribution, and Load Balancing (with SGLang Model Gateway)
- Support dynamic scaling and configuration of disaggregated encoder (static
--encoder-urlssolution might be deprecated in the future) @gty111 @liusy58 - Adaptive EPD/PD: We can implement an optimization to make this adaptive. Specifically, if multiple images are present, they should be routed to the encoder; otherwise, fall back to the standard PD (Prefill-Decode) path.
Modality Extension
- Video and audio input support @ZhengWG [EPD][VLM] support video input(qwen-series) #15475 [EPD][VLM] support video/audio input #17824
Optimization of Communication Efficiency
- The current version supports ZMQ and the Mooncake transfer engine backend for communication, but the overall design and performance are not yet optimal. @gty111 @ZhengWG [EPD][Perf] parallelize ZMQ send for encode server #16487
Global Encoder Cache
- Design a distributed Cache to cache the processed results and use a hash-based matching method to skip the processing of duplicate multi-modal data (e.g., images, videos, audio), thus further reducing TTFT @ShangmingCai @liusy58 @hzh0425
Compatibility with CUDA Graph @bluecoffee8
- ViT CUDA Graph support.
- Piecewise CudaGraph
Fault tolerance
- Support EPD error handlinghttps://github.com/Support EPD error handling #16670
SGLang Model Gateway support
- EPD Disaggregation SGLang Model Gateway Support: [model gateway][4/N] router EPD support: add EPD python bindings and E2E routing tests #17550 @chenzongyao200127 @JasonZhang517
- SGLang Encoder server grpc Support: [model gateway][0/N] router EPD support: add encoder grpc server backend support #16552 @chenzongyao200127 @JasonZhang517
We welcome contributors who are interested in these tasks and willing to join us for rapid development.