[EPD][VLM] support video/audio input by ZhengWG · Pull Request #17824 · sgl-project/sglang

ZhengWG · 2026-01-27T12:06:39Z

Motivation

The PR adds video/audio input support with EPD mode, tested on Qwen-series models, addressing the requirements outlined in #15118.

based on #15475

Modifications

Added video/audio processing pipeline
Handle multimodal embedding aggregation with MultiModalEmbeddingData
Support modality-based assignment

Accuracy Tests

Test with Qwen-series model:

Model Family	Model Variants	Supported Modalities	Test Status
Qwen2.5-VL	`Qwen/Qwen2.5-VL-7B-Instruct`	Image, Video	✅ Tested
Qwen3-VL	`Qwen/Qwen3-VL-30B-A3B-Instruct`, `Qwen/Qwen3-VL-235B-A22B-Instruct-FP8`	Image, Video	✅ Tested
Qwen2-Audio	`Qwen/Qwen2-Audio-7B-Instruct`	Audio	✅ Tested
Qwen3-Omni	`Qwen/Qwen3-Omni-MoE-A22B-Instruct`	Image, Video, Audio	✅ Tested

Launch script samle:

# language
MODEL_PATH=Qwen3-VL-30B-A3B-Instruct  # Qwen2.5-VL-7B-Instruct
CUDA_VISIBLE_DEVICES=4 python3 -m sglang.launch_server --model-path ${MODEL_PATH} --disable-radix-cache \
        --host $HOST_IP --port 8003 --trust-remote-code --tp-size 1 \
        --enable-cache-report --log-level info  --encoder-urls 'http://127.0.0.1:8001' 'http://127.0.0.1:8002' \
        --mem-fraction-static 0.7--chunked-prefill-size 8192--attention-backend fa3 \
        --enable-multimodal --language-only
# encoder
CUDA_VISIBLE_DEVICES=0 python3 -m sglang.launch_server --model-path ${MODEL_PATH} \
        --host $HOST_IP --port 8001 --trust-remote-code --tp-size 1 \
        --enable-cache-report --log-level info  \
        --mem-fraction-static 0.7--chunked-prefill-size 8192--attention-backend fa3 \
        --mm-attention-backend fa3 --encoder-only
CUDA_VISIBLE_DEVICES=1 python3 -m sglang.launch_server --model-path ${MODEL_PATH} \
        --host $HOST_IP --port 8002 --trust-remote-code --tp-size 1 \
        --enable-cache-report --log-level info  \
        --mem-fraction-static 0.7--chunked-prefill-size 8192--attention-backend fa3 \
        --mm-attention-backend fa3 --encoder-only

Req sample:

curl "http://127.0.0.1:8003/v1/chat/completions" \
--header 'Content-Type: application/json' \
--data '{
    "model": "auto",
    "messages": [
        {
            "role": "system", 
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "video_url",
                    "video_url": {
                        "url": "http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/ForBiggerBlazes.mp4"
                    }
                },
                {
                    "type": "text",
                    "text": "Desribe the video."
                }
            ]
        }
    ],
    "stream": false,
    "max_tokens": 512
}'

Benchmarking and Profiling

Bench script:

    # test qps_list: [0.1, 0.3, 0.6]
    python -m sglang.bench_serving \
        --random-image-count \
        --model ${MODEL_PATH} \
        --num-prompts 64 \
        --dataset-name image \
        --random-input-len 128 \
        --random-output-len 256 \
        --image-count 8 \
        --image-resolution 1080p \
        --host $HOST_IP \
        --port $port \
        --backend vllm-chat \
        --request-rate $qps

Performance remains consistent before and after modification: test on Qwen3-VL-235B-A22B-Instruct-FP8

Before:

	One encoder(TP1)/one prefill(TP4)	Two encoder(TP1)/one prefill(TP4)
qps=0.1	Mean TTFT (ms): 3396.60/Mean TPOT (ms):29.30	Mean TTFT (ms): 2385.97/Mean TPOT (ms):17.97
qps=0.3	Mean TTFT (ms): 5348.32/Mean TPOT (ms):91.68	Mean TTFT (ms): 3425.13/Mean TPOT (ms):63.32
qps=0.6	Mean TTFT (ms): 61344.05/Mean TPOT (ms):739.06	Mean TTFT (ms): 7376.02/Mean TPOT (ms):427.76

After:

	One encoder(TP1)/one prefill(TP4)	Two encoder(TP1)/one prefill(TP4)
qps=0.1	Mean TTFT (ms): 3297.11/Mean TPOT (ms):29.46	Mean TTFT (ms): 2414.95/Mean TPOT (ms):31.07
qps=0.3	Mean TTFT (ms): 5250.96/Mean TPOT (ms):86.47	Mean TTFT (ms): 3344.37/Mean TPOT (ms):72.21
qps=0.6	Mean TTFT (ms): 55030.39/Mean TPOT (ms):818.90	Mean TTFT (ms): 8142.78/Mean TPOT (ms):439.15

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

liusy58 · 2026-01-30T15:50:11Z

/rerun-failed-ci

liusy58 · 2026-01-31T06:37:08Z

/rerun-failed-ci

liusy58 · 2026-01-31T09:28:17Z

/rerun-failed-ci

ZhengWG · 2026-02-03T02:35:33Z

/rerun-failed-ci

ZhengWG · 2026-02-04T02:41:06Z

/rerun-failed-ci

ShangmingCai · 2026-02-04T03:50:01Z

python/sglang/srt/disaggregation/encode_receiver.py

        while recv_embedding_data is None or not recv_embedding_data.ready:
            parts = await recv_socket.recv_multipart(copy=False)

            recv_obj: EmbeddingData = pickle.loads(parts[0])


Why we need EmbeddingData and MultiModalEmbeddingData at the same time, can we just keep one? Will this mixing type trigger AttributeError when pickle.loads?

They serve different purposes:EmbeddingData: Used in encoder_server for single-modality data merging；MultiModalEmbeddingData: Used in encode_receiver for cross-modality data merging.
The encode_receiver only receives EmbeddingData from encoder_server via pickle.loads, so there's no mixing of types that would trigger an AttributeError. The conversion from EmbeddingData to MultiModalEmbeddingData happens after deserialization.

ShangmingCai

Seems like we only support qwen now, should we throw an error when users are using other models? Also, a lot of code has been changed, and it is hard to review carefully.

ShangmingCai · 2026-02-04T04:04:35Z

@yhyang201, please help review the processors part when you have time. The manager part LGTM, but need more time to review the whole PR.

ZhengWG · 2026-02-04T06:39:20Z

Seems like we only support qwen now, should we throw an error when users are using other models? Also, a lot of code has been changed, and it is hard to review carefully.

I'll add error handling for unsupported models.

Regarding the code changes: preprocessing logic varies significantly across different models and modalities, which is why the changes are extensive.

ZhengWG · 2026-02-06T03:23:54Z

/rerun-failed-ci

ZhengWG · 2026-02-06T09:24:08Z

/rerun-failed-ci

ZhengWG · 2026-02-09T02:28:18Z

/rerun-failed-ci

ZhengWG · 2026-02-10T02:43:55Z

/rerun-failed-ci

ZhengWG · 2026-02-10T05:49:29Z

/rerun-failed-ci

liusy58 · 2026-02-11T08:50:48Z

/rerun-failed-ci

Replace _extract_image_url + _extract_image_urls with a single _extract_url_data function following the pattern from PR sgl-project#17824. - Use isinstance(item, ImageData) instead of hasattr(item, "url") - Import ImageData at module level (not TYPE_CHECKING only) - Fix duplicate comment - Consolidate two functions into one simpler function - Still handles str, dict, and ImageData formats for native API compat Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ZhengWG · 2026-02-17T03:18:42Z

/rerun-failed-ci

ZhengWG · 2026-02-17T07:03:36Z

/rerun-failed-ci

ZhengWG · 2026-02-17T07:11:20Z

/rerun-failed-ci

ZhengWG · 2026-02-17T08:01:28Z

/rerun-failed-ci

ZhengWG · 2026-02-18T09:15:53Z

/rerun-failed-ci

ZhengWG · 2026-02-19T10:07:35Z

/rerun-failed-ci

ZhengWG · 2026-02-20T06:06:14Z

/rerun-failed-ci

ZhengWG added 21 commits December 18, 2025 20:04

feat: split mm items

1367261

support video

9a8e37e

UT: add video-test ut

57254db

Merge branch 'main' into py/epd-support-video

ff4abce

fix synctax/import error

bcc9d7c

Merge branch 'main' into py/epd-support-video

622b408

retrict video processing to qwen-series

67c0ce3

Merge branch 'main' into py/epd-support-video

bdf71a3

Merge branch 'main' into py/epd-support-video

6490a0f

Merge branch 'main' into py/epd-support-video

5cecc97

Merge branch 'main' into py/epd-support-video

05e1065

Merge branch 'main' into py/epd-support-video

e68e239

fix attr err

5460961

Merge branch 'main' into py/epd-support-video

8a99213

Merge branch 'main' into py/epd-support-video

725609e

Merge branch 'main' into py/epd-support-video

f5fd79a

Merge branch 'main' into py/epd-support-video

f66e527

Merge branch 'main' into py/epd-support-video

d4180c0

Merge branch 'main' into py/epd-support-video

94a90e1

fix

367fbd4

feat: support audio for epd

ae33649

ZhengWG requested review from ByronHsu, JustinTong0323, ShangmingCai, Ying1123, hnyls2002, merrymercy, mickqian, xiezhq-hermann and yhyang201 as code owners January 27, 2026 12:06

Merge branch 'main' into py/epd-support-video-dev

4926e67

ZhengWG added 2 commits February 2, 2026 14:57

fix: fix audio-seq-lens

7224d51

Merge branch 'main' into py/epd-support-video-dev

d9a8c86

ShangmingCai reviewed Feb 4, 2026

View reviewed changes

yhyang201 self-assigned this Feb 4, 2026

ZhengWG added 2 commits February 4, 2026 20:05

feat: retrict model to qwen-series

2e2cb9e

clean code

125f574

Merge branch 'main' into py/epd-support-video-dev

e548fab

Conversation

ZhengWG commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

liusy58 commented Jan 30, 2026

Uh oh!

liusy58 commented Jan 31, 2026

Uh oh!

liusy58 commented Jan 31, 2026

Uh oh!

ZhengWG commented Feb 3, 2026

Uh oh!

ZhengWG commented Feb 4, 2026

Uh oh!

ShangmingCai Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

ZhengWG Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

ShangmingCai left a comment

Choose a reason for hiding this comment

Uh oh!

ShangmingCai commented Feb 4, 2026

Uh oh!

ZhengWG commented Feb 4, 2026

Uh oh!

ZhengWG commented Feb 6, 2026

Uh oh!

ZhengWG commented Feb 6, 2026

Uh oh!

ZhengWG commented Feb 9, 2026

Uh oh!

ZhengWG commented Feb 10, 2026

Uh oh!

ZhengWG commented Feb 10, 2026

Uh oh!

liusy58 commented Feb 11, 2026

Uh oh!

ZhengWG commented Feb 17, 2026

Uh oh!

ZhengWG commented Feb 17, 2026

Uh oh!

ZhengWG commented Feb 17, 2026

Uh oh!

ZhengWG commented Feb 17, 2026

Uh oh!

ZhengWG commented Feb 18, 2026

Uh oh!

ZhengWG commented Feb 19, 2026

Uh oh!

ZhengWG commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

ZhengWG commented Jan 27, 2026 •

edited

Loading