Skip to content

[EPD][VLM] support video/audio input#17824

Open
ZhengWG wants to merge 33 commits intosgl-project:mainfrom
ZhengWG:py/epd-support-video-dev
Open

[EPD][VLM] support video/audio input#17824
ZhengWG wants to merge 33 commits intosgl-project:mainfrom
ZhengWG:py/epd-support-video-dev

Conversation

@ZhengWG
Copy link
Contributor

@ZhengWG ZhengWG commented Jan 27, 2026

Motivation

The PR adds video/audio input support with EPD mode, tested on Qwen-series models, addressing the requirements outlined in #15118.

based on #15475

Modifications

  1. Added video/audio processing pipeline
  2. Handle multimodal embedding aggregation with MultiModalEmbeddingData
  3. Support modality-based assignment

Accuracy Tests

Test with Qwen-series model:

Model Family Model Variants Supported Modalities Test Status
Qwen2.5-VL Qwen/Qwen2.5-VL-7B-Instruct Image, Video ✅ Tested
Qwen3-VL Qwen/Qwen3-VL-30B-A3B-Instruct, Qwen/Qwen3-VL-235B-A22B-Instruct-FP8 Image, Video ✅ Tested
Qwen2-Audio Qwen/Qwen2-Audio-7B-Instruct Audio ✅ Tested
Qwen3-Omni Qwen/Qwen3-Omni-MoE-A22B-Instruct Image, Video, Audio ✅ Tested

Launch script samle:

# language
MODEL_PATH=Qwen3-VL-30B-A3B-Instruct  # Qwen2.5-VL-7B-Instruct
CUDA_VISIBLE_DEVICES=4 python3 -m sglang.launch_server --model-path ${MODEL_PATH} --disable-radix-cache \
        --host $HOST_IP --port 8003 --trust-remote-code --tp-size 1 \
        --enable-cache-report --log-level info  --encoder-urls 'http://127.0.0.1:8001' 'http://127.0.0.1:8002' \
        --mem-fraction-static 0.7--chunked-prefill-size 8192--attention-backend fa3 \
        --enable-multimodal --language-only
# encoder
CUDA_VISIBLE_DEVICES=0 python3 -m sglang.launch_server --model-path ${MODEL_PATH} \
        --host $HOST_IP --port 8001 --trust-remote-code --tp-size 1 \
        --enable-cache-report --log-level info  \
        --mem-fraction-static 0.7--chunked-prefill-size 8192--attention-backend fa3 \
        --mm-attention-backend fa3 --encoder-only
CUDA_VISIBLE_DEVICES=1 python3 -m sglang.launch_server --model-path ${MODEL_PATH} \
        --host $HOST_IP --port 8002 --trust-remote-code --tp-size 1 \
        --enable-cache-report --log-level info  \
        --mem-fraction-static 0.7--chunked-prefill-size 8192--attention-backend fa3 \
        --mm-attention-backend fa3 --encoder-only

Req sample:

curl "http://127.0.0.1:8003/v1/chat/completions" \
--header 'Content-Type: application/json' \
--data '{
    "model": "auto",
    "messages": [
        {
            "role": "system", 
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "video_url",
                    "video_url": {
                        "url": "http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/ForBiggerBlazes.mp4"
                    }
                },
                {
                    "type": "text",
                    "text": "Desribe the video."
                }
            ]
        }
    ],
    "stream": false,
    "max_tokens": 512
}'

Benchmarking and Profiling

Bench script:

    # test qps_list: [0.1, 0.3, 0.6]
    python -m sglang.bench_serving \
        --random-image-count \
        --model ${MODEL_PATH} \
        --num-prompts 64 \
        --dataset-name image \
        --random-input-len 128 \
        --random-output-len 256 \
        --image-count 8 \
        --image-resolution 1080p \
        --host $HOST_IP \
        --port $port \
        --backend vllm-chat \
        --request-rate $qps

Performance remains consistent before and after modification: test on Qwen3-VL-235B-A22B-Instruct-FP8

Before:

  One encoder(TP1)/one prefill(TP4) Two encoder(TP1)/one prefill(TP4)
qps=0.1 Mean TTFT (ms): 3396.60/Mean TPOT (ms):29.30 Mean TTFT (ms): 2385.97/Mean TPOT (ms):17.97
qps=0.3 Mean TTFT (ms): 5348.32/Mean TPOT (ms):91.68 Mean TTFT (ms): 3425.13/Mean TPOT (ms):63.32
qps=0.6 Mean TTFT (ms): 61344.05/Mean TPOT (ms):739.06 Mean TTFT (ms): 7376.02/Mean TPOT (ms):427.76

After:

  One encoder(TP1)/one prefill(TP4) Two encoder(TP1)/one prefill(TP4)
qps=0.1 Mean TTFT (ms): 3297.11/Mean TPOT (ms):29.46 Mean TTFT (ms): 2414.95/Mean TPOT (ms):31.07
qps=0.3 Mean TTFT (ms): 5250.96/Mean TPOT (ms):86.47 Mean TTFT (ms): 3344.37/Mean TPOT (ms):72.21
qps=0.6 Mean TTFT (ms): 55030.39/Mean TPOT (ms):818.90 Mean TTFT (ms): 8142.78/Mean TPOT (ms):439.15

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@liusy58
Copy link
Contributor

liusy58 commented Jan 30, 2026

/rerun-failed-ci

1 similar comment
@liusy58
Copy link
Contributor

liusy58 commented Jan 31, 2026

/rerun-failed-ci

@liusy58
Copy link
Contributor

liusy58 commented Jan 31, 2026

/rerun-failed-ci

@ZhengWG
Copy link
Contributor Author

ZhengWG commented Feb 3, 2026

/rerun-failed-ci

1 similar comment
@ZhengWG
Copy link
Contributor Author

ZhengWG commented Feb 4, 2026

/rerun-failed-ci

while recv_embedding_data is None or not recv_embedding_data.ready:
parts = await recv_socket.recv_multipart(copy=False)

recv_obj: EmbeddingData = pickle.loads(parts[0])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need EmbeddingData and MultiModalEmbeddingData at the same time, can we just keep one? Will this mixing type trigger AttributeError when pickle.loads?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They serve different purposes:EmbeddingData: Used in encoder_server for single-modality data merging;MultiModalEmbeddingData: Used in encode_receiver for cross-modality data merging.
The encode_receiver only receives EmbeddingData from encoder_server via pickle.loads, so there's no mixing of types that would trigger an AttributeError. The conversion from EmbeddingData to MultiModalEmbeddingData happens after deserialization.

Copy link
Collaborator

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we only support qwen now, should we throw an error when users are using other models? Also, a lot of code has been changed, and it is hard to review carefully.

@ShangmingCai
Copy link
Collaborator

@yhyang201, please help review the processors part when you have time. The manager part LGTM, but need more time to review the whole PR.

@yhyang201 yhyang201 self-assigned this Feb 4, 2026
@ZhengWG
Copy link
Contributor Author

ZhengWG commented Feb 4, 2026

Seems like we only support qwen now, should we throw an error when users are using other models? Also, a lot of code has been changed, and it is hard to review carefully.

I'll add error handling for unsupported models.

Regarding the code changes: preprocessing logic varies significantly across different models and modalities, which is why the changes are extensive.

@ZhengWG
Copy link
Contributor Author

ZhengWG commented Feb 6, 2026

/rerun-failed-ci

4 similar comments
@ZhengWG
Copy link
Contributor Author

ZhengWG commented Feb 6, 2026

/rerun-failed-ci

@ZhengWG
Copy link
Contributor Author

ZhengWG commented Feb 9, 2026

/rerun-failed-ci

@ZhengWG
Copy link
Contributor Author

ZhengWG commented Feb 10, 2026

/rerun-failed-ci

@ZhengWG
Copy link
Contributor Author

ZhengWG commented Feb 10, 2026

/rerun-failed-ci

@liusy58
Copy link
Contributor

liusy58 commented Feb 11, 2026

/rerun-failed-ci

flexorRegev added a commit to flexorRegev/sglang that referenced this pull request Feb 11, 2026
Replace _extract_image_url + _extract_image_urls with a single
_extract_url_data function following the pattern from PR sgl-project#17824.

- Use isinstance(item, ImageData) instead of hasattr(item, "url")
- Import ImageData at module level (not TYPE_CHECKING only)
- Fix duplicate comment
- Consolidate two functions into one simpler function
- Still handles str, dict, and ImageData formats for native API compat

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ZhengWG
Copy link
Contributor Author

ZhengWG commented Feb 17, 2026

/rerun-failed-ci

6 similar comments
@ZhengWG
Copy link
Contributor Author

ZhengWG commented Feb 17, 2026

/rerun-failed-ci

@ZhengWG
Copy link
Contributor Author

ZhengWG commented Feb 17, 2026

/rerun-failed-ci

@ZhengWG
Copy link
Contributor Author

ZhengWG commented Feb 17, 2026

/rerun-failed-ci

@ZhengWG
Copy link
Contributor Author

ZhengWG commented Feb 18, 2026

/rerun-failed-ci

@ZhengWG
Copy link
Contributor Author

ZhengWG commented Feb 19, 2026

/rerun-failed-ci

@ZhengWG
Copy link
Contributor Author

ZhengWG commented Feb 20, 2026

/rerun-failed-ci

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Comments