[EPD][VLM] support video/audio input#17824
[EPD][VLM] support video/audio input#17824ZhengWG wants to merge 33 commits intosgl-project:mainfrom
Conversation
|
/rerun-failed-ci |
1 similar comment
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
1 similar comment
|
/rerun-failed-ci |
| while recv_embedding_data is None or not recv_embedding_data.ready: | ||
| parts = await recv_socket.recv_multipart(copy=False) | ||
|
|
||
| recv_obj: EmbeddingData = pickle.loads(parts[0]) |
There was a problem hiding this comment.
Why we need EmbeddingData and MultiModalEmbeddingData at the same time, can we just keep one? Will this mixing type trigger AttributeError when pickle.loads?
There was a problem hiding this comment.
They serve different purposes:EmbeddingData: Used in encoder_server for single-modality data merging;MultiModalEmbeddingData: Used in encode_receiver for cross-modality data merging.
The encode_receiver only receives EmbeddingData from encoder_server via pickle.loads, so there's no mixing of types that would trigger an AttributeError. The conversion from EmbeddingData to MultiModalEmbeddingData happens after deserialization.
ShangmingCai
left a comment
There was a problem hiding this comment.
Seems like we only support qwen now, should we throw an error when users are using other models? Also, a lot of code has been changed, and it is hard to review carefully.
|
@yhyang201, please help review the |
I'll add error handling for unsupported models. Regarding the code changes: preprocessing logic varies significantly across different models and modalities, which is why the changes are extensive. |
|
/rerun-failed-ci |
4 similar comments
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
Replace _extract_image_url + _extract_image_urls with a single _extract_url_data function following the pattern from PR sgl-project#17824. - Use isinstance(item, ImageData) instead of hasattr(item, "url") - Import ImageData at module level (not TYPE_CHECKING only) - Fix duplicate comment - Consolidate two functions into one simpler function - Still handles str, dict, and ImageData formats for native API compat Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
/rerun-failed-ci |
6 similar comments
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
Motivation
The PR adds video/audio input support with EPD mode, tested on Qwen-series models, addressing the requirements outlined in #15118.
based on #15475
Modifications
MultiModalEmbeddingDataAccuracy Tests
Test with Qwen-series model:
Qwen/Qwen2.5-VL-7B-InstructQwen/Qwen3-VL-30B-A3B-Instruct,Qwen/Qwen3-VL-235B-A22B-Instruct-FP8Qwen/Qwen2-Audio-7B-InstructQwen/Qwen3-Omni-MoE-A22B-InstructLaunch script samle:
Req sample:
Benchmarking and Profiling
Bench script:
Performance remains consistent before and after modification: test on Qwen3-VL-235B-A22B-Instruct-FP8
Before:
After:
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci