multimodal: precompute hash for MultimodalDataItem#14354
multimodal: precompute hash for MultimodalDataItem#14354Kangyan-Zhou merged 3 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @sufeng-buaa, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the performance of multimodal data processing in VLM systems by optimizing the timing of hash code generation. By shifting the hash computation for Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request effectively moves the hash precomputation for MultimodalDataItem to the tokenizer, which, as the benchmarks show, provides a significant performance improvement by unblocking the scheduler. The changes are well-contained and controlled by a new environment variable.
One concern is that the original hash computation in python/sglang/srt/managers/schedule_batch.py (lines 331-332 in MultimodalInputs.from_dict) seems to be still present based on the full file content. This would cause set_pad_value() to be called twice. Although the implementation of set_pad_value prevents re-hashing, it's best to remove the redundant call from schedule_batch.py to complete the "move" of this logic and keep the code clean. If this was an oversight, please consider adding the change to this file.
I've also left a minor suggestion in tokenizer_manager.py to improve code readability by reducing nesting.
The cost of calling |
|
/tag-and-rerun-ci |
|
LGTM. @yhyang201 Could you help to review? |
Signed-off-by: Feng Su <sufeng@linux.alibaba.com> Signed-off-by: Junjie Mao <junjie.mao@linux.alibaba.com>
6770b08 to
d00c2a0
Compare
|
Great! Could you conduct a more in-depth comparison regarding the enabling and disabling of this feature? This should include scenarios with low-bs and high-bs. Maybe we could make it default? |
Should we consult more people before making it the default? |
|
/rerun-failed-ci |
|
/tag-and-rerun-ci |
|
LGTM for now, maybe we could add some auto switching feature based on the traffic in the future. |
|
/rerun-failed-ci |
|
Do we need to wait for the CI to be fixed before we can merge? The failed items don't seem related to my code. @JustinTong0323 @yuan-luo |
|
/rerun-failed-ci |
|
@yhyang201 All test cases have passed. Can this PR be merged? Thanks |
|
I would like to verify if my understanding of this feature is correct: First, this feature offloads the hash calculation to the tokenizer manager. Since the tokenizer manager runs asynchronously with the scheduler, this creates an execution 'overlap', preventing the hash calculation from blocking the scheduler, and thereby improving overall throughput Second, this feature is controlled via an environment variable because it should not be enabled under high concurrency. The rationale is that if the tokenizer becomes blocked, new requests might fail to establish a connection entirely. In contrast, if the scheduler handles the load, requests would simply be queued. Therefore, enabling this feature under heavy load could prevent requests from entering the system, leading to a high number of timeout failures. Is this understanding accurate? Please let me know if I missed anything. |
yes, your understanding is correct. |
* 'main' of https://github.com/sgl-project/sglang: (136 commits) fix: unreachable error check in retraction (sgl-project#15433) [sgl-kernel] chore: update deepgemm version (sgl-project#13402) [diffusion] multi-platform: support diffusion on amd and fix encoder loading on MI325 (sgl-project#13760) [amd] Add deterministic all-reduce kernel for AMD (ROCm) (sgl-project#15340) [diffusion] refactor: refactor _build_req_from_sampling to use shallow_asdict (sgl-project#13782) Add customized sampler registration (sgl-project#15423) Update readme (sgl-project#15425) Fix Mindspore model import warning (sgl-project#15287) [Feature] Xiaomi `MiMo-V2-Flash` day0 support (sgl-project#15207) [diffusion] profiling: add bench_serving.py and VBench (sgl-project#15410) [DLLM] Fix dLLM regression (sgl-project#15371) [Deepseek V3.2] Fix Deepseek MTP in V1 mode (sgl-project#15429) chore: update CI_PERMISSIONS (sgl-project#15431) [DLLM] Add CI for diffusion LLMs (sgl-project#14723) Support using different attention backend for draft decoding. (sgl-project#14843) feat(dsv32): better error handling for DeepSeek-v3.2 encoder (sgl-project#14353) tiny fix lint on main (sgl-project#15424) multimodal: precompute hash for MultimodalDataItem (sgl-project#14354) [AMD] Clear pre-built AITER kernels and warmup to prevent segfaults and test timeouts (sgl-project#15318) [Performance] optimize NSA backend metadata computation for multi-step speculative decoding (sgl-project#14781) ...
Signed-off-by: Feng Su <sufeng@linux.alibaba.com> Signed-off-by: Junjie Mao <junjie.mao@linux.alibaba.com>
Signed-off-by: Feng Su <sufeng@linux.alibaba.com> Signed-off-by: Junjie Mao <junjie.mao@linux.alibaba.com>
Signed-off-by: Feng Su <sufeng@linux.alibaba.com> Signed-off-by: Junjie Mao <junjie.mao@linux.alibaba.com>
Co-authored-by: Junjie Mao junjie.mao@linux.alibaba.com
Motivation
In VLM scenarios, computing the hash code of a MultimodalDataItem is a relatively time-consuming operation. Currently, this step is implemented in the scheduler, and for large input data, it directly blocks the entire scheduler, resulting in reduced throughput and increased latency.
Modifications
This PR moves this step earlier into the tokenizer and introduces an environment variable SGLANG_MM_PRECOMPUTE_HASH to control whether this feature is enabled.
Accuracy Tests
Benchmarking and Profiling
test parameters
server:
client:
vllm bench serve --port 30260 --backend openai-chat --model Qwen/Qwen3-VL-4B-Instruct-FP8 --trust-remote-code --percentile-metrics ttft,tpot,itl,e2el --num-prompts 1024 --dataset-name random-mm --random-mm-base-items-per-request 1 --random-mm-num-mm-items-range-ratio 0 --random-mm-bucket-config '{(1280, 720, 1): 1.0}' --seed 18347 --request-rate inf --random-prefix-len 500 --random-input-len 600 --random-output-len 230 --endpoint /v1/chat/completions --max-concurrency 128Reuslt
Multimodal data size is approximately 20 MB, achieving a 15% throughput improvement.
Original:
Optimized:
Checklist