Skip to content

transfer mrope_position_delta to device when first running#11047

Merged
zhyncs merged 9 commits intosgl-project:mainfrom
ping1jing2:mm_h2d
Oct 26, 2025
Merged

transfer mrope_position_delta to device when first running#11047
zhyncs merged 9 commits intosgl-project:mainfrom
ping1jing2:mm_h2d

Conversation

@ash-sigh
Copy link
Contributor

@ash-sigh ash-sigh commented Sep 29, 2025

Motivation

related to #11046

Modifications

The mrope_position_delta is transferred to the device during the first run to avoid subsequent host-to-device transfers for decode performance improvement of multi-modality models.

Accuracy Tests

Ascend NPU:

# launch_server :
export HCCL_OP_EXPANSION_MODE="AIV"
export OMP_PROC_BIND=false
python -m sglang.launch_server --model-path /home/weights/Qwen2.5-VL-32B-Instruct --host 127.0.0.1 --port 8022 --device npu --attention-backend ascend --tp 2 --grammar-backend xgrammar --base-gpu-id 4 --mm-attention-backend ascend_attn --trust-remote-code --enable-multimodal

# benchmark mmmu:
python benchmark/mmmu/bench_sglang.py --port 8022 --concurrency 16 --dataset-path /home/dataset/MMMU

result before modification:
Benchmark time: 388.05547650624067s
Overall accuracy: 0.582

result after modification:
Benchmark time: 279.4189731432125s
Overall accuracy: 0.583

H20 GPU:

# launch_server :
python -m sglang.launch_server --model-path /data/MultiModalWeights/Qwen2.5-VL-32B-Instruct --tp 2 --host 127.0.0.1 --port 8022 --base-gpu-id 0 --mm-attention-backend fa3 --trust-remote-code  --enable-multimodal

# benchmark mmmu:
python benchmark/mmmu/bench_sglang.py --port 8022 --concurrency 16 --dataset-path /home/dataset/MMMU

result before modification:
Benchmark time: 293.04421259183437
Overall accuracy: 0.586

result after modification:
Benchmark time: 291.70924793416634
Overall accuracy: 0.586

Benchmarking and Profiling

Ascend NPU:

# launch_server :
export HCCL_OP_EXPANSION_MODE="AIV"
export OMP_PROC_BIND=false
python -m sglang.launch_server --model-path /home/weights/Qwen2.5-VL-32B-Instruct --host 127.0.0.1 --port 8022 --device npu --attention-backend ascend --tp 2 --grammar-backend xgrammar --base-gpu-id 4 --mm-attention-backend ascend_attn --trust-remote-code --enable-multimodal

# benchmark:
python3 -m sglang.bench_serving --backend sglang --host 127.0.0.1 --port 8022 --dataset-name random-image --model /home/weights/Qwen2.5-VL-32B-Instruct --num-prompt 512 --random-input-len 31 --random-output-len 256 --random-range-ratio 1.0 --random-image-num-images 1 --random-image-resolution 720p --max-concurrency 32 --apply-chat-template --flush-cache

result before modification:
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 32
Successful requests: 512
Benchmark duration (s): 497.67
Total input tokens: 27971
Total generated tokens: 131072
Total generated tokens (retokenized): 60654
Request throughput (req/s): 1.03
Input token throughput (tok/s): 56.20
Output token throughput (tok/s): 263.37
Total token throughput (tok/s): 319.58
Concurrency: 31.82
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 30929.16
Median E2E Latency (ms): 31412.63
---------------Time to First Token----------------
Mean TTFT (ms): 6228.17
Median TTFT (ms): 6573.16
P99 TTFT (ms): 13450.14
---------------Inter-Token Latency----------------
Mean ITL (ms): 96.89
Median ITL (ms): 64.00
P95 ITL (ms): 283.95
P99 ITL (ms): 540.91
Max ITL (ms): 7429.85

result after modification:
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 32
Successful requests: 512
Benchmark duration (s): 456.94
Total input tokens: 27877
Total generated tokens: 131072
Total generated tokens (retokenized): 60441
Request throughput (req/s): 1.12
Input token throughput (tok/s): 61.01
Output token throughput (tok/s): 286.85
Total token throughput (tok/s): 347.85
Concurrency: 31.83
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 28403.76
Median E2E Latency (ms): 28722.04
---------------Time to First Token----------------
Mean TTFT (ms): 6642.61
Median TTFT (ms): 7468.42
P99 TTFT (ms): 13886.43
---------------Inter-Token Latency----------------
Mean ITL (ms): 85.35
Median ITL (ms): 57.98
P95 ITL (ms): 128.43
P99 ITL (ms): 584.89
Max ITL (ms): 10766.88

H20 GPU:

# launch_server :
python -m sglang.launch_server --model-path /data/MultiModalWeights/Qwen2.5-VL-32B-Instruct --tp 2 --host 127.0.0.1 --port 8022 --base-gpu-id 0 --mm-attention-backend fa3 --trust-remote-code  --enable-multimodal

# benchmark:
python3 -m sglang.bench_serving --backend sglang --host 127.0.0.1 --port 8022 --dataset-name random-image --model /data/MultiModalWeights/Qwen2.5-VL-32B-Instruct --num-prompt 512 --random-input-len 31 --random-output-len 256 --random-range-ratio 1.0 --random-image-num-images 1 --random-image-resolution 720p --max-concurrency 32 --apply-chat-template --flush-cache

result before modification:
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 32
Successful requests: 512
Benchmark duration (s): 301.21
Total input tokens: 27944
Total generated tokens: 131072
Total generated tokens (retokenized): 59641
Request throughput (req/s): 1.70
Input token throughput (tok/s): 92.77
Output token throughput (tok/s): 435.15
Total token throughput (tok/s): 527.92
Concurrency: 31.52
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 18543.10
Median E2E Latency (ms): 18754.19
---------------Time to First Token----------------
Mean TTFT (ms): 7447.40
Median TTFT (ms): 8141.37
P99 TTFT (ms): 17402.11
---------------Inter-Token Latency----------------
Mean ITL (ms): 43.60
Median ITL (ms): 17.32
P95 ITL (ms): 76.08
P99 ITL (ms): 294.61
Max ITL (ms): 14387.22

result after modification:
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 32
Successful requests: 512
Benchmark duration (s): 297.15
Total input tokens: 28008
Total generated tokens: 131072
Total generated tokens (retokenized): 59120
Request throughput (req/s): 1.72
Input token throughput (tok/s): 94.26
Output token throughput (tok/s): 441.10
Total token throughput (tok/s): 535.36
Concurrency: 31.71
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 18405.00
Median E2E Latency (ms): 18126.27
---------------Time to First Token----------------
Mean TTFT (ms): 7743.42
Median TTFT (ms): 8579.59
P99 TTFT (ms): 16227.20
---------------Inter-Token Latency----------------
Mean ITL (ms): 41.81
Median ITL (ms): 17.31
P95 ITL (ms): 36.65
P99 ITL (ms): 625.31
Max ITL (ms): 13778.65

Checklist

@JustinTong0323 JustinTong0323 self-assigned this Oct 1, 2025
@JustinTong0323 JustinTong0323 added the Multi-modal multi-modal language model label Oct 1, 2025
@ash-sigh ash-sigh marked this pull request as ready for review October 16, 2025 09:50
@JustinTong0323 JustinTong0323 added the express-lane A PR may be merged without a full CI check label Oct 26, 2025
@zhyncs zhyncs merged commit 0b3b3e9 into sgl-project:main Oct 26, 2025
29 of 70 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

express-lane A PR may be merged without a full CI check Multi-modal multi-modal language model run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants