transfer mrope_position_delta to device when first running by ash-sigh · Pull Request #11047 · sgl-project/sglang

ash-sigh · 2025-09-29T03:08:02Z

Motivation

related to #11046

Modifications

The mrope_position_delta is transferred to the device during the first run to avoid subsequent host-to-device transfers for decode performance improvement of multi-modality models.

Accuracy Tests

Ascend NPU:

# launch_server :
export HCCL_OP_EXPANSION_MODE="AIV"
export OMP_PROC_BIND=false
python -m sglang.launch_server --model-path /home/weights/Qwen2.5-VL-32B-Instruct --host 127.0.0.1 --port 8022 --device npu --attention-backend ascend --tp 2 --grammar-backend xgrammar --base-gpu-id 4 --mm-attention-backend ascend_attn --trust-remote-code --enable-multimodal

# benchmark mmmu:
python benchmark/mmmu/bench_sglang.py --port 8022 --concurrency 16 --dataset-path /home/dataset/MMMU

result before modification:
Benchmark time: 388.05547650624067s
Overall accuracy: 0.582

result after modification:
Benchmark time: 279.4189731432125s
Overall accuracy: 0.583

H20 GPU:

# launch_server :
python -m sglang.launch_server --model-path /data/MultiModalWeights/Qwen2.5-VL-32B-Instruct --tp 2 --host 127.0.0.1 --port 8022 --base-gpu-id 0 --mm-attention-backend fa3 --trust-remote-code  --enable-multimodal

# benchmark mmmu:
python benchmark/mmmu/bench_sglang.py --port 8022 --concurrency 16 --dataset-path /home/dataset/MMMU

result before modification:
Benchmark time: 293.04421259183437
Overall accuracy: 0.586

result after modification:
Benchmark time: 291.70924793416634
Overall accuracy: 0.586

Benchmarking and Profiling

Ascend NPU:

# launch_server :
export HCCL_OP_EXPANSION_MODE="AIV"
export OMP_PROC_BIND=false
python -m sglang.launch_server --model-path /home/weights/Qwen2.5-VL-32B-Instruct --host 127.0.0.1 --port 8022 --device npu --attention-backend ascend --tp 2 --grammar-backend xgrammar --base-gpu-id 4 --mm-attention-backend ascend_attn --trust-remote-code --enable-multimodal

# benchmark:
python3 -m sglang.bench_serving --backend sglang --host 127.0.0.1 --port 8022 --dataset-name random-image --model /home/weights/Qwen2.5-VL-32B-Instruct --num-prompt 512 --random-input-len 31 --random-output-len 256 --random-range-ratio 1.0 --random-image-num-images 1 --random-image-resolution 720p --max-concurrency 32 --apply-chat-template --flush-cache

result before modification:
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 32
Successful requests: 512
Benchmark duration (s): 497.67
Total input tokens: 27971
Total generated tokens: 131072
Total generated tokens (retokenized): 60654
Request throughput (req/s): 1.03
Input token throughput (tok/s): 56.20
Output token throughput (tok/s): 263.37
Total token throughput (tok/s): 319.58
Concurrency: 31.82
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 30929.16
Median E2E Latency (ms): 31412.63
---------------Time to First Token----------------
Mean TTFT (ms): 6228.17
Median TTFT (ms): 6573.16
P99 TTFT (ms): 13450.14
---------------Inter-Token Latency----------------
Mean ITL (ms): 96.89
Median ITL (ms): 64.00
P95 ITL (ms): 283.95
P99 ITL (ms): 540.91
Max ITL (ms): 7429.85

result after modification:
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 32
Successful requests: 512
Benchmark duration (s): 456.94
Total input tokens: 27877
Total generated tokens: 131072
Total generated tokens (retokenized): 60441
Request throughput (req/s): 1.12
Input token throughput (tok/s): 61.01
Output token throughput (tok/s): 286.85
Total token throughput (tok/s): 347.85
Concurrency: 31.83
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 28403.76
Median E2E Latency (ms): 28722.04
---------------Time to First Token----------------
Mean TTFT (ms): 6642.61
Median TTFT (ms): 7468.42
P99 TTFT (ms): 13886.43
---------------Inter-Token Latency----------------
Mean ITL (ms): 85.35
Median ITL (ms): 57.98
P95 ITL (ms): 128.43
P99 ITL (ms): 584.89
Max ITL (ms): 10766.88

H20 GPU:

# launch_server :
python -m sglang.launch_server --model-path /data/MultiModalWeights/Qwen2.5-VL-32B-Instruct --tp 2 --host 127.0.0.1 --port 8022 --base-gpu-id 0 --mm-attention-backend fa3 --trust-remote-code  --enable-multimodal

# benchmark:
python3 -m sglang.bench_serving --backend sglang --host 127.0.0.1 --port 8022 --dataset-name random-image --model /data/MultiModalWeights/Qwen2.5-VL-32B-Instruct --num-prompt 512 --random-input-len 31 --random-output-len 256 --random-range-ratio 1.0 --random-image-num-images 1 --random-image-resolution 720p --max-concurrency 32 --apply-chat-template --flush-cache

result before modification:
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 32
Successful requests: 512
Benchmark duration (s): 301.21
Total input tokens: 27944
Total generated tokens: 131072
Total generated tokens (retokenized): 59641
Request throughput (req/s): 1.70
Input token throughput (tok/s): 92.77
Output token throughput (tok/s): 435.15
Total token throughput (tok/s): 527.92
Concurrency: 31.52
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 18543.10
Median E2E Latency (ms): 18754.19
---------------Time to First Token----------------
Mean TTFT (ms): 7447.40
Median TTFT (ms): 8141.37
P99 TTFT (ms): 17402.11
---------------Inter-Token Latency----------------
Mean ITL (ms): 43.60
Median ITL (ms): 17.32
P95 ITL (ms): 76.08
P99 ITL (ms): 294.61
Max ITL (ms): 14387.22

result after modification:
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 32
Successful requests: 512
Benchmark duration (s): 297.15
Total input tokens: 28008
Total generated tokens: 131072
Total generated tokens (retokenized): 59120
Request throughput (req/s): 1.72
Input token throughput (tok/s): 94.26
Output token throughput (tok/s): 441.10
Total token throughput (tok/s): 535.36
Concurrency: 31.71
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 18405.00
Median E2E Latency (ms): 18126.27
---------------Time to First Token----------------
Mean TTFT (ms): 7743.42
Median TTFT (ms): 8579.59
P99 TTFT (ms): 16227.20
---------------Inter-Token Latency----------------
Mean ITL (ms): 41.81
Median ITL (ms): 17.31
P95 ITL (ms): 36.65
P99 ITL (ms): 625.31
Max ITL (ms): 13778.65

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

transfer mrope_position_delta to device when first running

fc66875

JustinTong0323 self-assigned this Oct 1, 2025

JustinTong0323 added the Multi-modal multi-modal language model label Oct 1, 2025

Merge branch 'main' into mm_h2d

14c9606

ash-sigh marked this pull request as ready for review October 16, 2025 09:50

ash-sigh requested review from Ying1123, hnyls2002, ispobock, merrymercy and zhyncs as code owners October 16, 2025 09:50

Merge branch 'main' into mm_h2d

d156ea5

ping1jing2 added the run-ci label Oct 16, 2025

Alcanderian approved these changes Oct 17, 2025

View reviewed changes

ash-sigh and others added 5 commits October 21, 2025 16:00

Merge branch 'main' into mm_h2d

f6a5382

Merge branch 'main' into mm_h2d

7d62bc4

Merge branch 'main' into mm_h2d

723e0d4

Merge branch 'main' into mm_h2d

dfc7be9

Merge branch 'main' into mm_h2d

eadb26c

JustinTong0323 added the express-lane A PR may be merged without a full CI check label Oct 26, 2025

Merge branch 'main' into mm_h2d

aca6006

ping1jing2 approved these changes Oct 26, 2025

View reviewed changes

zhyncs merged commit 0b3b3e9 into sgl-project:main Oct 26, 2025
29 of 70 checks passed

ping1jing2 mentioned this pull request Dec 27, 2025

[Experience Sharing] How to optimize Qwen on Ascend #10337

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

transfer mrope_position_delta to device when first running#11047

transfer mrope_position_delta to device when first running#11047
zhyncs merged 9 commits intosgl-project:mainfrom
ping1jing2:mm_h2d

ash-sigh commented Sep 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

ash-sigh commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ash-sigh commented Sep 29, 2025 •

edited

Loading