sglang在qwen3-vl-4B模型使用dflash加速但是没有收益，都是负收益 #24142

cuihangbin · 2026-04-30T10:13:32Z

cuihangbin
Apr 30, 2026

基于最新的sglang main分支安装的，

026-04-30 17:50:00】============================================
【2026-04-30 17:50:00】🚀 启动 Baseline (port=30000)...
【2026-04-30 17:50:00】============================================
【2026-04-30 17:50:00】⏳ 等待服务 30000 启动（最多 300s）...
【2026-04-30 17:52:56】✅ 服务 30000 ready（等待 176s）
【2026-04-30 17:52:56】
【2026-04-30 17:52:56】📊 Benchmark Baseline...
【2026-04-30 17:53:36】Loading MMStar samples...
【2026-04-30 17:53:36】
Generating val split: 0%| | 0/1500 [00:00<?, ? examples/s]
Generating val split: 7%|▋ | 100/1500 [00:00<00:01, 786.00 examples/s]
Generating val split: 100%|██████████| 1500/1500 [00:00<00:00, 8336.81 examples/s]
【2026-04-30 17:53:37】Running benchmark with 10 samples, concurrency=1
【2026-04-30 17:53:48】
MMStar: 0%| | 0/10 [00:00<?, ?it/s]
MMStar: 10%|█ | 1/10 [00:02<00:21, 2.44s/it]
MMStar: 20%|██ | 2/10 [00:03<00:12, 1.56s/it]
MMStar: 30%|███ | 3/10 [00:04<00:09, 1.40s/it]
MMStar: 40%|████ | 4/10 [00:05<00:06, 1.01s/it]
MMStar: 50%|█████ | 5/10 [00:05<00:03, 1.28it/s]
MMStar: 60%|██████ | 6/10 [00:05<00:02, 1.47it/s]
MMStar: 70%|███████ | 7/10 [00:06<00:01, 1.67it/s]
MMStar: 80%|████████ | 8/10 [00:07<00:01, 1.58it/s]
MMStar: 90%|█████████ | 9/10 [00:07<00:00, 1.81it/s]
MMStar: 100%|██████████| 10/10 [00:11<00:00, 1.54s/it]
MMStar: 100%|██████████| 10/10 [00:11<00:00, 1.11s/it]
【2026-04-30 17:53:48】============================================================
【2026-04-30 17:53:48】MMStar multimodal benchmark result
【2026-04-30 17:53:48】============================================================
【2026-04-30 17:53:48】 Base URL: http://127.0.0.1:30000
【2026-04-30 17:53:48】 Num samples: 10
【2026-04-30 17:53:48】 Errors: 0
【2026-04-30 17:53:48】 Total latency (s): 11.12
【2026-04-30 17:53:48】 Total completion: 2484 tokens
【2026-04-30 17:53:48】 Throughput: 223.29 tok/s
【2026-04-30 17:53:48】 Avg accept length: N/A
【2026-04-30 17:53:48】============================================================
【2026-04-30 17:53:48】Wrote json report to: /nfs/dataset-ofs-heterogeneous-computing/cuihangbin/vlm_infra/sglang/log/baseline_bench.json
【2026-04-30 17:53:51】✅ Baseline benchmark 完成，结果: /nfs/dataset-ofs-heterogeneous-computing/cuihangbin/vlm_infra/sglang/log/baseline_bench.json
【2026-04-30 17:53:51】🛑 停止端口 30000 的服务...
【2026-04-30 17:53:56】/nfs/dataset-ofs-heterogeneous-computing/cuihangbin/vlm_infra/sglang/test_sglang_dflash.sh: line 78: 1460 Killed CUDA_VISIBLE_DEVICES=0,1 python -m sglang.launch_server --model-path "$TARGET_MODEL" --tp-size $TP_SIZE --dtype bfloat16 --mem-fraction-static 0.9 --cuda-graph-max-bs 32 --context-length $CONTEXT_LENGTH --enable-return-hidden-states --port $PORT_BASELINE > "$LOG_DIR/baseline.log" 2>&1
【2026-04-30 17:53:56】
【2026-04-30 17:53:56】============================================
【2026-04-30 17:53:56】🚀 启动 DFlash (port=30001)...
【2026-04-30 17:53:56】 --speculative-num-draft-tokens=5
【2026-04-30 17:53:56】============================================
【2026-04-30 17:53:56】⏳ 等待服务 30001 启动（最多 300s）...
【2026-04-30 17:56:14】✅ 服务 30001 ready（等待 138s）
【2026-04-30 17:56:14】
【2026-04-30 17:56:14】📊 Benchmark DFlash...
【2026-04-30 17:56:54】Loading MMStar samples...
【2026-04-30 17:56:54】Running benchmark with 10 samples, concurrency=1
【2026-04-30 17:57:05】
MMStar: 0%| | 0/10 [00:00<?, ?it/s]
MMStar: 10%|█ | 1/10 [00:02<00:23, 2.64s/it]
MMStar: 20%|██ | 2/10 [00:03<00:13, 1.72s/it]
MMStar: 30%|███ | 3/10 [00:04<00:09, 1.40s/it]
MMStar: 40%|████ | 4/10 [00:05<00:06, 1.11s/it]
MMStar: 50%|█████ | 5/10 [00:05<00:04, 1.12it/s]
MMStar: 60%|██████ | 6/10 [00:06<00:03, 1.24it/s]
MMStar: 70%|███████ | 7/10 [00:07<00:02, 1.32it/s]
MMStar: 80%|████████ | 8/10 [00:08<00:01, 1.12it/s]
MMStar: 90%|█████████ | 9/10 [00:09<00:00, 1.21it/s]
MMStar: 100%|██████████| 10/10 [00:10<00:00, 1.14it/s]
MMStar: 100%|██████████| 10/10 [00:10<00:00, 1.01s/it]
【2026-04-30 17:57:05】============================================================
【2026-04-30 17:57:05】MMStar multimodal benchmark result
【2026-04-30 17:57:05】============================================================
【2026-04-30 17:57:05】 Base URL: http://127.0.0.1:30001
【2026-04-30 17:57:05】 Num samples: 10
【2026-04-30 17:57:05】 Errors: 0
【2026-04-30 17:57:05】 Total latency (s): 10.06
【2026-04-30 17:57:05】 Total completion: 1437 tokens
【2026-04-30 17:57:05】 Throughput: 142.85 tok/s
【2026-04-30 17:57:05】 Avg accept length: 1.3711538461538462
【2026-04-30 17:57:05】============================================================
【2026-04-30 17:57:05】Wrote json report to: /nfs/dataset-ofs-heterogeneous-computing/cuihangbin/vlm_infra/sglang/log/dflash_bench.json
【2026-04-30 17:57:07】✅ DFlash benchmark 完成，结果: /nfs/dataset-ofs-heterogeneous-computing/cuihangbin/vlm_infra/sglang/log/dflash_bench.json
【2026-04-30 17:57:07】
【2026-04-30 17:57:07】📐 Compare baseline vs dflash hidden states...
【2026-04-30 17:57:07】
【2026-04-30 17:57:07】Compare baseline vs dflash per sample:
【2026-04-30 17:57:07】sample_id,comp_tokens_baseline,comp_tokens_dflash,hidden_dim_baseline,hidden_dim_dflash,max_diff_last_hidden,cos_last_hidden
【2026-04-30 17:57:07】
【2026-04-30 17:57:07】========== 吞吐对比汇总 ==========
【2026-04-30 17:57:07】Baseline throughput=N/A tok/s latency=11.124370343983173s tokens=2484 avg_accept_len=N/A
【2026-04-30 17:57:07】DFlash throughput=N/A tok/s latency=10.059408709406853s tokens=1437 avg_accept_len=N/A
【2026-04-30 17:57:07】
【2026-04-30 17:57:07】✅ 全部完成
【2026-04-30 17:57:07】📁 日志目录: /nfs/dataset-ofs-heterogeneous-computing/cuihangbin/vlm_infra/sglang/log
【2026-04-30 17:57:07】
【2026-04-30 17:57:07】关键文件:
【2026-04-30 17:57:07】 baseline server log : /nfs/dataset-ofs-heterogeneous-computing/cuihangbin/vlm_infra/sglang/log/baseline.log
【2026-04-30 17:57:07】 dflash server log : /nfs/dataset-ofs-heterogeneous-computing/cuihangbin/vlm_infra/sglang/log/dflash.log
【2026-04-30 17:57:07】 baseline bench : /nfs/dataset-ofs-heterogeneous-computing/cuihangbin/vlm_infra/sglang/log/baseline_bench.json
【2026-04-30 17:57:07】 dflash bench : /nfs/dataset-ofs-heterogeneous-computing/cuihangbin/vlm_infra/sglang/log/dflash_bench.json
【2026-04-30 17:57:07】🧹 清理进程...
【2026-04-30 17:57:10】============================================== Task Result ============================================
【2026-04-30 17:57:10】[DEBUG] result: success
【2026-04-30 17:57:10】[DEBUG] exit code: 0
【2026-04-30 17:57:10】[DEBUG] start time: 2026-04-30 17:49:03
【2026-04-30 17:57:10】[DEBUG] end time: 2026-04-30 17:57:10
【2026-04-30 17:57:10】============================================== Task End ===============================================

脚本如下：

#!/usr/bin/env bash
set -euo pipefail

########################################

挂载数据

########################################
sudo bash /mnt/common/jianshu/ofs/release/current/script/ofs_mount.sh heterogeneous-computing
/nfs/dataset-ofs-heterogeneous-computing/ 5c20783e7023437a93220006b9232e40 nmgvoyagermodel
sudo bash /mnt/common/jianshu/ofs/release/current/script/ofs_mount.sh voyager-model-evaluation
/ofs/model-eval-output 62030ef0956e44269180321d62d46e22 nmgpu
sudo bash /mnt/common/jianshu/ofs/release/current/script/ofs_mount.sh prediction-dos
/nfs/dataset-ofs-prediction-dos ac28a6f0810a49b09eb86f66fca7bd93 nmgvoyagerbag
sudo bash /mnt/common/jianshu/ofs/release/current/script/ofs_mount.sh architecture
/nfs/dataset-ofs-architecture 8658742bfcad4e89bbf4c9b042d984e1 nmgvoyagermodel

########################################

激活环境

########################################
eval "$(conda shell.bash hook)"
conda activate /nfs/dataset-ofs-architecture/cuihangbin/workspace/miniconda3/envs/sglang_env

cd /nfs/dataset-ofs-heterogeneous-computing/cuihangbin/vlm_infra/sglang
python -c "import sglang; print(sglang.file)"

########################################

路径配置

########################################
TARGET_MODEL="/home/luban/Model-Optimizer/model_path/Qwen3-VL-4B-Instruct"
DRAFT_MODEL="/nfs/dataset-ofs-heterogeneous-computing/cuihangbin/llm_model/Qwen3-4B-DFlash-b16"
DATASET_PATH="/nfs/dataset-ofs-heterogeneous-computing/cuihangbin/llm_model/Qwen3-4B-DFlash-b16/mmstar"

TP_SIZE=2
CONTEXT_LENGTH=40960
MAX_COMPLETION_TOKENS=2048
NUM_SAMPLES=10
CONCURRENCY=1

DFlash 核心参数 —— 根据显存和draft model调整

SPECULATIVE_NUM_DRAFT_TOKENS=5 # 建议先用5，accept length低时可降到3

PORT_BASELINE=30000
PORT_DFLASH=30001

LOG_DIR=/nfs/dataset-ofs-heterogeneous-computing/cuihangbin/vlm_infra/sglang/log
mkdir -p "$LOG_DIR"

########################################

环境变量

########################################
export SGLANG_DISABLE_CUDNN_CHECK=1

########################################

工具函数

########################################
cleanup() {
echo "🧹 清理进程..."
pkill -f sglang.launch_server 2>/dev/null || true
sleep 3
}
trap cleanup EXIT

wait_ready() {
local port=$1
local max_wait=300
echo "⏳ 等待服务 $port 启动（最多 ${max_wait}s）..."
for i in $(seq 1 $max_wait); do
if curl -s "http://127.0.0.1:${port}/health" >/dev/null 2>&1; then
echo "✅ 服务 $port ready（等待 ${i}s）"
return 0
fi
sleep 1
done
echo "❌ 服务 $port 启动超时，查看日志："
tail -30 "$LOG_DIR/${port}.log" 2>/dev/null || true
exit 1
}

kill_server() {
local port=$1
echo "🛑 停止端口 $port 的服务..."
pkill -f "port $port" 2>/dev/null || true
pkill -f "launch_server.*--port $port" 2>/dev/null || true
# 兜底：用 lsof 找进程
local pid
pid=$(lsof -ti tcp:"$port" 2>/dev/null || true)
if [ -n "$pid" ]; then
kill "$pid" 2>/dev/null || true
fi
sleep 5
}

########################################

1️⃣ 启动 Baseline（无 speculative）

########################################
echo ""
echo "============================================"
echo "🚀 启动 Baseline (port=$PORT_BASELINE)..."
echo "============================================"

CUDA_VISIBLE_DEVICES=0,1
python -m sglang.launch_server
--model-path "$TARGET_MODEL"
--tp-size $TP_SIZE
--dtype bfloat16
--mem-fraction-static 0.9
--cuda-graph-max-bs 32
--context-length $CONTEXT_LENGTH
--enable-return-hidden-states
--port $PORT_BASELINE
> "$LOG_DIR/baseline.log" 2>&1 &

wait_ready $PORT_BASELINE

########################################

2️⃣ Benchmark Baseline

########################################
echo ""
echo "📊 Benchmark Baseline..."

python benchmark/dflash/bench_dflash_mmstar.py
--port $PORT_BASELINE
--dataset-path "$DATASET_PATH"
--num-samples $NUM_SAMPLES
--concurrency $CONCURRENCY
--max-completion-tokens $MAX_COMPLETION_TOKENS
--temperature 0.0
--output-json "$LOG_DIR/baseline_bench.json"
| tee "$LOG_DIR/baseline_bench.log"

echo "✅ Baseline benchmark 完成，结果: $LOG_DIR/baseline_bench.json"

########################################

停掉 Baseline

########################################
kill_server $PORT_BASELINE

########################################

3️⃣ 启动 DFlash（speculative decoding）

########################################
echo ""
echo "============================================"
echo "🚀 启动 DFlash (port=$PORT_DFLASH)..."
echo " --speculative-num-draft-tokens=$SPECULATIVE_NUM_DRAFT_TOKENS"
echo "============================================"

CUDA_VISIBLE_DEVICES=0,1
python -m sglang.launch_server
--model-path "$TARGET_MODEL"
--speculative-algorithm DFLASH
--speculative-draft-model-path "$DRAFT_MODEL"
--speculative-num-draft-tokens $SPECULATIVE_NUM_DRAFT_TOKENS
--tp-size $TP_SIZE
--dtype bfloat16
--mem-fraction-static 0.9
--cuda-graph-max-bs 32
--context-length $CONTEXT_LENGTH
--enable-return-hidden-states
--port $PORT_DFLASH
> "$LOG_DIR/dflash.log" 2>&1 &

wait_ready $PORT_DFLASH

########################################

4️⃣ Benchmark DFlash

########################################
echo ""
echo "📊 Benchmark DFlash..."

python benchmark/dflash/bench_dflash_mmstar.py
--port $PORT_DFLASH
--dataset-path "$DATASET_PATH"
--num-samples $NUM_SAMPLES
--concurrency $CONCURRENCY
--max-completion-tokens $MAX_COMPLETION_TOKENS
--temperature 0.0
--output-json "$LOG_DIR/dflash_bench.json"
| tee "$LOG_DIR/dflash_bench.log"

echo "✅ DFlash benchmark 完成，结果: $LOG_DIR/dflash_bench.json"

########################################

5️⃣ Hidden States 对比

########################################
echo ""
echo "📐 Compare baseline vs dflash hidden states..."

python - <<'EOF'
import json
import math
import sys

def load(path):
try:
with open(path) as f:
return json.load(f)
except Exception as e:
print(f"❌ 无法读取 {path}: {e}")
sys.exit(1)

baseline = load("/nfs/dataset-ofs-heterogeneous-computing/cuihangbin/vlm_infra/sglang/log/baseline_bench.json")
dflash = load("/nfs/dataset-ofs-heterogeneous-computing/cuihangbin/vlm_infra/sglang/log/dflash_bench.json")

samples_a = {str(s.get("sample_id")): s for s in baseline.get("per_request_metrics", [])}
samples_b = {str(s.get("sample_id")): s for s in dflash.get("per_request_metrics", [])}

def align_hidden(a, b):
if not a or not b:
return None, None
la, lb = len(a), len(b)
if la == lb:
return a, b
if la > lb and la % lb == 0:
return a[-lb:], b
if lb > la and lb % la == 0:
return a, b[-la:]
return None, None

print("\nCompare baseline vs dflash per sample:")
print("sample_id,comp_tokens_baseline,comp_tokens_dflash,hidden_dim_baseline,hidden_dim_dflash,max_diff_last_hidden,cos_last_hidden")

all_ids = sorted(set(samples_a) & set(samples_b))
for sid in all_ids:
a = samples_a.get(sid, {})
b = samples_b.get(sid, {})

ct_a = a.get("completion_tokens", "N/A")
ct_b = b.get("completion_tokens", "N/A")
hs_a = a.get("last_hidden")
hs_b = b.get("last_hidden")
hd_a = len(hs_a) if hs_a else 0
hd_b = len(hs_b) if hs_b else 0

aa, bb = align_hidden(hs_a, hs_b)
if aa is None or bb is None:
    print(f"{sid},{ct_a},{ct_b},{hd_a},{hd_b},NA,NA")
    continue

max_diff = max(abs(x - y) for x, y in zip(aa, bb))
dot = sum(x * y for x, y in zip(aa, bb))
na = math.sqrt(sum(x * x for x in aa))
nb = math.sqrt(sum(y * y for y in bb))
cos = dot / (na * nb + 1e-9)
print(f"{sid},{ct_a},{ct_b},{hd_a},{hd_b},{max_diff:.6f},{cos:.6f}")

汇总吞吐对比

print("\n========== 吞吐对比汇总 ==========")
for label, data in [("Baseline", baseline), ("DFlash", dflash)]:
tp = data.get("throughput_toks_per_s", "N/A")
lat = data.get("total_latency_s", "N/A")
toks = data.get("total_completion_tokens", "N/A")
acc = data.get("avg_spec_accept_length", "N/A")
print(f"{label:10s} throughput={tp} tok/s latency={lat}s tokens={toks} avg_accept_len={acc}")
EOF

########################################
echo ""
echo "✅ 全部完成"
echo "📁 日志目录: $LOG_DIR"
echo ""
echo "关键文件:"
echo " baseline server log : $LOG_DIR/baseline.log"
echo " dflash server log : $LOG_DIR/dflash.log"
echo " baseline bench : $LOG_DIR/baseline_bench.json"
echo " dflash bench : $LOG_DIR/dflash_bench.json"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sglang在qwen3-vl-4B模型使用dflash加速但是没有收益，都是负收益 #24142

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

sglang在qwen3-vl-4B模型使用dflash加速但是没有收益，都是负收益 #24142

Uh oh!

cuihangbin Apr 30, 2026

挂载数据

激活环境

路径配置

DFlash 核心参数 —— 根据显存和draft model调整

环境变量

工具函数

1️⃣ 启动 Baseline（无 speculative）

2️⃣ Benchmark Baseline

停掉 Baseline

3️⃣ 启动 DFlash（speculative decoding）

4️⃣ Benchmark DFlash

5️⃣ Hidden States 对比

汇总吞吐对比

Replies: 0 comments

cuihangbin
Apr 30, 2026