sglang在qwen3-vl-4B模型使用dflash加速但是没有收益,都是负收益 #24142
Unanswered
cuihangbin
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
基于最新的sglang main分支安装的,
026-04-30 17:50:00】============================================
【2026-04-30 17:50:00】🚀 启动 Baseline (port=30000)...
【2026-04-30 17:50:00】============================================
【2026-04-30 17:50:00】⏳ 等待服务 30000 启动(最多 300s)...
【2026-04-30 17:52:56】✅ 服务 30000 ready(等待 176s)
【2026-04-30 17:52:56】
【2026-04-30 17:52:56】📊 Benchmark Baseline...
【2026-04-30 17:53:36】Loading MMStar samples...
【2026-04-30 17:53:36】
Generating val split: 0%| | 0/1500 [00:00<?, ? examples/s]
Generating val split: 7%|▋ | 100/1500 [00:00<00:01, 786.00 examples/s]
Generating val split: 100%|██████████| 1500/1500 [00:00<00:00, 8336.81 examples/s]
【2026-04-30 17:53:37】Running benchmark with 10 samples, concurrency=1
【2026-04-30 17:53:48】
MMStar: 0%| | 0/10 [00:00<?, ?it/s]
MMStar: 10%|█ | 1/10 [00:02<00:21, 2.44s/it]
MMStar: 20%|██ | 2/10 [00:03<00:12, 1.56s/it]
MMStar: 30%|███ | 3/10 [00:04<00:09, 1.40s/it]
MMStar: 40%|████ | 4/10 [00:05<00:06, 1.01s/it]
MMStar: 50%|█████ | 5/10 [00:05<00:03, 1.28it/s]
MMStar: 60%|██████ | 6/10 [00:05<00:02, 1.47it/s]
MMStar: 70%|███████ | 7/10 [00:06<00:01, 1.67it/s]
MMStar: 80%|████████ | 8/10 [00:07<00:01, 1.58it/s]
MMStar: 90%|█████████ | 9/10 [00:07<00:00, 1.81it/s]
MMStar: 100%|██████████| 10/10 [00:11<00:00, 1.54s/it]
MMStar: 100%|██████████| 10/10 [00:11<00:00, 1.11s/it]
【2026-04-30 17:53:48】============================================================
【2026-04-30 17:53:48】MMStar multimodal benchmark result
【2026-04-30 17:53:48】============================================================
【2026-04-30 17:53:48】 Base URL: http://127.0.0.1:30000
【2026-04-30 17:53:48】 Num samples: 10
【2026-04-30 17:53:48】 Errors: 0
【2026-04-30 17:53:48】 Total latency (s): 11.12
【2026-04-30 17:53:48】 Total completion: 2484 tokens
【2026-04-30 17:53:48】 Throughput: 223.29 tok/s
【2026-04-30 17:53:48】 Avg accept length: N/A
【2026-04-30 17:53:48】============================================================
【2026-04-30 17:53:48】Wrote json report to: /nfs/dataset-ofs-heterogeneous-computing/cuihangbin/vlm_infra/sglang/log/baseline_bench.json
【2026-04-30 17:53:51】✅ Baseline benchmark 完成,结果: /nfs/dataset-ofs-heterogeneous-computing/cuihangbin/vlm_infra/sglang/log/baseline_bench.json
【2026-04-30 17:53:51】🛑 停止端口 30000 的服务...
【2026-04-30 17:53:56】/nfs/dataset-ofs-heterogeneous-computing/cuihangbin/vlm_infra/sglang/test_sglang_dflash.sh: line 78: 1460 Killed CUDA_VISIBLE_DEVICES=0,1 python -m sglang.launch_server --model-path "$TARGET_MODEL" --tp-size $TP_SIZE --dtype bfloat16 --mem-fraction-static 0.9 --cuda-graph-max-bs 32 --context-length $CONTEXT_LENGTH --enable-return-hidden-states --port $PORT_BASELINE > "$LOG_DIR/baseline.log" 2>&1
【2026-04-30 17:53:56】
【2026-04-30 17:53:56】============================================
【2026-04-30 17:53:56】🚀 启动 DFlash (port=30001)...
【2026-04-30 17:53:56】 --speculative-num-draft-tokens=5
【2026-04-30 17:53:56】============================================
【2026-04-30 17:53:56】⏳ 等待服务 30001 启动(最多 300s)...
【2026-04-30 17:56:14】✅ 服务 30001 ready(等待 138s)
【2026-04-30 17:56:14】
【2026-04-30 17:56:14】📊 Benchmark DFlash...
【2026-04-30 17:56:54】Loading MMStar samples...
【2026-04-30 17:56:54】Running benchmark with 10 samples, concurrency=1
【2026-04-30 17:57:05】
MMStar: 0%| | 0/10 [00:00<?, ?it/s]
MMStar: 10%|█ | 1/10 [00:02<00:23, 2.64s/it]
MMStar: 20%|██ | 2/10 [00:03<00:13, 1.72s/it]
MMStar: 30%|███ | 3/10 [00:04<00:09, 1.40s/it]
MMStar: 40%|████ | 4/10 [00:05<00:06, 1.11s/it]
MMStar: 50%|█████ | 5/10 [00:05<00:04, 1.12it/s]
MMStar: 60%|██████ | 6/10 [00:06<00:03, 1.24it/s]
MMStar: 70%|███████ | 7/10 [00:07<00:02, 1.32it/s]
MMStar: 80%|████████ | 8/10 [00:08<00:01, 1.12it/s]
MMStar: 90%|█████████ | 9/10 [00:09<00:00, 1.21it/s]
MMStar: 100%|██████████| 10/10 [00:10<00:00, 1.14it/s]
MMStar: 100%|██████████| 10/10 [00:10<00:00, 1.01s/it]
【2026-04-30 17:57:05】============================================================
【2026-04-30 17:57:05】MMStar multimodal benchmark result
【2026-04-30 17:57:05】============================================================
【2026-04-30 17:57:05】 Base URL: http://127.0.0.1:30001
【2026-04-30 17:57:05】 Num samples: 10
【2026-04-30 17:57:05】 Errors: 0
【2026-04-30 17:57:05】 Total latency (s): 10.06
【2026-04-30 17:57:05】 Total completion: 1437 tokens
【2026-04-30 17:57:05】 Throughput: 142.85 tok/s
【2026-04-30 17:57:05】 Avg accept length: 1.3711538461538462
【2026-04-30 17:57:05】============================================================
【2026-04-30 17:57:05】Wrote json report to: /nfs/dataset-ofs-heterogeneous-computing/cuihangbin/vlm_infra/sglang/log/dflash_bench.json
【2026-04-30 17:57:07】✅ DFlash benchmark 完成,结果: /nfs/dataset-ofs-heterogeneous-computing/cuihangbin/vlm_infra/sglang/log/dflash_bench.json
【2026-04-30 17:57:07】
【2026-04-30 17:57:07】📐 Compare baseline vs dflash hidden states...
【2026-04-30 17:57:07】
【2026-04-30 17:57:07】Compare baseline vs dflash per sample:
【2026-04-30 17:57:07】sample_id,comp_tokens_baseline,comp_tokens_dflash,hidden_dim_baseline,hidden_dim_dflash,max_diff_last_hidden,cos_last_hidden
【2026-04-30 17:57:07】
【2026-04-30 17:57:07】========== 吞吐对比汇总 ==========
【2026-04-30 17:57:07】Baseline throughput=N/A tok/s latency=11.124370343983173s tokens=2484 avg_accept_len=N/A
【2026-04-30 17:57:07】DFlash throughput=N/A tok/s latency=10.059408709406853s tokens=1437 avg_accept_len=N/A
【2026-04-30 17:57:07】
【2026-04-30 17:57:07】✅ 全部完成
【2026-04-30 17:57:07】📁 日志目录: /nfs/dataset-ofs-heterogeneous-computing/cuihangbin/vlm_infra/sglang/log
【2026-04-30 17:57:07】
【2026-04-30 17:57:07】关键文件:
【2026-04-30 17:57:07】 baseline server log : /nfs/dataset-ofs-heterogeneous-computing/cuihangbin/vlm_infra/sglang/log/baseline.log
【2026-04-30 17:57:07】 dflash server log : /nfs/dataset-ofs-heterogeneous-computing/cuihangbin/vlm_infra/sglang/log/dflash.log
【2026-04-30 17:57:07】 baseline bench : /nfs/dataset-ofs-heterogeneous-computing/cuihangbin/vlm_infra/sglang/log/baseline_bench.json
【2026-04-30 17:57:07】 dflash bench : /nfs/dataset-ofs-heterogeneous-computing/cuihangbin/vlm_infra/sglang/log/dflash_bench.json
【2026-04-30 17:57:07】🧹 清理进程...
【2026-04-30 17:57:10】============================================== Task Result ============================================
【2026-04-30 17:57:10】[DEBUG] result: success
【2026-04-30 17:57:10】[DEBUG] exit code: 0
【2026-04-30 17:57:10】[DEBUG] start time: 2026-04-30 17:49:03
【2026-04-30 17:57:10】[DEBUG] end time: 2026-04-30 17:57:10
【2026-04-30 17:57:10】============================================== Task End ===============================================
脚本如下:
#!/usr/bin/env bash
set -euo pipefail
########################################
挂载数据
########################################
sudo bash /mnt/common/jianshu/ofs/release/current/script/ofs_mount.sh heterogeneous-computing
/nfs/dataset-ofs-heterogeneous-computing/ 5c20783e7023437a93220006b9232e40 nmgvoyagermodel
sudo bash /mnt/common/jianshu/ofs/release/current/script/ofs_mount.sh voyager-model-evaluation
/ofs/model-eval-output 62030ef0956e44269180321d62d46e22 nmgpu
sudo bash /mnt/common/jianshu/ofs/release/current/script/ofs_mount.sh prediction-dos
/nfs/dataset-ofs-prediction-dos ac28a6f0810a49b09eb86f66fca7bd93 nmgvoyagerbag
sudo bash /mnt/common/jianshu/ofs/release/current/script/ofs_mount.sh architecture
/nfs/dataset-ofs-architecture 8658742bfcad4e89bbf4c9b042d984e1 nmgvoyagermodel
########################################
激活环境
########################################
eval "$(conda shell.bash hook)"
conda activate /nfs/dataset-ofs-architecture/cuihangbin/workspace/miniconda3/envs/sglang_env
cd /nfs/dataset-ofs-heterogeneous-computing/cuihangbin/vlm_infra/sglang
python -c "import sglang; print(sglang.file)"
########################################
路径配置
########################################
TARGET_MODEL="/home/luban/Model-Optimizer/model_path/Qwen3-VL-4B-Instruct"
DRAFT_MODEL="/nfs/dataset-ofs-heterogeneous-computing/cuihangbin/llm_model/Qwen3-4B-DFlash-b16"
DATASET_PATH="/nfs/dataset-ofs-heterogeneous-computing/cuihangbin/llm_model/Qwen3-4B-DFlash-b16/mmstar"
TP_SIZE=2
CONTEXT_LENGTH=40960
MAX_COMPLETION_TOKENS=2048
NUM_SAMPLES=10
CONCURRENCY=1
DFlash 核心参数 —— 根据显存和draft model调整
SPECULATIVE_NUM_DRAFT_TOKENS=5 # 建议先用5,accept length低时可降到3
PORT_BASELINE=30000
PORT_DFLASH=30001
LOG_DIR=/nfs/dataset-ofs-heterogeneous-computing/cuihangbin/vlm_infra/sglang/log
mkdir -p "$LOG_DIR"
########################################
环境变量
########################################
export SGLANG_DISABLE_CUDNN_CHECK=1
########################################
工具函数
########################################
cleanup() {
echo "🧹 清理进程..."
pkill -f sglang.launch_server 2>/dev/null || true
sleep 3
}
trap cleanup EXIT
wait_ready() {$port 启动(最多 $ {max_wait}s)..."$port ready(等待 $ {i}s)"
local port=$1
local max_wait=300
echo "⏳ 等待服务
for i in $(seq 1 $max_wait); do
if curl -s "http://127.0.0.1:${port}/health" >/dev/null 2>&1; then
echo "✅ 服务
return 0
fi
sleep 1
done
echo "❌ 服务 $port 启动超时,查看日志:"
tail -30 "$LOG_DIR/${port}.log" 2>/dev/null || true
exit 1
}
kill_server() {
local port=$1
echo "🛑 停止端口 $port 的服务..."
pkill -f "port $port" 2>/dev/null || true
pkill -f "launch_server.*--port $port" 2>/dev/null || true
# 兜底:用 lsof 找进程
local pid
pid=$(lsof -ti tcp:"$port" 2>/dev/null || true)
if [ -n "$pid" ]; then
kill "$pid" 2>/dev/null || true
fi
sleep 5
}
########################################
1️⃣ 启动 Baseline(无 speculative)
########################################
echo ""
echo "============================================"
echo "🚀 启动 Baseline (port=$PORT_BASELINE)..."
echo "============================================"
CUDA_VISIBLE_DEVICES=0,1
python -m sglang.launch_server
--model-path "$TARGET_MODEL"
--tp-size $TP_SIZE
--dtype bfloat16
--mem-fraction-static 0.9
--cuda-graph-max-bs 32
--context-length $CONTEXT_LENGTH
--enable-return-hidden-states
--port $PORT_BASELINE
> "$LOG_DIR/baseline.log" 2>&1 &
wait_ready $PORT_BASELINE
########################################
2️⃣ Benchmark Baseline
########################################
echo ""
echo "📊 Benchmark Baseline..."
python benchmark/dflash/bench_dflash_mmstar.py
--port $PORT_BASELINE
--dataset-path "$DATASET_PATH"
--num-samples $NUM_SAMPLES
--concurrency $CONCURRENCY
--max-completion-tokens $MAX_COMPLETION_TOKENS
--temperature 0.0
--output-json "$LOG_DIR/baseline_bench.json"
| tee "$LOG_DIR/baseline_bench.log"
echo "✅ Baseline benchmark 完成,结果: $LOG_DIR/baseline_bench.json"
########################################
停掉 Baseline
########################################
kill_server $PORT_BASELINE
########################################
3️⃣ 启动 DFlash(speculative decoding)
########################################
echo ""
echo "============================================"
echo "🚀 启动 DFlash (port=$PORT_DFLASH)..."
echo " --speculative-num-draft-tokens=$SPECULATIVE_NUM_DRAFT_TOKENS"
echo "============================================"
CUDA_VISIBLE_DEVICES=0,1
python -m sglang.launch_server
--model-path "$TARGET_MODEL"
--speculative-algorithm DFLASH
--speculative-draft-model-path "$DRAFT_MODEL"
--speculative-num-draft-tokens $SPECULATIVE_NUM_DRAFT_TOKENS
--tp-size $TP_SIZE
--dtype bfloat16
--mem-fraction-static 0.9
--cuda-graph-max-bs 32
--context-length $CONTEXT_LENGTH
--enable-return-hidden-states
--port $PORT_DFLASH
> "$LOG_DIR/dflash.log" 2>&1 &
wait_ready $PORT_DFLASH
########################################
4️⃣ Benchmark DFlash
########################################
echo ""
echo "📊 Benchmark DFlash..."
python benchmark/dflash/bench_dflash_mmstar.py
--port $PORT_DFLASH
--dataset-path "$DATASET_PATH"
--num-samples $NUM_SAMPLES
--concurrency $CONCURRENCY
--max-completion-tokens $MAX_COMPLETION_TOKENS
--temperature 0.0
--output-json "$LOG_DIR/dflash_bench.json"
| tee "$LOG_DIR/dflash_bench.log"
echo "✅ DFlash benchmark 完成,结果: $LOG_DIR/dflash_bench.json"
########################################
5️⃣ Hidden States 对比
########################################
echo ""
echo "📐 Compare baseline vs dflash hidden states..."
python - <<'EOF'
import json
import math
import sys
def load(path):
try:
with open(path) as f:
return json.load(f)
except Exception as e:
print(f"❌ 无法读取 {path}: {e}")
sys.exit(1)
baseline = load("/nfs/dataset-ofs-heterogeneous-computing/cuihangbin/vlm_infra/sglang/log/baseline_bench.json")
dflash = load("/nfs/dataset-ofs-heterogeneous-computing/cuihangbin/vlm_infra/sglang/log/dflash_bench.json")
samples_a = {str(s.get("sample_id")): s for s in baseline.get("per_request_metrics", [])}
samples_b = {str(s.get("sample_id")): s for s in dflash.get("per_request_metrics", [])}
def align_hidden(a, b):
if not a or not b:
return None, None
la, lb = len(a), len(b)
if la == lb:
return a, b
if la > lb and la % lb == 0:
return a[-lb:], b
if lb > la and lb % la == 0:
return a, b[-la:]
return None, None
print("\nCompare baseline vs dflash per sample:")
print("sample_id,comp_tokens_baseline,comp_tokens_dflash,hidden_dim_baseline,hidden_dim_dflash,max_diff_last_hidden,cos_last_hidden")
all_ids = sorted(set(samples_a) & set(samples_b))
for sid in all_ids:
a = samples_a.get(sid, {})
b = samples_b.get(sid, {})
汇总吞吐对比
print("\n========== 吞吐对比汇总 ==========")
for label, data in [("Baseline", baseline), ("DFlash", dflash)]:
tp = data.get("throughput_toks_per_s", "N/A")
lat = data.get("total_latency_s", "N/A")
toks = data.get("total_completion_tokens", "N/A")
acc = data.get("avg_spec_accept_length", "N/A")
print(f"{label:10s} throughput={tp} tok/s latency={lat}s tokens={toks} avg_accept_len={acc}")
EOF
########################################
echo ""
echo "✅ 全部完成"
echo "📁 日志目录: $LOG_DIR"
echo ""
echo "关键文件:"
echo " baseline server log : $LOG_DIR/baseline.log"
echo " dflash server log : $LOG_DIR/dflash.log"
echo " baseline bench : $LOG_DIR/baseline_bench.json"
echo " dflash bench : $LOG_DIR/dflash_bench.json"
Beta Was this translation helpful? Give feedback.
All reactions