Skip to content

Comments

[DSv32] Overlap indexer weights_proj during dual_stream decode#16637

Merged
Fridge003 merged 1 commit intosgl-project:mainfrom
zianglih:zianglih/DSv32_overlap
Jan 10, 2026
Merged

[DSv32] Overlap indexer weights_proj during dual_stream decode#16637
Fridge003 merged 1 commit intosgl-project:mainfrom
zianglih:zianglih/DSv32_overlap

Conversation

@zianglih
Copy link
Contributor

@zianglih zianglih commented Jan 7, 2026

Motivation

The weights_proj in the DSA indexer uses float type, which is slow on modern GPUs. Profiler traces show this accounts for ~20% of decode layer runtime. The traces also reveal that this projection typically uses a grid of (bs, 1, 1), utilizing very few SMs and creating an opportunity for inter-stream overlap.

Modifications

In this optimization, we overlap the very slow weights_proj with q_b_proj, indexer _get_q_k_bf16, and indexer qk act_quant during dual_stream decode.

Accuracy Tests

BEFORE 
{
  "server_command": "docker run --rm --ipc=host --network=host --privileged --shm-size 256G -e DO_NOT_TRACK=1 -e SGLANG_DG_CACHE_DIR=/data/.cache/deep_gemm --gpus all -v $HOME/dockerx:/dockerx -v /data:/data -v /data/.cache/huggingface:/root/.cache/huggingface -v /data/.cache:/root/.cache -p 30000:30000 lmsysorg/sglang:dev bash -c ' python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --tp 8 --enable-dp-attention --kv-cache-dtype bf16 --mem-fraction-static 0.8 --dp 8 --cuda-graph-max-bs 256 --trust-remote-code --host 0.0.0.0 --port 30000'",
  "bench_accuracy_command": "docker run --rm --ipc=host --network=host --privileged --shm-size 256G -e DO_NOT_TRACK=1 -e SGLANG_DG_CACHE_DIR=/data/.cache/deep_gemm --gpus all -v $HOME/dockerx:/dockerx -v /data:/data -v /data/.cache/huggingface:/root/.cache/huggingface -v /data/.cache:/root/.cache -p 30000:30000 lmsysorg/sglang:dev bash -c ' cd /sgl-workspace/sglang && python3 -m sglang.test.run_eval --port 30000 --eval-name gpqa --num-examples 198 --max-tokens 4096 --repeat 10 '",
  "benchmark_result": [
    "====================",
    "Repeat: 10, mean: 0.777",
    "Scores: ['0.788', '0.768', '0.808', '0.783', '0.788', '0.758', '0.747', '0.778', '0.778', '0.773']",
    "====================",
    "[METRIC] gpqa_mean_score=0.7767676767676768 labels={\"model\": \"deepseek-ai/DeepSeek-V3.2\", \"eval\": \"gpqa\", \"repeat\": 10}",
    "Writing report to /tmp/gpqa_deepseek-ai_DeepSeek-V3.2.html",
    "{'chars': np.float64(3697.338383838384), 'chars:std': np.float64(2099.168857082937), 'score:std': np.float64(0.4190702026042221), 'scores': ['0.788', '0.768', '0.808', '0.783', '0.788', '0.758', '0.747', '0.778', '0.778', '0.773'], 'mean_score': np.float64(0.7767676767676768)}",
    "Writing results to /tmp/gpqa_deepseek-ai_DeepSeek-V3.2.json"
  ]
}
AFTER
{
  "server_command": "docker run --rm --ipc=host --network=host --privileged --shm-size 256G -e DO_NOT_TRACK=1 -e SGLANG_DG_CACHE_DIR=/data/.cache/deep_gemm --gpus all -v $HOME/dockerx:/dockerx -v /data:/data -v /data/.cache/huggingface:/root/.cache/huggingface -v /data/.cache:/root/.cache -p 30000:30000 lmsysorg/sglang:dev bash -c ' cd /sgl-workspace && rm -rf sglang && git clone -b zianglih/DSv32_overlap https://github.com/zianglih/sglang.git && cd sglang && pip install --upgrade pip && pip install -e \"python\" && python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --tp 8 --enable-dp-attention --kv-cache-dtype bf16 --mem-fraction-static 0.8 --dp 8 --cuda-graph-max-bs 256 --trust-remote-code --host 0.0.0.0 --port 30000'",
  "bench_accuracy_command": "docker run --rm --ipc=host --network=host --privileged --shm-size 256G -e DO_NOT_TRACK=1 -e SGLANG_DG_CACHE_DIR=/data/.cache/deep_gemm --gpus all -v $HOME/dockerx:/dockerx -v /data:/data -v /data/.cache/huggingface:/root/.cache/huggingface -v /data/.cache:/root/.cache -p 30000:30000 lmsysorg/sglang:dev bash -c ' cd /sgl-workspace && rm -rf sglang && git clone -b zianglih/DSv32_overlap https://github.com/zianglih/sglang.git && cd sglang && pip install --upgrade pip && pip install -e \"python\" && cd /sgl-workspace/sglang && python3 -m sglang.test.run_eval --port 30000 --eval-name gpqa --num-examples 198 --max-tokens 4096 --repeat 10 '",
  "benchmark_result": [
    "====================",
    "Repeat: 10, mean: 0.771",
    "Scores: ['0.737', '0.793', '0.783', '0.747', '0.763', '0.798', '0.753', '0.773', '0.793', '0.773']",
    "====================",
    "[METRIC] gpqa_mean_score=0.7712121212121212 labels={\"model\": \"deepseek-ai/DeepSeek-V3.2\", \"eval\": \"gpqa\", \"repeat\": 10}",
    "Writing report to /tmp/gpqa_deepseek-ai_DeepSeek-V3.2.html",
    "{'chars': np.float64(3708.1464646464647), 'chars:std': np.float64(2132.345340292064), 'score:std': np.float64(0.4190702026042221), 'scores': ['0.737', '0.793', '0.783', '0.747', '0.763', '0.798', '0.753', '0.773', '0.793', '0.773'], 'mean_score': np.float64(0.7712121212121212)}",
    "Writing results to /tmp/gpqa_deepseek-ai_DeepSeek-V3.2.json"
  ]
}

Benchmarking and Profiling

image image
1k/2k BEFORE
{
  "server_command": "docker run --rm --ipc=host --network=host --privileged --shm-size 256G -e DO_NOT_TRACK=1 -e SGLANG_DG_CACHE_DIR=/data/.cache/deep_gemm --gpus all -v $HOME/dockerx:/dockerx -v /data:/data -v /data/.cache/huggingface:/root/.cache/huggingface -v /data/.cache:/root/.cache -p 30000:30000 lmsysorg/sglang:dev bash -c ' python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --tp 8 --enable-dp-attention --kv-cache-dtype bf16 --mem-fraction-static 0.8 --dp 8 --cuda-graph-max-bs 256 --trust-remote-code --host 0.0.0.0 --port 30000'",
  "bench_serving_command": "docker run --rm --ipc=host --network=host --privileged --shm-size 256G -e DO_NOT_TRACK=1 -e SGLANG_DG_CACHE_DIR=/data/.cache/deep_gemm --gpus all -v $HOME/dockerx:/dockerx -v /data:/data -v /data/.cache/huggingface:/root/.cache/huggingface -v /data/.cache:/root/.cache -p 30000:30000 lmsysorg/sglang:dev bash -c ' python3 -m sglang.bench_serving --warmup-requests 1 --backend sglang --dataset-name random --random-range-ratio 0.8 --num-prompts 256 --random-input 1024 --random-output 2048 --host 0.0.0.0 --port 30000 '",
  "benchmark_result": [
    "============ Serving Benchmark Result ============",
    "Backend:                                 sglang",
    "Traffic request rate:                    inf",
    "Max request concurrency:                 not set",
    "Successful requests:                     256",
    "Benchmark duration (s):                  85.95",
    "Total input tokens:                      235251",
    "Total input text tokens:                 235251",
    "Total input vision tokens:               0",
    "Total generated tokens:                  473977",
    "Total generated tokens (retokenized):    473807",
    "Request throughput (req/s):              2.98",
    "Input token throughput (tok/s):          2736.95",
    "Output token throughput (tok/s):         5514.33",
    "Peak output token throughput (tok/s):    6807.00",
    "Peak concurrent requests:                256",
    "Total token throughput (tok/s):          8251.28",
    "Concurrency:                             232.38",
    "----------------End-to-End Latency----------------",
    "Mean E2E Latency (ms):                   78024.30",
    "Median E2E Latency (ms):                 78047.66",
    "P90 E2E Latency (ms):                    84378.09",
    "P99 E2E Latency (ms):                    85878.01",
    "---------------Time to First Token----------------",
    "Mean TTFT (ms):                          4084.06",
    "Median TTFT (ms):                        4083.83",
    "P99 TTFT (ms):                           6333.84",
    "-----Time per Output Token (excl. 1st token)------",
    "Mean TPOT (ms):                          39.96",
    "Median TPOT (ms):                        39.96",
    "P99 TPOT (ms):                           41.53",
    "---------------Inter-Token Latency----------------",
    "Mean ITL (ms):                           39.96",
    "Median ITL (ms):                         38.43",
    "P95 ITL (ms):                            52.08",
    "P99 ITL (ms):                            61.16",
    "Max ITL (ms):                            5643.54",
    "=================================================="
  ]
}
1k/2k AFTER
{
  "server_command": "docker run --rm --ipc=host --network=host --privileged --shm-size 256G -e DO_NOT_TRACK=1 -e SGLANG_DG_CACHE_DIR=/data/.cache/deep_gemm --gpus all -v $HOME/dockerx:/dockerx -v /data:/data -v /data/.cache/huggingface:/root/.cache/huggingface -v /data/.cache:/root/.cache -p 30000:30000 lmsysorg/sglang:dev bash -c ' cd /sgl-workspace && rm -rf sglang && git clone -b zianglih/DSv32_overlap https://github.com/zianglih/sglang.git && cd sglang && pip install --upgrade pip && pip install -e \"python\" && python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --tp 8 --enable-dp-attention --kv-cache-dtype bf16 --mem-fraction-static 0.8 --dp 8 --cuda-graph-max-bs 256 --trust-remote-code --host 0.0.0.0 --port 30000'",
  "bench_serving_command": "docker run --rm --ipc=host --network=host --privileged --shm-size 256G -e DO_NOT_TRACK=1 -e SGLANG_DG_CACHE_DIR=/data/.cache/deep_gemm --gpus all -v $HOME/dockerx:/dockerx -v /data:/data -v /data/.cache/huggingface:/root/.cache/huggingface -v /data/.cache:/root/.cache -p 30000:30000 lmsysorg/sglang:dev bash -c ' python3 -m sglang.bench_serving --warmup-requests 1 --backend sglang --dataset-name random --random-range-ratio 0.8 --num-prompts 256 --random-input 1024 --random-output 2048 --host 0.0.0.0 --port 30000 '",
  "benchmark_result": [
    "============ Serving Benchmark Result ============",
    "Backend:                                 sglang",
    "Traffic request rate:                    inf",
    "Max request concurrency:                 not set",
    "Successful requests:                     256",
    "Benchmark duration (s):                  82.53",
    "Total input tokens:                      235251",
    "Total input text tokens:                 235251",
    "Total input vision tokens:               0",
    "Total generated tokens:                  473977",
    "Total generated tokens (retokenized):    473750",
    "Request throughput (req/s):              3.10",
    "Input token throughput (tok/s):          2850.66",
    "Output token throughput (tok/s):         5743.42",
    "Peak output token throughput (tok/s):    7168.00",
    "Peak concurrent requests:                256",
    "Total token throughput (tok/s):          8594.08",
    "Concurrency:                             231.91",
    "----------------End-to-End Latency----------------",
    "Mean E2E Latency (ms):                   74760.71",
    "Median E2E Latency (ms):                 74710.94",
    "P90 E2E Latency (ms):                    80995.56",
    "P99 E2E Latency (ms):                    82450.48",
    "---------------Time to First Token----------------",
    "Mean TTFT (ms):                          4001.88",
    "Median TTFT (ms):                        4025.09",
    "P99 TTFT (ms):                           6228.64",
    "-----Time per Output Token (excl. 1st token)------",
    "Mean TPOT (ms):                          38.24",
    "Median TPOT (ms):                        38.26",
    "P99 TPOT (ms):                           39.78",
    "---------------Inter-Token Latency----------------",
    "Mean ITL (ms):                           38.24",
    "Median ITL (ms):                         36.74",
    "P95 ITL (ms):                            52.39",
    "P99 ITL (ms):                            65.52",
    "Max ITL (ms):                            5586.01",
    "=================================================="
  ]
}
8k/2k BEFORE
{
  "server_command": "docker run --rm --ipc=host --network=host --privileged --shm-size 256G -e DO_NOT_TRACK=1 -e SGLANG_DG_CACHE_DIR=/data/.cache/deep_gemm --gpus all -v $HOME/dockerx:/dockerx -v /data:/data -v /data/.cache/huggingface:/root/.cache/huggingface -v /data/.cache:/root/.cache -p 30000:30000 lmsysorg/sglang:dev bash -c ' python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --tp 8 --enable-dp-attention --kv-cache-dtype bf16 --mem-fraction-static 0.8 --dp 8 --cuda-graph-max-bs 256 --trust-remote-code --host 0.0.0.0 --port 30000'",
  "bench_serving_command": "docker run --rm --ipc=host --network=host --privileged --shm-size 256G -e DO_NOT_TRACK=1 -e SGLANG_DG_CACHE_DIR=/data/.cache/deep_gemm --gpus all -v $HOME/dockerx:/dockerx -v /data:/data -v /data/.cache/huggingface:/root/.cache/huggingface -v /data/.cache:/root/.cache -p 30000:30000 lmsysorg/sglang:dev bash -c ' python3 -m sglang.bench_serving --warmup-requests 1 --backend sglang --dataset-name random --random-range-ratio 0.8 --num-prompts 256 --random-input 8192 --random-output 2048 --host 0.0.0.0 --port 30000 '",
  "benchmark_result": [
    "============ Serving Benchmark Result ============",
    "Backend:                                 sglang",
    "Traffic request rate:                    inf",
    "Max request concurrency:                 not set",
    "Successful requests:                     256",
    "Benchmark duration (s):                  132.50",
    "Total input tokens:                      1900585",
    "Total input text tokens:                 1900585",
    "Total input vision tokens:               0",
    "Total generated tokens:                  473587",
    "Total generated tokens (retokenized):    473578",
    "Request throughput (req/s):              1.93",
    "Input token throughput (tok/s):          14344.10",
    "Output token throughput (tok/s):         3574.26",
    "Peak output token throughput (tok/s):    6784.00",
    "Peak concurrent requests:                256",
    "Total token throughput (tok/s):          17918.36",
    "Concurrency:                             240.61",
    "----------------End-to-End Latency----------------",
    "Mean E2E Latency (ms):                   124536.19",
    "Median E2E Latency (ms):                 124556.93",
    "P90 E2E Latency (ms):                    130819.25",
    "P99 E2E Latency (ms):                    132421.32",
    "---------------Time to First Token----------------",
    "Mean TTFT (ms):                          27678.59",
    "Median TTFT (ms):                        27763.26",
    "P99 TTFT (ms):                           52400.84",
    "-----Time per Output Token (excl. 1st token)------",
    "Mean TPOT (ms):                          52.43",
    "Median TPOT (ms):                        52.39",
    "P99 TPOT (ms):                           66.58",
    "---------------Inter-Token Latency----------------",
    "Mean ITL (ms):                           52.39",
    "Median ITL (ms):                         38.38",
    "P95 ITL (ms):                            52.47",
    "P99 ITL (ms):                            61.65",
    "Max ITL (ms):                            50119.64",
    "=================================================="
  ]
}
8k/2k AFTER
{
  "server_command": "docker run --rm --ipc=host --network=host --privileged --shm-size 256G -e DO_NOT_TRACK=1 -e SGLANG_DG_CACHE_DIR=/data/.cache/deep_gemm --gpus all -v $HOME/dockerx:/dockerx -v /data:/data -v /data/.cache/huggingface:/root/.cache/huggingface -v /data/.cache:/root/.cache -p 30000:30000 lmsysorg/sglang:dev bash -c ' python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --tp 8 --enable-dp-attention --kv-cache-dtype bf16 --mem-fraction-static 0.8 --dp 8 --cuda-graph-max-bs 256 --trust-remote-code --host 0.0.0.0 --port 30000'",
  "bench_serving_command": "docker run --rm --ipc=host --network=host --privileged --shm-size 256G -e DO_NOT_TRACK=1 -e SGLANG_DG_CACHE_DIR=/data/.cache/deep_gemm --gpus all -v $HOME/dockerx:/dockerx -v /data:/data -v /data/.cache/huggingface:/root/.cache/huggingface -v /data/.cache:/root/.cache -p 30000:30000 lmsysorg/sglang:dev bash -c ' python3 -m sglang.bench_serving --warmup-requests 1 --backend sglang --dataset-name random --random-range-ratio 0.8 --num-prompts 256 --random-input 1024 --random-output 2048 --host 0.0.0.0 --port 30000 '",
  "benchmark_result": [
    "============ Serving Benchmark Result ============",
    "Backend:                                 sglang",
    "Traffic request rate:                    inf",
    "Max request concurrency:                 not set",
    "Successful requests:                     256",
    "Benchmark duration (s):                  85.95",
    "Total input tokens:                      235251",
    "Total input text tokens:                 235251",
    "Total input vision tokens:               0",
    "Total generated tokens:                  473977",
    "Total generated tokens (retokenized):    473807",
    "Request throughput (req/s):              2.98",
    "Input token throughput (tok/s):          2736.95",
    "Output token throughput (tok/s):         5514.33",
    "Peak output token throughput (tok/s):    6807.00",
    "Peak concurrent requests:                256",
    "Total token throughput (tok/s):          8251.28",
    "Concurrency:                             232.38",
    "----------------End-to-End Latency----------------",
    "Mean E2E Latency (ms):                   78024.30",
    "Median E2E Latency (ms):                 78047.66",
    "P90 E2E Latency (ms):                    84378.09",
    "P99 E2E Latency (ms):                    85878.01",
    "---------------Time to First Token----------------",
    "Mean TTFT (ms):                          4084.06",
    "Median TTFT (ms):                        4083.83",
    "P99 TTFT (ms):                           6333.84",
    "-----Time per Output Token (excl. 1st token)------",
    "Mean TPOT (ms):                          39.96",
    "Median TPOT (ms):                        39.96",
    "P99 TPOT (ms):                           41.53",
    "---------------Inter-Token Latency----------------",
    "Mean ITL (ms):                           39.96",
    "Median ITL (ms):                         38.43",
    "P95 ITL (ms):                            52.08",
    "P99 ITL (ms):                            61.16",
    "Max ITL (ms):                            5643.54",
    "=================================================="
  ]
}


Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments (/tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci) or contact authorized users to do so.
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@Fridge003
Copy link
Collaborator

Hi @zianglih Can you please test accuracy with the command here, with longer output length and thinking enabled

From 0.77 I'm still unsure about the correctness

@zianglih zianglih force-pushed the zianglih/DSv32_overlap branch from ca63555 to b8c807d Compare January 7, 2026 21:10
@zianglih
Copy link
Contributor Author

zianglih commented Jan 7, 2026

Hi @Fridge003 here are the results. I ran the AFTER twice:

AFTER
{
  "server_command": "docker run --rm --ipc=host --network=host --privileged --shm-size 256G -e DO_NOT_TRACK=1 -e SGLANG_DG_CACHE_DIR=/data/.cache/deep_gemm --gpus all -v $HOME/dockerx:/dockerx -v /data:/data -v /data/.cache/huggingface:/root/.cache/huggingface -v /data/.cache:/root/.cache -p 30000:30000 lmsysorg/sglang:dev bash -c ' cd /sgl-workspace && rm -rf sglang && git clone -b zianglih/DSv32_overlap https://github.com/zianglih/sglang.git && cd sglang && pip install --upgrade pip && pip install -e \"python\" && python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --tp 8 --enable-dp-attention --kv-cache-dtype bf16 --mem-fraction-static 0.8 --dp 8 --cuda-graph-max-bs 256 --trust-remote-code --host 0.0.0.0 --port 30000'",
  "bench_accuracy_command": "docker run --rm --ipc=host --network=host --privileged --shm-size 256G -e DO_NOT_TRACK=1 -e SGLANG_DG_CACHE_DIR=/data/.cache/deep_gemm --gpus all -v $HOME/dockerx:/dockerx -v /data:/data -v /data/.cache/huggingface:/root/.cache/huggingface -v /data/.cache:/root/.cache -p 30000:30000 lmsysorg/sglang:dev bash -c ' cd /sgl-workspace && rm -rf sglang && git clone -b zianglih/DSv32_overlap https://github.com/zianglih/sglang.git && cd sglang && pip install --upgrade pip && pip install -e \"python\" && cd /sgl-workspace/sglang && python3 -m sglang.test.run_eval --port 30000 --eval-name gpqa --num-examples 198 --max-tokens 120000 --repeat 8 --thinking-mode deepseek-v3 '",
  "benchmark_result": [
    "100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 198/198 [24:13<00:00,  7.34s/it]",
    "====================",
    "Repeat: 8, mean: 0.847",
    "Scores: ['0.848', '0.848', '0.843', '0.869', '0.854', '0.833', '0.838', '0.838']",
    "====================",
    "[METRIC] gpqa_mean_score=0.8465909090909091 labels={\"model\": \"deepseek-ai/DeepSeek-V3.2\", \"eval\": \"gpqa\", \"repeat\": 8}",
    "Writing report to /tmp/gpqa_deepseek-ai_DeepSeek-V3.2.html",
    "{'chars': np.float64(25902.161616161615), 'chars:std': np.float64(25426.272464093607), 'score:std': np.float64(0.3680983264300727), 'scores': ['0.848', '0.848', '0.843', '0.869', '0.854', '0.833', '0.838', '0.838'], 'mean_score': np.float64(0.8465909090909091)}",
    "Writing results to /tmp/gpqa_deepseek-ai_DeepSeek-V3.2.json"
  ]
}
AFTER
{
  "server_command": "docker run --rm --ipc=host --network=host --privileged --shm-size 256G -e DO_NOT_TRACK=1 -e SGLANG_DG_CACHE_DIR=/data/.cache/deep_gemm --gpus all -v $HOME/dockerx:/dockerx -v /data:/data -v /data/.cache/huggingface:/root/.cache/huggingface -v /data/.cache:/root/.cache -p 30000:30000 lmsysorg/sglang:dev bash -c ' cd /sgl-workspace && rm -rf sglang && git clone -b zianglih/DSv32_overlap https://github.com/zianglih/sglang.git && cd sglang && pip install --upgrade pip && pip install -e \"python\" && python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --tp 8 --enable-dp-attention --kv-cache-dtype bf16 --mem-fraction-static 0.8 --dp 8 --cuda-graph-max-bs 256 --trust-remote-code --host 0.0.0.0 --port 30000'",
  "bench_accuracy_command": "docker run --rm --ipc=host --network=host --privileged --shm-size 256G -e DO_NOT_TRACK=1 -e SGLANG_DG_CACHE_DIR=/data/.cache/deep_gemm --gpus all -v $HOME/dockerx:/dockerx -v /data:/data -v /data/.cache/huggingface:/root/.cache/huggingface -v /data/.cache:/root/.cache -p 30000:30000 lmsysorg/sglang:dev bash -c ' cd /sgl-workspace && rm -rf sglang && git clone -b zianglih/DSv32_overlap https://github.com/zianglih/sglang.git && cd sglang && pip install --upgrade pip && pip install -e \"python\" && cd /sgl-workspace/sglang && python3 -m sglang.test.run_eval --port 30000 --eval-name gpqa --num-examples 198 --max-tokens 120000 --repeat 8 --thinking-mode deepseek-v3 '",
  "benchmark_result": [
    "100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 198/198 [24:43<00:00,  7.49s/it]",
    "====================",
    "Repeat: 8, mean: 0.841",
    "Scores: ['0.823', '0.843', '0.838', '0.864', '0.854', '0.803', '0.838', '0.864']",
    "====================",
    "[METRIC] gpqa_mean_score=0.8409090909090908 labels={\"model\": \"deepseek-ai/DeepSeek-V3.2\", \"eval\": \"gpqa\", \"repeat\": 8}",
    "Writing report to /tmp/gpqa_deepseek-ai_DeepSeek-V3.2.html",
    "{'chars': np.float64(26298.60101010101), 'chars:std': np.float64(25170.52269430342), 'score:std': np.float64(0.3431742925123068), 'scores': ['0.823', '0.843', '0.838', '0.864', '0.854', '0.803', '0.838', '0.864'], 'mean_score': np.float64(0.8409090909090908)}",
    "Writing results to /tmp/gpqa_deepseek-ai_DeepSeek-V3.2.json"
  ]
}
BEFORE
{
  "server_command": "docker run --rm --ipc=host --network=host --privileged --shm-size 256G -e DO_NOT_TRACK=1 -e SGLANG_DG_CACHE_DIR=/data/.cache/deep_gemm --gpus all -v $HOME/dockerx:/dockerx -v /data:/data -v /data/.cache/huggingface:/root/.cache/huggingface -v /data/.cache:/root/.cache -p 30000:30000 lmsysorg/sglang:dev bash -c ' python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --tp 8 --enable-dp-attention --kv-cache-dtype bf16 --mem-fraction-static 0.8 --dp 8 --cuda-graph-max-bs 256 --trust-remote-code --host 0.0.0.0 --port 30000'",
  "bench_accuracy_command": "docker run --rm --ipc=host --network=host --privileged --shm-size 256G -e DO_NOT_TRACK=1 -e SGLANG_DG_CACHE_DIR=/data/.cache/deep_gemm --gpus all -v $HOME/dockerx:/dockerx -v /data:/data -v /data/.cache/huggingface:/root/.cache/huggingface -v /data/.cache:/root/.cache -p 30000:30000 lmsysorg/sglang:dev bash -c ' cd /sgl-workspace/sglang && python3 -m sglang.test.run_eval --port 30000 --eval-name gpqa --num-examples 198 --max-tokens 120000 --repeat 8 --thinking-mode deepseek-v3 '",
  "benchmark_result": [
    "100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 198/198 [25:57<00:00,  7.87s/it]",
    "====================",
    "Repeat: 8, mean: 0.844",
    "Scores: ['0.848', '0.859', '0.838', '0.859', '0.848', '0.848', '0.823', '0.828']",
    "====================",
    "[METRIC] gpqa_mean_score=0.8440656565656566 labels={\"model\": \"deepseek-ai/DeepSeek-V3.2\", \"eval\": \"gpqa\", \"repeat\": 8}",
    "Writing report to /tmp/gpqa_deepseek-ai_DeepSeek-V3.2.html",
    "{'chars': np.float64(25794.828282828283), 'chars:std': np.float64(25157.76033677148), 'score:std': np.float64(0.3771344384362519), 'scores': ['0.848', '0.859', '0.838', '0.859', '0.848', '0.848', '0.823', '0.828'], 'mean_score': np.float64(0.8440656565656566)}",
    "Writing results to /tmp/gpqa_deepseek-ai_DeepSeek-V3.2.json"
  ]
}

Also, the total runtime dropped from 26min to 24.5min, 1.06x speedup.

@Fridge003
Copy link
Collaborator

Are q_b_proj and _get_q_k_bf16 sharing the same alt stream?
Can you please point out this part from the trace file?

@zianglih
Copy link
Contributor Author

zianglih commented Jan 8, 2026

@Fridge003 q_b_proj and _get_q_k_bf16 is sharing the same alt stream and this is intended. In deepseek_v2.py, q_b_proj is dispatched to alt stream, and self.indexer is called. In nsa_indexer.py, we 1) call self.alt_stream.wait_stream(current_stream) which does not need to wait till the completion of q_b_proj 2) call the very slow weights = self._project_and_scale_head_gates(x) and 3) dispatch _get_q_k_bf16 and act_quant to alt stream so they can follow q_b_proj and overlap with the very slow weights = self._project_and_scale_head_gates(x) .
image

There is an additional 10.1μs (2.112+6.176+1.856) overlap opportunity (0.62ms across 61 layers) by hiding the (kv norm + nvjet + qk rope) circled in red—none of them have data dependencies with the indexer weights_proj. However, hiding them requires more aggressive code changes, so we can save this for a next step.

@Fridge003
Copy link
Collaborator

/tag-and-rerun-ci

@github-actions github-actions bot added the run-ci label Jan 9, 2026
@zianglih zianglih force-pushed the zianglih/DSv32_overlap branch from b8c807d to 7c6b1b7 Compare January 9, 2026 18:35
@ziang-and
Copy link
Contributor

@Fridge003 Could you trigger /rerun-failed-ci ? The currently failed ci did not fail last time and they don't seem to be related to my code changes.

@ziang-and ziang-and force-pushed the zianglih/DSv32_overlap branch from 7c6b1b7 to b4c7f23 Compare January 10, 2026 00:56
@Fridge003 Fridge003 merged commit 20abaee into sgl-project:main Jan 10, 2026
202 of 211 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants