[DSv32] Overlap indexer weights_proj during dual_stream decode by zianglih · Pull Request #16637 · sgl-project/sglang

zianglih · 2026-01-07T07:28:11Z

Motivation

The weights_proj in the DSA indexer uses float type, which is slow on modern GPUs. Profiler traces show this accounts for ~20% of decode layer runtime. The traces also reveal that this projection typically uses a grid of (bs, 1, 1), utilizing very few SMs and creating an opportunity for inter-stream overlap.

Modifications

In this optimization, we overlap the very slow weights_proj with q_b_proj, indexer _get_q_k_bf16, and indexer qk act_quant during dual_stream decode.

Accuracy Tests

BEFORE 
{
  "server_command": "docker run --rm --ipc=host --network=host --privileged --shm-size 256G -e DO_NOT_TRACK=1 -e SGLANG_DG_CACHE_DIR=/data/.cache/deep_gemm --gpus all -v $HOME/dockerx:/dockerx -v /data:/data -v /data/.cache/huggingface:/root/.cache/huggingface -v /data/.cache:/root/.cache -p 30000:30000 lmsysorg/sglang:dev bash -c ' python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --tp 8 --enable-dp-attention --kv-cache-dtype bf16 --mem-fraction-static 0.8 --dp 8 --cuda-graph-max-bs 256 --trust-remote-code --host 0.0.0.0 --port 30000'",
  "bench_accuracy_command": "docker run --rm --ipc=host --network=host --privileged --shm-size 256G -e DO_NOT_TRACK=1 -e SGLANG_DG_CACHE_DIR=/data/.cache/deep_gemm --gpus all -v $HOME/dockerx:/dockerx -v /data:/data -v /data/.cache/huggingface:/root/.cache/huggingface -v /data/.cache:/root/.cache -p 30000:30000 lmsysorg/sglang:dev bash -c ' cd /sgl-workspace/sglang && python3 -m sglang.test.run_eval --port 30000 --eval-name gpqa --num-examples 198 --max-tokens 4096 --repeat 10 '",
  "benchmark_result": [
    "====================",
    "Repeat: 10, mean: 0.777",
    "Scores: ['0.788', '0.768', '0.808', '0.783', '0.788', '0.758', '0.747', '0.778', '0.778', '0.773']",
    "====================",
    "[METRIC] gpqa_mean_score=0.7767676767676768 labels={\"model\": \"deepseek-ai/DeepSeek-V3.2\", \"eval\": \"gpqa\", \"repeat\": 10}",
    "Writing report to /tmp/gpqa_deepseek-ai_DeepSeek-V3.2.html",
    "{'chars': np.float64(3697.338383838384), 'chars:std': np.float64(2099.168857082937), 'score:std': np.float64(0.4190702026042221), 'scores': ['0.788', '0.768', '0.808', '0.783', '0.788', '0.758', '0.747', '0.778', '0.778', '0.773'], 'mean_score': np.float64(0.7767676767676768)}",
    "Writing results to /tmp/gpqa_deepseek-ai_DeepSeek-V3.2.json"
  ]
}
AFTER
{
  "server_command": "docker run --rm --ipc=host --network=host --privileged --shm-size 256G -e DO_NOT_TRACK=1 -e SGLANG_DG_CACHE_DIR=/data/.cache/deep_gemm --gpus all -v $HOME/dockerx:/dockerx -v /data:/data -v /data/.cache/huggingface:/root/.cache/huggingface -v /data/.cache:/root/.cache -p 30000:30000 lmsysorg/sglang:dev bash -c ' cd /sgl-workspace && rm -rf sglang && git clone -b zianglih/DSv32_overlap https://github.com/zianglih/sglang.git && cd sglang && pip install --upgrade pip && pip install -e \"python\" && python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --tp 8 --enable-dp-attention --kv-cache-dtype bf16 --mem-fraction-static 0.8 --dp 8 --cuda-graph-max-bs 256 --trust-remote-code --host 0.0.0.0 --port 30000'",
  "bench_accuracy_command": "docker run --rm --ipc=host --network=host --privileged --shm-size 256G -e DO_NOT_TRACK=1 -e SGLANG_DG_CACHE_DIR=/data/.cache/deep_gemm --gpus all -v $HOME/dockerx:/dockerx -v /data:/data -v /data/.cache/huggingface:/root/.cache/huggingface -v /data/.cache:/root/.cache -p 30000:30000 lmsysorg/sglang:dev bash -c ' cd /sgl-workspace && rm -rf sglang && git clone -b zianglih/DSv32_overlap https://github.com/zianglih/sglang.git && cd sglang && pip install --upgrade pip && pip install -e \"python\" && cd /sgl-workspace/sglang && python3 -m sglang.test.run_eval --port 30000 --eval-name gpqa --num-examples 198 --max-tokens 4096 --repeat 10 '",
  "benchmark_result": [
    "====================",
    "Repeat: 10, mean: 0.771",
    "Scores: ['0.737', '0.793', '0.783', '0.747', '0.763', '0.798', '0.753', '0.773', '0.793', '0.773']",
    "====================",
    "[METRIC] gpqa_mean_score=0.7712121212121212 labels={\"model\": \"deepseek-ai/DeepSeek-V3.2\", \"eval\": \"gpqa\", \"repeat\": 10}",
    "Writing report to /tmp/gpqa_deepseek-ai_DeepSeek-V3.2.html",
    "{'chars': np.float64(3708.1464646464647), 'chars:std': np.float64(2132.345340292064), 'score:std': np.float64(0.4190702026042221), 'scores': ['0.737', '0.793', '0.783', '0.747', '0.763', '0.798', '0.753', '0.773', '0.793', '0.773'], 'mean_score': np.float64(0.7712121212121212)}",
    "Writing results to /tmp/gpqa_deepseek-ai_DeepSeek-V3.2.json"
  ]
}

Benchmarking and Profiling

1k/2k BEFORE
{
  "server_command": "docker run --rm --ipc=host --network=host --privileged --shm-size 256G -e DO_NOT_TRACK=1 -e SGLANG_DG_CACHE_DIR=/data/.cache/deep_gemm --gpus all -v $HOME/dockerx:/dockerx -v /data:/data -v /data/.cache/huggingface:/root/.cache/huggingface -v /data/.cache:/root/.cache -p 30000:30000 lmsysorg/sglang:dev bash -c ' python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --tp 8 --enable-dp-attention --kv-cache-dtype bf16 --mem-fraction-static 0.8 --dp 8 --cuda-graph-max-bs 256 --trust-remote-code --host 0.0.0.0 --port 30000'",
  "bench_serving_command": "docker run --rm --ipc=host --network=host --privileged --shm-size 256G -e DO_NOT_TRACK=1 -e SGLANG_DG_CACHE_DIR=/data/.cache/deep_gemm --gpus all -v $HOME/dockerx:/dockerx -v /data:/data -v /data/.cache/huggingface:/root/.cache/huggingface -v /data/.cache:/root/.cache -p 30000:30000 lmsysorg/sglang:dev bash -c ' python3 -m sglang.bench_serving --warmup-requests 1 --backend sglang --dataset-name random --random-range-ratio 0.8 --num-prompts 256 --random-input 1024 --random-output 2048 --host 0.0.0.0 --port 30000 '",
  "benchmark_result": [
    "============ Serving Benchmark Result ============",
    "Backend:                                 sglang",
    "Traffic request rate:                    inf",
    "Max request concurrency:                 not set",
    "Successful requests:                     256",
    "Benchmark duration (s):                  85.95",
    "Total input tokens:                      235251",
    "Total input text tokens:                 235251",
    "Total input vision tokens:               0",
    "Total generated tokens:                  473977",
    "Total generated tokens (retokenized):    473807",
    "Request throughput (req/s):              2.98",
    "Input token throughput (tok/s):          2736.95",
    "Output token throughput (tok/s):         5514.33",
    "Peak output token throughput (tok/s):    6807.00",
    "Peak concurrent requests:                256",
    "Total token throughput (tok/s):          8251.28",
    "Concurrency:                             232.38",
    "----------------End-to-End Latency----------------",
    "Mean E2E Latency (ms):                   78024.30",
    "Median E2E Latency (ms):                 78047.66",
    "P90 E2E Latency (ms):                    84378.09",
    "P99 E2E Latency (ms):                    85878.01",
    "---------------Time to First Token----------------",
    "Mean TTFT (ms):                          4084.06",
    "Median TTFT (ms):                        4083.83",
    "P99 TTFT (ms):                           6333.84",
    "-----Time per Output Token (excl. 1st token)------",
    "Mean TPOT (ms):                          39.96",
    "Median TPOT (ms):                        39.96",
    "P99 TPOT (ms):                           41.53",
    "---------------Inter-Token Latency----------------",
    "Mean ITL (ms):                           39.96",
    "Median ITL (ms):                         38.43",
    "P95 ITL (ms):                            52.08",
    "P99 ITL (ms):                            61.16",
    "Max ITL (ms):                            5643.54",
    "=================================================="
  ]
}
1k/2k AFTER
{
  "server_command": "docker run --rm --ipc=host --network=host --privileged --shm-size 256G -e DO_NOT_TRACK=1 -e SGLANG_DG_CACHE_DIR=/data/.cache/deep_gemm --gpus all -v $HOME/dockerx:/dockerx -v /data:/data -v /data/.cache/huggingface:/root/.cache/huggingface -v /data/.cache:/root/.cache -p 30000:30000 lmsysorg/sglang:dev bash -c ' cd /sgl-workspace && rm -rf sglang && git clone -b zianglih/DSv32_overlap https://github.com/zianglih/sglang.git && cd sglang && pip install --upgrade pip && pip install -e \"python\" && python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --tp 8 --enable-dp-attention --kv-cache-dtype bf16 --mem-fraction-static 0.8 --dp 8 --cuda-graph-max-bs 256 --trust-remote-code --host 0.0.0.0 --port 30000'",
  "bench_serving_command": "docker run --rm --ipc=host --network=host --privileged --shm-size 256G -e DO_NOT_TRACK=1 -e SGLANG_DG_CACHE_DIR=/data/.cache/deep_gemm --gpus all -v $HOME/dockerx:/dockerx -v /data:/data -v /data/.cache/huggingface:/root/.cache/huggingface -v /data/.cache:/root/.cache -p 30000:30000 lmsysorg/sglang:dev bash -c ' python3 -m sglang.bench_serving --warmup-requests 1 --backend sglang --dataset-name random --random-range-ratio 0.8 --num-prompts 256 --random-input 1024 --random-output 2048 --host 0.0.0.0 --port 30000 '",
  "benchmark_result": [
    "============ Serving Benchmark Result ============",
    "Backend:                                 sglang",
    "Traffic request rate:                    inf",
    "Max request concurrency:                 not set",
    "Successful requests:                     256",
    "Benchmark duration (s):                  82.53",
    "Total input tokens:                      235251",
    "Total input text tokens:                 235251",
    "Total input vision tokens:               0",
    "Total generated tokens:                  473977",
    "Total generated tokens (retokenized):    473750",
    "Request throughput (req/s):              3.10",
    "Input token throughput (tok/s):          2850.66",
    "Output token throughput (tok/s):         5743.42",
    "Peak output token throughput (tok/s):    7168.00",
    "Peak concurrent requests:                256",
    "Total token throughput (tok/s):          8594.08",
    "Concurrency:                             231.91",
    "----------------End-to-End Latency----------------",
    "Mean E2E Latency (ms):                   74760.71",
    "Median E2E Latency (ms):                 74710.94",
    "P90 E2E Latency (ms):                    80995.56",
    "P99 E2E Latency (ms):                    82450.48",
    "---------------Time to First Token----------------",
    "Mean TTFT (ms):                          4001.88",
    "Median TTFT (ms):                        4025.09",
    "P99 TTFT (ms):                           6228.64",
    "-----Time per Output Token (excl. 1st token)------",
    "Mean TPOT (ms):                          38.24",
    "Median TPOT (ms):                        38.26",
    "P99 TPOT (ms):                           39.78",
    "---------------Inter-Token Latency----------------",
    "Mean ITL (ms):                           38.24",
    "Median ITL (ms):                         36.74",
    "P95 ITL (ms):                            52.39",
    "P99 ITL (ms):                            65.52",
    "Max ITL (ms):                            5586.01",
    "=================================================="
  ]
}

8k/2k BEFORE
{
  "server_command": "docker run --rm --ipc=host --network=host --privileged --shm-size 256G -e DO_NOT_TRACK=1 -e SGLANG_DG_CACHE_DIR=/data/.cache/deep_gemm --gpus all -v $HOME/dockerx:/dockerx -v /data:/data -v /data/.cache/huggingface:/root/.cache/huggingface -v /data/.cache:/root/.cache -p 30000:30000 lmsysorg/sglang:dev bash -c ' python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --tp 8 --enable-dp-attention --kv-cache-dtype bf16 --mem-fraction-static 0.8 --dp 8 --cuda-graph-max-bs 256 --trust-remote-code --host 0.0.0.0 --port 30000'",
  "bench_serving_command": "docker run --rm --ipc=host --network=host --privileged --shm-size 256G -e DO_NOT_TRACK=1 -e SGLANG_DG_CACHE_DIR=/data/.cache/deep_gemm --gpus all -v $HOME/dockerx:/dockerx -v /data:/data -v /data/.cache/huggingface:/root/.cache/huggingface -v /data/.cache:/root/.cache -p 30000:30000 lmsysorg/sglang:dev bash -c ' python3 -m sglang.bench_serving --warmup-requests 1 --backend sglang --dataset-name random --random-range-ratio 0.8 --num-prompts 256 --random-input 8192 --random-output 2048 --host 0.0.0.0 --port 30000 '",
  "benchmark_result": [
    "============ Serving Benchmark Result ============",
    "Backend:                                 sglang",
    "Traffic request rate:                    inf",
    "Max request concurrency:                 not set",
    "Successful requests:                     256",
    "Benchmark duration (s):                  132.50",
    "Total input tokens:                      1900585",
    "Total input text tokens:                 1900585",
    "Total input vision tokens:               0",
    "Total generated tokens:                  473587",
    "Total generated tokens (retokenized):    473578",
    "Request throughput (req/s):              1.93",
    "Input token throughput (tok/s):          14344.10",
    "Output token throughput (tok/s):         3574.26",
    "Peak output token throughput (tok/s):    6784.00",
    "Peak concurrent requests:                256",
    "Total token throughput (tok/s):          17918.36",
    "Concurrency:                             240.61",
    "----------------End-to-End Latency----------------",
    "Mean E2E Latency (ms):                   124536.19",
    "Median E2E Latency (ms):                 124556.93",
    "P90 E2E Latency (ms):                    130819.25",
    "P99 E2E Latency (ms):                    132421.32",
    "---------------Time to First Token----------------",
    "Mean TTFT (ms):                          27678.59",
    "Median TTFT (ms):                        27763.26",
    "P99 TTFT (ms):                           52400.84",
    "-----Time per Output Token (excl. 1st token)------",
    "Mean TPOT (ms):                          52.43",
    "Median TPOT (ms):                        52.39",
    "P99 TPOT (ms):                           66.58",
    "---------------Inter-Token Latency----------------",
    "Mean ITL (ms):                           52.39",
    "Median ITL (ms):                         38.38",
    "P95 ITL (ms):                            52.47",
    "P99 ITL (ms):                            61.65",
    "Max ITL (ms):                            50119.64",
    "=================================================="
  ]
}
8k/2k AFTER
{
  "server_command": "docker run --rm --ipc=host --network=host --privileged --shm-size 256G -e DO_NOT_TRACK=1 -e SGLANG_DG_CACHE_DIR=/data/.cache/deep_gemm --gpus all -v $HOME/dockerx:/dockerx -v /data:/data -v /data/.cache/huggingface:/root/.cache/huggingface -v /data/.cache:/root/.cache -p 30000:30000 lmsysorg/sglang:dev bash -c ' python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --tp 8 --enable-dp-attention --kv-cache-dtype bf16 --mem-fraction-static 0.8 --dp 8 --cuda-graph-max-bs 256 --trust-remote-code --host 0.0.0.0 --port 30000'",
  "bench_serving_command": "docker run --rm --ipc=host --network=host --privileged --shm-size 256G -e DO_NOT_TRACK=1 -e SGLANG_DG_CACHE_DIR=/data/.cache/deep_gemm --gpus all -v $HOME/dockerx:/dockerx -v /data:/data -v /data/.cache/huggingface:/root/.cache/huggingface -v /data/.cache:/root/.cache -p 30000:30000 lmsysorg/sglang:dev bash -c ' python3 -m sglang.bench_serving --warmup-requests 1 --backend sglang --dataset-name random --random-range-ratio 0.8 --num-prompts 256 --random-input 1024 --random-output 2048 --host 0.0.0.0 --port 30000 '",
  "benchmark_result": [
    "============ Serving Benchmark Result ============",
    "Backend:                                 sglang",
    "Traffic request rate:                    inf",
    "Max request concurrency:                 not set",
    "Successful requests:                     256",
    "Benchmark duration (s):                  85.95",
    "Total input tokens:                      235251",
    "Total input text tokens:                 235251",
    "Total input vision tokens:               0",
    "Total generated tokens:                  473977",
    "Total generated tokens (retokenized):    473807",
    "Request throughput (req/s):              2.98",
    "Input token throughput (tok/s):          2736.95",
    "Output token throughput (tok/s):         5514.33",
    "Peak output token throughput (tok/s):    6807.00",
    "Peak concurrent requests:                256",
    "Total token throughput (tok/s):          8251.28",
    "Concurrency:                             232.38",
    "----------------End-to-End Latency----------------",
    "Mean E2E Latency (ms):                   78024.30",
    "Median E2E Latency (ms):                 78047.66",
    "P90 E2E Latency (ms):                    84378.09",
    "P99 E2E Latency (ms):                    85878.01",
    "---------------Time to First Token----------------",
    "Mean TTFT (ms):                          4084.06",
    "Median TTFT (ms):                        4083.83",
    "P99 TTFT (ms):                           6333.84",
    "-----Time per Output Token (excl. 1st token)------",
    "Mean TPOT (ms):                          39.96",
    "Median TPOT (ms):                        39.96",
    "P99 TPOT (ms):                           41.53",
    "---------------Inter-Token Latency----------------",
    "Mean ITL (ms):                           39.96",
    "Median ITL (ms):                         38.43",
    "P95 ITL (ms):                            52.08",
    "P99 ITL (ms):                            61.16",
    "Max ITL (ms):                            5643.54",
    "=================================================="
  ]
}

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments (/tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci) or contact authorized users to do so.
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-01-07T07:28:14Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Fridge003 · 2026-01-07T13:28:50Z

Hi @zianglih Can you please test accuracy with the command here, with longer output length and thinking enabled

From 0.77 I'm still unsure about the correctness

zianglih · 2026-01-07T21:47:27Z

Hi @Fridge003 here are the results. I ran the AFTER twice:

AFTER
{
  "server_command": "docker run --rm --ipc=host --network=host --privileged --shm-size 256G -e DO_NOT_TRACK=1 -e SGLANG_DG_CACHE_DIR=/data/.cache/deep_gemm --gpus all -v $HOME/dockerx:/dockerx -v /data:/data -v /data/.cache/huggingface:/root/.cache/huggingface -v /data/.cache:/root/.cache -p 30000:30000 lmsysorg/sglang:dev bash -c ' cd /sgl-workspace && rm -rf sglang && git clone -b zianglih/DSv32_overlap https://github.com/zianglih/sglang.git && cd sglang && pip install --upgrade pip && pip install -e \"python\" && python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --tp 8 --enable-dp-attention --kv-cache-dtype bf16 --mem-fraction-static 0.8 --dp 8 --cuda-graph-max-bs 256 --trust-remote-code --host 0.0.0.0 --port 30000'",
  "bench_accuracy_command": "docker run --rm --ipc=host --network=host --privileged --shm-size 256G -e DO_NOT_TRACK=1 -e SGLANG_DG_CACHE_DIR=/data/.cache/deep_gemm --gpus all -v $HOME/dockerx:/dockerx -v /data:/data -v /data/.cache/huggingface:/root/.cache/huggingface -v /data/.cache:/root/.cache -p 30000:30000 lmsysorg/sglang:dev bash -c ' cd /sgl-workspace && rm -rf sglang && git clone -b zianglih/DSv32_overlap https://github.com/zianglih/sglang.git && cd sglang && pip install --upgrade pip && pip install -e \"python\" && cd /sgl-workspace/sglang && python3 -m sglang.test.run_eval --port 30000 --eval-name gpqa --num-examples 198 --max-tokens 120000 --repeat 8 --thinking-mode deepseek-v3 '",
  "benchmark_result": [
    "100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 198/198 [24:13<00:00,  7.34s/it]",
    "====================",
    "Repeat: 8, mean: 0.847",
    "Scores: ['0.848', '0.848', '0.843', '0.869', '0.854', '0.833', '0.838', '0.838']",
    "====================",
    "[METRIC] gpqa_mean_score=0.8465909090909091 labels={\"model\": \"deepseek-ai/DeepSeek-V3.2\", \"eval\": \"gpqa\", \"repeat\": 8}",
    "Writing report to /tmp/gpqa_deepseek-ai_DeepSeek-V3.2.html",
    "{'chars': np.float64(25902.161616161615), 'chars:std': np.float64(25426.272464093607), 'score:std': np.float64(0.3680983264300727), 'scores': ['0.848', '0.848', '0.843', '0.869', '0.854', '0.833', '0.838', '0.838'], 'mean_score': np.float64(0.8465909090909091)}",
    "Writing results to /tmp/gpqa_deepseek-ai_DeepSeek-V3.2.json"
  ]
}
AFTER
{
  "server_command": "docker run --rm --ipc=host --network=host --privileged --shm-size 256G -e DO_NOT_TRACK=1 -e SGLANG_DG_CACHE_DIR=/data/.cache/deep_gemm --gpus all -v $HOME/dockerx:/dockerx -v /data:/data -v /data/.cache/huggingface:/root/.cache/huggingface -v /data/.cache:/root/.cache -p 30000:30000 lmsysorg/sglang:dev bash -c ' cd /sgl-workspace && rm -rf sglang && git clone -b zianglih/DSv32_overlap https://github.com/zianglih/sglang.git && cd sglang && pip install --upgrade pip && pip install -e \"python\" && python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --tp 8 --enable-dp-attention --kv-cache-dtype bf16 --mem-fraction-static 0.8 --dp 8 --cuda-graph-max-bs 256 --trust-remote-code --host 0.0.0.0 --port 30000'",
  "bench_accuracy_command": "docker run --rm --ipc=host --network=host --privileged --shm-size 256G -e DO_NOT_TRACK=1 -e SGLANG_DG_CACHE_DIR=/data/.cache/deep_gemm --gpus all -v $HOME/dockerx:/dockerx -v /data:/data -v /data/.cache/huggingface:/root/.cache/huggingface -v /data/.cache:/root/.cache -p 30000:30000 lmsysorg/sglang:dev bash -c ' cd /sgl-workspace && rm -rf sglang && git clone -b zianglih/DSv32_overlap https://github.com/zianglih/sglang.git && cd sglang && pip install --upgrade pip && pip install -e \"python\" && cd /sgl-workspace/sglang && python3 -m sglang.test.run_eval --port 30000 --eval-name gpqa --num-examples 198 --max-tokens 120000 --repeat 8 --thinking-mode deepseek-v3 '",
  "benchmark_result": [
    "100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 198/198 [24:43<00:00,  7.49s/it]",
    "====================",
    "Repeat: 8, mean: 0.841",
    "Scores: ['0.823', '0.843', '0.838', '0.864', '0.854', '0.803', '0.838', '0.864']",
    "====================",
    "[METRIC] gpqa_mean_score=0.8409090909090908 labels={\"model\": \"deepseek-ai/DeepSeek-V3.2\", \"eval\": \"gpqa\", \"repeat\": 8}",
    "Writing report to /tmp/gpqa_deepseek-ai_DeepSeek-V3.2.html",
    "{'chars': np.float64(26298.60101010101), 'chars:std': np.float64(25170.52269430342), 'score:std': np.float64(0.3431742925123068), 'scores': ['0.823', '0.843', '0.838', '0.864', '0.854', '0.803', '0.838', '0.864'], 'mean_score': np.float64(0.8409090909090908)}",
    "Writing results to /tmp/gpqa_deepseek-ai_DeepSeek-V3.2.json"
  ]
}
BEFORE
{
  "server_command": "docker run --rm --ipc=host --network=host --privileged --shm-size 256G -e DO_NOT_TRACK=1 -e SGLANG_DG_CACHE_DIR=/data/.cache/deep_gemm --gpus all -v $HOME/dockerx:/dockerx -v /data:/data -v /data/.cache/huggingface:/root/.cache/huggingface -v /data/.cache:/root/.cache -p 30000:30000 lmsysorg/sglang:dev bash -c ' python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --tp 8 --enable-dp-attention --kv-cache-dtype bf16 --mem-fraction-static 0.8 --dp 8 --cuda-graph-max-bs 256 --trust-remote-code --host 0.0.0.0 --port 30000'",
  "bench_accuracy_command": "docker run --rm --ipc=host --network=host --privileged --shm-size 256G -e DO_NOT_TRACK=1 -e SGLANG_DG_CACHE_DIR=/data/.cache/deep_gemm --gpus all -v $HOME/dockerx:/dockerx -v /data:/data -v /data/.cache/huggingface:/root/.cache/huggingface -v /data/.cache:/root/.cache -p 30000:30000 lmsysorg/sglang:dev bash -c ' cd /sgl-workspace/sglang && python3 -m sglang.test.run_eval --port 30000 --eval-name gpqa --num-examples 198 --max-tokens 120000 --repeat 8 --thinking-mode deepseek-v3 '",
  "benchmark_result": [
    "100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 198/198 [25:57<00:00,  7.87s/it]",
    "====================",
    "Repeat: 8, mean: 0.844",
    "Scores: ['0.848', '0.859', '0.838', '0.859', '0.848', '0.848', '0.823', '0.828']",
    "====================",
    "[METRIC] gpqa_mean_score=0.8440656565656566 labels={\"model\": \"deepseek-ai/DeepSeek-V3.2\", \"eval\": \"gpqa\", \"repeat\": 8}",
    "Writing report to /tmp/gpqa_deepseek-ai_DeepSeek-V3.2.html",
    "{'chars': np.float64(25794.828282828283), 'chars:std': np.float64(25157.76033677148), 'score:std': np.float64(0.3771344384362519), 'scores': ['0.848', '0.859', '0.838', '0.859', '0.848', '0.848', '0.823', '0.828'], 'mean_score': np.float64(0.8440656565656566)}",
    "Writing results to /tmp/gpqa_deepseek-ai_DeepSeek-V3.2.json"
  ]
}

Also, the total runtime dropped from 26min to 24.5min, 1.06x speedup.

Fridge003 · 2026-01-08T15:17:46Z

Are q_b_proj and _get_q_k_bf16 sharing the same alt stream?
Can you please point out this part from the trace file?

zianglih · 2026-01-08T19:31:44Z

@Fridge003 q_b_proj and _get_q_k_bf16 is sharing the same alt stream and this is intended. In deepseek_v2.py, q_b_proj is dispatched to alt stream, and self.indexer is called. In nsa_indexer.py, we 1) call self.alt_stream.wait_stream(current_stream) which does not need to wait till the completion of q_b_proj 2) call the very slow weights = self._project_and_scale_head_gates(x) and 3) dispatch _get_q_k_bf16 and act_quant to alt stream so they can follow q_b_proj and overlap with the very slow weights = self._project_and_scale_head_gates(x) .

There is an additional 10.1μs (2.112+6.176+1.856) overlap opportunity (0.62ms across 61 layers) by hiding the (kv norm + nvjet + qk rope) circled in red—none of them have data dependencies with the indexer weights_proj. However, hiding them requires more aggressive code changes, so we can save this for a next step.

Fridge003 · 2026-01-09T02:38:41Z

/tag-and-rerun-ci

ziang-and · 2026-01-09T21:46:55Z

@Fridge003 Could you trigger /rerun-failed-ci ? The currently failed ci did not fail last time and they don't seem to be related to my code changes.

zianglih requested review from Fridge003, Qiaolin-Yu, ch-wan, fzyzcjy, hebiao064, ispobock, merrymercy and zhyncs as code owners January 7, 2026 07:28

github-actions bot added the deepseek label Jan 7, 2026

zianglih force-pushed the zianglih/DSv32_overlap branch from ca63555 to b8c807d Compare January 7, 2026 21:10

Fridge003 mentioned this pull request Jan 8, 2026

[Roadmap] DeepSeek v3.2 Optimization #15025

Open

29 tasks

github-actions bot added the run-ci label Jan 9, 2026

zianglih force-pushed the zianglih/DSv32_overlap branch from b8c807d to 7c6b1b7 Compare January 9, 2026 18:35

Overlap indexer weights_proj during dual_stream decode

b4c7f23

ziang-and force-pushed the zianglih/DSv32_overlap branch from 7c6b1b7 to b4c7f23 Compare January 10, 2026 00:56

Fridge003 merged commit 20abaee into sgl-project:main Jan 10, 2026
202 of 211 checks passed

This was referenced Jan 10, 2026

[Feature] DSv32: Further exploit multi-stream overlap for DSA #16857

Open

[Feature] DSv32: Optimize DSA weights_proj during decode #16861

Open

BJWang-ant mentioned this pull request Jan 16, 2026

[OPT] DeepSeekV3.2: optimize indexer weight_proj-mma performance #17205

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[DSv32] Overlap indexer weights_proj during dual_stream decode#16637

[DSv32] Overlap indexer weights_proj during dual_stream decode#16637
Fridge003 merged 1 commit intosgl-project:mainfrom
zianglih:zianglih/DSv32_overlap

zianglih commented Jan 7, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Jan 7, 2026

Uh oh!

Fridge003 commented Jan 7, 2026

Uh oh!

zianglih commented Jan 7, 2026

Uh oh!

Fridge003 commented Jan 8, 2026

Uh oh!

zianglih commented Jan 8, 2026 •

edited

Loading

Uh oh!

Fridge003 commented Jan 9, 2026

Uh oh!

ziang-and commented Jan 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

zianglih commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist bot commented Jan 7, 2026

Uh oh!

Fridge003 commented Jan 7, 2026

Uh oh!

zianglih commented Jan 7, 2026

Uh oh!

Fridge003 commented Jan 8, 2026

Uh oh!

zianglih commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fridge003 commented Jan 9, 2026

Uh oh!

ziang-and commented Jan 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zianglih commented Jan 7, 2026 •

edited

Loading

zianglih commented Jan 8, 2026 •

edited

Loading