Skip to content

Comments

[DSv32] Overlap indexer qk projection and activation quant#17688

Merged
Fridge003 merged 2 commits intosgl-project:mainfrom
zianglih:qk
Jan 28, 2026
Merged

[DSv32] Overlap indexer qk projection and activation quant#17688
Fridge003 merged 2 commits intosgl-project:mainfrom
zianglih:qk

Conversation

@zianglih
Copy link
Contributor

@zianglih zianglih commented Jan 25, 2026

Motivation

@HumansAnd

After #17205, the indexer weight projection is fully hidden and no longer exposes latency.

Modifications

Accuracy Tests

# BEFORE
    {
        "server_command": "bash -c ' python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --tp 8 --enable-dp-attention --kv-cache-dtype bf16 --mem-fraction-static 0.8 --dp 8 --cuda-graph-max-bs 256 --trust-remote-code --host 0.0.0.0 --port 30000'",
        "bench_accuracy_command": "bash -c ' cd /sgl-workspace/sglang && python3 -m sglang.test.run_eval --port 30000 --eval-name gpqa --num-examples 198 --max-tokens 120000 --repeat 8 --thinking-mode deepseek-v3 '",
        "benchmark_result": [
            "100%|| 198/198 [49:46<00:00, 15.08s/it]",
            "====================",
            "Repeat: 8, mean: 0.852",
            "Scores: ['0.854', '0.864', '0.823', '0.838', '0.854', '0.848', '0.854', '0.879']",
            "====================",
            "[METRIC] gpqa_mean_score=0.8516414141414141 labels={\"model\": \"deepseek-ai/DeepSeek-V3.2\", \"eval\": \"gpqa\", \"repeat\": 8}",
            "Writing report to /tmp/gpqa_deepseek-ai_DeepSeek-V3.2.html",
            "{'chars': np.float64(24966.80808080808), 'chars:std': np.float64(24518.718051777523), 'score:std': np.float64(0.32637362467481845), 'scores': ['0.854', '0.864', '0.823', '0.838', '0.854', '0.848', '0.854', '0.879'], 'mean_score': np.float64(0.8516414141414141)}",
            "Writing results to /tmp/gpqa_deepseek-ai_DeepSeek-V3.2.json"
        ]
    },
# AFTER
    {
        "server_command": "docker run --rm --ipc=host --network=host --privileged --shm-size 256G -e DO_NOT_TRACK=1 -e SGLANG_DG_CACHE_DIR=/data/.cache/deep_gemm --gpus all -v $HOME/dockerx:/dockerx -v /data:/data -v /data/.cache/huggingface:/root/.cache/huggingface -v /data/.cache:/root/.cache -p 30000:30000 lmsysorg/sglang:dev bash -c ' cd /sgl-workspace && rm -rf sglang && git clone -b qk https://github.com/zianglih/sglang.git && cd sglang && pip install --upgrade pip && pip install -e \"python\" && python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --tp 8 --enable-dp-attention --kv-cache-dtype bf16 --mem-fraction-static 0.8 --dp 8 --cuda-graph-max-bs 256 --trust-remote-code --host 0.0.0.0 --port 30000'",
        "bench_accuracy_command": "docker run --rm --ipc=host --network=host --privileged --shm-size 256G -e DO_NOT_TRACK=1 -e SGLANG_DG_CACHE_DIR=/data/.cache/deep_gemm --gpus all -v $HOME/dockerx:/dockerx -v /data:/data -v /data/.cache/huggingface:/root/.cache/huggingface -v /data/.cache:/root/.cache -p 30000:30000 lmsysorg/sglang:dev bash -c ' cd /sgl-workspace && rm -rf sglang && git clone -b qk https://github.com/zianglih/sglang.git && cd sglang && pip install --upgrade pip && pip install -e \"python\" && cd /sgl-workspace/sglang && python3 -m sglang.test.run_eval --port 30000 --eval-name gpqa --num-examples 198 --max-tokens 120000 --repeat 8 --thinking-mode deepseek-v3 '",
        "benchmark_result": [
            "100%|| 198/198 [23:29<00:00,  7.12s/it]",
            "====================",
            "Repeat: 8, mean: 0.858",
            "Scores: ['0.854', '0.884', '0.889', '0.828', '0.859', '0.843', '0.859', '0.848']",
            "====================",
            "[METRIC] gpqa_mean_score=0.8579545454545455 labels={\"model\": \"deepseek-ai/DeepSeek-V3.2\", \"eval\": \"gpqa\", \"repeat\": 8}",
            "Writing report to /tmp/gpqa_deepseek-ai_DeepSeek-V3.2.html",
            "{'chars': np.float64(26761.303030303032), 'chars:std': np.float64(26398.355972093883), 'score:std': np.float64(0.3585502898848252), 'scores': ['0.854', '0.884', '0.889', '0.828', '0.859', '0.843', '0.859', '0.848'], 'mean_score': np.float64(0.8579545454545455)}",
            "Writing results to /tmp/gpqa_deepseek-ai_DeepSeek-V3.2.json"
        ]
    }

pip install git+https://github.com/NVIDIA/NeMo-Skills.git --ignore-installed blinker
cd /sgl-workspace && \
rm -rf sglang && \
git clone -b qk https://github.com/zianglih/sglang.git && \
cd sglang && \
pip install --upgrade pip && \
pip install -e "python"
python3 -m sglang.launch_server   --model-path deepseek-ai/DeepSeek-V3.2   --trust-remote-code   --tp-size 8 --dp-size 8 --enable-dp-attention   --tool-call-parser deepseekv32   --reasoning-parser deepseek-v3 &

#! /bin/bash
export NEMO_SKILLS_DISABLE_UNCOMMITTED_CHANGES_CHECK=1

ns prepare_data aime25

PORT=30000
BACKEND=sglang
MODEL="deepseek-ai/DeepSeek-V3.2" # Should be changed to the model name
MODEL_NAME="dsv32-fp8"

echo "Starting AIME25 evaluation with model $MODEL on port $PORT using backend $BACKEND..."
ns eval \
  --benchmarks=aime25:4 \
  --server_type=$BACKEND \
  --model=$MODEL \
  --server_address=http://localhost:${PORT}/v1 \
  --output_dir=nemo_skills_aime25_${MODEL_NAME}_output_${BACKEND}_$(date +%Y%m%d_%H%M%S) \
  ++chat_template_kwargs.thinking=true \
  ++inference.temperature=1.0 \
  ++inference.top_p=0.95 \
  ++inference.tokens_to_generate=64000
  # ++inference.tokens_to_generate=120000 for Speciale model


# BEFORE
nemo-run_1/0 ---------------------------------------- aime25 ----------------------------------------
nemo-run_1/0 evaluation_mode  | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer
nemo-run_1/0 pass@1[avg-of-4] | 30          | 15537      | 1652        | 90.00% ± 2.72%   | 2.50%
nemo-run_1/0 majority@4       | 30          | 15537      | 1652        | 91.67%           | 0.00%
nemo-run_1/0 pass@4           | 30          | 15537      | 1652        | 96.67%           | 0.00%
nemo-run_1/0
nemo-run_1/0
nemo-run_1/0 Metrics are saved to nemo_skills_aime25_dsv32-fp8_output_sglang_20260127_070245/eval-results/aime25/metrics.json

# AFTER
nemo-run_1/0 ---------------------------------------- aime25 ----------------------------------------
nemo-run_1/0 evaluation_mode  | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer
nemo-run_1/0 pass@1[avg-of-4] | 30          | 15068      | 1591        | 91.67% ± 4.30%   | 0.00%
nemo-run_1/0 majority@4       | 30          | 15068      | 1591        | 93.33%           | 0.00%
nemo-run_1/0 pass@4           | 30          | 15068      | 1591        | 96.67%           | 0.00%
nemo-run_1/0
nemo-run_1/0
nemo-run_1/0 Metrics are saved to nemo_skills_aime25_dsv32-fp8_output_sglang_20260127_060801/eval-results/aime25/metrics.json



Benchmarking and Profiling

image Per layer, there is a (52.5-39.5) = 13us latency reduction, 0.793ms across 61 layers.

Output token throughput: 1k/2k: 1.05x, 8k/2k: 1.032x

BEFORE, 1k/2k, 8k/2k

    {
        "server_command": "bash -c ' python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --tp 8 --enable-dp-attention --kv-cache-dtype bf16 --mem-fraction-static 0.8 --dp 8 --cuda-graph-max-bs 256 --trust-remote-code --host 0.0.0.0 --port 30000'",
        "bench_serving_command": "bash -c ' python3 -m sglang.bench_serving --warmup-requests 1 --backend sglang --dataset-name random --random-range-ratio 0.8 --num-prompts 256 --random-input 1024 --random-output 2048 --host 0.0.0.0 --port 30000 '",
        "benchmark_result": [
            "============ Serving Benchmark Result ============",
            "Backend:                                 sglang",
            "Traffic request rate:                    inf",
            "Max request concurrency:                 not set",
            "Successful requests:                     256",
            "Benchmark duration (s):                  77.86",
            "Total input tokens:                      235251",
            "Total input text tokens:                 235251",
            "Total generated tokens:                  473977",
            "Total generated tokens (retokenized):    473683",
            "Request throughput (req/s):              3.29",
            "Input token throughput (tok/s):          3021.61",
            "Output token throughput (tok/s):         6087.86",
            "Peak output token throughput (tok/s):    7680.00",
            "Peak concurrent requests:                256",
            "Total token throughput (tok/s):          9109.47",
            "Concurrency:                             231.64",
            "----------------End-to-End Latency----------------",
            "Mean E2E Latency (ms):                   70447.71",
            "Median E2E Latency (ms):                 70369.19",
            "P90 E2E Latency (ms):                    76390.10",
            "P99 E2E Latency (ms):                    77762.35",
            "---------------Time to First Token----------------",
            "Mean TTFT (ms):                          4074.74",
            "Median TTFT (ms):                        4068.72",
            "P99 TTFT (ms):                           6346.09",
            "-----Time per Output Token (excl. 1st token)------",
            "Mean TPOT (ms):                          35.86",
            "Median TPOT (ms):                        35.92",
            "P99 TPOT (ms):                           37.41",
            "---------------Inter-Token Latency----------------",
            "Mean ITL (ms):                           35.87",
            "Median ITL (ms):                         34.23",
            "P95 ITL (ms):                            50.44",
            "P99 ITL (ms):                            61.24",
            "Max ITL (ms):                            5725.49",
            "=================================================="
        ]
    },
    {
        "server_command": "bash -c ' python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --tp 8 --enable-dp-attention --kv-cache-dtype bf16 --mem-fraction-static 0.8 --dp 8 --cuda-graph-max-bs 256 --trust-remote-code --host 0.0.0.0 --port 30000'",
        "bench_serving_command": "bash -c ' python3 -m sglang.bench_serving --warmup-requests 1 --backend sglang --dataset-name random --random-range-ratio 0.8 --num-prompts 256 --random-input 8192 --random-output 2048 --host 0.0.0.0 --port 30000 '",
        "benchmark_result": [
            "============ Serving Benchmark Result ============",
            "Backend:                                 sglang",
            "Traffic request rate:                    inf",
            "Max request concurrency:                 not set",
            "Successful requests:                     256",
            "Benchmark duration (s):                  122.10",
            "Total input tokens:                      1900585",
            "Total input text tokens:                 1900585",
            "Total generated tokens:                  473587",
            "Total generated tokens (retokenized):    473576",
            "Request throughput (req/s):              2.10",
            "Input token throughput (tok/s):          15565.51",
            "Output token throughput (tok/s):         3878.61",
            "Peak output token throughput (tok/s):    7580.00",
            "Peak concurrent requests:                256",
            "Total token throughput (tok/s):          19444.12",
            "Concurrency:                             242.78",
            "----------------End-to-End Latency----------------",
            "Mean E2E Latency (ms):                   115796.89",
            "Median E2E Latency (ms):                 115959.76",
            "P90 E2E Latency (ms):                    121061.79",
            "P99 E2E Latency (ms):                    121999.02",
            "---------------Time to First Token----------------",
            "Mean TTFT (ms):                          28152.69",
            "Median TTFT (ms):                        28222.94",
            "P99 TTFT (ms):                           51803.88",
            "-----Time per Output Token (excl. 1st token)------",
            "Mean TPOT (ms):                          47.45",
            "Median TPOT (ms):                        47.30",
            "P99 TPOT (ms):                           61.32",
            "---------------Inter-Token Latency----------------",
            "Mean ITL (ms):                           47.40",
            "Median ITL (ms):                         34.49",
            "P95 ITL (ms):                            51.77",
            "P99 ITL (ms):                            61.75",
            "Max ITL (ms):                            48323.07",
            "=================================================="
        ]
    },

AFTER, 1k/2k, 8k/2k

    {
        "server_command": "bash -c ' cd /sgl-workspace && rm -rf sglang && git clone -b qk https://github.com/zianglih/sglang.git && cd sglang && pip install --upgrade pip && pip install -e \"python\" && python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --tp 8 --enable-dp-attention --kv-cache-dtype bf16 --mem-fraction-static 0.8 --dp 8 --cuda-graph-max-bs 256 --trust-remote-code --host 0.0.0.0 --port 30000'",
        "bench_serving_command": "bash -c ' python3 -m sglang.bench_serving --warmup-requests 1 --backend sglang --dataset-name random --random-range-ratio 0.8 --num-prompts 256 --random-input 1024 --random-output 2048 --host 0.0.0.0 --port 30000 '",
        "benchmark_result": [
            "============ Serving Benchmark Result ============",
            "Backend:                                 sglang",
            "Traffic request rate:                    inf",
            "Max request concurrency:                 not set",
            "Successful requests:                     256",
            "Benchmark duration (s):                  74.22",
            "Total input tokens:                      235251",
            "Total input text tokens:                 235251",
            "Total generated tokens:                  473977",
            "Total generated tokens (retokenized):    473767",
            "Request throughput (req/s):              3.45",
            "Input token throughput (tok/s):          3169.71",
            "Output token throughput (tok/s):         6386.24",
            "Peak output token throughput (tok/s):    7847.00",
            "Peak concurrent requests:                256",
            "Total token throughput (tok/s):          9555.96",
            "Concurrency:                             235.14",
            "----------------End-to-End Latency----------------",
            "Mean E2E Latency (ms):                   68169.36",
            "Median E2E Latency (ms):                 68344.97",
            "P90 E2E Latency (ms):                    73210.79",
            "P99 E2E Latency (ms):                    74117.08",
            "---------------Time to First Token----------------",
            "Mean TTFT (ms):                          3596.92",
            "Median TTFT (ms):                        3596.68",
            "P99 TTFT (ms):                           5700.22",
            "-----Time per Output Token (excl. 1st token)------",
            "Mean TPOT (ms):                          34.91",
            "Median TPOT (ms):                        34.86",
            "P99 TPOT (ms):                           36.37",
            "---------------Inter-Token Latency----------------",
            "Mean ITL (ms):                           34.90",
            "Median ITL (ms):                         33.57",
            "P95 ITL (ms):                            46.91",
            "P99 ITL (ms):                            53.95",
            "Max ITL (ms):                            5122.21",
            "=================================================="
        ]
    },
    {
        "server_command": "bash -c ' cd /sgl-workspace && rm -rf sglang && git clone -b qk https://github.com/zianglih/sglang.git && cd sglang && pip install --upgrade pip && pip install -e \"python\" && python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --tp 8 --enable-dp-attention --kv-cache-dtype bf16 --mem-fraction-static 0.8 --dp 8 --cuda-graph-max-bs 256 --trust-remote-code --host 0.0.0.0 --port 30000'",
        "bench_serving_command": "bash -c ' python3 -m sglang.bench_serving --warmup-requests 1 --backend sglang --dataset-name random --random-range-ratio 0.8 --num-prompts 256 --random-input 8192 --random-output 2048 --host 0.0.0.0 --port 30000 '",
        "benchmark_result": [
            "============ Serving Benchmark Result ============",
            "Backend:                                 sglang",
            "Traffic request rate:                    inf",
            "Max request concurrency:                 not set",
            "Successful requests:                     256",
            "Benchmark duration (s):                  118.27",
            "Total input tokens:                      1900585",
            "Total input text tokens:                 1900585",
            "Total generated tokens:                  473587",
            "Total generated tokens (retokenized):    473286",
            "Request throughput (req/s):              2.16",
            "Input token throughput (tok/s):          16069.63",
            "Output token throughput (tok/s):         4004.22",
            "Peak output token throughput (tok/s):    7680.00",
            "Peak concurrent requests:                256",
            "Total token throughput (tok/s):          20073.86",
            "Concurrency:                             242.90",
            "----------------End-to-End Latency----------------",
            "Mean E2E Latency (ms):                   112219.93",
            "Median E2E Latency (ms):                 112398.33",
            "P90 E2E Latency (ms):                    117253.22",
            "P99 E2E Latency (ms):                    118172.76",
            "---------------Time to First Token----------------",
            "Mean TTFT (ms):                          26143.37",
            "Median TTFT (ms):                        26110.09",
            "P99 TTFT (ms):                           49363.24",
            "-----Time per Output Token (excl. 1st token)------",
            "Mean TPOT (ms):                          46.60",
            "Median TPOT (ms):                        46.47",
            "P99 TPOT (ms):                           60.05",
            "---------------Inter-Token Latency----------------",
            "Mean ITL (ms):                           46.55",
            "Median ITL (ms):                         33.84",
            "P95 ITL (ms):                            45.71",
            "P99 ITL (ms):                            53.37",
            "Max ITL (ms):                            46917.77",
            "=================================================="
        ]
    }

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @zianglih, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on enhancing the performance of the nsa_indexer by strategically reordering and overlapping computational tasks. By leveraging CUDA streams to run query projection and quantization concurrently with key quantization, the change significantly reduces processing latency per layer, leading to a faster overall execution of the model's attention mechanism.

Highlights

  • Concurrency Optimization: The query (Q) and key (K) projection and activation quantization steps within the nsa_indexer are now overlapped using CUDA streams, allowing for concurrent execution of these operations.
  • Latency Reduction: This optimization results in a notable latency reduction of 13 microseconds per layer, accumulating to a total of 7.93 milliseconds across 61 layers, as indicated by benchmarking.
  • Stream Management Enhancement: The _get_q_k_bf16 function now explicitly utilizes the enable_dual_stream parameter, indicating improved stream awareness and potential for further concurrent processing within this critical path.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the CUDA stream usage in forward_cuda to increase parallelism for the indexer during decode. The changes enable dual-stream computation within _get_q_k_bf16 to overlap query and key projections. Additionally, the activation quantization for query and key are now parallelized across two separate streams. These optimizations should improve performance by increasing computation overlap, as supported by the benchmark results in the pull request description. The change looks correct and aligns with the goal of reducing latency.

@Kangyan-Zhou
Copy link
Collaborator

/tag-and-rerun-ci

@Fridge003
Copy link
Collaborator

@zianglih
Copy link
Contributor Author

zianglih commented Jan 27, 2026

Hi @Fridge003 , I have added both result under Accuracy Tests. Please check them out!

@Fridge003 Fridge003 merged commit a8dda2a into sgl-project:main Jan 28, 2026
198 of 215 checks passed
charlesHsuGG pushed a commit to charlesHsuGG/sglang that referenced this pull request Jan 30, 2026
Chen-0210 pushed a commit to Chen-0210/sglang that referenced this pull request Jan 30, 2026
sfiisf pushed a commit to sfiisf/sglang that referenced this pull request Feb 5, 2026
Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants