[DSv32] Move deep_gemm.get_paged_mqa_logits_metadata to init time as metadata by qianlihuang · Pull Request #15040 · sgl-project/sglang

qianlihuang · 2025-12-13T06:24:22Z

Motivation

Changes

Add paged_mqa_schedule_metadata to NSAMetadata (batch-level caching).
Compute once in init_forward_metadata() / init_forward_metadata_capture_cuda_graph().
Update in init_forward_metadata_replay_cuda_graph().
get_indexer_metadata() forwards cached tensor; indexer reuses it with fallback.

Accuracy Tests

python3 -m sglang.test.run_eval --port 30000 --eval-name gpqa --num-examples 198 --max-tokens 36000 --repeat 8 --thinking-mode deepseek-v3

Repeat: 8, mean: 0.848
Scores: ['0.884', '0.854', '0.864', '0.828', '0.818', '0.843', '0.838', '0.854']
====================
Writing report to /tmp/gpqa__data_models_DeepSeek-V3.2.html
{'chars': np.float64(23971.762626262625), 'chars:std': np.float64(22059.27866937971), 'score:std': np.float64(0.35357142673105335), 'score': np.float64(0.8535353535353535)}
Writing results to /tmp/gpqa__data_models_DeepSeek-V3.2.json
Total latency: 7081.645 s
Score: 0.854

python3 benchmark/gsm8k/bench_sglang.py --num-shots 20 --num-questions 1319 --parallel 1319

Accuracy: 0.957
Invalid: 0.000
Latency: 167.105 s
Output throughput: 765.038 token/s

Benchmarking and Profiling

python3 -m sglang.launch_server \
 --model deepseek-ai/DeepSeek-V3.2 \
 --tp 8 \
 --speculative-algorithm EAGLE \
 --speculative-num-steps 3 \
 --speculative-eagle-topk 1 \
 --speculative-num-draft-tokens 4

Benchmark

python3 -m sglang.bench_serving \
    --backend sglang \
    --model deepseek-ai/DeepSeek-V3.2 \
    --num-prompts 160 \
    --request-rate inf \
    --max-concurrency 16 \
    --dataset-name random-ids \
    --random-input-len 1 \
    --random-output-len 2000 \

Before

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 16        
Successful requests:                     160       
Benchmark duration (s):                  165.43    
Total input tokens:                      160       
Total input text tokens:                 160       
Total input vision tokens:               0         
Total generated tokens:                  171369    
Total generated tokens (retokenized):    171274    
Request throughput (req/s):              0.97      
Input token throughput (tok/s):          0.97      
Output token throughput (tok/s):         1035.87   
Peak output token throughput (tok/s):    1403.00   
Peak concurrent requests:                20        
Total token throughput (tok/s):          1036.84   
Concurrency:                             14.94     
Accept length:                           3.01      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   15446.38  
Median E2E Latency (ms):                 15407.75  
---------------Time to First Token----------------
Mean TTFT (ms):                          162.15    
Median TTFT (ms):                        144.04    
P99 TTFT (ms):                           288.15    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          14.74     
Median TPOT (ms):                        14.45     
P99 TPOT (ms):                           20.76     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           14.28     
Median ITL (ms):                         9.93      
P95 ITL (ms):                            36.37     
P99 ITL (ms):                            49.08     
Max ITL (ms):                            254.78    
==================================================

After

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 16        
Successful requests:                     160       
Benchmark duration (s):                  163.37    
Total input tokens:                      160       
Total input text tokens:                 160       
Total input vision tokens:               0         
Total generated tokens:                  171369    
Total generated tokens (retokenized):    171274    
Request throughput (req/s):              0.98      
Input token throughput (tok/s):          0.98      
Output token throughput (tok/s):         1048.97   
Peak output token throughput (tok/s):    1414.00   
Peak concurrent requests:                19        
Total token throughput (tok/s):          1049.95   
Concurrency:                             14.94     
Accept length:                           3.01      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   15256.81  
Median E2E Latency (ms):                 15301.24  
---------------Time to First Token----------------
Mean TTFT (ms):                          161.94    
Median TTFT (ms):                        143.68    
P99 TTFT (ms):                           306.59    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          14.54     
Median TPOT (ms):                        14.27     
P99 TPOT (ms):                           19.98     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           14.11     
Median ITL (ms):                         9.81      
P95 ITL (ms):                            36.30     
P99 ITL (ms):                            48.67     
Max ITL (ms):                            269.44    
==================================================

Profile

curl -X POST http://127.0.0.1:30000/start_profile \
  -H "Content-Type: application/json" \
  -d '{
    "num_steps": 10,
    "output_dir": "/sgl-workspace/profile",
    "activities": ["CPU","GPU"],
    "merge_profiles": false
  }'

Before

1765676758.7169704-TP-0.trace.json.gz

After

1765678229.1364799-TP-0.trace.json.gz

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

…decode

gemini-code-assist · 2025-12-13T06:24:26Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

qianlihuang · 2025-12-14T07:06:39Z

/gemini review

gemini-code-assist · 2025-12-14T07:06:42Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Fridge003 · 2025-12-15T11:10:19Z

@YAMY1234 Can you please take a look

YAMY1234 · 2025-12-15T17:06:52Z

@qianlihuang Could you add GPQA & gsm8k 20 shots results in the PR desc? Thanks
GPQA:

python3 -m sglang.test.run_eval --port 30000 --eval-name gpqa --num-examples 198 --max-tokens 36000 --repeat 8 --thinking-mode deepseek-v3

GSM8K:

python3 benchmark/gsm8k/bench_sglang.py --num-shots 20 --num-questions 1319 --parallel 1319

qianlihuang · 2025-12-16T07:33:49Z

@YAMY1234 I have finished the tests and the accuracy looks fine. I've updated the PR description with the results.

YAMY1234 · 2025-12-16T07:43:52Z

Thanks! But the GPQA results look higher than expected, normally the avg should be around 79.9 as reported. Do you have a clue? cc @Fridge003

qianlihuang · 2025-12-16T08:07:08Z

@YAMY1234 The GPQA score listed in the doc likely refers to DeepSeek V3.2 exp.
My result aligns with the nightly test scores. You can refer to the benchmark in this run:
https://github.com/sgl-project/sglang/actions/runs/20188050540/job/57961445321]

YAMY1234 · 2025-12-16T08:11:23Z

Thanks, LGTM overall.

Fridge003 · 2025-12-16T21:48:49Z

@YAMY1234 The GOQA for new DeepSeek v3.2 checkpoint is like ~85%, so it's expected

Fridge003 · 2025-12-16T23:28:31Z

/tag-and-rerun-ci

Fridge003 · 2025-12-18T22:32:34Z

@qianlihuang Please fix lint

qianlihuang · 2025-12-19T02:31:52Z

fixed by #15424
@Fridge003

…metadata (sgl-project#15040)

[NSA][Perf] Pre-compute deep_gemm schedule metadata at init time for …

311a36e

…decode

Fridge003 mentioned this pull request Dec 13, 2025

[Roadmap] DeepSeek v3.2 (GLM 5) Optimization #15025

Open

34 tasks

abcdea added 2 commits December 13, 2025 16:18

[NSA][Perf] Cache DeepGEMM paged MQA schedule per batch

2a1fa05

[NSA][Perf] Cache DeepGEMM schedule even with BF16 KV

aa90feb

qianlihuang marked this pull request as ready for review December 14, 2025 08:28

qianlihuang requested review from BBuf, Edwardf0t1, Fridge003, HaiShaw, Ying1123, ch-wan, ispobock and merrymercy as code owners December 14, 2025 08:28

github-actions bot added the run-ci label Dec 16, 2025

Fridge003 approved these changes Dec 17, 2025

View reviewed changes

Fridge003 and others added 3 commits December 16, 2025 20:37

Merge branch 'main' into feature/precompute-deepgemm-schedule-metadata

5257c9d

Merge branch 'main' into feature/precompute-deepgemm-schedule-metadata

150b61d

Merge branch 'main' into feature/precompute-deepgemm-schedule-metadata

5ead065

Merge branch 'main' into feature/precompute-deepgemm-schedule-metadata

e29ec35

Merge branch 'main' into feature/precompute-deepgemm-schedule-metadata

6ec339c

Fridge003 merged commit 6afc5d4 into sgl-project:main Dec 19, 2025
139 of 149 checks passed

Prozac614 pushed a commit to Prozac614/sglang that referenced this pull request Dec 23, 2025

[DSv32] Move deep_gemm.get_paged_mqa_logits_metadata to init time as …

fbf9df1

…metadata (sgl-project#15040)

jiaming1130 pushed a commit to zhuyijie88/sglang that referenced this pull request Dec 25, 2025

[DSv32] Move deep_gemm.get_paged_mqa_logits_metadata to init time as …

8788c6c

…metadata (sgl-project#15040)

YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026

[DSv32] Move deep_gemm.get_paged_mqa_logits_metadata to init time as …

ba6f1d8

…metadata (sgl-project#15040)

Conversation

qianlihuang commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Changes

Accuracy Tests

Benchmarking and Profiling

Benchmark

Profile

Checklist

Uh oh!

gemini-code-assist bot commented Dec 13, 2025

Uh oh!

qianlihuang commented Dec 14, 2025

Uh oh!

gemini-code-assist bot commented Dec 14, 2025

Uh oh!

Fridge003 commented Dec 15, 2025

Uh oh!

YAMY1234 commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qianlihuang commented Dec 16, 2025

Uh oh!

YAMY1234 commented Dec 16, 2025

Uh oh!

qianlihuang commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

YAMY1234 commented Dec 16, 2025

Uh oh!

Fridge003 commented Dec 16, 2025

Uh oh!

Fridge003 commented Dec 16, 2025

Uh oh!

Fridge003 commented Dec 18, 2025

Uh oh!

qianlihuang commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

qianlihuang commented Dec 13, 2025 •

edited

Loading

YAMY1234 commented Dec 15, 2025 •

edited

Loading

qianlihuang commented Dec 16, 2025 •

edited

Loading

qianlihuang commented Dec 19, 2025 •

edited

Loading