Skip to content

[DSv32] Move deep_gemm.get_paged_mqa_logits_metadata to init time as metadata#15040

Merged
Fridge003 merged 8 commits intosgl-project:mainfrom
qianlihuang:feature/precompute-deepgemm-schedule-metadata
Dec 19, 2025
Merged

[DSv32] Move deep_gemm.get_paged_mqa_logits_metadata to init time as metadata#15040
Fridge003 merged 8 commits intosgl-project:mainfrom
qianlihuang:feature/precompute-deepgemm-schedule-metadata

Conversation

@qianlihuang
Copy link
Contributor

@qianlihuang qianlihuang commented Dec 13, 2025

Motivation

#15025

Changes

  • Add paged_mqa_schedule_metadata to NSAMetadata (batch-level caching).
  • Compute once in init_forward_metadata() / init_forward_metadata_capture_cuda_graph().
  • Update in init_forward_metadata_replay_cuda_graph().
  • get_indexer_metadata() forwards cached tensor; indexer reuses it with fallback.

Accuracy Tests

python3 -m sglang.test.run_eval --port 30000 --eval-name gpqa --num-examples 198 --max-tokens 36000 --repeat 8 --thinking-mode deepseek-v3
Repeat: 8, mean: 0.848
Scores: ['0.884', '0.854', '0.864', '0.828', '0.818', '0.843', '0.838', '0.854']
====================
Writing report to /tmp/gpqa__data_models_DeepSeek-V3.2.html
{'chars': np.float64(23971.762626262625), 'chars:std': np.float64(22059.27866937971), 'score:std': np.float64(0.35357142673105335), 'score': np.float64(0.8535353535353535)}
Writing results to /tmp/gpqa__data_models_DeepSeek-V3.2.json
Total latency: 7081.645 s
Score: 0.854
python3 benchmark/gsm8k/bench_sglang.py --num-shots 20 --num-questions 1319 --parallel 1319
Accuracy: 0.957
Invalid: 0.000
Latency: 167.105 s
Output throughput: 765.038 token/s

Benchmarking and Profiling

python3 -m sglang.launch_server \
 --model deepseek-ai/DeepSeek-V3.2 \
 --tp 8 \
 --speculative-algorithm EAGLE \
 --speculative-num-steps 3 \
 --speculative-eagle-topk 1 \
 --speculative-num-draft-tokens 4

Benchmark

python3 -m sglang.bench_serving \
    --backend sglang \
    --model deepseek-ai/DeepSeek-V3.2 \
    --num-prompts 160 \
    --request-rate inf \
    --max-concurrency 16 \
    --dataset-name random-ids \
    --random-input-len 1 \
    --random-output-len 2000 \

Before

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 16        
Successful requests:                     160       
Benchmark duration (s):                  165.43    
Total input tokens:                      160       
Total input text tokens:                 160       
Total input vision tokens:               0         
Total generated tokens:                  171369    
Total generated tokens (retokenized):    171274    
Request throughput (req/s):              0.97      
Input token throughput (tok/s):          0.97      
Output token throughput (tok/s):         1035.87   
Peak output token throughput (tok/s):    1403.00   
Peak concurrent requests:                20        
Total token throughput (tok/s):          1036.84   
Concurrency:                             14.94     
Accept length:                           3.01      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   15446.38  
Median E2E Latency (ms):                 15407.75  
---------------Time to First Token----------------
Mean TTFT (ms):                          162.15    
Median TTFT (ms):                        144.04    
P99 TTFT (ms):                           288.15    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          14.74     
Median TPOT (ms):                        14.45     
P99 TPOT (ms):                           20.76     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           14.28     
Median ITL (ms):                         9.93      
P95 ITL (ms):                            36.37     
P99 ITL (ms):                            49.08     
Max ITL (ms):                            254.78    
==================================================

After

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 16        
Successful requests:                     160       
Benchmark duration (s):                  163.37    
Total input tokens:                      160       
Total input text tokens:                 160       
Total input vision tokens:               0         
Total generated tokens:                  171369    
Total generated tokens (retokenized):    171274    
Request throughput (req/s):              0.98      
Input token throughput (tok/s):          0.98      
Output token throughput (tok/s):         1048.97   
Peak output token throughput (tok/s):    1414.00   
Peak concurrent requests:                19        
Total token throughput (tok/s):          1049.95   
Concurrency:                             14.94     
Accept length:                           3.01      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   15256.81  
Median E2E Latency (ms):                 15301.24  
---------------Time to First Token----------------
Mean TTFT (ms):                          161.94    
Median TTFT (ms):                        143.68    
P99 TTFT (ms):                           306.59    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          14.54     
Median TPOT (ms):                        14.27     
P99 TPOT (ms):                           19.98     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           14.11     
Median ITL (ms):                         9.81      
P95 ITL (ms):                            36.30     
P99 ITL (ms):                            48.67     
Max ITL (ms):                            269.44    
==================================================

Profile

curl -X POST http://127.0.0.1:30000/start_profile \
  -H "Content-Type: application/json" \
  -d '{
    "num_steps": 10,
    "output_dir": "/sgl-workspace/profile",
    "activities": ["CPU","GPU"],
    "merge_profiles": false
  }'

Before
image
1765676758.7169704-TP-0.trace.json.gz

After
image
1765678229.1364799-TP-0.trace.json.gz

Checklist

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@qianlihuang
Copy link
Contributor Author

/gemini review

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@qianlihuang qianlihuang marked this pull request as ready for review December 14, 2025 08:28
@Fridge003
Copy link
Collaborator

@YAMY1234 Can you please take a look

@YAMY1234
Copy link
Contributor

YAMY1234 commented Dec 15, 2025

@qianlihuang Could you add GPQA & gsm8k 20 shots results in the PR desc? Thanks
GPQA:

python3 -m sglang.test.run_eval --port 30000 --eval-name gpqa --num-examples 198 --max-tokens 36000 --repeat 8 --thinking-mode deepseek-v3

GSM8K:

python3 benchmark/gsm8k/bench_sglang.py --num-shots 20 --num-questions 1319 --parallel 1319

@qianlihuang
Copy link
Contributor Author

@YAMY1234 I have finished the tests and the accuracy looks fine. I've updated the PR description with the results.

@YAMY1234
Copy link
Contributor

Thanks! But the GPQA results look higher than expected, normally the avg should be around 79.9 as reported. Do you have a clue? cc @Fridge003

@qianlihuang
Copy link
Contributor Author

qianlihuang commented Dec 16, 2025

@YAMY1234 The GPQA score listed in the doc likely refers to DeepSeek V3.2 exp.
My result aligns with the nightly test scores. You can refer to the benchmark in this run:
https://github.com/sgl-project/sglang/actions/runs/20188050540/job/57961445321]

@YAMY1234
Copy link
Contributor

Thanks, LGTM overall.

@Fridge003
Copy link
Collaborator

@YAMY1234 The GOQA for new DeepSeek v3.2 checkpoint is like ~85%, so it's expected

@Fridge003
Copy link
Collaborator

/tag-and-rerun-ci

@Fridge003
Copy link
Collaborator

@qianlihuang Please fix lint

@qianlihuang
Copy link
Contributor Author

qianlihuang commented Dec 19, 2025

fixed by #15424
@Fridge003

@Fridge003 Fridge003 merged commit 6afc5d4 into sgl-project:main Dec 19, 2025
139 of 149 checks passed
Prozac614 pushed a commit to Prozac614/sglang that referenced this pull request Dec 23, 2025
jiaming1130 pushed a commit to zhuyijie88/sglang that referenced this pull request Dec 25, 2025
YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants