Skip to content

[Perf] Fused qk_norm_rope for GLM4.6#14952

Closed
Kevin-XiongC wants to merge 10 commits intosgl-project:mainfrom
novitalabs:fused_glm
Closed

[Perf] Fused qk_norm_rope for GLM4.6#14952
Kevin-XiongC wants to merge 10 commits intosgl-project:mainfrom
novitalabs:fused_glm

Conversation

@Kevin-XiongC
Copy link
Contributor

@Kevin-XiongC Kevin-XiongC commented Dec 12, 2025

Motivation

image

Inspired by #13998

Modifications

  • Added fused QK norm + RoPE support in Glm4MoeAttention:
  • Added rotary_dim parameter to support partial rotary embedding:
    • New parameter allows rotary_dim < head_dim (e.g., partial_rotary_factor=0.5)
    • Only applies RoPE to the first rotary_dim dimensions

Use --enable-fused-qk-norm-rope to turn on.

Accuracy Tests

KERNEL UT

pytest tests/test_fused_qk_norm_rope.py                         
=================================================================================================================================================== test session starts ===================================================================================================================================================
platform linux -- Python 3.12.12, pytest-9.0.1, pluggy-1.6.0 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: /sgl-workspace/sglang-int/sgl-kernel
configfile: pyproject.toml
plugins: anyio-4.12.0, typeguard-4.4.4
collected 200 items                 
============================================================================================================================================= 200 passed, 2 warnings in 8.27s =============================================================================================================================================

E2E

HTTPS_PROXY='http://127.0.0.1:1081' python3 -m sglang.test.run_eval --port 8123 --eval-name mmlu --num-examples 200 
ChatCompletionSampler initialized with self.system_message=None self.temperature=0.0 self.max_tokens=2048 self.reasoning_effort=None self.extra_body={}
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [01:32<00:00,  2.15it/s]
Writing report to /tmp/mmlu__models_GLM-4.6-FP8_.html
{'other': np.float64(0.8723404255319149), 'other:std': np.float64(0.3337103647097472), 'score:std': np.float64(0.41758232721225164), 'stem': np.float64(0.7380952380952381), 'stem:std': np.float64(0.43967107887189016), 'humanities': np.float64(0.7258064516129032), 'humanities:std': np.float64(0.4461069898690107), 'social_sciences': np.float64(0.7755102040816326), 'social_sciences:std': np.float64(0.4172458836787933), 'score': np.float64(0.775)}
Writing results to /tmp/mmlu__models_GLM-4.6-FP8_.json
Total latency: 92.947 s
Score: 0.775

Benchmarking and Profiling

On 8xH20

 python3 -m sglang.bench_serving \                                                                                                           
    --backend sglang \
    --tokenizer /models/GLM-4.6-FP8 \
    --dataset-name random \
    --random-input 2000\
    --random-output 300\
    --num-prompts 400 --port 8123

BEFORE

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 not set   
Successful requests:                     400       
Benchmark duration (s):                  74.00     
Total input tokens:                      413547    
Total input text tokens:                 413547    
Total input vision tokens:               0         
Total generated tokens:                  60438     
Total generated tokens (retokenized):    60283     
Request throughput (req/s):              5.41      
Input token throughput (tok/s):          5588.78   
Output token throughput (tok/s):         816.77    
Peak output token throughput (tok/s):    2096.00   
Peak concurrent requests:                400       
Total token throughput (tok/s):          6405.55   
Concurrency:                             241.94    
Accept length:                           1.96      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   44757.22  
Median E2E Latency (ms):                 45578.14  
---------------Time to First Token----------------
Mean TTFT (ms):                          31770.14  
Median TTFT (ms):                        31222.84  
P99 TTFT (ms):                           66947.24  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          91.85     
Median TPOT (ms):                        91.94     
P99 TPOT (ms):                           200.68    
---------------Inter-Token Latency----------------
Mean ITL (ms):                           85.69     
Median ITL (ms):                         89.57     
P95 ITL (ms):                            192.59    
P99 ITL (ms):                            237.08    
Max ITL (ms):                            5352.32   
==================================================

AFTER

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 not set   
Successful requests:                     400       
Benchmark duration (s):                  73.76     
Total input tokens:                      413547    
Total input text tokens:                 413547    
Total input vision tokens:               0         
Total generated tokens:                  60438     
Total generated tokens (retokenized):    60288     
Request throughput (req/s):              5.42      
Input token throughput (tok/s):          5606.52   
Output token throughput (tok/s):         819.37    
Peak output token throughput (tok/s):    2140.00   
Peak concurrent requests:                400       
Total token throughput (tok/s):          6425.88   
Concurrency:                             240.97    
Accept length:                           1.96      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   44436.41  
Median E2E Latency (ms):                 45314.02  
---------------Time to First Token----------------
Mean TTFT (ms):                          31483.80  
Median TTFT (ms):                        31055.40  
P99 TTFT (ms):                           66442.16  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          90.36     
Median TPOT (ms):                        91.38     
P99 TPOT (ms):                           189.26    
---------------Inter-Token Latency----------------
Mean ITL (ms):                           85.46     
Median ITL (ms):                         91.90     
P95 ITL (ms):                            188.44    
P99 ITL (ms):                            258.64    
Max ITL (ms):                            4940.61   
==================================================

Checklist

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@Kevin-XiongC Kevin-XiongC marked this pull request as ready for review December 12, 2025 06:47
@yuan-luo
Copy link
Collaborator

@Kevin-XiongC Please split the sgl-kernel change into an independent PR such like #14036.

@Kevin-XiongC
Copy link
Contributor Author

#14036

I've split the kernel part. #15141. Could you please review it? @yuan-luo

@yuan-luo
Copy link
Collaborator

#14036

I've split the kernel part. #15141. Could you please review it? @yuan-luo

Sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments