[Perf] Fused qk_norm_rope for GLM4.6 by Kevin-XiongC · Pull Request #14952 · sgl-project/sglang

Kevin-XiongC · 2025-12-12T06:43:44Z

Motivation

Inspired by #13998

Modifications

Added fused QK norm + RoPE support in Glm4MoeAttention:
Added rotary_dim parameter to support partial rotary embedding:
- New parameter allows rotary_dim < head_dim (e.g., partial_rotary_factor=0.5)
- Only applies RoPE to the first rotary_dim dimensions

Use --enable-fused-qk-norm-rope to turn on.

Accuracy Tests

KERNEL UT

pytest tests/test_fused_qk_norm_rope.py                         
=================================================================================================================================================== test session starts ===================================================================================================================================================
platform linux -- Python 3.12.12, pytest-9.0.1, pluggy-1.6.0 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: /sgl-workspace/sglang-int/sgl-kernel
configfile: pyproject.toml
plugins: anyio-4.12.0, typeguard-4.4.4
collected 200 items                 
============================================================================================================================================= 200 passed, 2 warnings in 8.27s =============================================================================================================================================

E2E

HTTPS_PROXY='http://127.0.0.1:1081' python3 -m sglang.test.run_eval --port 8123 --eval-name mmlu --num-examples 200 
ChatCompletionSampler initialized with self.system_message=None self.temperature=0.0 self.max_tokens=2048 self.reasoning_effort=None self.extra_body={}
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [01:32<00:00,  2.15it/s]
Writing report to /tmp/mmlu__models_GLM-4.6-FP8_.html
{'other': np.float64(0.8723404255319149), 'other:std': np.float64(0.3337103647097472), 'score:std': np.float64(0.41758232721225164), 'stem': np.float64(0.7380952380952381), 'stem:std': np.float64(0.43967107887189016), 'humanities': np.float64(0.7258064516129032), 'humanities:std': np.float64(0.4461069898690107), 'social_sciences': np.float64(0.7755102040816326), 'social_sciences:std': np.float64(0.4172458836787933), 'score': np.float64(0.775)}
Writing results to /tmp/mmlu__models_GLM-4.6-FP8_.json
Total latency: 92.947 s
Score: 0.775

Benchmarking and Profiling

On 8xH20

 python3 -m sglang.bench_serving \                                                                                                           
    --backend sglang \
    --tokenizer /models/GLM-4.6-FP8 \
    --dataset-name random \
    --random-input 2000\
    --random-output 300\
    --num-prompts 400 --port 8123

BEFORE

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 not set   
Successful requests:                     400       
Benchmark duration (s):                  74.00     
Total input tokens:                      413547    
Total input text tokens:                 413547    
Total input vision tokens:               0         
Total generated tokens:                  60438     
Total generated tokens (retokenized):    60283     
Request throughput (req/s):              5.41      
Input token throughput (tok/s):          5588.78   
Output token throughput (tok/s):         816.77    
Peak output token throughput (tok/s):    2096.00   
Peak concurrent requests:                400       
Total token throughput (tok/s):          6405.55   
Concurrency:                             241.94    
Accept length:                           1.96      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   44757.22  
Median E2E Latency (ms):                 45578.14  
---------------Time to First Token----------------
Mean TTFT (ms):                          31770.14  
Median TTFT (ms):                        31222.84  
P99 TTFT (ms):                           66947.24  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          91.85     
Median TPOT (ms):                        91.94     
P99 TPOT (ms):                           200.68    
---------------Inter-Token Latency----------------
Mean ITL (ms):                           85.69     
Median ITL (ms):                         89.57     
P95 ITL (ms):                            192.59    
P99 ITL (ms):                            237.08    
Max ITL (ms):                            5352.32   
==================================================

AFTER

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 not set   
Successful requests:                     400       
Benchmark duration (s):                  73.76     
Total input tokens:                      413547    
Total input text tokens:                 413547    
Total input vision tokens:               0         
Total generated tokens:                  60438     
Total generated tokens (retokenized):    60288     
Request throughput (req/s):              5.42      
Input token throughput (tok/s):          5606.52   
Output token throughput (tok/s):         819.37    
Peak output token throughput (tok/s):    2140.00   
Peak concurrent requests:                400       
Total token throughput (tok/s):          6425.88   
Concurrency:                             240.97    
Accept length:                           1.96      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   44436.41  
Median E2E Latency (ms):                 45314.02  
---------------Time to First Token----------------
Mean TTFT (ms):                          31483.80  
Median TTFT (ms):                        31055.40  
P99 TTFT (ms):                           66442.16  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          90.36     
Median TPOT (ms):                        91.38     
P99 TPOT (ms):                           189.26    
---------------Inter-Token Latency----------------
Mean ITL (ms):                           85.46     
Median ITL (ms):                         91.90     
P95 ITL (ms):                            188.44    
P99 ITL (ms):                            258.64    
Max ITL (ms):                            4940.61   
==================================================

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

gemini-code-assist · 2025-12-12T06:43:48Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

yuan-luo · 2025-12-12T11:50:11Z

@Kevin-XiongC Please split the sgl-kernel change into an independent PR such like #14036.

Kevin-XiongC · 2025-12-15T02:35:09Z

#14036

I've split the kernel part. #15141. Could you please review it? @yuan-luo

yuan-luo · 2025-12-15T03:07:21Z

#14036

I've split the kernel part. #15141. Could you please review it? @yuan-luo

Sure.

python/sglang/srt/models/glm4_moe.py

Kevin-XiongC added 9 commits December 11, 2025 07:13

1st

ee5fab3

1st

8383c85

fix uninitiialized

11a6117

modify sig

4d02dca

modify test

2009f19

add missed

2073e32

fix neon

501260d

Merge remote-tracking branch 'origin/main' into fused_glm

acf4cc5

fix typo

7f5d671

github-actions bot added the sgl-kernel label Dec 12, 2025

Merge remote-tracking branch 'origin/main' into fused_glm

89803fe

Kevin-XiongC marked this pull request as ready for review December 12, 2025 06:47

Kevin-XiongC requested review from BBuf, FlamingoPg, HaiShaw, ispobock, merrymercy, yizhang2077 and zhyncs as code owners December 12, 2025 06:47

Kevin-XiongC mentioned this pull request Dec 15, 2025

[sgl-kernel][1/2] Fused qk_norm_rope for GLM4.6 #15141

Merged

6 tasks

Kevin-XiongC mentioned this pull request Dec 17, 2025

[2/2] Fused qk_norm_rope for GLM4.6 #15305

Open

6 tasks

Alcanderian reviewed Dec 18, 2025

View reviewed changes

python/sglang/srt/models/glm4_moe.py Show resolved Hide resolved

Kevin-XiongC closed this Dec 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Perf] Fused qk_norm_rope for GLM4.6#14952

[Perf] Fused qk_norm_rope for GLM4.6#14952
Kevin-XiongC wants to merge 10 commits intosgl-project:mainfrom
novitalabs:fused_glm

Kevin-XiongC commented Dec 12, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Dec 12, 2025

Uh oh!

yuan-luo commented Dec 12, 2025

Uh oh!

Kevin-XiongC commented Dec 15, 2025

Uh oh!

yuan-luo commented Dec 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

Kevin-XiongC commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

KERNEL UT

E2E

Benchmarking and Profiling

BEFORE

AFTER

Checklist

Uh oh!

gemini-code-assist bot commented Dec 12, 2025

Uh oh!

yuan-luo commented Dec 12, 2025

Uh oh!

Kevin-XiongC commented Dec 15, 2025

Uh oh!

yuan-luo commented Dec 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Kevin-XiongC commented Dec 12, 2025 •

edited

Loading