Skip to content

[kernel] opt moe align block kernel by block/warp scan algorithm#7884

Merged
ispobock merged 8 commits intosgl-project:mainfrom
yuan-luo:refine_moe_align_block_kernel
Jul 17, 2025
Merged

[kernel] opt moe align block kernel by block/warp scan algorithm#7884
ispobock merged 8 commits intosgl-project:mainfrom
yuan-luo:refine_moe_align_block_kernel

Conversation

@yuan-luo
Copy link
Collaborator

@yuan-luo yuan-luo commented Jul 9, 2025

Motivation

This PR is to introduce block / warp scan algorithm in fused MoE path moe_align_block_size_kernel which gains approximately 10% speedup.

Here is the benchmark result.
Note: num_experts >= 128 appliable to this PR. num_experts < 128 appliable to moe_align_block_size_small_batch_expert_kernel kernel.

This PR:

$python ./sgl-kernel/benchmark/bench_moe_align_block_size.py
INFO 07-09 13:15:28 [__init__.py:244] Automatically detected platform cuda.
✅ VLLM implementation works with 64 experts!
✅ SGL and Triton implementations match
✅ SGL and VLLM implementations match

📊 Running performance benchmark for 64 experts...
moe-align-block-size-performance:
     num_tokens  num_experts  topk        SGL  SGL Fusion       Triton
0           1.0          8.0   1.0  16.672000   14.912000    44.032000
1           1.0          8.0   2.0  16.896000   15.072000    44.256002
2           1.0          8.0   4.0  16.672000   15.008000    43.264002
3           1.0          8.0   8.0  16.432000   15.104000    44.767998
4           1.0         32.0   1.0  19.200001   17.424000    44.512000
5           1.0         32.0   2.0  19.231999   17.503999    44.464000
6           1.0         32.0   4.0  19.200001   17.535999    43.728001
7           1.0         32.0   8.0  19.231999   17.568000    44.767998
8           1.0         64.0   1.0  22.655999   20.864001    44.160001
9           1.0         64.0   2.0  22.608001   20.927999    44.799998
10          1.0         64.0   4.0  22.624001   20.927999    44.863999
11          1.0         64.0   8.0  22.496000   20.959999    45.343999
12          1.0        128.0   1.0  19.296000   17.696001    47.295999
13          1.0        128.0   2.0  19.264000   17.472001    44.927999
14          1.0        128.0   4.0  19.328000   17.728001    43.968000
15          1.0        128.0   8.0  19.296000   17.503999    44.560000
16          1.0        256.0   1.0  19.231999   17.632000    45.311999
17          1.0        256.0   2.0  19.200001   17.728001    45.056000
18          1.0        256.0   4.0  19.360000   17.440001    46.271998
19          1.0        256.0   8.0  19.328000   17.759999    46.016000
20          8.0          8.0   1.0  16.543999   15.104000    44.096000
21          8.0          8.0   2.0  16.576000   14.976000    44.383999
22          8.0          8.0   4.0  16.319999   15.008000    45.407999
23          8.0          8.0   8.0  17.055999   15.488000    44.911999
24          8.0         32.0   1.0  19.216000   17.568000    45.152001
25          8.0         32.0   2.0  19.231999   17.728001    45.919999
26          8.0         32.0   4.0  19.072000   17.824000    45.887999
27          8.0         32.0   8.0  19.776000   18.528000    45.664001
28          8.0         64.0   1.0  22.608001   20.959999    46.112001
29          8.0         64.0   2.0  22.528000   20.992000    43.584000
30          8.0         64.0   4.0  22.624001   21.183999    44.704001
31          8.0         64.0   8.0  22.496000   21.376001    45.664001
32          8.0        128.0   1.0  19.328000   17.535999    46.464000
33          8.0        128.0   2.0  19.360000   17.696001    44.831999
34          8.0        128.0   4.0  19.328000   17.759999    46.592001
35          8.0        128.0   8.0  19.360000   17.792000    45.279998
36          8.0        256.0   1.0  19.296000   17.535999    45.823999
37          8.0        256.0   2.0  19.360000   17.568000    46.416000
38          8.0        256.0   4.0  19.328000   17.696001    46.271998
39          8.0        256.0   8.0  19.360000   17.983999    46.048000
40         16.0          8.0   1.0  16.559999   15.008000    44.544000
41         16.0          8.0   2.0  16.287999   15.008000    44.895999
42         16.0          8.0   4.0  17.055999   15.456000    44.576000
43         16.0          8.0   8.0  18.144000   16.543999    45.600001
44         16.0         32.0   1.0  19.231999   17.759999    43.807998
45         16.0         32.0   2.0  19.296000   17.856000    45.248002
46         16.0         32.0   4.0  19.552000   18.480000    45.056000
47         16.0         32.0   8.0  20.832000   19.584000    44.512000
48         16.0         64.0   1.0  22.624001   20.992000    44.927999
49         16.0         64.0   2.0  22.624001   21.120001    47.200002
50         16.0         64.0   4.0  22.592001   21.376001    47.648001
51         16.0         64.0   8.0  22.944000   22.175999    45.728002
52         16.0        128.0   1.0  19.360000   17.535999    45.311999
53         16.0        128.0   2.0  19.392001   17.568000    44.767998
54         16.0        128.0   4.0  19.360000   17.920000    45.472000
55         16.0        128.0   8.0  19.360000   18.015999    45.375999
56         16.0        256.0   1.0  19.328000   17.568000    46.208002
57         16.0        256.0   2.0  19.296000   17.728001    46.944000
58         16.0        256.0   4.0  19.328000   17.983999    46.815999
59         16.0        256.0   8.0  19.424001   18.368000    46.432000
60         32.0          8.0   1.0  16.543999   15.040000    43.744002
61         32.0          8.0   2.0  16.799999   15.488000    44.064000
62         32.0          8.0   4.0  18.112000   16.576000    43.200001
63         32.0          8.0   8.0  20.256000   18.816000    42.495999
64         32.0         32.0   1.0  19.296000   17.792000    41.951999
65         32.0         32.0   2.0  19.568000   18.495999    41.471999
66         32.0         32.0   4.0  20.576000   19.616000    42.495999
67         32.0         32.0   8.0  22.879999   21.728000    44.128001
68         32.0         64.0   1.0  22.560000   21.183999    43.584000
69         32.0         64.0   2.0  22.560000   21.360001    44.831999
70         32.0         64.0   4.0  22.944000   22.175999    43.935999
71         32.0         64.0   8.0  23.840001   23.200000    45.311999
72         32.0        128.0   1.0  19.328000   17.503999    43.488000
73         32.0        128.0   2.0  19.360000   17.983999    44.256002
74         32.0        128.0   4.0  19.392001   17.983999    43.359999
75         32.0        128.0   8.0  19.552000   18.528000    45.136001
76         32.0        256.0   1.0  19.360000   17.744000    45.984000
77         32.0        256.0   2.0  19.312000   17.759999    45.632001
78         32.0        256.0   4.0  19.360000   18.208001    46.144001
79         32.0        256.0   8.0  19.552000   18.880000    46.528000
80         64.0          8.0   1.0  17.055999   15.488000    44.544000
81         64.0          8.0   2.0  17.856000   16.543999    45.056000
82         64.0          8.0   4.0  20.352000   18.848000    44.064000
83         64.0          8.0   8.0  24.480000   22.784000    45.120001
84         64.0         32.0   1.0  19.743999   18.495999    46.560001
85         64.0         32.0   2.0  20.576000   19.616000    43.520000
86         64.0         32.0   4.0  22.879999   21.695999    43.488000
87         64.0         32.0   8.0  26.208000   25.280001    45.024000
88         64.0         64.0   1.0  22.544000   21.376001    45.040000
89         64.0         64.0   2.0  23.135999   22.175999    44.544000
90         64.0         64.0   4.0  23.840001   23.167999    45.248002
91         64.0         64.0   8.0  26.272001   26.272001    45.456000
92         64.0        128.0   1.0  19.360000   17.759999    44.608001
93         64.0        128.0   2.0  19.392001   17.983999    45.120001
94         64.0        128.0   4.0  19.584000   18.656000    46.879999
95         64.0        128.0   8.0  19.424001   18.592000    46.016000
96         64.0        256.0   1.0  19.264000   17.952001    45.855999
97         64.0        256.0   2.0  19.360000   18.400000    46.303999
98         64.0        256.0   4.0  19.487999   18.688001    46.656001
99         64.0        256.0   8.0  19.664001   19.168001    47.359999
100       128.0          8.0   1.0  17.856000   16.543999    46.528000
101       128.0          8.0   2.0  20.191999   18.848000    45.279998
102       128.0          8.0   4.0  24.512000   22.784000    44.032000
103       128.0          8.0   8.0  19.231999   17.472001    46.080001
104       128.0         32.0   1.0  20.832000   19.616000    45.311999
105       128.0         32.0   2.0  22.848001   21.728000    44.480000
106       128.0         32.0   4.0  26.400000   25.312001    45.120001
107       128.0         32.0   8.0  19.104000   17.360000    43.839999
108       128.0         64.0   1.0  23.072001   22.175999    44.831999
109       128.0         64.0   2.0  23.936000   23.167999    44.799998
110       128.0         64.0   4.0  26.272001   26.272001    44.895999
111       128.0         64.0   8.0  19.632000   18.080000    45.375999
112       128.0        128.0   1.0  19.392001   18.015999    45.024000
113       128.0        128.0   2.0  19.552000   18.624000    45.216002
114       128.0        128.0   4.0  19.424001   18.495999    44.415999
115       128.0        128.0   8.0  19.360000   18.495999    44.496000
116       128.0        256.0   1.0  19.424001   18.432001    45.919999
117       128.0        256.0   2.0  19.504000   18.880000    46.688002
118       128.0        256.0   4.0  19.776000   19.168001    47.488000
119       128.0        256.0   8.0  19.808000   19.776000    48.016001
120       256.0          8.0   1.0  20.223999   18.880000    45.248002
121       256.0          8.0   2.0  24.480000   22.784000    45.088001
122       256.0          8.0   4.0  19.231999   17.503999    46.592001
123       256.0          8.0   8.0  19.296000   17.728001    69.696002
124       256.0         32.0   1.0  22.848001   21.695999    43.648001
125       256.0         32.0   2.0  26.400000   25.312001    44.704001
126       256.0         32.0   4.0  19.040000   17.360000    44.000000
127       256.0         32.0   8.0  19.680001   18.080000    44.480000
128       256.0         64.0   1.0  23.871999   23.167999    44.287998
129       256.0         64.0   2.0  26.432000   26.272001    43.935999
130       256.0         64.0   4.0  19.616000   18.015999    44.000000
131       256.0         64.0   8.0  20.096000   18.751999    42.624000
132       256.0        128.0   1.0  19.616000   18.592000    45.375999
133       256.0        128.0   2.0  19.424001   18.528000    43.552000
134       256.0        128.0   4.0  19.360000   18.528000    45.375999
135       256.0        128.0   8.0  20.096000   19.264000    44.592001
136       256.0        256.0   1.0  19.455999   18.751999    46.432000
137       256.0        256.0   2.0  19.680001   19.584000    47.520000
138       256.0        256.0   4.0  19.840000   19.792000    47.839999
139       256.0        256.0   8.0  20.160001   20.000000    50.080001
140       512.0          8.0   1.0  24.351999   22.784000    42.656001
141       512.0          8.0   2.0  19.200001   17.472001    45.728002
142       512.0          8.0   4.0  19.296000   17.759999    69.728002
143       512.0          8.0   8.0  21.024000   19.296000   114.335999
144       512.0         32.0   1.0  26.400000   25.280001    45.311999
145       512.0         32.0   2.0  19.072000   17.376000    43.712001
146       512.0         32.0   4.0  19.680001   18.048000    44.767998
147       512.0         32.0   8.0  21.056000   19.455999    49.311999
148       512.0         64.0   1.0  26.432000   26.272001    45.279998
149       512.0         64.0   2.0  19.648001   18.255999    45.375999
150       512.0         64.0   4.0  20.064000   18.784000    43.455999
151       512.0         64.0   8.0  21.695999   20.191999    43.423999
152       512.0        128.0   1.0  19.392001   18.592000    45.120001
153       512.0        128.0   2.0  19.360000   18.560000    43.104000
154       512.0        128.0   4.0  20.160001   19.264000    43.616001
155       512.0        128.0   8.0  21.183999   20.352000    45.504000
156       512.0        256.0   1.0  19.696000   19.200001    47.168002
157       512.0        256.0   2.0  19.840000   19.727999    47.488000
158       512.0        256.0   4.0  20.128001   20.160001    50.303999
159       512.0        256.0   8.0  21.792000   21.632001    54.752000
160      1024.0          8.0   1.0  19.231999   17.535999    46.239998
161      1024.0          8.0   2.0  19.264000   17.664000    69.824003
162      1024.0          8.0   4.0  20.992000   19.392001   114.944004
163      1024.0          8.0   8.0  23.072001   21.568000   203.743994
164      1024.0         32.0   1.0  19.072000   17.376000    44.112001
165      1024.0         32.0   2.0  19.680001   18.015999    42.176001
166      1024.0         32.0   4.0  21.056000   19.455999    48.864000
167      1024.0         32.0   8.0  23.936000   22.688000    73.824003
168      1024.0         64.0   1.0  19.648001   18.239999    44.096000
169      1024.0         64.0   2.0  20.096000   18.751999    43.776002
170      1024.0         64.0   4.0  21.695999   20.240000    45.056000
171      1024.0         64.0   8.0  24.160000   22.976000    53.215999
172      1024.0        128.0   1.0  19.360000   18.464001    45.216002
173      1024.0        128.0   2.0  20.176000   19.296000    43.807998
174      1024.0        128.0   4.0  21.215999   20.352000    44.447999
175      1024.0        128.0   8.0  23.391999   22.368001    47.839999
176      1024.0        256.0   1.0  19.840000   19.776000    48.416000
177      1024.0        256.0   2.0  20.128001   20.048000    50.080001
178      1024.0        256.0   4.0  21.824000   21.695999    54.719999
179      1024.0        256.0   8.0  24.095999   24.160000    60.640000
180      2048.0          8.0   1.0  19.296000   17.744000    69.888003
181      2048.0          8.0   2.0  20.992000   19.328000   114.432000
182      2048.0          8.0   4.0  23.104001   21.568000   203.968003
183      2048.0          8.0   8.0  27.775999   26.912000   378.239989
184      2048.0         32.0   1.0  19.680001   17.983999    44.128001
185      2048.0         32.0   2.0  21.056000   19.424001    48.448000
186      2048.0         32.0   4.0  23.840001   22.496000    73.600002
187      2048.0         32.0   8.0  28.672000   28.160000   118.207999
188      2048.0         64.0   1.0  20.032000   18.751999    43.968000
189      2048.0         64.0   2.0  21.728000   20.352000    43.903999
190      2048.0         64.0   4.0  24.127999   23.072001    53.183999
191      2048.0         64.0   8.0  30.239999   29.408000    77.408001
192      2048.0        128.0   1.0  20.160001   19.231999    43.968000
193      2048.0        128.0   2.0  21.152001   20.288000    44.096000
194      2048.0        128.0   4.0  23.424000   22.399999    48.384000
195      2048.0        128.0   8.0  28.320000   28.000001    60.479999
196      2048.0        256.0   1.0  20.160001   20.160001    50.112002
197      2048.0        256.0   2.0  21.824000   21.792000    55.103999
198      2048.0        256.0   4.0  24.095999   24.064001    60.768001
199      2048.0        256.0   8.0  29.727999   29.792000    68.928003
200      4096.0          8.0   1.0  20.992000   19.376000   114.576001
201      4096.0          8.0   2.0  23.072001   21.568000   203.615993
202      4096.0          8.0   4.0  27.807999   26.880000   377.952009
203      4096.0          8.0   8.0  36.543999   36.640000   728.096008
204      4096.0         32.0   1.0  21.024000   19.424001    49.024001
205      4096.0         32.0   2.0  23.903999   22.496000    73.632002
206      4096.0         32.0   4.0  28.736001   28.063999   118.303999
207      4096.0         32.0   8.0  38.880002   38.943999   209.503993
208      4096.0         64.0   1.0  21.632001   20.352000    43.327998
209      4096.0         64.0   2.0  24.160000   23.135999    53.280000
210      4096.0         64.0   4.0  30.208001   29.408000    77.264000
211      4096.0         64.0   8.0  40.959999   41.248001   122.560002
212      4096.0        128.0   1.0  21.215999   20.352000    45.952000
213      4096.0        128.0   2.0  23.360001   22.624001    47.775999
214      4096.0        128.0   4.0  28.352000   28.096000    60.288001
215      4096.0        128.0   8.0  37.967999   38.511999    84.352002
216      4096.0        256.0   1.0  21.824000   21.728000    54.912001
217      4096.0        256.0   2.0  24.064001   24.160000    60.479999
218      4096.0        256.0   4.0  29.632000   29.696001    68.928003
219      4096.0        256.0   8.0  35.615999   36.543999    82.208000
220      8192.0          8.0   1.0  23.040000   21.695999   203.968003
221      8192.0          8.0   2.0  27.744001   26.912000   378.176004
222      8192.0          8.0   4.0  36.575999   36.672000   730.368018
223      8192.0          8.0   8.0  54.848000   57.055999  1443.904042
224      8192.0         32.0   1.0  23.808001   22.655999    73.536001
225      8192.0         32.0   2.0  28.704001   27.872000   118.368000
226      8192.0         32.0   4.0  38.784001   38.896000   209.503993
227      8192.0         32.0   8.0  58.432002   60.640000   381.760001
228      8192.0         64.0   1.0  24.160000   23.040000    53.119998
229      8192.0         64.0   2.0  30.176001   29.344000    77.472001
230      8192.0         64.0   4.0  40.832002   41.184001   122.368000
231      8192.0         64.0   8.0  63.072003   65.504000   215.136006
232      8192.0        128.0   1.0  23.391999   22.368001    47.839999
233      8192.0        128.0   2.0  28.320000   28.320000    60.479999
234      8192.0        128.0   4.0  37.951998   38.431998    84.448002
235      8192.0        128.0   8.0  56.384001   59.071999   131.743997
236      8192.0        256.0   1.0  24.095999   24.127999    60.927998
237      8192.0        256.0   2.0  29.664000   29.600000    68.960004
238      8192.0        256.0   4.0  35.583999   36.607999    82.176000
239      8192.0        256.0   8.0  52.480001   55.424001   106.271997

Main:

$python ./sgl-kernel/benchmark/bench_moe_align_block_size.py
INFO 07-09 13:51:19 [__init__.py:244] Automatically detected platform cuda.
✅ VLLM implementation works with 64 experts!
✅ SGL and Triton implementations match
✅ SGL and VLLM implementations match

📊 Running performance benchmark for 64 experts...
moe-align-block-size-performance:
     num_tokens  num_experts  topk        SGL  SGL Fusion       Triton
0           1.0          8.0   1.0  16.832000   14.912000    42.431999
1           1.0          8.0   2.0  16.672000   15.072000    42.240001
2           1.0          8.0   4.0  16.864000   15.072000    44.383999
3           1.0          8.0   8.0  16.511999   15.104000    43.327998
4           1.0         32.0   1.0  19.231999   17.503999    42.463999
5           1.0         32.0   2.0  19.264000   17.535999    41.760001
6           1.0         32.0   4.0  19.200001   17.535999    43.200001
7           1.0         32.0   8.0  19.264000   17.568000    44.128001
8           1.0         64.0   1.0  22.655999   20.927999    41.680001
9           1.0         64.0   2.0  22.655999   20.912000    43.200001
10          1.0         64.0   4.0  22.655999   20.959999    43.584000
11          1.0         64.0   8.0  22.592001   20.927999    43.264002
12          1.0        128.0   1.0  20.447999   18.751999    43.584000
13          1.0        128.0   2.0  20.416001   18.784000    42.112000
14          1.0        128.0   4.0  20.447999   18.784000    44.032000
15          1.0        128.0   8.0  20.479999   18.608000    43.232001
16          1.0        256.0   1.0  20.576000   19.040000    45.600001
17          1.0        256.0   2.0  20.671999   18.912001    45.056000
18          1.0        256.0   4.0  20.736000   19.200001    45.504000
19          1.0        256.0   8.0  20.640001   19.231999    44.960000
20          8.0          8.0   1.0  16.416000   15.072000    42.335998
21          8.0          8.0   2.0  16.511999   15.072000    43.280002
22          8.0          8.0   4.0  16.511999   14.976000    44.736002
23          8.0          8.0   8.0  17.023999   15.488000    44.927999
24          8.0         32.0   1.0  19.264000   17.568000    42.624000
25          8.0         32.0   2.0  19.200001   17.759999    43.568000
26          8.0         32.0   4.0  19.247999   17.792000    43.520000
27          8.0         32.0   8.0  19.711999   18.464001    43.648001
28          8.0         64.0   1.0  22.592001   20.927999    44.000000
29          8.0         64.0   2.0  22.592001   20.992000    43.648001
30          8.0         64.0   4.0  22.655999   21.152001    43.935999
31          8.0         64.0   8.0  22.655999   21.376001    43.807998
32          8.0        128.0   1.0  20.416001   18.751999    44.512000
33          8.0        128.0   2.0  20.447999   18.719999    44.863999
34          8.0        128.0   4.0  20.447999   18.656000    44.064000
35          8.0        128.0   8.0  20.447999   19.007999    44.351999
36          8.0        256.0   1.0  20.671999   19.007999    45.407999
37          8.0        256.0   2.0  20.703999   19.040000    46.048000
38          8.0        256.0   4.0  20.768000   19.200001    46.144001
39          8.0        256.0   8.0  20.768000   19.264000    45.248002
40         16.0          8.0   1.0  16.543999   14.944000    43.184001
41         16.0          8.0   2.0  16.543999   15.072000    43.855999
42         16.0          8.0   4.0  17.023999   15.488000    42.815998
43         16.0          8.0   8.0  18.080000   16.543999    43.008000
44         16.0         32.0   1.0  19.231999   17.728001    42.560000
45         16.0         32.0   2.0  19.040000   17.792000    41.760001
46         16.0         32.0   4.0  19.648001   18.608000    42.528000
47         16.0         32.0   8.0  20.768000   19.616000    43.807998
48         16.0         64.0   1.0  22.528000   20.992000    43.520000
49         16.0         64.0   2.0  22.496000   21.183999    43.616001
50         16.0         64.0   4.0  22.592001   21.376001    44.576000
51         16.0         64.0   8.0  22.911999   21.952000    44.480000
52         16.0        128.0   1.0  20.463999   18.719999    44.672001
53         16.0        128.0   2.0  20.447999   18.784000    44.672001
54         16.0        128.0   4.0  20.416001   18.848000    44.383999
55         16.0        128.0   8.0  20.384001   19.072000    44.927999
56         16.0        256.0   1.0  20.736000   19.200001    45.759998
57         16.0        256.0   2.0  20.703999   19.040000    45.120001
58         16.0        256.0   4.0  20.800000   19.200001    45.440000
59         16.0        256.0   8.0  20.832000   19.552000    46.080001
60         32.0          8.0   1.0  16.287999   15.008000    44.767998
61         32.0          8.0   2.0  17.055999   15.424000    44.160001
62         32.0          8.0   4.0  18.080000   16.511999    43.807998
63         32.0          8.0   8.0  20.256000   18.848000    44.032000
64         32.0         32.0   1.0  19.231999   17.759999    43.136001
65         32.0         32.0   2.0  19.648001   18.464001    42.959999
66         32.0         32.0   4.0  20.768000   19.616000    44.192001
67         32.0         32.0   8.0  22.848001   21.695999    42.592000
68         32.0         64.0   1.0  22.576001   21.152001    43.040000
69         32.0         64.0   2.0  22.624001   21.344000    42.208001
70         32.0         64.0   4.0  23.104001   21.888001    43.712001
71         32.0         64.0   8.0  23.808001   23.135999    42.720001
72         32.0        128.0   1.0  20.463999   18.848000    42.560000
73         32.0        128.0   2.0  20.479999   18.848000    43.776002
74         32.0        128.0   4.0  20.447999   19.264000    43.264002
75         32.0        128.0   8.0  20.671999   19.711999    44.128001
76         32.0        256.0   1.0  20.703999   19.040000    46.144001
77         32.0        256.0   2.0  20.800000   19.231999    46.144001
78         32.0        256.0   4.0  20.736000   19.552000    45.472000
79         32.0        256.0   8.0  20.959999   20.160001    46.656001
80         64.0          8.0   1.0  16.767999   15.488000    45.823999
81         64.0          8.0   2.0  17.856000   16.511999    45.088001
82         64.0          8.0   4.0  20.256000   18.848000    45.024000
83         64.0          8.0   8.0  24.480000   22.752000    44.128001
84         64.0         32.0   1.0  19.648001   18.464001    44.000000
85         64.0         32.0   2.0  20.576000   19.648001    44.767998
86         64.0         32.0   4.0  22.655999   21.760000    44.351999
87         64.0         32.0   8.0  26.240001   25.280001    43.712001
88         64.0         64.0   1.0  22.592001   21.376001    44.992000
89         64.0         64.0   2.0  23.104001   21.888001    43.040000
90         64.0         64.0   4.0  23.840001   23.135999    44.160001
91         64.0         64.0   8.0  26.400000   26.272001    43.104000
92         64.0        128.0   1.0  20.447999   18.816000    45.407999
93         64.0        128.0   2.0  20.416001   19.231999    43.584000
94         64.0        128.0   4.0  20.671999   19.648001    44.447999
95         64.0        128.0   8.0  20.479999   19.648001    45.504000
96         64.0        256.0   1.0  20.736000   19.424001    45.536000
97         64.0        256.0   2.0  20.832000   19.711999    45.855999
98         64.0        256.0   4.0  20.896001   20.128001    47.231998
99         64.0        256.0   8.0  21.088000   20.640001    47.072001
100       128.0          8.0   1.0  17.856000   16.543999    43.008000
101       128.0          8.0   2.0  20.256000   18.816000    43.040000
102       128.0          8.0   4.0  24.383999   22.784000    42.752001
103       128.0          8.0   8.0  19.360000   17.664000    46.432000
104       128.0         32.0   1.0  20.768000   19.648001    42.847998
105       128.0         32.0   2.0  22.816001   21.695999    43.776002
106       128.0         32.0   4.0  26.272001   25.280001    43.152001
107       128.0         32.0   8.0  19.711999   17.952001    44.608001
108       128.0         64.0   1.0  22.879999   21.952000    43.968000
109       128.0         64.0   2.0  24.000000   22.816001    43.903999
110       128.0         64.0   4.0  26.335999   26.303999    43.040000
111       128.0         64.0   8.0  20.447999   19.040000    44.895999
112       128.0        128.0   1.0  20.416001   19.231999    45.216002
113       128.0        128.0   2.0  20.671999   19.616000    46.144001
114       128.0        128.0   4.0  20.479999   19.584000    44.000000
115       128.0        128.0   8.0  20.447999   19.568000    44.319998
116       128.0        256.0   1.0  20.848000   19.520000    45.504000
117       128.0        256.0   2.0  20.832000   20.256000    46.783999
118       128.0        256.0   4.0  21.088000   20.864001    46.944000
119       128.0        256.0   8.0  21.248000   21.152001    47.231998
120       256.0          8.0   1.0  20.384001   18.848000    42.975999
121       256.0          8.0   2.0  24.351999   22.784000    43.792000
122       256.0          8.0   4.0  19.392001   17.600000    45.919999
123       256.0          8.0   8.0  19.455999   17.920000    69.760002
124       256.0         32.0   1.0  22.816001   21.728000    42.463999
125       256.0         32.0   2.0  26.400000   25.280001    42.943999
126       256.0         32.0   4.0  19.743999   17.952001    42.720001
127       256.0         32.0   8.0  20.320000   18.719999    41.536000
128       256.0         64.0   1.0  23.808001   23.167999    42.560000
129       256.0         64.0   2.0  26.400000   26.303999    43.391999
130       256.0         64.0   4.0  20.479999   19.072000    41.312002
131       256.0         64.0   8.0  20.927999   19.552000    42.431999
132       256.0        128.0   1.0  20.703999   19.648001    42.943999
133       256.0        128.0   2.0  20.512000   19.584000    43.935999
134       256.0        128.0   4.0  20.447999   19.648001    43.232001
135       256.0        128.0   8.0  21.280000   20.384001    43.072000
136       256.0        256.0   1.0  20.992000   20.160001    47.040001
137       256.0        256.0   2.0  21.024000   20.927999    47.040001
138       256.0        256.0   4.0  21.232000   21.088000    47.488000
139       256.0        256.0   8.0  21.472000   21.536000    50.271999
140       512.0          8.0   1.0  24.320001   22.752000    45.871999
141       512.0          8.0   2.0  19.424001   17.632000    46.335999
142       512.0          8.0   4.0  19.455999   17.824000    69.696002
143       512.0          8.0   8.0  21.152001   19.520000   114.367999
144       512.0         32.0   1.0  26.303999   25.248000    43.903999
145       512.0         32.0   2.0  19.680001   17.983999    42.911999
146       512.0         32.0   4.0  20.336000   18.672001    43.664001
147       512.0         32.0   8.0  21.663999   20.160001    49.056001
148       512.0         64.0   1.0  26.400000   26.303999    43.648001
149       512.0         64.0   2.0  20.512000   19.104000    44.287998
150       512.0         64.0   4.0  20.927999   19.616000    42.879999
151       512.0         64.0   8.0  22.528000   21.056000    43.616001
152       512.0        128.0   1.0  20.512000   19.632000    46.432000
153       512.0        128.0   2.0  20.479999   19.648001    43.200001
154       512.0        128.0   4.0  21.248000   20.384001    44.000000
155       512.0        128.0   8.0  22.272000   21.344000    44.256002
156       512.0        256.0   1.0  21.120001   20.896001    47.616001
157       512.0        256.0   2.0  21.215999   21.120001    47.584001
158       512.0        256.0   4.0  21.504000   21.536000    50.271999
159       512.0        256.0   8.0  23.200000   23.135999    55.008002
160      1024.0          8.0   1.0  19.455999   17.600000    46.496000
161      1024.0          8.0   2.0  19.455999   17.888000    69.696002
162      1024.0          8.0   4.0  21.183999   19.552000   114.432000
163      1024.0          8.0   8.0  23.232000   21.792000   203.424007
164      1024.0         32.0   1.0  19.680001   18.048000    44.160001
165      1024.0         32.0   2.0  20.352000   18.719999    43.536000
166      1024.0         32.0   4.0  21.632001   20.032000    49.040001
167      1024.0         32.0   8.0  24.464000   23.104001    73.536001
168      1024.0         64.0   1.0  20.447999   18.912001    44.128001
169      1024.0         64.0   2.0  20.959999   19.584000    43.616001
170      1024.0         64.0   4.0  22.560000   21.056000    44.192001
171      1024.0         64.0   8.0  25.024001   23.936000    53.151999
172      1024.0        128.0   1.0  20.447999   19.584000    44.512000
173      1024.0        128.0   2.0  21.248000   20.352000    42.495999
174      1024.0        128.0   4.0  22.272000   21.344000    43.903999
175      1024.0        128.0   8.0  24.480000   23.584001    48.351999
176      1024.0        256.0   1.0  21.215999   21.120001    47.711998
177      1024.0        256.0   2.0  21.504000   21.536000    50.112002
178      1024.0        256.0   4.0  23.135999   23.200000    54.880001
179      1024.0        256.0   8.0  25.472000   25.408000    60.704000
180      2048.0          8.0   1.0  19.455999   17.983999    69.824003
181      2048.0          8.0   2.0  21.168000   19.487999   114.560001
182      2048.0          8.0   4.0  23.200000   21.792000   203.840002
183      2048.0          8.0   8.0  28.192000   27.104000   378.480002
184      2048.0         32.0   1.0  20.320000   18.784000    43.152001
185      2048.0         32.0   2.0  21.632001   20.000000    48.896000
186      2048.0         32.0   4.0  24.480000   23.216000    73.728003
187      2048.0         32.0   8.0  29.312000   28.384000   118.464001
188      2048.0         64.0   1.0  20.927999   19.584000    43.648001
189      2048.0         64.0   2.0  22.528000   21.088000    43.807998
190      2048.0         64.0   4.0  25.024001   23.871999    53.247999
191      2048.0         64.0   8.0  31.008000   30.272000    77.344000
192      2048.0        128.0   1.0  21.248000   20.352000    44.224001
193      2048.0        128.0   2.0  22.272000   21.376001    44.512000
194      2048.0        128.0   4.0  24.544001   23.456000    48.319999
195      2048.0        128.0   8.0  29.536000   28.864000    60.320001
196      2048.0        256.0   1.0  21.536000   21.536000    50.032001
197      2048.0        256.0   2.0  23.104001   23.135999    55.071998
198      2048.0        256.0   4.0  25.504000   25.456000    60.832001
199      2048.0        256.0   8.0  31.072000   31.104000    69.055997
200      4096.0          8.0   1.0  21.152001   19.487999   114.335999
201      4096.0          8.0   2.0  23.264000   21.824000   204.352006
202      4096.0          8.0   4.0  27.936000   27.104000   377.535999
203      4096.0          8.0   8.0  36.672000   36.864001   730.944008
204      4096.0         32.0   1.0  21.632001   20.160001    48.896000
205      4096.0         32.0   2.0  24.416000   23.135999    73.600002
206      4096.0         32.0   4.0  29.312000   28.416000   118.432000
207      4096.0         32.0   8.0  39.328001   39.423998   209.087998
208      4096.0         64.0   1.0  22.528000   21.152001    43.104000
209      4096.0         64.0   2.0  24.992000   23.936000    53.183999
210      4096.0         64.0   4.0  31.008000   30.208001    77.408001
211      4096.0         64.0   8.0  41.648000   42.016000   122.528002
212      4096.0        128.0   1.0  22.304000   21.392000    44.303998
213      4096.0        128.0   2.0  24.448000   23.456000    48.480000
214      4096.0        128.0   4.0  29.503999   28.960001    60.608000
215      4096.0        128.0   8.0  39.039999   39.391998    84.352002
216      4096.0        256.0   1.0  23.200000   23.200000    54.848000
217      4096.0        256.0   2.0  25.440000   25.440000    60.416002
218      4096.0        256.0   4.0  31.040000   31.040000    68.960004
219      4096.0        256.0   8.0  37.055999   37.951998    82.176000
220      8192.0          8.0   1.0  23.248000   21.728000   204.799995
221      8192.0          8.0   2.0  27.968001   27.071999   377.855986
222      8192.0          8.0   4.0  36.672000   36.768001   728.160024
223      8192.0          8.0   8.0  54.976001   57.312001  1443.519950
224      8192.0         32.0   1.0  24.480000   23.104001    73.568001
225      8192.0         32.0   2.0  29.312000   28.352000   118.368000
226      8192.0         32.0   4.0  39.423998   39.487999   209.296003
227      8192.0         32.0   8.0  58.944002   61.280001   382.111996
228      8192.0         64.0   1.0  24.992000   23.903999    53.151999
229      8192.0         64.0   2.0  31.008000   30.239999    77.376001
230      8192.0         64.0   4.0  41.632000   42.016000   122.368000
231      8192.0         64.0   8.0  63.936003   66.303998   215.039998
232      8192.0        128.0   1.0  24.480000   23.520000    48.191998
233      8192.0        128.0   2.0  29.536000   28.960001    60.256001
234      8192.0        128.0   4.0  39.007999   39.487999    84.128000
235      8192.0        128.0   8.0  57.503998   60.192000   131.807998
236      8192.0        256.0   1.0  25.488000   25.440000    60.896002
237      8192.0        256.0   2.0  31.040000   31.008000    68.991996
238      8192.0        256.0   4.0  36.959998   37.792001    82.208000
239      8192.0        256.0   8.0  53.888001   56.800000   106.080003
$python ./sgl-kernel/tests/test_moe_align.py
========================================================================================================= test session starts =========================================================================================================
platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0
rootdir: /home/root/luoyuan.luo/sglang/sgl-kernel
configfile: pyproject.toml
plugins: hypothesis-6.135.10, anyio-4.8.0
collected 4368 items                                                                                                                                                                                                                  

sgl-kernel/tests/test_moe_align.py ............................................................................................................................................................................................ [  4%]
............................................................................................................................................................................................................................... [  9%]
............................................................................................................................................................................................................................... [ 14%]
............................................................................................................................................................................................................................... [ 19%]
............................................................................................................................................................................................................................... [ 24%]
............................................................................................................................................................................................................................... [ 29%]
............................................................................................................................................................................................................................... [ 34%]
............................................................................................................................................................................................................................... [ 40%]
............................................................................................................................................................................................................................... [ 45%]
............................................................................................................................................................................................................................... [ 50%]
............................................................................................................................................................................................................................... [ 55%]
............................................................................................................................................................................................................................... [ 60%]
............................................................................................................................................................................................................................... [ 65%]
............................................................................................................................................................................................................................... [ 70%]
............................................................................................................................................................................................................................... [ 75%]
............................................................................................................................................................................................................................... [ 80%]
............................................................................................................................................................................................................................... [ 85%]
............................................................................................................................................................................................................................... [ 91%]
............................................................................................................................................................................................................................... [ 96%]
......................................................................................................................................................................                                                          [100%]

======================================================================================================== 4368 passed in 9.74s =========================================================================================================

Modifications

Checklist

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @yuan-luo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on enhancing the performance of the moe_align_block_size_kernel by introducing a more optimized block and warp scan algorithm. The primary intent is to accelerate the fused Mixture-of-Experts (MoE) path, especially for configurations involving a large number of experts. The implementation shifts from a traditional Blelloch scan to a two-level exclusive prefix sum, which has resulted in measurable speedups according to the included benchmarks.

Highlights

  • Kernel Refinement: Refactored the moe_align_block_size_kernel to replace the previous Blelloch scan with a 2-level block and warp scan approach for calculating padded token counts and prefix sums, leading to improved performance.
  • Resource Management: Adjusted shared memory allocation for the moe_align_block_size_kernel to support the new scan mechanism and removed obsolete variables related to the prior implementation.
  • Benchmark Results: The changes demonstrate significant performance improvements for the MoE align block kernel, particularly for num_experts >= 128, as evidenced by the provided benchmark data showing faster 'SGL Fusion' times.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The code changes introduce block scan warp scan in fused MoE path moe_align_block_size_kernel. The changes involve using warp-level prefix sums and block-level prefix sums to improve performance. However, there are potential issues with out-of-bounds reads and incorrect calculations due to incomplete warps and blocks. These issues need to be addressed to ensure the correctness and stability of the code.

@BBuf BBuf changed the title Refine moe align block kernel [kernel] opt moe align block kernel by block/warp scan algorithm Jul 9, 2025
__syncthreads();

// Write prefix[0..num_experts - 1] and cumsum
if (tid < num_experts) prefix[tid] = scan_buf[tid];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need to keep prefix?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need to keep prefix, it is used in fill expert_ids stage to binary search the expert_ids.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use scan_buf for expert_ids search?

@ispobock
Copy link
Collaborator

ispobock commented Jul 9, 2025

Could you also add end2end accuracy test for dsv3 or qwen3 models?

@yuan-luo
Copy link
Collaborator Author

yuan-luo commented Jul 10, 2025

Could you also add end2end accuracy test for dsv3 or qwen3 models?

QWEN3:
Main:

$python3 benchmark/gsm8k/bench_sglang.py --num-questions 200 --parallel 128 --num-shots 8 --port 30000
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:10<00:00, 19.44it/s]
Accuracy: 0.720
Invalid: 0.000
Latency: 10.441 s
Output throughput: 4756.603 token/s

This PR:

$python3 benchmark/gsm8k/bench_sglang.py --num-questions 200 --parallel 128 --num-shots 8 --port 30000
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:11<00:00, 17.07it/s]
Accuracy: 0.725
Invalid: 0.000
Latency: 12.372 s
Output throughput: 4044.128 token/s

@ispobock
Copy link
Collaborator

Why the Output throughput of this PR is much less than the main branch? Could you try Qwen/Qwen3-235B-A22B-FP8?

@yuan-luo
Copy link
Collaborator Author

yuan-luo commented Jul 11, 2025

$python3 -m sglang.launch_server --model /home/admin/Qwen3-30B-A3B --tp-size 8 --port 30000

Tested benchmark again, the PR version does perform better.

Main:

[root  /home/root/luoyuan.luo/sglang] 五 7月 11 11:01:10 
$python ./sgl-kernel/benchmark/bench_moe_align_block_size.py
INFO 07-11 11:01:24 [__init__.py:244] Automatically detected platform cuda.
✅ VLLM implementation works with 64 experts!
✅ SGL and Triton implementations match
✅ SGL and VLLM implementations match

📊 Running performance benchmark for 64 experts...
moe-align-block-size-performance:
     num_tokens  num_experts  topk        SGL  SGL Fusion       Triton
0           1.0          8.0   1.0  16.511999   14.816000    46.560001
1           1.0          8.0   2.0  16.543999   14.832000    44.128001
2           1.0          8.0   4.0  16.543999   14.944000    45.632001
3           1.0          8.0   8.0  16.480001   14.976000    46.496000
4           1.0         32.0   1.0  19.168001   17.344000    46.367999
5           1.0         32.0   2.0  19.168001   17.376000    44.640001
6           1.0         32.0   4.0  19.104000   17.472001    45.472000
7           1.0         32.0   8.0  19.136000   17.535999    45.632001
8           1.0         64.0   1.0  22.464000   20.768000    45.823999
9           1.0         64.0   2.0  22.496000   20.896001    45.375999
10          1.0         64.0   4.0  22.720000   20.864001    45.504000
11          1.0         64.0   8.0  22.720000   20.896001    44.831999
12          1.0        128.0   1.0  20.384001   18.656000    46.976000
13          1.0        128.0   2.0  20.352000   18.560000    46.656001
14          1.0        128.0   4.0  20.384001   18.688001    47.104001
15          1.0        128.0   8.0  20.384001   18.656000    46.847999
16          1.0        256.0   1.0  20.447999   18.784000    47.040001
17          1.0        256.0   2.0  20.479999   18.751999    47.584001
18          1.0        256.0   4.0  20.512000   18.975999    45.728002
19          1.0        256.0   8.0  20.496000   19.007999    47.104001
20          8.0          8.0   1.0  16.480001   14.944000    47.072001
21          8.0          8.0   2.0  16.704001   15.264000    46.271998
22          8.0          8.0   4.0  16.576000   15.264000    46.512000
23          8.0          8.0   8.0  17.055999   15.744001    47.072001
24          8.0         32.0   1.0  19.200001   17.503999    46.335999
25          8.0         32.0   2.0  19.104000   17.600000    46.528000
26          8.0         32.0   4.0  19.007999   17.888000    47.871999
27          8.0         32.0   8.0  19.520000   18.464001    45.759998
28          8.0         64.0   1.0  22.464000   20.896001    45.823999
29          8.0         64.0   2.0  22.399999   20.927999    45.184001
30          8.0         64.0   4.0  22.688000   21.104001    45.407999
31          8.0         64.0   8.0  22.464000   21.183999    46.831999
32          8.0        128.0   1.0  20.368000   18.656000    46.432000
33          8.0        128.0   2.0  20.352000   18.704000    46.239998
34          8.0        128.0   4.0  20.416001   18.751999    47.648001
35          8.0        128.0   8.0  20.512000   19.072000    46.847999
36          8.0        256.0   1.0  20.479999   19.007999    47.072001
37          8.0        256.0   2.0  20.512000   19.007999    48.432000
38          8.0        256.0   4.0  20.560000   19.104000    47.136001
39          8.0        256.0   8.0  20.656000   19.296000    46.239998
40         16.0          8.0   1.0  16.704001   15.264000    46.112001
41         16.0          8.0   2.0  16.576000   15.232000    46.815999
42         16.0          8.0   4.0  17.088000   15.712000    45.440000
43         16.0          8.0   8.0  18.080000   16.736001    47.040001
44         16.0         32.0   1.0  19.231999   17.696001    46.624001
45         16.0         32.0   2.0  19.231999   17.759999    46.560001
46         16.0         32.0   4.0  19.552000   18.495999    45.120001
47         16.0         32.0   8.0  20.608000   19.552000    46.016000
48         16.0         64.0   1.0  22.399999   20.959999    47.343999
49         16.0         64.0   2.0  22.624001   21.088000    47.616001
50         16.0         64.0   4.0  22.560000   21.312000    46.496000
51         16.0         64.0   8.0  22.944000   22.143999    46.399999
52         16.0        128.0   1.0  20.352000   18.688001    46.271998
53         16.0        128.0   2.0  20.512000   18.880000    46.432000
54         16.0        128.0   4.0  20.479999   18.944001    46.367999
55         16.0        128.0   8.0  20.416001   19.264000    45.696001
56         16.0        256.0   1.0  20.544000   19.040000    46.688002
57         16.0        256.0   2.0  20.576000   19.104000    47.104001
58         16.0        256.0   4.0  20.544000   19.328000    46.720002
59         16.0        256.0   8.0  20.576000   19.584000    47.871999
60         32.0          8.0   1.0  16.576000   15.280000    46.239998
61         32.0          8.0   2.0  17.088000   15.712000    47.168002
62         32.0          8.0   4.0  18.144000   16.767999    46.656001
63         32.0          8.0   8.0  20.128001   18.719999    46.367999
64         32.0         32.0   1.0  19.264000   17.856000    46.399999
65         32.0         32.0   2.0  19.552000   18.464001    46.208002
66         32.0         32.0   4.0  20.640001   19.552000    46.399999
67         32.0         32.0   8.0  22.560000   21.504000    45.520000
68         32.0         64.0   1.0  22.431999   21.183999    44.960000
69         32.0         64.0   2.0  22.688000   21.344000    45.248002
70         32.0         64.0   4.0  22.944000   22.143999    45.311999
71         32.0         64.0   8.0  23.903999   23.167999    44.351999
72         32.0        128.0   1.0  20.479999   18.816000    45.056000
73         32.0        128.0   2.0  20.479999   19.040000    45.536000
74         32.0        128.0   4.0  20.416001   19.168001    45.375999
75         32.0        128.0   8.0  20.640001   19.455999    47.392000
76         32.0        256.0   1.0  20.576000   19.007999    47.424000
77         32.0        256.0   2.0  20.544000   19.231999    47.168002
78         32.0        256.0   4.0  20.608000   19.520000    47.648001
79         32.0        256.0   8.0  20.896001   20.191999    47.168002
80         64.0          8.0   1.0  17.152000   15.680000    46.016000
81         64.0          8.0   2.0  18.144000   16.767999    47.200002
82         64.0          8.0   4.0  20.032000   18.688001    45.696001
83         64.0          8.0   8.0  24.512000   22.879999    47.648001
84         64.0         32.0   1.0  19.711999   18.464001    46.464000
85         64.0         32.0   2.0  20.576000   19.552000    46.688002
86         64.0         32.0   4.0  22.655999   21.504000    45.903999
87         64.0         32.0   8.0  26.559999   25.567999    46.335999
88         64.0         64.0   1.0  22.655999   21.344000    46.271998
89         64.0         64.0   2.0  22.944000   22.112001    46.016000
90         64.0         64.0   4.0  23.903999   23.167999    46.335999
91         64.0         64.0   8.0  26.464000   25.936000    44.927999
92         64.0        128.0   1.0  20.479999   18.944001    45.696001
93         64.0        128.0   2.0  20.447999   19.296000    46.751998
94         64.0        128.0   4.0  20.640001   19.455999    45.311999
95         64.0        128.0   8.0  20.640001   19.680001    45.343999
96         64.0        256.0   1.0  20.576000   19.264000    46.399999
97         64.0        256.0   2.0  20.640001   19.520000    45.887999
98         64.0        256.0   4.0  20.768000   20.128001    46.335999
99         64.0        256.0   8.0  21.056000   20.768000    47.231998
100       128.0          8.0   1.0  18.112000   16.799999    45.823999
101       128.0          8.0   2.0  20.032000   18.688001    45.311999
102       128.0          8.0   4.0  24.480000   22.911999    44.480000
103       128.0          8.0   8.0  19.552000   17.792000    46.271998
104       128.0         32.0   1.0  20.608000   19.520000    46.016000
105       128.0         32.0   2.0  22.655999   21.504000    45.440000
106       128.0         32.0   4.0  26.528001   25.536001    43.839999
107       128.0         32.0   8.0  19.648001   17.824000    46.176001
108       128.0         64.0   1.0  22.944000   22.143999    47.072001
109       128.0         64.0   2.0  23.871999   23.167999    45.952000
110       128.0         64.0   4.0  26.432000   26.016001    45.855999
111       128.0         64.0   8.0  20.447999   19.136000    44.576000
112       128.0        128.0   1.0  20.447999   19.136000    46.335999
113       128.0        128.0   2.0  20.608000   19.520000    47.936000
114       128.0        128.0   4.0  20.608000   19.680001    47.120001
115       128.0        128.0   8.0  20.416001   19.487999    46.351999
116       128.0        256.0   1.0  20.544000   19.616000    47.392000
117       128.0        256.0   2.0  20.752000   20.128001    47.008000
118       128.0        256.0   4.0  21.136001   20.864001    48.223998
119       128.0        256.0   8.0  21.312000   21.056000    47.871999
120       256.0          8.0   1.0  19.967999   18.656000    46.688002
121       256.0          8.0   2.0  24.448000   22.879999    46.367999
122       256.0          8.0   4.0  19.552000   17.616000    47.231998
123       256.0          8.0   8.0  19.552000   17.983999    69.472000
124       256.0         32.0   1.0  22.528000   21.552000    46.399999
125       256.0         32.0   2.0  26.496001   25.504000    45.568001
126       256.0         32.0   4.0  19.648001   17.824000    47.200002
127       256.0         32.0   8.0  20.447999   18.640000    44.096000
128       256.0         64.0   1.0  24.000000   23.135999    44.895999
129       256.0         64.0   2.0  26.208000   26.016001    45.440000
130       256.0         64.0   4.0  20.479999   19.104000    45.664001
131       256.0         64.0   8.0  20.927999   19.648001    46.287999
132       256.0        128.0   1.0  20.640001   19.455999    44.736002
133       256.0        128.0   2.0  20.608000   19.680001    46.239998
134       256.0        128.0   4.0  20.352000   19.487999    46.432000
135       256.0        128.0   8.0  21.248000   20.288000    45.504000
136       256.0        256.0   1.0  20.736000   20.191999    46.879999
137       256.0        256.0   2.0  21.056000   20.864001    47.904000
138       256.0        256.0   4.0  21.280000   21.088000    47.488000
139       256.0        256.0   8.0  21.439999   21.536000    50.016001
140       512.0          8.0   1.0  24.544001   22.879999    48.064001
141       512.0          8.0   2.0  19.552000   17.632000    47.424000
142       512.0          8.0   4.0  19.552000   18.015999    69.263998
143       512.0          8.0   8.0  21.152001   19.520000   115.263999
144       512.0         32.0   1.0  26.528001   25.504000    45.248002
145       512.0         32.0   2.0  19.616000   17.888000    47.359999
146       512.0         32.0   4.0  20.447999   18.848000    46.496000
147       512.0         32.0   8.0  21.632001   20.160001    48.608001
148       512.0         64.0   1.0  26.464000   25.984000    46.656001
149       512.0         64.0   2.0  20.479999   18.944001    44.415999
150       512.0         64.0   4.0  20.927999   19.520000    44.544000
151       512.0         64.0   8.0  22.624001   21.072000    46.016000
152       512.0        128.0   1.0  20.656000   19.840000    46.048000
153       512.0        128.0   2.0  20.352000   19.520000    45.120001
154       512.0        128.0   4.0  21.248000   20.191999    44.064000
155       512.0        128.0   8.0  22.464000   21.600001    46.112001
156       512.0        256.0   1.0  21.152001   20.896001    47.584001
157       512.0        256.0   2.0  21.280000   21.248000    48.544001
158       512.0        256.0   4.0  21.472000   21.568000    50.352000
159       512.0        256.0   8.0  23.072001   23.040000    54.016002
160      1024.0          8.0   1.0  19.552000   17.759999    46.912000
161      1024.0          8.0   2.0  19.552000   17.952001    69.215998
162      1024.0          8.0   4.0  21.183999   19.520000   115.247998
163      1024.0          8.0   8.0  23.232000   21.760000   205.791995
164      1024.0         32.0   1.0  19.648001   17.983999    47.263999
165      1024.0         32.0   2.0  20.416001   18.864000    45.248002
166      1024.0         32.0   4.0  21.632001   20.064000    63.072000
167      1024.0         32.0   8.0  24.448000   23.008000    73.119998
168      1024.0         64.0   1.0  20.512000   19.104000    70.784003
169      1024.0         64.0   2.0  20.959999   19.648001    47.552001
170      1024.0         64.0   4.0  22.624001   21.280000    47.775999
171      1024.0         64.0   8.0  25.024001   23.887999    53.312000
172      1024.0        128.0   1.0  20.352000   19.455999    46.080001
173      1024.0        128.0   2.0  21.264000   20.288000    47.936000
174      1024.0        128.0   4.0  22.431999   21.504000    46.271998
175      1024.0        128.0   8.0  24.544001   23.552001    48.560001
176      1024.0        256.0   1.0  21.280000   21.120001    47.711998
177      1024.0        256.0   2.0  21.472000   21.504000    50.239999
178      1024.0        256.0   4.0  23.040000   22.944000    54.143999
179      1024.0        256.0   8.0  25.536001   25.536001    61.152000
180      2048.0          8.0   1.0  19.552000   17.952001    69.311999
181      2048.0          8.0   2.0  21.183999   19.360000   115.039997
182      2048.0          8.0   4.0  23.264000   21.744000   206.208006
183      2048.0          8.0   8.0  27.968001   27.008001   379.487991
184      2048.0         32.0   1.0  20.400001   18.832000    45.791999
185      2048.0         32.0   2.0  21.632001   19.967999    48.576001
186      2048.0         32.0   4.0  24.448000   23.167999    73.055997
187      2048.0         32.0   8.0  29.216001   28.160000   118.880004
188      2048.0         64.0   1.0  20.927999   19.584000    45.855999
189      2048.0         64.0   2.0  22.624001   21.088000    46.128001
190      2048.0         64.0   4.0  24.992000   23.968000    53.087998
191      2048.0         64.0   8.0  31.008000   30.112000    77.376001
192      2048.0        128.0   1.0  21.215999   20.320000    46.815999
193      2048.0        128.0   2.0  22.464000   21.600001    46.303999
194      2048.0        128.0   4.0  24.480000   23.424000    48.640002
195      2048.0        128.0   8.0  29.472001   28.832000    60.479999
196      2048.0        256.0   1.0  21.472000   21.536000    50.496001
197      2048.0        256.0   2.0  23.072001   22.944000    54.591998
198      2048.0        256.0   4.0  25.504000   25.536001    61.184000
199      2048.0        256.0   8.0  31.072000   31.104000    69.007996
200      4096.0          8.0   1.0  21.183999   19.360000   115.199998
201      4096.0          8.0   2.0  23.232000   21.792000   206.432000
202      4096.0          8.0   4.0  27.968001   26.944000   380.127996
203      4096.0          8.0   8.0  36.704000   36.672000   733.951986
204      4096.0         32.0   1.0  21.616001   20.000000    49.088001
205      4096.0         32.0   2.0  24.448000   23.167999    72.927997
206      4096.0         32.0   4.0  29.216001   28.192000   118.496001
207      4096.0         32.0   8.0  39.455999   39.360002   210.048005
208      4096.0         64.0   1.0  22.624001   21.152001    44.480000
209      4096.0         64.0   2.0  24.992000   23.871999    53.312000
210      4096.0         64.0   4.0  31.024000   30.144000    77.312000
211      4096.0         64.0   8.0  41.568000   41.792002   122.496001
212      4096.0        128.0   1.0  22.464000   21.472000    45.759998
213      4096.0        128.0   2.0  24.480000   23.600000    48.767999
214      4096.0        128.0   4.0  29.503999   28.991999    60.416002
215      4096.0        128.0   8.0  38.943999   39.360002    84.799998
216      4096.0        256.0   1.0  23.072001   23.040000    54.175999
217      4096.0        256.0   2.0  25.504000   25.536001    61.120000
218      4096.0        256.0   4.0  31.104000   30.960000    69.183998
219      4096.0        256.0   8.0  36.959998   37.951998    82.687996
220      8192.0          8.0   1.0  23.312001   21.856001   206.272006
221      8192.0          8.0   2.0  27.936000   26.912000   379.040003
222      8192.0          8.0   4.0  36.591999   36.768001   733.951986
223      8192.0          8.0   8.0  54.976001   57.280000  1433.135986
224      8192.0         32.0   1.0  24.383999   22.944000    73.023997
225      8192.0         32.0   2.0  29.216001   28.224001   118.752003
226      8192.0         32.0   4.0  39.391998   39.360002   209.984004
227      8192.0         32.0   8.0  58.880001   61.216000   382.272005
228      8192.0         64.0   1.0  24.960000   23.903999    53.344000
229      8192.0         64.0   2.0  30.975999   30.208001    77.087998
230      8192.0         64.0   4.0  41.664001   41.887999   122.528002
231      8192.0         64.0   8.0  63.648000   66.047996   214.080006
232      8192.0        128.0   1.0  24.576001   23.440000    49.279999
233      8192.0        128.0   2.0  29.472001   28.991999    60.416002
234      8192.0        128.0   4.0  38.975999   39.391998    84.608003
235      8192.0        128.0   8.0  57.312001   59.935998   131.679997
236      8192.0        256.0   1.0  25.472000   25.536001    60.991999
237      8192.0        256.0   2.0  31.136001   31.296000    69.151998
238      8192.0        256.0   4.0  36.959998   37.856001    82.624003
239      8192.0        256.0   8.0  53.727999   56.800000   106.271997

This PR:

[root  /home/root/luoyuan.luo/sglang] 五 7月 11 11:42:48 
$python ./sgl-kernel/benchmark/bench_moe_align_block_size.py
INFO 07-11 11:43:05 [__init__.py:244] Automatically detected platform cuda.
✅ VLLM implementation works with 64 experts!
✅ SGL and Triton implementations match
✅ SGL and VLLM implementations match

📊 Running performance benchmark for 64 experts...
moe-align-block-size-performance:
     num_tokens  num_experts  topk        SGL  SGL Fusion       Triton
0           1.0          8.0   1.0  16.224001   14.464000    44.992000
1           1.0          8.0   2.0  16.319999   14.432000    45.855999
2           1.0          8.0   4.0  16.319999   14.464000    47.008000
3           1.0          8.0   8.0  16.192000   14.560000    45.791999
4           1.0         32.0   1.0  18.080000   16.448000    45.472000
5           1.0         32.0   2.0  18.112000   16.448000    46.208002
6           1.0         32.0   4.0  18.112000   16.480001    45.311999
7           1.0         32.0   8.0  18.096000   16.608000    45.823999
8           1.0         64.0   1.0  21.919999   20.256000    47.200002
9           1.0         64.0   2.0  22.112001   20.288000    48.960000
10          1.0         64.0   4.0  22.080000   20.384001    46.432000
11          1.0         64.0   8.0  22.016000   20.400001    46.144001
12          1.0        128.0   1.0  18.176001   16.767999    47.327999
13          1.0        128.0   2.0  18.239999   16.704001    47.263999
14          1.0        128.0   4.0  18.208001   16.704001    47.936000
15          1.0        128.0   8.0  18.239999   16.736001    47.136001
16          1.0        256.0   1.0  18.560000   16.864000    47.648001
17          1.0        256.0   2.0  18.560000   16.992001    46.815999
18          1.0        256.0   4.0  18.592000   16.992001    46.656001
19          1.0        256.0   8.0  18.528000   16.896000    47.200002
20          8.0          8.0   1.0  16.192000   14.592000    47.888000
21          8.0          8.0   2.0  16.192000   14.720000    46.847999
22          8.0          8.0   4.0  15.984000   14.720000    48.416000
23          8.0          8.0   8.0  16.576000   15.072000    47.136001
24          8.0         32.0   1.0  18.176001   16.608000    47.904000
25          8.0         32.0   2.0  18.176001   16.767999    47.648001
26          8.0         32.0   4.0  17.952001   16.832000    48.672002
27          8.0         32.0   8.0  18.271999   17.376000    46.976000
28          8.0         64.0   1.0  21.919999   20.416001    47.648001
29          8.0         64.0   2.0  22.016000   20.384001    47.040001
30          8.0         64.0   4.0  22.080000   20.544000    46.512000
31          8.0         64.0   8.0  21.856001   20.703999    49.152002
32          8.0        128.0   1.0  18.176001   16.704001    47.200002
33          8.0        128.0   2.0  18.208001   16.767999    47.200002
34          8.0        128.0   4.0  18.239999   16.864000    48.400000
35          8.0        128.0   8.0  18.271999   17.023999    47.263999
36          8.0        256.0   1.0  18.495999   17.088000    45.791999
37          8.0        256.0   2.0  18.528000   16.992001    48.191998
38          8.0        256.0   4.0  18.592000   17.184000    46.560001
39          8.0        256.0   8.0  18.592000   17.376000    46.271998
40         16.0          8.0   1.0  16.224001   14.720000    46.432000
41         16.0          8.0   2.0  16.192000   14.688000    46.208002
42         16.0          8.0   4.0  16.608000   15.072000    46.080001
43         16.0          8.0   8.0  17.312000   15.904000    45.759998
44         16.0         32.0   1.0  18.176001   16.704001    46.208002
45         16.0         32.0   2.0  17.952001   16.928000    46.688002
46         16.0         32.0   4.0  18.528000   17.408000    44.767998
47         16.0         32.0   8.0  19.264000   18.271999    45.504000
48         16.0         64.0   1.0  22.048000   20.447999    47.136001
49         16.0         64.0   2.0  21.888001   20.576000    48.128001
50         16.0         64.0   4.0  22.016000   20.736000    46.879999
51         16.0         64.0   8.0  22.431999   21.312000    46.847999
52         16.0        128.0   1.0  18.176001   16.672000    46.976000
53         16.0        128.0   2.0  18.176001   16.799999    47.263999
54         16.0        128.0   4.0  18.271999   17.023999    47.136001
55         16.0        128.0   8.0  18.368000   17.344000    48.096001
56         16.0        256.0   1.0  18.528000   16.992001    47.104001
57         16.0        256.0   2.0  18.608000   17.216001    46.976000
58         16.0        256.0   4.0  18.624000   17.408000    47.295999
59         16.0        256.0   8.0  18.719999   17.759999    48.223998
60         32.0          8.0   1.0  16.063999   14.656000    47.616001
61         32.0          8.0   2.0  16.656000   15.072000    46.912000
62         32.0          8.0   4.0  17.344000   15.904000    46.335999
63         32.0          8.0   8.0  18.880000   17.440001    46.239998
64         32.0         32.0   1.0  18.144000   16.767999    46.239998
65         32.0         32.0   2.0  18.560000   17.344000    46.144001
66         32.0         32.0   4.0  19.104000   18.239999    46.528000
67         32.0         32.0   8.0  20.896001   19.920001    47.040001
68         32.0         64.0   1.0  22.048000   20.544000    46.496000
69         32.0         64.0   2.0  21.984000   20.703999    45.504000
70         32.0         64.0   4.0  22.399999   21.407999    46.560001
71         32.0         64.0   8.0  23.232000   22.528000    46.783999
72         32.0        128.0   1.0  18.239999   16.767999    46.432000
73         32.0        128.0   2.0  18.368000   16.976001    45.984000
74         32.0        128.0   4.0  18.336000   17.344000    47.104001
75         32.0        128.0   8.0  18.400000   17.376000    47.552001
76         32.0        256.0   1.0  18.560000   17.152000    47.295999
77         32.0        256.0   2.0  18.719999   17.376000    46.656001
78         32.0        256.0   4.0  18.688001   17.664000    48.512001
79         32.0        256.0   8.0  18.848000   18.176001   133.648001
80         64.0          8.0   1.0  16.576000   15.072000    46.592001
81         64.0          8.0   2.0  17.312000   15.936000    46.656001
82         64.0          8.0   4.0  18.784000   17.472001    46.271998
83         64.0          8.0   8.0  22.256000   20.864001    46.399999
84         64.0         32.0   1.0  18.336000   17.376000    47.168002
85         64.0         32.0   2.0  19.104000   18.208001    48.000000
86         64.0         32.0   4.0  20.896001   19.904001    62.656000
87         64.0         32.0   8.0  24.480000   23.647999    47.871999
88         64.0         64.0   1.0  21.888001   20.736000    48.319999
89         64.0         64.0   2.0  22.415999   21.376001    47.920000
90         64.0         64.0   4.0  23.264000   22.592001    48.544001
91         64.0         64.0   8.0  24.704000   24.448000    48.096001
92         64.0        128.0   1.0  18.239999   17.055999    47.839999
93         64.0        128.0   2.0  18.336000   17.344000    48.351999
94         64.0        128.0   4.0  18.432001   17.472001    48.351999
95         64.0        128.0   8.0  18.592000   17.888000    48.128001
96         64.0        256.0   1.0  18.656000   17.312000    48.544001
97         64.0        256.0   2.0  18.719999   17.728001    47.584001
98         64.0        256.0   4.0  18.816000   18.208001    47.680002
99         64.0        256.0   8.0  18.560000   18.400000    48.032001
100       128.0          8.0   1.0  17.376000   15.936000    48.191998
101       128.0          8.0   2.0  18.848000   17.440001    47.775999
102       128.0          8.0   4.0  22.208000   20.896001    47.711998
103       128.0          8.0   8.0  18.048000   16.608000    45.952000
104       128.0         32.0   1.0  19.104000   18.208001    47.424000
105       128.0         32.0   2.0  20.832000   19.856000    48.064001
106       128.0         32.0   4.0  24.639999   23.680000    47.807999
107       128.0         32.0   8.0  18.528000   17.088000    48.096001
108       128.0         64.0   1.0  22.399999   21.407999    49.279999
109       128.0         64.0   2.0  23.264000   22.624001    48.831999
110       128.0         64.0   4.0  24.752000   24.512000    47.168002
111       128.0         64.0   8.0  18.304000   17.184000    47.552001
112       128.0        128.0   1.0  18.304000   17.312000    47.904000
113       128.0        128.0   2.0  18.432001   17.535999    45.759998
114       128.0        128.0   4.0  18.560000   17.856000    46.624001
115       128.0        128.0   8.0  18.784000   18.080000    47.231998
116       128.0        256.0   1.0  18.719999   17.728001    46.112001
117       128.0        256.0   2.0  18.880000   18.144000    47.488000
118       128.0        256.0   4.0  18.592000   18.495999    46.399999
119       128.0        256.0   8.0  18.719999   18.656000    47.616001
120       256.0          8.0   1.0  18.816000   17.503999    47.488000
121       256.0          8.0   2.0  22.304000   20.864001    47.584001
122       256.0          8.0   4.0  18.048000   16.608000    47.104001
123       256.0          8.0   8.0  19.264000   17.552000    67.520000
124       256.0         32.0   1.0  20.927999   19.872000    45.871999
125       256.0         32.0   2.0  24.480000   23.615999    47.584001
126       256.0         32.0   4.0  18.528000   17.136000    46.144001
127       256.0         32.0   8.0  19.231999   17.535999    46.528000
128       256.0         64.0   1.0  23.296000   22.592001    47.263999
129       256.0         64.0   2.0  24.800001   24.480000    46.399999
130       256.0         64.0   4.0  18.304000   17.184000    46.112001
131       256.0         64.0   8.0  19.424001   18.176001    47.871999
132       256.0        128.0   1.0  18.400000   17.503999    47.552001
133       256.0        128.0   2.0  18.624000   17.888000    47.424000
134       256.0        128.0   4.0  18.848000   17.983999    48.416000
135       256.0        128.0   8.0  19.168001   18.432001    47.680002
136       256.0        256.0   1.0  18.816000   18.239999    47.327999
137       256.0        256.0   2.0  18.528000   18.432001    47.072001
138       256.0        256.0   4.0  18.719999   18.688001    46.815999
139       256.0        256.0   8.0  19.648001   19.584000    49.279999
140       512.0          8.0   1.0  22.304000   20.896001    47.136001
141       512.0          8.0   2.0  18.048000   16.511999    46.736000
142       512.0          8.0   4.0  19.296000   17.600000    67.680001
143       512.0          8.0   8.0  19.808000   18.144000   113.664001
144       512.0         32.0   1.0  24.512000   23.647999    47.536001
145       512.0         32.0   2.0  18.495999   16.864000    47.839999
146       512.0         32.0   4.0  19.231999   17.535999    45.807999
147       512.0         32.0   8.0  20.447999   19.136000    48.608001
148       512.0         64.0   1.0  24.752000   24.480000    47.839999
149       512.0         64.0   2.0  18.336000   16.976001    46.624001
150       512.0         64.0   4.0  19.424001   18.176001    46.720002
151       512.0         64.0   8.0  20.608000   19.424001    47.552001
152       512.0        128.0   1.0  18.592000   17.856000    45.791999
153       512.0        128.0   2.0  18.848000   17.983999    47.552001
154       512.0        128.0   4.0  19.168001   18.464001    47.008000
155       512.0        128.0   8.0  20.544000   19.776000    47.088001
156       512.0        256.0   1.0  18.560000   18.336000    46.432000
157       512.0        256.0   2.0  18.719999   18.864000    48.191998
158       512.0        256.0   4.0  19.648001   19.743999    49.088001
159       512.0        256.0   8.0  20.768000   20.832000    51.904000
160      1024.0          8.0   1.0  18.048000   16.608000    47.392000
161      1024.0          8.0   2.0  19.264000   17.376000    67.680001
162      1024.0          8.0   4.0  19.711999   18.176001   113.504000
163      1024.0          8.0   8.0  22.496000   21.120001   203.520000
164      1024.0         32.0   1.0  18.544000   16.832000    46.496000
165      1024.0         32.0   2.0  19.200001   17.535999    47.904000
166      1024.0         32.0   4.0  20.463999   19.072000    48.319999
167      1024.0         32.0   8.0  22.752000   21.600001    71.392000
168      1024.0         64.0   1.0  18.304000   17.184000    46.720002
169      1024.0         64.0   2.0  19.424001   18.304000    46.879999
170      1024.0         64.0   4.0  20.576000   19.360000    48.864000
171      1024.0         64.0   8.0  23.808001   22.655999    51.743999
172      1024.0        128.0   1.0  18.880000   17.983999    47.456000
173      1024.0        128.0   2.0  19.168001   18.400000    48.160002
174      1024.0        128.0   4.0  20.576000   19.680001    48.351999
175      1024.0        128.0   8.0  23.040000   22.112001    48.351999
176      1024.0        256.0   1.0  18.719999   18.880000    48.416000
177      1024.0        256.0   2.0  19.648001   19.616000    48.480000
178      1024.0        256.0   4.0  20.816000   20.896001    51.552001
179      1024.0        256.0   8.0  23.680000   23.520000    58.591999
180      2048.0          8.0   1.0  19.231999   17.600000    67.648001
181      2048.0          8.0   2.0  19.743999   18.336000   113.792002
182      2048.0          8.0   4.0  22.496000   21.088000   203.615993
183      2048.0          8.0   8.0  26.848000   25.872000   374.736011
184      2048.0         32.0   1.0  19.168001   17.696001    47.424000
185      2048.0         32.0   2.0  20.479999   19.072000    66.367999
186      2048.0         32.0   4.0  22.784000   21.407999    71.263999
187      2048.0         32.0   8.0  28.096000   27.392000   117.311999
188      2048.0         64.0   1.0  19.424001   18.112000    47.743998
189      2048.0         64.0   2.0  20.624001   19.424001    46.783999
190      2048.0         64.0   4.0  23.776000   22.784000    51.824000
191      2048.0         64.0   8.0  28.928000   28.224001    74.848004
192      2048.0        128.0   1.0  19.136000   18.224000    45.952000
193      2048.0        128.0   2.0  20.608000   19.648001    47.520000
194      2048.0        128.0   4.0  23.008000   22.208000    48.640002
195      2048.0        128.0   8.0  27.775999   27.327999    59.808001
196      2048.0        256.0   1.0  19.648001   19.711999    49.343999
197      2048.0        256.0   2.0  20.800000   20.703999    51.647998
198      2048.0        256.0   4.0  23.647999   23.584001    58.527999
199      2048.0        256.0   8.0  28.767999   28.880000    66.880003
200      4096.0          8.0   1.0  19.743999   18.192001   113.504000
201      4096.0          8.0   2.0  22.496000   21.280000   204.255998
202      4096.0          8.0   4.0  26.815999   26.112000   375.871986
203      4096.0          8.0   8.0  36.127999   36.288001   728.640020
204      4096.0         32.0   1.0  20.479999   19.104000    49.376000
205      4096.0         32.0   2.0  22.688000   21.376001    71.520001
206      4096.0         32.0   4.0  28.160000   27.264001   117.376000
207      4096.0         32.0   8.0  37.983999   38.208000   208.800003
208      4096.0         64.0   1.0  20.671999   19.424001    46.976000
209      4096.0         64.0   2.0  23.776000   22.688000    51.488001
210      4096.0         64.0   4.0  28.960001   28.192000    75.071998
211      4096.0         64.0   8.0  40.383998   40.576000   120.991997
212      4096.0        128.0   1.0  20.544000   19.680001    47.616001
213      4096.0        128.0   2.0  23.008000   22.112001    48.480000
214      4096.0        128.0   4.0  27.775999   27.295999    59.744000
215      4096.0        128.0   8.0  37.152000   37.696000    83.903998
216      4096.0        256.0   1.0  20.736000   20.848000    51.647998
217      4096.0        256.0   2.0  23.647999   23.615999    58.304001
218      4096.0        256.0   4.0  28.767999   28.848000    67.071997
219      4096.0        256.0   8.0  35.168000   36.320001    80.063999
220      8192.0          8.0   1.0  22.464000   21.280000   203.424007
221      8192.0          8.0   2.0  26.880000   26.112000   374.496013
222      8192.0          8.0   4.0  36.160000   36.479998   729.503989
223      8192.0          8.0   8.0  54.207999   56.575999  1447.360039
224      8192.0         32.0   1.0  22.752000   21.568000    71.456000
225      8192.0         32.0   2.0  28.096000   27.232001   117.472000
226      8192.0         32.0   4.0  37.951998   38.208000   208.800003
227      8192.0         32.0   8.0  58.143999   60.479999   381.312013
228      8192.0         64.0   1.0  23.776000   22.688000    51.808000
229      8192.0         64.0   2.0  28.896000   28.255999    74.720003
230      8192.0         64.0   4.0  40.288001   40.832002   120.863996
231      8192.0         64.0   8.0  62.208001   64.672001   213.024005
232      8192.0        128.0   1.0  23.008000   22.240000    48.367999
233      8192.0        128.0   2.0  27.712001   27.136000    59.872001
234      8192.0        128.0   4.0  37.248001   37.760001    83.999999
235      8192.0        128.0   8.0  56.095999   58.720000   131.456003
236      8192.0        256.0   1.0  23.680000   23.776000    58.495998
237      8192.0        256.0   2.0  28.783999   28.928000    67.008004
238      8192.0        256.0   4.0  35.135999   36.160000    80.159999
239      8192.0        256.0   8.0  51.552001   54.880001   105.407998

GSM8K results:

Main:

[root  /home/root/luoyuan.luo/sglang] 五 7月 11 11:15:50 
$python3 benchmark/gsm8k/bench_sglang.py --num-questions 200 --parallel 128 --num-shots 8 --port 30000 --data-path ./test.jsonl
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:11<00:00, 17.97it/s]
Accuracy: 0.710
Invalid: 0.000
Latency: 11.374 s
Output throughput: 4666.546 token/s

This PR:

[root  /home/root/luoyuan.luo/sglang] 五 7月 11 11:17:28 
$python3 benchmark/gsm8k/bench_sglang.py --num-questions 200 --parallel 128 --num-shots 8 --port 30000 --data-path ./test.jsonl
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:10<00:00, 19.74it/s]
Accuracy: 0.755
Invalid: 0.000
Latency: 10.316 s
Output throughput: 4843.095 token/s

This PR Precision Verification:

[root  /home/root/luoyuan.luo/sglang] 五 7月 11 11:42:36 
$python ./sgl-kernel/tests/test_moe_align.py
============================================================================================= test session starts =============================================================================================
platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0
rootdir: /home/root/luoyuan.luo/sglang/sgl-kernel
configfile: pyproject.toml
plugins: hypothesis-6.135.10, anyio-4.8.0
collected 4368 items                                                                                                                                                                                          

sgl-kernel/tests/test_moe_align.py .................................................................................................................................................................... [  3%]
....................................................................................................................................................................................................... [  8%]
....................................................................................................................................................................................................... [ 12%]
....................................................................................................................................................................................................... [ 17%]
....................................................................................................................................................................................................... [ 21%]
....................................................................................................................................................................................................... [ 26%]
....................................................................................................................................................................................................... [ 31%]
....................................................................................................................................................................................................... [ 35%]
....................................................................................................................................................................................................... [ 40%]
....................................................................................................................................................................................................... [ 44%]
....................................................................................................................................................................................................... [ 49%]
....................................................................................................................................................................................................... [ 53%]
....................................................................................................................................................................................................... [ 58%]
....................................................................................................................................................................................................... [ 62%]
....................................................................................................................................................................................................... [ 67%]
....................................................................................................................................................................................................... [ 72%]
....................................................................................................................................................................................................... [ 76%]
....................................................................................................................................................................................................... [ 81%]
....................................................................................................................................................................................................... [ 85%]
....................................................................................................................................................................................................... [ 90%]
....................................................................................................................................................................................................... [ 94%]
....................................................................................................................................................................................................... [ 99%]
.........................                                                                                                                                                                               [100%]

============================================================================================ 4368 passed in 8.45s =============================================================================================

@yuan-luo
Copy link
Collaborator Author

Why the Output throughput of this PR is much less than the main branch? Could you try Qwen/Qwen3-235B-A22B-FP8?

My test machine H20 is shared with other users occasionally. It might because some other benchmark test is on-going. I'll pay more attention to this kind of situation which impacts the result.

Copy link
Collaborator

@ispobock ispobock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems all the AMD tests are failed. This kernel is also used on AMD GPUs. Please have a check.

@yuan-luo
Copy link
Collaborator Author

It seems all the AMD tests are failed. This kernel is also used on AMD GPUs. Please have a check.

This has no relation with this PR's change. Retrying CI.
image

@ispobock
Copy link
Collaborator

@yuan-luo yuan-luo force-pushed the refine_moe_align_block_kernel branch from 1593e6e to c852fc6 Compare July 14, 2025 11:29
@ispobock
Copy link
Collaborator

@HaiShaw Please help review this change for AMD GPUs.

@yuan-luo
Copy link
Collaborator Author

@ispobock ispobock merged commit af1cc8f into sgl-project:main Jul 17, 2025
156 of 169 checks passed
BBuf added a commit that referenced this pull request Jul 28, 2025
yuan-luo pushed a commit to antgroup/sglang that referenced this pull request Sep 18, 2025
Merge branch 'sglang_public_tracker of git@code.alipay.com:Theta/SGLang.git into main

https://code.alipay.com/Theta/SGLang/pull_requests/192


Reviewed-by: 得泽 <zhangkaihong.zkh@antgroup.com>


* fix duplicate args in schedule_batch (sgl-project#7816)
* [AMD] Fail gracefully when AITER is unavailable gfx90a GPUs (sgl-project#7187)
* docs: update README (sgl-project#7821)
* [theta] add py-spy deps
* feat: support DeepSeek-R1-W4AFP8 model with ep-moe mode (sgl-project#7762)
* Enable ModelOpt Llama4 fp8 checkpoint deployment in SGLang (sgl-project#7129)
* [Minor] Fix sporadic CI timeout caused by underestimated tests. (sgl-project#7850)
* [Bugfix] Fix two batch overlap with auto DeepEP Dispatch (sgl-project#7853)
* Fix cache modules of triton import error (sgl-project#7832)
* [router] forward stream_options in request (sgl-project#7860)
* Fix illegal memory in trtllm allreduce fusion (sgl-project#7864)
* Fix llama4 vision (sgl-project#7840)
* Support Mimo-VL (sgl-project#7579)
* fix: Handles input_embeds in GenerateReqInput when n>1 (sgl-project#7830)
* [Multimodal][Perf] Use `pybase64` instead of `base64` (sgl-project#7724)
* Bump xgrammar's version to 0.1.20 (sgl-project#7866)
* [CPU]convert topk_weights to fp32 for INT8 and FP8 paths (for llama4) and fix LmHead weight pack (sgl-project#7818)
* [PD] Add guidance for prefill bootstrap timeout (sgl-project#7846)
* Update native_api doc to match the change in the `get_model_info` endpoint (sgl-project#7660)
* Revert "Embedding parallel by attn_tp (sgl-project#7623)" (sgl-project#7880)
* chore: bump v0.4.9.post1 (sgl-project#7882)
* Fixes typo in assertion message (sgl-project#7895)
* [CI] Add deepep tests to CI (sgl-project#7872)
* [CPU] [FP8] set SGLANG_CPU_FP8_CVT_FTZ in CMakeLists.txt (sgl-project#7885)
* [CPU][Qwen3 MoE] Enable fused_topk CPU fusion and enhance FP8 TP padding (sgl-project#7838)
* Remove unused imports (sgl-project#7898)
* [router] Update metrics when request completes (sgl-project#7899)
* [feature] Add start step profile argument in /start_profile (sgl-project#7608)
* [bugfix] add pd router policy validation (sgl-project#7904)
* vlm: support video as an input modality (sgl-project#5888)
* Feat: Support Phi-3.5-MoE in SGLang (sgl-project#7907)
* add sentencepiece as dependency explicitly (sgl-project#7922)
* Fix bug of deepseek-v3 under DP+EP mode with large batchsize/seqlen (sgl-project#6449)
* [feature]Ascend quantization support (sgl-project#7791)
* [ready b200] fuse allreduce+add_rmsnorm in prepare_attention + mlp module (sgl-project#7775)
* Support Kimi K2 (sgl-project#7940)
* [feature] kv transfer support of ascend npu (sgl-project#7795)
* fix: minor fix for modelopt weight load compatibility (sgl-project#7953)
* temporarily disable deepep-8-gpu and activate two small tests (sgl-project#7961)
* [fix]Update unitest for fp8_blockwise_scaled_grouped_mm kernel (sgl-project#7932)
* chore: bump sgl-kernel v0.2.5 (sgl-project#7964)
* Revert "[PD Disaggregation] replace transfer with batch transfer for better performance (sgl-project#7236)" (sgl-project#7968)
* chore: upgrade xgrammar 0.1.21 (sgl-project#7962)
* delete uselese code caused by fuse allreduce+add_rmsnorm pr (sgl-project#7970)
* Fix wrong gemm branch cause 250us slower (sgl-project#7969)
* [router] add worker abstraction (sgl-project#7960)
* chore: upgrade sgl-kernel 0.2.5 (sgl-project#7971)
* chore: bump v0.4.9.post2 (sgl-project#7963)
* [minor fix] llama4 hybrid memory (sgl-project#7950)
* [minor fix] SWA missing methods (sgl-project#7972)
* [script] update loogle test (sgl-project#7975)
* perf: add kimi k2 fused_moe tuning config for h20_3e
* [theta] perf: add kimi k2 fused_moe tuning config for h200
* [minor fix] SWA missing methods (sgl-project#7972)
* [script] update loogle test (sgl-project#7975)
* perf: add kimi k2 fused_moe tuning config for h30_3e
* docs: update README (sgl-project#7985)
* Overlap the gating function with shared experts in DeepSeek (sgl-project#7978)
* [BugFix] fix pre_reorder_triton_kernel default int32 issue (sgl-project#7814)
* [minor] Add server_args check for Llama4 with hybrid (sgl-project#7988)
* Tiny fix mooncake log warning wrong output (sgl-project#7952)
* [BugFix] add verify logit_bias to avoid crash because of IndexError  (sgl-project#7749)
* SWA Prefix Cache (sgl-project#7367)
* chore: remove unnecessary limits on quantization methods in test script (sgl-project#7997)
* Refactor dynamic LoRA update to fix incorrect handling of variant weight shapes (sgl-project#7844)
* Support for Phi-1.5 & Phi-2 models (sgl-project#7862)
* [Dockerfile] Multi-arch support for ROCm (sgl-project#7902)
* [CPU] fix no attribute 'can_fuse_mlp_allreduce' error (sgl-project#8010)
* perf: add kimi k2 fused_moe tuning config for h30_3e (sgl-project#8021)
* [ci] CI supports use cached models (sgl-project#7874)
* [Minor] Remove redundant print (sgl-project#8005)
* [Feature]TP Group Switching for PD-Multiplexing (sgl-project#7653)
* [Feature] CUDA Green Context Support (sgl-project#7649)
* Fix flaky CI: test_vlm_models (sgl-project#8006)
* Fix Bug 'get_cpu_copy not Implemented' in pd offloading mode (sgl-project#7982)
* prevent server crash from potential invalid grammar (sgl-project#7897)
* Setup workflow for releasing mi300x and mi350x dockers. (sgl-project#8035)
* fix: modality length mismatch with image_data (sgl-project#7887)
* Update CODEOWNERS (sgl-project#8044)
* perf: add qwen3-30b-a3b fused moe tuning config for h20
* [feat]Support fusion kernel for constructing quant input and scale factor for fp8_blockwise_scaled_grouped_mm (sgl-project#8023)
* feat: update multimodal data handling in engine entrypoint (sgl-project#8002)
* fix: remove redundant rotary embedding cache recomputation in MiniCPM (sgl-project#8022)
* Fix the input tools format and history tool_calls in OpenAI API  (sgl-project#6556)
* fix: resolve arm build issue (sgl-project#8052)
* concurrently load weights of DeepseekV2ForCausalLM (sgl-project#7943)
* H20 tune config for Kimi (sgl-project#8047)
* Update amd docker image. (sgl-project#8045)
* feat: replace Decord with video_reader-rs (sgl-project#5163)
* remove kv_a.congigous in DeepseekV2AttentionMLA (sgl-project#8058)
* update transformers to 4.53.2 (sgl-project#8029)
* Fix different device type adjustment in PP (sgl-project#7760)
* Use device_group for all_gather when disabling overlap scheduling (sgl-project#8001)
* Revert "feat: replace Decord with video_reader-rs" (sgl-project#8077)
* Fix CI xeon test with triton 3.3.1 (sgl-project#8086)
* fix greenctx stream compability (sgl-project#8090)
* [misc] update nvshmem and pin deepEP commit hash (sgl-project#8098)
* [Feature] Layer-wise Prefill (sgl-project#7634)
* [1/n] chore: decouple quantization implementation from vLLM dependency (sgl-project#7992)
* refactor: unify names of the feature field of MultimodalDataItem (sgl-project#8075)
* feat: add tp_rank, pp_rank and dp_rank labels for scheduler metrics (sgl-project#7597)
* [ci] limit cmake build nproc (sgl-project#8100)
* [ci] disable memory imbalance check for draft worker (sgl-project#8108)
* [Fix] ensure DeepGEMM is only enabled for FP8_W8A8 models (sgl-project#8110)
* [ci] recover 8-gpu deepep test (sgl-project#8105)
* Refactor: move all quantization-related code to `srt/layer/quantization` (sgl-project#7989)
* [kernel] opt moe align block kernel by block/warp scan algorithm (sgl-project#7884)
* Super tiny fix typo (sgl-project#8046)
* fix: update HostKVCache init to report correct msg when available memory is not enough (sgl-project#8102)
* [Hunyuan]: Fix Dense Model Support (sgl-project#8117)
* feat: add production metric for retracted requests due to insufficient kvcache (sgl-project#7030)
* refactor: simply MultimodalTokens logic (sgl-project#7924)
* [Fix][Ready]Fix register spilling in cutlass nvfp4 gemm kernel on Blackwell (sgl-project#8127)
* Feat: Support Granite 3.0 MoE in SGLang (sgl-project#7959)
* load draft model fix (sgl-project#7506)
* [CPU][Llama4] Fix Llama4 MoE inputs with "apply_router_weight_on_input"  (sgl-project#7889)
* [Quantization][w8a8_int8] Fix weight loading issue for w8a8_int8 path with "ignore" layer list in quantization config (sgl-project#7820)
* Hicache Storage Layer Prototype (sgl-project#7704)
* Revert "Fix different device type adjustment in PP" (sgl-project#8141)
* feat: enchance green context stream creation robust with backward compatibility (sgl-project#8136)
* fix compressed tensors WNA16 imports (sgl-project#8142)
* [Bugfix] Fix w8a8_int8 import error on NPU (sgl-project#8147)
* [3/n] chore: decouple AWQ implementation from vLLM dependency (sgl-project#8113)
* [router] Refactor router and policy traits with dependency injection (sgl-project#7987)
* [AMD] Add triton awq_dequantize kernel to support AWQ on ROCm (sgl-project#7661)
* [Doc] Steps to add a new attention backend (sgl-project#8155)
* chore: tune mem fraction static for vlm (sgl-project#6881)
* Support NVFP4 quantized dense models on AMD CDNA2/CDNA3 GPUs (sgl-project#7302)
* Feat: Support audio in Phi4-mm model (sgl-project#8048)
* [PD] Support non-MLA models PD different TP with DP attention (sgl-project#7931)
* [health_generate] fix: fix the /health_generate always success bug (sgl-project#8028)
* [router] router metrics cleanup (sgl-project#8158)
* [router] allow router to have empty workers (sgl-project#8160)
* Add GB200 wide-EP docker (sgl-project#8157)
* [1/N] MoE Refactor: refactor `select_experts` (sgl-project#7966)
* chore: bump sgl-kernel v0.2.6 (sgl-project#8165)
* chore: upgrade sgl-kernel 0.2.6 (sgl-project#8166)
* [theta] sync bailing
* Fix suffix mismatch for the metrics. (sgl-project#8168)
* Update README.md (sgl-project#8171)
* Clean up server args (sgl-project#8161)
* Fix LoRA buffer contamination during adapter eviction (sgl-project#8103)
* Fix Dockerfile.gb200 (sgl-project#8169)
* [router] add ut for worker and errors (sgl-project#8170)
* bugfix: fix sglang crash in NVIDIA MIG container (sgl-project#8167)
* Support start up LoRA server without initial adapters (sgl-project#8019)
* Clean warning logs for gate_proj loading in Lora (sgl-project#8172)
* Fix tuning_fused_moe_triton.py (sgl-project#8175)
* [Feature] Simple Improve Health Check Mechanism for Production-Grade Stability (sgl-project#8115)
* Add bf16 output option for dsv3_router_gemm kernel (sgl-project#7999)
* Enable FlashInfer support encoder models and add head_dim padding workaround (sgl-project#6230)
* Add get_hidden_dim to qwen3.py for correct lora (sgl-project#7312)
* feat: add h200 tp 16 kimi k2 moe config (sgl-project#8176)
* feat: add b200 tp 16 kimi k2 moe config (sgl-project#8178)
* fix moe gate dtype, fix tbo, fix fake dispatch (sgl-project#7825)
* Revert "[Feature] Simple Improve Health Check Mechanism for Production-Grade Stability" (sgl-project#8181)
* feat: update nccl 2.27.6 (sgl-project#8182)
* Feat: Support for Persimmon Model (sgl-project#7983)
* feat: add h200 tp 16 kimi k2 moe config (sgl-project#8183)
* Fix eagle3 cuda graph (sgl-project#8163)
* fix: fix the bug of loading Internvl3 (sgl-project#8067)
* Fix dtype error in CI (sgl-project#8197)
* Cherry-pick commit 2dc5de40 "perf: add bailing mo..." 到当前分支
* [router] add ut for pd request, metrics and config (sgl-project#8184)
* [feature] enable NPU CI (sgl-project#7935)
* [fix] fix modelopt fp4 on b200 (sgl-project#8195)
* chore: bump sgl-kernel v0.2.6.post1 (sgl-project#8200)
* Apply fused sorted token ids padding (sgl-project#8193)
* [Refactor] simplify multimodal data processing (sgl-project#8107)
* [theta] feat vl name
* [router] add ut for pd router (sgl-project#8208)
* [router] upgade router version to 0.1.6 (sgl-project#8209)
* Remve router gemm output dtype conversion (sgl-project#8204)
* chore: upgrade sgl-kernel 0.2.6.post1 (sgl-project#8202)
* [Feature] Add a test for Layer-wise Prefill (sgl-project#8231)
* docs: update 2025 h2 roadmap (sgl-project#8237)
* fix: retrieve mm token by modality, raise error if none (sgl-project#8221)
* [AMD] Remove vllm's scaled_fp8_quant and moe_sum when SGLANG_USE_AITER=1 (sgl-project#7484)
* [theta] tune h20 config for qwen3 235b
* [theta] tune h20 config for qwen3 235b
* fix: sgl-router remove dead code (sgl-project#8257)
* [fix] benchmark : routed_scaling_factor is None (sgl-project#8059)
* [Benchmark] add disable-auto-run param for hicache/bench_multiturn (sgl-project#7822)
* Preliminary Support for Qwen3XMLDetector (sgl-project#8260)
* chore: bump v0.4.9.post3 (sgl-project#8265)
* PullRequest: 178 perf: add qwen235b h20-3e fused moe kernel config
* [theta] tune h20 config for qwen3 480b
* Skip llama4 vision module loading when multimodal disabled (sgl-project#8272)
* PullRequest: 180 新增Qwen480B和Qwen235B在NVIDIA H20-3e上的Fused MoE Triton配置
* Fix sgl-kernel ci test (sgl-project#8284)
* [theta] tune h200 config for qwen3 480b
* Introduce Stable LoRA ID System for Overlapped Updates and Prefix Caching (sgl-project#8261)
* Hicache IO kernel refactoring (sgl-project#8264)
* bug fix and tag (sgl-project#8282)
* HiCache Fix (sgl-project#8288)
* [sgl-kernel] Opt per_token_quant_fp8 with warp reduce (sgl-project#8130)
* [router] add common ut infra to mock worker and app (sgl-project#8295)
* fix: workaround for deepgemm warmup issue (sgl-project#8302)
* [Performance][PD Disaggregation] optimize TokenToKVPoolAllocator by sorting free pages (sgl-project#8133)
* Fix the issue of incorrect finish reason in final stream response chunk returned during tool call (sgl-project#7708)
* fix: match chat-template for internvl3 (sgl-project#8262)
* Fix gemma3n with hybrid swa (sgl-project#8240)
* chore: upgrade sgl-kernel 0.2.7 (sgl-project#8304)
* fix: prevent crashes due to logit bias dimension mismatch (sgl-project#7685)
* feat(function call): complete utility method for KimiK2Detector and enhance documentation (sgl-project#8043)
* Fix incomplete tool call capture issue in streaming response of DeepSeek-V3 when enable MTP  (sgl-project#7562)
* [AMD] Pull latest image for AMD CI (sgl-project#8070)
* Pin the version of petit kernel to fix the APIs (sgl-project#8235)
* [bug] fix pd completion protocol for batching support (sgl-project#8317)
* [router] fix pd model completion request (sgl-project#8303)
* fix bug when eos_ids==0 (sgl-project#8315)
* [router] add endpoint unit test (sgl-project#8298)
* [code style] Clean dead triton kernel code in fused_moe and useless vllm_ops import (sgl-project#8310)
* chore: upgrade flashinfer v0.2.9rc1 (sgl-project#8301)
* [router] add streaming unit test (sgl-project#8299)
* [router] add request format unit test (sgl-project#8300)
* HiCache Storage TP Refinement (sgl-project#8307)
* breakdown kernel update (sgl-project#8334)
* support idle batch for TBO (sgl-project#8233)
* [Feature] Integrate quick allreduce and select the best allreduce implementation (sgl-project#6619)
* DP Enhancement (sgl-project#8280)
* fix: Fix failed functional tests https://github.com/meta-llama/llama-stack-evals (sgl-project#8266)
* [AMD] Add silu_and_mul, gelu_and_mul, gelu_tanh_and_mul, and gelu_quick kernels for AMD GPUs (sgl-project#7135)
* [CPU] Add tutorial docs for SGL on CPU (sgl-project#8000)
* chore: upgrade mooncake 0.3.5 (sgl-project#8341)
* [torch.compile bug] avoid biased_grouped_topk_impl func repeatedly triggering `torch.compile` in forward pass (sgl-project#8353)
* [P/D] Support ipv6 in P/D scenario (sgl-project#7858)
* Add H20-3e fused MoE kernel tuning configs for Qwen3-Coder-480B-A35B-Instruct (sgl-project#8344)
* [Bugfix][Feat] Add XML-ish grammar in EBNFComposer and fix misc bugs in Qwen3 detector (sgl-project#8357)
* Clean up server_args, triton cache manager (sgl-project#8332)
* fix: upgrade nccl version (sgl-project#8359)
* [Feat] Add reasoning parser for Qwen/Qwen3-235B-A22B-Thinking-2507 (sgl-project#8363)
* fix: kimi k2 xgrammar crash (sgl-project#8367)
* Fix FP4 MoE accuracy from missing routed_scaling_factor (sgl-project#8333)
* [CI] Fix flaky threshold (sgl-project#8370)
* chore: bump v0.4.9.post4 (sgl-project#8305)
* Fix test_moe_fused_gate_combined sgl-kernel ci test (sgl-project#8374)
* Uodate Dockerfile.gb200 to latest sglang (sgl-project#8356)
* chore: improve mmmu benchmark (sgl-project#7000)
* Save peak memory in logits processor (sgl-project#8343)
* Extract update_weights from RL Engine to SGLang to keep simplicity and fix torch reduce (sgl-project#8267)
* chore: improvements on mm_utils (sgl-project#7737)
* vlm: optimize tensor transport (sgl-project#6003)
* Tiny assert EPLB is used together with expert parallel (sgl-project#8381)
* model: support intern-s1 (sgl-project#8350)
* Add perf tests for LoRA (sgl-project#8314)
* Remove slot usage in code to be backward-compatible with python 3.9 (sgl-project#8396)
* Add docker release flow for gb200 (sgl-project#8394)
* HiCache, check before terminate prefetching (sgl-project#8372)
* Add nvfp4 scaled mm benchmark. (sgl-project#8401)
* Urgent Fix: intern-s1 chat-template matching (sgl-project#8403)
* Tool to dump and compare internal activation tensors (sgl-project#7976)
* Minor tool for comparison of benchmark results (sgl-project#7974)
* Fix bench script making input data on L2 cache (sgl-project#7739)
* [NVIDIA] Add Flashinfer MoE blockscale fp8 backend (sgl-project#8036)
* Update Cutlass in sgl-kernel to v4.1 (sgl-project#8392)
* fix: minor fix TransportProxyTensor under tp (sgl-project#8382)
* [router] add different policies for p node and d node (sgl-project#8395)
* Add A800 fused MoE kernel tuning configs for Qwen3-Coder-480B-A35B-Instruct (sgl-project#8351)
* fix: fix the missing metrics on non-rank0 nodes (sgl-project#7720)
* [2/N] MoE Refactor: Unify weight loader and quant methods (sgl-project#8397)
* Use FlashInfer FP4 gemm. (sgl-project#8241)
* Support precomputed_embeddings for Llama 4 (sgl-project#8156)
* [hotfix] fix merge conflicts in FlashInferEPMoE (sgl-project#8405)
* chore: update CODEOWNERS (sgl-project#8407)
* chore: upgrade flashinfer v0.2.9rc2 (sgl-project#8406)
* Support triton kernels v3.4.0 for fused_moe (sgl-project#8258)
* [Bugfix] Prevent PD server crash from invalid grammar (sgl-project#8062)
* Change to use native arm runner (sgl-project#8414)
* Support overlapped lora updates  (sgl-project#8213)
* Support ue8m0 for triton quant kernel (sgl-project#7603)
* Fix: Improve test_openai_function_calling unit test and fix reasoning_parser.py think_start_token logic (sgl-project#8316)
* bugfix: Fix multiple finish_reason chunks and tool_calls finish reason check (sgl-project#8417)
* Fix test_openai_server (sgl-project#8419)
* Fix docker buildx push error (sgl-project#8425)
* bugfix: Fix XGrammar backend to use model's EOS tokens for constrained generation (sgl-project#8422)
* [router] improve router logs and request id header (sgl-project#8415)
* [feat] Support different attention backends for prefill and decode  (sgl-project#6338)
* chore: bump transformer to 4.54.0 (sgl-project#8416)
* [PD] Fix abort_request for PD disaggregation (sgl-project#8352)
* GLM-4.5 Model Support (sgl-project#8224)
* Remove zstd compression for building Dockerfile.gb200 (sgl-project#8442)
* doc: add bench_one_batch_server in the benchmark doc (sgl-project#8441)
* GLM-4.5 Model Support Follow-up (sgl-project#8445)
* fix GLM4_MOE launch with compressed_tensor quant model (sgl-project#8456)
* Fix per_token_group_quant_8bit when hidden_dim // group_size is not divided by 4. (sgl-project#8449)
* Revert "[kernel] opt moe align block kernel by block/warp scan algorithm" (sgl-project#8457)
* chore: bump v0.4.9.post5 (sgl-project#8458)
* fix:reorder topk experts to ensure shared expert replaces minimal score (sgl-project#8125)
* perf: add kimi k2 h200 fused moe config (extracted from theta-asap-sglang-049)
* Cherry-pick commit 4a75e015 "Add draft model fuse..." 到当前分支
* Update PR template (sgl-project#8465)
* feat: throttle requests at scheduler based on --max_queued_requests (sgl-project#7565)
* [theta] tuning script for glm4 moe
* perf: add fused moe kernel config glm4.5,h20-3e,tp8
* [theta] tuning script for glm4 moe h20
* fix: update dep (sgl-project#8467)
* [NVIDIA] Change to use `num_local_experts` (sgl-project#8453)
* Fix parsing ChatCompletionMessage (sgl-project#7273)
* [3/N] MoE Refactor: Simplify DeepEP Output (sgl-project#8421)
* feat: support glm4 tuning (sgl-project#8473)
* Fix DEEPEP BF16 compatibility for Deepseek Style model like GLM 4.5 (sgl-project#8469)
* Update codeowner (sgl-project#8476)
* chore: add glm4 fp8 tp8 config (sgl-project#8478)
* chore: add glm 4.5 fp8 tp4 config (sgl-project#8480)
* [CI]Add genai-bench Performance Validation for PD Router (sgl-project#8477)
* Update CODEOWNERS (sgl-project#8485)
* Rename the last step in pr-test.yml as pr-test-finish (sgl-project#8486)
* Reduce memory usage for fp4 moe (sgl-project#8413)
* Tiny add warnings for DeepEP when it is suboptimal (sgl-project#8426)
* Support colocating requests (sgl-project#7973)
* Fix incorrect KV cache allocation for MTP models. (sgl-project#8482)
* Add PVC and update resource limits in k8s config (sgl-project#8489)
* chore: bump v0.4.9.post6 (sgl-project#8517)
* Always trigger pr-test (sgl-project#8527)
* Update README.md (sgl-project#8528)
* [sgl-kernel performace] fix fp8 quant kernels dispatch __nv_fp8_e4m3 bug to improve performance 10%-20% (sgl-project#8499)
* Update cutlass_moe.py (sgl-project#8535)
* Fix moe align kernel test (sgl-project#8531)
* Split the scheduler into multiple mixin classes to reduce the file size (sgl-project#8483)
* bring back kimi vl ci (sgl-project#8537)
* fix: temporarily disable cuda-ipc for mm data tensor (sgl-project#8431)
* Support EPLB in FusedMoE (sgl-project#8448)
* feat(hicache): support file backend reading directory config form env. (sgl-project#8498)
* feature(pd-hicache): Prefill instances support reusing the RemoteStorage Cache via HiCache. (sgl-project#8516)
* [router] allow longer time out for router e2e (sgl-project#8560)
* Update cutlass_moe.py (sgl-project#8545)
* Update CODEOWNERS (sgl-project#8562)
* [feature] [sgl-router] Add a dp-aware routing strategy (sgl-project#6869)
* [Hot-Fix] moe_aligned_block_size CI failed in AMD (sgl-project#8461)
* Cherry-pick commit 4fdc06a9 "add fp8a8 kimi-k2 dr..." 到当前分支
* [Model] Add support for Arcee Foundational Model (sgl-project#8154)
* Revert "Fix the input tools format and history tool_calls in OpenAI API  (sgl-project#6556)" (sgl-project#8584)
* Add hf3fs support for hicache storage (based on sgl-project#7704) (sgl-project#7280)
* [router] migrate router from actix to axum (sgl-project#8479)
* [Fix]Fix index oob in get_group_gemm_starts kernel. (sgl-project#8564)
* Bump transfomers to 4.54.1 to fix Gemma cache issue. (sgl-project#8541)
* Add GKE's default CUDA runtime lib location to PATH and LD_LIBRARY_PATH. (sgl-project#8544)
* Bug: Fix google gemma3n-mm audio input not working bug (sgl-project#8365)
* update sgl-kernel for EP: kernel part  (sgl-project#8514)
* chore: bump sgl-kernel v0.2.8 (sgl-project#8599)
* [bugfix] Fix 2 minor bugs in the hicache storage layer (sgl-project#8404)
* fix incorrect increase of hit count (sgl-project#8533)
* Support l3 cache (mooncake store) for hiradix cache (sgl-project#7211)
* [theta] Conditionally import HiCacheHF3FS sgl-project#8598
* update sgl-kernel for EP: python part (sgl-project#8550)
* add SVG logo (sgl-project#8603)
* [4/N] MoE Refactor: Unified Triton Kernel for FusedMoE and EPMoE (sgl-project#8515)
* fix: fork should not run pypi router (sgl-project#8604)
* model: support Step3V (sgl-project#8583)
* [Feature] Hybrid EP and TP (sgl-project#8590)
* chore: bump v0.4.10 (sgl-project#8608)
* [PD] Use batch transfer for rdma transport and add notes for mnnvl usage (sgl-project#8595)
* [bugifx] QWen-1M context support[2/3] using current cuda stream in the DCA's kernel for bugfix. (sgl-project#8611)
* Fix hf3fs_fuse import error (sgl-project#8623)
* Update step3v default config (sgl-project#8626)
* [ci] fix genai-bench execution cmd (sgl-project#8629)
* [router] update router pypi version (sgl-project#8628)
* [Optimization][Perf] Disable the GC during CUDA graph capture to speed up by up to 3x (sgl-project#8577)
* Fix typos in py_test/test_launch_server.py (sgl-project#6227)
* misc: Remove debug print to logger.info (sgl-project#8633)
* SGLang HiCache NIXL Connector (sgl-project#8488)
* [bug] remove pdlb from minilb since its no longer available (sgl-project#8634)
* [bugfix] Fix flashinfer cutlass EP moe after MoE refactor (sgl-project#8630)
* Conditionally import HiCacheHF3FS (sgl-project#8598)
* TRTLLM Gen MLA Decode Kernel Integration (same as sgl-project#7938) (sgl-project#8632)
* Fix nan value generated after custom all reduce (sgl-project#8532)
* Revert "Fix nan value generated after custom all reduce (sgl-project#8532)" (sgl-project#8642)
* Feature/modelscope model download (sgl-project#8083)
* chore: speedup NPU CI by cache (sgl-project#8270)
* [Bugfix] fix w8a8_int8 load issue (sgl-project#8308)
* [bugfix] fix router python parser for pd urls (sgl-project#8644)
* [router] add basic usage doc (sgl-project#8640)
* [router] upgrade router version to 0.1.8 (sgl-project#8645)
* [NVIDIA] Enable Flashinfer MoE blockscale fp8 backend for TP MoE (sgl-project#8450)
* HiCache, fixing hash value indexing (sgl-project#8636)
* Interface change for kvcache io to support page first layout (sgl-project#8318)
* Update batch size limitation of dsv3_router_gemm kernel to 16 (sgl-project#8051)
* chore: bump v0.4.10.post1 (sgl-project#8652)
* Add hf3fs_utils.cpp to package-data (sgl-project#8653)
* Fix chat template handling for OpenAI serving (sgl-project#8635)
* Bug: apply final_hidden_states*=self.routed_scaling_factor at MoE lay… (sgl-project#8511)
* [5/N] MoE Refactor: Update MoE parallelism arguments (sgl-project#8658)
* Increase tolerance to address CI failures (sgl-project#8643)
* [Kimi K2] dsv3_router_gemm supports NUM_EXPERTS == 384 (sgl-project#8013)
* [DOC]Update sgl-kernel README (sgl-project#8665)
* fix per token cuda kernel hidden dim cannot divide by 16 (sgl-project#8543)
* fix arg typo for --disaggregation-transfer-backend (sgl-project#8664)
* [fix] fix pd disagg error of vlms (sgl-project#8094)
* Disable tp for shared experts under expert parallelism for GLM4.5 model (sgl-project#8647) (sgl-project#8647)
* [bugfix] Fix page size for create_flashmla_kv_indices_triton() for cutlass mla (sgl-project#8685)
* [bug] limit bootstrap room to to [0, 2^63 - 1] (sgl-project#8684)
* Update CODEOWNERS (sgl-project#8686)
* Fix deepgemm masked grouped gemm jit compile (sgl-project#8679)
* Fix FP8 block quantization when N or K is not multiples of 128 (sgl-project#8648)
* bugfix(hicache): Fix 'MooncakeStore' not defined error. (sgl-project#8668)
* upgrade xgrammar 0.1.22 (sgl-project#8522)
* [bugfix] Add 'disaggregation_mode' parameter to warmup function when compile deep_gemm manually (sgl-project#8618)
* Add support for NCCL symmetric memory for TP allreduces (sgl-project#8238)
* [1/2] sgl-kernel: Fuse routed scaling factor into select_experts (sgl-project#8364)
* chore(gb200): update dockerfile to handle fp4 disaggregation (sgl-project#8694)
* [bugfix] Apply routed scaling factor to cutlass_fused_experts_fp8 (sgl-project#8688)
* Fix: resolve prefill of retracted request out-of-memory issue when ignore_eos is enabled (sgl-project#7434)
* model: adapt mllama4 to VisionAttention (sgl-project#8512)
* Add tensor.detach() back to update weight util (sgl-project#8691)
* [Doc] Polish sgl-kernel readme for cu126 build error (sgl-project#8704)
* [theta] merge 0802-3
* Revert "[1/2] sgl-kernel: Fuse routed scaling factor into select_experts" (sgl-project#8706)
* [router] minor code clean up and and refactoring (sgl-project#8711)
* [Bug] fix green context's incompatibility with `cuda < 12.4` (sgl-project#8701)
* chore: bump sgl-kernel v0.2.9 (sgl-project#8713)
* Remove assertions about per group quant fp8 (sgl-project#8717)
* [FIX] Fix the nightly CI by disabling swa mem pool for gemma2 (sgl-project#8693)
* Fix triton moe error caused by TopK refactor (sgl-project#8705)
* [router] Implement HTTP Dependency Injection Pattern for Router System (sgl-project#8714)
* [Feature] Radix Tree in C++ (sgl-project#7369)
* [Perf]Use Cooperative Schedule for H100 & H200 & H800 in fp8_blockwise_scaled_grouped_mm (sgl-project#8722)
* Fix fused MoE when `routed_scaling_factor is None` (sgl-project#8709)
* Tiny fix CI pytest error (sgl-project#8524)
* [hotfix] fix mixtral with tensor-level compressed-tensor quantization (sgl-project#8721)
* Support limiting max loaded loras in CPU. (sgl-project#8650)
* Reduce memory accumulation in long-running server (sgl-project#8306)
* HiCache storage, style change and bug fix (sgl-project#8719)
* [feat] support minimum token load balance in dp attention (sgl-project#7379)
* Do layernorm before allgather for DP attention (sgl-project#8631)
* [fix] Fix divide by zero error for llama4. (sgl-project#8683)
* feat: Add new moe triton for NVIDIA RTX 6000 Ada (sgl-project#8547)
* [Improvements] Merge health check route (sgl-project#8444)
* chore: bump sgl-kernel 0.3.0 with torch 2.8.0 (sgl-project#8718)
* Save cuda graph memory for fa3 (sgl-project#8567)
* [CUDA Graph] save cuda graph memory by using next_token_logits_buffer (sgl-project#8579)
* [DP] fix the compatibility issue between DP attention and `--attention-backend triton` (sgl-project#8723)
* chore: bump v0.4.10.post2 (sgl-project#8727)
* feat: Support DP Attention for step3_vl (sgl-project#8699)
* [RL] fix update weight for FusedMoE with EP (sgl-project#8676)
* use fp32 for e_score_correction_bias in GLM-4.5 (sgl-project#8729)
* Fix triton kernels topk with keyword arguments (sgl-project#8732)
* feat: support cutlass_moe_fp8 kernel for fusedmoe in sm90 (sgl-project#8678)
* Fix the missing 'lof' choice of --schedule-policy server args (sgl-project#7114)
* fix args typo in memory_pool_host (sgl-project#8662)
* [CI] Do not trigger pd-disaggregation CI in draft PR (sgl-project#8737)
* [MoE] Enable `renormalize=False` in Triton kernels (sgl-project#8735)
* Replace torch.jit.script with torch.compile in get_masked_input_and_mask to fix benchmark underreporting (sgl-project#8733)
* Fix bug of refactoring TopKOutput in w4afp8 (sgl-project#8745)
* Rename lora_path to lora_id in batches (sgl-project#8437)
* [sgl-kernel] avoid per_token_quant_fp8.cu hardcode sm_count (sgl-project#8738)
* [CI] Ascend NPU CI enhancement (sgl-project#8294)
* [bugfix] fix import path in HiCacheController (sgl-project#8749)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

Comments