Skip to content

Conversation

@kaixih
Copy link
Collaborator

@kaixih kaixih commented Nov 18, 2025

This PR adds a kernel benchmark for fp8 gemms available for Blackwell GPUs. Currently it covers the flashinfer and deepgemm impls. This can guide users which backends should be picked based on the use cases.

Benchmark

On GB200:

> python benchmark_deepgemm_fp8_gemm_blackwell.py --tp-size 4 --plot-friendly
image
# unfortunately, we need to run it again in the normal mode to get a more human readable text with shapes
> python benchmark_deepgemm_fp8_gemm_blackwell.py --tp-size 4
          m        n        k  tp_size    DeepGEMM   Flashinfer
0       8.0    576.0   7168.0      4.0   65.119997    16.640000
1      16.0    576.0   7168.0      4.0   60.640000    16.287999
2      32.0    576.0   7168.0      4.0   61.792001    16.416000
3      64.0    576.0   7168.0      4.0   59.904002    16.352000
4     128.0    576.0   7168.0      4.0   60.160000    16.384000
5     256.0    576.0   7168.0      4.0   58.304001    16.384000
6    1024.0    576.0   7168.0      4.0   60.543999    28.704001
7    2048.0    576.0   7168.0      4.0   61.439998    41.216001
8    4096.0    576.0   7168.0      4.0   62.208001    63.487999
9       8.0  24576.0   7168.0      4.0   72.031997    47.136001
10     16.0  24576.0   7168.0      4.0   70.047997    47.104001
11     32.0  24576.0   7168.0      4.0   70.015997    47.392000
12     64.0  24576.0   7168.0      4.0   71.039997    69.920003
13    128.0  24576.0   7168.0      4.0   69.103997   125.216007
14    256.0  24576.0   7168.0      4.0   71.039997   223.519996
15   1024.0  24576.0   7168.0      4.0  146.031998   826.191992
16   2048.0  24576.0   7168.0      4.0  260.383993  1632.544041
17   4096.0  24576.0   7168.0      4.0  514.335990  3283.008099
18      8.0  32768.0    512.0      4.0   59.967998    11.904000
19     16.0  32768.0    512.0      4.0   58.912002    13.216000
20     32.0  32768.0    512.0      4.0   59.487998    16.352000
21     64.0  32768.0    512.0      4.0   58.095999    20.703999
22    128.0  32768.0    512.0      4.0   58.304001    32.768000
23    256.0  32768.0    512.0      4.0   57.728000    55.264000
24   1024.0  32768.0    512.0      4.0   58.368001   192.736000
25   2048.0  32768.0    512.0      4.0   66.463999   375.039995
26   4096.0  32768.0    512.0      4.0   73.760003   741.375983
27      8.0   7168.0  16384.0      4.0   65.920003    44.608001
28     16.0   7168.0  16384.0      4.0   68.960004    45.056000
29     32.0   7168.0  16384.0      4.0   70.015997    44.608001
30     64.0   7168.0  16384.0      4.0   66.944003    45.311999
31    128.0   7168.0  16384.0      4.0   66.944003    73.728003
32    256.0   7168.0  16384.0      4.0   65.920003   100.351997
33   1024.0   7168.0  16384.0      4.0   89.695998   350.511998
34   2048.0   7168.0  16384.0      4.0  184.640005   700.687975
35   4096.0   7168.0  16384.0      4.0  358.720005  1507.328033
36      8.0   7168.0  18432.0      4.0   64.896002    47.359999
37     16.0   7168.0  18432.0      4.0   64.896002    47.327999
38     32.0   7168.0  18432.0      4.0   65.920003    47.104001
39     64.0   7168.0  18432.0      4.0   65.920003    47.104001
40    128.0   7168.0  18432.0      4.0   63.872002    77.903997
41    256.0   7168.0  18432.0      4.0   63.872002   108.800001
42   1024.0   7168.0  18432.0      4.0  104.720000   393.216014
43   2048.0   7168.0  18432.0      4.0  204.832003   751.616001
44   4096.0   7168.0  18432.0      4.0  395.296007  1559.551954
45      8.0   9216.0   7168.0      4.0   55.744000    24.768000
46     16.0   9216.0   7168.0      4.0   56.704000    24.576001
47     32.0   9216.0   7168.0      4.0   51.344000    24.576001
48     64.0   9216.0   7168.0      4.0   52.191999    26.624000
49    128.0   9216.0   7168.0      4.0   52.223999    39.168000
50    256.0   9216.0   7168.0      4.0   55.328000    61.407998
51   1024.0   9216.0   7168.0      4.0   61.087999   192.543998
52   2048.0   9216.0   7168.0      4.0  112.896003   358.687997
53   4096.0   9216.0   7168.0      4.0  212.576002   692.480028
54      8.0   6144.0   7168.0      4.0   54.655999    20.768000
55     16.0   6144.0   7168.0      4.0   55.679999    20.768000
56     32.0   6144.0   7168.0      4.0   47.711998    20.992000
57     64.0   6144.0   7168.0      4.0   48.160002    22.496000
58    128.0   6144.0   7168.0      4.0   47.904000    33.023998
59    256.0   6144.0   7168.0      4.0   48.160002    45.056000
60   1024.0   6144.0   7168.0      4.0   54.687999   131.104007
61   2048.0   6144.0   7168.0      4.0   66.175997   239.904001
62   4096.0   6144.0   7168.0      4.0  140.864000   459.776014
63      8.0   8192.0    512.0      4.0   45.568001     9.216000
64     16.0   8192.0    512.0      4.0   44.608001     8.864000
65     32.0   8192.0    512.0      4.0   44.064000     9.168000
66     64.0   8192.0    512.0      4.0   44.192001    10.240000
67    128.0   8192.0    512.0      4.0   43.584000    14.784000
68    256.0   8192.0    512.0      4.0   43.935999    22.784000
69   1024.0   8192.0    512.0      4.0   44.160001    71.648002
70   2048.0   8192.0    512.0      4.0   45.024000   133.376002
71   4096.0   8192.0    512.0      4.0   49.120001   260.127991
72      8.0   6144.0   1536.0      4.0   46.335999    10.240000
73     16.0   6144.0   1536.0      4.0   45.472000    10.240000
74     32.0   6144.0   1536.0      4.0   46.592001    10.240000
75     64.0   6144.0   1536.0      4.0   46.224000    14.336000
76    128.0   6144.0   1536.0      4.0   45.855999    18.495999
77    256.0   6144.0   1536.0      4.0   46.624001    28.608000
78   1024.0   6144.0   1536.0      4.0   46.895999    82.496002
79   2048.0   6144.0   1536.0      4.0   47.104001   157.664001
80   4096.0   6144.0   1536.0      4.0   51.584002   307.231992
81      8.0   1024.0   7168.0      4.0   50.207999    16.896000
82     16.0   1024.0   7168.0      4.0   47.136001    16.960001
83     32.0   1024.0   7168.0      4.0   45.311999    17.408000
84     64.0   1024.0   7168.0      4.0   45.056000    16.384000
85    128.0   1024.0   7168.0      4.0   44.224001    16.480001
86    256.0   1024.0   7168.0      4.0   44.831999    17.408000
87   1024.0   1024.0   7168.0      4.0   44.624001    30.208001
88   2048.0   1024.0   7168.0      4.0   46.815999    53.247999
89   4096.0   1024.0   7168.0      4.0   52.607998    87.583996
90      8.0   7168.0   4608.0      4.0   46.112001    16.384000
91     16.0   7168.0   4608.0      4.0   47.104001    16.384000
92     32.0   7168.0   4608.0      4.0   43.807998    16.384000
93     64.0   7168.0   4608.0      4.0   44.576000    16.640000
94    128.0   7168.0   4608.0      4.0   44.128001    24.576001
95    256.0   7168.0   4608.0      4.0   44.351999    34.784000
96   1024.0   7168.0   4608.0      4.0   49.536001   104.447998
97   2048.0   7168.0   4608.0      4.0   51.775999   194.848001
98   4096.0   7168.0   4608.0      4.0  104.768001   376.832008
99      8.0   7168.0   4096.0      4.0   46.112001    15.904000
100    16.0   7168.0   4096.0      4.0   44.256002    15.904000
101    32.0   7168.0   4096.0      4.0   42.975999    15.904000
102    64.0   7168.0   4096.0      4.0   43.887999    16.352000
103   128.0   7168.0   4096.0      4.0   44.128001    22.528000
104   256.0   7168.0   4096.0      4.0   43.136001    30.688001
105  1024.0   7168.0   4096.0      4.0   50.528001    96.256003
106  2048.0   7168.0   4096.0      4.0   49.152002   178.207994
107  4096.0   7168.0   4096.0      4.0   93.759999   344.352007
108     8.0   7168.0    512.0      4.0   46.464000     8.864000
109    16.0   7168.0    512.0      4.0   43.968000     8.192000
110    32.0   7168.0    512.0      4.0   44.447999     9.184000
111    64.0   7168.0    512.0      4.0   43.648001    10.240000
112   128.0   7168.0    512.0      4.0   43.552000    14.304000
113   256.0   7168.0    512.0      4.0   42.688001    20.736000
114  1024.0   7168.0    512.0      4.0   44.767998    63.744001
115  2048.0   7168.0    512.0      4.0   43.680001   118.784003
116  4096.0   7168.0    512.0      4.0   46.080001   231.040001

Accuracy

The test contains correctness check:

Running correctness tests...                                                                                                                                                                                                               
[2025-11-18 18:04:02] WARNING compile_utils.py:94: Entering DeepGEMM JIT Pre-Compile session. It may take a long time (typically 10-20 mins) if you have not run `sglang.compile_deep_gemm`. It is recommended to run `sglang.compile_deep_
gemm` with same args as `sglang.launch_server` for pre-compilation to reduce the overhead if you have not run it before. For example: `python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code`     
[2025-11-18 18:04:02] INFO compile_utils.py:104: Try DeepGEMM JIT Compiling for <GEMM_NT_F8F8BF16> N=512, K=7168, num_groups=1 with all Ms. It only takes a little time (typically 1 sec) if you have run `python3 -m sglang.compile_deep_g
emm`.                                                                                                                                                                                                                                      
DeepGEMM warmup: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16384/16384 [00:02<00:00, 8031.39it/s]
Shape m=64, n=512, k=7168:                                                                                                                                                                                                                 
Flashinfer output: tensor([-26.2500,  40.5000, -15.4375, -61.0000, -68.5000], device='cuda:0',                                                                                                                                             
       dtype=torch.bfloat16)                                                                                                                                                                                                               
DeepGEMM output: tensor([-17.5000,  39.0000, -11.8125, -58.7500, -66.5000], device='cuda:0',                                                                                                                                               
       dtype=torch.bfloat16)                                                                                                                                                                                                               
Correctness check:                                                                                                                                                                                                                         
  - Flashinfer vs DeepGEMM: ✅                                                                                                                                                                                                             
[2025-11-18 18:04:04] WARNING compile_utils.py:94: Entering DeepGEMM JIT Pre-Compile session. It may take a long time (typically 10-20 mins) if you have not run `sglang.compile_deep_gemm`. It is recommended to run `sglang.compile_deep_
gemm` with same args as `sglang.launch_server` for pre-compilation to reduce the overhead if you have not run it before. For example: `python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code`     
[2025-11-18 18:04:04] INFO compile_utils.py:104: Try DeepGEMM JIT Compiling for <GEMM_NT_F8F8BF16> N=7168, K=16384, num_groups=1 with all Ms. It only takes a little time (typically 1 sec) if you have run `python3 -m sglang.compile_deep
_gemm`.                                                                                                                                                                                                                                    
DeepGEMM warmup: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16384/16384 [00:02<00:00, 8106.33it/s]
Shape m=64, n=7168, k=16384:                                                                                                                                                                                                               
Flashinfer output: tensor([ 51.5000, 137.0000, -87.5000, -53.7500, 111.0000], device='cuda:0',                       
       dtype=torch.bfloat16)                                                                                                                                                                                                               
DeepGEMM output: tensor([ 50.2500, 138.0000, -95.0000, -61.7500, 114.0000], device='cuda:0',                         
       dtype=torch.bfloat16)                                                                                         
Correctness check:                                                                                                   
  - Flashinfer vs DeepGEMM: ✅                                                                                       
[2025-11-18 18:04:06] WARNING compile_utils.py:94: Entering DeepGEMM JIT Pre-Compile session. It may take a long time (typically 10-20 mins) if you have not run `sglang.compile_deep_gemm`. It is recommended to run `sglang.compile_deep_
gemm` with same args as `sglang.launch_server` for pre-compilation to reduce the overhead if you have not run it before. For example: `python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code`     
[2025-11-18 18:04:06] INFO compile_utils.py:104: Try DeepGEMM JIT Compiling for <GEMM_NT_F8F8BF16> N=18432, K=7168, num_groups=1 with all Ms. It only takes a little time (typically 1 sec) if you have run `python3 -m sglang.compile_deep
_gemm`.                                                                                                                                                                                                                                    
DeepGEMM warmup: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16384/16384 [00:02<00:00, 8110.53it/s]
Shape m=64, n=18432, k=7168:                                                                                         
Flashinfer output: tensor([-41.0000,  29.7500, 117.5000,  78.0000,  28.8750], device='cuda:0',                                                                                                                                             
       dtype=torch.bfloat16)                                                                                         
DeepGEMM output: tensor([-35.2500,  33.7500, 120.0000,  75.0000,  26.2500], device='cuda:0',
       dtype=torch.bfloat16)                                                                                         
Correctness check:                                                                                                   
  - Flashinfer vs DeepGEMM: ✅  

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @kaixih, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request adds a dedicated benchmark for FP8 General Matrix Multiply (GEMM) kernels tailored for NVIDIA Blackwell GPUs. The benchmark evaluates and compares the performance and numerical accuracy of two prominent implementations, Flashinfer and DeepGEMM, under diverse matrix dimensions and tensor parallelism settings. This new tool aims to empower users with critical performance data, enabling them to make informed decisions when choosing the most suitable FP8 GEMM backend for their specific applications on Blackwell architecture.

Highlights

  • New FP8 GEMM Benchmark: Introduced a new benchmark script specifically for FP8 General Matrix Multiply (GEMM) operations on NVIDIA Blackwell GPUs.
  • Backend Comparison: The benchmark compares the performance and correctness of two FP8 GEMM implementations: Flashinfer and DeepGEMM.
  • Configurable Benchmarking: The benchmark supports various matrix shapes, batch sizes, and tensor parallelism configurations to provide comprehensive performance insights.
  • FP8 Utility Function Made Public: The internal _requant_weight_ue8m0 function in fp8_utils.py has been renamed to requant_weight_ue8m0, making it a public utility for requantizing FP8 weights.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@kaixih
Copy link
Collaborator Author

kaixih commented Nov 18, 2025

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new benchmark script for fp8 GEMM operations on Blackwell GPUs, comparing flashinfer and deepgemm implementations. The changes also include making a utility function public for use in the benchmark. My review focuses on improving the new benchmark script's correctness, readability, and maintainability. I've provided suggestions to fix an argument parsing bug, correct TFLOPS calculations and time unit inconsistencies, reduce code duplication, and improve code style for better clarity.

Comment on lines +21 to +25
x_padded = torch.zeros(
(ceil_div(m, 128) * 128, ceil_div(n, 128) * 128), dtype=x.dtype, device=x.device
)
x_padded[:m, :n] = x
x_view = x_padded.view(-1, 128, x_padded.size(1) // 128, 128)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The constant BLOCK_SIZE is defined at the top of the file, but the magic number 128 is used here. For better readability and maintainability, please use the defined constant BLOCK_SIZE.

Suggested change
x_padded = torch.zeros(
(ceil_div(m, 128) * 128, ceil_div(n, 128) * 128), dtype=x.dtype, device=x.device
)
x_padded[:m, :n] = x
x_view = x_padded.view(-1, 128, x_padded.size(1) // 128, 128)
x_padded = torch.zeros(
(ceil_div(m, BLOCK_SIZE) * BLOCK_SIZE, ceil_div(n, BLOCK_SIZE) * BLOCK_SIZE), dtype=x.dtype, device=x.device
)
x_padded[:m, :n] = x
x_view = x_padded.view(-1, BLOCK_SIZE, x_padded.size(1) // BLOCK_SIZE, BLOCK_SIZE)

Comment on lines +53 to +61
weight_shapes = []
for t in total:
weight_shapes.append(t)
for n_t in n_tp:
new_t = (n_t[0] // tp_size, n_t[1])
weight_shapes.append(new_t)
for k_t in k_tp:
new_t = (k_t[0], k_t[1] // tp_size)
weight_shapes.append(new_t)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The construction of weight_shapes can be simplified by using list.extend with list comprehensions. This is more Pythonic and improves readability.

Suggested change
weight_shapes = []
for t in total:
weight_shapes.append(t)
for n_t in n_tp:
new_t = (n_t[0] // tp_size, n_t[1])
weight_shapes.append(new_t)
for k_t in k_tp:
new_t = (k_t[0], k_t[1] // tp_size)
weight_shapes.append(new_t)
weight_shapes = list(total)
weight_shapes.extend([(n_t[0] // tp_size, n_t[1]) for n_t in n_tp])
weight_shapes.extend([(k_t[0], k_t[1] // tp_size) for k_t in k_tp])

Comment on lines +125 to +131
mismatch_percent = 1.0 - match_ratio.item()
if mismatch_percent > 1 - percent:
print(
f"Mismatch percentage is {mismatch_percent:.4f} for rtol {rtol} "
f"(threshold: {1 - percent:.4f})"
)
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The condition if mismatch_percent > 1 - percent: is redundant. If the execution reaches this point, it means match_ratio < percent, which implies 1.0 - match_ratio.item() > 1.0 - percent. Therefore, the if statement is always true. You can simplify the code by removing this condition.

    mismatch_percent = 1.0 - match_ratio.item()
    print(
        f"Mismatch percentage is {mismatch_percent:.4f} for rtol {rtol} "
        f"(threshold: {1 - percent:.4f})"
    )
    return False

Comment on lines 178 to 197
x = torch.randn((m, k), device="cuda", dtype=torch.bfloat16)
y = torch.randn((n, k), device="cuda", dtype=torch.bfloat16)

# Preprocess data before benchmarking
y_fp8, y_scale = per_block_cast_to_fp8(y)
x_fp8, x_scale = sglang_per_token_group_quant_fp8(
x, BLOCK_SIZE, column_major_scales=True
)
dg_x_fp8, dg_x_scale = sglang_per_token_group_quant_fp8(
x,
BLOCK_SIZE,
column_major_scales=True,
scale_tma_aligned=True,
scale_ue8m0=True,
)
dg_y_fp8, dg_y_scale = requant_weight_ue8m0(
y_fp8,
y_scale,
[BLOCK_SIZE, BLOCK_SIZE]
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The input preparation logic in this block is duplicated in the calculate_diff function (lines 135-165). To improve code maintainability and avoid redundancy, consider extracting this logic into a separate helper function. This function could handle the creation and quantization of tensors for both flashinfer and deepgemm, returning the prepared inputs.

line_vals=["deepgemm", "flashinfer"],
line_names=["DeepGEMM", "Flashinfer"],
styles=[("blue", "-"), ("red", "-")],
ylabel="us",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The ylabel is set to "us" (microseconds), but the benchmark function returns values in milliseconds (ms * 1000). Please change ylabel to "ms" to match the returned values.

            ylabel="ms",

line_vals=["deepgemm", "flashinfer"],
line_names=["DeepGEMM", "Flashinfer"],
styles=[("blue", "-"), ("red", "-")],
ylabel="us",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The ylabel is set to "us" (microseconds), but the benchmark function returns values in milliseconds (ms * 1000). This should be corrected to "ms" to match the returned values.

            ylabel="ms",

Comment on lines +291 to +294
"--run-correctness",
action="store_true",
default=True,
help="Whether to run correctness test",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current argparse configuration for --run-correctness with action="store_true" and default=True makes it impossible to disable the correctness check. A better approach is to use action=argparse.BooleanOptionalAction, which automatically creates both --run-correctness and --no-run-correctness flags, providing more intuitive control.

        "--run-correctness",
        action=argparse.BooleanOptionalAction,
        default=True,
        help="Whether to run correctness test",

@Fridge003 Fridge003 merged commit c3c4da7 into sgl-project:main Nov 20, 2025
116 of 129 checks passed
yukavio pushed a commit to yukavio/sglang that referenced this pull request Nov 25, 2025
* [model-gateway] update workflow names for gateway and exclude npu (sgl-project#13415)

* [Tiny fix] Fix bench_speculative.py run bug (sgl-project#13416)

* [model-gateway] Add Gateway Release Tooling (sgl-project#13420)

* fix uneven PP layer indices (sgl-project#13282)

Co-authored-by: Xuchun Shang <[email protected]>

* diffusion: fix wan2.2 ti2v num_frames adjust logic (sgl-project#13379)

Co-authored-by: adarshxs <[email protected]>
Co-authored-by: Adarsh Shirawalmath <[email protected]>

* [PD][bug fix] fix memleak when last_batch is none (sgl-project#13144)

Signed-off-by: Xuchun Shang <[email protected]>
Co-authored-by: Shangming Cai <[email protected]>

* Fix cache_tokens calculate issue when retracted (sgl-project#11900)

Signed-off-by: Michael Qiu <[email protected]>
Co-authored-by: Mike_Qiu <[email protected]>

* [feature] Custom base path on FastAPI server (sgl-project#5879)

Co-authored-by: lianhu.yin <[email protected]>
Co-authored-by: kebyn <[email protected]>
Co-authored-by: Liangsheng Yin <[email protected]>

* Adding user defined hooks support (sgl-project#13217)

* Fix log time stats (sgl-project#13418)

* [Ci tiny fix] Lower score threshold in evaluation test (sgl-project#13443)

* diffusion: fix loading with local model_path (sgl-project#13445)

* [2/N] CI refactor: sperate some backend-independent CPU tasks. (sgl-project#13447)

* Temporarily disable model hooks CI (sgl-project#13450)

* [Deepseek V3.2] Use torch.compile to speed up torch.cat in nsa (sgl-project#13022)

Signed-off-by: Hao Lu <[email protected]>

* Remove verbs from GET endpoint paths to follow REST standards (sgl-project#13273)

* Add missing models (sgl-project#13456)

* extend sagemaker.Dockerfile serve script to allow all sglang serve flags (sgl-project#13173)

* Fix 8-gpu B200 nightly tests (sgl-project#13457)

* Fixes validation errors for Wan-AI models which store model weights in subdirectories (sgl-project#13461)

* [Embeddings Performance Testing] Add performance test for embedding models (sgl-project#12359)

* [NVIDIA] Fix broken fp8 MoE of deepseek v3 (sgl-project#13264)

Co-authored-by: Baizhou Zhang <[email protected]>
Co-authored-by: Kangyan-Zhou <[email protected]>

* Temporarily comment out multimodal gen test to recover runners (sgl-project#13463)

* Update pr-test.yml to fix invalid job name error

* Add interface_v1 option for dynamic HiCache backend (sgl-project#13140)

Co-authored-by: Zhiqiang Xie <[email protected]>

* Add bfloat16 tuned fused moe config for Dpsk-MTP layer on B200 (sgl-project#13455)

* fix MambaPool clear method after refactoring (sgl-project#13449)

* [AMD CI] Update sgl-router python path in dockerfile. (sgl-project#13458)

* [CI] re-enable test_vision_openai_server_a ci (sgl-project#13444)

* Adding CI Monitor Improvements (sgl-project#13462)

* [GLM4.6v] Required changes for bumping up to transformer 5.x (sgl-project#13229)

* [GLM4.6v] Relax the constraint of non-user role chat completion message schema for new GLM-v release (sgl-project#13258)

* [model-gateway] use worker startup time out for worker registration (sgl-project#13473)

* model: support JetVLM (sgl-project#13289)

* chore: add an unified server arg for multimodal inputs preprocess config(sgl-project#12149)

Co-authored-by: bianfeng <[email protected]>
Co-authored-by: Xinyuan Tong <[email protected]>

* [PD] Clarify init method docstrings for kvsender and kvreceiver (sgl-project#13476)

* Fix lora test (sgl-project#13479)

* [Piecewise CUDA Graph] Support ModelOpt FP8 (sgl-project#13094)

* CI: fix NFS EBUSY error in PR test workflow (sgl-project#13460)

Co-authored-by: Kangyan-Zhou <[email protected]>
Co-authored-by: Mick <[email protected]>

* [CI] fix triggered by a non-run-ci label (sgl-project#13393)

* [CI] remove auto-labeling `run-ci` label. (sgl-project#13486)

* fix: change performance log directory to cache path (sgl-project#13482)

Co-authored-by: Mick <[email protected]>

* [CI] Add input for pr-gate (sgl-project#13491)

* [opt kimi k2 3/n] opt kimi_k2 moe_fused_gate kernel (sgl-project#13374)

* [CI] fix lint yml (syntax error) (sgl-project#13496)

* [VLM][feat] Support encoder DP for Qwen2.5-VL (sgl-project#13126)

Co-authored-by: Shangming Cai <[email protected]>
Co-authored-by: liusy58 <[email protected]>
Co-authored-by: Yuan Luo <[email protected]>

* [HiCache] Critical fix to host memory double free (sgl-project#13501)

Co-authored-by: Hao Chen <[email protected]>

* [BugFix] Accuracy and function Issue when run ptpc quant model (sgl-project#13157)

Co-authored-by: yuechguo <[email protected]>

* fix: create git tags directly instead of temporary branches (sgl-project#13168)

* Add .github/CI_PERMISSIONS.json to define the CI permissions (sgl-project#13509)

Co-authored-by: sglang-bot <[email protected]>

* README.md -> FOLDER_README.md (sgl-project#13510)

Co-authored-by: sglang-bot <[email protected]>

* Use slash command to trigger CI (sgl-project#13512)

Co-authored-by: sglang-bot <[email protected]>

* Add docs on trigger ci (sgl-project#13513)

Co-authored-by: sglang-bot <[email protected]>

* [Feature] Re:Enable hybrid mem saver (sgl-project#12962)

* Trigger CI retry with edit (sgl-project#13516)

Co-authored-by: sglang-bot <[email protected]>

* Update docs (sgl-project#13519)

Co-authored-by: sglang-bot <[email protected]>

* Add /tag-and-rerun-ci (sgl-project#13521)

* [CI] update pr-gate to be compatible with new slash triggering mananer. (sgl-project#13522)

* [CI] fix skipping pr-gate on main (sgl-project#13525)

* Small cleanups related to LoRA weight loading (sgl-project#13474)

* [CI] fix CI skipped on main (sgl-project#13527)

* [model-gateway] fix gateway docker build due to recent py code change (sgl-project#13532)

* [model-gateway] limit opened files in docker build to fix edge case (sgl-project#13536)

* [docker] fix dockerfile naming for diffusion (sgl-project#13534)

* fix lora test (sgl-project#13537)

* Remove jet-ai/Jet-Nemotron-2B in nightly text tests as this is constantly failing (sgl-project#13540)

* [fix] Fixes accuracy issues caused by incorrect use of rope (sgl-project#13495)

* Flashinfer TRTLLM-GEN-MoE + Qwen3 (sgl-project#13489)

* [chore] Disable ccache for sgl-kernel release (sgl-project#13541)

* Add Qwen/Qwen1.5-MoE-A2.7B to model list (sgl-project#13543)

* [Fix] Fix DeepSeek V3 MTP on B200 (sgl-project#13548)

* [router][grpc] Support num_reasoning_tokens in haromy models (sgl-project#13047)

* [feat][Ascend][Mindspore]: support model-impl of mindspore (sgl-project#9234)

* [AMD CI] Local cache fallback. (sgl-project#13452)

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* [CI] fix amd 1 gpu basic test (sgl-project#13551)

* [Doc] Update HiCache and Mooncake docs & Mooncake Setup Error Checking (sgl-project#12740)

* purge unnecessary env variable set in deterministic test (sgl-project#13481)

* chore: bump sgl-kernel version to 0.3.17.post2 (sgl-project#13542)

* Add `lmsys/gpt-oss-20b-bf16` to model validation check (sgl-project#13557)

* CI Failure Monitor Improvements (sgl-project#13558)

* [RL] Allow passing tensors of different dtypes for FlattenedTensorBucket (sgl-project#13413)

* [CI] Fix CUDA workflow's dependency. (sgl-project#13568)

* [NPU] Adapt pr-gate for pr-test workflow & workflows refresh (sgl-project#13567)

* Tiny enhance test suites sanity check (sgl-project#13589)

* [3/N] CI refactor: move some manually triggered tests. (sgl-project#13448)

* Support moe topk sigmoid kernel (sgl-project#13049)

Co-authored-by: xuebi <[email protected]>

* Expend compatibility check for all quantized MoE models (sgl-project#13465)

Signed-off-by: Xinyuan Tong <[email protected]>

* add https://github.com/netanel-haber to CI_PERMISSIONS.json (sgl-project#13577)

* chore: bump sgl-kernel version to 0.3.17.post2 (sgl-project#13570)

* [Auto Sync] Update base_grammar_backend.py, collector.py (20251116) (sgl-project#13357)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Sehoon Kim <[email protected]>

* [GDN] Remove unnecessary contiguous() (sgl-project#13604)

* [GDN] Remove unnecessary conv state clone (sgl-project#13603)

* [VLM] Support Piecewise CUDA Graph for Qwen2.5-VL  (sgl-project#13055)

Co-authored-by: luoyuan.luo <[email protected]>
Co-authored-by: Yuhao Yang <[email protected]>

* [diffusion] CI: improve diffusion CI (sgl-project#13562)

Co-authored-by: Adarsh Shirawalmath <[email protected]>

* feat: support external custom models (sgl-project#13429)

Co-authored-by: qiuxuan.lzw <[email protected]>
Co-authored-by: Mick <[email protected]>

* [CI fix] Fix image download failures in VLM CI tests (sgl-project#13613)

* [NVIDIA] Add fp8 gemm benchmark on blackwell (sgl-project#13528)

* [UT] Destroy process group after broadcast to resolve port occupation issues in multi-server tests (sgl-project#12379)

* [diffusion] refactor: remove PreprocessorConfig (sgl-project#13248)

* [diffusion] refactor: refactor pipeline folders (sgl-project#13253)

* Add FP32 dtype support for RoPE - Part2 (sgl-project#13328)

* [diffusion] fix: remove multimodal_gen redundant get_bool_env_var func (sgl-project#13583)

Co-authored-by: Mick <[email protected]>

* Add support for new aiter version (AR accuracy, is_shuffled PR) (sgl-project#13554)

Co-authored-by: sogalin <[email protected]>

* diffusion: improve baseline performance monitor (sgl-project#13614)

* [Feature] Introduce JIT Kernel in sglang (with hicache JIT kernel) (sgl-project#13453)

* [CI] Align metric units for CI rate limit (sgl-project#13633)

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* [ROCM] Optimized deepseek-r1 fp8 model with + triton_gemm_a8w8 + batch_gemm_a8w8 + fused set_mla_kv_buffer kernel (sgl-project#13617)

Co-authored-by: root <[email protected]>
Co-authored-by: jacky.cheng <[email protected]>

* fix bench_speculative bug (sgl-project#13197)

* Revert "[Feature] Introduce JIT Kernel in sglang (with hicache JIT kernel)" (sgl-project#13644)

* [CI] optimize CI workflow info (sgl-project#13634)

* CI: Kill zombie diffusion processes in CI & minor code style fix on rotary embedding fallback  (sgl-project#13637)

* [CI] apply pr-gate for XPU (sgl-project#13663)

* Add fused_rmsnorm_gated_cpu kernel for CPU to support Qwen3-Next (sgl-project#11577)

* [10/n] decouple quantization impl from vllm dependency - fix import (sgl-project#13524)

* Adding nightly tests as release guard for bot bump workflows (sgl-project#13655)

* [DeepseekV3.2] Deepseek fp8 support for MHA path (sgl-project#12964)

* Fix launch of `Olmo3` (sgl-project#13666)

Signed-off-by: Vincent Zhong <[email protected]>

* [Deepseek V3.2] Change indexer weights_proj to fp32 (sgl-project#13459)

* enable csgmv automatically on cuda (sgl-project#13600)

* Add nightly test CI monitor workflow (sgl-project#13038)

* allow loras to be implicitly evicted and loaded based on max_loaded_loras (sgl-project#11526)

* Test reorganization: Move tests to manual/ (sgl-project#13610)

* [Piecewise CUDA Graph] Fix recompile issue for Mixtral and Grok2 (sgl-project#13667)

Co-authored-by: Minglei Zhu <[email protected]>
Co-authored-by: Ke Bao <[email protected]>
Co-authored-by: Oasis-Git <[email protected]>

* Super tiny remove unused MiniMaxM2MLP class (sgl-project#13659)

* Update quantization.md with new model resources (sgl-project#13677)

* [model-gateway] add both python and rust cli alias (sgl-project#13678)

* [diffusion] CI: improve validation method (sgl-project#13627)

* [model-gateway] fix gateway cli arg parser to not use = (sgl-project#13685)

* [CI] Move nightly tests to test/nightly/ (sgl-project#13683)

* [NVIDIA] Add cutedsl e2e test to GB200 CI (sgl-project#12672)

Co-authored-by: Baizhou Zhang <[email protected]>

* Add sgl-kernel CI test for Blackwell (B200) (sgl-project#13301)

* remove unnecessary starvation check (sgl-project#13619)

* Fix target MLA with eagle3 support for PD disaggregation (sgl-project#13555)

Signed-off-by: Michael Qiu <[email protected]>
Co-authored-by: Mike_Qiu <[email protected]>

* [kimi k2 thinking] Avoid useless torch.zeros_  (sgl-project#13596)

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

* [opt kimi k2 4 / n] Delete useless pad kernel in sgl_moe_align_block_size (sgl-project#13587)

* [VLM] Support Piecewise CUDA Graph for InternVL (sgl-project#13640)

Co-authored-by: luoyuan.luo <[email protected]>

* [Piecewise Cuda Graph] rename, refactor and add more logging (sgl-project#13675)

Co-authored-by: Minglei Zhu <[email protected]>
Co-authored-by: Ke Bao <[email protected]>
Co-authored-by: Oasis-Git <[email protected]>

* [difusion] CI: speed up multimodal_gen ci (sgl-project#13665)

Co-authored-by: Mick <[email protected]>

* [diffusion] doc: minor update docs (sgl-project#13177)

* Fix ZMQ bind error on non-zero rank nodes when using SGLANG_BLOCK_NONZERO_RANK_CHILDREN=0 (sgl-project#13686)

* [diffusion] server: use meta to avoid Linear init for TextEncoder (sgl-project#13564)

Co-authored-by: Mick <[email protected]>

* [Auto Sync] Update http_server.py, io_struct.py, scheduler_... (20251120) (sgl-project#13679)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Zhuqi Li <[email protected]>

* [Bugfix] Fix hidden state size in EAGLE PD disaggregation buffers (sgl-project#13590)

Co-authored-by: ZeldaHuang <[email protected]>
Co-authored-by: Shangming Cai <[email protected]>

* [HiCache] fix unit test with changed new APIs (sgl-project#13498)

* [Fix] Qwen3Next lmhead dtype  (sgl-project#13708)

* [NPU] chore: bump to CANN 8.3.RC1 and Pytorch 2.8.0 (sgl-project#13647)

* [11/N] MoE Refactor: Simplifying SBO Implementation with Dispatcher Hooks (sgl-project#13327)

Co-authored-by: Baizhou Zhang <[email protected]>

* [Clean code] Compressed_tensors_moe code clean (sgl-project#13719)

* [diffusion] profile: support performance metric dumping and comparison (sgl-project#13630)

* [AMD] Enable fused shared expert append and flatten quant for fp8 deepseekR1 model (sgl-project#13705)

Co-authored-by: yctseng0211 <[email protected]>

* [diffusion] doc: add contributing.md (sgl-project#13649)

* fix 3fs down, lock schedule main thread (sgl-project#13407)

* Fix url: use https://roadmap.sglang.io for roadmap (sgl-project#13733)

Co-authored-by: sglang-bot <[email protected]>

* Super tiny delete unused files (sgl-project#13734)

* [diffusion] log: minor improve logging (sgl-project#13735)

* [CI] minor hot fix of model validation list (sgl-project#13737)

* Add to ci permission (sgl-project#13739)

* [Piecewise CUDA Graph] Support Kimi-K2 (non-Thinking) (sgl-project#13466)

Co-authored-by: Brayden Zhong <[email protected]>
Co-authored-by: luoyuan.luo <[email protected]>

* Fix: CI monitor should not exit with error on regressions (sgl-project#13694)

* Revert "enable csgmv automatically on cuda" (sgl-project#13707)

* Support torch 12.9 + DeepEP by removing custom nvshmem (sgl-project#12949)

Co-authored-by: Baizhou Zhang <[email protected]>

* add some more labels (sgl-project#13701)

Co-authored-by: Brayden Zhong <[email protected]>

* Feat/nemotron nano v3 support (sgl-project#12690)

* Fix global scaling factor loading hang (sgl-project#13484)

* Fix B200 Nightly tests and move one manual test back to unit test to prevent the same issue (sgl-project#13746)

* fix test_lora_update.py starvation message check (sgl-project#13702)

* Fix model weights validation with automatic cache cleanup (sgl-project#13729)

* [Auto Sync] Update evict_policy.py, radix_cache.py (20251120) (sgl-project#13669)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: cctry <[email protected]>

* [Tiny] Renaming environ for NVFP4 dispatch (sgl-project#13756)

* modularize gsm8k and mmmu test classes (sgl-project#13506)

* Use dual stream for DS MoE whenever cuda graph is used (instead of with token threshold) (sgl-project#9405)

* [Ascend] support Kimi-K2-Thinking (sgl-project#12759)

Co-authored-by: ZhengdQin <[email protected]>
Co-authored-by: richhuan <[email protected]>
Co-authored-by: ZhengdQin <[email protected]>

* Refactor eagle bigram key matching (sgl-project#13714)

* [diffusion] fix: fix hunyuanvideo and add 2-gpu ci test  (sgl-project#13720)

Co-authored-by: Mick <[email protected]>

* Update mem checker during busy (sgl-project#13704)

* Tiny support different prompts in `send_one.py` (sgl-project#13768)

* [diffusion] refactor: refactor sampling params (sgl-project#13706)

* [VLM] Replace torch.repeat_interleave with faster np.repeat for Qwen-VL series (sgl-project#13736)

Co-authored-by: luoyuan.luo <[email protected]>

* [Spec v2] Remove `allocate_lens` and enable over-allocation (sgl-project#13478)

* [diffusion] CI: tinyfix diffusion ci (sgl-project#13769)

Co-authored-by: Mick <[email protected]>

* align code style eagle draft&draft_extend cuda graph runner (sgl-project#13533)

* Refactor MHA & MLA KV caches to support FP4 (sgl-project#13547)

Signed-off-by: Ho-Ren (Jack) Chuang <[email protected]>

* Move unnecessary input_addr capture under debug mode flag for speed-up (sgl-project#13690)

* Gather static input buffers for cuda graph (sgl-project#13676)

* Revert "Fix RMSNorm API CALL mismatch issue. (sgl-project#10032)" (sgl-project#13727)

* [model-gateway] update smg code owner (sgl-project#13777)

* [model-gateway] clean up router manager function order (sgl-project#13776)

* Fix typo in docs (sgl-project#13709)

* [Feature] HiCache JIT kernel (once again) (sgl-project#13764)

* [DeepEP] Add SGLANG_DEEPEP_BF16_DISPATCH env var in Normal mode (sgl-project#13787)

* Upgrade flashmla kernel for NSA tp support (sgl-project#13718)

* [diffusion] feat: support sp for image models (sgl-project#13180)

* [diffusion] CI: add run_suite to multimodal_gen CI (sgl-project#13791)

* Fix pagination bug in CI monitor preventing performance-test-2-gpu data collection (sgl-project#13781)

* [Scheduler] Tiny organize code style (sgl-project#13806)

* [Deepseek] Refactor deepseek server_args _handle_model_specific_adjustments (sgl-project#13687)

* [CI] Tiny refactoring sgl-kernel tests (sgl-project#13813)

* Tune fp8_w8a8 fused triton moe for GLM-4.6-FP8 (sgl-project#13815)

* make trtllm attn backend's init_forward_metadat non blocking (sgl-project#13802)

* remove package json which is not used (sgl-project#13810)

* [1/2] Refactor DeepGeem requant for FP8 Linear on Blackwell  (sgl-project#13601)

Co-authored-by: fy1214

* chore: bump sgl-kernel version to 0.3.18 (sgl-project#13816)

* xgrammar up version to 0.1.27 (sgl-project#13650)

* Fix bug: Incorrect variable used in rem_total_token_offset calculatio… (sgl-project#13201)

* [Doc] Refine fused_moe_triton configs doc (sgl-project#13820)

* Update MindSpore documentation (sgl-project#13656)

Co-authored-by: wangtiance <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Refactor cache init logic (sgl-project#13800)

* [Bugfix] Add jit kernel files in packaging (sgl-project#13829)

Co-authored-by: luoyuan.luo <[email protected]>
Co-authored-by: Xu Yongfei <[email protected]>

* [diffusion] doc: minor update contributing.md with test section (sgl-project#13792)

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* [misc] Rename minilb install env & remove files & fix lint (sgl-project#13831)

* [diffusion] CI: send nightly-test outputs of diffusion to slack for correctness monitoring (sgl-project#13833)

Co-authored-by: Mick <[email protected]>

* [chore]Upgrade flashinfer to 0.5.3 (sgl-project#13751)

* [Intel XPU]support xgrammar backend for intel xpu (sgl-project#13245)

* [sgl-kernel Code Clean] Remove useless lightning_attention kernel (sgl-project#13819)

* [VLM] Revise InternVL Piecewise CUDA Graph Supporting (sgl-project#13846)

Co-authored-by: luoyuan.luo <[email protected]>

* Fix TorchAO quant in VLM (sgl-project#13508)

Co-authored-by: qiuxuan.lzw <[email protected]>

* [Fix]: Adjust FutureMap's token_id_bufs Size to Prevent ChunkedPrefill's next_token_ids from Overwriting Previous Prefill Requests' next_token_id (sgl-project#13713)

Signed-off-by: vito.yy <[email protected]>

* Fix: Safe RoPE Cache Expansion to Prevent Position-ID Out-of-Bounds in EAGLE + Long-Sequence Workloads (sgl-project#11871)

* [Fix] Fix uvloop get_event_loop() is not suitable for 0.22.x (sgl-project#13612)

Signed-off-by: lzy <[email protected]>
Co-authored-by: lzy <[email protected]>

* Tiny unpin uvloop for other backends (sgl-project#13858)

* [model-gateway] Refactor router e2e responses tests (sgl-project#13745)

Co-authored-by: Chang Su <[email protected]>
Co-authored-by: Simo Lin <[email protected]>

* [Perf] Optimize DeepSeek-R1 w4afp8 glue kernels (sgl-project#10027)

Co-authored-by: Fan Yin <[email protected]>

* Fix quantized moe checker fail for Qwen3 dense fp8 model (sgl-project#13853)

* [model-gateway] add grpc server code owner (sgl-project#13865)

* [BugFix] fix outplace_fused_experts missing is_gated (sgl-project#13864)

* fix xgrammar_backend crash with malformed inputs (sgl-project#13752)

* [Auto Sync] Update schedule_batch.py, schedule_policy.py, b... (20251122) (sgl-project#13763)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Hanming Lu <[email protected]>
Co-authored-by: Hanming Lu <[email protected]>

* [Doc] Add an Introduction to Expert Parallelism (sgl-project#13783)

* add LoRA warning if loading a preexisting LoRA adapter with a different name (sgl-project#13822)

* [NPU] Fix NPU CI (sgl-project#13834)

Co-authored-by: c30031083 <[email protected]>

* Overlap glm moe gemms in two cuda streams (sgl-project#13786)

* [Performance] Replace preprocess_video logic from GLM  multimodal processor with transformer impl for speed up (up to 27% faster) and addressing OOM (up to 50x improvements) (sgl-project#13487)

* Add support for bf16 x bf16 cutlass fused MoE (sgl-project#10275)

Co-authored-by: Sam Li <[email protected]>
Co-authored-by: jackeyhua <[email protected]>

* [Router bugfix] Fix router_manager selecting the wrong router when enable-igw. (sgl-project#13572)

* Fix nightly test job to fail when any test fails (sgl-project#13871)

* [diffusion] refactor: remove training-related code (sgl-project#13860)

* [CI] fix multimodel-gen-test job (sgl-project#13874)

* [diffusion] CI: add validation and cleanup for corrupted safetensors in multimodal loader (sgl-project#13870)

Co-authored-by: Mick <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* [CI] fix lint error (sgl-project#13891)

* fix: draft model revision misuse model revision (sgl-project#11893)

* Fix trace publish paths in nightly-test-nvidia workflow (sgl-project#13888)

* Adding nightly tests for Kimi-K2-thinking, Qwen3, minimax-m2, GLM4.6 (sgl-project#13890)

* [Fix] JIT kernel dependencies in other platforms (sgl-project#13889)

* remove RoPE CPU fp32 tests (sgl-project#13827)

Co-authored-by: Fan Yin <[email protected]>

* Move test_dummy_grok_models.py from manual to srt (temporary) (sgl-project#13901)

* [CI tiny fix] Enhance robustness of vision chunked prefill test with ROUGE-L metric (sgl-project#13793)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

* update flashinfer_cubin==0.5.3 (sgl-project#13848)

* fix

* fix

---------

Signed-off-by: Xuchun Shang <[email protected]>
Signed-off-by: Michael Qiu <[email protected]>
Signed-off-by: Hao Lu <[email protected]>
Signed-off-by: Xinyuan Tong <[email protected]>
Signed-off-by: Vincent Zhong <[email protected]>
Signed-off-by: Ho-Ren (Jack) Chuang <[email protected]>
Signed-off-by: vito.yy <[email protected]>
Signed-off-by: lzy <[email protected]>
Co-authored-by: Simo Lin <[email protected]>
Co-authored-by: Xiaoyu Zhang <[email protected]>
Co-authored-by: AlphaBaby <[email protected]>
Co-authored-by: Xuchun Shang <[email protected]>
Co-authored-by: Mick <[email protected]>
Co-authored-by: adarshxs <[email protected]>
Co-authored-by: Adarsh Shirawalmath <[email protected]>
Co-authored-by: Xuchun Shang <[email protected]>
Co-authored-by: Shangming Cai <[email protected]>
Co-authored-by: Mike Qiu <[email protected]>
Co-authored-by: Mike_Qiu <[email protected]>
Co-authored-by: kebyn <[email protected]>
Co-authored-by: lianhu.yin <[email protected]>
Co-authored-by: kebyn <[email protected]>
Co-authored-by: Liangsheng Yin <[email protected]>
Co-authored-by: Carlo Mussolini <[email protected]>
Co-authored-by: Rain H <[email protected]>
Co-authored-by: Liangsheng Yin <[email protected]>
Co-authored-by: hlu1 <[email protected]>
Co-authored-by: Kangyan-Zhou <[email protected]>
Co-authored-by: Sirut Buasai <[email protected]>
Co-authored-by: Vedant V Jhaveri <[email protected]>
Co-authored-by: Kaixi Hou <[email protected]>
Co-authored-by: Baizhou Zhang <[email protected]>
Co-authored-by: pansicheng <[email protected]>
Co-authored-by: Zhiqiang Xie <[email protected]>
Co-authored-by: Minglei Zhu <[email protected]>
Co-authored-by: Sai Enduri <[email protected]>
Co-authored-by: Yuhao Yang <[email protected]>
Co-authored-by: Douglas Yang <[email protected]>
Co-authored-by: Binyao Jiang <[email protected]>
Co-authored-by: Zijian Zhang <[email protected]>
Co-authored-by: wingedge <[email protected]>
Co-authored-by: bianfeng <[email protected]>
Co-authored-by: Xinyuan Tong <[email protected]>
Co-authored-by: b8zhong <[email protected]>
Co-authored-by: alisonshao <[email protected]>
Co-authored-by: Cheng Wan <[email protected]>
Co-authored-by: Nicholas <[email protected]>
Co-authored-by: liusy58 <[email protected]>
Co-authored-by: Yuan Luo <[email protected]>
Co-authored-by: Hao Chen <[email protected]>
Co-authored-by: Morpheus Guo <[email protected]>
Co-authored-by: yuechguo <[email protected]>
Co-authored-by: Lianmin Zheng <[email protected]>
Co-authored-by: sglang-bot <[email protected]>
Co-authored-by: Junrong Lin <[email protected]>
Co-authored-by: Glen Liu <[email protected]>
Co-authored-by: Chang Su <[email protected]>
Co-authored-by: gongwei-130 <[email protected]>
Co-authored-by: Baidu-AIAK <[email protected]>
Co-authored-by: Chen Haozhe <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: ykwd <[email protected]>
Co-authored-by: Zilin Zhu <[email protected]>
Co-authored-by: Even Zhou <[email protected]>
Co-authored-by: Roger Young <[email protected]>
Co-authored-by: xuebi <[email protected]>
Co-authored-by: Netanel Haber <[email protected]>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Sehoon Kim <[email protected]>
Co-authored-by: luoyuan.luo <[email protected]>
Co-authored-by: Yuhao Yang <[email protected]>
Co-authored-by: StonyPort <[email protected]>
Co-authored-by: qiuxuan.lzw <[email protected]>
Co-authored-by: Zeyu Li <[email protected]>
Co-authored-by: iLeGend <[email protected]>
Co-authored-by: joesun <[email protected]>
Co-authored-by: Thomas Wang <[email protected]>
Co-authored-by: sogalin <[email protected]>
Co-authored-by: DarkSharpness <[email protected]>
Co-authored-by: yctseng0211 <[email protected]>
Co-authored-by: root <[email protected]>
Co-authored-by: jacky.cheng <[email protected]>
Co-authored-by: Lzhang-hub <[email protected]>
Co-authored-by: YanbingJiang <[email protected]>
Co-authored-by: Fan Yin <[email protected]>
Co-authored-by: YAMY <[email protected]>
Co-authored-by: Vincent Zhong <[email protected]>
Co-authored-by: Stefan He <[email protected]>
Co-authored-by: Ke Bao <[email protected]>
Co-authored-by: Oasis-Git <[email protected]>
Co-authored-by: fzyzcjy <[email protected]>
Co-authored-by: 赵晨阳 <[email protected]>
Co-authored-by: ishandhanani <[email protected]>
Co-authored-by: zyksir <[email protected]>
Co-authored-by: Zhuqi Li <[email protected]>
Co-authored-by: Michele Marzollo <[email protected]>
Co-authored-by: ZeldaHuang <[email protected]>
Co-authored-by: Teng Ma <[email protected]>
Co-authored-by: weibingo <[email protected]>
Co-authored-by: Jiajun Li <[email protected]>
Co-authored-by: Brayden Zhong <[email protected]>
Co-authored-by: Qiaolin Yu <[email protected]>
Co-authored-by: roikoren755 <[email protected]>
Co-authored-by: Shu Wang <[email protected]>
Co-authored-by: cctry <[email protected]>
Co-authored-by: Trevor Morris <[email protected]>
Co-authored-by: Yijie Zhu <[email protected]>
Co-authored-by: ZhengdQin <[email protected]>
Co-authored-by: richhuan <[email protected]>
Co-authored-by: ZhengdQin <[email protected]>
Co-authored-by: yinghui <[email protected]>
Co-authored-by: Ho-Ren (Jack) Chuang <[email protected]>
Co-authored-by: ErsongWang <[email protected]>
Co-authored-by: Peiqi Yin <[email protected]>
Co-authored-by: Swipe4057 <[email protected]>
Co-authored-by: liuhuijiayou <[email protected]>
Co-authored-by: Tiance Wang <[email protected]>
Co-authored-by: wangtiance <[email protected]>
Co-authored-by: Xu Yongfei <[email protected]>
Co-authored-by: gaopengff <[email protected]>
Co-authored-by: ant-yy <[email protected]>
Co-authored-by: Zhi Yiliu <[email protected]>
Co-authored-by: lzy <[email protected]>
Co-authored-by: Xinyue Zhang <[email protected]>
Co-authored-by: Yuhao Yao <[email protected]>
Co-authored-by: Hanming Lu <[email protected]>
Co-authored-by: Hanming Lu <[email protected]>
Co-authored-by: c30031083 <[email protected]>
Co-authored-by: Nicolas Castet <[email protected]>
Co-authored-by: Sam Li <[email protected]>
Co-authored-by: jackeyhua <[email protected]>
Co-authored-by: Siyuan Chen <[email protected]>
Co-authored-by: Yibo Cai <[email protected]>
Co-authored-by: Yibo Cai <[email protected]>
Co-authored-by: Zaili Wang <[email protected]>
Co-authored-by: josephyou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants