[NVIDIA] Add fp8 gemm benchmark on blackwell #13528

kaixih · 2025-11-18T18:08:10Z

This PR adds a kernel benchmark for fp8 gemms available for Blackwell GPUs. Currently it covers the flashinfer and deepgemm impls. This can guide users which backends should be picked based on the use cases.

Benchmark

On GB200:

> python benchmark_deepgemm_fp8_gemm_blackwell.py --tp-size 4 --plot-friendly

# unfortunately, we need to run it again in the normal mode to get a more human readable text with shapes
> python benchmark_deepgemm_fp8_gemm_blackwell.py --tp-size 4
          m        n        k  tp_size    DeepGEMM   Flashinfer
0       8.0    576.0   7168.0      4.0   65.119997    16.640000
1      16.0    576.0   7168.0      4.0   60.640000    16.287999
2      32.0    576.0   7168.0      4.0   61.792001    16.416000
3      64.0    576.0   7168.0      4.0   59.904002    16.352000
4     128.0    576.0   7168.0      4.0   60.160000    16.384000
5     256.0    576.0   7168.0      4.0   58.304001    16.384000
6    1024.0    576.0   7168.0      4.0   60.543999    28.704001
7    2048.0    576.0   7168.0      4.0   61.439998    41.216001
8    4096.0    576.0   7168.0      4.0   62.208001    63.487999
9       8.0  24576.0   7168.0      4.0   72.031997    47.136001
10     16.0  24576.0   7168.0      4.0   70.047997    47.104001
11     32.0  24576.0   7168.0      4.0   70.015997    47.392000
12     64.0  24576.0   7168.0      4.0   71.039997    69.920003
13    128.0  24576.0   7168.0      4.0   69.103997   125.216007
14    256.0  24576.0   7168.0      4.0   71.039997   223.519996
15   1024.0  24576.0   7168.0      4.0  146.031998   826.191992
16   2048.0  24576.0   7168.0      4.0  260.383993  1632.544041
17   4096.0  24576.0   7168.0      4.0  514.335990  3283.008099
18      8.0  32768.0    512.0      4.0   59.967998    11.904000
19     16.0  32768.0    512.0      4.0   58.912002    13.216000
20     32.0  32768.0    512.0      4.0   59.487998    16.352000
21     64.0  32768.0    512.0      4.0   58.095999    20.703999
22    128.0  32768.0    512.0      4.0   58.304001    32.768000
23    256.0  32768.0    512.0      4.0   57.728000    55.264000
24   1024.0  32768.0    512.0      4.0   58.368001   192.736000
25   2048.0  32768.0    512.0      4.0   66.463999   375.039995
26   4096.0  32768.0    512.0      4.0   73.760003   741.375983
27      8.0   7168.0  16384.0      4.0   65.920003    44.608001
28     16.0   7168.0  16384.0      4.0   68.960004    45.056000
29     32.0   7168.0  16384.0      4.0   70.015997    44.608001
30     64.0   7168.0  16384.0      4.0   66.944003    45.311999
31    128.0   7168.0  16384.0      4.0   66.944003    73.728003
32    256.0   7168.0  16384.0      4.0   65.920003   100.351997
33   1024.0   7168.0  16384.0      4.0   89.695998   350.511998
34   2048.0   7168.0  16384.0      4.0  184.640005   700.687975
35   4096.0   7168.0  16384.0      4.0  358.720005  1507.328033
36      8.0   7168.0  18432.0      4.0   64.896002    47.359999
37     16.0   7168.0  18432.0      4.0   64.896002    47.327999
38     32.0   7168.0  18432.0      4.0   65.920003    47.104001
39     64.0   7168.0  18432.0      4.0   65.920003    47.104001
40    128.0   7168.0  18432.0      4.0   63.872002    77.903997
41    256.0   7168.0  18432.0      4.0   63.872002   108.800001
42   1024.0   7168.0  18432.0      4.0  104.720000   393.216014
43   2048.0   7168.0  18432.0      4.0  204.832003   751.616001
44   4096.0   7168.0  18432.0      4.0  395.296007  1559.551954
45      8.0   9216.0   7168.0      4.0   55.744000    24.768000
46     16.0   9216.0   7168.0      4.0   56.704000    24.576001
47     32.0   9216.0   7168.0      4.0   51.344000    24.576001
48     64.0   9216.0   7168.0      4.0   52.191999    26.624000
49    128.0   9216.0   7168.0      4.0   52.223999    39.168000
50    256.0   9216.0   7168.0      4.0   55.328000    61.407998
51   1024.0   9216.0   7168.0      4.0   61.087999   192.543998
52   2048.0   9216.0   7168.0      4.0  112.896003   358.687997
53   4096.0   9216.0   7168.0      4.0  212.576002   692.480028
54      8.0   6144.0   7168.0      4.0   54.655999    20.768000
55     16.0   6144.0   7168.0      4.0   55.679999    20.768000
56     32.0   6144.0   7168.0      4.0   47.711998    20.992000
57     64.0   6144.0   7168.0      4.0   48.160002    22.496000
58    128.0   6144.0   7168.0      4.0   47.904000    33.023998
59    256.0   6144.0   7168.0      4.0   48.160002    45.056000
60   1024.0   6144.0   7168.0      4.0   54.687999   131.104007
61   2048.0   6144.0   7168.0      4.0   66.175997   239.904001
62   4096.0   6144.0   7168.0      4.0  140.864000   459.776014
63      8.0   8192.0    512.0      4.0   45.568001     9.216000
64     16.0   8192.0    512.0      4.0   44.608001     8.864000
65     32.0   8192.0    512.0      4.0   44.064000     9.168000
66     64.0   8192.0    512.0      4.0   44.192001    10.240000
67    128.0   8192.0    512.0      4.0   43.584000    14.784000
68    256.0   8192.0    512.0      4.0   43.935999    22.784000
69   1024.0   8192.0    512.0      4.0   44.160001    71.648002
70   2048.0   8192.0    512.0      4.0   45.024000   133.376002
71   4096.0   8192.0    512.0      4.0   49.120001   260.127991
72      8.0   6144.0   1536.0      4.0   46.335999    10.240000
73     16.0   6144.0   1536.0      4.0   45.472000    10.240000
74     32.0   6144.0   1536.0      4.0   46.592001    10.240000
75     64.0   6144.0   1536.0      4.0   46.224000    14.336000
76    128.0   6144.0   1536.0      4.0   45.855999    18.495999
77    256.0   6144.0   1536.0      4.0   46.624001    28.608000
78   1024.0   6144.0   1536.0      4.0   46.895999    82.496002
79   2048.0   6144.0   1536.0      4.0   47.104001   157.664001
80   4096.0   6144.0   1536.0      4.0   51.584002   307.231992
81      8.0   1024.0   7168.0      4.0   50.207999    16.896000
82     16.0   1024.0   7168.0      4.0   47.136001    16.960001
83     32.0   1024.0   7168.0      4.0   45.311999    17.408000
84     64.0   1024.0   7168.0      4.0   45.056000    16.384000
85    128.0   1024.0   7168.0      4.0   44.224001    16.480001
86    256.0   1024.0   7168.0      4.0   44.831999    17.408000
87   1024.0   1024.0   7168.0      4.0   44.624001    30.208001
88   2048.0   1024.0   7168.0      4.0   46.815999    53.247999
89   4096.0   1024.0   7168.0      4.0   52.607998    87.583996
90      8.0   7168.0   4608.0      4.0   46.112001    16.384000
91     16.0   7168.0   4608.0      4.0   47.104001    16.384000
92     32.0   7168.0   4608.0      4.0   43.807998    16.384000
93     64.0   7168.0   4608.0      4.0   44.576000    16.640000
94    128.0   7168.0   4608.0      4.0   44.128001    24.576001
95    256.0   7168.0   4608.0      4.0   44.351999    34.784000
96   1024.0   7168.0   4608.0      4.0   49.536001   104.447998
97   2048.0   7168.0   4608.0      4.0   51.775999   194.848001
98   4096.0   7168.0   4608.0      4.0  104.768001   376.832008
99      8.0   7168.0   4096.0      4.0   46.112001    15.904000
100    16.0   7168.0   4096.0      4.0   44.256002    15.904000
101    32.0   7168.0   4096.0      4.0   42.975999    15.904000
102    64.0   7168.0   4096.0      4.0   43.887999    16.352000
103   128.0   7168.0   4096.0      4.0   44.128001    22.528000
104   256.0   7168.0   4096.0      4.0   43.136001    30.688001
105  1024.0   7168.0   4096.0      4.0   50.528001    96.256003
106  2048.0   7168.0   4096.0      4.0   49.152002   178.207994
107  4096.0   7168.0   4096.0      4.0   93.759999   344.352007
108     8.0   7168.0    512.0      4.0   46.464000     8.864000
109    16.0   7168.0    512.0      4.0   43.968000     8.192000
110    32.0   7168.0    512.0      4.0   44.447999     9.184000
111    64.0   7168.0    512.0      4.0   43.648001    10.240000
112   128.0   7168.0    512.0      4.0   43.552000    14.304000
113   256.0   7168.0    512.0      4.0   42.688001    20.736000
114  1024.0   7168.0    512.0      4.0   44.767998    63.744001
115  2048.0   7168.0    512.0      4.0   43.680001   118.784003
116  4096.0   7168.0    512.0      4.0   46.080001   231.040001

Accuracy

The test contains correctness check:

Running correctness tests...                                                                                                                                                                                                               
[2025-11-18 18:04:02] WARNING compile_utils.py:94: Entering DeepGEMM JIT Pre-Compile session. It may take a long time (typically 10-20 mins) if you have not run `sglang.compile_deep_gemm`. It is recommended to run `sglang.compile_deep_
gemm` with same args as `sglang.launch_server` for pre-compilation to reduce the overhead if you have not run it before. For example: `python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code`     
[2025-11-18 18:04:02] INFO compile_utils.py:104: Try DeepGEMM JIT Compiling for <GEMM_NT_F8F8BF16> N=512, K=7168, num_groups=1 with all Ms. It only takes a little time (typically 1 sec) if you have run `python3 -m sglang.compile_deep_g
emm`.                                                                                                                                                                                                                                      
DeepGEMM warmup: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16384/16384 [00:02<00:00, 8031.39it/s]
Shape m=64, n=512, k=7168:                                                                                                                                                                                                                 
Flashinfer output: tensor([-26.2500,  40.5000, -15.4375, -61.0000, -68.5000], device='cuda:0',                                                                                                                                             
       dtype=torch.bfloat16)                                                                                                                                                                                                               
DeepGEMM output: tensor([-17.5000,  39.0000, -11.8125, -58.7500, -66.5000], device='cuda:0',                                                                                                                                               
       dtype=torch.bfloat16)                                                                                                                                                                                                               
Correctness check:                                                                                                                                                                                                                         
  - Flashinfer vs DeepGEMM: ✅                                                                                                                                                                                                             
[2025-11-18 18:04:04] WARNING compile_utils.py:94: Entering DeepGEMM JIT Pre-Compile session. It may take a long time (typically 10-20 mins) if you have not run `sglang.compile_deep_gemm`. It is recommended to run `sglang.compile_deep_
gemm` with same args as `sglang.launch_server` for pre-compilation to reduce the overhead if you have not run it before. For example: `python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code`     
[2025-11-18 18:04:04] INFO compile_utils.py:104: Try DeepGEMM JIT Compiling for <GEMM_NT_F8F8BF16> N=7168, K=16384, num_groups=1 with all Ms. It only takes a little time (typically 1 sec) if you have run `python3 -m sglang.compile_deep
_gemm`.                                                                                                                                                                                                                                    
DeepGEMM warmup: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16384/16384 [00:02<00:00, 8106.33it/s]
Shape m=64, n=7168, k=16384:                                                                                                                                                                                                               
Flashinfer output: tensor([ 51.5000, 137.0000, -87.5000, -53.7500, 111.0000], device='cuda:0',                       
       dtype=torch.bfloat16)                                                                                                                                                                                                               
DeepGEMM output: tensor([ 50.2500, 138.0000, -95.0000, -61.7500, 114.0000], device='cuda:0',                         
       dtype=torch.bfloat16)                                                                                         
Correctness check:                                                                                                   
  - Flashinfer vs DeepGEMM: ✅                                                                                       
[2025-11-18 18:04:06] WARNING compile_utils.py:94: Entering DeepGEMM JIT Pre-Compile session. It may take a long time (typically 10-20 mins) if you have not run `sglang.compile_deep_gemm`. It is recommended to run `sglang.compile_deep_
gemm` with same args as `sglang.launch_server` for pre-compilation to reduce the overhead if you have not run it before. For example: `python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code`     
[2025-11-18 18:04:06] INFO compile_utils.py:104: Try DeepGEMM JIT Compiling for <GEMM_NT_F8F8BF16> N=18432, K=7168, num_groups=1 with all Ms. It only takes a little time (typically 1 sec) if you have run `python3 -m sglang.compile_deep
_gemm`.                                                                                                                                                                                                                                    
DeepGEMM warmup: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16384/16384 [00:02<00:00, 8110.53it/s]
Shape m=64, n=18432, k=7168:                                                                                         
Flashinfer output: tensor([-41.0000,  29.7500, 117.5000,  78.0000,  28.8750], device='cuda:0',                                                                                                                                             
       dtype=torch.bfloat16)                                                                                         
DeepGEMM output: tensor([-35.2500,  33.7500, 120.0000,  75.0000,  26.2500], device='cuda:0',
       dtype=torch.bfloat16)                                                                                         
Correctness check:                                                                                                   
  - Flashinfer vs DeepGEMM: ✅

gemini-code-assist · 2025-11-18T18:08:31Z

Summary of Changes

Hello @kaixih, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request adds a dedicated benchmark for FP8 General Matrix Multiply (GEMM) kernels tailored for NVIDIA Blackwell GPUs. The benchmark evaluates and compares the performance and numerical accuracy of two prominent implementations, Flashinfer and DeepGEMM, under diverse matrix dimensions and tensor parallelism settings. This new tool aims to empower users with critical performance data, enabling them to make informed decisions when choosing the most suitable FP8 GEMM backend for their specific applications on Blackwell architecture.

Highlights

New FP8 GEMM Benchmark: Introduced a new benchmark script specifically for FP8 General Matrix Multiply (GEMM) operations on NVIDIA Blackwell GPUs.
Backend Comparison: The benchmark compares the performance and correctness of two FP8 GEMM implementations: Flashinfer and DeepGEMM.
Configurable Benchmarking: The benchmark supports various matrix shapes, batch sizes, and tensor parallelism configurations to provide comprehensive performance insights.
FP8 Utility Function Made Public: The internal _requant_weight_ue8m0 function in fp8_utils.py has been renamed to requant_weight_ue8m0, making it a public utility for requantizing FP8 weights.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

kaixih · 2025-11-18T18:08:49Z

@Fridge003 @gracehonv @ishandhanani @wenscarl

gemini-code-assist

Code Review

This pull request introduces a new benchmark script for fp8 GEMM operations on Blackwell GPUs, comparing flashinfer and deepgemm implementations. The changes also include making a utility function public for use in the benchmark. My review focuses on improving the new benchmark script's correctness, readability, and maintainability. I've provided suggestions to fix an argument parsing bug, correct TFLOPS calculations and time unit inconsistencies, reduce code duplication, and improve code style for better clarity.