[sgl-kernel][Feat][B200][1/N]Support MXFP8 Grouped GEMM in Blackwell by HydraQYH · Pull Request #13731 · sgl-project/sglang

HydraQYH · 2025-11-21T12:08:44Z

Motivation

This PR supports CUTLASS-based MXFP8 Grouped GEMM. In addition to introducing the MXFP8 Grouped GEMM, this PR also provides a Group Quant Kernel for calculating quantization inputs and scale factors. The Grouped Quant Kernel is implemented in C++ based on CUTLASS CuTe. Grouped Quant Kernel can perform quantization calculations for all groups within a single kernel. Notice that the format of the Scale Factor is special: https://docs.nvidia.com/cuda/cublas/index.html#d-block-scaling-factors-layout

In TensorRT-LLM, the scale factor is written to Global Memory using the STG.8 instruction. This method of storing data is inefficient. So in our implementation, we use the following three optimization techniques:

256bit Load
100% Occupancy
Overlapping with TMA STORE

When there is sufficient data, our kernel can effectively utilize HBM bandwidth on B200(up to 85%+):

Modifications

sgl-kernel/CMakeLists.txt
sgl-kernel/include/sgl_kernel_ops.h
sgl-kernel/csrc/common_extension.cc
sgl-kernel/csrc/expert_specialization/es_sm100_mxfp8_blockscaled_functor.cuh
sgl-kernel/csrc/expert_specialization/es_sm100_mxfp8_blockscaled_group_quant.cu
sgl-kernel/csrc/expert_specialization/es_sm100_mxfp8_blockscaled_group_quant.cuh
sgl-kernel/csrc/expert_specialization/es_sm100_mxfp8_blockscaled_launcher.cuh
sgl-kernel/csrc/expert_specialization/es_sm100_mxfp8_blockscaled_traits.cuh
sgl-kernel/csrc/expert_specialization/es_sm100_mxfp8_blockscaled.cu
sgl-kernel/python/sgl_kernel/__init__.py
sgl-kernel/python/sgl_kernel/expert_specialization.py
sgl-kernel/tests/test_es_mxfp8_blockscaled_moe.py

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

gemini-code-assist · 2025-11-21T12:09:08Z

Summary of Changes

Hello @HydraQYH, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the sgl-kernel library by adding specialized support for MXFP8 Grouped GEMM and an optimized Group Quantization Kernel tailored for NVIDIA's Blackwell architecture. The primary goal is to accelerate mixed-precision operations in deep learning models by leveraging advanced hardware capabilities and efficient data handling, ensuring high performance and memory bandwidth utilization.

Highlights

MXFP8 Grouped GEMM Support: Introduces support for CUTLASS-based MXFP8 Grouped General Matrix Multiplication (GEMM) specifically for Blackwell (SM100) architecture, enabling highly efficient mixed-precision computations.
Group Quantization Kernel: Adds a new Group Quant Kernel, implemented in C++ using CUTLASS CuTe, which calculates quantization inputs and scale factors for all groups within a single kernel, optimizing the quantization process.
Performance Optimizations: Incorporates several optimization techniques for scale factor storage and processing, including 256-bit loads, achieving 100% occupancy, and overlapping with TMA STORE, leading to efficient HBM bandwidth utilization (up to 85%+).
PyTorch Integration and Testing: Integrates the new MXFP8 grouped GEMM and quantization kernels into the PyTorch extension, making them accessible via torch.ops.sgl_kernel, and includes comprehensive unit tests to verify accuracy on SM100 devices.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for CUTLASS-based MXFP8 Grouped GEMM on the Blackwell architecture, along with a corresponding group quantization kernel. The implementation is comprehensive and includes several performance optimizations. My review has identified a critical type-mismatch issue in the test code that should be addressed, as well as a high-severity issue related to incomplete input validation in one of the new kernels. I've also noted several medium-severity issues, such as typos in function names, incorrect error messages, and leftover TODO comments, which should be cleaned up to improve code quality and maintainability.

sgl-kernel/tests/test_es_mxfp8_blockscaled_moe.py

sgl-kernel/csrc/expert_specialization/es_sm100_mxfp8_blockscaled.cu

sgl-kernel/csrc/expert_specialization/es_sm100_mxfp8_blockscaled_functor.cuh

sgl-kernel/csrc/expert_specialization/es_sm100_mxfp8_blockscaled_group_quant.cu

sgl-kernel/csrc/expert_specialization/es_sm100_mxfp8_blockscaled_group_quant.cuh

sgl-kernel/csrc/expert_specialization/es_sm100_mxfp8_blockscaled_launcher.cuh

sgl-kernel/tests/test_es_mxfp8_blockscaled_moe.py

BBuf

Great job! We can also conside improve many memory bound kernel's HBM bandwidth with 256 bit LDG/STS in b200.

yuan-luo · 2025-11-22T15:01:22Z

Awesome workpiece! The design spirit is quite similar with DeepSeek DSA Indexer, using a pre-compute kernel to avoid massive computing. Whereas Indexer is to pick up 2048 previous tokens and participate on-going self-attention so as to make the compute linear complexity, in this PR the "intelligent" pre-compute is adopted to do "real" grouped gemm kernel dispatch via masking problem size. Moreover, you propagate the design to sm100 mxfp8. It's really a great design. Thanks a lot.
One comment may not be matured, is it possible to introduce a pre-trained pre-compute kernel(something like Indexer weight in DSv3.2) to better masking problem size instead of deciding by token number, that would be pretty cool ...

yuan-luo · 2025-11-23T08:34:55Z

The basic design for hopper refers to #11432 series.

yuan-luo · 2025-11-23T09:04:28Z

For sm100, SGLang has sm100_fp8_blockwise_group_mm_dispatch_shape to do dispatch based on shape. The dispatch principle seems to be similar this PR.

  if (a.size(0) <= 2048 && a.size(1) >= 2048) {
    run_get_group_gemm_starts<MmaConfig1::LayoutSFA, MmaConfig1::LayoutSFB, MmaConfig1::ScaleConfig>(
        expert_offsets,
        a_ptrs,
        b_ptrs,
        out_ptrs,
        a_scales_ptrs,
        b_scales_ptrs,
        b_t,
        a_t,
        output_t,
        scales_b_t,
        scales_a_t,
        layout_sfa,
        layout_sfb,
        problem_sizes,
        problem_sizes_transpose,
        true);
    launch_sm100_fp8_blockwise_scaled_group_mm<OutType, MmaConfig1, cutlass::layout::ColumnMajor>(
        out_ptrs,
        a_ptrs,
        b_ptrs,
        a_scales_ptrs,
        b_scales_ptrs,
        stride_a,
        stride_b,
        stride_c,
        layout_sfa,
        layout_sfb,
        problem_sizes_transpose,
        expert_offsets,
        workspace);
    output = output_t.t();
  } else if (a.size(0) > 2048 && a.size(1) >= 2048) {
  ......
     // Dispatch to another kernel with another set of MmaConfig/LayoutSF.
  }

The kernel launch_es_sm100_mxfp8_blockscaled_grouped_quant to do grouped quant mxfp8 with cute is awesome! It is really an excellent example for cute.

sgl-kernel/csrc/expert_specialization/es_sm100_mxfp8_blockscaled_group_quant.cuh

HydraQYH · 2025-11-24T14:00:16Z

Awesome workpiece! The design spirit is quite similar with DeepSeek DSA Indexer, using a pre-compute kernel to avoid massive computing. Whereas Indexer is to pick up 2048 previous tokens and participate on-going self-attention so as to make the compute linear complexity, in this PR the "intelligent" pre-compute is adopted to do "real" grouped gemm kernel dispatch via masking problem size. Moreover, you propagate the design to sm100 mxfp8. It's really a great design. Thanks a lot. One comment may not be matured, is it possible to introduce a pre-trained pre-compute kernel(something like Indexer weight in DSv3.2) to better masking problem size instead of deciding by token number, that would be pretty cool ...

@yuan-luo That is a fantastic idea. I also believe that the choice of which kernel to select for an expert should not be based solely on the number of tokens, but rather on Arithmetic Intensity. Imagine a scenario where the N and K dimensions of an expert are very small. Even with a large number of tokens, the overall arithmetic Intensity of this expert is still very low, placing it within the memory bounds. In this case, a kernel with higher TMA load efficiency should be chosen, rather than a kernel designed for compute bounds. I am still verifying this idea on the SM90. If there is any progress, I will submit a PR for optimization.

HydraQYH · 2025-11-24T14:12:57Z

For sm100, SGLang has sm100_fp8_blockwise_group_mm_dispatch_shape to do dispatch based on shape. The dispatch principle seems to be similar this PR.

  if (a.size(0) <= 2048 && a.size(1) >= 2048) {
    run_get_group_gemm_starts<MmaConfig1::LayoutSFA, MmaConfig1::LayoutSFB, MmaConfig1::ScaleConfig>(
        expert_offsets,
        a_ptrs,
        b_ptrs,
        out_ptrs,
        a_scales_ptrs,
        b_scales_ptrs,
        b_t,
        a_t,
        output_t,
        scales_b_t,
        scales_a_t,
        layout_sfa,
        layout_sfb,
        problem_sizes,
        problem_sizes_transpose,
        true);
    launch_sm100_fp8_blockwise_scaled_group_mm<OutType, MmaConfig1, cutlass::layout::ColumnMajor>(
        out_ptrs,
        a_ptrs,
        b_ptrs,
        a_scales_ptrs,
        b_scales_ptrs,
        stride_a,
        stride_b,
        stride_c,
        layout_sfa,
        layout_sfb,
        problem_sizes_transpose,
        expert_offsets,
        workspace);
    output = output_t.t();
  } else if (a.size(0) > 2048 && a.size(1) >= 2048) {
  ......
     // Dispatch to another kernel with another set of MmaConfig/LayoutSF.
  }

The kernel launch_es_sm100_mxfp8_blockscaled_grouped_quant to do grouped quant mxfp8 with cute is awesome! It is really an excellent example for cute.

@yuan-luo This dispatch strategy has a very obvious problem. a.size(0) <= 2048 indicates that it performs dispatching solely based on the Total number of tokens. It do not even infer the average number of tokens processed by each expert based on the number of experts. In #11432, when the batch size is between 512 and 1024, the total number of tokens is already greater than 2048. On average, each expert only needs to process 16 to 32 tokens, but a kernel with M=128 was chosen, resulting in a lot of redundant calculations.

yuan-luo · 2025-11-24T15:14:18Z

For sm100, SGLang has sm100_fp8_blockwise_group_mm_dispatch_shape to do dispatch based on shape. The dispatch principle seems to be similar this PR.
  if (a.size(0) <= 2048 && a.size(1) >= 2048) {
    run_get_group_gemm_starts<MmaConfig1::LayoutSFA, MmaConfig1::LayoutSFB, MmaConfig1::ScaleConfig>(
        expert_offsets,
        a_ptrs,
        b_ptrs,
        out_ptrs,
        a_scales_ptrs,
        b_scales_ptrs,
        b_t,
        a_t,
        output_t,
        scales_b_t,
        scales_a_t,
        layout_sfa,
        layout_sfb,
        problem_sizes,
        problem_sizes_transpose,
        true);
    launch_sm100_fp8_blockwise_scaled_group_mm<OutType, MmaConfig1, cutlass::layout::ColumnMajor>(
        out_ptrs,
        a_ptrs,
        b_ptrs,
        a_scales_ptrs,
        b_scales_ptrs,
        stride_a,
        stride_b,
        stride_c,
        layout_sfa,
        layout_sfb,
        problem_sizes_transpose,
        expert_offsets,
        workspace);
    output = output_t.t();
  } else if (a.size(0) > 2048 && a.size(1) >= 2048) {
  ......
     // Dispatch to another kernel with another set of MmaConfig/LayoutSF.
  }
The kernel launch_es_sm100_mxfp8_blockscaled_grouped_quant to do grouped quant mxfp8 with cute is awesome! It is really an excellent example for cute.
@yuan-luo This dispatch strategy has a very obvious problem. a.size(0) <= 2048 indicates that it performs dispatching solely based on the Total number of tokens. It do not even infer the average number of tokens processed by each expert based on the number of experts. In #11432, when the batch size is between 512 and 1024, the total number of tokens is already greater than 2048. On average, each expert only needs to process 16 to 32 tokens, but a kernel with M=128 was chosen, resulting in a lot of redundant calculations.

Got it, this is one of the key designs in your PR.

…sgl-project#13731) Co-authored-by: Yineng Zhang <me@zhyncs.com>

HydraQYH requested review from BBuf, FlamingoPg, HaiShaw, ispobock, merrymercy, yizhang2077 and zhyncs as code owners November 21, 2025 12:08

github-actions bot added quant LLM Quantization sgl-kernel labels Nov 21, 2025

gemini-code-assist bot reviewed Nov 21, 2025

View reviewed changes

HydraQYH force-pushed the dev_support_mxfp8_grouped_gemm branch from fe5b209 to a9e5243 Compare November 21, 2025 14:35

HydraQYH added 2 commits November 21, 2025 23:44

Support mxfp8 grouped quant & gemm.

cf7044d

Fix typo.

417d26c

HydraQYH force-pushed the dev_support_mxfp8_grouped_gemm branch from 28b38b9 to 417d26c Compare November 21, 2025 15:44

HydraQYH assigned yuan-luo, FlamingoPg, BBuf and Fridge003 Nov 21, 2025

zhyncs added the run-ci label Nov 22, 2025

zhyncs and others added 2 commits November 21, 2025 16:39

Merge branch 'main' into dev_support_mxfp8_grouped_gemm

4056eb0

Fix es unitest.

eaa0d72

BBuf reviewed Nov 22, 2025

View reviewed changes

yuan-luo reviewed Nov 23, 2025

View reviewed changes

sgl-kernel/csrc/expert_specialization/es_sm100_mxfp8_blockscaled_group_quant.cuh Show resolved Hide resolved

yuan-luo reviewed Nov 23, 2025

View reviewed changes

sgl-kernel/csrc/expert_specialization/es_sm100_mxfp8_blockscaled_group_quant.cuh Show resolved Hide resolved

Add TensorRT-LLM reference.

56fce86

Merge branch 'main' into dev_support_mxfp8_grouped_gemm

debe712

FlamingoPg mentioned this pull request Nov 25, 2025

[sgl-kernel] fix b200 kernel ci #13907

Merged

6 tasks

HydraQYH added 4 commits November 26, 2025 21:46

Merge branch 'main' into dev_support_mxfp8_grouped_gemm

9d78f12

Merge branch 'main' into dev_support_mxfp8_grouped_gemm

99aac2c

Merge branch 'main' into dev_support_mxfp8_grouped_gemm

f1812af

Merge branch 'main' into dev_support_mxfp8_grouped_gemm

1eb2028

mickqian merged commit 16ff892 into sgl-project:main Dec 4, 2025
135 of 140 checks passed

tom-jerr pushed a commit to tom-jerr/sglang that referenced this pull request Dec 4, 2025

[sgl-kernel][Feat][B200][1/N] Support MXFP8 Grouped GEMM in Blackwell (…

3dc7478

…sgl-project#13731) Co-authored-by: Yineng Zhang <me@zhyncs.com>

yingluosanqian pushed a commit to yingluosanqian/sglang that referenced this pull request Dec 4, 2025

[sgl-kernel][Feat][B200][1/N] Support MXFP8 Grouped GEMM in Blackwell (…

6c60f4c

…sgl-project#13731) Co-authored-by: Yineng Zhang <me@zhyncs.com>

tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 5, 2025

[sgl-kernel][Feat][B200][1/N] Support MXFP8 Grouped GEMM in Blackwell (…

b948300

…sgl-project#13731) Co-authored-by: Yineng Zhang <me@zhyncs.com>

tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 5, 2025

[sgl-kernel][Feat][B200][1/N] Support MXFP8 Grouped GEMM in Blackwell (…

357360d

…sgl-project#13731) Co-authored-by: Yineng Zhang <me@zhyncs.com>

yuchengz816-bot pushed a commit to yuchengz816-bot/sglang that referenced this pull request Dec 8, 2025

[sgl-kernel][Feat][B200][1/N] Support MXFP8 Grouped GEMM in Blackwell (…

03a2d7b

…sgl-project#13731) Co-authored-by: Yineng Zhang <me@zhyncs.com>

HydraQYH mentioned this pull request Dec 8, 2025

[sgl-kernel][Feat][B200][2/N] Support MXFP8 Grouped GEMM in Blackwell #14640

Open

6 tasks

Kevin-XiongC pushed a commit to novitalabs/sglang that referenced this pull request Dec 9, 2025

[sgl-kernel][Feat][B200][1/N] Support MXFP8 Grouped GEMM in Blackwell (…

ab8e2f4

…sgl-project#13731) Co-authored-by: Yineng Zhang <me@zhyncs.com>

tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 12, 2025

[sgl-kernel][Feat][B200][1/N] Support MXFP8 Grouped GEMM in Blackwell (…

009a0d6

…sgl-project#13731) Co-authored-by: Yineng Zhang <me@zhyncs.com>

tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 12, 2025

[sgl-kernel][Feat][B200][1/N] Support MXFP8 Grouped GEMM in Blackwell (…

5be67ea

…sgl-project#13731) Co-authored-by: Yineng Zhang <me@zhyncs.com>

tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 12, 2025

[sgl-kernel][Feat][B200][1/N] Support MXFP8 Grouped GEMM in Blackwell (…

4dbd3aa

…sgl-project#13731) Co-authored-by: Yineng Zhang <me@zhyncs.com>

zianglih mentioned this pull request Jan 21, 2026

Add mxfp8 support for online quantization, Triton dense linear, and CUTLASS MoE #17449

Merged

5 tasks

This was referenced Feb 11, 2026

[Kernel] adopt mxfp8 grouped_gemm and grouped_quant kernel vllm-project/vllm#34381

Closed

[Kernel] Integrate SM100 MXFP8 blockscaled grouped MM and quant kernels vllm-project/vllm#34448

Open

Conversation

HydraQYH commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Nov 21, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BBuf left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuan-luo commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuan-luo commented Nov 23, 2025

Uh oh!

yuan-luo commented Nov 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HydraQYH commented Nov 24, 2025

Uh oh!

HydraQYH commented Nov 24, 2025

Uh oh!

yuan-luo commented Nov 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Comments

HydraQYH commented Nov 21, 2025 •

edited

Loading

BBuf left a comment •

edited

Loading

yuan-luo commented Nov 22, 2025 •

edited

Loading

yuan-luo commented Nov 23, 2025 •

edited

Loading