[7/n] decouple quantization impl from vllm dependency - gguf kernel by FlamingoPg · Pull Request #11019 · sgl-project/sglang

FlamingoPg · 2025-09-28T10:26:34Z

Motivation

Support sglang native gguf kernel.
Will change srt in next PR

Kernel List

Done:

ggml_mul_mat_vec_a8
ggml_mul_mat_a8
ggml_dequantize
ggml_moe_get_block_size
ggml_moe_a8
moe_sum
ggml_moe_a8_vec
moe_sum (

sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py

Line 598 in 2a9d995

vllm_ops.moe_sum(

)

TODO:

Adapted from: https://github.com/vllm-project/vllm/tree/c85be1f6dd3e20d9b42cd68ff54b328ffeb6cb4b/csrc/quantization/gguf

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist · 2025-09-28T10:27:28Z

Summary of Changes

Hello @FlamingoPg, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request marks a significant step towards supporting native GGUF (GGML Universal Format) kernels within the sgl-kernel project. By integrating a suite of CUDA kernels adapted from vLLM and llama.cpp, the project gains capabilities for efficient dequantization, matrix-vector, matrix-matrix, and Mixture-of-Experts (MoE) operations across numerous GGUF quantization types. This foundational work also includes the necessary TorchScript bindings to expose these new functionalities and a comprehensive test suite to ensure their accuracy and reliability. The "Work In Progress" status indicates this is part of a larger effort to enhance quantization support.

Highlights

GGUF Kernel Integration: Introduces native GGUF (GGML Universal Format) quantization kernels, enabling efficient operations with quantized models.
Decoupled Implementation: The GGUF implementation is now decoupled from external vLLM dependencies, promoting a more self-contained and flexible codebase.
New TorchScript Bindings: Adds TorchScript bindings for various GGUF operations, including dequantization, matrix-vector multiplication, matrix-matrix multiplication, and Mixture-of-Experts (MoE) computations.
Comprehensive Testing: Includes a new test suite (test_gguf.py) with pytest to validate the correctness of dequantization, matrix-vector, and matrix-matrix multiplication across a wide range of GGUF quantization types and hidden sizes.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces GGUF quantization kernels, decoupling them from vllm dependencies. The changes are extensive, adding several new CUDA source and header files, and registering the new operations with PyTorch. The implementation seems to be a good adaptation from existing codebases like vllm and llama.cpp. However, there are some areas for improvement. Several functions for matrix and MoE operations are missing support for i-quant types, making the feature incomplete. There are also minor issues in the CUDA kernels regarding unnecessary use of 64-bit integers, and the new tests have some issues that will prevent them from running or properly verifying correctness. Addressing these points will improve the completeness and robustness of this new feature.

sgl-kernel/csrc/quantization/gguf/gguf_kernel.cu

gemini-code-assist · 2025-09-28T10:30:14Z

sgl-kernel/csrc/quantization/gguf/dequantize.cuh

+    const int64_t tid = threadIdx.x;
+    const int64_t il = tid/8; // 0...3
+    const int64_t ib = tid%8; // 0...7


The variables tid, il, and ib are derived from threadIdx.x, which is an unsigned int with a maximum value of 1023. Declaring them as int64_t is unnecessary and could potentially lead to less optimal code generation. Using int or const auto would be more appropriate and cleaner.

const int tid = threadIdx.x; const int il = tid/8; // 0...3 const int ib = tid%8; // 0...7

sgl-kernel/csrc/quantization/gguf/dequantize.cuh

sgl-kernel/tests/test_gguf.py

FlamingoPg · 2025-09-28T14:37:55Z

Test all passed, ready for review

AniZpZ · 2025-09-29T07:56:21Z

fix lint plz

…gl-project#11019)

FlamingoPg added 4 commits September 25, 2025 04:39

[1/2] add gguf kernel for quantization

dc7c189

fix compile error

849c12f

fix kerenl compile

7a069fb

Merge remote-tracking branch 'origin/main' into gguf_quant

4a2ba88

FlamingoPg requested review from BBuf, HaiShaw, ispobock, merrymercy, yizhang2077 and zhyncs as code owners September 28, 2025 10:26

sglang-bot added the run-ci label Sep 28, 2025

FlamingoPg marked this pull request as draft September 28, 2025 10:26

FlamingoPg assigned FlamingoPg and AniZpZ Sep 28, 2025

gemini-code-assist bot reviewed Sep 28, 2025

View reviewed changes

FlamingoPg added 3 commits September 28, 2025 13:21

fix gguf quant test

835765d

remove print

0ae0d95

support moe sum kernel

b2b4e9d

FlamingoPg marked this pull request as ready for review September 28, 2025 14:37

FlamingoPg changed the title ~~[WIP][7/n] decouple quantization impl from vllm dependency - gguf kernel~~ [7/n] decouple quantization impl from vllm dependency - gguf kernel Sep 28, 2025

Merge branch 'main' into gguf_quant_stage1

0d7a7d3

FlamingoPg added 2 commits September 30, 2025 17:11

fix lint

89cf515

Merge remote-tracking branch 'fan/gguf_quant_stage1' into gguf_quant

ba59d13

AniZpZ approved these changes Oct 5, 2025

View reviewed changes

Merge branch 'main' into gguf_quant_stage1

2eb90d0

ch-wan added the ready-to-merge The PR is ready to merge after the CI is green. label Oct 8, 2025

Merge branch 'main' into gguf_quant_stage1

65b8be2

zhyncs merged commit 8fdcd98 into sgl-project:main Oct 11, 2025
140 of 157 checks passed

AniZpZ mentioned this pull request Oct 21, 2025

[Roadmap] Quantization Support #8180

Open

15 tasks

lpc0220 pushed a commit to lpc0220/sglang that referenced this pull request Oct 29, 2025

[7/n] decouple quantization impl from vllm dependency - gguf kernel (s…

03b5a17

…gl-project#11019)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[7/n] decouple quantization impl from vllm dependency - gguf kernel#11019

[7/n] decouple quantization impl from vllm dependency - gguf kernel#11019
zhyncs merged 12 commits intosgl-project:mainfrom
FlamingoPg:gguf_quant_stage1

FlamingoPg commented Sep 28, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Sep 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

gemini-code-assist bot Sep 28, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

FlamingoPg commented Sep 28, 2025

Uh oh!

AniZpZ commented Sep 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Comments

Conversation

FlamingoPg commented Sep 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Kernel List

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Sep 28, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Sep 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

FlamingoPg commented Sep 28, 2025

Uh oh!

AniZpZ commented Sep 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

FlamingoPg commented Sep 28, 2025 •

edited

Loading