Skip to content

Comments

[7/n] decouple quantization impl from vllm dependency - gguf kernel#11019

Merged
zhyncs merged 12 commits intosgl-project:mainfrom
FlamingoPg:gguf_quant_stage1
Oct 11, 2025
Merged

[7/n] decouple quantization impl from vllm dependency - gguf kernel#11019
zhyncs merged 12 commits intosgl-project:mainfrom
FlamingoPg:gguf_quant_stage1

Conversation

@FlamingoPg
Copy link
Collaborator

@FlamingoPg FlamingoPg commented Sep 28, 2025

Motivation

Support sglang native gguf kernel.
Will change srt in next PR

Kernel List

  • Done:
  1. ggml_mul_mat_vec_a8
  2. ggml_mul_mat_a8
  3. ggml_dequantize
  4. ggml_moe_get_block_size
  5. ggml_moe_a8
  6. moe_sum
  7. ggml_moe_a8_vec
  8. moe_sum ( )
  • TODO:

Adapted from: https://github.com/vllm-project/vllm/tree/c85be1f6dd3e20d9b42cd68ff54b328ffeb6cb4b/csrc/quantization/gguf

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @FlamingoPg, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request marks a significant step towards supporting native GGUF (GGML Universal Format) kernels within the sgl-kernel project. By integrating a suite of CUDA kernels adapted from vLLM and llama.cpp, the project gains capabilities for efficient dequantization, matrix-vector, matrix-matrix, and Mixture-of-Experts (MoE) operations across numerous GGUF quantization types. This foundational work also includes the necessary TorchScript bindings to expose these new functionalities and a comprehensive test suite to ensure their accuracy and reliability. The "Work In Progress" status indicates this is part of a larger effort to enhance quantization support.

Highlights

  • GGUF Kernel Integration: Introduces native GGUF (GGML Universal Format) quantization kernels, enabling efficient operations with quantized models.
  • Decoupled Implementation: The GGUF implementation is now decoupled from external vLLM dependencies, promoting a more self-contained and flexible codebase.
  • New TorchScript Bindings: Adds TorchScript bindings for various GGUF operations, including dequantization, matrix-vector multiplication, matrix-matrix multiplication, and Mixture-of-Experts (MoE) computations.
  • Comprehensive Testing: Includes a new test suite (test_gguf.py) with pytest to validate the correctness of dequantization, matrix-vector, and matrix-matrix multiplication across a wide range of GGUF quantization types and hidden sizes.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces GGUF quantization kernels, decoupling them from vllm dependencies. The changes are extensive, adding several new CUDA source and header files, and registering the new operations with PyTorch. The implementation seems to be a good adaptation from existing codebases like vllm and llama.cpp. However, there are some areas for improvement. Several functions for matrix and MoE operations are missing support for i-quant types, making the feature incomplete. There are also minor issues in the CUDA kernels regarding unnecessary use of 64-bit integers, and the new tests have some issues that will prevent them from running or properly verifying correctness. Addressing these points will improve the completeness and robustness of this new feature.

Comment on lines 362 to 364
const int64_t tid = threadIdx.x;
const int64_t il = tid/8; // 0...3
const int64_t ib = tid%8; // 0...7
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The variables tid, il, and ib are derived from threadIdx.x, which is an unsigned int with a maximum value of 1023. Declaring them as int64_t is unnecessary and could potentially lead to less optimal code generation. Using int or const auto would be more appropriate and cleaner.

    const int tid = threadIdx.x;
    const int il = tid/8; // 0...3
    const int ib = tid%8; // 0...7

@FlamingoPg
Copy link
Collaborator Author

Test all passed, ready for review

@FlamingoPg FlamingoPg marked this pull request as ready for review September 28, 2025 14:37
@FlamingoPg FlamingoPg changed the title [WIP][7/n] decouple quantization impl from vllm dependency - gguf kernel [7/n] decouple quantization impl from vllm dependency - gguf kernel Sep 28, 2025
@AniZpZ
Copy link
Collaborator

AniZpZ commented Sep 29, 2025

fix lint plz

@ch-wan ch-wan added the ready-to-merge The PR is ready to merge after the CI is green. label Oct 8, 2025
@zhyncs zhyncs merged commit 8fdcd98 into sgl-project:main Oct 11, 2025
140 of 157 checks passed
@AniZpZ AniZpZ mentioned this pull request Oct 21, 2025
15 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-to-merge The PR is ready to merge after the CI is green. run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants