[7/n] decouple quantization impl from vllm dependency - gguf kernel#11019
[7/n] decouple quantization impl from vllm dependency - gguf kernel#11019zhyncs merged 12 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @FlamingoPg, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request marks a significant step towards supporting native GGUF (GGML Universal Format) kernels within the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces GGUF quantization kernels, decoupling them from vllm dependencies. The changes are extensive, adding several new CUDA source and header files, and registering the new operations with PyTorch. The implementation seems to be a good adaptation from existing codebases like vllm and llama.cpp. However, there are some areas for improvement. Several functions for matrix and MoE operations are missing support for i-quant types, making the feature incomplete. There are also minor issues in the CUDA kernels regarding unnecessary use of 64-bit integers, and the new tests have some issues that will prevent them from running or properly verifying correctness. Addressing these points will improve the completeness and robustness of this new feature.
| const int64_t tid = threadIdx.x; | ||
| const int64_t il = tid/8; // 0...3 | ||
| const int64_t ib = tid%8; // 0...7 |
There was a problem hiding this comment.
The variables tid, il, and ib are derived from threadIdx.x, which is an unsigned int with a maximum value of 1023. Declaring them as int64_t is unnecessary and could potentially lead to less optimal code generation. Using int or const auto would be more appropriate and cleaner.
const int tid = threadIdx.x;
const int il = tid/8; // 0...3
const int ib = tid%8; // 0...7
|
Test all passed, ready for review |
|
fix lint plz |
Motivation
Support sglang native gguf kernel.
Will change srt in next PR
Kernel List
sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py
Line 598 in 2a9d995
Adapted from: https://github.com/vllm-project/vllm/tree/c85be1f6dd3e20d9b42cd68ff54b328ffeb6cb4b/csrc/quantization/gguf
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist