[CPU][INT4] Add INT4 kernels for CPU #8226
Conversation
There was a problem hiding this comment.
Summary of Changes
Hello @jianan-gu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces new CPU kernels for performing linear operations with 4-bit quantized weights (INT4). It provides two main approaches: one for A16W4 (16-bit activations, 4-bit weights) and another for A8W4 (8-bit activations, 4-bit weights), including utilities for weight packing and activation quantization. This significantly expands the CPU's capability for efficient inference with quantized models.
Highlights
- A16W4 Linear Kernel: Implemented a
tinygemm_kernelforBFloat16andHalfactivations withquint4x2(packed INT4) weights, supporting both a customtinygemm_kernel_nn(for smaller M) and abrgemmpath (for larger M) leveraging Intel's optimized GEMM library. This is found insgl-kernel/csrc/cpu/gemm_int4_w4a16.cpp. - A8W4 Linear Kernel: Introduced a new set of kernels for 8-bit quantized activations and 4-bit weights, including per-token symmetric quantization for activations and specialized
_dequant_gemm_accumfunctions for the matrix multiplication and dequantization. This is found insgl-kernel/csrc/cpu/gemm_int4_w4a8.cpp. - Weight Packing Utility: Added
convert_int4_weight_packedto pre-process and optimize INT4 weights, scales, and zero points into a packed format suitable for efficient CPU execution, including VNNI4 reordering for AVX512. This utility is part of the A8W4 implementation insgl-kernel/csrc/cpu/gemm_int4_w4a8.cpp. - CPU Parallelism Enhancements: Introduced generic
parallel_2dandadjust_num_threadsutilities insgl-kernel/csrc/cpu/common.hto improve thread blocking and utilization for 2D parallel computations, which are leveraged by the new GEMM kernels. - AVX512 Optimizations: The new kernels heavily utilize AVX512 intrinsics (e.g.,
_mm512_dpbf16_ps,_mm512_dpbusd_epi32) for high-performance computation on supported CPUs, particularly evident in bothgemm_int4_w4a16.cppandgemm_int4_w4a8.cpp.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request introduces AWQ linear kernels for CPU, including 2D parallelization helpers and kernels for INT4 weight GEMM with both 16-bit (w4a16) and 8-bit (w4a8) activations. It's recommended to address the potential runtime error with at::Half instantiation and verify the correctness of compensation calculation and symmetric quantization data type.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
/rerun-failed-ci |
2 similar comments
|
/rerun-failed-ci |
|
/rerun-failed-ci |
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
This PR implements CPU int4 kernels, which are called by CPU AWQ frontend https://github.com/sgl-project/sglang/pull/8225/files