Skip to content

[CPU][INT4] Add INT4 kernels for CPU #8226

Merged
Kangyan-Zhou merged 48 commits intosgl-project:mainfrom
jianan-gu:cpu_int4_kernel
Jan 30, 2026
Merged

[CPU][INT4] Add INT4 kernels for CPU #8226
Kangyan-Zhou merged 48 commits intosgl-project:mainfrom
jianan-gu:cpu_int4_kernel

Conversation

@jianan-gu
Copy link
Contributor

@jianan-gu jianan-gu commented Jul 21, 2025

This PR implements CPU int4 kernels, which are called by CPU AWQ frontend https://github.com/sgl-project/sglang/pull/8225/files

  • Including: AWQLinear and AWQMoE

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @jianan-gu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces new CPU kernels for performing linear operations with 4-bit quantized weights (INT4). It provides two main approaches: one for A16W4 (16-bit activations, 4-bit weights) and another for A8W4 (8-bit activations, 4-bit weights), including utilities for weight packing and activation quantization. This significantly expands the CPU's capability for efficient inference with quantized models.

Highlights

  • A16W4 Linear Kernel: Implemented a tinygemm_kernel for BFloat16 and Half activations with quint4x2 (packed INT4) weights, supporting both a custom tinygemm_kernel_nn (for smaller M) and a brgemm path (for larger M) leveraging Intel's optimized GEMM library. This is found in sgl-kernel/csrc/cpu/gemm_int4_w4a16.cpp.
  • A8W4 Linear Kernel: Introduced a new set of kernels for 8-bit quantized activations and 4-bit weights, including per-token symmetric quantization for activations and specialized _dequant_gemm_accum functions for the matrix multiplication and dequantization. This is found in sgl-kernel/csrc/cpu/gemm_int4_w4a8.cpp.
  • Weight Packing Utility: Added convert_int4_weight_packed to pre-process and optimize INT4 weights, scales, and zero points into a packed format suitable for efficient CPU execution, including VNNI4 reordering for AVX512. This utility is part of the A8W4 implementation in sgl-kernel/csrc/cpu/gemm_int4_w4a8.cpp.
  • CPU Parallelism Enhancements: Introduced generic parallel_2d and adjust_num_threads utilities in sgl-kernel/csrc/cpu/common.h to improve thread blocking and utilization for 2D parallel computations, which are leveraged by the new GEMM kernels.
  • AVX512 Optimizations: The new kernels heavily utilize AVX512 intrinsics (e.g., _mm512_dpbf16_ps, _mm512_dpbusd_epi32) for high-performance computation on supported CPUs, particularly evident in both gemm_int4_w4a16.cpp and gemm_int4_w4a8.cpp.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces AWQ linear kernels for CPU, including 2D parallelization helpers and kernels for INT4 weight GEMM with both 16-bit (w4a16) and 8-bit (w4a8) activations. It's recommended to address the potential runtime error with at::Half instantiation and verify the correctness of compensation calculation and symmetric quantization data type.

jianan-gu and others added 2 commits July 21, 2025 19:56
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@jianan-gu jianan-gu changed the title [CPU][INT4] Add AWQ Linear kernels for CPU [CPU][INT4] Add INT4 Linear kernels for CPU Jul 21, 2025
@jianan-gu jianan-gu changed the title [CPU][INT4] Add INT4 Linear kernels for CPU [CPU][INT4] Add INT4 kernels for CPU Jul 30, 2025
@mingfeima mingfeima marked this pull request as draft August 1, 2025 00:06
@mingfeima mingfeima added intel cpu cpu backend performance optimization labels Aug 1, 2025
@jianan-gu jianan-gu requested a review from fzyzcjy as a code owner December 10, 2025 04:30
@jianan-gu
Copy link
Contributor Author

/rerun-failed-ci

2 similar comments
@jianan-gu
Copy link
Contributor Author

/rerun-failed-ci

@jianan-gu
Copy link
Contributor Author

/rerun-failed-ci

@Kangyan-Zhou Kangyan-Zhou merged commit c35aa02 into sgl-project:main Jan 30, 2026
24 of 40 checks passed
charlesHsuGG pushed a commit to charlesHsuGG/sglang that referenced this pull request Jan 30, 2026
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
jianan-gu added a commit to jianan-gu/sglang that referenced this pull request Feb 3, 2026
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
sfiisf pushed a commit to sfiisf/sglang that referenced this pull request Feb 5, 2026
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cpu cpu backend performance optimization deepseek intel quant LLM Quantization run-ci sgl-kernel

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Comments