-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Description
1. Quantization refactor
Background
Scheme structure
Currently, quantization methods are mostly implemented in 2 ways: with schemes structure, like compressed_tensors and quark, and without, like modelslim for NPU or AWQ. While foregoing scheme structure can be ok if we only load 1 format, having ability to load many of them without specific logic will overload get_quant_method function. Overall, the more quant methods get supported, the bigger each respective file gets.
Weight loading and inference
The weight creation and inference code is currently implemented in the same class, but same inference code can be utilized by different frameworks.
Motivation
Support more scheme structures
The key improvement of following the scheme structure is that it is much easier to maintain and update, allowing for easier implementation and review process.
Below is the example of proposed scheme structure change:
Split weight loading and inference
Quant config, weight creation and kernel call logic should be clearly separated to allow different frameworks to use the same kernel if it fits. This will avoid code duplication and thus increase code readability and avoid circular imports. Our end goal is to create a unified and simpler structure for quantization functionality. The main source of inspiration for refactoring ideas came from compressed-tensors scheme structure.
Below are the image examples of proposed change for AWQ:

-
NPU specific refactoring and format support [Feature] Ascend NPU quantization refactoring & more quantization formats support #14424
-
Compressed-Tensors, ModelSlim, Quark MoE schemes
- [2/N] Quantization Refactor: Compressed tensors MoE schemes #17503
- [3/N] Quantization Refactor: ModelSlim MoE schemes #17993
- [4/N] Quantization Refactor: Quark MoE schemes #18252
- Look into for possible improvement
# compressed-tensors checkpoints with packed weights are stored flipped
-
Support schemes for AWQ, GPTQ, Auto Round, GGUF
-
Kernel call and weight init split
2. Non-Linear Module & Communication Quantization
Objective: Optimize components beyond standard linear layers to further improve performance.
-
Attention
-
MLA Quantization @hammersam
-
GQA/MHA Quantization
-
-
Improved KV Cache Quantization
- feat: Add FP8 KV cache support for Triton attention backend #18882 @zack041
- Support int8 kv cache for NPU @TamirBaydasov
-
Communication Quantization
- AllReduce @m8ngotree
3. New formats support
-
NVFP4 Quantization support
Roadmap: SGLang Nvidia Collaboration Roadmap (2026 Q1) #17130 -
Improved AutoQuantize
Roadmap: SGLang Nvidia Collaboration Roadmap (2026 Q1) #17130 -
FP4 KV-Cache Support
Roadmap: SGLang Nvidia Collaboration Roadmap (2026 Q1) #17130 -
mxfp8 support @zianglih Add mxfp8 support for online quantization, Triton dense linear, and CUTLASS MoE #17449
-
Online Rotation (for FlatQuant and etc.)
-
Vector quantization (for QuIP#, AQLM, VPTQ)