[Roadmap] Quantization Modifications

### 1. Quantization refactor
### Background
#### Scheme structure
Currently, quantization methods are mostly implemented in 2 ways: with schemes structure, like `compressed_tensors` and `quark`, and without, like `modelslim` for NPU or `AWQ`. While foregoing scheme structure can be ok if we only load 1 format, having ability to load many of them without specific logic will overload `get_quant_method` function. Overall, the more quant methods get supported, the bigger each respective file gets.
#### Weight loading and inference
The weight creation and inference code is currently implemented in the same class, but same inference code can be utilized by different frameworks.

### Motivation
#### Support more scheme structures
The key improvement of following the scheme structure is that it is much easier to **maintain** and **update**, allowing for **easier implementation and review process**.
Below is the example of proposed scheme structure change:

<img width="811" height="681" alt="Image" src="https://github.com/user-attachments/assets/325ffd33-0b37-414e-9794-4afe930c21e5" />

<img width="821" height="661" alt="Image" src="https://github.com/user-attachments/assets/8e62c042-e46b-46da-886c-7a031717f7ff" />

#### Split weight loading and inference
Quant config, weight creation and kernel call logic should be clearly separated to allow different frameworks to use the same kernel if it fits. This will avoid code duplication and thus increase **code readability** and avoid circular imports. Our end goal is to create a **unified and simpler structure** for quantization functionality. The main source of inspiration for refactoring ideas came from compressed-tensors scheme structure.
Below are the image examples of proposed change for AWQ:
<img width="654" height="367" alt="Image" src="https://github.com/user-attachments/assets/cab46aa3-cff4-4b91-b8bd-20375c12b27c" />

<img width="781" height="1050" alt="Image" src="https://github.com/user-attachments/assets/f3be0290-c18f-4050-a984-21a8b0e5fa31" />

  - [ ] NPU specific refactoring and format support https://github.com/sgl-project/sglang/issues/14424

  - [ ] Compressed-Tensors, ModelSlim, Quark MoE schemes
    - [x] https://github.com/sgl-project/sglang/pull/17503
    - [x] https://github.com/sgl-project/sglang/pull/17993
    - [x] https://github.com/sgl-project/sglang/pull/18252
    - [ ] Look into https://github.com/sgl-project/sglang/blob/f8636fbb253a83d268ffa1636eac4e111966376c/python/sglang/srt/layers/moe/fused_moe_triton/layer.py#L668 for possible improvement

  - [ ] Support schemes for AWQ, GPTQ, Auto Round, GGUF
 
  - [ ] Kernel call and weight init split
  
### 2. Non-Linear Module & Communication Quantization
Objective: Optimize components beyond standard linear layers to further improve performance.
  - [ ] Attention

    - [ ] MLA Quantization @hammersam 
  
    - [ ] GQA/MHA Quantization

  - [ ] Improved KV Cache Quantization
    - [ ] https://github.com/sgl-project/sglang/pull/18882 @zack041 
    - [ ] Support int8 kv cache for NPU @TamirBaydasov 

  - [ ] Communication Quantization
    - [ ] AllReduce @m8ngotree 
  
### 3. New formats support
  - [ ] NVFP4 Quantization support
         Roadmap: https://github.com/sgl-project/sglang/issues/17130
  - [ ] Improved AutoQuantize
         Roadmap: https://github.com/sgl-project/sglang/issues/17130
  - [ ] FP4 KV-Cache Support
         Roadmap: https://github.com/sgl-project/sglang/issues/17130
  - [x] mxfp8 support @zianglih  https://github.com/sgl-project/sglang/pull/17449

  - [ ] Online Rotation (for FlatQuant and etc.)

  - [ ] Vector quantization (for QuIP#, AQLM, VPTQ)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Roadmap] Quantization Modifications #15194

1. Quantization refactor

Background

Scheme structure

Weight loading and inference

Motivation

Support more scheme structures

Split weight loading and inference

2. Non-Linear Module & Communication Quantization

3. New formats support

Sub-issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Roadmap] Quantization Modifications #15194

Description

1. Quantization refactor

Background

Scheme structure

Weight loading and inference

Motivation

Support more scheme structures

Split weight loading and inference

2. Non-Linear Module & Communication Quantization

3. New formats support

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions