Skip to content

[Roadmap] Quantization Modifications #15194

@TamirBaydasov

Description

@TamirBaydasov

1. Quantization refactor

Background

Scheme structure

Currently, quantization methods are mostly implemented in 2 ways: with schemes structure, like compressed_tensors and quark, and without, like modelslim for NPU or AWQ. While foregoing scheme structure can be ok if we only load 1 format, having ability to load many of them without specific logic will overload get_quant_method function. Overall, the more quant methods get supported, the bigger each respective file gets.

Weight loading and inference

The weight creation and inference code is currently implemented in the same class, but same inference code can be utilized by different frameworks.

Motivation

Support more scheme structures

The key improvement of following the scheme structure is that it is much easier to maintain and update, allowing for easier implementation and review process.
Below is the example of proposed scheme structure change:

Image Image

Split weight loading and inference

Quant config, weight creation and kernel call logic should be clearly separated to allow different frameworks to use the same kernel if it fits. This will avoid code duplication and thus increase code readability and avoid circular imports. Our end goal is to create a unified and simpler structure for quantization functionality. The main source of inspiration for refactoring ideas came from compressed-tensors scheme structure.
Below are the image examples of proposed change for AWQ:
Image

Image

2. Non-Linear Module & Communication Quantization

Objective: Optimize components beyond standard linear layers to further improve performance.

3. New formats support

Sub-issues

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions