-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Open
Description
1. Decouple Quantization Implementation from vLLM
Objective: Refactor the code to enhance the maintainability and extensibility of the quantization module.
- Weight Only Methods
- GPTQ [1/n] chore: decouple quantization implementation from vLLM dependency #7992 [2/n]decouple quantization implementation from vLLM dependency #8112 [4/n]decouple quantization implementation from vLLM dependency #9191
- AWQ [3/n] chore: decouple AWQ implementation from vLLM dependency #8113
- Compressed Tensors [6/n]decouple quantization implementation from vLLM dependency #10750
- MoE Quantization Optimization
- Kernel Optimization
- fbgemm fp8 [5/n]decouple quantization implementation from vLLM dependency #9454
- gguf [7/n] decouple quantization impl from vllm dependency - gguf kernel #11019
2. Quantization on Various Hardware Platforms (Other than GPU)
Objective: Extend sglang's efficient inference capabilities to a broader range of hardware.
- Ascend NPUs
- Intel Xeon CPUs
- W8A8
3. Non-Linear Module & Communication Quantization
Objective: Optimize components beyond standard linear layers to further improve performance.
- Attention
- MLA Quantization
- GQA/MHA Quantization
- Improved KV Cache Quantization @Wilbolu
- Communication Quantization
4. Support for More Features & Novel Formats
Objective: Stay current with cutting-edge quantization techniques and data formats.
- MXFP4 Quantization
Reactions are currently unavailable