LoTA-QAF: Accepted to NeurIPS'25 as a poster. 😁
arxiv.org/abs/2505.18724
This repository contains the code for LoTA-QAF, a novel fine-tuning method for quantized Large Language Models (LLMs). It enables the lossless merging of ternary adaptation weights and the adjustment of all quantized weights. LoTA-QAF combines:
- Custom-designed Ternary Adaptation (TA) that aligns ternary weights with the quantization grid to adjust quantized weights.
- A TA-based mechanism for the lossless merging of adaptation weights.
- Ternary Signed Gradient Descent (t-SignSGD) for updating TA weights.
-
Core Logic:
LoTA_QAF_main.py: The main script for training LoTA-QAF and performing evaluations (using lm-eval for MMLU andevalGSV.pyfor GSM8K, SQL, and ViGGO).
-
LoTA Components (located in the
LoTA/directory):LoTA/layer.py: ContainsCustomLoraLinear, where Ternary Adaptation is implemented, used for training.LoTA/adapter.py: Provides theLTA(Lossless Ternary Adaptation) classes for loading trained Ternary Adaptation during inference and evaluation.LoTA/lota_merge.py: Includes the logic for merging Ternary Adaptation weights into the quantized model weights.
-
Optimizer:
t_signSGD.py: Implementation of the Ternary Signed Gradient Descent (t-SignSGD) optimizer used for training Ternary Adaptation.
-
Utility Modules:
data_print_save.py: A collection of utility functions for preparing datasets (e.g., Alpaca, GSM8K, SQL, ViGGO), printing configurations, and saving experimental results, etc.evalGSV.py: A custom evaluation script designed for Task-Specific such as GSM8K, SQL, and ViGGO.gptq_quantize.py: A script used for quantizing models using the GPTQModel library, preparing them for QAF.
- CUDA Version: 12.2 (Recommended).
The LoTA-QAF implementation is built upon specific versions of key libraries:
peft==0.15.1gptqmodel==2.1.1.dev0
It is recommended to install packages using a virtual environment.
pip install -r requirements.txt
# For detailed versioning of all dependencies, please refer to the environment.yml file.The main script LoTA_QAF_main.py operates in two modes: Training (mode 1) and Evaluation (mode 2).
Common Base Parameters (baseConfig):
--mode:1for training,2for evaluation.--pretrained: Path to the base pre-trained model (e.g.,/your_path/models/llama_3.1_8B_Instruct).--quantized_model_dir: Path to the quantized model directory (e.g.,/your_path/quant_models/8B_instruct/int4_64_asym).
Mode 1:
For training, you'll primarily use trainingConfig arguments alongside baseConfig.
python LoTA_QAF_main.py \
--mode 1 \
--pretrained "/your_path/models/llama_3.1_8B_Instruct" \
--quantized_model_dir "/your_path/quant_models/8B_instruct/int4_64_asym" \
--lota_qaf True \
--training_data_name "alpaca" \
--adapter_path "your_path/adapter_output" \
--interval_point 48 \ # Omega for LoTA-QAF
--filter_ratio 0.95 \ # Sigma_t for LoTA-QAF, here 0.95 is discard 0.95 and select top 0.05.
--min_grad 0.999 \ # Effective range 0.95-0.999 in 0-80% of epochs. [Refer in "Baselines and Hyper-parameters" of the paper. The naming is not ideal and has not been updated yet.]
--filter_upper 0.9999 # 0.999-0.9999 in 20-100% epoch
--max_steps 300 \
--save_number 5 \
--train_batch_size 64 \
--gradient_accumulation_steps 1 \Mode 2:
For evaluation, you'll use evalConfig arguments. Parameters like pretrained, quantized_model_dir, w_bits, group_size, lora_r, lora_alpha, and lota_qaf are often automatically inferred from the --load_adapter path if an adapter is being evaluated.
# Example 1: Evaluate a GPTQ model with a LoTA-QAF adapter on MMLU
python LoTA_QAF_main.py \
--mode 2 \
--load_adapter "/path/to/your/trained_lota_adapter/8B_int4_LoTA_48_0.950_0.999_alpaca_..." \
--tasks "mmlu" \
--num_fewshot 5 \
--eval_batch_size 16 \
--output_path "./eval_results" \
# --auto_gptq "gptq" # Default for loading adapter with GPTQ model
# Example 2: Evaluate a GPTQ model with a LoTA-QAF adapter on a task-specific dataset (e.g., gsm8k)
python LoTA_QAF_main.py \
--mode 2 \
--load_adapter "/path/to/your/trained_lota_adapter/8B_int4_LoTA_48_0.950_0.999_gsm8k" \
--output_path "./eval_results_gsv" \
--ft_dataset_name "gsm8k" \
--eval_batch_size 64 \
--auto_gptq "gptq"- Key Evaluation Parameters (
evalConfig):--load_adapter: Path to the trained adapter to load. Set to"none"to evaluate the base model without an adapter. Many parameters likelota_qaf,w_bits,group_size,pretrainedmodel path, andquantized_model_dirwill be auto-configured based on this path.--auto_gptq: Use"gptq"to load a GPTQ quantized model (with or without adapter). Use"none"to load a 16-bit model (typically for evaluating a base 16-bit model without an adapter).--tasks: List of tasks for lm-eval (e.g., "mmlu").--ft_dataset_name: For task-specific evaluation usingevalGSV.py(e.g., "gsm8k", "sql", "viggo"). If not "none", this evaluation type is chosen over lm-eval.--num_fewshot: Number of few-shot examples for lm-eval.--eval_batch_size: Batch size for evaluation.--output_path: Directory to save evaluation results.
Automatic Parameter Configuration:
The script includes logic to automatically determine several parameters, especially in evaluation mode (mode 2) when --load_adapter is specified. This includes:
w_bits,group_sizefromquantized_model_dir(training) orload_adapterpath (evaluation).lora_r,lora_alphabased on model size.lota_qafandload_ada_interval(Omega) based on theload_adapterpath structure.pretrainedmodel path andquantized_model_dirbased on model size and quantization bits inferred from theload_adapterpath. Note: You will need to update the placeholder/your_path/in the script (base_args.pretrained = f"/your_path/models/{pre}"andbase_args.quantized_model_dir = f"/your_path/quant_models/{model_size}_instruct/int{base_args.w_bits}_{base_args.group_size}_asym") to your actual model paths for this auto-configuration to work correctly.
- Version: 2025.05.15
- License: MIT License
Copyright (c) 2025 [KingdalfGoodman]
This project is licensed under the MIT License - see the LICENSE file for details