Skip to content

Commit 3f7e828

Browse files
Refactor Quantizer for reusing in QAT (#9276)
* Refactor Quantizer for reusing in QAT Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> * Address more reviewer comments Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> * update yaml config Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> --------- Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
1 parent 67bc846 commit 3f7e828

File tree

6 files changed

+153
-159
lines changed

6 files changed

+153
-159
lines changed

.github/workflows/cicd-main.yml

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -213,10 +213,10 @@ jobs:
213213
with:
214214
RUNNER: self-hosted-azure
215215
SCRIPT: |
216-
python examples/nlp/language_modeling/megatron_quantization.py \
217-
model_file=/home/TestData/nlp/megatron_llama/llama_ci.nemo \
216+
python examples/nlp/language_modeling/megatron_gpt_quantization.py \
217+
model.restore_from_path=/home/TestData/nlp/megatron_llama/llama_ci.nemo \
218218
quantization.algorithm=null \
219-
model_save=/home/TestData/nlp/megatron_llama/ci_baseline
219+
export.save_path=/home/TestData/nlp/megatron_llama/ci_baseline
220220
AFTER_SCRIPT: |
221221
rm -rf /home/TestData/nlp/megatron_llama/ci_baseline
222222
@@ -226,16 +226,16 @@ jobs:
226226
with:
227227
RUNNER: self-hosted-azure
228228
SCRIPT: |
229-
python examples/nlp/language_modeling/megatron_quantization.py \
230-
model_file=/home/TestData/nlp/megatron_llama/llama_ci.nemo \
231-
tensor_model_parallel_size=2 \
229+
python examples/nlp/language_modeling/megatron_gpt_quantization.py \
230+
model.restore_from_path=/home/TestData/nlp/megatron_llama/llama_ci.nemo \
231+
model.tensor_model_parallel_size=2 \
232232
trainer.devices=2 \
233233
quantization.calib_dataset=/home/TestData/nlp/test_quantization/test.json \
234234
quantization.algorithm=fp8 \
235235
quantization.num_calib_size=8 \
236236
inference.batch_size=2 \
237237
export.inference_tensor_parallel=2 \
238-
model_save=/home/TestData/nlp/megatron_llama/ci_fp8.qnemo
238+
export.save_path=/home/TestData/nlp/megatron_llama/ci_fp8.qnemo
239239
AFTER_SCRIPT: |
240240
rm -rf /home/TestData/nlp/megatron_llama/ci_fp8.qnemo
241241
@@ -245,13 +245,13 @@ jobs:
245245
with:
246246
RUNNER: self-hosted-azure
247247
SCRIPT: |
248-
python examples/nlp/language_modeling/megatron_quantization.py \
249-
model_file=/home/TestData/nlp/megatron_llama/llama_ci.nemo \
248+
python examples/nlp/language_modeling/megatron_gpt_quantization.py \
249+
model.restore_from_path=/home/TestData/nlp/megatron_llama/llama_ci.nemo \
250250
quantization.calib_dataset=/home/TestData/nlp/test_quantization/test.json \
251251
quantization.algorithm=int8_sq \
252252
quantization.num_calib_size=8 \
253253
inference.batch_size=2 \
254-
model_save=/home/TestData/nlp/megatron_llama/ci_int8_sq.qnemo
254+
export.save_path=/home/TestData/nlp/megatron_llama/ci_int8_sq.qnemo
255255
AFTER_SCRIPT: |
256256
rm -rf /home/TestData/nlp/megatron_llama/ci_int8_sq.qnemo
257257
@@ -274,15 +274,15 @@ jobs:
274274
# - name: Checkout repository
275275
# uses: actions/checkout@v4
276276
# - run: |
277-
# python examples/nlp/language_modeling/megatron_quantization.py \
278-
# model_file=/home/TestData/nlp/megatron_llama/llama_ci.nemo \
279-
# tensor_model_parallel_size=1 \
277+
# python examples/nlp/language_modeling/megatron_gpt_quantization.py \
278+
# model.restore_from_path=/home/TestData/nlp/megatron_llama/llama_ci.nemo \
279+
# model.tensor_model_parallel_size=1 \
280280
# trainer.devices=1 \
281281
# quantization.calib_dataset=/home/TestData/nlp/test_quantization/test.json \
282282
# quantization.algorithm=int4_awq \
283283
# quantization.num_calib_size=8 \
284284
# inference.batch_size=2 \
285-
# model_save=/home/TestData/nlp/megatron_llama/ci_int4_awq.qnemo
285+
# export.save_path=/home/TestData/nlp/megatron_llama/ci_int4_awq.qnemo
286286
#
287287
# rm -rf /home/TestData/nlp/megatron_llama/ci_int4_awq.qnemo
288288
#- uses: "NVIDIA/NeMo/.github/actions/cancel-workflow@main"

docs/source/nlp/quantization.rst

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -73,17 +73,17 @@ The script must be launched correctly with the number of processes equal to tens
7373

7474
.. code-block:: bash
7575
76-
torchrun --nproc-per-node 8 examples/nlp/language_modeling/megatron_quantization.py \
77-
model_file=llama2-70b-base-bf16.nemo \
78-
tensor_model_parallel_size=8 \
79-
pipeline_model_parallel_size=1 \
76+
torchrun --nproc-per-node 8 examples/nlp/language_modeling/megatron_gpt_quantization.py \
77+
model.restore_from_path=llama2-70b-base-bf16.nemo \
78+
model.tensor_model_parallel_size=8 \
79+
model.pipeline_model_parallel_size=1 \
8080
trainer.num_nodes=1 \
8181
trainer.devices=8 \
8282
trainer.precision=bf16 \
8383
quantization.algorithm=fp8 \
8484
export.decoder_type=llama \
8585
export.inference_tensor_parallel=2 \
86-
model_save=llama2-70b-base-fp8-qnemo
86+
export.save_path=llama2-70b-base-fp8-qnemo
8787
8888
8989

examples/nlp/language_modeling/conf/megatron_quantization.yaml renamed to examples/nlp/language_modeling/conf/megatron_gpt_quantization.yaml

Lines changed: 15 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -20,21 +20,26 @@ trainer:
2020
precision: bf16 # 16, 32, or bf16
2121
enable_checkpointing: false
2222

23+
model:
24+
tensor_model_parallel_size: 1
25+
pipeline_model_parallel_size: 1
26+
restore_from_path: llama2-7b-fp16.nemo # Nemo file path
27+
28+
## Activation Checkpoint
29+
activations_checkpoint_granularity: null # 'selective' or 'full'
30+
activations_checkpoint_method: null # 'uniform', 'block', not used with 'selective'
31+
2332
quantization:
24-
quantize_bmm1: false
25-
algorithm: fp8 # int8_sq, fp8, int8, int4_awq, null
33+
decoder_type: ${export.decoder_type} # gptnext, gpt2, llama
34+
algorithm: fp8 # null, int8_sq, fp8, int4_awq
2635
calib_dataset: cnn_dailymail # wikitext, cnn_dailymail, or a local dataset
2736
num_calib_size: 512 # number of samples used for calibration
28-
awq_block_size: 128 # block size for scaling factors in AWQ algorithm
29-
alpha: 1.0 # alpha parameter in SmoothQuant algorithm
37+
awq_block_size: 128 # block size for scaling factors (only used in AWQ algorithms)
38+
sq_alpha: 1.0 # alpha parameter (only used in SmoothQuant algorithms)
3039

3140
export:
3241
decoder_type: llama # gptnext, gpt2, llama
3342
inference_tensor_parallel: 1 # Default using 1 TP for inference
3443
inference_pipeline_parallel: 1 # Default using 1 PP for inference
35-
dtype: bf16 # Default precision data type
36-
37-
model_file: llama2-7b-fp16.nemo # Nemo file path
38-
model_save: llama2-7b-fp8.qnemo # Path where the quantized model will be saved
39-
tensor_model_parallel_size: 1
40-
pipeline_model_parallel_size: 1
44+
dtype: ${trainer.precision} # Default precision data type
45+
save_path: llama2-7b-${quantization.algorithm}.qnemo # Path where the quantized model will be saved

examples/nlp/language_modeling/megatron_quantization.py renamed to examples/nlp/language_modeling/megatron_gpt_quantization.py

Lines changed: 37 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -15,32 +15,38 @@
1515
import torch
1616
import torch.multiprocessing as mp
1717
from datasets import load_dataset
18+
from omegaconf import OmegaConf
19+
from pytorch_lightning.trainer.trainer import Trainer
20+
from tqdm import tqdm
1821

22+
from nemo.collections.nlp.models.language_modeling.megatron_gpt_model import MegatronGPTModel
23+
from nemo.collections.nlp.parts.nlp_overrides import NLPDDPStrategy
1924
from nemo.core.config import hydra_runner
2025
from nemo.export.quantize import Quantizer
26+
from nemo.utils.model_utils import load_config
2127

2228
mp.set_start_method("spawn", force=True)
2329

2430
"""
2531
Nemo quantization example script.
2632
2733
Please consult nemo.export.quantize.Quantizer class
28-
and examples/nlp/language_modeling/conf/megatron_quantization.yaml config on available quantization methods,
34+
and examples/nlp/language_modeling/conf/megatron_gpt_quantization.yaml config on available quantization methods,
2935
models supported as well as how to set up data and inference for calibration (with defaults recommended).
3036
3137
Example usage:
3238
```
33-
python examples/nlp/language_modeling/megatron_quantization.py \
34-
model_file=llama2-7b-fp16.nemo \
35-
model_save=llama2-7b-fp8.qnemo \
39+
python examples/nlp/language_modeling/megatron_gpt_quantization.py \
40+
model.restore_from_path=llama2-7b-fp16.nemo \
3641
quantization.algorithm=fp8 \
3742
export.decoder_type=llama \
3843
export.inference_tensor_parallel=1
44+
export.save_path=llama2-7b-fp8.qnemo \
3945
```
4046
"""
4147

4248

43-
def get_calib_dataloader(data="cnn_dailymail", batch_size=64, calib_size=512, max_sequence_length=512):
49+
def get_calib_data_iter(data="cnn_dailymail", batch_size=64, calib_size=512, max_sequence_length=512):
4450
if data == "wikitext":
4551
dataset = load_dataset("wikitext", "wikitext-103-v1", split="train")
4652
text_column = "text"
@@ -59,31 +65,46 @@ def get_calib_dataloader(data="cnn_dailymail", batch_size=64, calib_size=512, ma
5965
yield batch
6066

6167

62-
@hydra_runner(config_path="conf", config_name="megatron_quantization")
68+
@hydra_runner(config_path="conf", config_name="megatron_gpt_quantization")
6369
def main(cfg) -> None:
6470
if not torch.cuda.is_available():
65-
raise EnvironmentError("GPU is required for the inference.")
71+
raise EnvironmentError("GPU is required for the quantization.")
6672

67-
quantizer = Quantizer(cfg.quantization, cfg.inference, cfg.export, cfg.trainer)
73+
# Initialize quantizer
74+
quantizer = Quantizer(cfg.quantization, cfg.export)
75+
76+
# Overwrite model config with the one from the model checkpoint and apply quantization modifications
77+
model_cfg = load_config(cfg.model.restore_from_path)
78+
model_cfg.update(cfg.model)
79+
model_cfg = quantizer.modify_model_config(model_cfg)
80+
81+
trainer = Trainer(strategy=NLPDDPStrategy(), **cfg.trainer)
82+
model = MegatronGPTModel.restore_from(
83+
restore_path=cfg.model.restore_from_path, override_config_path=model_cfg, trainer=trainer
84+
)
85+
model.freeze()
6886

6987
# Quantization algorithm can be set to None. This is useful for baseline precision
7088
# accuracy validation. In this case only weights export step will be performed:
7189
if cfg.quantization.algorithm is not None:
72-
dataloader = get_calib_dataloader(
90+
data_iter = get_calib_data_iter(
7391
cfg.quantization.calib_dataset,
7492
cfg.inference.batch_size,
7593
cfg.quantization.num_calib_size,
7694
cfg.inference.max_context_length,
7795
)
78-
dataloader = [data for data in dataloader]
79-
else:
80-
dataloader = None
96+
dataloader = [data for data in data_iter]
8197

82-
model = quantizer.quantize(
83-
cfg.model_file, dataloader, cfg.tensor_model_parallel_size, cfg.pipeline_model_parallel_size
84-
)
98+
def forward_loop(model):
99+
# NOTE: Alternatively you can also use `model.forward_bwd_step(data_iter, forward_only=True)`
100+
# if your model is setup for training.
101+
model.set_inference_config(OmegaConf.to_container(cfg.inference))
102+
for i, batch in enumerate(tqdm(dataloader, desc="Calibrating")):
103+
model.predict_step(batch, i)
104+
105+
model = quantizer.quantize(model, forward_loop)
85106

86-
quantizer.export(model, cfg.model_save)
107+
quantizer.export(model)
87108

88109

89110
if __name__ == '__main__':

0 commit comments

Comments
 (0)