Skip to content

Commit e57a462

Browse files
Merge branch 'main' of https://github.com/NVIDIA/NeMo into hillst-bn2/add-pipelineparallel-dtype
2 parents a507623 + 3f7e828 commit e57a462

File tree

7 files changed

+154
-160
lines changed

7 files changed

+154
-160
lines changed

.github/workflows/cicd-main.yml

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -213,10 +213,10 @@ jobs:
213213
with:
214214
RUNNER: self-hosted-azure
215215
SCRIPT: |
216-
python examples/nlp/language_modeling/megatron_quantization.py \
217-
model_file=/home/TestData/nlp/megatron_llama/llama_ci.nemo \
216+
python examples/nlp/language_modeling/megatron_gpt_quantization.py \
217+
model.restore_from_path=/home/TestData/nlp/megatron_llama/llama_ci.nemo \
218218
quantization.algorithm=null \
219-
model_save=/home/TestData/nlp/megatron_llama/ci_baseline
219+
export.save_path=/home/TestData/nlp/megatron_llama/ci_baseline
220220
AFTER_SCRIPT: |
221221
rm -rf /home/TestData/nlp/megatron_llama/ci_baseline
222222
@@ -226,16 +226,16 @@ jobs:
226226
with:
227227
RUNNER: self-hosted-azure
228228
SCRIPT: |
229-
python examples/nlp/language_modeling/megatron_quantization.py \
230-
model_file=/home/TestData/nlp/megatron_llama/llama_ci.nemo \
231-
tensor_model_parallel_size=2 \
229+
python examples/nlp/language_modeling/megatron_gpt_quantization.py \
230+
model.restore_from_path=/home/TestData/nlp/megatron_llama/llama_ci.nemo \
231+
model.tensor_model_parallel_size=2 \
232232
trainer.devices=2 \
233233
quantization.calib_dataset=/home/TestData/nlp/test_quantization/test.json \
234234
quantization.algorithm=fp8 \
235235
quantization.num_calib_size=8 \
236236
inference.batch_size=2 \
237237
export.inference_tensor_parallel=2 \
238-
model_save=/home/TestData/nlp/megatron_llama/ci_fp8.qnemo
238+
export.save_path=/home/TestData/nlp/megatron_llama/ci_fp8.qnemo
239239
AFTER_SCRIPT: |
240240
rm -rf /home/TestData/nlp/megatron_llama/ci_fp8.qnemo
241241
@@ -245,13 +245,13 @@ jobs:
245245
with:
246246
RUNNER: self-hosted-azure
247247
SCRIPT: |
248-
python examples/nlp/language_modeling/megatron_quantization.py \
249-
model_file=/home/TestData/nlp/megatron_llama/llama_ci.nemo \
248+
python examples/nlp/language_modeling/megatron_gpt_quantization.py \
249+
model.restore_from_path=/home/TestData/nlp/megatron_llama/llama_ci.nemo \
250250
quantization.calib_dataset=/home/TestData/nlp/test_quantization/test.json \
251251
quantization.algorithm=int8_sq \
252252
quantization.num_calib_size=8 \
253253
inference.batch_size=2 \
254-
model_save=/home/TestData/nlp/megatron_llama/ci_int8_sq.qnemo
254+
export.save_path=/home/TestData/nlp/megatron_llama/ci_int8_sq.qnemo
255255
AFTER_SCRIPT: |
256256
rm -rf /home/TestData/nlp/megatron_llama/ci_int8_sq.qnemo
257257
@@ -274,15 +274,15 @@ jobs:
274274
# - name: Checkout repository
275275
# uses: actions/checkout@v4
276276
# - run: |
277-
# python examples/nlp/language_modeling/megatron_quantization.py \
278-
# model_file=/home/TestData/nlp/megatron_llama/llama_ci.nemo \
279-
# tensor_model_parallel_size=1 \
277+
# python examples/nlp/language_modeling/megatron_gpt_quantization.py \
278+
# model.restore_from_path=/home/TestData/nlp/megatron_llama/llama_ci.nemo \
279+
# model.tensor_model_parallel_size=1 \
280280
# trainer.devices=1 \
281281
# quantization.calib_dataset=/home/TestData/nlp/test_quantization/test.json \
282282
# quantization.algorithm=int4_awq \
283283
# quantization.num_calib_size=8 \
284284
# inference.batch_size=2 \
285-
# model_save=/home/TestData/nlp/megatron_llama/ci_int4_awq.qnemo
285+
# export.save_path=/home/TestData/nlp/megatron_llama/ci_int4_awq.qnemo
286286
#
287287
# rm -rf /home/TestData/nlp/megatron_llama/ci_int4_awq.qnemo
288288
#- uses: "NVIDIA/NeMo/.github/actions/cancel-workflow@main"

docs/source/nlp/quantization.rst

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -73,17 +73,17 @@ The script must be launched correctly with the number of processes equal to tens
7373

7474
.. code-block:: bash
7575
76-
torchrun --nproc-per-node 8 examples/nlp/language_modeling/megatron_quantization.py \
77-
model_file=llama2-70b-base-bf16.nemo \
78-
tensor_model_parallel_size=8 \
79-
pipeline_model_parallel_size=1 \
76+
torchrun --nproc-per-node 8 examples/nlp/language_modeling/megatron_gpt_quantization.py \
77+
model.restore_from_path=llama2-70b-base-bf16.nemo \
78+
model.tensor_model_parallel_size=8 \
79+
model.pipeline_model_parallel_size=1 \
8080
trainer.num_nodes=1 \
8181
trainer.devices=8 \
8282
trainer.precision=bf16 \
8383
quantization.algorithm=fp8 \
8484
export.decoder_type=llama \
8585
export.inference_tensor_parallel=2 \
86-
model_save=llama2-70b-base-fp8-qnemo
86+
export.save_path=llama2-70b-base-fp8-qnemo
8787
8888
8989

examples/nlp/language_modeling/conf/megatron_quantization.yaml renamed to examples/nlp/language_modeling/conf/megatron_gpt_quantization.yaml

Lines changed: 15 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -20,21 +20,26 @@ trainer:
2020
precision: bf16 # 16, 32, or bf16
2121
enable_checkpointing: false
2222

23+
model:
24+
tensor_model_parallel_size: 1
25+
pipeline_model_parallel_size: 1
26+
restore_from_path: llama2-7b-fp16.nemo # Nemo file path
27+
28+
## Activation Checkpoint
29+
activations_checkpoint_granularity: null # 'selective' or 'full'
30+
activations_checkpoint_method: null # 'uniform', 'block', not used with 'selective'
31+
2332
quantization:
24-
quantize_bmm1: false
25-
algorithm: fp8 # int8_sq, fp8, int8, int4_awq, null
33+
decoder_type: ${export.decoder_type} # gptnext, gpt2, llama
34+
algorithm: fp8 # null, int8_sq, fp8, int4_awq
2635
calib_dataset: cnn_dailymail # wikitext, cnn_dailymail, or a local dataset
2736
num_calib_size: 512 # number of samples used for calibration
28-
awq_block_size: 128 # block size for scaling factors in AWQ algorithm
29-
alpha: 1.0 # alpha parameter in SmoothQuant algorithm
37+
awq_block_size: 128 # block size for scaling factors (only used in AWQ algorithms)
38+
sq_alpha: 1.0 # alpha parameter (only used in SmoothQuant algorithms)
3039

3140
export:
3241
decoder_type: llama # gptnext, gpt2, llama
3342
inference_tensor_parallel: 1 # Default using 1 TP for inference
3443
inference_pipeline_parallel: 1 # Default using 1 PP for inference
35-
dtype: bf16 # Default precision data type
36-
37-
model_file: llama2-7b-fp16.nemo # Nemo file path
38-
model_save: llama2-7b-fp8.qnemo # Path where the quantized model will be saved
39-
tensor_model_parallel_size: 1
40-
pipeline_model_parallel_size: 1
44+
dtype: ${trainer.precision} # Default precision data type
45+
save_path: llama2-7b-${quantization.algorithm}.qnemo # Path where the quantized model will be saved

examples/nlp/language_modeling/megatron_quantization.py renamed to examples/nlp/language_modeling/megatron_gpt_quantization.py

Lines changed: 37 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -15,32 +15,38 @@
1515
import torch
1616
import torch.multiprocessing as mp
1717
from datasets import load_dataset
18+
from omegaconf import OmegaConf
19+
from pytorch_lightning.trainer.trainer import Trainer
20+
from tqdm import tqdm
1821

22+
from nemo.collections.nlp.models.language_modeling.megatron_gpt_model import MegatronGPTModel
23+
from nemo.collections.nlp.parts.nlp_overrides import NLPDDPStrategy
1924
from nemo.core.config import hydra_runner
2025
from nemo.export.quantize import Quantizer
26+
from nemo.utils.model_utils import load_config
2127

2228
mp.set_start_method("spawn", force=True)
2329

2430
"""
2531
Nemo quantization example script.
2632
2733
Please consult nemo.export.quantize.Quantizer class
28-
and examples/nlp/language_modeling/conf/megatron_quantization.yaml config on available quantization methods,
34+
and examples/nlp/language_modeling/conf/megatron_gpt_quantization.yaml config on available quantization methods,
2935
models supported as well as how to set up data and inference for calibration (with defaults recommended).
3036
3137
Example usage:
3238
```
33-
python examples/nlp/language_modeling/megatron_quantization.py \
34-
model_file=llama2-7b-fp16.nemo \
35-
model_save=llama2-7b-fp8.qnemo \
39+
python examples/nlp/language_modeling/megatron_gpt_quantization.py \
40+
model.restore_from_path=llama2-7b-fp16.nemo \
3641
quantization.algorithm=fp8 \
3742
export.decoder_type=llama \
3843
export.inference_tensor_parallel=1
44+
export.save_path=llama2-7b-fp8.qnemo \
3945
```
4046
"""
4147

4248

43-
def get_calib_dataloader(data="cnn_dailymail", batch_size=64, calib_size=512, max_sequence_length=512):
49+
def get_calib_data_iter(data="cnn_dailymail", batch_size=64, calib_size=512, max_sequence_length=512):
4450
if data == "wikitext":
4551
dataset = load_dataset("wikitext", "wikitext-103-v1", split="train")
4652
text_column = "text"
@@ -59,31 +65,46 @@ def get_calib_dataloader(data="cnn_dailymail", batch_size=64, calib_size=512, ma
5965
yield batch
6066

6167

62-
@hydra_runner(config_path="conf", config_name="megatron_quantization")
68+
@hydra_runner(config_path="conf", config_name="megatron_gpt_quantization")
6369
def main(cfg) -> None:
6470
if not torch.cuda.is_available():
65-
raise EnvironmentError("GPU is required for the inference.")
71+
raise EnvironmentError("GPU is required for the quantization.")
6672

67-
quantizer = Quantizer(cfg.quantization, cfg.inference, cfg.export, cfg.trainer)
73+
# Initialize quantizer
74+
quantizer = Quantizer(cfg.quantization, cfg.export)
75+
76+
# Overwrite model config with the one from the model checkpoint and apply quantization modifications
77+
model_cfg = load_config(cfg.model.restore_from_path)
78+
model_cfg.update(cfg.model)
79+
model_cfg = quantizer.modify_model_config(model_cfg)
80+
81+
trainer = Trainer(strategy=NLPDDPStrategy(), **cfg.trainer)
82+
model = MegatronGPTModel.restore_from(
83+
restore_path=cfg.model.restore_from_path, override_config_path=model_cfg, trainer=trainer
84+
)
85+
model.freeze()
6886

6987
# Quantization algorithm can be set to None. This is useful for baseline precision
7088
# accuracy validation. In this case only weights export step will be performed:
7189
if cfg.quantization.algorithm is not None:
72-
dataloader = get_calib_dataloader(
90+
data_iter = get_calib_data_iter(
7391
cfg.quantization.calib_dataset,
7492
cfg.inference.batch_size,
7593
cfg.quantization.num_calib_size,
7694
cfg.inference.max_context_length,
7795
)
78-
dataloader = [data for data in dataloader]
79-
else:
80-
dataloader = None
96+
dataloader = [data for data in data_iter]
8197

82-
model = quantizer.quantize(
83-
cfg.model_file, dataloader, cfg.tensor_model_parallel_size, cfg.pipeline_model_parallel_size
84-
)
98+
def forward_loop(model):
99+
# NOTE: Alternatively you can also use `model.forward_bwd_step(data_iter, forward_only=True)`
100+
# if your model is setup for training.
101+
model.set_inference_config(OmegaConf.to_container(cfg.inference))
102+
for i, batch in enumerate(tqdm(dataloader, desc="Calibrating")):
103+
model.predict_step(batch, i)
104+
105+
model = quantizer.quantize(model, forward_loop)
85106

86-
quantizer.export(model, cfg.model_save)
107+
quantizer.export(model)
87108

88109

89110
if __name__ == '__main__':

nemo/collections/asr/modules/audio_preprocessing.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -100,7 +100,7 @@ def __init__(self, win_length, hop_length):
100100
@torch.no_grad()
101101
def forward(self, input_signal, length):
102102
if input_signal.dtype != torch.float32:
103-
logging.warn(
103+
logging.warning(
104104
f"AudioPreprocessor received an input signal of dtype {input_signal.dtype}, rather than torch.float32. In sweeps across multiple datasets, we have found that the preprocessor is not robust to low precision mathematics. As such, it runs in float32. Your input will be cast to float32, but this is not necessarily enough to recovery full accuracy. For example, simply casting input_signal from torch.float32 to torch.bfloat16, then back to torch.float32 before running AudioPreprocessor causes drops in absolute WER of up to 0.1%. torch.bfloat16 simply does not have enough mantissa bits to represent enough values in the range [-1.0,+1.0] correctly.",
105105
mode=logging_mode.ONCE,
106106
)

0 commit comments

Comments
 (0)