Skip to content

Expose --quantize_dtype CLI arg to control dtype during quantization#228

Open
rhn19 wants to merge 1 commit intohuggingface:mainfrom
rhn19:fix/qat-bf16-quantization-dtype
Open

Expose --quantize_dtype CLI arg to control dtype during quantization#228
rhn19 wants to merge 1 commit intohuggingface:mainfrom
rhn19:fix/qat-bf16-quantization-dtype

Conversation

@rhn19
Copy link
Copy Markdown

@rhn19 rhn19 commented Apr 13, 2026

Fixes #226

Problem

The model is loaded in --dtype before quantization. There is no way to quantize in a different dtype (e.g. to match QAT training precision).

Changes

  • Add --quantize_dtype CLI argument (choices=["float32", "float16", "bfloat16"])
  • Wire it through all task loaders (causal_lm, masked_lm, asr, multimodal_text_to_text) to quantize_model_()
  • In quantize_model_(): cast to quantize_dtype before quantization, restore original dtype before unwrap_tensor_subclass

The restore-before-unwrap ordering matters: unwrap_tensor_subclass breaks .to() for quantized tensor subclasses, so the dtype cast must happen while the subclass wrappers are still intact.

Usage

optimum-cli export executorch \
   --model merged-qat-model/ \
   --task text-generation --recipe xnnpack \
   --qlinear 8da4w --quantize_dtype bfloat16 \
   --dtype float32 \
   --output_dir output/

No behavior change when --quantize_dtype is not specified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] QAT-trained models produce degraded output after export due to quantization parameter mismatches

1 participant