Expose --quantize_dtype CLI arg to control dtype during quantization by rhn19 · Pull Request #228 · huggingface/optimum-executorch

rhn19 · 2026-04-13T06:56:56Z

Fixes #226

Problem

The model is loaded in --dtype before quantization. There is no way to quantize in a different dtype (e.g. to match QAT training precision).

Changes

Add --quantize_dtype CLI argument (choices=["float32", "float16", "bfloat16"])
Wire it through all task loaders (causal_lm, masked_lm, asr, multimodal_text_to_text) to quantize_model_()
In quantize_model_(): cast to quantize_dtype before quantization, restore original dtype before unwrap_tensor_subclass

The restore-before-unwrap ordering matters: unwrap_tensor_subclass breaks .to() for quantized tensor subclasses, so the dtype cast must happen while the subclass wrappers are still intact.

Usage

optimum-cli export executorch \
   --model merged-qat-model/ \
   --task text-generation --recipe xnnpack \
   --qlinear 8da4w --quantize_dtype bfloat16 \
   --dtype float32 \
   --output_dir output/

No behavior change when --quantize_dtype is not specified.

Expose --quantize_dtype CLI arg to control dtype during quantization

e65ba34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose --quantize_dtype CLI arg to control dtype during quantization#228

Expose --quantize_dtype CLI arg to control dtype during quantization#228
rhn19 wants to merge 1 commit intohuggingface:mainfrom
rhn19:fix/qat-bf16-quantization-dtype

rhn19 commented Apr 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rhn19 commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Changes

Usage

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rhn19 commented Apr 13, 2026 •

edited

Loading