create a new accuracy eval script for official README.md eval accuracy #3449

vkuzo · 2025-12-05T19:58:29Z

Summary:

Creates a standalone eval script for generating accuracy metrics for
quantization README.md, based on the HuggingFace model definition of
LLaMa 3.1 8B

Why new script?

the current prod script in
https://github.com/pytorch/ao/blob/main/torchao/_models/llama/eval.py
uses a custom model definition, this was pre-HF integration, it's better to use HF's model definition now
we have HummingBird scripts in
https://github.com/pytorch/ao/tree/40c4f44677ae11166c3dcfbb9189cfa78789390c/.github/scripts/torchao_model_releases,
but they seem pretty verbose and hard to use/modify
we have
https://github.com/pytorch/ao/blob/main/benchmarks/_models/eval_hf_models.py,
I copy-pasted and modified this for the current PR. The script above
didn't work as is for various reasons, and also seemed to be hard to
use/modify, for main README.md it's important to have a very simple
standalone script.

We should probably do a pass on the naming before landing.

Future work:

add metrics for int4_weight_only_hqq (need to run on A100)
add metrics for 'int4 weight float8 activation' (currently doesn't work with HF accelerate)
add metrics for mxfp8 and nvfp4 (need to run on B200)
make the parsing of logs automated
also add a similar script for performance benchmarks, using vllm
delete https://github.com/pytorch/ao/blob/main/torchao/_models/llama/

Test Plan:

// debug run on small model
with-proxy time ./benchmarks/quantization/eval_accuracy_for_readme.sh facebook/opt-125m

// real run
with-proxy time ./benchmarks/quantization/eval_accuracy_for_readme.sh

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]

vkuzo · 2025-12-05T19:58:30Z

Stack from ghstack (oldest at bottom):

pytorch-bot · 2025-12-05T19:58:32Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3449

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 1581808 with merge base 69ce0fd ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Summary: Creates a standalone eval script for generating accuracy metrics for quantization README.md, based on the HuggingFace model definition of LLaMa 3.1 8B Why new script? 1. the current `prod` script in https://github.com/pytorch/ao/blob/main/torchao/_models/llama/eval.py uses a custom model definition, this was pre-HF integration, it's better to use HF's model definition now 2. we have HummingBird scripts in https://github.com/pytorch/ao/tree/40c4f44677ae11166c3dcfbb9189cfa78789390c/.github/scripts/torchao_model_releases, but they seem pretty verbose and hard to use/modify 3. we have https://github.com/pytorch/ao/blob/main/benchmarks/_models/eval_hf_models.py, I copy-pasted and modified this for the current PR. The script above didn't work as is for various reasons, and also seemed to be hard to use/modify, for main README.md it's important to have a very simple standalone script. We should probably do a pass on the naming before landing. Future work: 1. add metrics for `int4_weight_only_hqq` (need to run on A100) 2. add metrics for `mxfp8` and `nvfp4` (need to run on B200) 3. make the parsing of logs automated 4. also add a similar script for performance benchmarks, using vllm 5. delete https://github.com/pytorch/ao/blob/main/torchao/_models/llama/ Test Plan: ``` // debug run on small model with-proxy time ./benchmarks/quantization/eval_accuracy_for_readme.sh facebook/opt-125m // real run with-proxy time ./benchmarks/quantization/eval_accuracy_for_readme.sh ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 39c1d72 ghstack-comment-id: 3618394399 Pull-Request: #3449

[ghstack-poisoned]

Summary: Creates a standalone eval script for generating accuracy metrics for quantization README.md, based on the HuggingFace model definition of LLaMa 3.1 8B Why new script? 1. the current `prod` script in https://github.com/pytorch/ao/blob/main/torchao/_models/llama/eval.py uses a custom model definition, this was pre-HF integration, it's better to use HF's model definition now 2. we have HummingBird scripts in https://github.com/pytorch/ao/tree/40c4f44677ae11166c3dcfbb9189cfa78789390c/.github/scripts/torchao_model_releases, but they seem pretty verbose and hard to use/modify 3. we have https://github.com/pytorch/ao/blob/main/benchmarks/_models/eval_hf_models.py, I copy-pasted and modified this for the current PR. The script above didn't work as is for various reasons, and also seemed to be hard to use/modify, for main README.md it's important to have a very simple standalone script. We should probably do a pass on the naming before landing. Future work: 1. add metrics for `int4_weight_only_hqq` (need to run on A100) 2. add metrics for `mxfp8` and `nvfp4` (need to run on B200) 3. make the parsing of logs automated 4. also add a similar script for performance benchmarks, using vllm 5. delete https://github.com/pytorch/ao/blob/main/torchao/_models/llama/ Test Plan: ``` // debug run on small model with-proxy time ./benchmarks/quantization/eval_accuracy_for_readme.sh facebook/opt-125m // real run with-proxy time ./benchmarks/quantization/eval_accuracy_for_readme.sh ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 174b317 ghstack-comment-id: 3618394399 Pull-Request: #3449

jerryzh168 · 2025-12-08T18:52:11Z

benchmarks/quantization/eval_accuracy_for_readme.sh

+# note:
+# * `int4_groupwise_hqq_weight_float8_rowwise_activation` doesn't work with dtype_map auto: https://gist.github.com/vkuzo/6b128681b628744d445c553cdeac8a85
+# * `int4_groupwise_hqq_weight_only` only works on A100
+for quant_recipe in float8_rowwise int4_groupwise_weight_float8_rowwise_activation int4_groupwise_hqq_weight_only int8_rowwise_weight_only int8_rowwise; do


nit: int4_groupwise_weight_float8_rowwise_activation --> float8_rowwise_activation_int4_groupwise_weight to match the config name order?

I'm matching the order in https://github.com/pytorch/ao/tree/main?tab=readme-ov-file#stable-workflows

overall if we want to standardize this everywhere sounds reasonable, IMO let's do that in a separate "rename-only" PR?

Summary: #3449 is a newer version of these which uses the HuggingFace model definition. Test Plan: CI Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 9d85193 ghstack-comment-id: 3628761600 Pull-Request: #3466

Summary: #3449 is a newer version of these which uses the HuggingFace model definition. Test Plan: CI Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 0ad33cb ghstack-comment-id: 3628761600 Pull-Request: #3466

missed this in #3449

Update

002ba19

[ghstack-poisoned]

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 5, 2025

vkuzo added the topic: not user facing Use this tag if you don't want this PR to show up in release notes label Dec 8, 2025

Update

1581808

[ghstack-poisoned]

jerryzh168 reviewed Dec 8, 2025

View reviewed changes

jerryzh168 requested a review from jainapurva December 8, 2025 18:52

jainapurva approved these changes Dec 8, 2025

View reviewed changes

vkuzo mentioned this pull request Dec 8, 2025

delete outdated llama eval scripts #3466

Open

vkuzo merged commit 7b65989 into main Dec 8, 2025
56 checks passed

vkuzo added a commit that referenced this pull request Dec 8, 2025

fix table formatting in quantization readme

1d13f2c

missed this in #3449

vkuzo mentioned this pull request Dec 8, 2025

fix table formatting in quantization readme #3467

Merged

vkuzo added a commit that referenced this pull request Dec 8, 2025

fix table formatting in quantization readme (#3467)

08e5e20

missed this in #3449

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

create a new accuracy eval script for official README.md eval accuracy #3449

create a new accuracy eval script for official README.md eval accuracy #3449

vkuzo commented Dec 5, 2025 •

edited

Loading

Uh oh!

vkuzo commented Dec 5, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Dec 5, 2025 •

edited

Loading

Uh oh!

jerryzh168 Dec 8, 2025

Uh oh!

vkuzo Dec 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

create a new accuracy eval script for official README.md eval accuracy #3449

create a new accuracy eval script for official README.md eval accuracy #3449

Conversation

vkuzo commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vkuzo commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3449

✅ No Failures

Uh oh!

jerryzh168 Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

vkuzo Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vkuzo commented Dec 5, 2025 •

edited

Loading

vkuzo commented Dec 5, 2025 •

edited

Loading

pytorch-bot bot commented Dec 5, 2025 •

edited

Loading