chore: bench phase breakdown + thread sweep for MSM reduction#21885
Merged
johnathan79717 merged 3 commits intoAztecProtocol:merge-train/barretenbergfrom Mar 30, 2026
Merged
Conversation
Contributor
Author
|
CI3 (External) is currently blocked because this is an external PR from a fork: the workflow says external PRs need the Would someone with label permission mind adding one of those labels so CI can execute? (This PR only touches the MSM benchmark + |
Contributor
Author
|
Follow-up: I narrowed the
Latest local run (4 vCPU laptop;
|
github-merge-queue Bot
pushed a commit
that referenced
this pull request
Mar 24, 2026
## Summary - The `ci-external` workflow fails with `Resource not accessible by integration (removeLabelsFromLabelable)` because `github.token` defaults to read-only when no `permissions` block is set. - Adds `contents: read` + `pull-requests: write` to the `ci-external` job so `gh pr edit --remove-label "ci-external-once"` can succeed. ## Context - Safe because the workflow uses `pull_request_target` (code always from base branch, not fork) and is gated by a maintainer adding the `ci-external` / `ci-external-once` label. - Matches the pattern used by other label-modifying workflows in the repo (e.g., `merge-train-create-pr`, `auto-close-stale-drafts`). - Fixes the CI failure seen on external PRs like #21885.
Collaborator
|
This issue was automatically closed because it was referenced in PR #21965 which has been merged to the default branch. |
johnathan79717
approved these changes
Mar 25, 2026
69ef274 to
07ca389
Compare
auto-merge was automatically disabled
March 26, 2026 15:45
Pull request was closed
Replaces speculative TODO with actual measurements from a 192-core machine: ~512us for 2^16 MSM (1.2% of total), ~207us for 2^20 (<0.1%).
6ebf784 to
88f04ea
Compare
c251a92
into
AztecProtocol:merge-train/barretenberg
21 of 22 checks passed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
This PR addresses
AztecProtocol/barretenberg#1656by making theMSM::batch_multi_scalar_mul(...)phase breakdown measurable and by adding a benchmark case that targets the exact regime called out in the issue (2^16points with256threads).The goal is to answer: is the single-threaded final reduction (
accumulate_results) actually a bottleneck relative to the full MSM?What Changed
1) Phase Breakdown (
BB_BENCH) insidebatch_multi_scalar_mulAdded
BB_BENCHscopes for:MSM::batch_multi_scalar_mul/evaluate_work_unitsMSM::batch_multi_scalar_mul/accumulate_resultsMSM::batch_multi_scalar_mul/batch_normalizeMSM::batch_multi_scalar_mul/scalars_to_montgomeryThese are surfaced in google-benchmark output via
GOOGLE_BB_BENCH_REPORTER(state).2) New Benchmark Case:
BatchMSM_1656Added a dedicated benchmark:
PippengerBench/BatchMSM_1656/256/{msm_size}num_polys = 1)msm_size ∈ {2^16, 2^20}This uses
bb::set_parallel_for_concurrency(256)to force the intended partitioning and restores the original value after the benchmark finishes.Results (Local)
Machine: 4 vCPU laptop (so
256 threadsis intentionally oversubscribed; the key point is the absolute reduction overhead).Key takeaways (from
BB_BENCHcounters; time counters are nanoseconds):2^16, 256 threads:accumulate_results ≈ 139k ns(~0.139 ms)≈ 521 ms~0.027%2^20, 256 threads:accumulate_results ≈ 141k ns(~0.141 ms)≈ 3457 ms~0.004%So the final reduction appears negligible in these regimes on my setup, and the benchmark now makes it easy to validate on a 64+ core machine where the original concern is most relevant.
Notes / Background
MSMhere is a Pippenger-style MSM. For background reading on multiexponentiation / multi-product methods:How To Run
Fixes AztecProtocol/barretenberg#1656