IREE CPU Reduction Performance Report
Machine: 32-core AMD Zen4 7975WX, AVX-512, 32 threads.
Baseline: PyTorch 2.9.0+cpu with MKL.
IREE flags: --iree-opt-level=O2 --iree-opt-data-tiling --iree-llvmcpu-target-cpu=host
// reduction sum: static 1000000
func.func @sum(%input: tensor<1000000xf32>) -> tensor<f32> {
%cst = arith.constant 0.0 : f32
%init = tensor.empty() : tensor<f32>
%fill = linalg.fill ins(%cst : f32) outs(%init : tensor<f32>) -> tensor<f32>
%result = linalg.reduce { arith.addf }
ins(%input : tensor<1000000xf32>)
outs(%fill : tensor<f32>)
dimensions = [0]
return %result : tensor<f32>
}
Note, the report is built on top of a codegen bug fix (or say improved feature): #24010
Overview: IREE vs MKL (sum 1D N=1M, @32 threads/workers)
|
MKL |
IREE no-split |
IREE c=8 fixed |
IREE c=2 |
| Total (measured) |
15 us |
61 us |
59 us |
61 us |
| Compute |
— |
39.5 us |
9.9 us |
22.7 us |
| Runtime overhead |
— |
21.5 us |
49.1 us |
38.3 us |
| vs MKL |
1.0x |
4.1x |
3.9x |
4.1x |
c=N, means that there are N chunks in total, i.e., the number of workgroups is N in the main reduction dispatch..
MKL total is a black-box measurement; no internal breakdown available.
IREE compute and overhead are from Tracy single-iteration trace (see below).
c=8 (tile=124992) has the lowest compute (9.9 us, 9 parallel shards at ~5 us) but
highest overhead (49.1 us — scheduling 9 workers + two dispatches).
The gap to MKL is 3.9x and is dominated by runtime overhead, not
compute quality.
Comparison summary: PyTorch vs IREE dispatch model
|
PyTorch/OpenMP |
IREE HAL/task system |
| Dispatch floor |
1.1 us |
~10 us |
| Thread fork/join |
~13 us (OpenMP) |
~15 us (task scheduler) |
| Per-call allocations |
2 us (output + 128B buffer) |
~15 us (semaphores, fences, cmd buffer, heap buffer) |
| Work distribution |
Static chunk per thread |
Workgroup → task queue → worker |
| Small tensor handling |
Sequential if < 32K elements |
Always dispatches through full HAL path |
| Total @32t, 1M sum |
15 us |
56 us |
The fork/join costs are comparable (~13 vs ~15 us). The gap comes from:
- Dispatch floor: 1.1 us vs ~10 us (9 us difference)
- Per-call allocations: 2 us vs ~15 us (13 us difference)
- Kernel efficiency: they are pretty close, when we look at profiles.
Tracy single-iteration breakdown: c=8 fixed @32w (56.0 us)
Traced from steady-state BenchmarkIteration at t≈3s in
sum_1M_c8_fixed_32w.tracy. Threads: 1 (main/VM), 2-10 (task workers).
| Phase |
Duration (us) |
% |
Description |
| VM setup |
12.4 |
22% |
vm_invoke, semaphore_create, fence_create, buffer_alloc, queue_execute, executor_submit |
| Scheduling delay |
14.8 |
26% |
fence_await starts (t=15.9) → first shard starts (t=30.7) |
| Dispatch 0 compute |
~10 |
18% |
7-9 parallel shards across threads 2-10, each ~5 us. Wall-clock ≈ 10 us (t=30.7 to t=40.7) |
| Inter-dispatch + dispatch_1 |
3.7 |
7% |
dispatch_0 retire → dispatch_1 execute (0.0 us) → retire |
| Cleanup + wakeup |
15.1 |
27% |
semaphore_signal, resource cleanup, fence_await wakeup, buffer_view_create, vm_end_invoke |
| Total |
56.0 |
100% |
|
Timeline visualization: c=8 fixed (one iteration, 56 us)
From sum_1M_c8_fixed_32w.tracy, BenchmarkIteration at ns_since_start≈3000051421.
All timestamps below are relative to iteration start (t=0).
0 us 10 20 30 40 50 56
| | | | | | |
T1 [--vm_setup(12.4)--][------fence_await/blocked(40.2)------][cl]
t=0 t=12.5 t=55.5
vm_invoke semaphore_multi_wait end_invoke
sem_create×2
fence_create×3
buf_alloc×2
queue_exec
T2 ·············[cmd_buf][wait]·········[===D0===]·[retire]········
t=13.0 t=19.4 t=30.7 t=47.7
issue_cmd queue_wait 5.1 us cmd_cleanup
T3 ·····································[===D0===]·················
t=32.2 5.0 us
T4 ······································[===D0===]················
t=31.3 5.1 us
T5 ······································[===D0===]················
t=31.3 5.0 us
T6 ·····································[===D0===]·················
t=30.4 5.1 us
T7 ···································(asleep — woke but no shard this iter)
T8 ·································[===D0===]·····················
t=27.3 4.9 us
T9 ··································[===D0===]····················
t=28.6 4.9 us
T10 ··········································[D1 0.04us]·[retire+signal]
t=41.0 t=44.4
reduce 9→1 sem_signal
Key timestamps (us from iteration start):
- t=0.0: BenchmarkIteration starts (thread 1)
- t=12.5: fence_await starts — thread 1 blocks
- t=27.3: first dispatch_0 shard starts (thread 8, 4.9 us)
- t=32.2: last dispatch_0 shard starts (thread 3, 5.0 us)
- t=37.2: last dispatch_0 shard completes
- t=41.0: dispatch_1 starts (thread 10, reduce 9→1, 0.04 us)
- t=44.4: semaphore_signal (wakes main thread)
- t=55.5: vm_end_invoke, iteration ends
Compute: dispatch_0 shards t=27.3→37.2 (9.9 us) + dispatch_1 at t=41.0 (0.04 us) = 9.9 us (18%).
Inter-dispatch overhead: t=37.2→41.0 = 3.8 us (retire, barrier, issue).
Timeline visualization: c=2 (one iteration, 60.7 us)
From sum_1M_c2_32w.tracy, BenchmarkIteration at ns_since_start=3000054482.
All timestamps below are relative to iteration start (t=0).
0 us 10 20 30 40 50 60.7
| | | | | | |
T1 [--vm_setup(12.6)--][-------fence_await/blocked(43.2)-------][cl]
t=0 t=12.6 t=58.7
vm_invoke semaphore_multi_wait end_invoke
sem_create×2
fence_create×3
buf_alloc
queue_exec
T2 ·················[cmd][wait]·[========dispatch_0========]·[retire]
t=14.8 t=22.9 t=42.6
cmd_buf 19.7 us (500K f32) cleanup
issue
T3 ···························[========dispatch_0========]·[D1]·[ret]
t=25.7 t=45.5 t=48.8
19.8 us (500K f32) 0.1us
signal
Key timestamps (us from iteration start):
- t=0.0: BenchmarkIteration starts (thread 1)
- t=12.6: fence_await starts — thread 1 blocks
- t=22.9: first dispatch_0 shard starts (thread 2, 19.7 us)
- t=25.7: second dispatch_0 shard starts (thread 3, 19.8 us)
- t=45.5: last dispatch_0 shard completes
- t=48.8: dispatch_1 completes (reduce 2→1, 0.1 us)
- t=58.7: vm_end_invoke, iteration ends
Compute: dispatch_0 shards t=22.9→45.5 (22.6 us) + dispatch_1 at t=48.8 (0.1 us) = 22.7 us (37%).
Inter-dispatch overhead: t=45.5→48.7 = 3.2 us (retire, barrier, issue).
Comparison: c=2 vs c=8 fixed iteration breakdown
| Phase |
c=2 (60.7 us) |
c=8 fixed (56.0 us) |
| VM setup |
12.6 us (21%) |
12.4 us (22%) |
| Scheduling delay |
5.8 us (10%) |
14.8 us (26%) |
| Dispatch 0 compute |
22.8 us (37%) |
~10 us (18%) |
| Inter-dispatch + dispatch_1 |
3.1 us (5%) |
3.7 us (7%) |
| Cleanup + wakeup |
16.4 us (27%) |
15.1 us (27%) |
c=8 has lower compute (10 us vs 22.8 us — more parallelism) but higher
scheduling delay (14.8 us vs 5.8 us — more workers to wake up). The net
result is roughly the same wall-clock (56 vs 61 us).
Thread scaling: c=8 fixed vs no-split vs MKL
| Workers |
no-split |
c=8 fixed |
MKL |
| 1 |
0.059 |
0.080 |
0.031 |
| 2 |
0.065 |
0.069 |
0.022 |
| 4 |
0.065 |
0.056 |
0.015 |
| 8 |
0.065 |
0.055 |
0.011 |
| 16 |
0.065 |
0.063 |
0.010 |
| 32 |
0.066 |
0.059 |
0.015 |
c=8 fixed now scales: 0.080 → 0.055 ms from 1 → 8 workers.
Best at 8 workers: 0.055 ms (3.7x off MKL @32t).
Remaining gap to MKL
The gap is now entirely runtime overhead. Per the Tracy breakdown, only
18% of wall time is compute; 82% is IREE runtime machinery:
- VM lifecycle: 12.4 us per iteration (semaphore/fence/buffer management)
- Scheduling delay: 14.8 us (fence_await → first shard)
- Cleanup: 15.1 us (retire, signal, resource free)
MKL completes the same sum in 15 us total @32t.
MKL Analysis
Parallelism: OpenMP two-pass reduction
Source: aten/src/ATen/native/TensorIteratorReduce.cpp
This path is equivalent to IREE's split-reduction path.
static void two_pass_reduction(TensorIteratorBase& iter, loop2d_t loop) {
const int max_threads = at::get_num_threads();
auto buffer = at::empty({max_threads, 1}, dst.options()); // one alloc: 128 bytes
buffer.copy_(unsqueezed); // fill with identity
at::parallel_for(0, numel, GRAIN_SIZE, [&](int64_t begin, int64_t end) {
// Each thread reduces its chunk into buffer[thread_id]
base_ptrs[0] += buffer_stride * thread_num;
serial_for_each(shape, strides, base_ptrs, loop, {begin, end});
});
// Final: reduce buffer[0..max_threads] → output
auto final_reduce = TensorIterator::reduce_op(unsqueezed, buffer);
final_reduce.for_each(loop);
}
- Pass 1:
parallel_for(0, 1M, 32768, ...) → ~31 chunks across 32 threads.
Each thread accumulates into buffer[thread_id].
- Pass 2: Sequential reduce of 32 partial sums (trivial).
- Grain size:
GRAIN_SIZE = 32,768 elements (TensorIterator.h:78).
Per-iteration allocations
Source: TensorIteratorReduce.cpp, c10/core/CPUAllocator.cpp:20
auto buffer = at::empty(buffer_shape, dst.options());
// → alloc_cpu(128) → posix_memalign(128 bytes)
Per torch.sum() call:
- Output tensor: ~0.6 us (
torch.empty(()) — scalar + metadata)
- Intermediate buffer: ~0.7 us (
at::empty({32, 1}) — 128 bytes via posix_memalign)
- Total allocation: ~2 us (measured:
torch.sum(a) - torch.sum(a, out=out) = 2.0 us)
Parallelization decision
Source: TensorIteratorReduce.cpp
if (numel < GRAIN_SIZE || get_num_threads() == 1 || in_parallel_region()) {
serial_for_each(loop, {0, numel}); // No parallelism
} else if (output.numel() == 1) {
two_pass_reduction(*this, loop); // Scalar output → buffer per thread
} else {
parallel_dim_reduction(*this, loop); // Reduce along one dimension
}
- Below 32,768 elements: sequential (no OpenMP overhead)
- Scalar output (our case): two-pass (buffer per thread)
- Multi-element output: parallel_dim (parallelize non-reduced dims)
No caching allocator for CPU — uses glibc posix_memalign/free directly
(c10/core/impl/alloc_cpu.cpp:126). But glibc's internal free-list makes
repeated same-size allocations fast (~50-100 ns after first call).
Thread management: OpenMP
Source: aten/src/ATen/ParallelOpenMP.h:14-54
#pragma omp parallel
{
int tid = omp_get_thread_num();
int nthreads = omp_get_num_threads();
int64_t chunk = divup((end - begin), nthreads);
// Each thread gets a contiguous chunk, no task queue
}
- Persistent thread pool (OpenMP creates threads once at program start)
- Static work division (no task stealing, no scheduling overhead)
- Implicit barrier at end of
#pragma omp parallel block
- Measured fork/join overhead @32t: ~13 us (from single-thread minus parallel comparison)
No dispatch machinery
Unlike IREE, there is no:
- VM invoke/deinvoke
- HAL command buffer create/destroy
- Semaphore create/signal/wait
- Fence create/destroy
- Task scheduler / work stealing
- Buffer allocation for intermediate state (beyond the 128-byte buffer)
The kernel is a direct C++ function call from parallel_reduce into the
compiled template instantiation. The only indirection is OpenMP's thread
fork/join.
IREE CPU Reduction Performance Report
Machine: 32-core AMD Zen4 7975WX, AVX-512, 32 threads.
Baseline: PyTorch 2.9.0+cpu with MKL.
IREE flags:
--iree-opt-level=O2 --iree-opt-data-tiling --iree-llvmcpu-target-cpu=hostNote, the report is built on top of a codegen bug fix (or say improved feature): #24010
Overview: IREE vs MKL (sum 1D N=1M, @32 threads/workers)
c=N, means that there areNchunks in total, i.e., the number of workgroups isNin the main reduction dispatch..MKL total is a black-box measurement; no internal breakdown available.
IREE compute and overhead are from Tracy single-iteration trace (see below).
c=8 (tile=124992)has the lowest compute (9.9 us, 9 parallel shards at ~5 us) buthighest overhead (49.1 us — scheduling 9 workers + two dispatches).
The gap to MKL is 3.9x and is dominated by runtime overhead, not
compute quality.
Comparison summary: PyTorch vs IREE dispatch model
The fork/join costs are comparable (~13 vs ~15 us). The gap comes from:
Tracy single-iteration breakdown: c=8 fixed @32w (56.0 us)
Traced from steady-state
BenchmarkIterationat t≈3s insum_1M_c8_fixed_32w.tracy. Threads: 1 (main/VM), 2-10 (task workers).Timeline visualization: c=8 fixed (one iteration, 56 us)
From
sum_1M_c8_fixed_32w.tracy,BenchmarkIterationat ns_since_start≈3000051421.All timestamps below are relative to iteration start (t=0).
Key timestamps (us from iteration start):
Compute: dispatch_0 shards t=27.3→37.2 (9.9 us) + dispatch_1 at t=41.0 (0.04 us) = 9.9 us (18%).
Inter-dispatch overhead: t=37.2→41.0 = 3.8 us (retire, barrier, issue).
Timeline visualization: c=2 (one iteration, 60.7 us)
From
sum_1M_c2_32w.tracy,BenchmarkIterationat ns_since_start=3000054482.All timestamps below are relative to iteration start (t=0).
Key timestamps (us from iteration start):
Compute: dispatch_0 shards t=22.9→45.5 (22.6 us) + dispatch_1 at t=48.8 (0.1 us) = 22.7 us (37%).
Inter-dispatch overhead: t=45.5→48.7 = 3.2 us (retire, barrier, issue).
Comparison: c=2 vs c=8 fixed iteration breakdown
c=8 has lower compute (10 us vs 22.8 us — more parallelism) but higher
scheduling delay (14.8 us vs 5.8 us — more workers to wake up). The net
result is roughly the same wall-clock (56 vs 61 us).
Thread scaling: c=8 fixed vs no-split vs MKL
c=8 fixed now scales: 0.080 → 0.055 ms from 1 → 8 workers.
Best at 8 workers: 0.055 ms (3.7x off MKL @32t).
Remaining gap to MKL
The gap is now entirely runtime overhead. Per the Tracy breakdown, only
18% of wall time is compute; 82% is IREE runtime machinery:
MKL completes the same sum in 15 us total @32t.
MKL Analysis
Parallelism: OpenMP two-pass reduction
Source: aten/src/ATen/native/TensorIteratorReduce.cpp
This path is equivalent to IREE's split-reduction path.
parallel_for(0, 1M, 32768, ...)→ ~31 chunks across 32 threads.Each thread accumulates into
buffer[thread_id].GRAIN_SIZE = 32,768elements (TensorIterator.h:78).Per-iteration allocations
Source: TensorIteratorReduce.cpp, c10/core/CPUAllocator.cpp:20
Per
torch.sum()call:torch.empty(())— scalar + metadata)at::empty({32, 1})— 128 bytes viaposix_memalign)torch.sum(a)-torch.sum(a, out=out)= 2.0 us)Parallelization decision
Source: TensorIteratorReduce.cpp
No caching allocator for CPU — uses glibc
posix_memalign/freedirectly(
c10/core/impl/alloc_cpu.cpp:126). But glibc's internal free-list makesrepeated same-size allocations fast (~50-100 ns after first call).
Thread management: OpenMP
Source: aten/src/ATen/ParallelOpenMP.h:14-54
#pragma omp parallelblockNo dispatch machinery
Unlike IREE, there is no:
The kernel is a direct C++ function call from
parallel_reduceinto thecompiled template instantiation. The only indirection is OpenMP's thread
fork/join.