[CPU] Reduction (sum) performance study on AMD Zen4 7975WX

# IREE CPU Reduction Performance Report

Machine: 32-core AMD Zen4 7975WX, AVX-512, 32 threads.
Baseline: PyTorch 2.9.0+cpu with MKL.
IREE flags: `--iree-opt-level=O2 --iree-opt-data-tiling --iree-llvmcpu-target-cpu=host`

```mlir
// reduction sum: static 1000000
func.func @sum(%input: tensor<1000000xf32>) -> tensor<f32> {
  %cst = arith.constant 0.0 : f32
  %init = tensor.empty() : tensor<f32>
  %fill = linalg.fill ins(%cst : f32) outs(%init : tensor<f32>) -> tensor<f32>
  %result = linalg.reduce { arith.addf }
    ins(%input : tensor<1000000xf32>)
    outs(%fill : tensor<f32>)
    dimensions = [0]
  return %result : tensor<f32>
}
```

Note, the report is built on top of a codegen bug fix (or say improved feature): https://github.com/iree-org/iree/issues/24010

## Overview: IREE vs MKL (sum 1D N=1M, @32 threads/workers)

| | MKL | IREE no-split | IREE c=8 fixed | IREE c=2 |
|--|-----|---------------|----------------|----------|
| **Total (measured)** | **15 us** | **61 us** | **59 us** | **61 us** |
| Compute | — | 39.5 us | 9.9 us | 22.7 us |
| Runtime overhead | — | 21.5 us | 49.1 us | 38.3 us |
| vs MKL | 1.0x | 4.1x | 3.9x | 4.1x |

* `c=N`, means that there are `N` chunks in total, i.e., the number of workgroups is `N` in the main reduction dispatch..

MKL total is a black-box measurement; no internal breakdown available.
IREE compute and overhead are from Tracy single-iteration trace (see below).

`c=8 (tile=124992)` has the lowest compute (9.9 us, 9 parallel shards at ~5 us) but
highest overhead (49.1 us — scheduling 9 workers + two dispatches).
The gap to MKL is **3.9x** and is dominated by runtime overhead, not
compute quality.

### Comparison summary: PyTorch vs IREE dispatch model

| | PyTorch/OpenMP | IREE HAL/task system |
|--|---------------|---------------------|
| Dispatch floor | 1.1 us | ~10 us |
| Thread fork/join | ~13 us (OpenMP) | ~15 us (task scheduler) |
| Per-call allocations | 2 us (output + 128B buffer) | ~15 us (semaphores, fences, cmd buffer, heap buffer) |
| Work distribution | Static chunk per thread | Workgroup → task queue → worker |
| Small tensor handling | Sequential if < 32K elements | Always dispatches through full HAL path |
| Total @32t, 1M sum | **15 us** | **56 us** |

The fork/join costs are comparable (~13 vs ~15 us). The gap comes from:
1. **Dispatch floor**: 1.1 us vs ~10 us (9 us difference)
2. **Per-call allocations**: 2 us vs ~15 us (13 us difference)
3. **Kernel efficiency**: they are pretty close, when we look at profiles.

## Tracy single-iteration breakdown: c=8 fixed @32w (56.0 us)

Traced from steady-state `BenchmarkIteration` at t≈3s in
`sum_1M_c8_fixed_32w.tracy`. Threads: 1 (main/VM), 2-10 (task workers).

| Phase | Duration (us) | % | Description |
|-------|--------------|---|-------------|
| VM setup | 12.4 | 22% | vm_invoke, semaphore_create, fence_create, buffer_alloc, queue_execute, executor_submit |
| Scheduling delay | 14.8 | 26% | fence_await starts (t=15.9) → first shard starts (t=30.7) |
| **Dispatch 0 compute** | **~10** | **18%** | 7-9 parallel shards across threads 2-10, each ~5 us. Wall-clock ≈ 10 us (t=30.7 to t=40.7) |
| Inter-dispatch + dispatch_1 | 3.7 | 7% | dispatch_0 retire → dispatch_1 execute (0.0 us) → retire |
| Cleanup + wakeup | 15.1 | 27% | semaphore_signal, resource cleanup, fence_await wakeup, buffer_view_create, vm_end_invoke |
| **Total** | **56.0** | **100%** | |

## Timeline visualization: c=8 fixed (one iteration, 56 us)

From `sum_1M_c8_fixed_32w.tracy`, `BenchmarkIteration` at ns_since_start≈3000051421.
All timestamps below are relative to iteration start (t=0).

```
     0 us       10        20        30        40        50        56
     |          |         |         |         |         |         |
T1  [--vm_setup(12.4)--][------fence_await/blocked(40.2)------][cl]
     t=0                  t=12.5                                t=55.5
     vm_invoke            semaphore_multi_wait                  end_invoke
     sem_create×2
     fence_create×3
     buf_alloc×2
     queue_exec

T2  ·············[cmd_buf][wait]·········[===D0===]·[retire]········
                 t=13.0   t=19.4         t=30.7     t=47.7
                 issue_cmd queue_wait     5.1 us     cmd_cleanup

T3  ·····································[===D0===]·················
                                         t=32.2  5.0 us

T4  ······································[===D0===]················
                                          t=31.3  5.1 us

T5  ······································[===D0===]················
                                          t=31.3  5.0 us

T6  ·····································[===D0===]·················
                                         t=30.4  5.1 us

T7  ···································(asleep — woke but no shard this iter)

T8  ·································[===D0===]·····················
                                     t=27.3  4.9 us

T9  ··································[===D0===]····················
                                      t=28.6  4.9 us

T10 ··········································[D1 0.04us]·[retire+signal]
                                              t=41.0       t=44.4
                                              reduce 9→1   sem_signal
```

Key timestamps (us from iteration start):
- t=0.0:   BenchmarkIteration starts (thread 1)
- t=12.5:  fence_await starts — thread 1 blocks
- t=27.3:  first dispatch_0 shard starts (thread 8, 4.9 us)
- t=32.2:  last dispatch_0 shard starts (thread 3, 5.0 us)
- t=37.2:  last dispatch_0 shard completes
- t=41.0:  dispatch_1 starts (thread 10, reduce 9→1, 0.04 us)
- t=44.4:  semaphore_signal (wakes main thread)
- t=55.5:  vm_end_invoke, iteration ends

**Compute: dispatch_0 shards t=27.3→37.2 (9.9 us) + dispatch_1 at t=41.0 (0.04 us) = 9.9 us (18%).**
**Inter-dispatch overhead: t=37.2→41.0 = 3.8 us (retire, barrier, issue).**

## Timeline visualization: c=2 (one iteration, 60.7 us)

From `sum_1M_c2_32w.tracy`, `BenchmarkIteration` at ns_since_start=3000054482.
All timestamps below are relative to iteration start (t=0).

```
     0 us       10        20        30        40        50        60.7
     |          |         |         |         |         |         |
T1  [--vm_setup(12.6)--][-------fence_await/blocked(43.2)-------][cl]
     t=0                  t=12.6                                  t=58.7
     vm_invoke            semaphore_multi_wait                    end_invoke
     sem_create×2
     fence_create×3
     buf_alloc
     queue_exec

T2  ·················[cmd][wait]·[========dispatch_0========]·[retire]
                     t=14.8      t=22.9                t=42.6
                     cmd_buf     19.7 us (500K f32)    cleanup
                     issue

T3  ···························[========dispatch_0========]·[D1]·[ret]
                               t=25.7                t=45.5 t=48.8
                               19.8 us (500K f32)          0.1us
                                                           signal
```

Key timestamps (us from iteration start):
- t=0.0:  BenchmarkIteration starts (thread 1)
- t=12.6: fence_await starts — thread 1 blocks
- t=22.9: first dispatch_0 shard starts (thread 2, 19.7 us)
- t=25.7: second dispatch_0 shard starts (thread 3, 19.8 us)
- t=45.5: last dispatch_0 shard completes
- t=48.8: dispatch_1 completes (reduce 2→1, 0.1 us)
- t=58.7: vm_end_invoke, iteration ends

**Compute: dispatch_0 shards t=22.9→45.5 (22.6 us) + dispatch_1 at t=48.8 (0.1 us) = 22.7 us (37%).**
**Inter-dispatch overhead: t=45.5→48.7 = 3.2 us (retire, barrier, issue).**

## Comparison: c=2 vs c=8 fixed iteration breakdown

| Phase | c=2 (60.7 us) | c=8 fixed (56.0 us) |
|-------|---------------|---------------------|
| VM setup | 12.6 us (21%) | 12.4 us (22%) |
| Scheduling delay | 5.8 us (10%) | 14.8 us (26%) |
| Dispatch 0 compute | 22.8 us (37%) | ~10 us (18%) |
| Inter-dispatch + dispatch_1 | 3.1 us (5%) | 3.7 us (7%) |
| Cleanup + wakeup | 16.4 us (27%) | 15.1 us (27%) |

c=8 has lower compute (10 us vs 22.8 us — more parallelism) but higher
scheduling delay (14.8 us vs 5.8 us — more workers to wake up). The net
result is roughly the same wall-clock (56 vs 61 us).

## Thread scaling: c=8 fixed vs no-split vs MKL

| Workers | no-split | c=8 fixed | MKL |
|---------|----------|-----------|-----|
| 1 | 0.059 | 0.080 | 0.031 |
| 2 | 0.065 | 0.069 | 0.022 |
| 4 | 0.065 | 0.056 | 0.015 |
| 8 | 0.065 | 0.055 | 0.011 |
| 16 | 0.065 | 0.063 | 0.010 |
| 32 | 0.066 | 0.059 | 0.015 |

c=8 fixed now scales: 0.080 → 0.055 ms from 1 → 8 workers.
Best at 8 workers: 0.055 ms (3.7x off MKL @32t).

## Remaining gap to MKL

The gap is now entirely runtime overhead. Per the Tracy breakdown, only
18% of wall time is compute; 82% is IREE runtime machinery:

- **VM lifecycle**: 12.4 us per iteration (semaphore/fence/buffer management)
- **Scheduling delay**: 14.8 us (fence_await → first shard)
- **Cleanup**: 15.1 us (retire, signal, resource free)

MKL completes the same sum in 15 us total @32t.

### MKL Analysis

#### Parallelism: OpenMP two-pass reduction

Source: [aten/src/ATen/native/TensorIteratorReduce.cpp](https://github.com/pytorch/pytorch/blob/cf0901c1e0192a7f3c8ea70f67f8bef4e8bb284e/aten/src/ATen/native/TensorIteratorReduce.cpp#L43-L80)

This path is equivalent to IREE's split-reduction path.

```cpp
static void two_pass_reduction(TensorIteratorBase& iter, loop2d_t loop) {
    const int max_threads = at::get_num_threads();
    auto buffer = at::empty({max_threads, 1}, dst.options()); // one alloc: 128 bytes
    buffer.copy_(unsqueezed);                                  // fill with identity

    at::parallel_for(0, numel, GRAIN_SIZE, [&](int64_t begin, int64_t end) {
        // Each thread reduces its chunk into buffer[thread_id]
        base_ptrs[0] += buffer_stride * thread_num;
        serial_for_each(shape, strides, base_ptrs, loop, {begin, end});
    });

    // Final: reduce buffer[0..max_threads] → output
    auto final_reduce = TensorIterator::reduce_op(unsqueezed, buffer);
    final_reduce.for_each(loop);
}
```

- **Pass 1**: `parallel_for(0, 1M, 32768, ...)` → ~31 chunks across 32 threads.
  Each thread accumulates into `buffer[thread_id]`.
- **Pass 2**: Sequential reduce of 32 partial sums (trivial).
- **Grain size**: `GRAIN_SIZE = 32,768` elements (`TensorIterator.h:78`).

#### Per-iteration allocations

Source: [TensorIteratorReduce.cpp](https://github.com/pytorch/pytorch/blob/cf0901c1e0192a7f3c8ea70f67f8bef4e8bb284e/aten/src/ATen/native/TensorIteratorReduce.cpp#L50-L51), [c10/core/CPUAllocator.cpp:20](https://github.com/pytorch/pytorch/blob/75cdf1c1dc053686816e8167faad743c0af23cfa/c10/core/CPUAllocator.cpp#L20-L30)

```cpp
auto buffer = at::empty(buffer_shape, dst.options());
// → alloc_cpu(128) → posix_memalign(128 bytes)
```

Per `torch.sum()` call:
1. **Output tensor**: ~0.6 us (`torch.empty(())` — scalar + metadata)
2. **Intermediate buffer**: ~0.7 us (`at::empty({32, 1})` — 128 bytes via `posix_memalign`)
3. Total allocation: **~2 us** (measured: `torch.sum(a)` - `torch.sum(a, out=out)` = 2.0 us)

#### Parallelization decision

Source: [TensorIteratorReduce.cpp](https://github.com/pytorch/pytorch/blob/75cdf1c1dc053686816e8167faad743c0af23cfa/aten/src/ATen/native/TensorIteratorReduce.cpp#L24-L37)

```cpp
if (numel < GRAIN_SIZE || get_num_threads() == 1 || in_parallel_region()) {
    serial_for_each(loop, {0, numel});     // No parallelism
} else if (output.numel() == 1) {
    two_pass_reduction(*this, loop);        // Scalar output → buffer per thread
} else {
    parallel_dim_reduction(*this, loop);    // Reduce along one dimension
}
```

- Below 32,768 elements: **sequential** (no OpenMP overhead)
- Scalar output (our case): **two-pass** (buffer per thread)
- Multi-element output: **parallel_dim** (parallelize non-reduced dims)
No caching allocator for CPU — uses glibc `posix_memalign`/`free` directly
(`c10/core/impl/alloc_cpu.cpp:126`). But glibc's internal free-list makes
repeated same-size allocations fast (~50-100 ns after first call).

#### Thread management: OpenMP

Source: [aten/src/ATen/ParallelOpenMP.h:14-54](https://github.com/pytorch/pytorch/blob/75cdf1c1dc053686816e8167faad743c0af23cfa/aten/src/ATen/ParallelOpenMP.h#L14-L54)

```cpp
#pragma omp parallel
{
    int tid = omp_get_thread_num();
    int nthreads = omp_get_num_threads();
    int64_t chunk = divup((end - begin), nthreads);
    // Each thread gets a contiguous chunk, no task queue
}
```

- Persistent thread pool (OpenMP creates threads once at program start)
- Static work division (no task stealing, no scheduling overhead)
- Implicit barrier at end of `#pragma omp parallel` block
- Measured fork/join overhead @32t: ~13 us (from single-thread minus parallel comparison)

#### No dispatch machinery

Unlike IREE, there is no:
- VM invoke/deinvoke
- HAL command buffer create/destroy
- Semaphore create/signal/wait
- Fence create/destroy
- Task scheduler / work stealing
- Buffer allocation for intermediate state (beyond the 128-byte buffer)

The kernel is a **direct C++ function call** from `parallel_reduce` into the
compiled template instantiation. The only indirection is OpenMP's thread
fork/join.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CPU] Reduction (sum) performance study on AMD Zen4 7975WX #24011

IREE CPU Reduction Performance Report

Overview: IREE vs MKL (sum 1D N=1M, @32 threads/workers)

Comparison summary: PyTorch vs IREE dispatch model

Tracy single-iteration breakdown: c=8 fixed @32w (56.0 us)

Timeline visualization: c=8 fixed (one iteration, 56 us)

Timeline visualization: c=2 (one iteration, 60.7 us)

Comparison: c=2 vs c=8 fixed iteration breakdown

Thread scaling: c=8 fixed vs no-split vs MKL

Remaining gap to MKL

MKL Analysis

Parallelism: OpenMP two-pass reduction

Per-iteration allocations

Parallelization decision

Thread management: OpenMP

No dispatch machinery

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	MKL	IREE no-split	IREE c=8 fixed	IREE c=2
Total (measured)	15 us	61 us	59 us	61 us
Compute	—	39.5 us	9.9 us	22.7 us
Runtime overhead	—	21.5 us	49.1 us	38.3 us
vs MKL	1.0x	4.1x	3.9x	4.1x

	PyTorch/OpenMP	IREE HAL/task system
Dispatch floor	1.1 us	~10 us
Thread fork/join	~13 us (OpenMP)	~15 us (task scheduler)
Per-call allocations	2 us (output + 128B buffer)	~15 us (semaphores, fences, cmd buffer, heap buffer)
Work distribution	Static chunk per thread	Workgroup → task queue → worker
Small tensor handling	Sequential if < 32K elements	Always dispatches through full HAL path
Total @32t, 1M sum	15 us	56 us

Phase	Duration (us)	%	Description
VM setup	12.4	22%	vm_invoke, semaphore_create, fence_create, buffer_alloc, queue_execute, executor_submit
Scheduling delay	14.8	26%	fence_await starts (t=15.9) → first shard starts (t=30.7)
Dispatch 0 compute	~10	18%	7-9 parallel shards across threads 2-10, each ~5 us. Wall-clock ≈ 10 us (t=30.7 to t=40.7)
Inter-dispatch + dispatch_1	3.7	7%	dispatch_0 retire → dispatch_1 execute (0.0 us) → retire
Cleanup + wakeup	15.1	27%	semaphore_signal, resource cleanup, fence_await wakeup, buffer_view_create, vm_end_invoke
Total	56.0	100%

Phase	c=2 (60.7 us)	c=8 fixed (56.0 us)
VM setup	12.6 us (21%)	12.4 us (22%)
Scheduling delay	5.8 us (10%)	14.8 us (26%)
Dispatch 0 compute	22.8 us (37%)	~10 us (18%)
Inter-dispatch + dispatch_1	3.1 us (5%)	3.7 us (7%)
Cleanup + wakeup	16.4 us (27%)	15.1 us (27%)

Workers	no-split	c=8 fixed	MKL
1	0.059	0.080	0.031
2	0.065	0.069	0.022
4	0.065	0.056	0.015
8	0.065	0.055	0.011
16	0.065	0.063	0.010
32	0.066	0.059	0.015

[CPU] Reduction (sum) performance study on AMD Zen4 7975WX #24011

Description

IREE CPU Reduction Performance Report

Overview: IREE vs MKL (sum 1D N=1M, @32 threads/workers)

Comparison summary: PyTorch vs IREE dispatch model

Tracy single-iteration breakdown: c=8 fixed @32w (56.0 us)

Timeline visualization: c=8 fixed (one iteration, 56 us)

Timeline visualization: c=2 (one iteration, 60.7 us)

Comparison: c=2 vs c=8 fixed iteration breakdown

Thread scaling: c=8 fixed vs no-split vs MKL

Remaining gap to MKL

MKL Analysis

Parallelism: OpenMP two-pass reduction

Per-iteration allocations

Parallelization decision

Thread management: OpenMP

No dispatch machinery

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions