Skip to content

[CPU] Reduction (sum) performance study on AMD Zen4 7975WX #24011

@hanhanW

Description

@hanhanW

IREE CPU Reduction Performance Report

Machine: 32-core AMD Zen4 7975WX, AVX-512, 32 threads.
Baseline: PyTorch 2.9.0+cpu with MKL.
IREE flags: --iree-opt-level=O2 --iree-opt-data-tiling --iree-llvmcpu-target-cpu=host

// reduction sum: static 1000000
func.func @sum(%input: tensor<1000000xf32>) -> tensor<f32> {
  %cst = arith.constant 0.0 : f32
  %init = tensor.empty() : tensor<f32>
  %fill = linalg.fill ins(%cst : f32) outs(%init : tensor<f32>) -> tensor<f32>
  %result = linalg.reduce { arith.addf }
    ins(%input : tensor<1000000xf32>)
    outs(%fill : tensor<f32>)
    dimensions = [0]
  return %result : tensor<f32>
}

Note, the report is built on top of a codegen bug fix (or say improved feature): #24010

Overview: IREE vs MKL (sum 1D N=1M, @32 threads/workers)

MKL IREE no-split IREE c=8 fixed IREE c=2
Total (measured) 15 us 61 us 59 us 61 us
Compute 39.5 us 9.9 us 22.7 us
Runtime overhead 21.5 us 49.1 us 38.3 us
vs MKL 1.0x 4.1x 3.9x 4.1x
  • c=N, means that there are N chunks in total, i.e., the number of workgroups is N in the main reduction dispatch..

MKL total is a black-box measurement; no internal breakdown available.
IREE compute and overhead are from Tracy single-iteration trace (see below).

c=8 (tile=124992) has the lowest compute (9.9 us, 9 parallel shards at ~5 us) but
highest overhead (49.1 us — scheduling 9 workers + two dispatches).
The gap to MKL is 3.9x and is dominated by runtime overhead, not
compute quality.

Comparison summary: PyTorch vs IREE dispatch model

PyTorch/OpenMP IREE HAL/task system
Dispatch floor 1.1 us ~10 us
Thread fork/join ~13 us (OpenMP) ~15 us (task scheduler)
Per-call allocations 2 us (output + 128B buffer) ~15 us (semaphores, fences, cmd buffer, heap buffer)
Work distribution Static chunk per thread Workgroup → task queue → worker
Small tensor handling Sequential if < 32K elements Always dispatches through full HAL path
Total @32t, 1M sum 15 us 56 us

The fork/join costs are comparable (~13 vs ~15 us). The gap comes from:

  1. Dispatch floor: 1.1 us vs ~10 us (9 us difference)
  2. Per-call allocations: 2 us vs ~15 us (13 us difference)
  3. Kernel efficiency: they are pretty close, when we look at profiles.

Tracy single-iteration breakdown: c=8 fixed @32w (56.0 us)

Traced from steady-state BenchmarkIteration at t≈3s in
sum_1M_c8_fixed_32w.tracy. Threads: 1 (main/VM), 2-10 (task workers).

Phase Duration (us) % Description
VM setup 12.4 22% vm_invoke, semaphore_create, fence_create, buffer_alloc, queue_execute, executor_submit
Scheduling delay 14.8 26% fence_await starts (t=15.9) → first shard starts (t=30.7)
Dispatch 0 compute ~10 18% 7-9 parallel shards across threads 2-10, each ~5 us. Wall-clock ≈ 10 us (t=30.7 to t=40.7)
Inter-dispatch + dispatch_1 3.7 7% dispatch_0 retire → dispatch_1 execute (0.0 us) → retire
Cleanup + wakeup 15.1 27% semaphore_signal, resource cleanup, fence_await wakeup, buffer_view_create, vm_end_invoke
Total 56.0 100%

Timeline visualization: c=8 fixed (one iteration, 56 us)

From sum_1M_c8_fixed_32w.tracy, BenchmarkIteration at ns_since_start≈3000051421.
All timestamps below are relative to iteration start (t=0).

     0 us       10        20        30        40        50        56
     |          |         |         |         |         |         |
T1  [--vm_setup(12.4)--][------fence_await/blocked(40.2)------][cl]
     t=0                  t=12.5                                t=55.5
     vm_invoke            semaphore_multi_wait                  end_invoke
     sem_create×2
     fence_create×3
     buf_alloc×2
     queue_exec

T2  ·············[cmd_buf][wait]·········[===D0===]·[retire]········
                 t=13.0   t=19.4         t=30.7     t=47.7
                 issue_cmd queue_wait     5.1 us     cmd_cleanup

T3  ·····································[===D0===]·················
                                         t=32.2  5.0 us

T4  ······································[===D0===]················
                                          t=31.3  5.1 us

T5  ······································[===D0===]················
                                          t=31.3  5.0 us

T6  ·····································[===D0===]·················
                                         t=30.4  5.1 us

T7  ···································(asleep — woke but no shard this iter)

T8  ·································[===D0===]·····················
                                     t=27.3  4.9 us

T9  ··································[===D0===]····················
                                      t=28.6  4.9 us

T10 ··········································[D1 0.04us]·[retire+signal]
                                              t=41.0       t=44.4
                                              reduce 9→1   sem_signal

Key timestamps (us from iteration start):

  • t=0.0: BenchmarkIteration starts (thread 1)
  • t=12.5: fence_await starts — thread 1 blocks
  • t=27.3: first dispatch_0 shard starts (thread 8, 4.9 us)
  • t=32.2: last dispatch_0 shard starts (thread 3, 5.0 us)
  • t=37.2: last dispatch_0 shard completes
  • t=41.0: dispatch_1 starts (thread 10, reduce 9→1, 0.04 us)
  • t=44.4: semaphore_signal (wakes main thread)
  • t=55.5: vm_end_invoke, iteration ends

Compute: dispatch_0 shards t=27.3→37.2 (9.9 us) + dispatch_1 at t=41.0 (0.04 us) = 9.9 us (18%).
Inter-dispatch overhead: t=37.2→41.0 = 3.8 us (retire, barrier, issue).

Timeline visualization: c=2 (one iteration, 60.7 us)

From sum_1M_c2_32w.tracy, BenchmarkIteration at ns_since_start=3000054482.
All timestamps below are relative to iteration start (t=0).

     0 us       10        20        30        40        50        60.7
     |          |         |         |         |         |         |
T1  [--vm_setup(12.6)--][-------fence_await/blocked(43.2)-------][cl]
     t=0                  t=12.6                                  t=58.7
     vm_invoke            semaphore_multi_wait                    end_invoke
     sem_create×2
     fence_create×3
     buf_alloc
     queue_exec

T2  ·················[cmd][wait]·[========dispatch_0========]·[retire]
                     t=14.8      t=22.9                t=42.6
                     cmd_buf     19.7 us (500K f32)    cleanup
                     issue

T3  ···························[========dispatch_0========]·[D1]·[ret]
                               t=25.7                t=45.5 t=48.8
                               19.8 us (500K f32)          0.1us
                                                           signal

Key timestamps (us from iteration start):

  • t=0.0: BenchmarkIteration starts (thread 1)
  • t=12.6: fence_await starts — thread 1 blocks
  • t=22.9: first dispatch_0 shard starts (thread 2, 19.7 us)
  • t=25.7: second dispatch_0 shard starts (thread 3, 19.8 us)
  • t=45.5: last dispatch_0 shard completes
  • t=48.8: dispatch_1 completes (reduce 2→1, 0.1 us)
  • t=58.7: vm_end_invoke, iteration ends

Compute: dispatch_0 shards t=22.9→45.5 (22.6 us) + dispatch_1 at t=48.8 (0.1 us) = 22.7 us (37%).
Inter-dispatch overhead: t=45.5→48.7 = 3.2 us (retire, barrier, issue).

Comparison: c=2 vs c=8 fixed iteration breakdown

Phase c=2 (60.7 us) c=8 fixed (56.0 us)
VM setup 12.6 us (21%) 12.4 us (22%)
Scheduling delay 5.8 us (10%) 14.8 us (26%)
Dispatch 0 compute 22.8 us (37%) ~10 us (18%)
Inter-dispatch + dispatch_1 3.1 us (5%) 3.7 us (7%)
Cleanup + wakeup 16.4 us (27%) 15.1 us (27%)

c=8 has lower compute (10 us vs 22.8 us — more parallelism) but higher
scheduling delay (14.8 us vs 5.8 us — more workers to wake up). The net
result is roughly the same wall-clock (56 vs 61 us).

Thread scaling: c=8 fixed vs no-split vs MKL

Workers no-split c=8 fixed MKL
1 0.059 0.080 0.031
2 0.065 0.069 0.022
4 0.065 0.056 0.015
8 0.065 0.055 0.011
16 0.065 0.063 0.010
32 0.066 0.059 0.015

c=8 fixed now scales: 0.080 → 0.055 ms from 1 → 8 workers.
Best at 8 workers: 0.055 ms (3.7x off MKL @32t).

Remaining gap to MKL

The gap is now entirely runtime overhead. Per the Tracy breakdown, only
18% of wall time is compute; 82% is IREE runtime machinery:

  • VM lifecycle: 12.4 us per iteration (semaphore/fence/buffer management)
  • Scheduling delay: 14.8 us (fence_await → first shard)
  • Cleanup: 15.1 us (retire, signal, resource free)

MKL completes the same sum in 15 us total @32t.

MKL Analysis

Parallelism: OpenMP two-pass reduction

Source: aten/src/ATen/native/TensorIteratorReduce.cpp

This path is equivalent to IREE's split-reduction path.

static void two_pass_reduction(TensorIteratorBase& iter, loop2d_t loop) {
    const int max_threads = at::get_num_threads();
    auto buffer = at::empty({max_threads, 1}, dst.options()); // one alloc: 128 bytes
    buffer.copy_(unsqueezed);                                  // fill with identity

    at::parallel_for(0, numel, GRAIN_SIZE, [&](int64_t begin, int64_t end) {
        // Each thread reduces its chunk into buffer[thread_id]
        base_ptrs[0] += buffer_stride * thread_num;
        serial_for_each(shape, strides, base_ptrs, loop, {begin, end});
    });

    // Final: reduce buffer[0..max_threads] → output
    auto final_reduce = TensorIterator::reduce_op(unsqueezed, buffer);
    final_reduce.for_each(loop);
}
  • Pass 1: parallel_for(0, 1M, 32768, ...) → ~31 chunks across 32 threads.
    Each thread accumulates into buffer[thread_id].
  • Pass 2: Sequential reduce of 32 partial sums (trivial).
  • Grain size: GRAIN_SIZE = 32,768 elements (TensorIterator.h:78).

Per-iteration allocations

Source: TensorIteratorReduce.cpp, c10/core/CPUAllocator.cpp:20

auto buffer = at::empty(buffer_shape, dst.options());
// → alloc_cpu(128) → posix_memalign(128 bytes)

Per torch.sum() call:

  1. Output tensor: ~0.6 us (torch.empty(()) — scalar + metadata)
  2. Intermediate buffer: ~0.7 us (at::empty({32, 1}) — 128 bytes via posix_memalign)
  3. Total allocation: ~2 us (measured: torch.sum(a) - torch.sum(a, out=out) = 2.0 us)

Parallelization decision

Source: TensorIteratorReduce.cpp

if (numel < GRAIN_SIZE || get_num_threads() == 1 || in_parallel_region()) {
    serial_for_each(loop, {0, numel});     // No parallelism
} else if (output.numel() == 1) {
    two_pass_reduction(*this, loop);        // Scalar output → buffer per thread
} else {
    parallel_dim_reduction(*this, loop);    // Reduce along one dimension
}
  • Below 32,768 elements: sequential (no OpenMP overhead)
  • Scalar output (our case): two-pass (buffer per thread)
  • Multi-element output: parallel_dim (parallelize non-reduced dims)
    No caching allocator for CPU — uses glibc posix_memalign/free directly
    (c10/core/impl/alloc_cpu.cpp:126). But glibc's internal free-list makes
    repeated same-size allocations fast (~50-100 ns after first call).

Thread management: OpenMP

Source: aten/src/ATen/ParallelOpenMP.h:14-54

#pragma omp parallel
{
    int tid = omp_get_thread_num();
    int nthreads = omp_get_num_threads();
    int64_t chunk = divup((end - begin), nthreads);
    // Each thread gets a contiguous chunk, no task queue
}
  • Persistent thread pool (OpenMP creates threads once at program start)
  • Static work division (no task stealing, no scheduling overhead)
  • Implicit barrier at end of #pragma omp parallel block
  • Measured fork/join overhead @32t: ~13 us (from single-thread minus parallel comparison)

No dispatch machinery

Unlike IREE, there is no:

  • VM invoke/deinvoke
  • HAL command buffer create/destroy
  • Semaphore create/signal/wait
  • Fence create/destroy
  • Task scheduler / work stealing
  • Buffer allocation for intermediate state (beyond the 128-byte buffer)

The kernel is a direct C++ function call from parallel_reduce into the
compiled template instantiation. The only indirection is OpenMP's thread
fork/join.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions