Fast & Accurate Microbenchmarking for Zig
Let's benchmark fib:
const std = @import("std");
const bench = @import("bench");
fn fibNaive(n: u64) u64 {
if (n <= 1) return n;
return fibNaive(n - 1) + fibNaive(n - 2);
}
fn fibIterative(n: u64) u64 {
if (n == 0) return 0;
var a: u64 = 0;
var b: u64 = 1;
for (2..n + 1) |_| {
const c = a + b;
a = b;
b = c;
}
return b;
}
pub fn main() !void {
const allocator = std.heap.smp_allocator;
const opts = bench.Options{
.sample_size = 100,
.warmup_iters = 3,
};
const m_naive = try bench.run(allocator, "fibNaive/30", fibNaive, .{30}, opts);
const m_iter = try bench.run(allocator, "fibIterative/30", fibIterative, .{30}, opts);
try bench.report(.{
.metrics = &.{ m_naive, m_iter },
.baseline_index = 0, // naive as baseline
});
}Run it, and you will get the following output in your terminal:
| Benchmark | Time | Speedup | Iterations | Ops/s | Cycles | Instructions | IPC | Cache Misses |
| :---------------- | ------: | ---------: | ---------: | -------: | -----: | -----------: | ---: | -----------: |
| `fibNaive/30` | 1.78 ms | 1.00x | 1 | 563.2/s | 8.1M | 27.8M | 3.41 | 0.3 |
| `fibIterative/30` | 3.44 ns | 516055.19x | 300006 | 290.6M/s | 15.9 | 82.0 | 5.15 | 0.0 |The benchmark report generates valid Markdown, so you can copy-paste it directly into a markdown file:
| Benchmark | Time | Speedup | Iterations | Ops/s | Cycles | Instructions | IPC | Cache Misses |
|---|---|---|---|---|---|---|---|---|
fibNaive/30 |
1.78 ms | 1.00x | 1 | 563.2/s | 8.1M | 27.8M | 3.41 | 0.3 |
fibIterative/30 |
3.44 ns | 516055.19x | 300006 | 290.6M/s | 15.9 | 82.0 | 5.15 | 0.0 |
- CPU Counters: Measures CPU cycles, instructions, IPC, and cache misses directly from the kernel (Linux only).
- Argument Support: Pass pre-calculated data to your functions to separate setup overhead from the benchmark loop.
- Baseline Comparison: Easily compare multiple implementations against a reference function to see relative speedups or regressions.
- Flexible Reporting: Access raw metric data programmatically to generate custom reports (JSON, CSV) or assert performance limits in CI.
- Easy Throughput Metrics: Automatically calculates operations per second and data throughput (MB/s, GB/s) when payload size is provided.
- Robust Statistics: Uses median and standard deviation to provide reliable metrics despite system noise.
Fetch the latest version:
zig fetch --save=bench https://github.com/pyk/bench/archive/main.tar.gzThen add this to your build.zig:
const bench = b.dependency("bench", .{
.target = target,
.optimize = optimize,
});
// Use it on a module
const mod = b.createModule(.{
.target = target,
.optimize = optimize,
.imports = &.{
.{ .name = "bench", .module = bench.module("bench") },
},
});
// Or executable
const my_bench = b.addExecutable(.{
.name = "my-bench",
.root_module = b.createModule(.{
.root_source_file = b.path("bench/my-bench.zig"),
.target = target,
.optimize = .ReleaseFast,
.imports = &.{
.{ .name = "bench", .module = bench.module("bench") },
},
}),
});If you are using it only for tests/benchmarks, it is recommended to mark it as lazy:
.dependencies = .{
.bench = .{
.url = "...",
.hash = "...",
.lazy = true, // here
},
}To benchmark a single function, pass the allocator, a name, and the function
pointer to run.
const res = try bench.run(allocator, "My Function", myFn, .{});
try bench.report(.{ .metrics = &.{res} });You can generate test data before the benchmark starts and pass it via a tuple. This ensures the setup cost doesn't pollute your measurements.
// Setup data outside the benchmark
const input = try generateLargeString(allocator, 10_000);
// Pass input as a tuple
const res = try bench.run(allocator, "Parser", parseFn, .{input}, .{});You can run multiple benchmarks and compare them against a baseline. The
baseline_index determines which result is used as the reference (1.00x).
const a = try bench.run(allocator, "Implementation A", implA, .{});
const b = try bench.run(allocator, "Implementation B", implB, .{});
try bench.report(.{
.metrics = &.{ a, b },
// Use the first metric (Implementation A) as the baseline
.baseline_index = 0,
});If your function processes data (like copying memory or parsing strings),
provide bytes_per_op to get throughput metrics (MB/s or GB/s).
const size = 1024 * 1024;
const res = try bench.run(allocator, "Memcpy 1MB", copyFn, .{
.bytes_per_op = size,
});
// Report will now show GB/s instead of just Ops/s
try bench.report(.{ .metrics = &.{res} });You can tune the benchmark behavior by modifying the Options struct.
const res = try bench.run(allocator, "Heavy Task", heavyFn, .{
.warmup_iters = 10, // Default: 100
.sample_size = 50, // Default: 1000
});The default bench.report prints a clean, Markdown-compatible table to stdout. It
automatically handles unit scaling (ns, us, ms, s) and formatting.
| Benchmark | Time | Speedup | Iterations | Ops/s | Cycles | Instructions | IPC | Cache Misses |
| :---------------- | ------: | ---------: | ---------: | -------: | -----: | -----------: | ---: | -----------: |
| `fibNaive/30` | 1.78 ms | 1.00x | 1 | 563.2/s | 8.1M | 27.8M | 3.41 | 0.3 |
| `fibIterative/30` | 3.44 ns | 516055.19x | 300006 | 290.6M/s | 15.9 | 82.0 | 5.15 | 0.0 |The run function returns a Metrics struct containing all raw data (min, max,
median, variance, cycles, etc.). You can use this to generate JSON, CSV, or
assert performance limits in CI.
const metrics = try bench.run(allocator, "MyFn", myFn, .{});
// Access raw fields directly
std.debug.print("Median: {d}ns, Max: {d}ns\n", .{
metrics.median_ns,
metrics.max_ns
});The run function returns a Metrics struct containing the following data
points:
| Category | Metric | Description |
|---|---|---|
| Meta | name |
The identifier string for the benchmark. |
| Time | min_ns |
Minimum execution time per operation (nanoseconds). |
| Time | max_ns |
Maximum execution time per operation (nanoseconds). |
| Time | mean_ns |
Arithmetic mean execution time (nanoseconds). |
| Time | median_ns |
Median execution time (nanoseconds). |
| Time | std_dev_ns |
Standard deviation of the execution time. |
| Meta | samples |
Total number of measurement samples collected. |
| Throughput | ops_sec |
Calculated operations per second. |
| Throughput | mb_sec |
Data throughput in MB/s (populated if bytes_per_op > 0). |
| Hardware* | cycles |
Average CPU cycles per operation. |
| Hardware* | instructions |
Average CPU instructions executed per operation. |
| Hardware* | ipc |
Instructions Per Cycle (efficiency ratio). |
| Hardware* | cache_misses |
Average cache misses per operation. |
*Hardware metrics are currently available on Linux only. They will be null
on other platforms or if permissions are restricted.
bench shows you the time your code takes. It tells you "what" the speed is.
But it does not tell you "why" it is slow. To find out why, use tools like
perf on Linux. These tools show you where the CPU spends time. For example,
perf record runs your code and collects data. Then perf report or
Firefox Profiler shows hotspots. This helps you
fix the real problems.
The compiler can remove code if it thinks it does nothing. For example, if you compute a value but never use it, the compiler skips the work. This makes benchmarks wrong. It shows fast times for code that does not run.
To stop this, use std.mem.doNotOptimizeAway. Pass your result to it. The
compiler must compute it then. For example, in a scanner or tokenizer:
while (true) {
const token = try scanner.next();
if (token == .end) break;
std.mem.doNotOptimizeAway(token); // CRITICAL
}Here, doNotOptimizeAway(token) forces the compiler to run scanner.next().
Without it, the loop might empty. Always use this on key results. Like counts,
parsed values, or outputs.
On Linux, hardware metrics like cycles and instructions come from the kernel. But by default, it limits access. You get null values.
To fix, run:
sudo sysctl -w kernel.perf_event_paranoid=-1This allows your code to read counters. Set to 2 to restrict again.
Check with cat /proc/sys/kernel/perf_event_paranoid. Lower values mean more
access. Value -1 is full. Use it for benchmarks. But be careful in production.
If you use constant data like const input = "hello";, the compiler knows it at
build time. It can unroll loops or compute results ahead. Your benchmark
measures nothing real. Times stay flat even if data grows.
Instead, use runtime data. Allocate a buffer and fill it.
Bad example:
const input = " hello"; // Compiler knows every byte
const res = try bench.run(allocator, "Parser", parse, .{input}, .{});Good example:
var input = try allocator.alloc(u8, 100);
defer allocator.free(input);
@memset(input[0..4], ' ');
@memcpy(input[4..], "hello");
const res = try bench.run(allocator, "Parser", parse, .{input}, .{});Now, the buffer is dynamic. The compiler cannot fold it. Times scale with real work. For varying tests, change the memset size each run.
- Fixing Microbenchmark Accuracy
- Fixing Zig benchmark where
std.mem.doNotOptimizeAwaywas ignored - Writing a Type-Safe Linux Perf Interface in Zig
Install the Zig toolchain via mise (optional):
mise trust
mise installRun tests:
zig build test --summary allBuild library:
zig buildEnable/disable kernel.perf_event_paranoid for debugging:
# Restrict access
sudo sysctl -w kernel.perf_event_paranoid=2
# Allow access (Required for CPU metrics)
sudo sysctl -w kernel.perf_event_paranoid=-1MIT. Use it for whatever.