Conversation
|
|
||
| // Calibrate loop count to hit a minimum wall time per value. | ||
| template <typename Fn> | ||
| std::uint64_t calibrate_loops(const Options& options, Fn&& fn) { |
There was a problem hiding this comment.
calibrate_loops() looks like a custom reimplementation of the same minimum-time calibration idea that tools like pyperf and Google Benchmark already provide. I understand that there are advantages to keeping the existing custom C++ harness / pyperf-compatible JSON output, but if we go that route, I think the custom calibration should match the usual/expected --min-time semantics more closely.
Right now, I do not think --min-time actually guarantees the requested minimum sample duration for a large chunk of these microbenchmarks. The calibration stops once it hits max_loops = 1000000, and BenchmarkSuite::run() then uses that capped loop count as-is. For very fast benchmarks, that cap is still well short of 0.1 s per value. For example:
- ~
6 ns/optops out around6 ms - ~
33 ns/optops out around33 ms - even ~
79 ns/oponly reaches around79 ms
So the new "stable collection" mode still under-times many of the exact fast C++ benchmarks it is meant to stabilize. The same issue also seems to apply to the explicit-loop overload, which means the --min-time semantics are not consistently honored there either.
I think that is important to fix before we rely on these numbers as "min-time collected" results, because readers will reasonably assume that --min-time 0.1 means each measured value is actually being calibrated to about 100 ms.
| std::uint64_t loops = options_.loops; | ||
| Options custom = options_; | ||
| if (options_.min_time_sec > 0.0) { | ||
| loops = calibrate_loops(options_, fn); |
There was a problem hiding this comment.
Generated by Cursor GPT-5.4 Extra High Fast after a few round of prompting, with a few small edits
I think the --min-time direction is great, but for async / stream-backed benchmarks the calibration pass may change the state being measured before the first timed sample. For synchronous operations that is probably harmless, but for enqueue-style operations it looks like calibration can leave a substantial backlog on the persistent stream, so the harness may end up measuring "enqueue on an already busy stream" rather than the quieter starting condition these benchmarks had before.
More concretely, calibrate_loops() repeatedly calls fn() until the target wall time is reached, and BenchmarkSuite::run() then immediately goes into run_benchmark() with no post-calibration drain/reset. That means calibration is not just estimating a loop count; it is also mutating benchmark state right before measurement begins.
Why I think this matters:
- For async benchmarks,
fn()often enqueues work onto a persistent stream rather than fully completing it. - A
0.1 starget can mean a lot of enqueued work before the first warmup/value. For example, a~1.6 uskernel launch benchmark needs on the order of~60klaunches to get there. - So the first measured sample may start from a deeply backlogged stream rather than an effectively empty stream.
That seems especially relevant for benchmarks like:
- kernel launches in
bench_launch.cpp cuMemAllocAsync/cuMemFreeAsyncinbench_memory.cppcuEventRecordinbench_event.cpp
The effect may be small for some cases, but since this PR is specifically about improving measurement stability, I think it is worth making sure the calibration step is not also changing what is being measured.
Helpful follow-up directions could be:
- add a post-calibration drain/reset step before the first measured warmup/value for async benchmarks
- or do a quick A/B comparison with and without a
cuStreamSynchronize()after calibration - or calibrate on a separate warmup path/stream if that is practical
If those variants all produce essentially the same numbers, that would be reassuring. If they move the launch/async-memory/event benchmarks materially, then this concern is real.
| ```bash | ||
| # Run the Python benchmarks in the wheel environment | ||
| pixi run -e wheel bench | ||
| pixi run -e wheel bench--min-time 0.1 |
Description
Follow up #1580
This one is to discuss and improve Cpp harness to make a more stable data collection.
I added a with min-time flag and to report the min/std on the compare.
This is the last one i got, where most of them are <2%.
stream.stream_create_destroydid went back from like 8% to 6% when I collected more time.This makes me feel a bit better about those numbers but the Python ones do seem more stable except
event.event_query.If we think its worth it to do the process isolation for the Cpp one let me know and I will take a look.
Checklist