-
Notifications
You must be signed in to change notification settings - Fork 124
Description
Suggestion Description
I'm capturing timestamps on device with __builtin_readsteadycounter (or extracting them from signals myself) and end up with quite a few of them in large buffers that I'd like to translate without the additional API overhead of calling hsa_amd_profiling_convert_tick_to_system_domain on each one in a loop. It'd be nice for such cases to have a hsa_amd_profiling_convert_tick_batch_to_system_domain that accepted a list of ticks and either updated them in-place or in an output buffer.
What I noticed is that GpuAgent::TranslateTime takes a lock, does some looping math to see if synchronization is required, and potentially synchronizes - in a batched mode that could be done once and the lock needs not be held for the entire duration of the translation (t0/t1 can be reused). Batching has a tradeoff with accuracy as it's possible for the skew to change over the course of a batch but translating them all consistently is better behavior than an outer loop: today it's possible for the timestamps to change base in the middle of translation and produce inconsistent results and that messes up reporting. The user of such an API could choose the batch/flush frequency to balance the drift to work around that and manage it when it makes sense (in-between top-level invocations/frames/etc where there's natural points to rebase).
Operating System
No response
GPU
No response
ROCm Component
No response