Skip to content

[Feature]: add a batched version of hsa_amd_profiling_convert_tick_to_system_domain. #243

@benvanik

Description

@benvanik

Suggestion Description

I'm capturing timestamps on device with __builtin_readsteadycounter (or extracting them from signals myself) and end up with quite a few of them in large buffers that I'd like to translate without the additional API overhead of calling hsa_amd_profiling_convert_tick_to_system_domain on each one in a loop. It'd be nice for such cases to have a hsa_amd_profiling_convert_tick_batch_to_system_domain that accepted a list of ticks and either updated them in-place or in an output buffer.

What I noticed is that GpuAgent::TranslateTime takes a lock, does some looping math to see if synchronization is required, and potentially synchronizes - in a batched mode that could be done once and the lock needs not be held for the entire duration of the translation (t0/t1 can be reused). Batching has a tradeoff with accuracy as it's possible for the skew to change over the course of a batch but translating them all consistently is better behavior than an outer loop: today it's possible for the timestamps to change base in the middle of translation and produce inconsistent results and that messes up reporting. The user of such an API could choose the batch/flush frequency to balance the drift to work around that and manage it when it makes sense (in-between top-level invocations/frames/etc where there's natural points to rebase).

Operating System

No response

GPU

No response

ROCm Component

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions