Wall-time/all tasks profiler by d-netto · Pull Request #55889 · JuliaLang/julia

d-netto · 2024-09-26T14:01:36Z

One limitation of sampling CPU/thread profiles, as is currently done in Julia, is that they primarily capture samples from CPU-intensive tasks.

If many tasks are performing IO or contending for concurrency primitives like semaphores, these tasks won’t appear in the profile, as they aren't scheduled on OS threads sampled by the profiler.

A wall-time profiler, like the one implemented in this PR, samples tasks regardless of OS thread scheduling. This enables profiling of IO-heavy tasks and detecting areas of heavy contention in the system.

Co-developed with @nickrobinson251.

d-netto · 2024-09-26T14:03:55Z

An additional note: this kind of wall-time/all tasks profiler is also implemented in Go (and denoted as goroutine profiler there), so there is some precedent for this in other languages as well: https://github.com/felixge/fgprof.

d-netto · 2024-09-26T14:04:28Z

@nickrobinson251 I can't assign you as reviewer... Feel free to assign yourself or post review comments otherwise.

IanButterworth · 2024-09-26T14:23:00Z

I think this is related to #55103. Could the metrics here be useful in that too?

d-netto · 2024-09-30T14:09:30Z

I think this is related to #55103. Could the metrics here be useful in that too?

For diagnosing excessive scheduling time? I can't immediately see how this PR would be useful for that.

d-netto · 2024-09-30T14:10:19Z

For diagnosing excessive scheduling time

#55103 seems like a much more direct approach for doing so, at least.

src/signals-unix.c

stdlib/Profile/src/Profile.jl

src/task.c

src/signal-handling.c

src/task.c

d-netto · 2024-10-19T02:53:46Z

@vtjnash: IIRC your comments stem from the fact that a GC may get interleaved with jl_get_all_tasks_arraylist and invalidate the contents of live_tasks -- both by freeing unreachable tasks and by freeing the underlying buffers from mtarraylist_t that have been scheduled for freeing.

Task freeing happens in sweep_stack_pools and the freeing of mtarraylist_t buffers happens when finalizers run.

Perhaps we could:

Hold the mutex during the whole jl_profile_task -- to ensure a task we are about to profile doesn't get GC'd while we're collecting its backtrace, since sweep_stack_pools will be blocked because we're holding the lock.
Hold the mutex while we free mtarraylist_t buffers for the same reasons.

Any thoughts?

d-netto · 2024-10-22T13:04:38Z

Can you share some examples e.g. simple workloads you've tried this out on and what the profiles look like?

Yes, here they are:

Workload 1: a_bunch_of_tasks_waiting_on_channel.jl

using Base.Threads
using Profile
using PProf

ch = Channel(1)

const N_SPAWNED_TASKS = (1 << 10)
const WAIT_TIME_NS = 10_000_000

function spawn_a_bunch_of_tasks_waiting_on_channel()
    for i in 1:N_SPAWNED_TASKS
        Threads.@spawn begin
            take!(ch)
        end
    end
end

function busywait()
    t0 = time_ns()
    while true
        if time_ns() - t0 > WAIT_TIME_NS
            break
        end
    end
end

function main()
    spawn_a_bunch_of_tasks_waiting_on_channel()
    for i in 1:N_SPAWNED_TASKS
        put!(ch, i)
        busywait()
    end
end

Profile.@profile_walltime main()

Expected results

Given that we have a lot more tasks waiting on a channel than tasks actually computing (spinning), we expect to see a large number of samples in channel-related functions. That seems to be the case here.

Workload 2: a_lot_of_short_lived_tasks.jl

using Base.Threads
using Profile
using PProf

const N_SPAWNED_TASKS = (1 << 16)
const WAIT_TIME_NS = 100_000

function spawn_a_bunch_of_short_lived_tasks()
    for i in 1:N_SPAWNED_TASKS
        Threads.@spawn begin
            # Do nothing
        end
    end
end

function busywait()
    t0 = time_ns()
    while true
        if time_ns() - t0 > WAIT_TIME_NS
            break
        end
    end
end

function main()
    spawn_a_bunch_of_short_lived_tasks()
    for i in 1:N_SPAWNED_TASKS
        busywait()
    end
end

Profile.@profile_walltime main()

Expected results

Given that we have a lot of short lived tasks, we expect to get a large numbers of samples corresponding to "failed to sample task & capture backtrace". That seems to be the case here.

d-netto · 2024-10-22T13:55:06Z

A few more workloads suggested by @NHDaly.

Workload 3: compute_heavy.jl

using Base.Threads
using Profile
using PProf

ch = Channel(1)

const MAX_ITERS = (1 << 22)
const N_TASKS = (1 << 12)

function spawn_a_task_waiting_on_channel()
    Threads.@spawn begin
        take!(ch)
    end
end

function sum_of_sqrt()
    sum_of_sqrt = 0.0
    for i in 1:MAX_ITERS
        sum_of_sqrt += sqrt(i)
    end
    return sum_of_sqrt
end

function spawn_a_bunch_of_compute_heavy_tasks()
    Threads.@sync begin
        for i in 1:N_TASKS
            Threads.@spawn begin
                sum_of_sqrt()
            end
        end
    end
end

function main()
    spawn_a_task_waiting_on_channel()
    spawn_a_bunch_of_compute_heavy_tasks()
end

Profile.@profile_walltime main()

Expected results

We have a lot more compute-heavy tasks than sleeping tasks. We expect to see a lot of samples in sum_of_sqrt. That seems to be the case here.

NHDaly · 2024-10-22T14:33:14Z

Cool, thanks! 🎉

🤔 Is it expected that the currently-scheduled tasks seem to have their stacks starting at a different frame than the waiting tasks?

It looks like the executing tasks start with right with the function in the Task (##1()), while the waiting tasks start with start_task | jl_apply | then the Task function?

I can't decide if I think this is helpful or not. On the one hand, it's maybe nice to visually divide the scheduled vs sleeping tasks, but on the other hand i think it would make more sense to integrate the stacks together if they had most of their content shared.

NHDaly · 2024-10-22T14:33:35Z

My instinct is that this is not desirable, and we should figure out why they're different, and correct that.

d-netto · 2024-10-22T14:46:46Z

Is it expected that the currently-scheduled tasks seem to have their stacks starting at a different frame than the waiting tasks?

Good question, I don't know. Will investigate this.

src/gc-stock.c

src/task.c

src/signals-unix.c

src/stackwalk.c

d-netto · 2024-10-25T14:37:12Z

Wrote this short demo on how to use the profiler: https://github.com/d-netto/Wall-time-Profiler-Demo.

Feedback is welcome.

Perhaps we should add this to some documentation page in Julia itself.

src/signal-handling.c

src/stackwalk.c

stdlib/Profile/src/Profile.jl

vtjnash · 2024-10-25T14:58:05Z

Can you add this to the Profile docs and the NEWS file?

d-netto · 2024-10-25T14:59:10Z

Can you add this to the Profile docs and the NEWS file?

Sounds good, will do. Thanks for the reviews.

JZL · 2024-10-27T04:03:33Z

This looks really interesting, I'm excited to use it!

To confirm, there's no way to use this with SIGUSR1/SIGINFO signalling, right? I've been trying to make sense of the code but am unsure -- am I correct that it's this line's [1] hardcoded 0/false which forces it to always be the old version?

Are there plans to make this modifiable using e.g. env variables like for heap snapshots or function references like peek_report.

[1]

julia/src/signals-unix.c

Line 818 in fcf7ec0

if (jl_profile_start_timer(0) < 0)

nickrobinson251 · 2024-10-27T10:20:29Z

there's no way to use this with SIGUSR1/SIGINFO signalling, right?

Your understanding is correct.

Are there plans to make this modifiable using e.g. env variables like for heap snapshots or function references like peek_report.

i think that could be a good addition! Perhaps open a feature request issue for it? We might even want both e.g. I suppose an ENV variable could control whether to take a cpu profile or a wall-time profile, but if you want both it might be best to use peek_report, so we might want to add an easy, official way to do that, and document it. Right now i think you'd have to use the non-public function Profile.start_timer(true). Worth an issue to discuss, i think :)

Also, you might also be interested in #56043 (which depending on your use-case might be even better than a wall-time profile, since it'll print the backtrace for all tasks not just a sample)

d-netto added the profiler label Sep 26, 2024

d-netto requested review from NHDaly and kpamnany September 26, 2024 14:01

d-netto force-pushed the dcn-all-task-profiler branch 2 times, most recently from 6b80fe3 to f5c8f5f Compare September 26, 2024 14:13

d-netto force-pushed the dcn-all-task-profiler branch 2 times, most recently from 1e9f41f to 6cd27d7 Compare September 26, 2024 16:02

nsajko added the feature Indicates new feature / enhancement requests label Sep 26, 2024

d-netto force-pushed the dcn-all-task-profiler branch 3 times, most recently from f6ea007 to 1029a84 Compare September 27, 2024 11:38

d-netto requested a review from gbaraldi September 27, 2024 12:25

d-netto force-pushed the dcn-all-task-profiler branch 3 times, most recently from 0d4ca9c to e493403 Compare September 30, 2024 11:42

d-netto force-pushed the dcn-all-task-profiler branch 2 times, most recently from 90bca24 to 14766d3 Compare September 30, 2024 14:57

gbaraldi reviewed Sep 30, 2024

View reviewed changes

src/signals-unix.c Outdated Show resolved Hide resolved

d-netto force-pushed the dcn-all-task-profiler branch from 14766d3 to b9f0f1d Compare September 30, 2024 18:34

NHDaly added the needs tests Unit tests are required for this change label Oct 1, 2024

NHDaly reviewed Oct 1, 2024

View reviewed changes

stdlib/Profile/src/Profile.jl Show resolved Hide resolved

NHDaly reviewed Oct 1, 2024

View reviewed changes

stdlib/Profile/src/Profile.jl Show resolved Hide resolved

NHDaly reviewed Oct 1, 2024

View reviewed changes

src/task.c Outdated Show resolved Hide resolved

d-netto force-pushed the dcn-all-task-profiler branch 2 times, most recently from 5ddd5ba to c9d1995 Compare October 3, 2024 12:50

d-netto requested review from gbaraldi and vtjnash October 17, 2024 21:17

d-netto force-pushed the dcn-all-task-profiler branch 3 times, most recently from 15e041c to 5ca6bf4 Compare October 18, 2024 14:25

gbaraldi reviewed Oct 18, 2024

View reviewed changes

src/signal-handling.c Outdated Show resolved Hide resolved

d-netto force-pushed the dcn-all-task-profiler branch from 5ca6bf4 to 8ffacd7 Compare October 18, 2024 14:31

vtjnash reviewed Oct 19, 2024

View reviewed changes

src/task.c Outdated Show resolved Hide resolved

vtjnash reviewed Oct 19, 2024

View reviewed changes

src/task.c Outdated Show resolved Hide resolved

vtjnash reviewed Oct 24, 2024

View reviewed changes

src/gc-stock.c Outdated Show resolved Hide resolved

vtjnash reviewed Oct 24, 2024

View reviewed changes

src/task.c Outdated Show resolved Hide resolved

vtjnash reviewed Oct 24, 2024

View reviewed changes

src/signals-unix.c Outdated Show resolved Hide resolved

vtjnash reviewed Oct 24, 2024

View reviewed changes

src/stackwalk.c Outdated Show resolved Hide resolved

vtjnash approved these changes Oct 25, 2024

View reviewed changes

src/signal-handling.c Outdated Show resolved Hide resolved

src/stackwalk.c Outdated Show resolved Hide resolved

src/stackwalk.c Outdated Show resolved Hide resolved

vtjnash reviewed Oct 25, 2024

View reviewed changes

stdlib/Profile/src/Profile.jl Outdated Show resolved Hide resolved

all tasks profiler

e0213eb

This was referenced Oct 27, 2024

all tasks profiler RelationalAI/julia#186

Closed

Wall-time/all tasks profiler (#55889) RelationalAI/julia#193

Merged

nickrobinson251 mentioned this pull request Oct 31, 2024

Add per-Task cpu time metric (#56320) RelationalAI/julia#194

Merged

3 tasks

Uh oh!

Conversation

d-netto commented Sep 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

d-netto commented Sep 26, 2024

Uh oh!

d-netto commented Sep 26, 2024

Uh oh!

IanButterworth commented Sep 26, 2024

Uh oh!

d-netto commented Sep 30, 2024

Uh oh!

d-netto commented Sep 30, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

d-netto commented Oct 19, 2024

Uh oh!

d-netto commented Oct 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Workload 1: a_bunch_of_tasks_waiting_on_channel.jl

Expected results

Workload 2: a_lot_of_short_lived_tasks.jl

Expected results

Uh oh!

d-netto commented Oct 22, 2024

Workload 3: compute_heavy.jl

Expected results

Uh oh!

NHDaly commented Oct 22, 2024

Uh oh!

NHDaly commented Oct 22, 2024

Uh oh!

d-netto commented Oct 22, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

d-netto commented Oct 25, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vtjnash commented Oct 25, 2024

Uh oh!

d-netto commented Oct 25, 2024

Uh oh!

JZL commented Oct 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nickrobinson251 commented Oct 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

d-netto commented Sep 26, 2024 •

edited

Loading

d-netto commented Oct 22, 2024 •

edited

Loading

JZL commented Oct 27, 2024 •

edited

Loading