The code below runs the file at https://github.com/Ferrite-FEM/Ferrite.jl/blob/kc/landau_opt/docs/src/literate-gallery/landau.jl which has the option to run the assembly routine with Threads.@threads or not. It is using a bad style of parallelism with threadid but that is not the point here.
If we run this code on 1.11:
git clone https://github.com/Ferrite-FEM/Ferrite.jl/
cd Ferrite.jl
git checkout kc/landau_opt
#### 1.11 ####
julia +1.11 --project=docs -e 'using Pkg; Pkg.update()'
# non-threaded loop
julia +1.11 --project=docs --threads=8 docs/src/literate-gallery/landau.jl
# F: 0.016159 seconds
# ∇F!: 0.065011 seconds
# ∇²F!: 1.169886 seconds (180.00 k allocations: 49.439 MiB, 0.07% gc time)
# 9.461187 seconds (3.51 M allocations: 1.392 GiB, 1.08% gc time, 6.16% compilation time)
# threaded loop
RUN_THREADED=1 julia +1.10 --project=docs --threads=8 docs/src/literate-gallery/landau.jl
# F: 0.004370 seconds (1.81 k allocations: 188.125 KiB)
# ∇F!: 0.018842 seconds (1.81 k allocations: 188.125 KiB)
# ∇²F!: 0.262754 seconds (181.82 k allocations: 49.624 MiB, 0.21% gc time)
# 3.578371 seconds (3.54 M allocations: 1.395 GiB, 2.35% gc time, 15.19% compilation time)
We can make the following observations:
- The amount allocated for the threaded and non-threaded loop is roughly the same
- The overhead in allocations from
F and ∇F! being called threaded is fixed and small.
Now, if we run this on 1.12:
#### 1.12 ####
julia +1.12 --project=docs -e 'using Pkg; Pkg.update()'
# non threaded loop
julia +1.12 --project=docs --threads=8 docs/src/literate-gallery/landau.jl
# F: 0.014978 seconds
# ∇F!: 0.064877 seconds
# ∇²F!: 1.213513 seconds (210.00 k allocations: 50.812 MiB, 0.07% gc time)
# 9.258766 seconds (2.01 M allocations: 1.317 GiB, 0.62% gc time, 3.71% compilation time)
# threaded loop
RUN_THREADED=1 julia +1.12 --project=docs --threads=8 docs/src/literate-gallery/landau.jl
# F: 0.018132 seconds (121.30 k allocations: 83.894 MiB, 24.68% gc time)
# ∇F!: 0.026937 seconds (91.30 k allocations: 83.436 MiB, 12.25% gc time)
# ∇²F!: 1.037706 seconds (11.70 M allocations: 6.379 GiB, 26.83% gc time)
# 7.667274 seconds (72.23 M allocations: 40.276 GiB, 22.67% gc time, 4.58% compilation time)
We can see the following:
- The code allocates an abhorrent amount in the threaded case and the GC has to work a lot.
- The overhead in allocations in the different functions is not constant.
I'll see if I can bisect something
The code below runs the file at https://github.com/Ferrite-FEM/Ferrite.jl/blob/kc/landau_opt/docs/src/literate-gallery/landau.jl which has the option to run the assembly routine with
Threads.@threadsor not. It is using a bad style of parallelism with threadid but that is not the point here.If we run this code on 1.11:
git clone https://github.com/Ferrite-FEM/Ferrite.jl/ cd Ferrite.jl git checkout kc/landau_optWe can make the following observations:
Fand∇F!being called threaded is fixed and small.Now, if we run this on 1.12:
We can see the following:
I'll see if I can bisect something