Skip to content

Regression in performance of FEM-code using AD in threaded loop in 1.12 vs 1.11. #60241

@KristofferC

Description

@KristofferC

The code below runs the file at https://github.com/Ferrite-FEM/Ferrite.jl/blob/kc/landau_opt/docs/src/literate-gallery/landau.jl which has the option to run the assembly routine with Threads.@threads or not. It is using a bad style of parallelism with threadid but that is not the point here.

If we run this code on 1.11:

git clone https://github.com/Ferrite-FEM/Ferrite.jl/
cd Ferrite.jl 
git checkout kc/landau_opt
#### 1.11 ####

julia +1.11 --project=docs -e 'using Pkg; Pkg.update()'

# non-threaded loop
julia +1.11 --project=docs --threads=8 docs/src/literate-gallery/landau.jl

# F: 0.016159 seconds
# ∇F!: 0.065011 seconds
# ∇²F!: 1.169886 seconds (180.00 k allocations: 49.439 MiB, 0.07% gc time)
 # 9.461187 seconds (3.51 M allocations: 1.392 GiB, 1.08% gc time, 6.16% compilation time)

# threaded loop
RUN_THREADED=1 julia +1.10 --project=docs --threads=8 docs/src/literate-gallery/landau.jl

# F: 0.004370 seconds (1.81 k allocations: 188.125 KiB)
# ∇F!: 0.018842 seconds (1.81 k allocations: 188.125 KiB)
# ∇²F!: 0.262754 seconds (181.82 k allocations: 49.624 MiB, 0.21% gc time)
#  3.578371 seconds (3.54 M allocations: 1.395 GiB, 2.35% gc time, 15.19% compilation time)

We can make the following observations:

  • The amount allocated for the threaded and non-threaded loop is roughly the same
  • The overhead in allocations from F and ∇F! being called threaded is fixed and small.

Now, if we run this on 1.12:

#### 1.12 ####

julia +1.12 --project=docs -e 'using Pkg; Pkg.update()'

# non threaded loop
julia +1.12 --project=docs --threads=8 docs/src/literate-gallery/landau.jl

# F: 0.014978 seconds
# ∇F!: 0.064877 seconds
# ∇²F!: 1.213513 seconds (210.00 k allocations: 50.812 MiB, 0.07% gc time)
#  9.258766 seconds (2.01 M allocations: 1.317 GiB, 0.62% gc time, 3.71% compilation time)

# threaded loop
RUN_THREADED=1 julia +1.12 --project=docs --threads=8 docs/src/literate-gallery/landau.jl

# F: 0.018132 seconds (121.30 k allocations: 83.894 MiB, 24.68% gc time)
# ∇F!: 0.026937 seconds (91.30 k allocations: 83.436 MiB, 12.25% gc time)
# ∇²F!: 1.037706 seconds (11.70 M allocations: 6.379 GiB, 26.83% gc time)
#  7.667274 seconds (72.23 M allocations: 40.276 GiB, 22.67% gc time, 4.58% compilation time)

We can see the following:

  • The code allocates an abhorrent amount in the threaded case and the GC has to work a lot.
  • The overhead in allocations in the different functions is not constant.

I'll see if I can bisect something

Metadata

Metadata

Assignees

No one assigned

    Labels

    multithreadingBase.Threads and related functionalityperformanceMust go fasterregressionRegression in behavior compared to a previous version

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions