Regression in performance of FEM-code using AD in threaded loop in 1.12 vs 1.11.

The code below runs the file at https://github.com/Ferrite-FEM/Ferrite.jl/blob/kc/landau_opt/docs/src/literate-gallery/landau.jl which has the option to run the assembly routine with `Threads.@threads` or not. It is using a bad style of parallelism with threadid but that is not the point here.

If we run this code on 1.11:

```bash
git clone https://github.com/Ferrite-FEM/Ferrite.jl/
cd Ferrite.jl 
git checkout kc/landau_opt
```

```bash
#### 1.11 ####

julia +1.11 --project=docs -e 'using Pkg; Pkg.update()'

# non-threaded loop
julia +1.11 --project=docs --threads=8 docs/src/literate-gallery/landau.jl

# F: 0.016159 seconds
# ∇F!: 0.065011 seconds
# ∇²F!: 1.169886 seconds (180.00 k allocations: 49.439 MiB, 0.07% gc time)
 # 9.461187 seconds (3.51 M allocations: 1.392 GiB, 1.08% gc time, 6.16% compilation time)

# threaded loop
RUN_THREADED=1 julia +1.10 --project=docs --threads=8 docs/src/literate-gallery/landau.jl

# F: 0.004370 seconds (1.81 k allocations: 188.125 KiB)
# ∇F!: 0.018842 seconds (1.81 k allocations: 188.125 KiB)
# ∇²F!: 0.262754 seconds (181.82 k allocations: 49.624 MiB, 0.21% gc time)
#  3.578371 seconds (3.54 M allocations: 1.395 GiB, 2.35% gc time, 15.19% compilation time)
```

We can make the following observations:

- The amount allocated for the threaded and non-threaded loop is roughly the same
- The overhead in allocations from `F` and `∇F!` being called threaded is fixed and small.


Now, if we run this on 1.12:


```bash
#### 1.12 ####

julia +1.12 --project=docs -e 'using Pkg; Pkg.update()'

# non threaded loop
julia +1.12 --project=docs --threads=8 docs/src/literate-gallery/landau.jl

# F: 0.014978 seconds
# ∇F!: 0.064877 seconds
# ∇²F!: 1.213513 seconds (210.00 k allocations: 50.812 MiB, 0.07% gc time)
#  9.258766 seconds (2.01 M allocations: 1.317 GiB, 0.62% gc time, 3.71% compilation time)

# threaded loop
RUN_THREADED=1 julia +1.12 --project=docs --threads=8 docs/src/literate-gallery/landau.jl

# F: 0.018132 seconds (121.30 k allocations: 83.894 MiB, 24.68% gc time)
# ∇F!: 0.026937 seconds (91.30 k allocations: 83.436 MiB, 12.25% gc time)
# ∇²F!: 1.037706 seconds (11.70 M allocations: 6.379 GiB, 26.83% gc time)
#  7.667274 seconds (72.23 M allocations: 40.276 GiB, 22.67% gc time, 4.58% compilation time)
```

We can see the following:

- The code allocates an abhorrent amount in the threaded case and the GC has to work a lot.
- The overhead in allocations in the different functions is not constant.

I'll see if I can bisect something



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Regression in performance of FEM-code using AD in threaded loop in 1.12 vs 1.11. #60241

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Regression in performance of FEM-code using AD in threaded loop in 1.12 vs 1.11. #60241

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions