[OpenBLAS] update multithreading cutoff#7189
Merged
ViralBShah merged 3 commits intoJuliaPackaging:masterfrom Aug 8, 2023
Merged
[OpenBLAS] update multithreading cutoff#7189ViralBShah merged 3 commits intoJuliaPackaging:masterfrom
ViralBShah merged 3 commits intoJuliaPackaging:masterfrom
Conversation
400 is a much better cutoff than 50 more most modern machines. Note that 100 is way too small even for modern 4 core machines (I think the 50 limit was found pre AVX2 (and possibly pre fma)). 400 is probably a bit larger than optimal on small machines but only gives up ~13% performance single core compared to 8 core (and on laptops it will probably be better because the single core can turbo higher). It also mitigates the horrible performance cliff of using 16 or more threads on medium sized matrices (between roughly 400 and 1600). Of course the better answer would be to make it so BLAS's threading is integrated with julia's (and we use an appropriate number of threads based on the matrix size), but for now this is a pretty noticeable improvement. ``` julia> BLAS.set_num_threads(32) julia> peakflops(400) 1.1644410982935661e10 julia> BLAS.set_num_threads(16) julia> peakflops(400) 1.5580026746524042e10 julia> BLAS.set_num_threads(8) julia> peakflops(400) 2.210268354206555e10 julia> BLAS.set_num_threads(4) julia> peakflops(400) 1.937951340161483e10 julia> BLAS.set_num_threads(1) julia> peakflops(400) 1.740427478902416e10 julia> BLAS.set_num_threads(32) julia> peakflops(100) 1.9949726688744364e9 julia> BLAS.set_num_threads(16) julia> peakflops(100) 2.9579541605843735e9 julia> BLAS.set_num_threads(8) julia> peakflops(100) 4.373630506947512e9 julia> BLAS.set_num_threads(4) julia> peakflops(100) 3.924300248211991e9 julia> BLAS.set_num_threads(1) julia> peakflops(100) 1.0693014253788e10
ViralBShah
approved these changes
Aug 8, 2023
ViralBShah
added a commit
to JuliaLang/julia
that referenced
this pull request
Aug 8, 2023
ViralBShah
added a commit
to JuliaLang/julia
that referenced
this pull request
Aug 9, 2023
…50844) Detailed discussion and benchmarks by @oscardssmith in JuliaPackaging/Yggdrasil#7189
KristofferC
pushed a commit
to JuliaLang/julia
that referenced
this pull request
Aug 10, 2023
…50844) Detailed discussion and benchmarks by @oscardssmith in JuliaPackaging/Yggdrasil#7189 (cherry picked from commit 626f687)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
400 is a much better cutoff than 50 more most modern machines. Note that 100 is way too small even for modern 4 core machines (I think the 50 limit was found pre AVX2 (and possibly pre fma)). 400 is probably a bit larger than optimal on small machines but only gives up ~13% performance single core compared to 8 core (and on laptops it will probably be better because the single core can turbo higher). It also mitigates the horrible performance cliff of using 16 or more threads on medium sized matrices (between roughly 400 and 1600). Of course the better answer would be to make it so BLAS's threading is integrated with julia's (and we use an appropriate number of threads based on the matrix size), but for now this is a pretty noticeable improvement.