Fix @fastmath x^2 inlining regression for Float32 and Float16#60640
Fix @fastmath x^2 inlining regression for Float32 and Float16#60640oscardssmith merged 3 commits intoJuliaLang:masterfrom
@fastmath x^2 inlining regression for Float32 and Float16#60640Conversation
|
@Yashagarwal9798 please be aware that all uses of AI must be disclosed |
0d4ccb4 to
c79d1c7
Compare
c79d1c7 to
102b4a1
Compare
102b4a1 to
1664287
Compare
|
Build failure looks real? |
6a69d79 to
bd9c0e9
Compare
|
@oscardssmith CI is all green now. |
|
@Yashagarwal9798 thanks for the PR! |
|
Thanks! really glad I could help improve the project. |
|
This is a candidate for backporting. Although this concerns only an optimization, generating inefficient code for |
|
Agreed. If we backport, we need to be careful not to back-port the Float16 version of this to 1.12 (LLVM only recently doesn't produce garbage on x86 with this intrinsic). 1.13 should receive the back-port unmodified though. I'll put up the 1.12 PR. |
@fastmath x^2 inlining regression for Float32 and Float16
## Summary This PR fixes the performance regression where `@fastmath x^2` for `Float32` was not being inlined to efficient LLVM code, unlike `Float64`. ## Problem As reported in #60639, `@fastmath x^2` for `Float32` was falling back to `power_by_squaring` instead of using the LLVM `powi` intrinsic. This resulted in: - Unnecessary function calls instead of inline multiplication - Potential type promotion to `Float64` - Suboptimal generated code compared to `Float64` Before this fix, `@code_llvm @fastmath Float32(1.5)^2` would show calls to `power_by_squaring`, while `Float64` correctly used the `llvm.powi` intrinsic. ## Solution Added the missing `pow_fast` methods for `Float32` and `Float16`: - `pow_fast(::Float32, ::Int32)` - uses `llvm.powi.f32.i32` intrinsic directly - `pow_fast(::Float32, ::Integer)` - wrapper that converts to `Int32` when safe, matching the `Float64` pattern - `pow_fast(::Float16, ::Integer)` - converts to `Float32`, computes, and converts back This mirrors the existing implementation for `Float64` which already used `llvm.powi.f64.i32`. ## Testing Added a regression test that verifies `@fastmath x^2` generates inline `fmul` instructions (not `power_by_squaring` calls) for `Float16`, `Float32`, and `Float64`. Fixes #60639 --------- Co-authored-by: Oscar Smith <oscardssmith@gmail.com> (cherry picked from commit f34d5f2)
|
@oscardssmith did you open a PR to backport this to 1.12? |
|
I did, but I linked to https://github.com/JuliaLang/julia/pull/60640/changes so it didn't tag. |
Summary
This PR fixes the performance regression where
@fastmath x^2forFloat32was not being inlined to efficient LLVM code, unlikeFloat64.Problem
As reported in #60639,
@fastmath x^2forFloat32was falling back topower_by_squaringinstead of using the LLVMpowiintrinsic. This resulted in:Float64Float64Before this fix,
@code_llvm @fastmath Float32(1.5)^2would show calls topower_by_squaring, whileFloat64correctly used thellvm.powiintrinsic.Solution
Added the missing
pow_fastmethods forFloat32andFloat16:pow_fast(::Float32, ::Int32)- usesllvm.powi.f32.i32intrinsic directlypow_fast(::Float32, ::Integer)- wrapper that converts toInt32when safe, matching theFloat64patternpow_fast(::Float16, ::Integer)- converts toFloat32, computes, and converts backThis mirrors the existing implementation for
Float64which already usedllvm.powi.f64.i32.Testing
Added a regression test that verifies
@fastmath x^2generates inlinefmulinstructions (notpower_by_squaringcalls) forFloat16,Float32, andFloat64.Fixes #60639