Implement Narrow and Widen using SIMDAsHWIntrinsic by tannergooding · Pull Request #60094 · dotnet/runtime

tannergooding · 2021-10-06T21:36:34Z

This continues the work on #49397 which started with #53450

In particular, this moves Narrow and Widen to be implemented using the general SIMDAsHWIntrinsic logic and adding then having the new APIs in Vector64/128/256 use the same shared entry points.

There will be a few more PRs after this one covering:

Floating-point to Integer conversions
CompareAll and CompareAny for each comparison type (GT, GE, LT, LE, etc)
Native Integer support

ghost · 2021-10-06T21:36:40Z

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

This continues the work on #49397 which started with #53450

In particular, this moves Narrow and Widen to be implemented using the general SIMDAsHWIntrinsic logic and adding then having the new APIs in Vector64/128/256 use the same shared entry points.

There will be a few more PRs after this one covering:

Floating-point to Integer conversions
CompareAll and CompareAny for each comparison type (GT, GE, LT, LE, etc)
Native Integer support

Author:	tannergooding
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

tannergooding · 2021-10-06T21:44:50Z

Will run outerloop jobs (azp run runtime-coreclr jitstress-isas-x86, runtime-coreclr jitstress-isas-arm, runtime-coreclr outerloop) after the initial CI results come back as successful.

Also plan on collecting PMI diffs before marking this ready-for-review.

tannergooding · 2021-10-07T20:37:34Z

/azp run runtime-coreclr jitstress-isas-x86, runtime-coreclr jitstress-isas-arm, runtime-coreclr outerloop

azure-pipelines · 2021-10-07T20:38:15Z

Azure Pipelines successfully started running 3 pipeline(s).

tannergooding · 2021-10-08T14:43:44Z

/azp run runtime-coreclr jitstress-isas-x86, runtime-coreclr jitstress-isas-arm, runtime-coreclr outerloop

azure-pipelines · 2021-10-08T14:45:44Z

Azure Pipelines successfully started running 3 pipeline(s).

tannergooding · 2021-10-08T23:07:22Z

/azp run runtime-coreclr jitstress-isas-x86, runtime-coreclr jitstress-isas-arm, runtime-coreclr outerloop

azure-pipelines · 2021-10-08T23:07:58Z

Azure Pipelines successfully started running 3 pipeline(s).

tannergooding · 2021-10-08T23:24:37Z

No diff for benchmarks.

==============================

Frameworks has the following improvements:

Top method improvements (bytes):
        -605 (-68.52% of base) : System.Private.CoreLib.dasm - Vector:Widen(Vector`1,byref,byref) (7 methods)
        -605 (-66.92% of base) : System.Private.CoreLib.dasm - Vector256:Widen(Vector256`1):ValueTuple`2 (7 methods)
        -545 (-69.25% of base) : System.Private.CoreLib.dasm - Vector128:Widen(Vector128`1):ValueTuple`2 (7 methods)
        -431 (-55.19% of base) : System.Private.CoreLib.dasm - Vector64:Widen(Vector64`1):ValueTuple`2 (7 methods)
         -24 (-20.34% of base) : System.Private.CoreLib.dasm - Latin1Utility:WidenLatin1ToUtf16_Fallback(long,long,long)

Noting that the Vector:Widen improvements are due to the code being "vectorized" now since it forwards to WidenLower and WidenUpperhelper intrinsics. The actual interesting change is toWidenLatin1ToUtf16` which is small improvement to the codegen vs the previous:

; Latin1Utility:WidenLatin1ToUtf16_Fallback(long,long,long)
        vmovupd  ymm0, ymmword ptr[rcx+rax]
-       vpermq   ymm2, ymm0, -44
-       vxorps   ymm1, ymm1
-       vpunpcklbw ymm2, ymm1
-       vpermq   ymm0, ymm0, -24
-       vxorps   ymm1, ymm1
-       vpunpckhbw ymm0, ymm1
+       vmovaps  ymm1, ymm0
+       vpmovzxbw ymm1, ymm1
+       vextractf128 xmm0, xmm0, 1
+       vpmovzxbw ymm0, ymm0

and (no diff shown against the old software impl, just showing a vectorized example)

; Vector:Widen(Vector`1,byref,byref)
       vmovdqu  ymm0, ymmword ptr[rcx]
       vpmovzxbw ymm0, ymm0
       vmovupd  ymmword ptr[rdx], ymm0
       vmovupd  ymm0, ymmword ptr[rcx]
       vextractf128 xmm0, xmm0, 1
       vpmovzxbw ymm0, ymm0
       vmovupd  ymmword ptr[r8], ymm0

; Vector256:Widen(Vector256`1):ValueTuple`2
       vmovupd  ymm0, ymmword ptr[rdx]
       vmovaps  ymm1, ymm0
       vpmovzxbw ymm1, ymm1
       vextractf128 xmm0, xmm0, 1
       vpmovzxbw ymm0, ymm0
       vmovupd  ymmword ptr[rcx], ymm1
       vmovupd  ymmword ptr[rcx+32], ymm0

; Vector128:Widen(Vector128`1):ValueTuple`2
       vmovupd  xmm0, xmmword ptr [rdx]
       vpmovzxbw xmm1, xmm0
       vpsrldq  xmm0, xmm0, 8
       vpmovzxbw xmm0, xmm0
       vmovupd  xmmword ptr [rcx], xmm1
       vmovupd  xmmword ptr [rcx+16], xmm0

==============================

Tests has the following improvements:

Top file improvements (bytes):
       -3265 : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm (-0.13% of base)
       -2284 : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm (-0.09% of base)
        -922 : JIT\HardwareIntrinsics\General\Vector256\Vector256_r\Vector256_r.dasm (-0.03% of base)
        -714 : JIT\HardwareIntrinsics\General\Vector128\Vector128_r\Vector128_r.dasm (-0.02% of base)
        -343 : JIT\HardwareIntrinsics\General\Vector64\Vector64_ro\Vector64_ro.dasm (-0.02% of base)
        -208 : JIT\SIMD\VectorConvert_r_Target_64Bit\VectorConvert_r_Target_64Bit.dasm (-0.17% of base)
        -200 : JIT\SIMD\VectorConvert_ro_Target_64Bit\VectorConvert_ro_Target_64Bit.dasm (-0.84% of base)
         -20 : JIT\Regression\JitBlue\GitHub_13568\GitHub_13568\GitHub_13568.dasm (-2.00% of base)
          -8 : JIT\Regression\JitBlue\GitHub_23159\GitHub_23159\GitHub_23159.dasm (-1.17% of base)

Top method improvements (bytes):
         -58 (-13.91% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorBinaryOpTest__NarrowDouble:RunStructFldScenario():this
         -58 (-13.91% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorBinaryOpTest__NarrowDouble:RunStructLclFldScenario():this
         -56 (-15.01% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - TestStruct:RunStructFldScenario(VectorBinaryOpTest__NarrowDouble):this
         -55 (-11.88% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - TestStruct:RunStructFldScenario(VectorBinaryOpTest__NarrowDouble):this
         -52 (-13.72% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorBinaryOpTest__NarrowDouble:RunClassFldScenario():this
         -52 (-12.75% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorBinaryOpTest__NarrowDouble:RunClassLclFldScenario():this
         -52 (-12.01% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorBinaryOpTest__NarrowDouble:RunClsVarScenario():this
         -51 (-10.87% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - VectorBinaryOpTest__NarrowDouble:RunClassFldScenario():this
         -51 (-10.24% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - VectorBinaryOpTest__NarrowDouble:RunClassLclFldScenario():this
         -51 (-9.75% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - VectorBinaryOpTest__NarrowDouble:RunClsVarScenario():this
         -50 (-11.99% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorBinaryOpTest__NarrowInt16:RunStructFldScenario():this
         -50 (-11.99% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorBinaryOpTest__NarrowInt16:RunStructLclFldScenario():this
         -50 (-11.99% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorBinaryOpTest__NarrowInt64:RunStructFldScenario():this
         -50 (-11.99% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorBinaryOpTest__NarrowInt64:RunStructLclFldScenario():this
         -50 (-11.99% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorBinaryOpTest__NarrowUInt16:RunStructFldScenario():this
         -50 (-11.99% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorBinaryOpTest__NarrowUInt16:RunStructLclFldScenario():this

Noting that most of the improvements come from the Vector128/256:Narrow/Widen calls being intrinsified now, where they weren't before. The interesting diffs are for JIT\SIMD where Vector<T> codegen is improved from the past logic:

-       vmovupd  ymm6, ymmword ptr[rsi]
-       vmovupd  ymm0, ymmword ptr[rdi]
-       vcvtpd2ps ymm6, ymm6
-       vcvtpd2ps ymm1, ymm0
-       vinsertf128 xmm6, xmm1, 1
-       vcvtps2pd ymm7, ymm6
-       vextractf128 xmm8, xmm6, 1
-       vcvtps2pd ymm8, ymm8
+       vcvtpd2ps ymm0, ymmword ptr[rsi]
+       vcvtpd2ps ymm1, ymmword ptr[rdi]
+       vinsertf128 xmm6, xmm0, xmm1, 1
+       vmovaps  ymm0, ymm6
+       vcvtps2pd ymm7, ymm0
+       vextractf128 xmm0, xmm6, 1
+       vcvtps2pd ymm8, ymm0

tannergooding · 2021-10-18T17:43:13Z

This should be ready for review now.

…denUpper

tannergooding · 2021-11-03T17:12:14Z

Rebased onto dotnet/main. This is still ready for review.

JulieLeeMSFT · 2021-11-11T16:38:11Z

CC @echesakovMSFT PTAL.

echesakov · 2021-11-11T19:45:56Z

src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector64.cs

        [CLSCompliant(false)]
-        [Intrinsic]
        public static unsafe (Vector64<ushort> Lower, Vector64<ushort> Upper) Widen(Vector64<byte> source)
+            => (WidenLower(source), WidenUpper(source));


@tannergooding Have you looked at the code produced for such function call? Do we pass the return values properly in two registers from a callee to the caller without copying to/from the memory?

The codegen looked good when I had checked, let me rebuild and get a disasm example to share however.

Thanks, I remember seeing some strange artifacts with multi reg returns while I was working on #52424.

It's worth noting this isn't a multi-reg return; rather its an a managed wrapper method that calls two helper intrinsics instead; since no platform implements this as a single instruction with multi-reg result.

Sure, it's not a multi-reg intrinsic, but it's still a multi-reg call returning a value in D0 and D1 registers, so I was wondering whether the codegen for such calls (incl. the method) is optimal.

@echesakovMSFT, looks like yes there is still some oddities there and it doesn't preference q0/q1 as the target register nor does it directly do a mov and instead does a store/load.

4EA01C10 mov v16.16b, v0.16b 0E617A10 fcvtl v16.2d, v16.2s 4E617811 fcvtl2 v17.2d, v0.4s 3D8007B0 str q16, [fp,#16] 3D800BB1 str q17, [fp,#32] 3DC007A0 ldr q0, [fp,#16] 3DC00BA1 ldr q1, [fp,#32]

It looks like this gets imported initially as:

[000014] I-CXG------- * CALL void ValueTuple`2..ctor (exactContextHnd=0x00007FFA49A115C9) [000013] ------------ this in x0 +--* ADDR byref [000012] -------N---- | \--* LCL_VAR struct<ValueTuple`2, 32> V02 tmp1 [000018] ---XG------- arg1 +--* OBJ simd16<Vector128`1> [000017] ------------ | \--* ADDR byref [000004] -------N---- | \--* HWINTRINSIC simd16 float ConvertToDouble [000003] ------------ | \--* HWINTRINSIC simd8 float GetLower [000002] n----------- | \--* OBJ simd16<Vector128`1> [000001] ------------ | \--* ADDR byref [000000] -------N---- | \--* LCL_VAR simd16<Vector128`1> V00 arg0 [000016] ---XG------- arg2 \--* OBJ simd16<Vector128`1> [000015] ------------ \--* ADDR byref [000008] -------N---- \--* HWINTRINSIC simd16 float ConvertToDoubleUpper [000007] n----------- \--* OBJ simd16<Vector128`1> [000006] ------------ \--* ADDR byref [000005] -------N---- \--* LCL_VAR simd16<Vector128`1> V00 arg0

However, there are locals introduced and indirections kept that never truly go away; even in the face of HVA/HFA. So before rationalize we still have:

***** BB01 STMT00000 (IL 0x000...0x016) N003 ( 5, 5) [000011] IA------R--- * ASG struct (init) $VN.Void N002 ( 3, 2) [000009] D------N---- +--* LCL_VAR struct<ValueTuple`2, 32> V02 tmp1 d:1 N001 ( 1, 2) [000010] ------------ \--* CNS_INT int 0 $40 ***** BB01 STMT00005 (IL ???... ???) N007 ( 9, 8) [000037] -A--G---R--- * ASG simd16 (copy) $200 N006 ( 1, 1) [000035] D------N---- +--* LCL_VAR simd16<Vector128`1> V03 tmp2 d:1 $200 N005 ( 9, 8) [000004] -------N---- \--* HWINTRINSIC simd16 float ConvertToDouble $200 N004 ( 8, 7) [000003] ------------ \--* HWINTRINSIC simd8 float GetLower $1c0 N003 ( 7, 6) [000002] n----------- \--* OBJ simd16<Vector128`1> $81 N002 ( 1, 2) [000001] ------------ \--* ADDR byref $140 N001 ( 1, 1) [000000] -------N---- \--* LCL_VAR simd16<Vector128`1> V00 arg0 u:1 $80 ***** BB01 STMT00006 (IL ???... ???) N006 ( 8, 7) [000040] -A--G---R--- * ASG simd16 (copy) $84 N005 ( 1, 1) [000038] D------N---- +--* LCL_VAR simd16<Vector128`1> V04 tmp3 d:1 $84 N004 ( 8, 7) [000008] -------N---- \--* HWINTRINSIC simd16 float ConvertToDoubleUpper $84 N003 ( 7, 6) [000007] n----------- \--* OBJ simd16<Vector128`1> $83 N002 ( 1, 2) [000006] ------------ \--* ADDR byref $140 N001 ( 1, 1) [000005] -------N---- \--* LCL_VAR simd16<Vector128`1> V00 arg0 u:1 (last use) $80 ***** BB01 STMT00003 (IL ???... ???) N003 ( 5, 6) [000027] -A------R--- * ASG simd16 (copy) $200 N002 ( 3, 4) [000022] U------N---- +--* LCL_FLD simd16 V02 tmp1 ud:1->2[+0] Fseq[Item1] $300 N001 ( 1, 1) [000023] ------------ \--* LCL_VAR simd16<Vector128`1> V03 tmp2 u:1 (last use) $200 ***** BB01 STMT00004 (IL ???... ???) N003 ( 5, 6) [000034] -A------R--- * ASG simd16 (copy) $84 N002 ( 3, 4) [000029] U------N---- +--* LCL_FLD simd16 V02 tmp1 ud:2->3[+16] Fseq[Item2] $301 N001 ( 1, 1) [000030] ------------ \--* LCL_VAR simd16<Vector128`1> V04 tmp3 u:1 (last use) $84 ***** BB01 STMT00002 (IL 0x016... ???) N002 ( 4, 3) [000020] ------------ * RETURN struct $101 N001 ( 3, 2) [000019] -------N---- \--* LCL_VAR struct<ValueTuple`2, 32> V02 tmp1 u:3 (last use) $301

-- Noting this is without inlining. With inlining, the JIT sometimes does the right thing and other times does not.

Okay, thanks for checking - this is what I suspected would happen - we should work on the issue in .NET 7.

echesakov

Changes LGTM.

EgorBo · 2021-11-16T16:33:54Z

Ubuntu-x64 improvements: dotnet/perf-autofiling-issues#2339
Alpine-x64 improvements: dotnet/perf-autofiling-issues#2362
windows-x64 improvements: dotnet/perf-autofiling-issues#2346
windows-amd-x64 improvements: dotnet/perf-autofiling-issues#2374

AndyAyersMS · 2021-11-23T17:11:26Z

Improvements:

ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Oct 6, 2021

This was referenced Oct 7, 2021

System.Tests.TimeZoneInfoTests.TestWindowsNlsDisplayNames is failing on rolling builds #60119

Closed

System.Net.Sockets.Tests.SendPacketsAsync.SendPacketsElement_FileZeroCount_Success sometimes fails #60017

Closed

tannergooding force-pushed the xplat-hwintrin branch 2 times, most recently from 366dcdd to c0b823c Compare October 7, 2021 17:52

tannergooding force-pushed the xplat-hwintrin branch from c0b823c to 68b3e42 Compare October 8, 2021 14:37

tannergooding marked this pull request as ready for review October 8, 2021 23:07

tannergooding force-pushed the xplat-hwintrin branch from f5d21da to 3094f4c Compare October 14, 2021 18:13

tannergooding requested a review from echesakov October 18, 2021 17:43

runfoapp bot mentioned this pull request Oct 18, 2021

Error in CI Mono llvmfullaot Pri0 Runtime Tests Run Linux arm64 release #60234

Closed

tannergooding added 5 commits November 3, 2021 10:11

Moving Narrow to implemented using SIMDAsHWIntrinsic

dc206c3

Moving Widen to implemented using SIMDAsHWIntrinsic

df0a3dd

Fix some handling of Narrow/Widen hwintrinsics

515c3ce

Ensure that Vector.Widen is still treated as an intrinsic

c926d85

Fixing NI_VectorT128_WidenUpper on ARM64 to actually call gtNewSimdWi…

c5fb319

…denUpper

tannergooding force-pushed the xplat-hwintrin branch from 3094f4c to c5fb319 Compare November 3, 2021 17:12

JulieLeeMSFT assigned tannergooding Nov 11, 2021

JulieLeeMSFT added this to the 7.0.0 milestone Nov 11, 2021

echesakov reviewed Nov 11, 2021

View reviewed changes

echesakov approved these changes Nov 12, 2021

View reviewed changes

tannergooding merged commit 5fa6dd3 into dotnet:main Nov 12, 2021

This was referenced Nov 16, 2021

[Mono] TE plaintext and json benchmarks regression #61484

Closed

Accelerate additional cross platform hardware intrinsics #61649

Merged

DrewScoggins mentioned this pull request Nov 18, 2021

[Perf] Changes at 11/12/2021 4:32:49 AM #61785

Closed

ghost locked as resolved and limited conversation to collaborators Dec 24, 2021

Conversation

tannergooding commented Oct 6, 2021

Uh oh!

ghost commented Oct 6, 2021

Uh oh!

tannergooding commented Oct 6, 2021

Uh oh!

tannergooding commented Oct 7, 2021

Uh oh!

azure-pipelines bot commented Oct 7, 2021

Uh oh!

tannergooding commented Oct 8, 2021

Uh oh!

azure-pipelines bot commented Oct 8, 2021

Uh oh!

tannergooding commented Oct 8, 2021

Uh oh!

azure-pipelines bot commented Oct 8, 2021

Uh oh!

tannergooding commented Oct 8, 2021

Uh oh!

tannergooding commented Oct 18, 2021

Uh oh!

tannergooding commented Nov 3, 2021

Uh oh!

JulieLeeMSFT commented Nov 11, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

echesakov left a comment

Choose a reason for hiding this comment

Uh oh!

EgorBo commented Nov 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AndyAyersMS commented Nov 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

EgorBo commented Nov 16, 2021 •

edited

Loading

AndyAyersMS commented Nov 23, 2021 •

edited

Loading