Implement Narrow and Widen using SIMDAsHWIntrinsic#60094
Implement Narrow and Widen using SIMDAsHWIntrinsic#60094tannergooding merged 5 commits intodotnet:mainfrom
Conversation
|
Tagging subscribers to this area: @JulieLeeMSFT Issue DetailsThis continues the work on #49397 which started with #53450 In particular, this moves Narrow and Widen to be implemented using the general SIMDAsHWIntrinsic logic and adding then having the new APIs in Vector64/128/256 use the same shared entry points. There will be a few more PRs after this one covering:
|
|
Will run outerloop jobs ( Also plan on collecting PMI diffs before marking this ready-for-review. |
366dcdd to
c0b823c
Compare
|
/azp run runtime-coreclr jitstress-isas-x86, runtime-coreclr jitstress-isas-arm, runtime-coreclr outerloop |
|
Azure Pipelines successfully started running 3 pipeline(s). |
c0b823c to
68b3e42
Compare
|
/azp run runtime-coreclr jitstress-isas-x86, runtime-coreclr jitstress-isas-arm, runtime-coreclr outerloop |
|
Azure Pipelines successfully started running 3 pipeline(s). |
|
/azp run runtime-coreclr jitstress-isas-x86, runtime-coreclr jitstress-isas-arm, runtime-coreclr outerloop |
|
Azure Pipelines successfully started running 3 pipeline(s). |
|
No diff for benchmarks. ============================== Frameworks has the following improvements: Noting that the Vector:Widen improvements are due to the code being "vectorized" now since it forwards to ; Latin1Utility:WidenLatin1ToUtf16_Fallback(long,long,long)
vmovupd ymm0, ymmword ptr[rcx+rax]
- vpermq ymm2, ymm0, -44
- vxorps ymm1, ymm1
- vpunpcklbw ymm2, ymm1
- vpermq ymm0, ymm0, -24
- vxorps ymm1, ymm1
- vpunpckhbw ymm0, ymm1
+ vmovaps ymm1, ymm0
+ vpmovzxbw ymm1, ymm1
+ vextractf128 xmm0, xmm0, 1
+ vpmovzxbw ymm0, ymm0and (no diff shown against the old software impl, just showing a vectorized example) ============================== Tests has the following improvements: Noting that most of the improvements come from the Vector128/256:Narrow/Widen calls being intrinsified now, where they weren't before. The interesting diffs are for JIT\SIMD where - vmovupd ymm6, ymmword ptr[rsi]
- vmovupd ymm0, ymmword ptr[rdi]
- vcvtpd2ps ymm6, ymm6
- vcvtpd2ps ymm1, ymm0
- vinsertf128 xmm6, xmm1, 1
- vcvtps2pd ymm7, ymm6
- vextractf128 xmm8, xmm6, 1
- vcvtps2pd ymm8, ymm8
+ vcvtpd2ps ymm0, ymmword ptr[rsi]
+ vcvtpd2ps ymm1, ymmword ptr[rdi]
+ vinsertf128 xmm6, xmm0, xmm1, 1
+ vmovaps ymm0, ymm6
+ vcvtps2pd ymm7, ymm0
+ vextractf128 xmm0, xmm6, 1
+ vcvtps2pd ymm8, ymm0 |
f5d21da to
3094f4c
Compare
|
This should be ready for review now. |
3094f4c to
c5fb319
Compare
|
Rebased onto dotnet/main. This is still ready for review. |
|
CC @echesakovMSFT PTAL. |
| [CLSCompliant(false)] | ||
| [Intrinsic] | ||
| public static unsafe (Vector64<ushort> Lower, Vector64<ushort> Upper) Widen(Vector64<byte> source) | ||
| => (WidenLower(source), WidenUpper(source)); |
There was a problem hiding this comment.
@tannergooding Have you looked at the code produced for such function call? Do we pass the return values properly in two registers from a callee to the caller without copying to/from the memory?
There was a problem hiding this comment.
The codegen looked good when I had checked, let me rebuild and get a disasm example to share however.
There was a problem hiding this comment.
Thanks, I remember seeing some strange artifacts with multi reg returns while I was working on #52424.
There was a problem hiding this comment.
It's worth noting this isn't a multi-reg return; rather its an a managed wrapper method that calls two helper intrinsics instead; since no platform implements this as a single instruction with multi-reg result.
There was a problem hiding this comment.
Sure, it's not a multi-reg intrinsic, but it's still a multi-reg call returning a value in D0 and D1 registers, so I was wondering whether the codegen for such calls (incl. the method) is optimal.
There was a problem hiding this comment.
@echesakovMSFT, looks like yes there is still some oddities there and it doesn't preference q0/q1 as the target register nor does it directly do a mov and instead does a store/load.
4EA01C10 mov v16.16b, v0.16b
0E617A10 fcvtl v16.2d, v16.2s
4E617811 fcvtl2 v17.2d, v0.4s
3D8007B0 str q16, [fp,#16]
3D800BB1 str q17, [fp,#32]
3DC007A0 ldr q0, [fp,#16]
3DC00BA1 ldr q1, [fp,#32]It looks like this gets imported initially as:
[000014] I-CXG------- * CALL void ValueTuple`2..ctor (exactContextHnd=0x00007FFA49A115C9)
[000013] ------------ this in x0 +--* ADDR byref
[000012] -------N---- | \--* LCL_VAR struct<ValueTuple`2, 32> V02 tmp1
[000018] ---XG------- arg1 +--* OBJ simd16<Vector128`1>
[000017] ------------ | \--* ADDR byref
[000004] -------N---- | \--* HWINTRINSIC simd16 float ConvertToDouble
[000003] ------------ | \--* HWINTRINSIC simd8 float GetLower
[000002] n----------- | \--* OBJ simd16<Vector128`1>
[000001] ------------ | \--* ADDR byref
[000000] -------N---- | \--* LCL_VAR simd16<Vector128`1> V00 arg0
[000016] ---XG------- arg2 \--* OBJ simd16<Vector128`1>
[000015] ------------ \--* ADDR byref
[000008] -------N---- \--* HWINTRINSIC simd16 float ConvertToDoubleUpper
[000007] n----------- \--* OBJ simd16<Vector128`1>
[000006] ------------ \--* ADDR byref
[000005] -------N---- \--* LCL_VAR simd16<Vector128`1> V00 arg0
However, there are locals introduced and indirections kept that never truly go away; even in the face of HVA/HFA. So before rationalize we still have:
***** BB01
STMT00000 (IL 0x000...0x016)
N003 ( 5, 5) [000011] IA------R--- * ASG struct (init) $VN.Void
N002 ( 3, 2) [000009] D------N---- +--* LCL_VAR struct<ValueTuple`2, 32> V02 tmp1 d:1
N001 ( 1, 2) [000010] ------------ \--* CNS_INT int 0 $40
***** BB01
STMT00005 (IL ???... ???)
N007 ( 9, 8) [000037] -A--G---R--- * ASG simd16 (copy) $200
N006 ( 1, 1) [000035] D------N---- +--* LCL_VAR simd16<Vector128`1> V03 tmp2 d:1 $200
N005 ( 9, 8) [000004] -------N---- \--* HWINTRINSIC simd16 float ConvertToDouble $200
N004 ( 8, 7) [000003] ------------ \--* HWINTRINSIC simd8 float GetLower $1c0
N003 ( 7, 6) [000002] n----------- \--* OBJ simd16<Vector128`1> $81
N002 ( 1, 2) [000001] ------------ \--* ADDR byref $140
N001 ( 1, 1) [000000] -------N---- \--* LCL_VAR simd16<Vector128`1> V00 arg0 u:1 $80
***** BB01
STMT00006 (IL ???... ???)
N006 ( 8, 7) [000040] -A--G---R--- * ASG simd16 (copy) $84
N005 ( 1, 1) [000038] D------N---- +--* LCL_VAR simd16<Vector128`1> V04 tmp3 d:1 $84
N004 ( 8, 7) [000008] -------N---- \--* HWINTRINSIC simd16 float ConvertToDoubleUpper $84
N003 ( 7, 6) [000007] n----------- \--* OBJ simd16<Vector128`1> $83
N002 ( 1, 2) [000006] ------------ \--* ADDR byref $140
N001 ( 1, 1) [000005] -------N---- \--* LCL_VAR simd16<Vector128`1> V00 arg0 u:1 (last use) $80
***** BB01
STMT00003 (IL ???... ???)
N003 ( 5, 6) [000027] -A------R--- * ASG simd16 (copy) $200
N002 ( 3, 4) [000022] U------N---- +--* LCL_FLD simd16 V02 tmp1 ud:1->2[+0] Fseq[Item1] $300
N001 ( 1, 1) [000023] ------------ \--* LCL_VAR simd16<Vector128`1> V03 tmp2 u:1 (last use) $200
***** BB01
STMT00004 (IL ???... ???)
N003 ( 5, 6) [000034] -A------R--- * ASG simd16 (copy) $84
N002 ( 3, 4) [000029] U------N---- +--* LCL_FLD simd16 V02 tmp1 ud:2->3[+16] Fseq[Item2] $301
N001 ( 1, 1) [000030] ------------ \--* LCL_VAR simd16<Vector128`1> V04 tmp3 u:1 (last use) $84
***** BB01
STMT00002 (IL 0x016... ???)
N002 ( 4, 3) [000020] ------------ * RETURN struct $101
N001 ( 3, 2) [000019] -------N---- \--* LCL_VAR struct<ValueTuple`2, 32> V02 tmp1 u:3 (last use) $301
There was a problem hiding this comment.
-- Noting this is without inlining. With inlining, the JIT sometimes does the right thing and other times does not.
There was a problem hiding this comment.
Okay, thanks for checking - this is what I suspected would happen - we should work on the issue in .NET 7.
|
Ubuntu-x64 improvements: dotnet/perf-autofiling-issues#2339 |
This continues the work on #49397 which started with #53450
In particular, this moves Narrow and Widen to be implemented using the general SIMDAsHWIntrinsic logic and adding then having the new APIs in Vector64/128/256 use the same shared entry points.
There will be a few more PRs after this one covering: