JIT: simple forward substitution pass#63720
Conversation
Extend ref counting done by local morph so that we can determine single-def single-use locals. Add a phase that runs just after local morph that will attempt to forward single-def single-use local defs to uses when they are in adjacent statements. Fix or work around issues uncovered elsewhere: * `gtFoldExprCompare` might fold "identical" volatile subtrees * `fgGetStubAddrArg` cannot handle complex trees * some simd/hw operations can lose struct handles * some calls cannot handle struct local args Addresses dotnet#6973 and related issues. Still sorting through exactly which ones are fixed, so list below may need revising. Fixes dotnet#48605. Fixes dotnet#51599. Fixes dotnet#55472. Improves some but not all cases in dotnet#12280 and dotnet#62064. Does not fix dotnet#33002, dotnet#47082, or dotnet#63116; these require handling multiple uses or bypassing statements.
|
Tagging subscribers to this area: @JulieLeeMSFT Issue DetailsExtend ref counting done by local morph so that we can determine Add a phase that runs just after local morph that will attempt to Fix or work around issues uncovered elsewhere:
Addresses #6973 and related issues. Still sorting through exactly Fixes #48605. Improves some but not all cases in #12280 and #62064. Does not fix #33002, #47082, or #63116; these require handling multiple
|
|
cc @dotnet/jit-contrib Large numbers of diffs expected. A few large regressions where we now decide to clone loops. Some small regressions where we now report generics context where we didn't previously (possibly a bug fix, the exact requirements here are a bit squishy). Lots of improvements. Will post some examples once this gets a bit further along. |
|
Looks like various issues cropping up during crossgen for non-windows non-x64 builds. Hopefully minor. |
| // if the next tree can't change the value of fwdSubNode or be adversly impacted by fwdSubNode effects | ||
| // | ||
| const bool fwdSubNodeInvariant = ((fwdSubNode->gtFlags & GTF_ALL_EFFECT) == 0); | ||
| const bool nextTreeIsPureUpToUse = ((fsv.GetFlags() & (GTF_EXCEPT | GTF_GLOB_REF | GTF_CALL)) == 0); |
There was a problem hiding this comment.
It feels very risky to rely on GTF_GLOB_REF before (or, really, after, since we don't have the checker enabled for it) morph. For example:
static int* _lclAddr;
[MethodImpl(MethodImplOptions.NoInlining)]
static int Problem(int a, int b)
{
[MethodImpl(MethodImplOptions.NoInlining)]
static void CaptureLcl(int* lclAddr)
{
_lclAddr = lclAddr;
}
[MethodImpl(MethodImplOptions.NoInlining)]
static int MutateLcl()
{
(*_lclAddr)++;
return 0;
}
CaptureLcl(&a);
b = a + a;
return MutateLcl() + b; // We will forward-substitute 'b' because "a + a" won't have GTF_GLOB_REF on it.
}There was a problem hiding this comment.
Thanks for the example. Seems like we'll need to walk over the tree to substitute and compute the actual set of effects flags.
There was a problem hiding this comment.
Added some extra flags scanning.
* For x86 we can't forward sub `GT_LCLHEAP`. * For arm64 we need to take more care with multi-reg struct types.
|
Going to set this aside for a while to focus on OSR. Looks like:
@EgorBo feel free to look into any of the above. |
|
JIT/Directed/debugging/debuginfo/tester is failing everywhere as we're evidently removing statements with debug info. Wonder if we should be merging debug info somehow. |
|
Latest diff summary for windows x64 (since CI runs are still unable to provide data, likely because of the huge number of impacted methods): aspnet.run.windows.x64.checked.mch:Detail diffsbenchmarks.run.windows.x64.checked.mch:Detail diffscoreclr_tests.pmi.windows.x64.checked.mch:Detail diffslibraries.crossgen2.windows.x64.checked.mch:Detail diffslibraries.pmi.windows.x64.checked.mch:Detail diffslibraries_tests.pmi.windows.x64.checked.mch:Detail diffs |
|
As noted earlier, the big regressions (eg |
|
From my impression some regressions are just missing opportunities for CSE that was supposed to "un-substitute" back where needed. Example: public class Strings
{
public string s1 { get; }
public string s2 { get; }
public string s3 { get; }
}
public class Program
{
public string Test(Strings size)
=> string.Concat(size.s1, size.s2, size.s3);
}G_M19485_IG02:
mov rcx, gword ptr [rdx+8]
- mov rax, gword ptr [rdx+16]
- mov r8, gword ptr [rdx+24]
- mov rdx, rax
+ mov gword ptr [rsp+10H], rdx
+ mov rdx, gword ptr [rdx+16]
+ mov r8, gword ptr [rsp+10H]
+ mov r8, gword ptr [r8+24]
G_M19485_IG03:
jmp System.String:Concat(System.String,System.String,System.String):System.String
-; Total bytes of code: 20
+; Total bytes of code: 27 |
Or (in this case) poor modelling of ABI constraints, LSRA preferencing, or poor choices for or arg evaluation order... @jakobbotsch was noting our current arg order sorting seems to be unhelpful. |
|
Might need to avoid forward sub of an implicit byref's promoted fields.... |
The debug info tests assume that mappings will be placed on statements containing calls. We should probably think about what the best way forward is -- perhaps prefer using the earliest offset? If we determine current behavior is ok the tests can be changed to have multiple uses to turn off the forward sub. |
|
I am not sure the NativeAOT failure is related, well it is related but during reducing a repro I came up with one that fails even on Main jit (if I did everything correctly): using System;
using System.Diagnostics;
using System.Runtime.CompilerServices;
using System.Text;
public class Test
{
public static void Main()
{
try
{
throw new Exception();
}
catch when (FilterWithStackTrace())
{
}
}
[MethodImpl(MethodImplOptions.NoInlining)]
static bool FilterWithStackTrace()
{
Consume(new StackTrace(0, true).ToString());
return true;
}
[MethodImpl(MethodImplOptions.NoInlining)]
private static void Consume(string s)
{
}
}@MichalStrehovsky could you please check this repro? UPD: nvm, I am now able to repro it only under this PR, investigating... |
|
NativeAOT's failed test was a difficult one since it only reproduced on app start and not during AOT compilation... NgenDump diff: https://www.diffchecker.com/mbzS9LLk Is this a correct substitution: |
|
It looks like we are reordering a call that modifies local0 and a read of local0. Seems like we need to watch for address-exposed uses when computing the effects of the statement we're trying to substitute into. Can you see if this diff fixes things? @@ -145,6 +145,15 @@ public:
m_parentNode = parent;
}
}
+
+ // Uses of address-exposed locals are modelled as global refs.
+ //
+ LclVarDsc* const varDsc = m_compiler->lvaGetDesc(lclNum);
+
+ if (varDsc->IsAddressExposed())
+ {
+ m_accumulatedFlags |= GTF_GLOB_REF;
+ }
}
m_accumulatedFlags |= (node->gtFlags & GTF_GLOB_EFFECT);
@@ -215,8 +224,8 @@ public:
if (node->OperIs(GT_LCL_VAR))
{
- unsigned const lclNum = node->AsLclVarCommon()->GetLclNum();
- LclVarDsc* varDsc = m_compiler->lvaGetDesc(lclNum);
+ unsigned const lclNum = node->AsLclVarCommon()->GetLclNum();
+ LclVarDsc* const varDsc = m_compiler->lvaGetDesc(lclNum);
if (varDsc->IsAddressExposed())
{ |
|
/azp run Fuzzlyn, Antigen |
|
Azure Pipelines successfully started running 2 pipeline(s). |
|
I think the Fuzzlyn and Antigen issues are known issues. And the runtime CI failures likewise. So, I'm going to merge. |
|
@AndyAyersMS I do see 2 Fuzzlyn examples in the latest run that look related. // Generated by Fuzzlyn v1.5 on 2022-02-03 20:06:22
// Run on X86 Windows
// Seed: 4993495812901676847
// Reduced from 57.3 KiB to 0.6 KiB in 00:00:48
// Debug: Outputs 1
// Release: Outputs 4294967295
public struct S1
{
public uint F4;
public long F7;
public S1(uint f4): this()
{
F4 = f4;
}
public ref bool M1(uint arg0, bool arg1)
{
this.F4 = (uint)-this.F4;
System.Console.WriteLine(arg0);
return ref Program.s_4;
}
}
public struct S2
{
public S1 F3;
public S2(S1 f3): this()
{
F3 = f3;
}
}
public class Program
{
public static S1[] s_2 = new S1[]{new S1(0)};
public static bool s_4;
public static void Main()
{
S2 vr0 = new S2(new S1(1));
s_2[0].M1(vr0.F3.F4, vr0.F3.M1(0, false));
}
}// Generated by Fuzzlyn v1.5 on 2022-02-03 20:11:11
// Run on X86 Windows
// Seed: 3146441340761756415
// Reduced from 127.7 KiB to 0.7 KiB in 00:03:37
// Debug: Throws 'System.NullReferenceException'
// Release: Runs successfully
public struct S0
{
public int F3;
public uint F6;
}
public class C1
{
public S1 M31(C1 arg0)
{
return new S1(new C1(), new S0());
}
}
public struct S1
{
public C1 F4;
public S0 F7;
public S1(C1 f4, S0 f7): this()
{
}
public C1 M26(int arg0)
{
if (Program.s_4)
{
this.F4 = new C1();
}
return new C1();
}
}
public class Program
{
public static bool s_4 = true;
public static long s_20;
public static void Main()
{
S1 vr8 = default(S1);
var vr9 = vr8.F7.F3;
S1 vr10 = vr8.F4.M31(vr8.M26(vr9));
System.Console.WriteLine(s_20);
}
} |
|
The first one needs a little massaging to repro as a console app: // Generated by Fuzzlyn v1.5 on 2022-02-03 20:06:22
// Run on X86 Windows
// Seed: 4993495812901676847
// Reduced from 57.3 KiB to 0.6 KiB in 00:00:48
// Debug: Outputs 1
// Release: Outputs 4294967295
using System;
public struct S1
{
public uint F4;
public long F7;
public S1(uint f4): this()
{
F4 = f4;
}
public ref bool M1(uint arg0, bool arg1)
{
this.F4 = (uint)-this.F4;
Program.s_rt.WriteLine(arg0);
return ref Program.s_4;
}
}
public struct S2
{
public S1 F3;
public S2(S1 f3): this()
{
F3 = f3;
}
}
public class Program
{
public static IRT s_rt;
public static S1[] s_2 = new S1[]{new S1(0)};
public static bool s_4;
public static void Main()
{
s_rt = new C();
S2 vr0 = new S2(new S1(1));
s_2[0].M1(vr0.F3.F4, vr0.F3.M1(0, false));
}
}
public interface IRT
{
void WriteLine<T>(T value);
}
public class C : IRT
{
public void WriteLine<T>(T value)
{
Console.WriteLine(value);
}
}The second repros out of the box. |
For LIR we verify that we can really consider locals to be used at their user by having a checker that looks for interfering stores to the same locals. However, in some cases we may have "interfering" GT_LCL_FLD/GT_STORE_LCL_FLD, in the sense that they work on the same local but on a disjoint range of bytes. Add support to validate this. This fixes dotnet#57919 which made the fuzzer jobs very noisy and made it easy to miss actual new examples (e.g. dotnet#63720 was just merged even though there were real examples found there). Fix dotnet#57919
For LIR we verify that we can really consider locals to be used at their user by having a checker that looks for interfering stores to the same locals. However, in some cases we may have "interfering" GT_LCL_FLD/GT_STORE_LCL_FLD, in the sense that they work on the same local but on a disjoint range of bytes. Add support to validate this. This fixes #57919 which made the fuzzer jobs very noisy and made it easy to miss actual new examples (e.g. #63720 was just merged even though there were real examples found there). Fix #57919
|
Some failures from this weekend's Fuzzlyn run. All of these repro in main (179ffff) in win-x86 and not on win-x64. // Generated by Fuzzlyn v1.5 on 2022-02-06 17:06:01
// Run on X86 Windows
// Seed: 7854701479455767355
// Reduced from 192.2 KiB to 0.6 KiB in 00:03:51
// Debug: Outputs 0
// Release: Outputs 1
public struct S0
{
public uint F0;
public ulong F2;
public int F3;
public ushort F6;
public int F8;
public int[] M35(int arg0, int[] arg1)
{
this.F3 = arg1[0];
Program.s_rt.WriteLine(arg0);
return arg1;
}
}
public class Program
{
public static IRT s_rt;
public static int[][] s_28 = new int[][]{new int[]{1}};
public static void Main()
{
s_rt = new C();
var vr3 = new S0();
var vr6 = vr3.F3;
var vr13 = s_28[0];
var vr12 = vr3.M35(0, vr13);
vr3.M35(vr6, vr12);
}
}
public interface IRT
{
void WriteLine<T>(T value);
}
public class C : IRT
{
public void WriteLine<T>(T value)
{
System.Console.WriteLine(value);
}
}// Generated by Fuzzlyn v1.5 on 2022-02-06 18:06:07
// Run on X86 Windows
// Seed: 525178552114891112
// Reduced from 142.0 KiB to 0.9 KiB in 00:06:35
// Debug: Throws 'System.NullReferenceException'
// Release: Runs successfully
public interface I0
{
}
public class C0
{
public ushort F0;
}
public struct S0 : I0
{
public byte F0;
public bool F1;
public long F3;
public byte F4;
public C0 F5;
public S0(long f3, C0 f5): this()
{
F5 = f5;
}
public C1 M24(ref bool arg0)
{
this = new S0(0, new C0());
return Program.s_18;
}
}
public class C1
{
}
public class Program
{
public static IRT s_rt;
public static C1 s_18;
public static void Main()
{
s_rt = new C();
var vr4 = new S0(0, new C0());
I0 vr7 = vr4;
bool vr8 = default(bool);
S0 vr9 = default(S0);
var vr10 = vr9.F5;
var vr11 = vr9.M24(ref vr8);
M8(vr10, ref vr7, vr11);
}
public static void M8(C0 argThis, ref I0 arg0, C1 arg1)
{
s_rt.WriteLine(argThis.F0);
}
}
public interface IRT
{
void WriteLine<T>(T value);
}
public class C : IRT
{
public void WriteLine<T>(T value)
{
System.Console.WriteLine(value);
}
}// Generated by Fuzzlyn v1.5 on 2022-02-06 16:34:36
// Run on X86 Windows
// Seed: 15345211054104247945
// Reduced from 56.7 KiB to 1.1 KiB in 00:00:55
// Debug: Outputs 0
// Release: Outputs -1
public struct S0
{
public ulong F1;
public int F2;
public byte F3;
public short F5;
public long F6;
public void M4(int arg0, sbyte[] arg1)
{
Program.s_rt.WriteLine(arg0);
}
public sbyte[] M6(ref int arg0, bool arg1, ref S0 arg2, ref byte arg3)
{
var vr0 = Program.s_5;
Program.M7(arg2, vr0, ref Program.s_6, ref Program.s_10, arg0);
var vr1 = arg2.F6;
Program.M8(arg2, vr1);
arg0 = this.F2--;
return new sbyte[]{1};
}
}
public class Program
{
public static IRT s_rt;
public static bool s_2;
public static S0 s_4;
public static uint[] s_5;
public static byte s_6;
public static S0[] s_10;
public static S0[, ] s_20 = new S0[, ]{{new S0()}};
public static void Main()
{
s_rt = new C();
S0 vr2 = default(S0);
new S0().M4(vr2.F2, vr2.M6(ref s_4.F2, s_2, ref s_20[0, 0], ref s_4.F3));
}
public static void M8(S0 arg0, long arg1)
{
}
public static void M7(S0 argThis, uint[] arg0, ref byte arg1, ref S0[] arg2, int arg3)
{
}
}
public interface IRT
{
void WriteLine<T>(T value);
}
public class C : IRT
{
public void WriteLine<T>(T value)
{
System.Console.WriteLine(value);
}
} |
|
@jakobbotsch - do you think they are related to the forward substitution? Could you open GH issues instead for better tracking? |
|
@kunalspathak Yes I expect so. Opened #64904. I don't really want to open two more yet since it seems very likely they may have the same cause. |
|
arm64 improvements: dotnet/perf-autofiling-issues#3438 dotnet/perf-autofiling-issues#3434 and dotnet/perf-autofiling-issues#3433 |


Extend ref counting done by local morph so that we can determine
single-def single-use locals.
Add a phase that runs just after local morph that will attempt to
forward single-def single-use local defs to uses when they are in
adjacent statements.
Fix or work around issues uncovered elsewhere:
gtFoldExprComparemight fold "identical" volatile subtreesfgGetStubAddrArgcannot handle complex treesAddresses #6973 and related issues. Still sorting through exactly
which ones are fixed, so list below may need revising.
Fixes #48605.
Fixes #51599.
Fixes #55472.
Improves some but not all cases in #12880 and #62604.
Does not address #33002, #47082, or #63116; these require handling multiple
uses or bypassing statements.