TestKit and Akka.Remote.TestKit: diagnostic improvements and code modernization#7321
Conversation
…eb/akka.net into mntr-barrier-cleanup
BarrierCoordinator [WIP]
Aaronontheweb
left a comment
There was a problem hiding this comment.
Wasn't brave enough to address nullability enable inside the MNTR, but I thought this was an acceptable set of changes and improvements to help debug some of these tests.
| <HoconVersion>2.0.3</HoconVersion> | ||
| <ConfigurationManagerVersion>6.0.1</ConfigurationManagerVersion> | ||
| <MultiNodeAdapterVersion>1.5.19</MultiNodeAdapterVersion> | ||
| <MultiNodeAdapterVersion>1.5.25</MultiNodeAdapterVersion> |
There was a problem hiding this comment.
Upgrade to latest MNTR version
There was a problem hiding this comment.
Tried to migrate as many tests as I could to use async Task and the async TestKit methods, since those tend to run faster and perform less thread-blocking. The rest of the changes in this file are just compilation fixes for the changes I made to the EnterBarrier / FailBarrier messages - which I'll describe in more detail on the files where those types are defined.
There was a problem hiding this comment.
No substantive changes here other than leveraging the additional diagnostic data added to EnterBarrier and FailBarrier - most of this code has been touched since 2014. I just modernized it a bit to clear up some compiler warnings surrounding the use of .GetHashCode() and mutable values.
|
|
||
| public sealed class Data | ||
| { | ||
| public Data(IEnumerable<Controller.NodeInfo> clients, string barrier, IEnumerable<IActorRef> arrived, Deadline deadline) : |
There was a problem hiding this comment.
Unused CTOR - removed it.
| public WrongBarrierException(string barrier, IActorRef client, Data barrierData) | ||
| : base($"[{client}] tried to enter '{barrier}' while we were waiting for '{barrierData.Barrier}'") | ||
| public WrongBarrierException(string barrier, IActorRef client, RoleName roleName, Data barrierData) | ||
| : base($"[{client}] [{roleName}] tried to enter '{barrier}' while we were waiting for '{barrierData.Barrier}'") |
There was a problem hiding this comment.
Expanded the description here of WHICH ROLE tried to enter the wrong barrier - this should make it much easier to debug tests in the future that have issues with nodes crossing barriers at inappropriate times.
| { | ||
| Name = failBarrier.Name, | ||
| Op = Proto.Msg.EnterBarrier.Types.BarrierOp.Fail, | ||
| RoleName = failBarrier.Role.Name |
There was a problem hiding this comment.
Added RoleName encoding to FailBarrier
| public void EnterBarrier(params string[] name) | ||
| { | ||
| TestConductor.Enter(RemainingOr(TestConductor.Settings.BarrierTimeout), name.ToImmutableList()); | ||
| TestConductor.Enter(RemainingOr(TestConductor.Settings.BarrierTimeout), Myself, name.ToImmutableList()); |
There was a problem hiding this comment.
Whenever we call EnterBarrier during testing now, we always pass in Myself - which is set to the value of our role during the testing system.
| /// throw an exception in case of timeouts or other errors. | ||
| /// </summary> | ||
| public void Enter(string name) | ||
| public void Enter(RoleName roleName, string name) |
There was a problem hiding this comment.
Had to add RoleName support to the Enter methods for transmitting barrier information to the BarrierCoordinator
| var stopped = Now + t; | ||
| if (stopped >= stop) | ||
| { | ||
| Sys.Log.Warning("AwaitAssert failed, timeout [{0}] is over after [{1}] attempts and [{2}] elapsed time", max, attempts, stopped - start); |
There was a problem hiding this comment.
Diagnostic improvements to AwaitAssert - I wanted to understand how many times an assertion actually ran before it failed, so I capture that data and the total elapsed time the test used and pipe it into a Warning log right before we throw the assertion exception.
| string name = 1; | ||
| BarrierOp op = 2; | ||
| int64 timeout = 3; | ||
| string roleName = 4; |
There was a problem hiding this comment.
Added roleName to the wire format of the MNTR test conductor.
|
Even though I made some changes to type signatures, they're all |
Changes
Based on some of the racy spec output we're seeing, it looks to me like there's some cases where the
BarrierCoordinatormight be creating race conditions where tests can inadvertently have multiple nodes crossing different barriers at the same time despite being programmed correctly. Investigating.Checklist
For significant changes, please ensure that the following have been completed (delete if not relevant):