Stop failing remembered entity if supervisor strategy failed by Arkatufus · Pull Request #7720 · akkadotnet/akka.net

Arkatufus · 2025-07-02T14:21:31Z

Fixes #7629

Changes

Add specialized shard supervision strategy with feedback mechanism to signal excessive failures
Add new SupervisorStrategy settings to ShardSupervisionStrategy (accessible only via C# fluent API)

Checklist

For significant changes, please ensure that the following have been completed (delete if not relevant):

This change follows the Akka.NET API Compatibility Guidelines.
I have reviewed my own pull request.
Design discussion issue Akka.Cluster.Sharding: dealing with remember-entities and actors who can't start up correctly #7629
Changes in public API reviewed, if any.

Arkatufus

Self review

Arkatufus · 2025-07-02T14:48:09Z

src/contrib/cluster/Akka.Cluster.Sharding/ShardSupervisionStrategy.cs

+
+namespace Akka.Cluster.Sharding;
+
+public class ShardSupervisionStrategy: OneForOneStrategy


This is the custom supervisor strategy, only for the Shard actor

Arkatufus · 2025-07-02T14:51:05Z

src/contrib/cluster/Akka.Cluster.Sharding/ShardSupervisionStrategy.cs

+            if(restart)
+                context.Self.Tell(new ExcessiveSupervisorRestartPassivation(child, WithinTimeRangeMilliseconds, MaxNumberOfRetries, cause));


The ProcessFailure code is pretty much the same as the OneToOneStrategy with these additional lines, we send the Shard actor a warning message that this failing child is due for termination because it is thrashing.

This is not quite right - we also need to handle scenarios where the SupervisorStrategy decided to issue a Stop directive. Reason being: if the actor failed in such a way that it has to be stopped, it's by definition an irrecoverable exception. Continuing to remember the entity after we get this type of signal back is net-destructive.

Also: we need to make sure this only applies to entity actors, not to any other children of the Shard, such as the RE infrastructure itself. You can determine this by checking the actor paths.

Also: we need to make sure this only applies to entity actors, not to any other children of the Shard, such as the RE infrastructure itself. You can determine this by checking the actor paths.

nevermind, this gets handled inside the Shard message handlers by the looks of things

Arkatufus · 2025-07-02T14:51:45Z

src/contrib/cluster/Akka.Cluster.Sharding/ShardingMessages.cs


+    internal sealed record ExcessiveSupervisorRestartPassivation(IActorRef Child, int TimeWindowInMilliseconds, int MaxRestartCount, Exception LastCause) : IShardRegionCommand;


New internal local only message from supervisor strategy to Shard actor

Arkatufus · 2025-07-02T14:55:56Z

A note here, while this design works in a very quiet system, it might still somehow fail on a very busy system.

This scheme would not be as responsive as what the unit test shows if the ExcessiveSupervisorRestartPassivation message somehow got burried in the Shard mailbox (busy system).

Aaronontheweb

Needs some changes in the way the supervision strategy is implemented

Aaronontheweb · 2025-07-03T16:53:37Z

src/contrib/cluster/Akka.Cluster.Sharding/ShardSupervisionStrategy.cs

+            if(restart)
+                context.Self.Tell(new ExcessiveSupervisorRestartPassivation(child, WithinTimeRangeMilliseconds, MaxNumberOfRetries, cause));


This is not quite right - we also need to handle scenarios where the SupervisorStrategy decided to issue a Stop directive. Reason being: if the actor failed in such a way that it has to be stopped, it's by definition an irrecoverable exception. Continuing to remember the entity after we get this type of signal back is net-destructive.

Arkatufus · 2025-07-07T16:36:46Z

OK, all fixed

Aaronontheweb

LGTM

Aaronontheweb · 2025-07-07T16:48:20Z

src/contrib/cluster/Akka.Cluster.Sharding/ShardingMessages.cs

        public object StopMessage { get; }
    }

+    internal sealed record SupervisorStopDirectivePassivation(IActorRef Child, string Reason, Exception LastCause) : IShardRegionCommand;


Aaronontheweb · 2025-07-07T16:48:41Z

src/contrib/cluster/Akka.Cluster.Sharding/ShardSupervisionStrategy.cs

+    {
+        if (restart && stats.RequestRestartPermission(MaxNumberOfRetries, WithinTimeRangeMilliseconds))
+            RestartChild(child, cause, suspendFirst: false);
+        else


…ufus/akka.net into akkadotnet#7629-fix-dying-RE-actor

Arkatufus added 2 commits July 2, 2025 21:13

Stop failing remembered entity if supervisor strategy failed

f9414b6

Update API Approval list

91d87b4

Arkatufus commented Jul 2, 2025

View reviewed changes

Aaronontheweb requested changes Jul 3, 2025

View reviewed changes

Arkatufus added 2 commits July 7, 2025 23:35

Fix logic

a14fd32

Merge branch 'dev' into akkadotnet#7629-fix-dying-RE-actor

6ba4fe0

Aaronontheweb added the akka-cluster-sharding label Jul 7, 2025

Aaronontheweb approved these changes Jul 7, 2025

View reviewed changes

Aaronontheweb enabled auto-merge (squash) July 7, 2025 16:48

Arkatufus added 2 commits July 7, 2025 23:54

Fix ShardEntityFailureSpec

651f439

Merge branch 'akkadotnet#7629-fix-dying-RE-actor' of github.com:Arkat…

930e00a

…ufus/akka.net into akkadotnet#7629-fix-dying-RE-actor

Aaronontheweb merged commit ec8a419 into akkadotnet:dev Jul 7, 2025
11 checks passed

Arkatufus mentioned this pull request Jul 7, 2025

Update RELEASE_NOTES for 1.5.45 release #7723

Merged

This was referenced Oct 27, 2025

Bump Akka.Streams.TestKit from 1.5.40 to 1.5.55 Aaronontheweb/Alpakka#899

Open

Bump Akka.Streams from 1.5.42 to 1.5.55 Aaronontheweb/Akka.Streams.Benchmark#69

Closed

This was referenced Nov 6, 2025

Bump Akka.Persistence.Query from 1.4.43 to 1.5.55 Arkatufus/Akka.Persistence.Azure#76

Closed

Bump Akka.Persistence.TCK from 1.4.43 to 1.5.55 Arkatufus/Akka.Persistence.Azure#77

Closed

dependabot bot mentioned this pull request Nov 26, 2025

Bump Akka.Streams from 1.5.42 to 1.5.56 Aaronontheweb/Akka.Streams.Benchmark#70

Closed

dependabot bot mentioned this pull request Dec 12, 2025

Bump Akka.Streams from 1.5.42 to 1.5.57 Aaronontheweb/Akka.Streams.Benchmark#71

Closed

This was referenced Jan 9, 2026

Bump Akka.Streams from 1.5.42 to 1.5.58 Aaronontheweb/Akka.Streams.Benchmark#72

Closed

Bump Akka from 1.5.31 to 1.5.58 Aaronontheweb/akka.net-log-trace-correlation-POC#8

Merged

dependabot bot mentioned this pull request Jan 27, 2026

Bump Akka.Streams from 1.5.42 to 1.5.59 Aaronontheweb/Akka.Streams.Benchmark#73

Merged


		namespace Akka.Cluster.Sharding;

		public class ShardSupervisionStrategy: OneForOneStrategy

		if(restart)
		context.Self.Tell(new ExcessiveSupervisorRestartPassivation(child, WithinTimeRangeMilliseconds, MaxNumberOfRetries, cause));


		internal sealed record ExcessiveSupervisorRestartPassivation(IActorRef Child, int TimeWindowInMilliseconds, int MaxRestartCount, Exception LastCause) : IShardRegionCommand;

Conversation

Arkatufus commented Jul 2, 2025 • edited by Aaronontheweb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Checklist

Uh oh!

Arkatufus left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Arkatufus commented Jul 2, 2025

Uh oh!

Aaronontheweb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Arkatufus commented Jul 7, 2025

Uh oh!

Aaronontheweb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Arkatufus commented Jul 2, 2025 •

edited by Aaronontheweb

Loading