Skip to content

Fix MergeSeen to filter Seen against current Members (#8009)#8011

Merged
Aaronontheweb merged 5 commits intoakkadotnet:devfrom
Aaronontheweb:feature/gossip-invariant-validation-8009
Jan 26, 2026
Merged

Fix MergeSeen to filter Seen against current Members (#8009)#8011
Aaronontheweb merged 5 commits intoakkadotnet:devfrom
Aaronontheweb:feature/gossip-invariant-validation-8009

Conversation

@Aaronontheweb
Copy link
Member

@Aaronontheweb Aaronontheweb commented Jan 22, 2026

Problem

ClusterMessageSerializer.GossipToProto throws ArgumentException: Unknown address when Seen contains addresses not present in Members.

Root Cause

The gossip protocol lacks tombstones for removed members. Without tombstones, removed members can be reintroduced during gossip merging, and their Seen entries persist after the member is removed from Members.

The MergeSeen method performed a blind union of seen sets without filtering against current membership.

Solution

Apply defensive fix to MergeSeen that ensures the invariant Seen ⊆ Members is always maintained:

public Gossip MergeSeen(Gossip that)
{
    var memberAddresses = _members.Select(m => m.UniqueAddress).ToImmutableHashSet();
    var mergedSeen = _overview.Seen.Union(that._overview.Seen).Intersect(memberAddresses);
    return Copy(overview: _overview.Copy(seen: mergedSeen));
}

Characteristics

  • Zero breaking changes
  • Wire-compatible with all existing versions
  • Defense-in-depth against corruption from any source
  • Minimal performance impact

Future Work

A more comprehensive fix using tombstones is planned for 1.6.0 - see #8015.

Closes #8009

@Aaronontheweb Aaronontheweb changed the title Add gossip invariant checking infrastructure (#8009) Fix MergeSeen to filter Seen against current Members (#8009) Jan 23, 2026
@Aaronontheweb Aaronontheweb marked this pull request as ready for review January 23, 2026 20:07
Apply defensive fix to MergeSeen that ensures the invariant
Seen ⊆ Members is always maintained. The merged seen set is
now intersected with current member addresses, preventing
stale entries from corrupting gossip state.

Closes akkadotnet#8009
@Aaronontheweb Aaronontheweb force-pushed the feature/gossip-invariant-validation-8009 branch from 98e8044 to 905f17b Compare January 23, 2026 20:11
Aaronontheweb added a commit to Aaronontheweb/akka.net that referenced this pull request Jan 25, 2026
Changes:
- LeaderDowningNodeThatIsUnreachableSpec: Fix bug where test tried to run on
  Second node after it was already exited (line 143)

- NodeDowningAndBeingRemovedSpec: Convert to async, increase outer timeout
  from 30s to 45s, add explicit timeouts to AwaitConditionAsync/AwaitAssertAsync

- NodeLeavingAndExitingAndBeingRemovedSpec: Convert to async, increase outer
  timeout from 15s to 45s for CI variability, add explicit timeouts

These tests are likely affected by PR akkadotnet#8011's MergeSeen filter fix which
changes gossip convergence timing.
Aaronontheweb added a commit that referenced this pull request Jan 25, 2026
…8025)

* Fix flaky multi-node cluster tests for member removal

Changes:
- LeaderDowningNodeThatIsUnreachableSpec: Fix bug where test tried to run on
  Second node after it was already exited (line 143)

- NodeDowningAndBeingRemovedSpec: Convert to async, increase outer timeout
  from 30s to 45s, add explicit timeouts to AwaitConditionAsync/AwaitAssertAsync

- NodeLeavingAndExitingAndBeingRemovedSpec: Convert to async, increase outer
  timeout from 15s to 45s for CI variability, add explicit timeouts

These tests are likely affected by PR #8011's MergeSeen filter fix which
changes gossip convergence timing.

* Address review feedback: remove redundant explicit timeouts

- Remove explicit timeout args from AwaitAssertAsync/AwaitConditionAsync
  calls inside WithinAsync blocks (timeouts are inherited from outer block)
- Move address caching outside WithinAsync block for cleaner code
- Keep CancellationToken.None as it's required by the API signature

* Fix SBR and ClusterSharding multi-node test race conditions

SBR Tests (IndirectlyConnected3NodeSpec, IndirectlyConnected5NodeSpec,
DownAllIndirectlyConnected5NodeSpec):
- Replace polling via AwaitConditionAsync with event-driven callback
- Use cluster.RegisterOnMemberRemoved() for immediate notification
- The callback fires as soon as the member is removed or cluster daemon stops
- Eliminates race between polling interval and actual state change
- Convert remaining sync methods to async pattern

ClusterShardingRolePartitioningSpec:
- Wrap first message send in AwaitAssert to handle coordinator readiness
- The coordinator may not respond to GetShardHome until HasAllRegionsRegistered()
- GetShardHome requests are silently ignored until _aliveRegions.Count >= _minMembers
- The retry pattern ensures we wait for coordinator readiness without timeout jiggling

* Convert ClusterShardingRolePartitioningSpec to async TestKit methods

- Convert test methods to return Task and use await
- Use AwaitClusterUpAsync, RunOnAsync, EnterBarrierAsync
- Use AwaitAssertAsync and ExpectMsgAsync patterns
- Maintains the coordinator readiness fix from previous commit
@Aaronontheweb Aaronontheweb merged commit 8174e1f into akkadotnet:dev Jan 26, 2026
12 checks passed
@Aaronontheweb Aaronontheweb deleted the feature/gossip-invariant-validation-8009 branch January 26, 2026 03:57
Aaronontheweb added a commit to Aaronontheweb/akka.net that referenced this pull request Jan 26, 2026
akkadotnet#8011)

Apply defensive fix to MergeSeen that ensures the invariant
Seen ⊆ Members is always maintained. The merged seen set is
now intersected with current member addresses, preventing
stale entries from corrupting gossip state.

Closes akkadotnet#8009
Aaronontheweb added a commit to Aaronontheweb/akka.net that referenced this pull request Jan 26, 2026
Documents all 8 backported PRs for the 1.5.59 release including:
- Critical cluster gossip fix (akkadotnet#8011)
- Bug fixes for logging, inbox, persistence, and TestKit
- New features: ActivityContext capture and BroadcastHub improvements
- CoordinatedShutdown logging enhancement
Aaronontheweb added a commit that referenced this pull request Jan 26, 2026
Apply defensive fix to MergeSeen that ensures the invariant
Seen ⊆ Members is always maintained. The merged seen set is
now intersected with current member addresses, preventing
stale entries from corrupting gossip state.

Closes #8009
Aaronontheweb added a commit that referenced this pull request Jan 26, 2026
Documents all 8 backported PRs for the 1.5.59 release including:
- Critical cluster gossip fix (#8011)
- Bug fixes for logging, inbox, persistence, and TestKit
- New features: ActivityContext capture and BroadcastHub improvements
- CoordinatedShutdown logging enhancement
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Akka.Cluster: Gossip serialization fails with 'Unknown address in cluster message' due to Seen/Members inconsistency

1 participant