Skip to content

Comments

NRG: Fix cluster size drop to 1 on replaying EntryAddPeer after restart#7850

Merged
neilalexander merged 1 commit intomainfrom
raft-single-node-cluster-after-restart
Feb 19, 2026
Merged

NRG: Fix cluster size drop to 1 on replaying EntryAddPeer after restart#7850
neilalexander merged 1 commit intomainfrom
raft-single-node-cluster-after-restart

Conversation

@sciascid
Copy link
Contributor

On restart, replaying EntryAddPeer could incorrectly leave a raft node at cluster size 1 instead of restoring the expected size and quorum from persisted state.
This bug could lead to the following scenario: a node in a 3 node cluster could restart, reset set cluster size to 1. If the node did not receive any message from other nodes, it could campaign to become leader. Being in a single node cluster, it would win the election. Resulting in the original cluster splitting into two clusters (or two leaders at the same time).
Specifically, if an EntryAddPeer was replayed on from the log, it would overwrite the cluster size and quorum to 1. The peer set is now restored before the log is replayed, and it is taken from the snapshot (if no snapshot is present then we fallback to peer.idx).
If a log entry that changes membership is replayed, it will now update the cluster and quorum size correctly.

Signed-off-by: Daniele Sciascia daniele@nats.io

@sciascid sciascid requested a review from a team as a code owner February 19, 2026 10:03
@sciascid sciascid force-pushed the raft-single-node-cluster-after-restart branch 4 times, most recently from 01cd563 to 09f117b Compare February 19, 2026 11:35
@sciascid
Copy link
Contributor Author

Notice that before this PR initSingleMemRaftNode would start in a weird configuration. The node would have cluster size = 3, but only 1 node its peer list.

With the changes in this PR we go back to a sane initial state:

  func TestNRGInitSingleMemRaftNodeDefaults(t *testing.T) {
          n, cleanup := initSingleMemRaftNode(t)
          defer cleanup()
          require_Equal(t, n.ID(), "esFhDys3")
          require_Equal(t, len(n.Peers()), 1)
          require_Equal(t, n.Peers()[0].ID, "esFhDys3")
          require_Equal(t, n.ClusterSize(), 1)
          require_True(t, n.Quorum())
  }

Some tests were relying on that initial configuration. Which explains why some tests need tweaking.

@sciascid sciascid force-pushed the raft-single-node-cluster-after-restart branch from 09f117b to 3e7f03e Compare February 19, 2026 12:30
Copy link
Member

@MauriceVanVeen MauriceVanVeen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Member

@neilalexander neilalexander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@neilalexander
Copy link
Member

@sciascid Looks like there's now a merge conflict in the tests, mind rebasing & resolving please?

On restart, replaying EntryAddPeer could incorrectly leave a
raft node at cluster size 1 instead of restoring the expected
size and quorum from persisted state.
This bug could lead to the following scenario: a node in a
3 node cluster could restart, reset set cluster size to 1.
If the node did not receive any message from other nodes,
it could campaign to become leader. Being in a single node
cluster, it would win the election. Resulting in the original
cluster splitting into two clusters (or two leaders at the
same time).
Specifically, if an EntryAddPeer was replayed on from the log,
it would overwrite the cluster size and quorum to 1.
The peer set is now restored before the log is replayed, and
it is taken from the snapshot (if no snapshot is present
then we fallback to peer.idx).
If a log entry that changes membership is replayed, it will
now update the cluster and quorum size correctly.

Signed-off-by: Daniele Sciascia <daniele@nats.io>
@sciascid sciascid force-pushed the raft-single-node-cluster-after-restart branch from 3e7f03e to b26f715 Compare February 19, 2026 13:07
@sciascid
Copy link
Contributor Author

@neilalexander Done

@neilalexander neilalexander merged commit 0e7df38 into main Feb 19, 2026
48 checks passed
@neilalexander neilalexander deleted the raft-single-node-cluster-after-restart branch February 19, 2026 15:38
neilalexander added a commit that referenced this pull request Feb 20, 2026
Includes the following:

- #7839
- #7843
- #7824
- #7826
- #7845
- #7844
- #7840
- #7827
- #7846
- #7848
- #7849
- #7855
- #7850
- #7857
- #7856

Signed-off-by: Neil Twigg <neil@nats.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants