NRG: Fix cluster size drop to 1 on replaying EntryAddPeer after restart#7850
Merged
neilalexander merged 1 commit intomainfrom Feb 19, 2026
Merged
Conversation
01cd563 to
09f117b
Compare
Contributor
Author
|
Notice that before this PR With the changes in this PR we go back to a sane initial state: Some tests were relying on that initial configuration. Which explains why some tests need tweaking. |
09f117b to
3e7f03e
Compare
Member
|
@sciascid Looks like there's now a merge conflict in the tests, mind rebasing & resolving please? |
On restart, replaying EntryAddPeer could incorrectly leave a raft node at cluster size 1 instead of restoring the expected size and quorum from persisted state. This bug could lead to the following scenario: a node in a 3 node cluster could restart, reset set cluster size to 1. If the node did not receive any message from other nodes, it could campaign to become leader. Being in a single node cluster, it would win the election. Resulting in the original cluster splitting into two clusters (or two leaders at the same time). Specifically, if an EntryAddPeer was replayed on from the log, it would overwrite the cluster size and quorum to 1. The peer set is now restored before the log is replayed, and it is taken from the snapshot (if no snapshot is present then we fallback to peer.idx). If a log entry that changes membership is replayed, it will now update the cluster and quorum size correctly. Signed-off-by: Daniele Sciascia <daniele@nats.io>
3e7f03e to
b26f715
Compare
Contributor
Author
|
@neilalexander Done |
neilalexander
added a commit
that referenced
this pull request
Feb 20, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
On restart, replaying EntryAddPeer could incorrectly leave a raft node at cluster size 1 instead of restoring the expected size and quorum from persisted state.
This bug could lead to the following scenario: a node in a 3 node cluster could restart, reset set cluster size to 1. If the node did not receive any message from other nodes, it could campaign to become leader. Being in a single node cluster, it would win the election. Resulting in the original cluster splitting into two clusters (or two leaders at the same time).
Specifically, if an EntryAddPeer was replayed on from the log, it would overwrite the cluster size and quorum to 1. The peer set is now restored before the log is replayed, and it is taken from the snapshot (if no snapshot is present then we fallback to peer.idx).
If a log entry that changes membership is replayed, it will now update the cluster and quorum size correctly.
Signed-off-by: Daniele Sciascia daniele@nats.io