Skip to content

Comments

[IMPROVED] Stage meta operations during Raft catchup#7540

Merged
neilalexander merged 4 commits intomainfrom
maurice/catchup-recovery
Nov 19, 2025
Merged

[IMPROVED] Stage meta operations during Raft catchup#7540
neilalexander merged 4 commits intomainfrom
maurice/catchup-recovery

Conversation

@MauriceVanVeen
Copy link
Member

When the Raft node underlying the meta layer enters catchup from another server, it gets placed in the same "recovery mode" such that it can stage changes into ru *recoveryUpdates such that added and deleted consumers become a noop.

Signed-off-by: Maurice van Veen github@mauricevanveen.com

server/raft.go Outdated
if n.catchup != nil && n.catchup.sub != nil {
n.unsubscribe(n.catchup.sub)
} else {
// Signal we've started catching up.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Signal to the upper layer that the following entries are catchup entries, up until the nil guard."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

mset.mu.RUnlock()

for i, e := range ce.Entries {
// Ignore if lower catchup is started.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May make sense for the comments here to explain why we'd ignore the Raft catchup, i.e. because there's an upper-layer catchup for gaps, and because we no-op sequences we've seen before?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated but kept it at the bit more generic:

// Ignore if lower-level catchup is started.
// We don't need to optimize during this, all entries are handled as normal.

For streams upper layer catchup and no-op sequences existing is not changed by this new EntryCatchup entry. So more highlighting that the normal flow is kept and there's no need to optimize here.

@MauriceVanVeen MauriceVanVeen force-pushed the maurice/catchup-recovery branch 6 times, most recently from 57570cc to e415658 Compare November 14, 2025 13:53
MauriceVanVeen and others added 3 commits November 18, 2025 18:03
Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
@MauriceVanVeen MauriceVanVeen force-pushed the maurice/catchup-recovery branch 2 times, most recently from 0690d20 to 1c29c31 Compare November 18, 2025 17:40
Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
@MauriceVanVeen MauriceVanVeen force-pushed the maurice/catchup-recovery branch from 1c29c31 to 4e26946 Compare November 19, 2025 07:53
@MauriceVanVeen MauriceVanVeen marked this pull request as ready for review November 19, 2025 08:56
@MauriceVanVeen MauriceVanVeen requested a review from a team as a code owner November 19, 2025 08:56
Copy link
Member

@neilalexander neilalexander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@neilalexander neilalexander merged commit 5bf3833 into main Nov 19, 2025
89 of 92 checks passed
@neilalexander neilalexander deleted the maurice/catchup-recovery branch November 19, 2025 14:34
neilalexander added a commit that referenced this pull request Nov 26, 2025
Includes the following:

- #7553
- #7540
- #7555
- #7579
- #7578

Signed-off-by: Neil Twigg <neil@nats.io>
wallyqs pushed a commit to wallyqs/nats-server that referenced this pull request Jan 6, 2026
Backport of PR nats-io#7540 to release-v2.11.12.

When a Raft node enters catchup mode from another server, it now enters a
"recovery mode" that stages changes into recovery updates, allowing added
and deleted consumers to become no-ops during this process.

Key changes:
- Add EntryCatchup entry type in raft to signal catchup start/end
- Modify cancelCatchup to push nil entry when catchup completes
- Modify createCatchup to send EntryCatchup entry at catchup start
- Update monitorCluster to handle EntryCatchup and track recovery state
- Change applyMetaEntries signature to return (isRecovering, didSnap, error)
- Handle nil ce entries in monitorStream and monitorConsumer
- Ignore EntryCatchup entries in applyStreamEntries and applyConsumerEntries

Signed-off-by: Claude <noreply@anthropic.com>
neilalexander added a commit that referenced this pull request Feb 13, 2026
…eta snapshot (#7824)

`TestJetStreamClusterDeleteConsumerWhileServerDown` (and others) would
fail if the restarted server couldn't install a snapshot during
shutdown. This happened if the server was a follower of the meta layer
and committed the consumer create from a heartbeat, which isn't stored
in the log. So, when the server restarted it didn't know it could
commit/apply this consumer create again (since it already did so prior
to restart). Then, when the meta snapshot was received to catch this
server up, it would not properly remove the consumer as it wasn't
tracked in its assignments.

Marked as 2.12+ as #7540 is
only cherry-picked there.

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants