Skip to content

Comments

[IMPROVED] Eventually force snapshots if blocked due to catchups#7846

Merged
neilalexander merged 2 commits intomainfrom
maurice/force-snapshot
Feb 19, 2026
Merged

[IMPROVED] Eventually force snapshots if blocked due to catchups#7846
neilalexander merged 2 commits intomainfrom
maurice/force-snapshot

Conversation

@MauriceVanVeen
Copy link
Member

@MauriceVanVeen MauriceVanVeen commented Feb 18, 2026

Follow-up related to #7827. Also counting failed snapshots (likely due to catchup preventing it) and eventually forcing the snapshot through for stream and consumer Raft groups. Also tunes meta snapshots to be forced earlier and consistent with the stream/consumer snapshot behavior.

Signed-off-by: Maurice van Veen github@mauricevanveen.com

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
@MauriceVanVeen MauriceVanVeen marked this pull request as ready for review February 18, 2026 18:47
@MauriceVanVeen MauriceVanVeen requested a review from a team as a code owner February 18, 2026 18:47
Copy link
Member

@wallyqs wallyqs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
@sciascid
Copy link
Contributor

These snapshot policies look somewhat aggressive to me. Wondering if there's a risk that a node may be unable catch up under certain scenarios?
Also, is this trying to solve a problem that has been observed in the wild, or maybe experimentally?

@MauriceVanVeen
Copy link
Member Author

These snapshot policies look somewhat aggressive to me. Wondering if there's a risk that a node may be unable catch up under certain scenarios?

For meta/stream/consumer snapshots would be blocked indefinitely if a lagging peer would be in continuous catchup. That would prevent any snapshots from happening.

The limits are not too aggressive at the moment, in essence if we can't snapshot due to catchup we retry again on a 15 second interval 3 more times. If after a minute (5th try) we're still blocked, we go ahead and snapshot. Catching up peers generally should never be catching up for more than a minute, but when they do we want to protect ourselves from infinite WAL growth.

Also, is this trying to solve a problem that has been observed in the wild, or maybe experimentally?

Has been observed in the wild under heavy load setups, also the ones @wallyqs is testing with. So, we definitely need these kinds of protective measures in place.

However, the 15 seconds and 4 tries threshold etc. is pretty much arbitrary at this point. They can be relaxed if needed, but we at least need something to protect the log from growing infinitely in such a case.

Copy link
Member

@neilalexander neilalexander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@neilalexander neilalexander merged commit c2e5c03 into main Feb 19, 2026
90 of 92 checks passed
@neilalexander neilalexander deleted the maurice/force-snapshot branch February 19, 2026 10:29
neilalexander added a commit that referenced this pull request Feb 20, 2026
Includes the following:

- #7839
- #7843
- #7824
- #7826
- #7845
- #7844
- #7840
- #7827
- #7846
- #7848
- #7849
- #7855
- #7850
- #7857
- #7856

Signed-off-by: Neil Twigg <neil@nats.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants