Skip to content

Comments

[IMPROVED] NRG: Drop proposals/pause quorum if we're being overrun#7853

Draft
MauriceVanVeen wants to merge 1 commit intomainfrom
maurice/nrg-overrun-protection
Draft

[IMPROVED] NRG: Drop proposals/pause quorum if we're being overrun#7853
MauriceVanVeen wants to merge 1 commit intomainfrom
maurice/nrg-overrun-protection

Conversation

@MauriceVanVeen
Copy link
Member

This PR adds a protective measure to ensure we can guard against unbounded WAL growth. Currently overloaded servers could see their meta log grow well over several GBs, eventually requiring the log to be manually deleted on the server in order to recover.

  • The leader will start dropping proposals if it has reached a certain threshold of uncommitted and unapplied entries. We already wait to apply all entries in the log before we signal we're the leader to the upper-layer, so this has no impact during leader changes. This protection ensures once we're leader we're not being spammed with proposals faster than we can commit and apply them.
  • The followers will store entries in their logs before the leader can mark them as having quorum/being committed. If a follower is slower to apply entries than the leader can make it add new entries and mark them as committed, the WAL on this follower will grow unbounded. And since all to-be-applied entries are pushed into the apply queue, this eventually makes the server go OOM. This protection ensures the follower will temporarily stop accepting new writes to work through the apply backlog first. This bounds the total committed but not-yet-applied entries. Allowing the follower to be caught up by the leader from a snapshot, instead of continuously storing new append entries and indefinitely growing its log.

The threshold is reasonably high. We keep incoming append entries cached in n.pae and this starts logging a warning at paeWarnThreshold: 10k and eventually caps the cache size at paeDropThreshold: 20k at which point new entries aren't cached and need to be loaded from disk instead when they are committed. Both the above protective measures only kick in when going over pauseQuorumThreshold: 100k append entries that haven't gotten quorum on the leader, or that have been committed but not yet applied on the follower. This difference of 'total uncommitted/unapplied entries in the log' on the leader versus 'total unapplied but committed entries in the log' on the follower should ensure under normal circumstances the leader starts dropping proposals first. If a follower is otherwise overloaded, it can also guard itself.

Signed-off-by: Maurice van Veen github@mauricevanveen.com

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
@MauriceVanVeen MauriceVanVeen force-pushed the maurice/nrg-overrun-protection branch from a9c46e6 to d98f0dd Compare February 20, 2026 10:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant