[IMPROVED] Eventually force snapshots if blocked due to catchups#7846
[IMPROVED] Eventually force snapshots if blocked due to catchups#7846neilalexander merged 2 commits intomainfrom
Conversation
Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
|
These snapshot policies look somewhat aggressive to me. Wondering if there's a risk that a node may be unable catch up under certain scenarios? |
For meta/stream/consumer snapshots would be blocked indefinitely if a lagging peer would be in continuous catchup. That would prevent any snapshots from happening. The limits are not too aggressive at the moment, in essence if we can't snapshot due to catchup we retry again on a 15 second interval 3 more times. If after a minute (5th try) we're still blocked, we go ahead and snapshot. Catching up peers generally should never be catching up for more than a minute, but when they do we want to protect ourselves from infinite WAL growth.
Has been observed in the wild under heavy load setups, also the ones @wallyqs is testing with. So, we definitely need these kinds of protective measures in place. However, the 15 seconds and 4 tries threshold etc. is pretty much arbitrary at this point. They can be relaxed if needed, but we at least need something to protect the log from growing infinitely in such a case. |
Follow-up related to #7827. Also counting failed snapshots (likely due to catchup preventing it) and eventually forcing the snapshot through for stream and consumer Raft groups. Also tunes meta snapshots to be forced earlier and consistent with the stream/consumer snapshot behavior.
Signed-off-by: Maurice van Veen github@mauricevanveen.com