Background
In ArcadeDBStateMachine.applyTransaction() (around lines 327-340 on the apache-ratis branch), when a follower's apply fails we set needsSnapshotDownload=true and submit a download task. The task uses compareAndSet(true, false) to debounce, so the flag is cleared before the download runs.
If the download itself fails, the flag stays false until the next apply failure re-arms it. On a quiet cluster (no new writes), no new log entry will arrive to trigger re-arming, so the follower remains permanently diverged until either:
- The leader sends new entries (which will also fail to apply, re-arming the flag), or
- The node is restarted (the
reinitialize() gap-detection path triggers a fresh download)
Proposal
Add a periodic check in HealthMonitor that re-arms needsSnapshotDownload when:
- this node is a follower
commitIndex - lastAppliedIndex > <threshold>
catchingUp == false (active log replay is not making progress)
needsSnapshotDownload == false (no download already queued)
- the lag has been observed for at least N consecutive ticks (avoids transient catch-up false positives)
Why deferred from the apache-ratis PR
- Failure mode is narrow: requires apply failure AND snapshot download failure AND a quiet cluster. The restart mitigation already works.
- Threshold tuning needs telemetry, not guessing: on busy clusters with
HA_SNAPSHOT_THRESHOLD=100k, normal catch-up lag can be in the tens of thousands. Picking a threshold without production data risks thundering-herd snapshot downloads from multiple followers hitting the leader at once.
- Two new config knobs (lag threshold + persistence duration) widen the public API and should land with a documented default that's been validated in real workloads.
- Test surface: a meaningful regression test requires injecting both an apply failure and a snapshot-download failure, then waiting for the periodic check - slow and flaky-prone.
Acceptance criteria
Existing breadcrumb
The current behavior is documented in ArcadeDBStateMachine.java lines 318-326 (the comment block above needsSnapshotDownload.set(true)).
Background
In
ArcadeDBStateMachine.applyTransaction()(around lines 327-340 on theapache-ratisbranch), when a follower's apply fails we setneedsSnapshotDownload=trueand submit a download task. The task usescompareAndSet(true, false)to debounce, so the flag is cleared before the download runs.If the download itself fails, the flag stays
falseuntil the next apply failure re-arms it. On a quiet cluster (no new writes), no new log entry will arrive to trigger re-arming, so the follower remains permanently diverged until either:reinitialize()gap-detection path triggers a fresh download)Proposal
Add a periodic check in
HealthMonitorthat re-armsneedsSnapshotDownloadwhen:commitIndex - lastAppliedIndex > <threshold>catchingUp == false(active log replay is not making progress)needsSnapshotDownload == false(no download already queued)Why deferred from the apache-ratis PR
HA_SNAPSHOT_THRESHOLD=100k, normal catch-up lag can be in the tens of thousands. Picking a threshold without production data risks thundering-herd snapshot downloads from multiple followers hitting the leader at once.Acceptance criteria
HA_STALE_FOLLOWER_LAG_THRESHOLDconfig (default informed by production telemetry)HA_STALE_FOLLOWER_RECOVERY_DURATION_MSconfigArcadeDBStateMachine.recoverFromPersistentLag()method invoked fromHealthMonitorExisting breadcrumb
The current behavior is documented in
ArcadeDBStateMachine.javalines 318-326 (the comment block aboveneedsSnapshotDownload.set(true)).