Skip to content

HA: stale follower recovery when snapshot download fails on a quiet cluster #3893

@lvca

Description

@lvca

Background

In ArcadeDBStateMachine.applyTransaction() (around lines 327-340 on the apache-ratis branch), when a follower's apply fails we set needsSnapshotDownload=true and submit a download task. The task uses compareAndSet(true, false) to debounce, so the flag is cleared before the download runs.

If the download itself fails, the flag stays false until the next apply failure re-arms it. On a quiet cluster (no new writes), no new log entry will arrive to trigger re-arming, so the follower remains permanently diverged until either:

  1. The leader sends new entries (which will also fail to apply, re-arming the flag), or
  2. The node is restarted (the reinitialize() gap-detection path triggers a fresh download)

Proposal

Add a periodic check in HealthMonitor that re-arms needsSnapshotDownload when:

  • this node is a follower
  • commitIndex - lastAppliedIndex > <threshold>
  • catchingUp == false (active log replay is not making progress)
  • needsSnapshotDownload == false (no download already queued)
  • the lag has been observed for at least N consecutive ticks (avoids transient catch-up false positives)

Why deferred from the apache-ratis PR

  • Failure mode is narrow: requires apply failure AND snapshot download failure AND a quiet cluster. The restart mitigation already works.
  • Threshold tuning needs telemetry, not guessing: on busy clusters with HA_SNAPSHOT_THRESHOLD=100k, normal catch-up lag can be in the tens of thousands. Picking a threshold without production data risks thundering-herd snapshot downloads from multiple followers hitting the leader at once.
  • Two new config knobs (lag threshold + persistence duration) widen the public API and should land with a documented default that's been validated in real workloads.
  • Test surface: a meaningful regression test requires injecting both an apply failure and a snapshot-download failure, then waiting for the periodic check - slow and flaky-prone.

Acceptance criteria

  • New HA_STALE_FOLLOWER_LAG_THRESHOLD config (default informed by production telemetry)
  • New HA_STALE_FOLLOWER_RECOVERY_DURATION_MS config
  • ArcadeDBStateMachine.recoverFromPersistentLag() method invoked from HealthMonitor
  • Integration test that simulates the persistent-failure window and asserts the follower converges
  • Telemetry/log line on each retrigger so operators can spot pathological loops

Existing breadcrumb

The current behavior is documented in ArcadeDBStateMachine.java lines 318-326 (the comment block above needsSnapshotDownload.set(true)).

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions