HA: stale follower recovery when snapshot download fails on a quiet cluster

## Background

In `ArcadeDBStateMachine.applyTransaction()` (around lines 327-340 on the `apache-ratis` branch), when a follower's apply fails we set `needsSnapshotDownload=true` and submit a download task. The task uses `compareAndSet(true, false)` to debounce, so the flag is cleared **before** the download runs.

If the download itself fails, the flag stays `false` until the next apply failure re-arms it. On a quiet cluster (no new writes), no new log entry will arrive to trigger re-arming, so the follower remains permanently diverged until either:

1. The leader sends new entries (which will also fail to apply, re-arming the flag), or
2. The node is restarted (the `reinitialize()` gap-detection path triggers a fresh download)

## Proposal

Add a periodic check in `HealthMonitor` that re-arms `needsSnapshotDownload` when:

- this node is a follower
- `commitIndex - lastAppliedIndex > <threshold>`
- `catchingUp == false` (active log replay is not making progress)
- `needsSnapshotDownload == false` (no download already queued)
- the lag has been observed for at least N consecutive ticks (avoids transient catch-up false positives)

## Why deferred from the apache-ratis PR

- **Failure mode is narrow**: requires apply failure AND snapshot download failure AND a quiet cluster. The restart mitigation already works.
- **Threshold tuning needs telemetry, not guessing**: on busy clusters with `HA_SNAPSHOT_THRESHOLD=100k`, normal catch-up lag can be in the tens of thousands. Picking a threshold without production data risks thundering-herd snapshot downloads from multiple followers hitting the leader at once.
- **Two new config knobs** (lag threshold + persistence duration) widen the public API and should land with a documented default that's been validated in real workloads.
- **Test surface**: a meaningful regression test requires injecting both an apply failure and a snapshot-download failure, then waiting for the periodic check - slow and flaky-prone.

## Acceptance criteria

- [ ] New `HA_STALE_FOLLOWER_LAG_THRESHOLD` config (default informed by production telemetry)
- [ ] New `HA_STALE_FOLLOWER_RECOVERY_DURATION_MS` config
- [ ] `ArcadeDBStateMachine.recoverFromPersistentLag()` method invoked from `HealthMonitor`
- [ ] Integration test that simulates the persistent-failure window and asserts the follower converges
- [ ] Telemetry/log line on each retrigger so operators can spot pathological loops

## Existing breadcrumb

The current behavior is documented in `ArcadeDBStateMachine.java` lines 318-326 (the comment block above `needsSnapshotDownload.set(true)`).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HA: stale follower recovery when snapshot download fails on a quiet cluster #3893

Background

Proposal

Why deferred from the apache-ratis PR

Acceptance criteria

Existing breadcrumb

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

HA: stale follower recovery when snapshot download fails on a quiet cluster #3893

Description

Background

Proposal

Why deferred from the apache-ratis PR

Acceptance criteria

Existing breadcrumb

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions