Skip to content

feat: Raft-based High Availability using Apache Ratis #3730

@robfrank

Description

@robfrank

Summary

Redesign the ArcadeDB High Availability stack using Apache Ratis (Raft consensus protocol) to replace the current custom leader-follower replication. This provides stronger consistency guarantees, automatic leader election, and proven consensus semantics.

Motivation

The current HA implementation uses a custom replication protocol that has known limitations around split-brain prevention, quorum enforcement, and failure recovery. Raft is a well-understood consensus algorithm with formal correctness proofs, and Apache Ratis is a mature, Apache 2.0-licensed Java implementation used in production by Apache Ozone and other systems.

Design

New Module: ha-raft

A self-contained Maven module (arcadedb-ha-raft) that plugs into the existing server via ServiceLoader discovery. Key components:

Component Responsibility
RaftHAPlugin Server plugin entry point, config validation, ServiceLoader registration
RaftHAServer Wraps Ratis RaftServer + RaftClient, peer list parsing, leader election
RaftReplicatedDatabase Wraps LocalDatabase, intercepts TX commit and schema changes, submits to Raft log
ArcadeStateMachine Ratis StateMachine — applies committed log entries (WAL replay + schema ops)
RaftLogEntryCodec Serializes TX (WAL deltas) and SCHEMA entries for the Raft log
SnapshotManager Checksum-based incremental snapshot for new node bootstrap
ClusterMonitor Replication lag tracking and health reporting
GetClusterHandler HTTP /api/v1/cluster endpoint for cluster status

Configuration

New GlobalConfiguration properties:

  • arcadedb.ha.implementation — selects HA backend (raft or legacy)
  • arcadedb.ha.raft.port — gRPC port for Raft inter-node communication (default: 2434)
  • arcadedb.ha.raft.persistStorage — persist Raft log to disk for crash recovery
  • arcadedb.ha.raft.snapshotThreshold — log entries before triggering snapshot compaction
  • arcadedb.ha.replicationLagWarning — threshold (ms) to warn about replication lag
  • arcadedb.ha.serverList — static peer list in format host:raftPort:httpPort[,...]
  • arcadedb.ha.quorum — quorum policy (majority, all, none)

How It Works

  1. Startup: Each server starts a Ratis RaftServer with the configured peer list. Ratis handles leader election automatically.
  2. Writes: RaftReplicatedDatabase intercepts commit() and command() calls, serializes the WAL delta or schema operation, and submits it to the Raft log via RaftClient.
  3. Replication: Ratis replicates the log entry to a majority of peers. Once committed, ArcadeStateMachine.applyTransaction() replays the entry on each node.
  4. Leader forwarding: Follower nodes forward write commands to the leader via authenticated HTTP, transparent to the client.
  5. Reads: Served locally from any node (eventual consistency for reads).
  6. Failure recovery: Crashed nodes replay the Raft log on restart. Nodes that fall too far behind receive a snapshot via SnapshotManager.

Quorum & Split-Brain Prevention

With quorum=majority, a 3-node cluster tolerates 1 failure. If the leader loses contact with the majority, it automatically steps down (Raft protocol guarantee). The majority partition elects a new leader. No split-brain is possible — Raft's term-based voting prevents two leaders in the same term.

Implementation Status

Development is on the ha-redesign branch. Current state:

Core Implementation (complete)

  • Raft log entry serialization for TX and schema operations
  • State machine with WAL replay and schema application
  • Cluster monitor with lag tracking
  • Peer list parsing (supports host:raftPort:httpPort:priority format)
  • Snapshot manager with checksum-based incremental sync
  • Leader command forwarding via authenticated HTTP
  • Plugin discovery via ServiceLoader
  • Cluster status API (/api/v1/cluster)

Test Coverage

Unit tests (25 in ha-raft):

  • RaftHAServerTest — peer list parsing, display names, cluster token, separator handling
  • ArcadeStateMachineTest — state machine entry application
  • RaftLogEntryCodecTest — serialization round-trips
  • ClusterMonitorTest — lag tracking
  • SnapshotManagerTest — snapshot lifecycle
  • RaftHAPluginTest — config validation
  • ConfigValidationTest — configuration edge cases

Integration tests (14 in ha-raft):

  • 2-node and 3-node replication
  • Schema replication across cluster
  • Leader failover and re-election
  • Replica crash and recovery (Raft log replay)
  • Leader crash and recovery with DatabaseComparator verification
  • Quorum loss detection
  • Split-brain prevention (3-node and 5-node clusters)
  • Full snapshot resync for lagging nodes

End-to-end container tests (9 in e2e-ha):

  • SimpleHaScenarioIT — 2-node basic replication
  • ThreeInstancesScenarioIT — 3-node cluster with data verification
  • LeaderFailoverIT — leader crash, re-election, data consistency
  • NetworkPartitionIT — leader partition, follower partition, no-quorum scenarios
  • NetworkPartitionRecoveryIT — partition healing with Raft log catch-up
  • SplitBrainIT — split-brain prevention verification
  • RollingRestartIT — zero-downtime rolling restarts with data persistence
  • NetworkDelayIT — cluster behavior under latency (via Toxiproxy)
  • PacketLossIT — cluster behavior under packet loss (via Toxiproxy)

Remaining Work

  • Performance benchmarking vs current HA implementation
  • Read consistency options (leader reads, follower reads with staleness bound)
  • Dynamic cluster membership (add/remove nodes without restart)
  • Metrics integration (Ratis metrics → Prometheus)
  • Documentation and migration guide from legacy HA
  • Studio UI updates for Raft cluster status

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions