Summary
Redesign the ArcadeDB High Availability stack using Apache Ratis (Raft consensus protocol) to replace the current custom leader-follower replication. This provides stronger consistency guarantees, automatic leader election, and proven consensus semantics.
Motivation
The current HA implementation uses a custom replication protocol that has known limitations around split-brain prevention, quorum enforcement, and failure recovery. Raft is a well-understood consensus algorithm with formal correctness proofs, and Apache Ratis is a mature, Apache 2.0-licensed Java implementation used in production by Apache Ozone and other systems.
Design
New Module: ha-raft
A self-contained Maven module (arcadedb-ha-raft) that plugs into the existing server via ServiceLoader discovery. Key components:
| Component |
Responsibility |
| RaftHAPlugin |
Server plugin entry point, config validation, ServiceLoader registration |
| RaftHAServer |
Wraps Ratis RaftServer + RaftClient, peer list parsing, leader election |
| RaftReplicatedDatabase |
Wraps LocalDatabase, intercepts TX commit and schema changes, submits to Raft log |
| ArcadeStateMachine |
Ratis StateMachine — applies committed log entries (WAL replay + schema ops) |
| RaftLogEntryCodec |
Serializes TX (WAL deltas) and SCHEMA entries for the Raft log |
| SnapshotManager |
Checksum-based incremental snapshot for new node bootstrap |
| ClusterMonitor |
Replication lag tracking and health reporting |
| GetClusterHandler |
HTTP /api/v1/cluster endpoint for cluster status |
Configuration
New GlobalConfiguration properties:
arcadedb.ha.implementation — selects HA backend (raft or legacy)
arcadedb.ha.raft.port — gRPC port for Raft inter-node communication (default: 2434)
arcadedb.ha.raft.persistStorage — persist Raft log to disk for crash recovery
arcadedb.ha.raft.snapshotThreshold — log entries before triggering snapshot compaction
arcadedb.ha.replicationLagWarning — threshold (ms) to warn about replication lag
arcadedb.ha.serverList — static peer list in format host:raftPort:httpPort[,...]
arcadedb.ha.quorum — quorum policy (majority, all, none)
How It Works
- Startup: Each server starts a Ratis
RaftServer with the configured peer list. Ratis handles leader election automatically.
- Writes:
RaftReplicatedDatabase intercepts commit() and command() calls, serializes the WAL delta or schema operation, and submits it to the Raft log via RaftClient.
- Replication: Ratis replicates the log entry to a majority of peers. Once committed,
ArcadeStateMachine.applyTransaction() replays the entry on each node.
- Leader forwarding: Follower nodes forward write commands to the leader via authenticated HTTP, transparent to the client.
- Reads: Served locally from any node (eventual consistency for reads).
- Failure recovery: Crashed nodes replay the Raft log on restart. Nodes that fall too far behind receive a snapshot via
SnapshotManager.
Quorum & Split-Brain Prevention
With quorum=majority, a 3-node cluster tolerates 1 failure. If the leader loses contact with the majority, it automatically steps down (Raft protocol guarantee). The majority partition elects a new leader. No split-brain is possible — Raft's term-based voting prevents two leaders in the same term.
Implementation Status
Development is on the ha-redesign branch. Current state:
Core Implementation (complete)
- Raft log entry serialization for TX and schema operations
- State machine with WAL replay and schema application
- Cluster monitor with lag tracking
- Peer list parsing (supports
host:raftPort:httpPort:priority format)
- Snapshot manager with checksum-based incremental sync
- Leader command forwarding via authenticated HTTP
- Plugin discovery via
ServiceLoader
- Cluster status API (
/api/v1/cluster)
Test Coverage
Unit tests (25 in ha-raft):
RaftHAServerTest — peer list parsing, display names, cluster token, separator handling
ArcadeStateMachineTest — state machine entry application
RaftLogEntryCodecTest — serialization round-trips
ClusterMonitorTest — lag tracking
SnapshotManagerTest — snapshot lifecycle
RaftHAPluginTest — config validation
ConfigValidationTest — configuration edge cases
Integration tests (14 in ha-raft):
- 2-node and 3-node replication
- Schema replication across cluster
- Leader failover and re-election
- Replica crash and recovery (Raft log replay)
- Leader crash and recovery with
DatabaseComparator verification
- Quorum loss detection
- Split-brain prevention (3-node and 5-node clusters)
- Full snapshot resync for lagging nodes
End-to-end container tests (9 in e2e-ha):
SimpleHaScenarioIT — 2-node basic replication
ThreeInstancesScenarioIT — 3-node cluster with data verification
LeaderFailoverIT — leader crash, re-election, data consistency
NetworkPartitionIT — leader partition, follower partition, no-quorum scenarios
NetworkPartitionRecoveryIT — partition healing with Raft log catch-up
SplitBrainIT — split-brain prevention verification
RollingRestartIT — zero-downtime rolling restarts with data persistence
NetworkDelayIT — cluster behavior under latency (via Toxiproxy)
PacketLossIT — cluster behavior under packet loss (via Toxiproxy)
Remaining Work
Summary
Redesign the ArcadeDB High Availability stack using Apache Ratis (Raft consensus protocol) to replace the current custom leader-follower replication. This provides stronger consistency guarantees, automatic leader election, and proven consensus semantics.
Motivation
The current HA implementation uses a custom replication protocol that has known limitations around split-brain prevention, quorum enforcement, and failure recovery. Raft is a well-understood consensus algorithm with formal correctness proofs, and Apache Ratis is a mature, Apache 2.0-licensed Java implementation used in production by Apache Ozone and other systems.
Design
New Module:
ha-raftA self-contained Maven module (
arcadedb-ha-raft) that plugs into the existing server viaServiceLoaderdiscovery. Key components:RaftServer+RaftClient, peer list parsing, leader electionLocalDatabase, intercepts TX commit and schema changes, submits to Raft logStateMachine— applies committed log entries (WAL replay + schema ops)/api/v1/clusterendpoint for cluster statusConfiguration
New
GlobalConfigurationproperties:arcadedb.ha.implementation— selects HA backend (raftor legacy)arcadedb.ha.raft.port— gRPC port for Raft inter-node communication (default: 2434)arcadedb.ha.raft.persistStorage— persist Raft log to disk for crash recoveryarcadedb.ha.raft.snapshotThreshold— log entries before triggering snapshot compactionarcadedb.ha.replicationLagWarning— threshold (ms) to warn about replication lagarcadedb.ha.serverList— static peer list in formathost:raftPort:httpPort[,...]arcadedb.ha.quorum— quorum policy (majority,all,none)How It Works
RaftServerwith the configured peer list. Ratis handles leader election automatically.RaftReplicatedDatabaseinterceptscommit()andcommand()calls, serializes the WAL delta or schema operation, and submits it to the Raft log viaRaftClient.ArcadeStateMachine.applyTransaction()replays the entry on each node.SnapshotManager.Quorum & Split-Brain Prevention
With
quorum=majority, a 3-node cluster tolerates 1 failure. If the leader loses contact with the majority, it automatically steps down (Raft protocol guarantee). The majority partition elects a new leader. No split-brain is possible — Raft's term-based voting prevents two leaders in the same term.Implementation Status
Development is on the
ha-redesignbranch. Current state:Core Implementation (complete)
host:raftPort:httpPort:priorityformat)ServiceLoader/api/v1/cluster)Test Coverage
Unit tests (25 in
ha-raft):RaftHAServerTest— peer list parsing, display names, cluster token, separator handlingArcadeStateMachineTest— state machine entry applicationRaftLogEntryCodecTest— serialization round-tripsClusterMonitorTest— lag trackingSnapshotManagerTest— snapshot lifecycleRaftHAPluginTest— config validationConfigValidationTest— configuration edge casesIntegration tests (14 in
ha-raft):DatabaseComparatorverificationEnd-to-end container tests (9 in
e2e-ha):SimpleHaScenarioIT— 2-node basic replicationThreeInstancesScenarioIT— 3-node cluster with data verificationLeaderFailoverIT— leader crash, re-election, data consistencyNetworkPartitionIT— leader partition, follower partition, no-quorum scenariosNetworkPartitionRecoveryIT— partition healing with Raft log catch-upSplitBrainIT— split-brain prevention verificationRollingRestartIT— zero-downtime rolling restarts with data persistenceNetworkDelayIT— cluster behavior under latency (via Toxiproxy)PacketLossIT— cluster behavior under packet loss (via Toxiproxy)Remaining Work