#2043 High Availability (HA) Subsystem Improvement #2062
#2043 High Availability (HA) Subsystem Improvement #2062
Conversation
Coverage summary from CodacySee diff coverage on Codacy
Coverage variation details
Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: Diff coverage details
Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: See your quality gate settings Change summary preferences |
c12ede7 to
697f2fc
Compare
697f2fc to
2832b8f
Compare
2832b8f to
db7bd83
Compare
db7bd83 to
5f527f2
Compare
af3014c to
73b2324
Compare
e287b3b to
7dad934
Compare
369a5b8 to
9777a56
Compare
9777a56 to
f82c3af
Compare
f82c3af to
aea3728
Compare
- Replace CodeUtils.sleep with Awaitility condition wait - Check for stable cluster state including empty replication queues - Improves test reliability by waiting for actual conditions Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Replace CodeUtils.sleep with Awaitility pollDelay for retry backoff - Maintains 500ms intentional delay before transaction retry - Improves consistency with Awaitility-based timeout handling Note: Test has pre-existing flakiness unrelated to this change Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Replace CodeUtils.sleep with TimeUnit.sleep in hot loop pacing delay - Add proper interrupt handling and throw RuntimeException on interruption - Document why Awaitility is not used (performance overhead in hot loop) - Remove unused CodeUtils import Note: Test has pre-existing flakiness in cleanup phase unrelated to this change Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…onsToForwardIT - Replace CodeUtils.sleep with Awaitility pollDelay for retry backoff - Replace arbitrary 1s sleep with condition-based wait for replication queues - Wait for each server's replication queue to drain before checking entries - Improves test reliability by waiting for actual conditions Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…InstallIT - Replace Thread.sleep with Awaitility pollDelay for intentional latency injection - Remove try/catch block as Awaitility handles interruption internally - Document that 10s delay is intentional to simulate slow replica - Improves consistency with Awaitility-based delay handling Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Replace Thread.sleep with Awaitility pollDelay for intentional delays - First delay (1s) to slow replica response and trigger hot resync - Second delay (1s) to allow current message processing before channel close - Remove try/catch blocks as Awaitility handles interruption internally - Improves consistency with Awaitility-based delay handling Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Replace Thread.sleep with Awaitility pollDelay in main method - Add console message indicating cluster is running - Keep 1000s delay for manual testing/observation - Improves consistency with Awaitility-based delay handling Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Add ReplicationTransientException for retryable failures - Add ReplicationPermanentException for unrecoverable errors - Add LeadershipChangeException for leader transitions - Each exception documents recovery strategy Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Classify IOExceptions and throw typed exceptions (ReplicationTransientException, LeadershipChangeException, ReplicationPermanentException) - Update handleIOException and handleGenericException to catch and handle specific types - Add transitionTo() helper for state transitions with validation and logging - Metrics track exception categories - Add Leader2ReplicaExceptionHandlingTest to verify classification logic Behind HA_ENHANCED_RECONNECTION feature flag. Legacy code path unchanged. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Add real health status calculation (HEALTHY/DEGRADED/UNHEALTHY) - Expose replica connection metrics via health endpoint - Add HAServer.getReplicaConnections() accessor for health monitoring - Health considers quorum status, replica failures, queue sizes - Add backward compatibility fields (role, onlineServers, quorumAvailable) - Add comprehensive test for replica metrics validation Health status logic: - HEALTHY: All replicas online, no failures, quorum present - DEGRADED: Some replicas offline or excessive failures (>5), or no leader connection - UNHEALTHY: Failed replicas or quorum lost Replica metrics exposed: - status: current replica connection status - queueSize: messages pending replication - consecutiveFailures: count of consecutive connection failures - totalReconnections: lifetime reconnection count - transientFailures: network-related failures - leadershipChanges: leadership transition count Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Implement ReplicaCircuitBreaker with CLOSED/OPEN/HALF_OPEN states - Add configuration for thresholds and timeouts - Integrate into Leader2ReplicaNetworkExecutor.sendMessage() - Expose circuit breaker state in health API - Disabled by default (HA_CIRCUIT_BREAKER_ENABLED=false) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Implement ConsistencyMonitor background thread - Sample records and detect drift across replicas - Support auto-alignment when drift exceeds threshold - Disabled by default (HA_CONSISTENCY_CHECK_ENABLED=false) - TODO: Add replica checksum request protocol Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Fixed three critical issues in ConsistencyMonitor: 1. Resource leak - ResultSet not closed on exception (now using try-with-resources) 2. Platform-dependent checksum - json.getBytes() now uses UTF-8 explicitly 3. SQL injection vulnerability - type name validated before query construction All fixes preserve existing behavior while addressing security and reliability concerns. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Critical fix: waitForClusterStable expects server count, not timeout. Test was incorrectly passing 60_000 (intended as timeout) but the method signature expects the number of servers in the cluster. Additional improvements: - Add test data creation (200 vertices) for monitor to sample - Replace Thread.sleep with Awaitility for monitor execution wait - Add comments explaining verification approach Test now passes consistently.
ISSUE: Test has fundamental design flaw causing deadlock The test attempts to restart the leader from within the replication callback thread, which causes a deadlock: 1. Callback fires on replica's replication thread 2. Callback stops/restarts leader synchronously 3. Callback calls waitForClusterStable() to wait for cluster reformation 4. DEADLOCK: The callback thread IS PART OF the cluster infrastructure needed for stabilization, but it's blocked waiting for stabilization Evidence from test run: - 3 restart attempts logged (18:06, 18:08, 18:10) - Each attempt hangs for ~2+ minutes - Test runs for 874 seconds (14+ minutes), hits timeout - Final result: 0 restarts completed despite 3 attempts ROOT CAUSE: Cannot modify cluster infrastructure from within that infrastructure. The replication callback thread is part of the cluster's messaging system, so blocking it prevents cluster stabilization. SOLUTIONS (for future): 1. Move restart logic to separate control thread (not callback thread) 2. Use external chaos tool (Toxiproxy) to trigger failures 3. Simplify to test 1 restart from main test thread 4. Test that cluster survives natural leader changes (don't control timing) Also increased getTxs() from 5,000 to 15,000 and added restart synchronization lock, but these changes didn't address the fundamental architectural issue. Related: ConsistencyMonitorIT fix (dd34d7e)
…er config
ISSUE: RemoteDatabase cannot failover when leader goes down
The test has a design flaw in how it configures the RemoteDatabase client:
1. RemoteDatabase created with ONLY server 0 address (line 80):
new RemoteDatabase(server0Host, server0Port, ...)
2. Test callback stops server 0 after 10 messages (line 151)
3. RemoteDatabase has no knowledge of servers 1 and 2
4. Client cannot failover to new leader, exhausts 2-minute retry timeout
5. Test fails with RemoteException: "no server available"
EVIDENCE:
- Awaitility retry loop at line 95-110 times out after 2 minutes
- RemoteException thrown at line 100
- No automatic failover occurs
ROOT CAUSE: RemoteDatabase needs full cluster topology (all server
addresses) to perform automatic failover. When configured with only
one server, it has no fallback when that server goes down.
SOLUTIONS (for future):
1. Configure RemoteDatabase with all server addresses:
new RemoteDatabase(new String[]{server0, server1, server2}, ...)
2. Manually reconnect to server 1 or 2 after detecting server 0 down
3. Use service discovery or load balancer in front of cluster
4. Test should verify client failover behavior, not just cluster behavior
This is different from ReplicationServerLeaderChanges3TimesIT which has
a deadlock issue. This test has a missing configuration issue.
Related: ReplicationServerLeaderChanges3TimesIT (e878bd7),
ConsistencyMonitorIT (dd34d7e)
…plicationServerFixedClientConnectionIT test method
…s, cleanup Thread safety fixes: - HAServer: capture cluster reference before compute() lambda to prevent race condition during replica registration - Leader2ReplicaNetworkExecutor: move status checks inside lock to fix TOCTOU vulnerability in setStatus() - LeaderFence: add synchronized to fence/unfence/reset methods to prevent non-atomic check-then-set race conditions Disable incomplete features: - HA_CONSISTENCY_CHECK_ENABLED: default to false, mark as EXPERIMENTAL (only collects leader checksums, replica collection not implemented) - HA_ENHANCED_RECONNECTION: default to false, mark as EXPERIMENTAL (recovery strategies still being refined) Code cleanup: - HAServer: remove commented-out code (unused HostUtil calls, dead getServerAddressList method) - ThreeInstancesScenarioIT: add TODO comments explaining why network partition test is disabled Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove ElectionContext.java: defined but never used anywhere in codebase - LeaderFence.acceptEpoch(): change while(true) to bounded retry loop with max 100 attempts to prevent infinite spinning under high contention Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Update test assertions to match the new default values: - HA_ENHANCED_RECONNECTION: true -> false - HA_CONSISTENCY_CHECK_ENABLED: true -> false Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The custom page-parsing implementation of countEntries() introduced in commit fb0201a had bugs with LSM merge-on-read semantics that caused test failures (incorrect entry counts after deletions and compaction). Reverted to using vectorIndex.getActiveCount() which properly handles LSM semantics through the in-memory VectorLocationIndex. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Fix DatabaseIsClosedException in HA tests caused by two issues: 1. Race condition: db.isOpen() could return true, but by the time db.begin() executed, the database was closed during failover. Added try-catch to gracefully handle this expected scenario. 2. Inherited test not disabled: ReplicationServerFixedClientConnectionIT had its own testReplication() disabled, but inherited replication() from parent was still running. Override replication() with @disabled. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
0e91510 to
48f1cc3
Compare
Pull Request Review: #2062 - High Availability (HA) Subsystem ImprovementsExecutive SummaryThis PR makes significant improvements to ArcadeDB's High Availability (HA) subsystem, addressing critical bugs in cluster discovery, server removal, and HTTP address propagation. The changes span 201 files with 33,374 additions and include new infrastructure for split-brain prevention, circuit breakers, and consistency monitoring. Overall code quality is good, but there are several critical thread safety concerns and incomplete implementations that should be addressed before merging. 1. Critical Issues 🔴1.1 Thread Safety Violation in
|
| Category | Rating | Notes |
|---|---|---|
| Code Quality | 8/10 | Excellent structure, minor cleanup needed |
| Thread Safety | 6/10 | Good patterns but 2-3 critical issues |
| Performance | 8/10 | Good optimizations, some aggressive policies |
| Security | 7/10 | Leader fencing is great, minor hardening needed |
| Test Coverage | 7/10 | Good breadth, some critical paths untested |
| Documentation | 6/10 | Excellent inline docs, missing config docs |
Overall Recommendation:
The PR makes substantial improvements to HA reliability and includes well-designed split-brain prevention. However, 3 critical bugs must be fixed before merge:
receivedResponse()parameter type mismatch- Cluster configuration race condition
- Non-functional ConsistencyMonitor
Once these are addressed, this will be a strong improvement to ArcadeDB's HA subsystem.
11. Positive Highlights ⭐
- Excellent architecture for leader fencing using epochs
- Well-designed circuit breaker pattern
- Comprehensive error categorization for better recovery strategies
- Good separation of concerns with new helper classes
- Extensive test coverage (46 tests)
- WAL semantics properly maintained
- Clear upgrade path from old protocol
Great work overall! The foundation is solid - just needs the critical fixes before merging.
This pull request addresses several critical issues and improvements in the High Availability (HA) subsystem, focusing on bug fixes, protocol consistency, and test reliability. The main changes include fixing alias resolution for cluster discovery, correcting type mismatches in server removal, re-enabling HTTP address propagation for client redirection, and fixing test assertions in resilience scenarios. Additionally, the CI workflow is updated to properly handle resilience tests.
HA Subsystem Bug Fixes and Improvements:
Alias Resolution
resolveAlias()method inHAServer.javato ensure that server addresses containing aliases (e.g.,{arcade2}proxy:8667) are correctly resolved to actual host:port values during leader connection, fixing cluster formation issues in Docker/K8s environments.Server Removal Consistency
removeServer()method inHAServer.javato acceptServerInfoobjects instead ofString, aligning with the migration toServerInfo-based replica connection maps. A backward-compatible method acceptingStringwas also added.findByHostAndPort()toHAClusterfor easierServerInfolookups and fixed enum naming conventions for consistency.HTTP Address Propagation
GetServerHandlerto return the correct HTTP addresses for client redirection and ensured cleanup of HTTP address mappings when servers are removed.Test Reliability and Workflow:
Test Assertion Correction
ThreeInstancesScenarioITtest by removing assertions on a disconnected node (arcade1) and clarifying comments to reflect the actual test scenario, ensuring the test only checks nodes that are connected and can replicate data.CI Workflow Update
resiliencemodule from the main integration test job and introduced a dedicatedjava-resilience-testsjob to run resilience-specific tests separately. [1] [2]These changes collectively improve the robustness, correctness, and maintainability of the HA subsystem and associated tests.
Related issues
#2043
Checklist
mvn clean packagecommand