[2.19] Fix segment replication failure during rolling restart#20423
[2.19] Fix segment replication failure during rolling restart#20423cuonghm2809 wants to merge 4 commits intoopensearch-project:2.19from
Conversation
During rolling restarts, replica shards may have received newer checkpoints from the primary before the restart, but after restart, the primary may have rolled back to an older state. The strict checkpoint validation added in opensearch-project#18944 to fix race conditions during primary relocation incorrectly rejects this legitimate scenario, causing shards to fail allocation after 5 retries. This fix distinguishes between two scenarios: 1. Normal replication - strict checkpoint validation applies to prevent accepting stale data during primary relocation (maintains opensearch-project#18944 fix) 2. Recovery (shard INITIALIZING or RELOCATING) - accepts the primary's current state even if it appears older than the replica's last known checkpoint, as this is expected during recovery from restart Added unit tests to verify: - Stale checkpoint is rejected during normal replication - Stale checkpoint is accepted during shard recovery Note: In 2.19, the logic is in SegmentReplicationTarget.java instead of AbstractSegmentReplicationTarget.java (which was introduced in later versions). Signed-off-by: Cuong Ha <cuong.ha@optimizely.com> Fixes opensearch-project#19234
Signed-off-by: Cuong Ha <cuong.ha@optimizely.com>
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
❌ Gradle check result for 85cb2e0: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Cuong Ha <cuong.ha@optimizely.com>
|
❌ Gradle check result for 4f5ed51: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Cuong Ha <cuong.ha@optimizely.com>
|
❌ Gradle check result for bd60de1: null Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Description
This PR backports the fix for issue #19234 to the 2.19 branch.
Problem:
During rolling restart with segment replication enabled, replica shards fail to re-assign with error "Rejecting stale metadata checkpoint" after 5 retries. This occurs because:
This is a legitimate scenario during recovery, not an actual stale checkpoint issue.
Root cause:
PR #18944 added strict checkpoint validation to prevent issue #18490, but it doesn't distinguish between:
Solution:
Only enforce strict checkpoint validation during normal replication. During recovery (when shard is INITIALIZING or RELOCATING), accept checkpoints that may appear older, as the replica is catching up to the primary's current state.
Related Issues
Resolves #19234
Related to #18490
Check List
Test Plan
Added two unit tests in
SegmentReplicationTargetTests.java:testStaleCheckpointRejected_duringNormalReplication(): Verifies that stale checkpoints are correctly rejected when the shard is active (not recovering)
testStaleCheckpointAccepted_duringRecovery(): Verifies that checkpoints are accepted during recovery even if they appear older than the replica's persisted checkpoint
Both tests passed:
Additional Context
This fix is critical for production deployments using segment replication during rolling restarts. Without this fix, shards fail to recover and require manual intervention.