Fix segment replication failure during rolling restart#20422
Fix segment replication failure during rolling restart#20422andrross merged 5 commits intoopensearch-project:mainfrom
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the 📝 WalkthroughWalkthroughReplaces a strict stale-checkpoint rejection with a recovery-aware condition in startReplication: if the shard is initializing or relocating (recovering) stale primary checkpoints are accepted; otherwise stale metadata checkpoints continue to be rejected. Two tests added for both scenarios. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In
`@server/src/test/java/org/opensearch/indices/replication/SegmentReplicationTargetTests.java`:
- Around line 593-596: The test incorrectly attempts to stub methods on the real
ShardRouting returned by spyIndexShard.routingEntry(); instead create a mock
ShardRouting, stub initializing() and relocating() on that mock, and have
spyIndexShard.routingEntry() return the mocked ShardRouting (e.g., mock
ShardRouting, when(mockRouting.initializing()).thenReturn(false),
when(mockRouting.relocating()).thenReturn(false), and
when(spyIndexShard.routingEntry()).thenReturn(mockRouting)); also add the import
for org.opensearch.cluster.routing.ShardRouting.
- Around line 660-662: The current test incorrectly chains mocks on
spyIndexShard.routingEntry(); instead create a dedicated mocked ShardRouting,
return it from spyIndexShard.routingEntry(), and then stub its state methods:
e.g., mock a ShardRouting instance,
when(spyIndexShard.routingEntry()).thenReturn(mockShardRouting), then
when(mockShardRouting.initializing()).thenReturn(true) and
when(mockShardRouting.relocating()).thenReturn(false) so the
initializing/relocating behavior is properly isolated on the ShardRouting mock.
🧹 Nitpick comments (1)
server/src/test/java/org/opensearch/indices/replication/SegmentReplicationTargetTests.java (1)
546-611: Consider adding test coverage for therelocatingstate.The implementation treats both
initializing()andrelocating()as recovery states. The current tests only verify theinitializingscenario. Consider adding a test case whereinitializing()returnsfalseandrelocating()returnstrueto ensure both branches of the OR condition are covered.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
server/src/main/java/org/opensearch/indices/replication/AbstractSegmentReplicationTarget.javaserver/src/test/java/org/opensearch/indices/replication/SegmentReplicationTargetTests.java
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2026-01-13T17:40:27.167Z
Learnt from: reta
Repo: opensearch-project/OpenSearch PR: 20411
File: server/src/main/java/org/opensearch/index/codec/CodecService.java:112-133
Timestamp: 2026-01-13T17:40:27.167Z
Learning: Avoid capturing or evaluating a supplier (e.g., this::defaultCodec) upfront when passing it to a registry during object construction. If registries may replace defaults during iteration (as in EnginePlugin.getAdditionalCodecs), pass the supplier itself and only resolve it at use time. This ensures dynamic behavior is preserved during initialization and prevents premature binding of defaults in codecs/registry setup. This pattern should apply to similar initialization paths in Java server code where registries may mutate defaults during construction.
Applied to files:
server/src/main/java/org/opensearch/indices/replication/AbstractSegmentReplicationTarget.java
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: gradle-check
🔇 Additional comments (1)
server/src/main/java/org/opensearch/indices/replication/AbstractSegmentReplicationTarget.java (1)
166-185: LGTM! Recovering-state guard correctly addresses rolling restart scenario.The logic correctly differentiates between normal replication (reject stale checkpoints) and recovery (accept stale checkpoints). The comments clearly explain the rationale and link to the relevant issues. The
indexShard.routingEntry()calls are safe within this context—the callers (afterIndexShardStartedandclusterChanged) validate the routing entry before initiating replication, ensuring it is non-null at this point in the lifecycle.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
server/src/test/java/org/opensearch/indices/replication/SegmentReplicationTargetTests.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/opensearch/indices/replication/SegmentReplicationTargetTests.java
Outdated
Show resolved
Hide resolved
|
❌ Gradle check result for 45fd6b1: null Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
❌ Gradle check result for 6ebfe41: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
2653d2e to
af48e43
Compare
|
❌ Gradle check result for af48e43: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
❌ Gradle check result for af48e43: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
During rolling restarts, replica shards may have received newer checkpoints from the primary before the restart, but after restart, the primary may have rolled back to an older state. The strict checkpoint validation added in opensearch-project#18944 to fix race conditions during primary relocation incorrectly rejects this legitimate scenario, causing shards to fail allocation after 5 retries. This fix distinguishes between two scenarios: 1. Normal replication - strict checkpoint validation applies to prevent accepting stale data during primary relocation (maintains opensearch-project#18944 fix) 2. Recovery (shard INITIALIZING or RELOCATING) - accepts the primary's current state even if it appears older than the replica's last known checkpoint, as this is expected during recovery from restart Added unit tests to verify: - Stale checkpoint is rejected during normal replication - Stale checkpoint is accepted during shard recovery Signed-off-by: Cuong Ha <cuong.ha@optimizely.com> Fixes opensearch-project#19234
The chained mock syntax when(spyIndexShard.routingEntry().initializing()) doesn't work as intended because routingEntry() returns a real ShardRouting object, not a mock. Fixed by: - Added ShardRouting import - Created separate ShardRouting mocks for both test cases - Properly stubbed initializing() and relocating() methods on the mock - Stubbed routingEntry() to return the mocked ShardRouting This ensures tests correctly verify the behavior for both: 1. Active shard (initializing=false) - should reject stale checkpoint 2. Recovering shard (initializing=true) - should accept stale checkpoint Signed-off-by: Cuong Ha <cuong.ha@optimizely.com>
Signed-off-by: Cuong Ha <cuong.ha@optimizely.com>
Signed-off-by: Cuong Ha <cuong.ha@optimizely.com>
af48e43 to
26b0eb2
Compare
|
❕ Gradle check result for 26b0eb2: UNSTABLE Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #20422 +/- ##
============================================
- Coverage 73.33% 73.24% -0.10%
+ Complexity 72004 71921 -83
============================================
Files 5793 5793
Lines 329110 329111 +1
Branches 47402 47403 +1
============================================
- Hits 241364 241058 -306
- Misses 68394 68706 +312
+ Partials 19352 19347 -5 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Andrew Ross <andrross@amazon.com>
|
❌ Gradle check result for 4c7f021: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
Thanks for the fix here @cuonghm2809 this fix looks correct, the force_sync performed during recovery should disregard checkpoint metadata to catch up to wherever the elected primary is at that point. I suspect the reason the old primary is ahead is from the flush performed on shard closure. If the new has not indexed/refreshed anything new in that timeframe the old primary (now replica) would always come back ahead. |
|
❌ Gradle check result for 4c7f021: Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
You are right, @mch2. We are blocked from upgrading our OpenSearch cluster from version 2.17 to 2.19 because of this issue, and we need the fix to be released before we proceed with the upgrade. |
…oject#20422) * Fix segment replication failure during rolling restart During rolling restarts, replica shards may have received newer checkpoints from the primary before the restart, but after restart, the primary may have rolled back to an older state. The strict checkpoint validation added in opensearch-project#18944 to fix race conditions during primary relocation incorrectly rejects this legitimate scenario, causing shards to fail allocation after 5 retries. This fix distinguishes between two scenarios: 1. Normal replication - strict checkpoint validation applies to prevent accepting stale data during primary relocation (maintains opensearch-project#18944 fix) 2. Recovery (shard INITIALIZING or RELOCATING) - accepts the primary's current state even if it appears older than the replica's last known checkpoint, as this is expected during recovery from restart Added unit tests to verify: - Stale checkpoint is rejected during normal replication - Stale checkpoint is accepted during shard recovery Signed-off-by: Cuong Ha <cuong.ha@optimizely.com> Fixes opensearch-project#19234 * Fix incorrect mock chaining in unit tests The chained mock syntax when(spyIndexShard.routingEntry().initializing()) doesn't work as intended because routingEntry() returns a real ShardRouting object, not a mock. Fixed by: - Added ShardRouting import - Created separate ShardRouting mocks for both test cases - Properly stubbed initializing() and relocating() methods on the mock - Stubbed routingEntry() to return the mocked ShardRouting This ensures tests correctly verify the behavior for both: 1. Active shard (initializing=false) - should reject stale checkpoint 2. Recovering shard (initializing=true) - should accept stale checkpoint Signed-off-by: Cuong Ha <cuong.ha@optimizely.com> * Fix ReplicationCheckpoint constructor in tests Signed-off-by: Cuong Ha <cuong.ha@optimizely.com> * Fix code formatting Signed-off-by: Cuong Ha <cuong.ha@optimizely.com> * Add CHANGELOG entry Signed-off-by: Andrew Ross <andrross@amazon.com> --------- Signed-off-by: Cuong Ha <cuong.ha@optimizely.com> Signed-off-by: Andrew Ross <andrross@amazon.com> Co-authored-by: Andrew Ross <andrross@amazon.com>
…oject#20422) * Fix segment replication failure during rolling restart During rolling restarts, replica shards may have received newer checkpoints from the primary before the restart, but after restart, the primary may have rolled back to an older state. The strict checkpoint validation added in opensearch-project#18944 to fix race conditions during primary relocation incorrectly rejects this legitimate scenario, causing shards to fail allocation after 5 retries. This fix distinguishes between two scenarios: 1. Normal replication - strict checkpoint validation applies to prevent accepting stale data during primary relocation (maintains opensearch-project#18944 fix) 2. Recovery (shard INITIALIZING or RELOCATING) - accepts the primary's current state even if it appears older than the replica's last known checkpoint, as this is expected during recovery from restart Added unit tests to verify: - Stale checkpoint is rejected during normal replication - Stale checkpoint is accepted during shard recovery Signed-off-by: Cuong Ha <cuong.ha@optimizely.com> Fixes opensearch-project#19234 * Fix incorrect mock chaining in unit tests The chained mock syntax when(spyIndexShard.routingEntry().initializing()) doesn't work as intended because routingEntry() returns a real ShardRouting object, not a mock. Fixed by: - Added ShardRouting import - Created separate ShardRouting mocks for both test cases - Properly stubbed initializing() and relocating() methods on the mock - Stubbed routingEntry() to return the mocked ShardRouting This ensures tests correctly verify the behavior for both: 1. Active shard (initializing=false) - should reject stale checkpoint 2. Recovering shard (initializing=true) - should accept stale checkpoint Signed-off-by: Cuong Ha <cuong.ha@optimizely.com> * Fix ReplicationCheckpoint constructor in tests Signed-off-by: Cuong Ha <cuong.ha@optimizely.com> * Fix code formatting Signed-off-by: Cuong Ha <cuong.ha@optimizely.com> * Add CHANGELOG entry Signed-off-by: Andrew Ross <andrross@amazon.com> --------- Signed-off-by: Cuong Ha <cuong.ha@optimizely.com> Signed-off-by: Andrew Ross <andrross@amazon.com> Co-authored-by: Andrew Ross <andrross@amazon.com>
Description
During rolling restarts, replica shards may have received newer checkpoints from the primary before the restart, but after restart, the primary may have rolled back to an older state. The strict checkpoint validation added in #18944 to fix race conditions during primary relocation incorrectly rejects this legitimate scenario, causing shards to fail allocation after 5 retries.
This fix distinguishes between two scenarios:
Changes
AbstractSegmentReplicationTarget.javato check shard state before rejecting stale checkpointsTesting
Added two unit tests:
testStaleCheckpointRejected_duringNormalReplication()- Verifies stale checkpoint is rejected during normal replicationtestStaleCheckpointAccepted_duringRecovery()- Verifies stale checkpoint is accepted during shard recoveryRelated Issues
Fixes #19234
Checklist
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.