Fix segment replication failure during rolling restart by cuonghm2809 · Pull Request #20422 · opensearch-project/OpenSearch

cuonghm2809 · 2026-01-14T13:46:04Z

Description

During rolling restarts, replica shards may have received newer checkpoints from the primary before the restart, but after restart, the primary may have rolled back to an older state. The strict checkpoint validation added in #18944 to fix race conditions during primary relocation incorrectly rejects this legitimate scenario, causing shards to fail allocation after 5 retries.

This fix distinguishes between two scenarios:

Normal replication - strict checkpoint validation applies to prevent accepting stale data during primary relocation (maintains Fix segment replication bug during primary relocation #18944 fix)
Recovery (shard INITIALIZING or RELOCATING) - accepts the primary's current state even if it appears older than the replica's last known checkpoint, as this is expected during recovery from restart

Changes

Modified AbstractSegmentReplicationTarget.java to check shard state before rejecting stale checkpoints
Added unit tests to verify both scenarios work correctly

Testing

Added two unit tests:

testStaleCheckpointRejected_duringNormalReplication() - Verifies stale checkpoint is rejected during normal replication
testStaleCheckpointAccepted_duringRecovery() - Verifies stale checkpoint is accepted during shard recovery

Related Issues

Fixes #19234

Checklist

Functionality includes testing
API changes companion pull request created, if applicable
Public documentation issue/PR created, if applicable

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

coderabbitai · 2026-01-14T13:46:32Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

Replaces a strict stale-checkpoint rejection with a recovery-aware condition in startReplication: if the shard is initializing or relocating (recovering) stale primary checkpoints are accepted; otherwise stale metadata checkpoints continue to be rejected. Two tests added for both scenarios.

Changes

Cohort / File(s)	Change Summary
Checkpoint validation `server/src/main/java/org/opensearch/indices/replication/AbstractSegmentReplicationTarget.java`	Add `isRecovering` (based on `indexShard.routingEntry().initializing()` or `relocating()`) in `startReplication()` to bypass strict "checkpoint ahead" rejection when shard is recovering.
Tests `server/src/test/java/org/opensearch/indices/replication/SegmentReplicationTargetTests.java`	Add `testStaleCheckpointRejected_duringNormalReplication()` and `testStaleCheckpointAccepted_duringRecovery()`; simulate `ShardRouting` states (normal vs initializing/relocating) to assert rejection during normal replication and acceptance during recovery.
Build/manifest `pom.xml`, `manifest_file`	Minor manifest/pom adjustments (small line changes).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 23.08% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Fix segment replication failure during rolling restart' clearly describes the main change—addressing a replication issue during rolling restarts, which matches the core objective of the PR.
Description check	✅ Passed	The description includes all required template sections: a clear description of the problem, changes made, testing details, and related issue (`#19234`), with the contributor confirmation.
Linked Issues check	✅ Passed	The PR fulfills all coding requirements from issue `#19234`: distinguishes between normal replication and recovery scenarios, preserves strict checkpoint validation during normal operation, permits checkpoint differences during recovery, and includes comprehensive unit tests demonstrating both paths.
Out of Scope Changes check	✅ Passed	All changes are scoped to the stated objectives: modifications to AbstractSegmentReplicationTarget.java to implement recovery-aware checkpoint validation, and two unit tests validating both the normal replication and recovery scenarios.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In
`@server/src/test/java/org/opensearch/indices/replication/SegmentReplicationTargetTests.java`:
- Around line 593-596: The test incorrectly attempts to stub methods on the real
ShardRouting returned by spyIndexShard.routingEntry(); instead create a mock
ShardRouting, stub initializing() and relocating() on that mock, and have
spyIndexShard.routingEntry() return the mocked ShardRouting (e.g., mock
ShardRouting, when(mockRouting.initializing()).thenReturn(false),
when(mockRouting.relocating()).thenReturn(false), and
when(spyIndexShard.routingEntry()).thenReturn(mockRouting)); also add the import
for org.opensearch.cluster.routing.ShardRouting.
- Around line 660-662: The current test incorrectly chains mocks on
spyIndexShard.routingEntry(); instead create a dedicated mocked ShardRouting,
return it from spyIndexShard.routingEntry(), and then stub its state methods:
e.g., mock a ShardRouting instance,
when(spyIndexShard.routingEntry()).thenReturn(mockShardRouting), then
when(mockShardRouting.initializing()).thenReturn(true) and
when(mockShardRouting.relocating()).thenReturn(false) so the
initializing/relocating behavior is properly isolated on the ShardRouting mock.

🧹 Nitpick comments (1)

server/src/test/java/org/opensearch/indices/replication/SegmentReplicationTargetTests.java (1)

546-611: Consider adding test coverage for the relocating state.

The implementation treats both initializing() and relocating() as recovery states. The current tests only verify the initializing scenario. Consider adding a test case where initializing() returns false and relocating() returns true to ensure both branches of the OR condition are covered.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 52b699d and 5e2114f.

📒 Files selected for processing (2)

server/src/main/java/org/opensearch/indices/replication/AbstractSegmentReplicationTarget.java
server/src/test/java/org/opensearch/indices/replication/SegmentReplicationTargetTests.java

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2026-01-13T17:40:27.167Z

Learnt from: reta
Repo: opensearch-project/OpenSearch PR: 20411
File: server/src/main/java/org/opensearch/index/codec/CodecService.java:112-133
Timestamp: 2026-01-13T17:40:27.167Z
Learning: Avoid capturing or evaluating a supplier (e.g., this::defaultCodec) upfront when passing it to a registry during object construction. If registries may replace defaults during iteration (as in EnginePlugin.getAdditionalCodecs), pass the supplier itself and only resolve it at use time. This ensures dynamic behavior is preserved during initialization and prevents premature binding of defaults in codecs/registry setup. This pattern should apply to similar initialization paths in Java server code where registries may mutate defaults during construction.

Applied to files:

server/src/main/java/org/opensearch/indices/replication/AbstractSegmentReplicationTarget.java

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: gradle-check

🔇 Additional comments (1)

server/src/main/java/org/opensearch/indices/replication/AbstractSegmentReplicationTarget.java (1)

166-185: LGTM! Recovering-state guard correctly addresses rolling restart scenario.

The logic correctly differentiates between normal replication (reject stale checkpoints) and recovery (accept stale checkpoints). The comments clearly explain the rationale and link to the relevant issues. The indexShard.routingEntry() calls are safe within this context—the callers (afterIndexShardStarted and clusterChanged) validate the routing entry before initiating replication, ensuring it is non-null at this point in the lifecycle.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

server/src/test/java/org/opensearch/indices/replication/SegmentReplicationTargetTests.java

github-actions · 2026-01-14T15:58:25Z

❌ Gradle check result for 45fd6b1: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2026-01-15T02:25:19Z

❌ Gradle check result for 6ebfe41: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2026-01-15T08:42:25Z

❌ Gradle check result for af48e43: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2026-01-15T09:56:54Z

❌ Gradle check result for af48e43: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

During rolling restarts, replica shards may have received newer checkpoints from the primary before the restart, but after restart, the primary may have rolled back to an older state. The strict checkpoint validation added in opensearch-project#18944 to fix race conditions during primary relocation incorrectly rejects this legitimate scenario, causing shards to fail allocation after 5 retries. This fix distinguishes between two scenarios: 1. Normal replication - strict checkpoint validation applies to prevent accepting stale data during primary relocation (maintains opensearch-project#18944 fix) 2. Recovery (shard INITIALIZING or RELOCATING) - accepts the primary's current state even if it appears older than the replica's last known checkpoint, as this is expected during recovery from restart Added unit tests to verify: - Stale checkpoint is rejected during normal replication - Stale checkpoint is accepted during shard recovery Signed-off-by: Cuong Ha <cuong.ha@optimizely.com> Fixes opensearch-project#19234

The chained mock syntax when(spyIndexShard.routingEntry().initializing()) doesn't work as intended because routingEntry() returns a real ShardRouting object, not a mock. Fixed by: - Added ShardRouting import - Created separate ShardRouting mocks for both test cases - Properly stubbed initializing() and relocating() methods on the mock - Stubbed routingEntry() to return the mocked ShardRouting This ensures tests correctly verify the behavior for both: 1. Active shard (initializing=false) - should reject stale checkpoint 2. Recovering shard (initializing=true) - should accept stale checkpoint Signed-off-by: Cuong Ha <cuong.ha@optimizely.com>

Signed-off-by: Cuong Ha <cuong.ha@optimizely.com>

github-actions · 2026-01-21T21:08:59Z

❕ Gradle check result for 26b0eb2: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

codecov · 2026-01-21T21:10:53Z

Codecov Report

❌ Patch coverage is 50.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 73.24%. Comparing base (68c2d14) to head (4c7f021).
⚠️ Report is 6 commits behind head on main.

Files with missing lines	Patch %	Lines
.../replication/AbstractSegmentReplicationTarget.java	50.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #20422      +/-   ##
============================================
- Coverage     73.33%   73.24%   -0.10%     
+ Complexity    72004    71921      -83     
============================================
  Files          5793     5793              
  Lines        329110   329111       +1     
  Branches      47402    47403       +1     
============================================
- Hits         241364   241058     -306     
- Misses        68394    68706     +312     
+ Partials      19352    19347       -5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Andrew Ross <andrross@amazon.com>

github-actions · 2026-01-21T22:06:10Z

❌ Gradle check result for 4c7f021: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

mch2 · 2026-01-22T22:01:07Z

Thanks for the fix here @cuonghm2809 this fix looks correct, the force_sync performed during recovery should disregard checkpoint metadata to catch up to wherever the elected primary is at that point. I suspect the reason the old primary is ahead is from the flush performed on shard closure. If the new has not indexed/refreshed anything new in that timeframe the old primary (now replica) would always come back ahead.

github-actions · 2026-01-22T22:49:37Z

❌ Gradle check result for 4c7f021:

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2026-01-23T00:53:15Z

✅ Gradle check result for 4c7f021: SUCCESS

cuonghm2809 · 2026-01-29T08:11:08Z

You are right, @mch2.
I have opened a PR #20498 to backport the fix to version 2.19. Could you please help review the PR and merge it into the 2.19 branch @mch2 @andrross?

We are blocked from upgrading our OpenSearch cluster from version 2.17 to 2.19 because of this issue, and we need the fix to be released before we proceed with the upgrade.

…oject#20422) * Fix segment replication failure during rolling restart During rolling restarts, replica shards may have received newer checkpoints from the primary before the restart, but after restart, the primary may have rolled back to an older state. The strict checkpoint validation added in opensearch-project#18944 to fix race conditions during primary relocation incorrectly rejects this legitimate scenario, causing shards to fail allocation after 5 retries. This fix distinguishes between two scenarios: 1. Normal replication - strict checkpoint validation applies to prevent accepting stale data during primary relocation (maintains opensearch-project#18944 fix) 2. Recovery (shard INITIALIZING or RELOCATING) - accepts the primary's current state even if it appears older than the replica's last known checkpoint, as this is expected during recovery from restart Added unit tests to verify: - Stale checkpoint is rejected during normal replication - Stale checkpoint is accepted during shard recovery Signed-off-by: Cuong Ha <cuong.ha@optimizely.com> Fixes opensearch-project#19234 * Fix incorrect mock chaining in unit tests The chained mock syntax when(spyIndexShard.routingEntry().initializing()) doesn't work as intended because routingEntry() returns a real ShardRouting object, not a mock. Fixed by: - Added ShardRouting import - Created separate ShardRouting mocks for both test cases - Properly stubbed initializing() and relocating() methods on the mock - Stubbed routingEntry() to return the mocked ShardRouting This ensures tests correctly verify the behavior for both: 1. Active shard (initializing=false) - should reject stale checkpoint 2. Recovering shard (initializing=true) - should accept stale checkpoint Signed-off-by: Cuong Ha <cuong.ha@optimizely.com> * Fix ReplicationCheckpoint constructor in tests Signed-off-by: Cuong Ha <cuong.ha@optimizely.com> * Fix code formatting Signed-off-by: Cuong Ha <cuong.ha@optimizely.com> * Add CHANGELOG entry Signed-off-by: Andrew Ross <andrross@amazon.com> --------- Signed-off-by: Cuong Ha <cuong.ha@optimizely.com> Signed-off-by: Andrew Ross <andrross@amazon.com> Co-authored-by: Andrew Ross <andrross@amazon.com>

cuonghm2809 requested a review from a team as a code owner January 14, 2026 13:46

github-actions bot added bug Something isn't working Cluster Manager Indexing:Replication Issues and PRs related to core replication framework eg segrep labels Jan 14, 2026

github-project-automation bot added this to Cluster Manager Project Board Jan 14, 2026

coderabbitai bot reviewed Jan 14, 2026

View reviewed changes

server/src/test/java/org/opensearch/indices/replication/SegmentReplicationTargetTests.java Outdated Show resolved Hide resolved

server/src/test/java/org/opensearch/indices/replication/SegmentReplicationTargetTests.java Outdated Show resolved Hide resolved

cuonghm2809 force-pushed the fix-shard-reallocation-19234 branch from 2653d2e to af48e43 Compare January 15, 2026 07:29

cuonghm2809 closed this Jan 15, 2026

github-project-automation bot moved this to ✅ Done in Cluster Manager Project Board Jan 15, 2026

cuonghm2809 reopened this Jan 15, 2026

github-project-automation bot moved this from ✅ Done to 🏗 In progress in Cluster Manager Project Board Jan 15, 2026

cuonghm2809 added 4 commits January 21, 2026 11:40

Fix ReplicationCheckpoint constructor in tests

717b703

Signed-off-by: Cuong Ha <cuong.ha@optimizely.com>

Fix code formatting

26b0eb2

Signed-off-by: Cuong Ha <cuong.ha@optimizely.com>

andrross force-pushed the fix-shard-reallocation-19234 branch from af48e43 to 26b0eb2 Compare January 21, 2026 19:41

Add CHANGELOG entry

4c7f021

Signed-off-by: Andrew Ross <andrross@amazon.com>

andrross approved these changes Jan 22, 2026

View reviewed changes

github-project-automation bot moved this from 🏗 In progress to 👀 In review in Cluster Manager Project Board Jan 22, 2026

mch2 approved these changes Jan 22, 2026

View reviewed changes

andrross merged commit 5906e7b into opensearch-project:main Jan 23, 2026
35 of 40 checks passed

github-project-automation bot moved this from 👀 In review to ✅ Done in Cluster Manager Project Board Jan 23, 2026

cuonghm2809 mentioned this pull request Jan 29, 2026

[Backport 2.19] Fix segment replication failure during rolling restart #20498

Merged

This was referenced Feb 5, 2026

[segment replication] Fix segment replication infinite retry due to stale metadata checkpoint #20551

Merged

[segment replication] Simplify the ckp verification for the primary shard relocation #20609

Open

guojialiang92 mentioned this pull request Feb 12, 2026

[RFC] Simplify the ckp verification for the primary shard relocation #20610

Open

This was referenced Feb 18, 2026

[AUTOCUT] Gradle Check Flaky Test Report for PruneFileCacheIT #19724

Open

[AUTOCUT] Gradle Check Flaky Test Report for RemoteRoutingTableServiceTests #14559

Open

[AUTOCUT] Gradle Check Flaky Test Report for ShardsLimitAllocationDeciderIT #19726

Closed

opensearch-ci-bot mentioned this pull request Mar 27, 2026

[AUTOCUT] Gradle Check Flaky Test Report for CloneSnapshotIT #16115

Open

Conversation

cuonghm2809 commented Jan 14, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

Testing

Related Issues

Checklist

Uh oh!

coderabbitai bot commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Jan 14, 2026

Uh oh!

github-actions bot commented Jan 15, 2026

Uh oh!

github-actions bot commented Jan 15, 2026

Uh oh!

github-actions bot commented Jan 15, 2026

Uh oh!

github-actions bot commented Jan 21, 2026

Uh oh!

codecov bot commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Jan 21, 2026

Uh oh!

mch2 commented Jan 22, 2026

Uh oh!

github-actions bot commented Jan 22, 2026

Uh oh!

github-actions bot commented Jan 23, 2026

Uh oh!

Uh oh!

cuonghm2809 commented Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cuonghm2809 commented Jan 14, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 14, 2026 •

edited

Loading

codecov bot commented Jan 21, 2026 •

edited

Loading