Skip to content

Fix segment replication failure during rolling restart#20422

Merged
andrross merged 5 commits intoopensearch-project:mainfrom
cuonghm2809:fix-shard-reallocation-19234
Jan 23, 2026
Merged

Fix segment replication failure during rolling restart#20422
andrross merged 5 commits intoopensearch-project:mainfrom
cuonghm2809:fix-shard-reallocation-19234

Conversation

@cuonghm2809
Copy link
Copy Markdown
Contributor

@cuonghm2809 cuonghm2809 commented Jan 14, 2026

Description

During rolling restarts, replica shards may have received newer checkpoints from the primary before the restart, but after restart, the primary may have rolled back to an older state. The strict checkpoint validation added in #18944 to fix race conditions during primary relocation incorrectly rejects this legitimate scenario, causing shards to fail allocation after 5 retries.

This fix distinguishes between two scenarios:

  1. Normal replication - strict checkpoint validation applies to prevent accepting stale data during primary relocation (maintains Fix segment replication bug during primary relocation #18944 fix)
  2. Recovery (shard INITIALIZING or RELOCATING) - accepts the primary's current state even if it appears older than the replica's last known checkpoint, as this is expected during recovery from restart

Changes

  • Modified AbstractSegmentReplicationTarget.java to check shard state before rejecting stale checkpoints
  • Added unit tests to verify both scenarios work correctly

Testing

Added two unit tests:

  • testStaleCheckpointRejected_duringNormalReplication() - Verifies stale checkpoint is rejected during normal replication
  • testStaleCheckpointAccepted_duringRecovery() - Verifies stale checkpoint is accepted during shard recovery

Related Issues

Fixes #19234

Checklist

  • Functionality includes testing
  • API changes companion pull request created, if applicable
  • Public documentation issue/PR created, if applicable

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@cuonghm2809 cuonghm2809 requested a review from a team as a code owner January 14, 2026 13:46
@github-actions github-actions bot added bug Something isn't working Cluster Manager Indexing:Replication Issues and PRs related to core replication framework eg segrep labels Jan 14, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Jan 14, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

Replaces a strict stale-checkpoint rejection with a recovery-aware condition in startReplication: if the shard is initializing or relocating (recovering) stale primary checkpoints are accepted; otherwise stale metadata checkpoints continue to be rejected. Two tests added for both scenarios.

Changes

Cohort / File(s) Change Summary
Checkpoint validation
server/src/main/java/org/opensearch/indices/replication/AbstractSegmentReplicationTarget.java
Add isRecovering (based on indexShard.routingEntry().initializing() or relocating()) in startReplication() to bypass strict "checkpoint ahead" rejection when shard is recovering.
Tests
server/src/test/java/org/opensearch/indices/replication/SegmentReplicationTargetTests.java
Add testStaleCheckpointRejected_duringNormalReplication() and testStaleCheckpointAccepted_duringRecovery(); simulate ShardRouting states (normal vs initializing/relocating) to assert rejection during normal replication and acceptance during recovery.
Build/manifest
pom.xml, manifest_file
Minor manifest/pom adjustments (small line changes).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 23.08% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Fix segment replication failure during rolling restart' clearly describes the main change—addressing a replication issue during rolling restarts, which matches the core objective of the PR.
Description check ✅ Passed The description includes all required template sections: a clear description of the problem, changes made, testing details, and related issue (#19234), with the contributor confirmation.
Linked Issues check ✅ Passed The PR fulfills all coding requirements from issue #19234: distinguishes between normal replication and recovery scenarios, preserves strict checkpoint validation during normal operation, permits checkpoint differences during recovery, and includes comprehensive unit tests demonstrating both paths.
Out of Scope Changes check ✅ Passed All changes are scoped to the stated objectives: modifications to AbstractSegmentReplicationTarget.java to implement recovery-aware checkpoint validation, and two unit tests validating both the normal replication and recovery scenarios.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In
`@server/src/test/java/org/opensearch/indices/replication/SegmentReplicationTargetTests.java`:
- Around line 593-596: The test incorrectly attempts to stub methods on the real
ShardRouting returned by spyIndexShard.routingEntry(); instead create a mock
ShardRouting, stub initializing() and relocating() on that mock, and have
spyIndexShard.routingEntry() return the mocked ShardRouting (e.g., mock
ShardRouting, when(mockRouting.initializing()).thenReturn(false),
when(mockRouting.relocating()).thenReturn(false), and
when(spyIndexShard.routingEntry()).thenReturn(mockRouting)); also add the import
for org.opensearch.cluster.routing.ShardRouting.
- Around line 660-662: The current test incorrectly chains mocks on
spyIndexShard.routingEntry(); instead create a dedicated mocked ShardRouting,
return it from spyIndexShard.routingEntry(), and then stub its state methods:
e.g., mock a ShardRouting instance,
when(spyIndexShard.routingEntry()).thenReturn(mockShardRouting), then
when(mockShardRouting.initializing()).thenReturn(true) and
when(mockShardRouting.relocating()).thenReturn(false) so the
initializing/relocating behavior is properly isolated on the ShardRouting mock.
🧹 Nitpick comments (1)
server/src/test/java/org/opensearch/indices/replication/SegmentReplicationTargetTests.java (1)

546-611: Consider adding test coverage for the relocating state.

The implementation treats both initializing() and relocating() as recovery states. The current tests only verify the initializing scenario. Consider adding a test case where initializing() returns false and relocating() returns true to ensure both branches of the OR condition are covered.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 52b699d and 5e2114f.

📒 Files selected for processing (2)
  • server/src/main/java/org/opensearch/indices/replication/AbstractSegmentReplicationTarget.java
  • server/src/test/java/org/opensearch/indices/replication/SegmentReplicationTargetTests.java
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2026-01-13T17:40:27.167Z
Learnt from: reta
Repo: opensearch-project/OpenSearch PR: 20411
File: server/src/main/java/org/opensearch/index/codec/CodecService.java:112-133
Timestamp: 2026-01-13T17:40:27.167Z
Learning: Avoid capturing or evaluating a supplier (e.g., this::defaultCodec) upfront when passing it to a registry during object construction. If registries may replace defaults during iteration (as in EnginePlugin.getAdditionalCodecs), pass the supplier itself and only resolve it at use time. This ensures dynamic behavior is preserved during initialization and prevents premature binding of defaults in codecs/registry setup. This pattern should apply to similar initialization paths in Java server code where registries may mutate defaults during construction.

Applied to files:

  • server/src/main/java/org/opensearch/indices/replication/AbstractSegmentReplicationTarget.java
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: gradle-check
🔇 Additional comments (1)
server/src/main/java/org/opensearch/indices/replication/AbstractSegmentReplicationTarget.java (1)

166-185: LGTM! Recovering-state guard correctly addresses rolling restart scenario.

The logic correctly differentiates between normal replication (reject stale checkpoints) and recovery (accept stale checkpoints). The comments clearly explain the rationale and link to the relevant issues. The indexShard.routingEntry() calls are safe within this context—the callers (afterIndexShardStarted and clusterChanged) validate the routing entry before initiating replication, ensuring it is non-null at this point in the lifecycle.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 45fd6b1: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 6ebfe41: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@cuonghm2809 cuonghm2809 force-pushed the fix-shard-reallocation-19234 branch from 2653d2e to af48e43 Compare January 15, 2026 07:29
@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for af48e43: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@cuonghm2809 cuonghm2809 reopened this Jan 15, 2026
@github-project-automation github-project-automation bot moved this from ✅ Done to 🏗 In progress in Cluster Manager Project Board Jan 15, 2026
@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for af48e43: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

During rolling restarts, replica shards may have received newer checkpoints
from the primary before the restart, but after restart, the primary may have
rolled back to an older state. The strict checkpoint validation added in opensearch-project#18944
to fix race conditions during primary relocation incorrectly rejects this
legitimate scenario, causing shards to fail allocation after 5 retries.

This fix distinguishes between two scenarios:
1. Normal replication - strict checkpoint validation applies to prevent
   accepting stale data during primary relocation (maintains opensearch-project#18944 fix)
2. Recovery (shard INITIALIZING or RELOCATING) - accepts the primary's
   current state even if it appears older than the replica's last known
   checkpoint, as this is expected during recovery from restart

Added unit tests to verify:
- Stale checkpoint is rejected during normal replication
- Stale checkpoint is accepted during shard recovery

Signed-off-by: Cuong Ha <cuong.ha@optimizely.com>

Fixes opensearch-project#19234
The chained mock syntax when(spyIndexShard.routingEntry().initializing())
doesn't work as intended because routingEntry() returns a real ShardRouting
object, not a mock.

Fixed by:
- Added ShardRouting import
- Created separate ShardRouting mocks for both test cases
- Properly stubbed initializing() and relocating() methods on the mock
- Stubbed routingEntry() to return the mocked ShardRouting

This ensures tests correctly verify the behavior for both:
1. Active shard (initializing=false) - should reject stale checkpoint
2. Recovering shard (initializing=true) - should accept stale checkpoint

Signed-off-by: Cuong Ha <cuong.ha@optimizely.com>
Signed-off-by: Cuong Ha <cuong.ha@optimizely.com>
Signed-off-by: Cuong Ha <cuong.ha@optimizely.com>
@andrross andrross force-pushed the fix-shard-reallocation-19234 branch from af48e43 to 26b0eb2 Compare January 21, 2026 19:41
@github-actions
Copy link
Copy Markdown
Contributor

❕ Gradle check result for 26b0eb2: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@codecov
Copy link
Copy Markdown

codecov bot commented Jan 21, 2026

Codecov Report

❌ Patch coverage is 50.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 73.24%. Comparing base (68c2d14) to head (4c7f021).
⚠️ Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
.../replication/AbstractSegmentReplicationTarget.java 50.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #20422      +/-   ##
============================================
- Coverage     73.33%   73.24%   -0.10%     
+ Complexity    72004    71921      -83     
============================================
  Files          5793     5793              
  Lines        329110   329111       +1     
  Branches      47402    47403       +1     
============================================
- Hits         241364   241058     -306     
- Misses        68394    68706     +312     
+ Partials      19352    19347       -5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Andrew Ross <andrross@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 4c7f021: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-project-automation github-project-automation bot moved this from 🏗 In progress to 👀 In review in Cluster Manager Project Board Jan 22, 2026
@mch2
Copy link
Copy Markdown
Member

mch2 commented Jan 22, 2026

Thanks for the fix here @cuonghm2809 this fix looks correct, the force_sync performed during recovery should disregard checkpoint metadata to catch up to wherever the elected primary is at that point. I suspect the reason the old primary is ahead is from the flush performed on shard closure. If the new has not indexed/refreshed anything new in that timeframe the old primary (now replica) would always come back ahead.

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 4c7f021:

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

✅ Gradle check result for 4c7f021: SUCCESS

@andrross andrross merged commit 5906e7b into opensearch-project:main Jan 23, 2026
35 of 40 checks passed
@github-project-automation github-project-automation bot moved this from 👀 In review to ✅ Done in Cluster Manager Project Board Jan 23, 2026
@cuonghm2809
Copy link
Copy Markdown
Contributor Author

You are right, @mch2.
I have opened a PR #20498 to backport the fix to version 2.19. Could you please help review the PR and merge it into the 2.19 branch @mch2 @andrross?

We are blocked from upgrading our OpenSearch cluster from version 2.17 to 2.19 because of this issue, and we need the fix to be released before we proceed with the upgrade.

tanyabti pushed a commit to tanyabti/OpenSearch that referenced this pull request Feb 24, 2026
…oject#20422)

* Fix segment replication failure during rolling restart

During rolling restarts, replica shards may have received newer checkpoints
from the primary before the restart, but after restart, the primary may have
rolled back to an older state. The strict checkpoint validation added in opensearch-project#18944
to fix race conditions during primary relocation incorrectly rejects this
legitimate scenario, causing shards to fail allocation after 5 retries.

This fix distinguishes between two scenarios:
1. Normal replication - strict checkpoint validation applies to prevent
   accepting stale data during primary relocation (maintains opensearch-project#18944 fix)
2. Recovery (shard INITIALIZING or RELOCATING) - accepts the primary's
   current state even if it appears older than the replica's last known
   checkpoint, as this is expected during recovery from restart

Added unit tests to verify:
- Stale checkpoint is rejected during normal replication
- Stale checkpoint is accepted during shard recovery

Signed-off-by: Cuong Ha <cuong.ha@optimizely.com>

Fixes opensearch-project#19234

* Fix incorrect mock chaining in unit tests

The chained mock syntax when(spyIndexShard.routingEntry().initializing())
doesn't work as intended because routingEntry() returns a real ShardRouting
object, not a mock.

Fixed by:
- Added ShardRouting import
- Created separate ShardRouting mocks for both test cases
- Properly stubbed initializing() and relocating() methods on the mock
- Stubbed routingEntry() to return the mocked ShardRouting

This ensures tests correctly verify the behavior for both:
1. Active shard (initializing=false) - should reject stale checkpoint
2. Recovering shard (initializing=true) - should accept stale checkpoint

Signed-off-by: Cuong Ha <cuong.ha@optimizely.com>

* Fix ReplicationCheckpoint constructor in tests

Signed-off-by: Cuong Ha <cuong.ha@optimizely.com>

* Fix code formatting

Signed-off-by: Cuong Ha <cuong.ha@optimizely.com>

* Add CHANGELOG entry

Signed-off-by: Andrew Ross <andrross@amazon.com>

---------

Signed-off-by: Cuong Ha <cuong.ha@optimizely.com>
Signed-off-by: Andrew Ross <andrross@amazon.com>
Co-authored-by: Andrew Ross <andrross@amazon.com>
tanyabti pushed a commit to tanyabti/OpenSearch that referenced this pull request Feb 24, 2026
…oject#20422)

* Fix segment replication failure during rolling restart

During rolling restarts, replica shards may have received newer checkpoints
from the primary before the restart, but after restart, the primary may have
rolled back to an older state. The strict checkpoint validation added in opensearch-project#18944
to fix race conditions during primary relocation incorrectly rejects this
legitimate scenario, causing shards to fail allocation after 5 retries.

This fix distinguishes between two scenarios:
1. Normal replication - strict checkpoint validation applies to prevent
   accepting stale data during primary relocation (maintains opensearch-project#18944 fix)
2. Recovery (shard INITIALIZING or RELOCATING) - accepts the primary's
   current state even if it appears older than the replica's last known
   checkpoint, as this is expected during recovery from restart

Added unit tests to verify:
- Stale checkpoint is rejected during normal replication
- Stale checkpoint is accepted during shard recovery

Signed-off-by: Cuong Ha <cuong.ha@optimizely.com>

Fixes opensearch-project#19234

* Fix incorrect mock chaining in unit tests

The chained mock syntax when(spyIndexShard.routingEntry().initializing())
doesn't work as intended because routingEntry() returns a real ShardRouting
object, not a mock.

Fixed by:
- Added ShardRouting import
- Created separate ShardRouting mocks for both test cases
- Properly stubbed initializing() and relocating() methods on the mock
- Stubbed routingEntry() to return the mocked ShardRouting

This ensures tests correctly verify the behavior for both:
1. Active shard (initializing=false) - should reject stale checkpoint
2. Recovering shard (initializing=true) - should accept stale checkpoint

Signed-off-by: Cuong Ha <cuong.ha@optimizely.com>

* Fix ReplicationCheckpoint constructor in tests

Signed-off-by: Cuong Ha <cuong.ha@optimizely.com>

* Fix code formatting

Signed-off-by: Cuong Ha <cuong.ha@optimizely.com>

* Add CHANGELOG entry

Signed-off-by: Andrew Ross <andrross@amazon.com>

---------

Signed-off-by: Cuong Ha <cuong.ha@optimizely.com>
Signed-off-by: Andrew Ross <andrross@amazon.com>
Co-authored-by: Andrew Ross <andrross@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working Cluster Manager Indexing:Replication Issues and PRs related to core replication framework eg segrep

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

[BUG] Shard fails to re-assign after a rolling restart

3 participants