Skip to content

[Backport 2.19] Fix segment replication failure during rolling restart#20498

Merged
andrross merged 5 commits intoopensearch-project:2.19from
cuonghm2809:fix-19234-tag-2.19.4
Jan 29, 2026
Merged

[Backport 2.19] Fix segment replication failure during rolling restart#20498
andrross merged 5 commits intoopensearch-project:2.19from
cuonghm2809:fix-19234-tag-2.19.4

Conversation

@cuonghm2809
Copy link
Copy Markdown
Contributor

@cuonghm2809 cuonghm2809 commented Jan 29, 2026

Description

Backport of the fix #20422 for segment replication failure during rolling restart to 2.19 branch.

Related Issues

Resolves #19234

Root Cause

During rolling restarts, replica shards may have received newer checkpoints from the primary before the restart, but after restart, the primary may have rolled back to an older state. The strict checkpoint validation added in #18944 to fix race conditions during primary relocation incorrectly rejects this legitimate scenario, causing shards to fail allocation after 5 retries.

Error Message

Rejecting stale metadata checkpoint [ReplicationCheckpoint{segmentsGen=261}] 
since initial checkpoint [ReplicationCheckpoint{segmentsGen=278}] is ahead of it

Solution

This fix distinguishes between two scenarios:

  1. Normal replication - strict checkpoint validation applies to prevent accepting stale data during primary relocation (maintains Fix segment replication bug during primary relocation #18944 fix)
  2. Recovery (shard INITIALIZING or RELOCATING) - accepts the primary's current state even if it appears older than the replica's last known checkpoint, as this is expected during recovery from restart

Implementation Notes

  • In 2.19, the logic is in SegmentReplicationTarget.java instead of AbstractSegmentReplicationTarget.java (which was introduced in later versions)
  • Added comprehensive unit tests to verify both scenarios

@cuonghm2809 cuonghm2809 requested a review from a team as a code owner January 29, 2026 08:00
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Jan 29, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

  • 🔍 Trigger a full review

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added bug Something isn't working Cluster Manager Indexing:Replication Issues and PRs related to core replication framework eg segrep labels Jan 29, 2026
During rolling restarts, replica shards may have received newer checkpoints
from the primary before the restart, but after restart, the primary may have
rolled back to an older state. The strict checkpoint validation added in opensearch-project#18944
to fix race conditions during primary relocation incorrectly rejects this
legitimate scenario, causing shards to fail allocation after 5 retries.

This fix distinguishes between two scenarios:
1. Normal replication - strict checkpoint validation applies to prevent
   accepting stale data during primary relocation (maintains opensearch-project#18944 fix)
2. Recovery (shard INITIALIZING or RELOCATING) - accepts the primary's
   current state even if it appears older than the replica's last known
   checkpoint, as this is expected during recovery from restart

Added unit tests to verify:
- Stale checkpoint is rejected during normal replication
- Stale checkpoint is accepted during shard recovery

Note: In 2.19, the logic is in SegmentReplicationTarget.java instead of
AbstractSegmentReplicationTarget.java (which was introduced in later versions).

Signed-off-by: Cuong Ha <cuong.ha@optimizely.com>

Fixes opensearch-project#19234
Signed-off-by: Cuong Ha <cuong.ha@optimizely.com>
Signed-off-by: Cuong Ha <cuong.ha@optimizely.com>
Signed-off-by: Cuong Ha <cuong.ha@optimizely.com>
@github-actions
Copy link
Copy Markdown
Contributor

✅ Gradle check result for 36e440d: SUCCESS

@codecov
Copy link
Copy Markdown

codecov bot commented Jan 29, 2026

Codecov Report

❌ Patch coverage is 50.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 71.98%. Comparing base (1f78365) to head (2191bb0).
⚠️ Report is 3 commits behind head on 2.19.

Files with missing lines Patch % Lines
.../indices/replication/SegmentReplicationTarget.java 50.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               2.19   #20498      +/-   ##
============================================
- Coverage     71.99%   71.98%   -0.02%     
- Complexity    65993    66022      +29     
============================================
  Files          5342     5342              
  Lines        307364   307363       -1     
  Branches      44857    44857              
============================================
- Hits         221298   221251      -47     
- Misses        67624    67668      +44     
- Partials      18442    18444       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Andrew Ross <andrross@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

✅ Gradle check result for 2191bb0: SUCCESS

@andrross andrross merged commit 04900c6 into opensearch-project:2.19 Jan 29, 2026
44 of 105 checks passed
@github-project-automation github-project-automation bot moved this from 👀 In review to ✅ Done in Cluster Manager Project Board Jan 29, 2026
@louzadod
Copy link
Copy Markdown

louzadod commented Mar 11, 2026

Hello @andrross , is this PR part of OpenSearch 2.19.5? 2.19.5 release notes did not mention it. Was the issue fixed?

@cuonghm2809
Copy link
Copy Markdown
Contributor Author

@louzadod This is included in OpenSearch version 2.19.5. You can check it here https://github.com/opensearch-project/OpenSearch/blob/2.19/release-notes/opensearch.release-notes-2.19.5.md

@louzadod
Copy link
Copy Markdown

louzadod commented Mar 11, 2026

Thank you @cuonghm2809 for the quick answer. Last night our cluster nodes were updated and OS cluster returned automatically to the GREEN state thanks to v2.19.5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working Cluster Manager Indexing:Replication Issues and PRs related to core replication framework eg segrep

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

3 participants