[Backport 2.19] Fix segment replication failure during rolling restart by cuonghm2809 · Pull Request #20498 · opensearch-project/OpenSearch

cuonghm2809 · 2026-01-29T08:00:49Z

Description

Backport of the fix #20422 for segment replication failure during rolling restart to 2.19 branch.

Related Issues

Resolves #19234

Root Cause

During rolling restarts, replica shards may have received newer checkpoints from the primary before the restart, but after restart, the primary may have rolled back to an older state. The strict checkpoint validation added in #18944 to fix race conditions during primary relocation incorrectly rejects this legitimate scenario, causing shards to fail allocation after 5 retries.

Error Message

Rejecting stale metadata checkpoint [ReplicationCheckpoint{segmentsGen=261}] 
since initial checkpoint [ReplicationCheckpoint{segmentsGen=278}] is ahead of it

Solution

This fix distinguishes between two scenarios:

Normal replication - strict checkpoint validation applies to prevent accepting stale data during primary relocation (maintains Fix segment replication bug during primary relocation #18944 fix)
Recovery (shard INITIALIZING or RELOCATING) - accepts the primary's current state even if it appears older than the replica's last known checkpoint, as this is expected during recovery from restart

Implementation Notes

In 2.19, the logic is in SegmentReplicationTarget.java instead of AbstractSegmentReplicationTarget.java (which was introduced in later versions)
Added comprehensive unit tests to verify both scenarios

coderabbitai · 2026-01-29T08:00:59Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

🔍 Trigger a full review

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

During rolling restarts, replica shards may have received newer checkpoints from the primary before the restart, but after restart, the primary may have rolled back to an older state. The strict checkpoint validation added in opensearch-project#18944 to fix race conditions during primary relocation incorrectly rejects this legitimate scenario, causing shards to fail allocation after 5 retries. This fix distinguishes between two scenarios: 1. Normal replication - strict checkpoint validation applies to prevent accepting stale data during primary relocation (maintains opensearch-project#18944 fix) 2. Recovery (shard INITIALIZING or RELOCATING) - accepts the primary's current state even if it appears older than the replica's last known checkpoint, as this is expected during recovery from restart Added unit tests to verify: - Stale checkpoint is rejected during normal replication - Stale checkpoint is accepted during shard recovery Note: In 2.19, the logic is in SegmentReplicationTarget.java instead of AbstractSegmentReplicationTarget.java (which was introduced in later versions). Signed-off-by: Cuong Ha <cuong.ha@optimizely.com> Fixes opensearch-project#19234

Signed-off-by: Cuong Ha <cuong.ha@optimizely.com>

github-actions · 2026-01-29T09:37:12Z

✅ Gradle check result for 36e440d: SUCCESS

codecov · 2026-01-29T09:38:18Z

Codecov Report

❌ Patch coverage is 50.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 71.98%. Comparing base (1f78365) to head (2191bb0).
⚠️ Report is 3 commits behind head on 2.19.

Files with missing lines	Patch %	Lines
.../indices/replication/SegmentReplicationTarget.java	50.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               2.19   #20498      +/-   ##
============================================
- Coverage     71.99%   71.98%   -0.02%     
- Complexity    65993    66022      +29     
============================================
  Files          5342     5342              
  Lines        307364   307363       -1     
  Branches      44857    44857              
============================================
- Hits         221298   221251      -47     
- Misses        67624    67668      +44     
- Partials      18442    18444       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Andrew Ross <andrross@amazon.com>

github-actions · 2026-01-29T20:07:21Z

✅ Gradle check result for 2191bb0: SUCCESS

louzadod · 2026-03-11T10:38:59Z

Hello @andrross , is this PR part of OpenSearch 2.19.5? 2.19.5 release notes did not mention it. Was the issue fixed?

cuonghm2809 · 2026-03-11T10:47:04Z

@louzadod This is included in OpenSearch version 2.19.5. You can check it here https://github.com/opensearch-project/OpenSearch/blob/2.19/release-notes/opensearch.release-notes-2.19.5.md

louzadod · 2026-03-11T12:26:56Z

Thank you @cuonghm2809 for the quick answer. Last night our cluster nodes were updated and OS cluster returned automatically to the GREEN state thanks to v2.19.5.

cuonghm2809 requested a review from a team as a code owner January 29, 2026 08:00

github-actions bot added bug Something isn't working Cluster Manager Indexing:Replication Issues and PRs related to core replication framework eg segrep labels Jan 29, 2026

github-project-automation bot added this to Cluster Manager Project Board Jan 29, 2026

cuonghm2809 mentioned this pull request Jan 29, 2026

Fix segment replication failure during rolling restart #20422

Merged

3 tasks

cuonghm2809 added 4 commits January 29, 2026 15:40

Fix javadoc syntax error in SearchPhase

469314f

Signed-off-by: Cuong Ha <cuong.ha@optimizely.com>

Fix ReplicationCheckpoint constructor in unit tests

734cf97

Signed-off-by: Cuong Ha <cuong.ha@optimizely.com>

Fix code formatting

36e440d

Signed-off-by: Cuong Ha <cuong.ha@optimizely.com>

cuonghm2809 force-pushed the fix-19234-tag-2.19.4 branch from bd60de1 to 36e440d Compare January 29, 2026 08:41

Add CHANGELOG entry

2191bb0

Signed-off-by: Andrew Ross <andrross@amazon.com>

andrross approved these changes Jan 29, 2026

View reviewed changes

github-project-automation bot moved this to 👀 In review in Cluster Manager Project Board Jan 29, 2026

andrross merged commit 04900c6 into opensearch-project:2.19 Jan 29, 2026
44 of 105 checks passed

github-project-automation bot moved this from 👀 In review to ✅ Done in Cluster Manager Project Board Jan 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Backport 2.19] Fix segment replication failure during rolling restart#20498

[Backport 2.19] Fix segment replication failure during rolling restart#20498
andrross merged 5 commits intoopensearch-project:2.19from
cuonghm2809:fix-19234-tag-2.19.4

cuonghm2809 commented Jan 29, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Jan 29, 2026

Review skipped

Uh oh!

github-actions bot commented Jan 29, 2026

Uh oh!

codecov bot commented Jan 29, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 29, 2026

Uh oh!

Uh oh!

louzadod commented Mar 11, 2026 •

edited

Loading

Uh oh!

cuonghm2809 commented Mar 11, 2026

Uh oh!

louzadod commented Mar 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cuonghm2809 commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Root Cause

Error Message

Solution

Implementation Notes

Uh oh!

coderabbitai bot commented Jan 29, 2026

Review skipped

Uh oh!

github-actions bot commented Jan 29, 2026

Uh oh!

codecov bot commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Jan 29, 2026

Uh oh!

Uh oh!

louzadod commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cuonghm2809 commented Mar 11, 2026

Uh oh!

louzadod commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cuonghm2809 commented Jan 29, 2026 •

edited

Loading

codecov bot commented Jan 29, 2026 •

edited

Loading

louzadod commented Mar 11, 2026 •

edited

Loading

louzadod commented Mar 11, 2026 •

edited

Loading