[2.19] Fix segment replication failure during rolling restart by cuonghm2809 · Pull Request #20423 · opensearch-project/OpenSearch

cuonghm2809 · 2026-01-14T16:42:03Z

Description

This PR backports the fix for issue #19234 to the 2.19 branch.

Problem:
During rolling restart with segment replication enabled, replica shards fail to re-assign with error "Rejecting stale metadata checkpoint" after 5 retries. This occurs because:

Replica restarts and loads its last persisted checkpoint (e.g., version 100)
During restart, primary relocates to another node and advances to a newer checkpoint (version 150)
After relocation completes, primary checkpoint might revert to an earlier version (version 120)
Replica tries to replicate with initial checkpoint version 100
Primary returns checkpoint version 120 (appears older than replica's 100)
The strict validation added in PR Fix segment replication bug during primary relocation #18944 incorrectly rejects this as "stale"

This is a legitimate scenario during recovery, not an actual stale checkpoint issue.

Root cause:
PR #18944 added strict checkpoint validation to prevent issue #18490, but it doesn't distinguish between:

Normal replication (where stale checkpoints should be rejected)
Recovery/initialization (where the replica's persisted checkpoint may legitimately appear newer)

Solution:
Only enforce strict checkpoint validation during normal replication. During recovery (when shard is INITIALIZING or RELOCATING), accept checkpoints that may appear older, as the replica is catching up to the primary's current state.

Related Issues

Resolves #19234
Related to #18490

Check List

New functionality includes testing

Test Plan

Added two unit tests in SegmentReplicationTargetTests.java:

testStaleCheckpointRejected_duringNormalReplication(): Verifies that stale checkpoints are correctly rejected when the shard is active (not recovering)
testStaleCheckpointAccepted_duringRecovery(): Verifies that checkpoints are accepted during recovery even if they appear older than the replica's persisted checkpoint

Both tests passed:

> Task :server:test --tests "org.opensearch.indices.replication.SegmentReplicationTargetTests.testStaleCheckpoint*"

SegmentReplicationTargetTests > testStaleCheckpointRejected_duringNormalReplication PASSED
SegmentReplicationTargetTests > testStaleCheckpointAccepted_duringRecovery PASSED

BUILD SUCCESSFUL

Additional Context

This fix is critical for production deployments using segment replication during rolling restarts. Without this fix, shards fail to recover and require manual intervention.

During rolling restarts, replica shards may have received newer checkpoints from the primary before the restart, but after restart, the primary may have rolled back to an older state. The strict checkpoint validation added in opensearch-project#18944 to fix race conditions during primary relocation incorrectly rejects this legitimate scenario, causing shards to fail allocation after 5 retries. This fix distinguishes between two scenarios: 1. Normal replication - strict checkpoint validation applies to prevent accepting stale data during primary relocation (maintains opensearch-project#18944 fix) 2. Recovery (shard INITIALIZING or RELOCATING) - accepts the primary's current state even if it appears older than the replica's last known checkpoint, as this is expected during recovery from restart Added unit tests to verify: - Stale checkpoint is rejected during normal replication - Stale checkpoint is accepted during shard recovery Note: In 2.19, the logic is in SegmentReplicationTarget.java instead of AbstractSegmentReplicationTarget.java (which was introduced in later versions). Signed-off-by: Cuong Ha <cuong.ha@optimizely.com> Fixes opensearch-project#19234

Signed-off-by: Cuong Ha <cuong.ha@optimizely.com>

coderabbitai · 2026-01-14T16:42:10Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-01-14T17:00:50Z

❌ Gradle check result for 85cb2e0: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Cuong Ha <cuong.ha@optimizely.com>

github-actions · 2026-01-15T02:17:37Z

❌ Gradle check result for 4f5ed51: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Cuong Ha <cuong.ha@optimizely.com>

github-actions · 2026-01-15T09:38:10Z

❌ Gradle check result for bd60de1: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

cuonghm2809 added 2 commits January 14, 2026 21:53

Fix javadoc syntax error in SearchPhase

85cb2e0

Signed-off-by: Cuong Ha <cuong.ha@optimizely.com>

cuonghm2809 requested a review from a team as a code owner January 14, 2026 16:42

github-actions bot added bug Something isn't working Cluster Manager Indexing:Replication Issues and PRs related to core replication framework eg segrep labels Jan 14, 2026

github-project-automation bot added this to Cluster Manager Project Board Jan 14, 2026

Fix ReplicationCheckpoint constructor in unit tests

4f5ed51

Signed-off-by: Cuong Ha <cuong.ha@optimizely.com>

Fix code formatting

bd60de1

Signed-off-by: Cuong Ha <cuong.ha@optimizely.com>

cuonghm2809 closed this Jan 29, 2026

github-project-automation bot moved this to ✅ Done in Cluster Manager Project Board Jan 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[2.19] Fix segment replication failure during rolling restart#20423

[2.19] Fix segment replication failure during rolling restart#20423
cuonghm2809 wants to merge 4 commits intoopensearch-project:2.19from
cuonghm2809:fix-19234-tag-2.19.4

cuonghm2809 commented Jan 14, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Jan 14, 2026

Review skipped

Uh oh!

github-actions bot commented Jan 14, 2026

Uh oh!

github-actions bot commented Jan 15, 2026

Uh oh!

github-actions bot commented Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cuonghm2809 commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Check List

Test Plan

Additional Context

Uh oh!

coderabbitai bot commented Jan 14, 2026

Review skipped

Uh oh!

github-actions bot commented Jan 14, 2026

Uh oh!

github-actions bot commented Jan 15, 2026

Uh oh!

github-actions bot commented Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cuonghm2809 commented Jan 14, 2026 •

edited

Loading