Skip to content

[2.19] Fix segment replication failure during rolling restart#20423

Closed
cuonghm2809 wants to merge 4 commits intoopensearch-project:2.19from
cuonghm2809:fix-19234-tag-2.19.4
Closed

[2.19] Fix segment replication failure during rolling restart#20423
cuonghm2809 wants to merge 4 commits intoopensearch-project:2.19from
cuonghm2809:fix-19234-tag-2.19.4

Conversation

@cuonghm2809
Copy link
Copy Markdown
Contributor

@cuonghm2809 cuonghm2809 commented Jan 14, 2026

Description

This PR backports the fix for issue #19234 to the 2.19 branch.

Problem:
During rolling restart with segment replication enabled, replica shards fail to re-assign with error "Rejecting stale metadata checkpoint" after 5 retries. This occurs because:

  1. Replica restarts and loads its last persisted checkpoint (e.g., version 100)
  2. During restart, primary relocates to another node and advances to a newer checkpoint (version 150)
  3. After relocation completes, primary checkpoint might revert to an earlier version (version 120)
  4. Replica tries to replicate with initial checkpoint version 100
  5. Primary returns checkpoint version 120 (appears older than replica's 100)
  6. The strict validation added in PR Fix segment replication bug during primary relocation #18944 incorrectly rejects this as "stale"

This is a legitimate scenario during recovery, not an actual stale checkpoint issue.

Root cause:
PR #18944 added strict checkpoint validation to prevent issue #18490, but it doesn't distinguish between:

  • Normal replication (where stale checkpoints should be rejected)
  • Recovery/initialization (where the replica's persisted checkpoint may legitimately appear newer)

Solution:
Only enforce strict checkpoint validation during normal replication. During recovery (when shard is INITIALIZING or RELOCATING), accept checkpoints that may appear older, as the replica is catching up to the primary's current state.

Related Issues

Resolves #19234
Related to #18490

Check List

  • New functionality includes testing

Test Plan

Added two unit tests in SegmentReplicationTargetTests.java:

  1. testStaleCheckpointRejected_duringNormalReplication(): Verifies that stale checkpoints are correctly rejected when the shard is active (not recovering)

  2. testStaleCheckpointAccepted_duringRecovery(): Verifies that checkpoints are accepted during recovery even if they appear older than the replica's persisted checkpoint

Both tests passed:

> Task :server:test --tests "org.opensearch.indices.replication.SegmentReplicationTargetTests.testStaleCheckpoint*"

SegmentReplicationTargetTests > testStaleCheckpointRejected_duringNormalReplication PASSED
SegmentReplicationTargetTests > testStaleCheckpointAccepted_duringRecovery PASSED

BUILD SUCCESSFUL

Additional Context

This fix is critical for production deployments using segment replication during rolling restarts. Without this fix, shards fail to recover and require manual intervention.

During rolling restarts, replica shards may have received newer checkpoints
from the primary before the restart, but after restart, the primary may have
rolled back to an older state. The strict checkpoint validation added in opensearch-project#18944
to fix race conditions during primary relocation incorrectly rejects this
legitimate scenario, causing shards to fail allocation after 5 retries.

This fix distinguishes between two scenarios:
1. Normal replication - strict checkpoint validation applies to prevent
   accepting stale data during primary relocation (maintains opensearch-project#18944 fix)
2. Recovery (shard INITIALIZING or RELOCATING) - accepts the primary's
   current state even if it appears older than the replica's last known
   checkpoint, as this is expected during recovery from restart

Added unit tests to verify:
- Stale checkpoint is rejected during normal replication
- Stale checkpoint is accepted during shard recovery

Note: In 2.19, the logic is in SegmentReplicationTarget.java instead of
AbstractSegmentReplicationTarget.java (which was introduced in later versions).

Signed-off-by: Cuong Ha <cuong.ha@optimizely.com>

Fixes opensearch-project#19234
Signed-off-by: Cuong Ha <cuong.ha@optimizely.com>
@cuonghm2809 cuonghm2809 requested a review from a team as a code owner January 14, 2026 16:42
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Jan 14, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added bug Something isn't working Cluster Manager Indexing:Replication Issues and PRs related to core replication framework eg segrep labels Jan 14, 2026
@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 85cb2e0: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Cuong Ha <cuong.ha@optimizely.com>
@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 4f5ed51: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Cuong Ha <cuong.ha@optimizely.com>
@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for bd60de1: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working Cluster Manager Indexing:Replication Issues and PRs related to core replication framework eg segrep

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

1 participant