Skip to content

Fix segment replication checkpoint not published after engine reset#20749

Closed
andrross wants to merge 1 commit intoopensearch-project:mainfrom
andrross:fix-checkpoint-not-published
Closed

Fix segment replication checkpoint not published after engine reset#20749
andrross wants to merge 1 commit intoopensearch-project:mainfrom
andrross:fix-checkpoint-not-published

Conversation

@andrross
Copy link
Copy Markdown
Member

When a primary shard's engine is created or reset, the InternalEngine constructor performs an internal-only refresh during restoreVersionMapAndCheckpointTracker. This updates the local replication checkpoint via ReplicationCheckpointUpdater (an internal refresh listener), but does not trigger CheckpointRefreshListener (an external refresh listener) which is responsible for publishing the checkpoint to replicas.

If no new writes arrive after the engine starts, the external reader manager never observes a reader change, didRefresh remains false, and the checkpoint is never published. Replicas believe they are current and never request the missing segments, leaving them permanently stale.

The promotion path in updateShardState already handled this correctly with an explicit updateReplicationCheckpoint() and checkpointPublisher.publish() call after engine reset. Two other paths were missing the same fix:

  1. Primary recovery after node restart (initializing -> active with same primary term): activatePrimaryMode() was called without publishing the checkpoint. This is the path hit during rolling upgrades.

  2. resetToWriteableEngine() (relocation handoff): called resetEngineToGlobalCheckpoint() without updating or publishing the checkpoint afterward.

Related Issues

Resolves #14302

Check List

  • Functionality includes testing.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

When a primary shard's engine is created or reset, the InternalEngine
constructor performs an internal-only refresh during
restoreVersionMapAndCheckpointTracker. This updates the local
replication checkpoint via ReplicationCheckpointUpdater (an internal
refresh listener), but does not trigger CheckpointRefreshListener (an
external refresh listener) which is responsible for publishing the
checkpoint to replicas.

If no new writes arrive after the engine starts, the external reader
manager never observes a reader change, didRefresh remains false, and
the checkpoint is never published. Replicas believe they are current
and never request the missing segments, leaving them permanently stale.

The promotion path in updateShardState already handled this correctly
with an explicit updateReplicationCheckpoint() and
checkpointPublisher.publish() call after engine reset. Two other paths
were missing the same fix:

1. Primary recovery after node restart (initializing -> active with
   same primary term): activatePrimaryMode() was called without
   publishing the checkpoint. This is the path hit during rolling
   upgrades.

2. resetToWriteableEngine() (relocation handoff): called
   resetEngineToGlobalCheckpoint() without updating or publishing
   the checkpoint afterward.

Signed-off-by: Andrew Ross <andrross@amazon.com>
@github-actions github-actions bot added >test-failure Test failure from CI, local build, etc. autocut flaky-test Random test failure that succeeds on second run Indexing Indexing, Bulk Indexing and anything related to indexing labels Feb 27, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Missing Publish

In resetToWriteableEngine(), updateReplicationCheckpoint() is called but checkpointPublisher.publish() is never called afterward. The PR description explicitly states the fix requires both updating and publishing the checkpoint (as done in the updateShardState promotion path), but the publish step is missing here. This means replicas may still not receive the updated checkpoint after a relocation handoff engine reset.

updateReplicationCheckpoint();
Missing Guard

In resetToWriteableEngine(), the call to updateReplicationCheckpoint() is not guarded by a check for indexSettings.isSegRepLocalEnabled() and checkpointPublisher != null, unlike the analogous fix in updateShardState(). This could cause a NullPointerException or unintended behavior when segment replication is not enabled.

updateReplicationCheckpoint();

@andrross
Copy link
Copy Markdown
Member Author

Full disclosure, kiro-cli (aka claude opus) and I debugged this together so I would appreciate some extra eyes from the segment replication experts: @mch2 @kkewwei @guojialiang92 @gbbafna

@github-actions
Copy link
Copy Markdown
Contributor

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Publish checkpoint inside the blocked operations scope

The updateReplicationCheckpoint() call in resetToWriteableEngine() is placed outside
the blockOperations lambda, meaning it runs after the block is released. This could
lead to a race condition where operations resume before the checkpoint is published.
Additionally, unlike the updateShardState path, there is no guard checking
indexSettings.isSegRepLocalEnabled() and checkpointPublisher != null, nor is
checkpointPublisher.publish() explicitly called, which may result in the checkpoint
not being propagated to replicas.

server/src/main/java/org/opensearch/index/shard/IndexShard.java [2300-2306]

 public void resetToWriteableEngine() throws IOException, InterruptedException, TimeoutException {
-    indexShardOperationPermits.blockOperations(30, TimeUnit.MINUTES, () -> { resetEngineToGlobalCheckpoint(); });
-    // Force update and publish checkpoint so replicas learn about segments that existed
-    // before the engine reset. See resetEngineToGlobalCheckpoint() in updateShardState()
-    // promotion path for the same pattern.
-    updateReplicationCheckpoint();
+    indexShardOperationPermits.blockOperations(30, TimeUnit.MINUTES, () -> {
+        resetEngineToGlobalCheckpoint();
+        // Force update and publish checkpoint so replicas learn about segments that existed
+        // before the engine reset. See updateShardState() promotion path for the same pattern.
+        if (indexSettings.isSegRepLocalEnabled() && checkpointPublisher != null) {
+            updateReplicationCheckpoint();
+            checkpointPublisher.publish(this, getLatestReplicationCheckpoint());
+        }
+    });
 }
Suggestion importance[1-10]: 6

__

Why: The suggestion raises a valid concern about the updateReplicationCheckpoint() call being outside the blockOperations lambda, which could allow operations to resume before the checkpoint is updated. However, the updateShardState path also calls these methods outside any lock, suggesting this may be intentional design. The missing guard for isSegRepLocalEnabled() and checkpointPublisher != null is a more concrete concern.

Low
General
Guard against null checkpoint before publishing

The updateReplicationCheckpoint() call updates the checkpoint, but then
checkpointPublisher.publish() is called with getLatestReplicationCheckpoint(). If
updateReplicationCheckpoint() is asynchronous or the checkpoint is not immediately
reflected, there could be a mismatch. It would be safer to capture the result of
updateReplicationCheckpoint() (if it returns the checkpoint) or ensure the publish
uses the same checkpoint that was just set.

server/src/main/java/org/opensearch/index/shard/IndexShard.java [798-805]

 if (indexSettings.isSegRepLocalEnabled() && this.checkpointPublisher != null) {
     // Force publish checkpoint so replicas catch up to segments that existed before
     // the node restart. Without this, the internal refresh during engine construction
     // may have already opened the latest reader, causing subsequent external refreshes
     // to see no change and never trigger CheckpointRefreshListener.
     updateReplicationCheckpoint();
-    checkpointPublisher.publish(this, getLatestReplicationCheckpoint());
+    final ReplicationCheckpoint checkpoint = getLatestReplicationCheckpoint();
+    if (checkpoint != null) {
+        checkpointPublisher.publish(this, checkpoint);
+    }
 }
Suggestion importance[1-10]: 3

__

Why: The suggestion to capture the checkpoint and guard against null is a minor defensive improvement, but getLatestReplicationCheckpoint() is unlikely to return null in this context since the engine was just activated. The improved_code also changes the logic slightly without strong justification.

Low

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for bd62e7d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@andrross
Copy link
Copy Markdown
Member Author

andrross commented Mar 3, 2026

I'm not confident this is the right way to fix the issues we're seeing. Closing for now.

@andrross andrross closed this Mar 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

autocut flaky-test Random test failure that succeeds on second run Indexing Indexing, Bulk Indexing and anything related to indexing >test-failure Test failure from CI, local build, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[AUTOCUT] Gradle Check Flaky Test Report for IndexingIT

1 participant