Skip to content

[RFC] Reproduce the corruption issue in the segment replication scenario with soft delete enabled. #20312

@guojialiang92

Description

@guojialiang92

Background

In the segment copy scenario, to avoid file conflicts after primary promotion, we incremented the segment counter by 100000 when NRTReplicationEngine was closed.

In the earlier discussion #5701, it was evaluated whether to update all generation fields in SegmentCommitInfo. Since the environment where the issue occurred was deleted and it was difficult to reproduce the problem, the issue was ultimately closed.

Currently, the InternalEngine only uses soft delete, so we focus on the segment replication scenario with soft deletion enabled. I have studied the implementation of the current segment replication and believe that in this scenario, there is indeed a possibility of an OpenSearchCorruptionException being thrown.

Reproduce

To reproduce the issue, I constructed an IT test (testPrimaryStopped_ReplicaPromoted_UpdateDoc) based on the latest main branch and submitted it to branch.

The exception stack is as follows.


[2025-12-23T18:05:06,031][ERROR][o.o.i.r.SegmentReplicationTargetService] [node_t1] [shardId [test-idx-1][0]] [replication id 611] Replication failed, timing data: {INIT=0, GET_CHECKPOINT_INFO=0, REPLICATING=0}
org.opensearch.indices.replication.common.ReplicationFailedException: Store corruption during replication
	at org.opensearch.indices.replication.SegmentReplicator$2.onFailure(SegmentReplicator.java:350) [main/:?]
	at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:90) [opensearch-core-3.5.0-SNAPSHOT.jar:3.5.0-SNAPSHOT]
	at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:84) [opensearch-core-3.5.0-SNAPSHOT.jar:3.5.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:126) [main/:?]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [main/:?]
	at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341) [main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) [main/:?]
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1604) [?:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) [main/:?]
	at org.opensearch.common.util.concurrent.BaseFuture.set(BaseFuture.java:160) [main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.onResponse(ListenableFuture.java:141) [main/:?]
	at org.opensearch.action.StepListener.innerOnResponse(StepListener.java:79) [main/:?]
	at org.opensearch.core.action.NotifyOnceListener.onResponse(NotifyOnceListener.java:58) [opensearch-core-3.5.0-SNAPSHOT.jar:3.5.0-SNAPSHOT]
	at org.opensearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:70) [main/:?]
	at org.opensearch.telemetry.tracing.handler.TraceableTransportResponseHandler.handleResponse(TraceableTransportResponseHandler.java:73) [main/:?]
	at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1587) [main/:?]
	at org.opensearch.transport.NativeMessageHandler.doHandleResponse(NativeMessageHandler.java:468) [main/:?]
	at org.opensearch.transport.NativeMessageHandler.lambda$handleResponse$3(NativeMessageHandler.java:462) [main/:?]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:916) [main/:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1090) [?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:614) [?:?]
	at java.base/java.lang.Thread.run(Thread.java:1474) [?:?]
Caused by: org.opensearch.OpenSearchCorruptionException: Shard [test-idx-1][0] has local copies of segments that differ from the primary [name [_0_2_Lucene90_0.dvm], length [160], checksum [12940pr], writtenBy [10.3.2], name [_0_2_Lucene90_0.dvd], length [91], checksum [1aenr37], writtenBy [10.3.2]]
	at org.opensearch.indices.replication.AbstractSegmentReplicationTarget.getFiles(AbstractSegmentReplicationTarget.java:251) ~[main/:?]
	at org.opensearch.indices.replication.AbstractSegmentReplicationTarget.lambda$startReplication$2(AbstractSegmentReplicationTarget.java:180) ~[main/:?]
	at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) ~[opensearch-core-3.5.0-SNAPSHOT.jar:3.5.0-SNAPSHOT]
	... 20 more

The test process is as follows:

  1. Start two data nodes.
  2. Create an index contain 1 primary shard and 1 replica shard. Turn off automatic refresh.
  3. Write 20 documents.
  4. Execute refresh. Generate _0.si.
  5. Wait for the segment replication to complete, and both the primary and replica shard contain 20 documents.
  6. Mock segment replication process to ensure that segment replication between primary and replica shard cannot be completed.
  7. Update the doc with id 5 and execute refresh. Primary shard generate _0_1_Lucene90_0.dvm.
  8. Update the doc with id 6 and execute refresh. Primary shard generate _0_2_Lucene90_0.dvm.
  9. Restart the node where the primary shard is located.
  10. The replica is promoted to a new primary shard, generating _0_1_Lucene90_0.dvm through translog recovery, but the content is different from the previous file with the same name.
  11. Update the doc with id 7. The new primary shard generates _0_2_Lucene90_0.dvm, but its content is different from the previous file with the same name.
  12. Peer recovery fails due to OpenSearchCorruptionException when performing force segment replication.
  13. After throwing OpenSearchCorruptionException, file-based recovery will be executed, and the cluster will eventually turn green.

This test reproduces the situation where OpenSearchCorruptionException occurs, but the test will pass due to the retry of recovery.

Expected behavior

In the segment replication scenario with soft delete enabled, OpenSearchCorruptionException will not occur.

Additional Details

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: [e.g. iOS]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Indexing:ReplicationIssues and PRs related to core replication framework eg segrepbugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions