Skip to content

[BUG] test org.opensearch.indices.replication.SegmentReplicationRelocationIT.testRelocateWithQueuedOperationsDuringHandoff is flaky #13797

@sohami

Description

@sohami

Describe the bug

The test org.opensearch.indices.replication.SegmentReplicationRelocationIT.testRelocateWithQueuedOperationsDuringHandoff fails with below error:

java.lang.AssertionError: beforeRefreshed called by a different thread. current [opensearch[node_t1][generic][T#4]], thread that called beforeRefresh [opensearch[node_t1][generic][T#5]]
	at __randomizedtesting.SeedInfo.seed([F86C333F9F578BBD]:0)
	at org.opensearch.index.shard.IndexShard$RefreshMetricUpdater.afterRefresh(IndexShard.java:4803)
	at org.apache.lucene.search.ReferenceManager.notifyRefreshListenersRefreshed(ReferenceManager.java:275)
	at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:182)
	at org.apache.lucene.search.ReferenceManager.maybeRefresh(ReferenceManager.java:213)
	at org.opensearch.index.engine.NRTReplicationReaderManager.updateSegments(NRTReplicationReaderManager.java:106)
	at org.opensearch.index.engine.NRTReplicationEngine.updateSegments(NRTReplicationEngine.java:168)
	at org.opensearch.index.shard.IndexShard.finalizeReplication(IndexShard.java:1683)
	at org.opensearch.indices.replication.SegmentReplicationTarget.finalizeReplication(SegmentReplicationTarget.java:299)
	at org.opensearch.indices.replication.SegmentReplicationTarget.lambda$startReplication$3(SegmentReplicationTarget.java:194)
	at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82)
	at org.opensearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:126)
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
	at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341)
	at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120)
	at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
	at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112)
	at org.opensearch.common.util.concurrent.BaseFuture.set(BaseFuture.java:160)
	at org.opensearch.common.util.concurrent.ListenableFuture.onResponse(ListenableFuture.java:141)
	at org.opensearch.action.StepListener.innerOnResponse(StepListener.java:79)
	at org.opensearch.core.action.NotifyOnceListener.onResponse(NotifyOnceListener.java:58)
	at org.opensearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:70)
	at org.opensearch.telemetry.tracing.handler.TraceableTransportResponseHandler.handleResponse(TraceableTransportResponseHandler.java:72)
	at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1483)
	at org.opensearch.transport.NativeMessageHandler.doHandleResponse(NativeMessageHandler.java:428)
	at org.opensearch.transport.NativeMessageHandler.lambda$handleResponse$3(NativeMessageHandler.java:422)
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:854)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1583)

Related component

Indexing:Replication

To Reproduce

REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.indices.replication.SegmentReplicationRelocationIT.testRelocateWithQueuedOperationsDuringHandoff" -Dtests.seed=F86C333F9F578BBD -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=es-VE -Dtests.timezone=Australia/Currie -Druntime.java=21
NOTE: test params are: codec=Asserting(Lucene99): {index_uuid=PostingsFormat(name=Asserting), type=PostingsFormat(name=Asserting)}, docValues:{}, maxPointsInLeafNode=1922, maxMBSortInHeap=5.46729700825271, sim=Asserting(RandomSimilarity(queryNorm=false): {}), locale=es-VE, timezone=Australia/Currie

Expected behavior

Test should always pass

Additional Details

Plugins
Please list all plugins currently enabled.

Screenshots
N/A

Host/Environment (please complete the following information):
CI

Additional context
https://build.ci.opensearch.org/job/gradle-check/39111/testReport/junit/org.opensearch.indices.replication/SegmentReplicationRelocationIT/testRelocateWithQueuedOperationsDuringHandoff/

Metadata

Metadata

Assignees

No one assigned

    Labels

    Indexing:ReplicationIssues and PRs related to core replication framework eg segrepbugSomething isn't workingflaky-testRandom test failure that succeeds on second runlucene

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions