Skip to content

fix flakey test testDecommissionNodeNoReplicas#18537

Merged
andrross merged 1 commit into
opensearch-project:mainfrom
guojialiang92:dev/fix_testDecommissionNodeNoReplica
Jun 17, 2025
Merged

fix flakey test testDecommissionNodeNoReplicas#18537
andrross merged 1 commit into
opensearch-project:mainfrom
guojialiang92:dev/fix_testDecommissionNodeNoReplica

Conversation

@guojialiang92
Copy link
Copy Markdown
Contributor

Description

I first found this flakey test in link, and I added some logs in the branch for analysis.

Reproduce

There may be 1 to 2 failures per 100 tests.

./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.allocation.FilteringAllocationIT.testDecommissionNodeNoReplicas" -Dtests.seed=C7410BE30403BEEF -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=tr -Dtests.timezone=Antarctica/Troll -Druntime.java=21
[2025-06-05T06:02:23,626][INFO ][o.o.c.a.FilteringAllocationIT] [testDecommissionNodeNoReplicas] --> verify all are allocated on node1 now
[2025-06-05T06:02:23,633][INFO ][o.o.c.m.MetadataIndexStateService] [node_t0] opening indices [[test/9PmfDww_Q9CJoI2C0pb-ow]]
[2025-06-05T06:02:23,635][INFO ][o.o.p.PluginsService     ] [node_t0] PluginService:onIndexModule index:[test/9PmfDww_Q9CJoI2C0pb-ow]
[2025-06-05T06:02:23,792][ERROR][o.o.c.r.BatchedRerouteService] [node_t0] unexpected failure during [cluster_reroute(async_shard_batch_fetch)], current state version [14]
java.lang.NullPointerException: Cannot invoke "org.opensearch.gateway.TransportNodesGatewayStartedShardHelper$GatewayStartedShard.allocationId()" because "shardData" is null
	at org.opensearch.gateway.PrimaryShardBatchAllocator.lambda$adaptToNodeShardStates$0(PrimaryShardBatchAllocator.java:153) ~[main/:?]
	at java.base/java.util.HashMap.forEach(HashMap.java:1429) ~[?:?]
	at org.opensearch.gateway.PrimaryShardBatchAllocator.adaptToNodeShardStates(PrimaryShardBatchAllocator.java:149) ~[main/:?]
	at org.opensearch.gateway.PrimaryShardBatchAllocator.allocateUnassignedBatch(PrimaryShardBatchAllocator.java:115) ~[main/:?]
	at org.opensearch.gateway.ShardsBatchGatewayAllocator$3.run(ShardsBatchGatewayAllocator.java:330) ~[main/:?]
	at org.opensearch.common.util.BatchRunnableExecutor.run(BatchRunnableExecutor.java:54) ~[main/:?]
	at java.base/java.util.Optional.ifPresent(Optional.java:178) ~[?:?]
	at org.opensearch.cluster.routing.allocation.AllocationService.allocateAllUnassignedShards(AllocationService.java:653) ~[main/:?]
	at org.opensearch.cluster.routing.allocation.AllocationService.allocateExistingUnassignedShards(AllocationService.java:625) ~[main/:?]
	at org.opensearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:601) ~[main/:?]
	at org.opensearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:566) ~[main/:?]
	at org.opensearch.cluster.routing.BatchedRerouteService$1.execute(BatchedRerouteService.java:136) ~[main/:?]
	at org.opensearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:67) ~[main/:?]
	at org.opensearch.cluster.service.ClusterManagerService.executeTasks(ClusterManagerService.java:889) ~[main/:?]
	at org.opensearch.cluster.service.ClusterManagerService.calculateTaskOutputs(ClusterManagerService.java:441) ~[main/:?]
	at org.opensearch.cluster.service.ClusterManagerService.runTasks(ClusterManagerService.java:301) ~[main/:?]
	at org.opensearch.cluster.service.ClusterManagerService$Batcher.run(ClusterManagerService.java:214) ~[main/:?]
	at org.opensearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:206) ~[main/:?]
	at org.opensearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:264) ~[main/:?]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:916) ~[main/:?]
	at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:283) ~[main/:?]
	at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:246) ~[main/:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
	at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
[2025-06-05T06:03:53,804][INFO ][o.o.c.a.FilteringAllocationIT] [testDecommissionNodeNoReplicas] after test

Analysis

After repeated verification, the test will only fail when condition closed == true.

Under this condition, it will first close the index test and then exclude node_1. When all shards has been migrated to node_0, the index test will be opened. In rare cases, an NPE will be generated after opening the index test.

I found that in InternalPrimaryBatchShardAllocator#fetchData, it is possible to return a AsyncShardFetch.FetchResult where the AsyncShardFetch.FetchResult#getData contains a DiscoveryNode to an empty NodeGatewayStartedShardsBatch#nodeGatewayStartedShardsBatch Map.

In AsyncShardFetch#fetchData, information is asynchronously collected through TransportNodesListGatewayStartedShardsBatch and placed in AsyncShardFetch#cache, and then fetchData is retrieved from the AsyncShardFetch#cache.

In TransportNodesListGatewayStartedShardsBatch#nodeOperation, if TransportNodesGatewayStartedShardHelper#getShardInfoOnLocalNode throws an exception, the returned shardsOnNode may be a mapping from shardId to null value. This will cause the operation this.emptyShardResponse[shardIdKey.get(shardId)] = true to be skipped in method AsyncShardBatchFetch.ShardBatchCache.NodeEntry#fillShardData.

Finally, when getting the fetchData from the AsyncShardFetch#cache, AsyncShardBatchFetch.ShardBatchCache#getBatchData will be called and an empty map will be returned. This causes a NPE in PrimaryShardBatchAllocator#adaptToNodeShardStates.

In summary, since we cannot ensure that no exceptions occur in TransportNodesListGatewayStartedShardsBatch#nodeOperation, we need to handle null values in PrimaryShardBatchAllocator#adaptToNodeShardStates.

Related Issues

Resolves #[18310]

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>
@guojialiang92 guojialiang92 requested a review from a team as a code owner June 17, 2025 09:45
@github-actions
Copy link
Copy Markdown
Contributor

✅ Gradle check result for 35fe8da: SUCCESS

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 17, 2025

Codecov Report

Attention: Patch coverage is 83.33333% with 1 line in your changes missing coverage. Please review.

Project coverage is 72.72%. Comparing base (a4dba6e) to head (35fe8da).
Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
...opensearch/gateway/PrimaryShardBatchAllocator.java 83.33% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #18537      +/-   ##
============================================
- Coverage     72.77%   72.72%   -0.06%     
+ Complexity    68170    68116      -54     
============================================
  Files          5540     5540              
  Lines        313384   313385       +1     
  Branches      45473    45474       +1     
============================================
- Hits         228051   227894     -157     
- Misses        66811    66993     +182     
+ Partials      18522    18498      -24     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@andrross andrross merged commit 8b95b19 into opensearch-project:main Jun 17, 2025
34 of 35 checks passed
neuenfeldttj pushed a commit to neuenfeldttj/OpenSearch that referenced this pull request Jun 26, 2025
Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>Signed-off-by: TJ Neuenfeldt <tjneu@amazon.com>
neuenfeldttj pushed a commit to neuenfeldttj/OpenSearch that referenced this pull request Jun 26, 2025
Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>
tandonks pushed a commit to tandonks/OpenSearch that referenced this pull request Aug 5, 2025
Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants