Fix for RemoteIndexPrimaryRelocationIT Flaky test by divyaruhil · Pull Request #21249 · opensearch-project/OpenSearch

divyaruhil · 2026-04-16T16:18:35Z

Description

The test was failing with IllegalArgumentException: [move_allocation] can't move 0, failed to find it on node due to a race condition. The parent class IndexPrimaryRelocationIT was:

Fetching cluster state once at the beginning
Tracking the shard location manually with relocationSource = relocationTarget after each iteration
This caused stale state when the actual shard location differed from the tracked location

Fix Applied:

Fetch fresh cluster state before each relocation: Added ClusterState currentState = client().admin().cluster().prepareState().get().getState(); inside the loop
Get current shard location from fresh state : Query the current routing table to find where the primary shard actually is
Removed stale state tracking: Deleted relocationSource = relocationTarget; since we now fetch the actual location each time

Related Issues

Resolves #20872

Check List

Functionality includes testing.
API changes companion pull request created, if applicable.
Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

github-actions · 2026-04-16T16:19:55Z

PR Reviewer Guide 🔍

(Review updated until commit `bc32048`)

Here are some key observations to aid the review process:

🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review Infinite Loop Risk If there is only one data node available, the `while (relocationTarget.equals(relocationSource))` loop will spin forever since all random selections will return the same single node. This was a pre-existing issue, but the fix now fetches `dataNodes` inside the loop each iteration, so it's worth verifying that the cluster always has at least 2 data nodes guaranteed before this loop runs. DiscoveryNode relocationTarget = randomFrom(dataNodes); while (relocationTarget.equals(relocationSource)) { relocationTarget = randomFrom(dataNodes); } Null Pointer Risk `primaryShard` could be null if the primary shard is not yet assigned or is in a transitional state when the fresh cluster state is fetched. There is no null check before calling `primaryShard.currentNodeId()`, which could cause a NullPointerException during the test. ShardRouting primaryShard = currentState.getRoutingTable().shardRoutingTable("test", 0).primaryShard(); DiscoveryNode relocationSource = currentState.getNodes().getDataNodes().get(primaryShard.currentNodeId());

github-actions · 2026-04-16T16:20:10Z

PR Code Suggestions ✨

Latest suggestions up to bc32048

Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	Add null checks for shard assignment state The `primaryShard` or its `currentNodeId()` could be null if the shard is not yet assigned or is in a transitional state. This would cause a `NullPointerException` when calling `currentState.getNodes().getDataNodes().get(primaryShard.currentNodeId())`. Add null checks before proceeding with the relocation logic. server/src/internalClusterTest/java/org/opensearch/indices/recovery/IndexPrimaryRelocationIT.java [87-88] ShardRouting primaryShard = currentState.getRoutingTable().shardRoutingTable("test", 0).primaryShard(); +if (primaryShard == null \|\| primaryShard.currentNodeId() == null) { + continue; +} DiscoveryNode relocationSource = currentState.getNodes().getDataNodes().get(primaryShard.currentNodeId()); +if (relocationSource == null) { + continue; +} Suggestion importance[1-10]: 5 __ Why: The suggestion is valid - `primaryShard` or `currentNodeId()` could theoretically be null during transitional states, causing a `NullPointerException`. However, in this test context the shard should always be assigned, and silently `continue`-ing could mask test failures rather than surfacing them properly.	Low

Previous suggestions

Suggestions up to commit bc32048

Category	Suggestion	Impact
Possible issue	Add null checks for unassigned primary shard The `primaryShard` could be null if no primary shard is currently assigned (e.g., during a previous relocation), and `primaryShard.currentNodeId()` could also return null if the shard is unassigned. This would cause a `NullPointerException`. Add null checks before dereferencing these values. server/src/internalClusterTest/java/org/opensearch/indices/recovery/IndexPrimaryRelocationIT.java [87-88] ShardRouting primaryShard = currentState.getRoutingTable().shardRoutingTable("test", 0).primaryShard(); +if (primaryShard == null \|\| !primaryShard.assignedToNode()) { + logger.info("--> [iteration {}] primary shard not yet assigned, retrying...", i); + i--; + continue; +} DiscoveryNode relocationSource = currentState.getNodes().getDataNodes().get(primaryShard.currentNodeId()); Suggestion importance[1-10]: 5 __ Why: While the null check for `primaryShard` is a valid defensive programming concern, in this test context the shard should always be assigned since the test waits for cluster health before proceeding. The suggestion adds retry logic with `i--` which changes test behavior and may not be the right approach for a test that expects a stable cluster state.	Low

Suggestions up to commit a5e0cde

Category	Suggestion	Impact
Possible issue	Add null checks for unassigned shard state The `primaryShard` could be `null` if the shard is not yet assigned, or `primaryShard.currentNodeId()` could return `null` if the shard is in an unassigned state. This would cause a `NullPointerException` when looking up the node. Add null checks before proceeding with the relocation logic. server/src/internalClusterTest/java/org/opensearch/indices/recovery/IndexPrimaryRelocationIT.java [87-88] ShardRouting primaryShard = currentState.getRoutingTable().shardRoutingTable("test", 0).primaryShard(); +if (primaryShard == null \|\| primaryShard.currentNodeId() == null) { + continue; +} DiscoveryNode relocationSource = currentState.getNodes().getDataNodes().get(primaryShard.currentNodeId()); +if (relocationSource == null) { + continue; +} Suggestion importance[1-10]: 5 __ Why: The suggestion is valid - `primaryShard` or `primaryShard.currentNodeId()` could theoretically be null if the shard is unassigned, causing a `NullPointerException`. However, in this test context, the shard should always be assigned and active, making this a low-probability edge case. The fix is reasonable but may be overly defensive for a test scenario.	Low

Suggestions up to commit f99e571

Category	Suggestion	Impact
Possible issue	Guard against infinite retry loops The retry logic using `i--` followed by `continue` can cause an infinite loop if the primary shard is persistently unassigned or the source node is consistently missing. A maximum retry count or a sleep/backoff should be added to prevent the test from hanging indefinitely. server/src/internalClusterTest/java/org/opensearch/indices/recovery/IndexPrimaryRelocationIT.java [88-99] +int retryCount = 0; +final int MAX_RETRIES = 10; if (primaryShard == null \|\| primaryShard.currentNodeId() == null) { logger.warn("--> [iteration {}] primary shard not found or not assigned, retrying", i); + if (++retryCount > MAX_RETRIES) { + throw new AssertionError("Primary shard not assigned after " + MAX_RETRIES + " retries at iteration " + i); + } i--; // retry this iteration continue; } DiscoveryNode relocationSource = currentState.getNodes().getDataNodes().get(primaryShard.currentNodeId()); if (relocationSource == null) { logger.warn("--> [iteration {}] source node not found in cluster, retrying", i); + if (++retryCount > MAX_RETRIES) { + throw new AssertionError("Source node not found after " + MAX_RETRIES + " retries at iteration " + i); + } i--; // retry this iteration continue; } Suggestion importance[1-10]: 6 __ Why: The `i--` retry logic could cause an infinite loop if the primary shard remains unassigned or the source node is persistently missing. Adding a max retry count with an assertion failure is a valid defensive improvement, though in practice this scenario is unlikely in a controlled test environment.	Low
Possible issue	Prevent infinite loop with single data node If `dataNodes` contains only one node (equal to `relocationSource`), the `while` loop will spin forever. A guard check should be added to ensure there is at least one node different from `relocationSource` before entering the loop. server/src/internalClusterTest/java/org/opensearch/indices/recovery/IndexPrimaryRelocationIT.java [101-104] +long distinctTargets = Arrays.stream(dataNodes).filter(n -> !n.equals(relocationSource)).count(); +if (distinctTargets == 0) { + logger.warn("--> [iteration {}] no alternative data node available for relocation, skipping", i); + continue; +} DiscoveryNode relocationTarget = randomFrom(dataNodes); while (relocationTarget.equals(relocationSource)) { relocationTarget = randomFrom(dataNodes); } Suggestion importance[1-10]: 5 __ Why: If `dataNodes` contains only one node matching `relocationSource`, the `while` loop would spin indefinitely. Adding a guard check before the loop is a valid defensive improvement, though the test setup typically ensures multiple data nodes exist.	Low

Suggestions up to commit af8ec38

Category	Suggestion	Impact
Possible issue	Add retry limit to prevent infinite loop Decrementing `i` before `continue` to retry the same iteration is risky because there is no retry limit or sleep/backoff, which can cause an infinite loop if the shard remains unassigned or the node is persistently missing. Consider adding a maximum retry count or a brief sleep between retries to prevent the test from hanging indefinitely. server/src/internalClusterTest/java/org/opensearch/indices/recovery/IndexPrimaryRelocationIT.java [88-99] +int retryCount = 0; +final int MAX_RETRIES = 10; if (primaryShard == null \|\| primaryShard.currentNodeId() == null) { logger.warn("--> [iteration {}] primary shard not found or not assigned, retrying", i); - i--; // retry this iteration + if (++retryCount > MAX_RETRIES) { + throw new AssertionError("Primary shard not assigned after " + MAX_RETRIES + " retries at iteration " + i); + } + Thread.sleep(500); continue; } DiscoveryNode relocationSource = currentState.getNodes().getDataNodes().get(primaryShard.currentNodeId()); if (relocationSource == null) { logger.warn("--> [iteration {}] source node not found in cluster, retrying", i); - i--; // retry this iteration + if (++retryCount > MAX_RETRIES) { + throw new AssertionError("Relocation source node not found after " + MAX_RETRIES + " retries at iteration " + i); + } + Thread.sleep(500); continue; } +retryCount = 0; Suggestion importance[1-10]: 6 __ Why: The `i--` retry pattern without a limit or backoff could cause an infinite loop if the shard remains unassigned. The suggestion to add a max retry count and sleep is valid, though the `improved_code` has a scoping issue with `retryCount` being declared inside the loop body.	Low
Possible issue	Guard against insufficient data nodes for relocation The `dataNodes` array is fetched inside the loop but used to pick a `relocationTarget` without verifying the target node is different from the current primary's node in the refreshed state. If `dataNodes` has only one element, the `while` `(relocationTarget.equals(relocationSource))` loop will spin forever. Consider adding a guard to ensure there are at least two data nodes before attempting relocation. server/src/internalClusterTest/java/org/opensearch/indices/recovery/IndexPrimaryRelocationIT.java [85] DiscoveryNode[] dataNodes = currentState.getNodes().getDataNodes().values().toArray(new DiscoveryNode[0]); +if (dataNodes.length < 2) { + throw new AssertionError("Expected at least 2 data nodes for relocation test, but found: " + dataNodes.length); +} Suggestion importance[1-10]: 5 __ Why: The concern about an infinite loop in `while (relocationTarget.equals(relocationSource))` when only one data node exists is valid, but this is a pre-existing issue and the test setup likely guarantees multiple nodes. The guard would improve robustness with a clear failure message.	Low

Suggestions up to commit 2f6d1cb

Category Suggestion Impact

Possible issue

Add retry limit to prevent infinite loop

Decrementing i and using continue to retry an iteration is fragile and can cause an
infinite loop if the cluster state never recovers (e.g., primary shard remains
unassigned indefinitely). A retry counter or a bounded sleep/wait mechanism should
be used to avoid hanging the test forever.

server/src/internalClusterTest/java/org/opensearch/indices/recovery/IndexPrimaryRelocationIT.java [88-99]

-if (primaryShard == null || primaryShard.currentNodeId() == null) {
-    logger.warn("--> [iteration {}] primary shard not found or not assigned, retrying", i);
-    i--; // retry this iteration
-    continue;
+int retries = 0;
+final int MAX_RETRIES = 10;
+while (retries < MAX_RETRIES) {
+    ClusterState currentState = client().admin().cluster().prepareState().get().getState();
+    ShardRouting primaryShard = currentState.getRoutingTable().shardRoutingTable("test", 0).primaryShard();
+    if (primaryShard == null || primaryShard.currentNodeId() == null) {
+        logger.warn("--> [iteration {}] primary shard not found or not assigned, retrying", i);
+        retries++;
+        Thread.sleep(500);
+        continue;
+    }
+    DiscoveryNode relocationSource = currentState.getNodes().getDataNodes().get(primaryShard.currentNodeId());
+    if (relocationSource == null) {
+        logger.warn("--> [iteration {}] source node not found in cluster, retrying", i);
+        retries++;
+        Thread.sleep(500);
+        continue;
+    }
+    // proceed with relocation using relocationSource
+    break;
+}
+if (retries >= MAX_RETRIES) {
+    throw new AssertionError("Primary shard not assigned after " + MAX_RETRIES + " retries at iteration [" + i + "]");
 }
 
-DiscoveryNode relocationSource = currentState.getNodes().getDataNodes().get(primaryShard.currentNodeId());
-if (relocationSource == null) {
-    logger.warn("--> [iteration {}] source node not found in cluster, retrying", i);
-    i--; // retry this iteration
-    continue;
-}
-

Suggestion importance[1-10]: 6

__

Why: The i-- + continue pattern can cause an infinite loop if the primary shard remains unassigned indefinitely. A bounded retry mechanism with a max retry count would be safer, though this is a test file where infinite loops are less critical but still problematic.

Low

Guard against insufficient data nodes for relocation

If dataNodes has only one element (i.e., only one data node), the while
(relocationTarget.equals(relocationSource)) loop below will spin forever since there
is no other node to relocate to. A guard check should be added to ensure there are
at least two data nodes before attempting relocation.

server/src/internalClusterTest/java/org/opensearch/indices/recovery/IndexPrimaryRelocationIT.java [83-85]

 // Fetch fresh cluster state to get current shard location and available nodes
 ClusterState currentState = client().admin().cluster().prepareState().get().getState();
 DiscoveryNode[] dataNodes = currentState.getNodes().getDataNodes().values().toArray(new DiscoveryNode[0]);
+if (dataNodes.length < 2) {
+    logger.warn("--> [iteration {}] fewer than 2 data nodes available, skipping relocation", i);
+    continue;
+}

Suggestion importance[1-10]: 5

__

Why: If only one data node is available, the while (relocationTarget.equals(relocationSource)) loop will spin forever. Adding a guard for dataNodes.length < 2 is a valid defensive check, though in practice this test likely requires multiple nodes to run meaningfully.

Low

github-actions · 2026-04-16T16:21:35Z

Persistent review updated to latest commit dae38ed

github-actions · 2026-04-16T16:40:04Z

Persistent review updated to latest commit 2f6d1cb

github-actions · 2026-04-16T17:06:13Z

❌ Gradle check result for 2f6d1cb: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2026-04-16T17:13:33Z

Persistent review updated to latest commit af8ec38

github-actions · 2026-04-16T17:35:45Z

❌ Gradle check result for af8ec38: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2026-04-16T18:01:22Z

Persistent review updated to latest commit f99e571

github-actions · 2026-04-16T18:44:18Z

❌ Gradle check result for f99e571: ABORTED

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2026-04-16T21:11:49Z

✅ Gradle check result for f99e571: SUCCESS

codecov · 2026-04-16T21:12:17Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.42%. Comparing base (436e4a6) to head (bc32048).
⚠️ Report is 6 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #21249      +/-   ##
============================================
+ Coverage     73.34%   73.42%   +0.07%     
- Complexity    74223    74261      +38     
============================================
  Files          5958     5958              
  Lines        337309   337309              
  Branches      48664    48664              
============================================
+ Hits         247408   247668     +260     
+ Misses        70188    69865     -323     
- Partials      19713    19776      +63

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

divyaruhil · 2026-04-16T21:24:40Z

Hi @andrross, just a gentle nudge—please review the PR whenever it’s convenient for you. Thankyou !

Signed-off-by: Divya <divyruhil999@gmail.com>

github-actions · 2026-04-21T10:59:17Z

Persistent review updated to latest commit a5e0cde

github-actions · 2026-04-21T12:09:48Z

✅ Gradle check result for a5e0cde: SUCCESS

divyaruhil · 2026-04-23T18:13:06Z

Hi @andrross, please let me know if there are any more review comments. Would appreciate it if you could review when you have time.

github-actions · 2026-04-23T18:15:53Z

Persistent review updated to latest commit bc32048

github-actions · 2026-04-23T19:11:04Z

❌ Gradle check result for bc32048: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2026-04-23T21:20:31Z

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit f0373c1.

'Diff too large, requires skip by maintainers after manual review'

Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.

⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

github-actions · 2026-04-23T21:23:42Z

Persistent review updated to latest commit bc32048

github-actions · 2026-04-23T22:22:13Z

❕ Gradle check result for bc32048: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

…#21249) Signed-off-by: Divya <divyruhil999@gmail.com> Co-authored-by: DIVYA2 <DIVYA2@ibm.com> Co-authored-by: Divya <divyruhil999@gmail.com> Co-authored-by: Andrew Ross <andrross@amazon.com>

divyaruhil requested a review from a team as a code owner April 16, 2026 16:18

divyaruhil force-pushed the fix_flaky branch from 23d9e53 to dae38ed Compare April 16, 2026 16:19

divyaruhil force-pushed the fix_flaky branch from df93bd2 to 2f6d1cb Compare April 16, 2026 16:38

divyaruhil force-pushed the fix_flaky branch from 2f6d1cb to af8ec38 Compare April 16, 2026 17:12

divyaruhil force-pushed the fix_flaky branch from ee3bd82 to f99e571 Compare April 16, 2026 17:59

andrross reviewed Apr 20, 2026

View reviewed changes

Comment thread ...r/src/internalClusterTest/java/org/opensearch/indices/recovery/IndexPrimaryRelocationIT.java Outdated

DIVYA2 and others added 4 commits April 21, 2026 16:27

fix for RemoteIndexPrimaryRelocationIT flaky test

03bf533

Signed-off-by: Divya <divyruhil999@gmail.com>

Adding null checks for shard and node lookups

64813b7

Signed-off-by: Divya <divyruhil999@gmail.com>

Apply spotless formatting

441091a

Signed-off-by: Divya <divyruhil999@gmail.com>

removing null chks not required

a5e0cde

Signed-off-by: Divya <divyruhil999@gmail.com>

divyaruhil force-pushed the fix_flaky branch from d1433c1 to a5e0cde Compare April 21, 2026 10:57

divyaruhil requested a review from andrross April 21, 2026 12:25

Merge branch 'main' into fix_flaky

bc32048

divyaruhil requested review from Bukhtawar, ashking94, cwperks, dbwiddis, gbbafna, kotwanikunal, mch2, msfroh, owaiskazi19, peternied, reta, sachinpkale, saratvemulapalli, shwetathareja and sohami as code owners April 23, 2026 21:19

divyaruhil force-pushed the fix_flaky branch from f0373c1 to bc32048 Compare April 23, 2026 21:22

This was referenced Apr 23, 2026

[AUTOCUT] Gradle Check Flaky Test Report for RestoreShallowSnapshotV2IT #16658

Open

[AUTOCUT] Gradle Check Flaky Test Report for QueryPhaseResultConsumerTests #19725

Closed

[AUTOCUT] Gradle Check Flaky Test Report for RemoteIndexRecoveryIT #14323

Open

opensearch-ci-bot mentioned this pull request Apr 24, 2026

[AUTOCUT] Gradle Check Flaky Test Report for RemoteStoreStatsIT #14310

Open

andrross approved these changes Apr 27, 2026

View reviewed changes

andrross merged commit 9bcf4f0 into opensearch-project:main Apr 27, 2026
28 of 32 checks passed

andrross added the skip-changelog label Apr 27, 2026

This was referenced Apr 27, 2026

[AUTOCUT] Gradle Check Flaky Test Report for RestoreShallowSnapshotV2IT #21374

Closed

[AUTOCUT] Gradle Check Flaky Test Report for RemoteStoreStatsIT #21394

Closed

Conversation

divyaruhil commented Apr 16, 2026

Description

Related Issues

Check List

Uh oh!

github-actions Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Reviewer Guide 🔍

(Review updated until commit bc32048)

Uh oh!

github-actions Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Code Suggestions ✨

Previous suggestions

Uh oh!

github-actions Bot commented Apr 16, 2026

Uh oh!

github-actions Bot commented Apr 16, 2026

Uh oh!

github-actions Bot commented Apr 16, 2026

Uh oh!

github-actions Bot commented Apr 16, 2026

Uh oh!

github-actions Bot commented Apr 16, 2026

Uh oh!

github-actions Bot commented Apr 16, 2026

Uh oh!

github-actions Bot commented Apr 16, 2026

Uh oh!

github-actions Bot commented Apr 16, 2026

Uh oh!

codecov Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

divyaruhil commented Apr 16, 2026

Uh oh!

Uh oh!

github-actions Bot commented Apr 21, 2026

Uh oh!

github-actions Bot commented Apr 21, 2026

Uh oh!

divyaruhil commented Apr 23, 2026

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

github-actions Bot commented Apr 23, 2026

PR Code Analyzer ❗

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented Apr 16, 2026 •

edited

Loading

(Review updated until commit `bc32048`)

github-actions Bot commented Apr 16, 2026 •

edited

Loading

codecov Bot commented Apr 16, 2026 •

edited

Loading