Skip to content

Fix for RemoteIndexPrimaryRelocationIT Flaky test#21249

Merged
andrross merged 5 commits into
opensearch-project:mainfrom
divyaruhil:fix_flaky
Apr 27, 2026
Merged

Fix for RemoteIndexPrimaryRelocationIT Flaky test#21249
andrross merged 5 commits into
opensearch-project:mainfrom
divyaruhil:fix_flaky

Conversation

@divyaruhil
Copy link
Copy Markdown
Contributor

Description

The test was failing with IllegalArgumentException: [move_allocation] can't move 0, failed to find it on node due to a race condition. The parent class IndexPrimaryRelocationIT was:

  1. Fetching cluster state once at the beginning
  2. Tracking the shard location manually with relocationSource = relocationTarget after each iteration
  3. This caused stale state when the actual shard location differed from the tracked location

Fix Applied:

  1. Fetch fresh cluster state before each relocation: Added ClusterState currentState = client().admin().cluster().prepareState().get().getState(); inside the loop
  2. Get current shard location from fresh state : Query the current routing table to find where the primary shard actually is
  3. Removed stale state tracking: Deleted relocationSource = relocationTarget; since we now fetch the actual location each time

Related Issues

Resolves #20872

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@divyaruhil divyaruhil requested a review from a team as a code owner April 16, 2026 16:18
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 16, 2026

PR Reviewer Guide 🔍

(Review updated until commit bc32048)

Here are some key observations to aid the review process:

🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Infinite Loop Risk

If there is only one data node available, the while (relocationTarget.equals(relocationSource)) loop will spin forever since all random selections will return the same single node. This was a pre-existing issue, but the fix now fetches dataNodes inside the loop each iteration, so it's worth verifying that the cluster always has at least 2 data nodes guaranteed before this loop runs.

DiscoveryNode relocationTarget = randomFrom(dataNodes);
while (relocationTarget.equals(relocationSource)) {
    relocationTarget = randomFrom(dataNodes);
}
Null Pointer Risk

primaryShard could be null if the primary shard is not yet assigned or is in a transitional state when the fresh cluster state is fetched. There is no null check before calling primaryShard.currentNodeId(), which could cause a NullPointerException during the test.

ShardRouting primaryShard = currentState.getRoutingTable().shardRoutingTable("test", 0).primaryShard();
DiscoveryNode relocationSource = currentState.getNodes().getDataNodes().get(primaryShard.currentNodeId());

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 16, 2026

PR Code Suggestions ✨

Latest suggestions up to bc32048

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Add null checks for shard assignment state

The primaryShard or its currentNodeId() could be null if the shard is not yet
assigned or is in a transitional state. This would cause a NullPointerException when
calling currentState.getNodes().getDataNodes().get(primaryShard.currentNodeId()).
Add null checks before proceeding with the relocation logic.

server/src/internalClusterTest/java/org/opensearch/indices/recovery/IndexPrimaryRelocationIT.java [87-88]

 ShardRouting primaryShard = currentState.getRoutingTable().shardRoutingTable("test", 0).primaryShard();
+if (primaryShard == null || primaryShard.currentNodeId() == null) {
+    continue;
+}
 DiscoveryNode relocationSource = currentState.getNodes().getDataNodes().get(primaryShard.currentNodeId());
+if (relocationSource == null) {
+    continue;
+}
Suggestion importance[1-10]: 5

__

Why: The suggestion is valid - primaryShard or currentNodeId() could theoretically be null during transitional states, causing a NullPointerException. However, in this test context the shard should always be assigned, and silently continue-ing could mask test failures rather than surfacing them properly.

Low

Previous suggestions

Suggestions up to commit bc32048
CategorySuggestion                                                                                                                                    Impact
Possible issue
Add null checks for unassigned primary shard

The primaryShard could be null if no primary shard is currently assigned (e.g.,
during a previous relocation), and primaryShard.currentNodeId() could also return
null if the shard is unassigned. This would cause a NullPointerException. Add null
checks before dereferencing these values.

server/src/internalClusterTest/java/org/opensearch/indices/recovery/IndexPrimaryRelocationIT.java [87-88]

 ShardRouting primaryShard = currentState.getRoutingTable().shardRoutingTable("test", 0).primaryShard();
+if (primaryShard == null || !primaryShard.assignedToNode()) {
+    logger.info("--> [iteration {}] primary shard not yet assigned, retrying...", i);
+    i--;
+    continue;
+}
 DiscoveryNode relocationSource = currentState.getNodes().getDataNodes().get(primaryShard.currentNodeId());
Suggestion importance[1-10]: 5

__

Why: While the null check for primaryShard is a valid defensive programming concern, in this test context the shard should always be assigned since the test waits for cluster health before proceeding. The suggestion adds retry logic with i-- which changes test behavior and may not be the right approach for a test that expects a stable cluster state.

Low
Suggestions up to commit a5e0cde
CategorySuggestion                                                                                                                                    Impact
Possible issue
Add null checks for unassigned shard state

The primaryShard could be null if the shard is not yet assigned, or
primaryShard.currentNodeId() could return null if the shard is in an unassigned
state. This would cause a NullPointerException when looking up the node. Add null
checks before proceeding with the relocation logic.

server/src/internalClusterTest/java/org/opensearch/indices/recovery/IndexPrimaryRelocationIT.java [87-88]

 ShardRouting primaryShard = currentState.getRoutingTable().shardRoutingTable("test", 0).primaryShard();
+if (primaryShard == null || primaryShard.currentNodeId() == null) {
+    continue;
+}
 DiscoveryNode relocationSource = currentState.getNodes().getDataNodes().get(primaryShard.currentNodeId());
+if (relocationSource == null) {
+    continue;
+}
Suggestion importance[1-10]: 5

__

Why: The suggestion is valid - primaryShard or primaryShard.currentNodeId() could theoretically be null if the shard is unassigned, causing a NullPointerException. However, in this test context, the shard should always be assigned and active, making this a low-probability edge case. The fix is reasonable but may be overly defensive for a test scenario.

Low
Suggestions up to commit f99e571
CategorySuggestion                                                                                                                                    Impact
Possible issue
Guard against infinite retry loops

The retry logic using i-- followed by continue can cause an infinite loop if the
primary shard is persistently unassigned or the source node is consistently missing.
A maximum retry count or a sleep/backoff should be added to prevent the test from
hanging indefinitely.

server/src/internalClusterTest/java/org/opensearch/indices/recovery/IndexPrimaryRelocationIT.java [88-99]

+int retryCount = 0;
+final int MAX_RETRIES = 10;
 if (primaryShard == null || primaryShard.currentNodeId() == null) {
     logger.warn("--> [iteration {}] primary shard not found or not assigned, retrying", i);
+    if (++retryCount > MAX_RETRIES) {
+        throw new AssertionError("Primary shard not assigned after " + MAX_RETRIES + " retries at iteration " + i);
+    }
     i--; // retry this iteration
     continue;
 }
 
 DiscoveryNode relocationSource = currentState.getNodes().getDataNodes().get(primaryShard.currentNodeId());
 if (relocationSource == null) {
     logger.warn("--> [iteration {}] source node not found in cluster, retrying", i);
+    if (++retryCount > MAX_RETRIES) {
+        throw new AssertionError("Source node not found after " + MAX_RETRIES + " retries at iteration " + i);
+    }
     i--; // retry this iteration
     continue;
 }
Suggestion importance[1-10]: 6

__

Why: The i-- retry logic could cause an infinite loop if the primary shard remains unassigned or the source node is persistently missing. Adding a max retry count with an assertion failure is a valid defensive improvement, though in practice this scenario is unlikely in a controlled test environment.

Low
Prevent infinite loop with single data node

If dataNodes contains only one node (equal to relocationSource), the while loop will
spin forever. A guard check should be added to ensure there is at least one node
different from relocationSource before entering the loop.

server/src/internalClusterTest/java/org/opensearch/indices/recovery/IndexPrimaryRelocationIT.java [101-104]

+long distinctTargets = Arrays.stream(dataNodes).filter(n -> !n.equals(relocationSource)).count();
+if (distinctTargets == 0) {
+    logger.warn("--> [iteration {}] no alternative data node available for relocation, skipping", i);
+    continue;
+}
 DiscoveryNode relocationTarget = randomFrom(dataNodes);
 while (relocationTarget.equals(relocationSource)) {
     relocationTarget = randomFrom(dataNodes);
 }
Suggestion importance[1-10]: 5

__

Why: If dataNodes contains only one node matching relocationSource, the while loop would spin indefinitely. Adding a guard check before the loop is a valid defensive improvement, though the test setup typically ensures multiple data nodes exist.

Low
Suggestions up to commit af8ec38
CategorySuggestion                                                                                                                                    Impact
Possible issue
Add retry limit to prevent infinite loop

Decrementing i before continue to retry the same iteration is risky because there is
no retry limit or sleep/backoff, which can cause an infinite loop if the shard
remains unassigned or the node is persistently missing. Consider adding a maximum
retry count or a brief sleep between retries to prevent the test from hanging
indefinitely.

server/src/internalClusterTest/java/org/opensearch/indices/recovery/IndexPrimaryRelocationIT.java [88-99]

+int retryCount = 0;
+final int MAX_RETRIES = 10;
 if (primaryShard == null || primaryShard.currentNodeId() == null) {
             logger.warn("--> [iteration {}] primary shard not found or not assigned, retrying", i);
-            i--; // retry this iteration
+            if (++retryCount > MAX_RETRIES) {
+                throw new AssertionError("Primary shard not assigned after " + MAX_RETRIES + " retries at iteration " + i);
+            }
+            Thread.sleep(500);
             continue;
         }
         
         DiscoveryNode relocationSource = currentState.getNodes().getDataNodes().get(primaryShard.currentNodeId());
         if (relocationSource == null) {
             logger.warn("--> [iteration {}] source node not found in cluster, retrying", i);
-            i--; // retry this iteration
+            if (++retryCount > MAX_RETRIES) {
+                throw new AssertionError("Relocation source node not found after " + MAX_RETRIES + " retries at iteration " + i);
+            }
+            Thread.sleep(500);
             continue;
         }
+retryCount = 0;
Suggestion importance[1-10]: 6

__

Why: The i-- retry pattern without a limit or backoff could cause an infinite loop if the shard remains unassigned. The suggestion to add a max retry count and sleep is valid, though the improved_code has a scoping issue with retryCount being declared inside the loop body.

Low
Guard against insufficient data nodes for relocation

The dataNodes array is fetched inside the loop but used to pick a relocationTarget
without verifying the target node is different from the current primary's node in
the refreshed state. If dataNodes has only one element, the while
(relocationTarget.equals(relocationSource)) loop will spin forever. Consider adding
a guard to ensure there are at least two data nodes before attempting relocation.

server/src/internalClusterTest/java/org/opensearch/indices/recovery/IndexPrimaryRelocationIT.java [85]

 DiscoveryNode[] dataNodes = currentState.getNodes().getDataNodes().values().toArray(new DiscoveryNode[0]);
+if (dataNodes.length < 2) {
+    throw new AssertionError("Expected at least 2 data nodes for relocation test, but found: " + dataNodes.length);
+}
Suggestion importance[1-10]: 5

__

Why: The concern about an infinite loop in while (relocationTarget.equals(relocationSource)) when only one data node exists is valid, but this is a pre-existing issue and the test setup likely guarantees multiple nodes. The guard would improve robustness with a clear failure message.

Low
Suggestions up to commit 2f6d1cb
CategorySuggestion                                                                                                                                    Impact
Possible issue
Add retry limit to prevent infinite loop

Decrementing i and using continue to retry an iteration is fragile and can cause an
infinite loop if the cluster state never recovers (e.g., primary shard remains
unassigned indefinitely). A retry counter or a bounded sleep/wait mechanism should
be used to avoid hanging the test forever.

server/src/internalClusterTest/java/org/opensearch/indices/recovery/IndexPrimaryRelocationIT.java [88-99]

-if (primaryShard == null || primaryShard.currentNodeId() == null) {
-    logger.warn("--> [iteration {}] primary shard not found or not assigned, retrying", i);
-    i--; // retry this iteration
-    continue;
+int retries = 0;
+final int MAX_RETRIES = 10;
+while (retries < MAX_RETRIES) {
+    ClusterState currentState = client().admin().cluster().prepareState().get().getState();
+    ShardRouting primaryShard = currentState.getRoutingTable().shardRoutingTable("test", 0).primaryShard();
+    if (primaryShard == null || primaryShard.currentNodeId() == null) {
+        logger.warn("--> [iteration {}] primary shard not found or not assigned, retrying", i);
+        retries++;
+        Thread.sleep(500);
+        continue;
+    }
+    DiscoveryNode relocationSource = currentState.getNodes().getDataNodes().get(primaryShard.currentNodeId());
+    if (relocationSource == null) {
+        logger.warn("--> [iteration {}] source node not found in cluster, retrying", i);
+        retries++;
+        Thread.sleep(500);
+        continue;
+    }
+    // proceed with relocation using relocationSource
+    break;
+}
+if (retries >= MAX_RETRIES) {
+    throw new AssertionError("Primary shard not assigned after " + MAX_RETRIES + " retries at iteration [" + i + "]");
 }
 
-DiscoveryNode relocationSource = currentState.getNodes().getDataNodes().get(primaryShard.currentNodeId());
-if (relocationSource == null) {
-    logger.warn("--> [iteration {}] source node not found in cluster, retrying", i);
-    i--; // retry this iteration
-    continue;
-}
-
Suggestion importance[1-10]: 6

__

Why: The i-- + continue pattern can cause an infinite loop if the primary shard remains unassigned indefinitely. A bounded retry mechanism with a max retry count would be safer, though this is a test file where infinite loops are less critical but still problematic.

Low
Guard against insufficient data nodes for relocation

If dataNodes has only one element (i.e., only one data node), the while
(relocationTarget.equals(relocationSource)) loop below will spin forever since there
is no other node to relocate to. A guard check should be added to ensure there are
at least two data nodes before attempting relocation.

server/src/internalClusterTest/java/org/opensearch/indices/recovery/IndexPrimaryRelocationIT.java [83-85]

 // Fetch fresh cluster state to get current shard location and available nodes
 ClusterState currentState = client().admin().cluster().prepareState().get().getState();
 DiscoveryNode[] dataNodes = currentState.getNodes().getDataNodes().values().toArray(new DiscoveryNode[0]);
+if (dataNodes.length < 2) {
+    logger.warn("--> [iteration {}] fewer than 2 data nodes available, skipping relocation", i);
+    continue;
+}
Suggestion importance[1-10]: 5

__

Why: If only one data node is available, the while (relocationTarget.equals(relocationSource)) loop will spin forever. Adding a guard for dataNodes.length < 2 is a valid defensive check, though in practice this test likely requires multiple nodes to run meaningfully.

Low

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit dae38ed

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 2f6d1cb

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 2f6d1cb: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit af8ec38

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for af8ec38: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit f99e571

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for f99e571: ABORTED

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

✅ Gradle check result for f99e571: SUCCESS

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 16, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.42%. Comparing base (436e4a6) to head (bc32048).
⚠️ Report is 6 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #21249      +/-   ##
============================================
+ Coverage     73.34%   73.42%   +0.07%     
- Complexity    74223    74261      +38     
============================================
  Files          5958     5958              
  Lines        337309   337309              
  Branches      48664    48664              
============================================
+ Hits         247408   247668     +260     
+ Misses        70188    69865     -323     
- Partials      19713    19776      +63     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@divyaruhil
Copy link
Copy Markdown
Contributor Author

Hi @andrross, just a gentle nudge—please review the PR whenever it’s convenient for you. Thankyou !

DIVYA2 and others added 4 commits April 21, 2026 16:27
Signed-off-by: Divya <divyruhil999@gmail.com>
Signed-off-by: Divya <divyruhil999@gmail.com>
Signed-off-by: Divya <divyruhil999@gmail.com>
Signed-off-by: Divya <divyruhil999@gmail.com>
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit a5e0cde

@github-actions
Copy link
Copy Markdown
Contributor

✅ Gradle check result for a5e0cde: SUCCESS

@divyaruhil divyaruhil requested a review from andrross April 21, 2026 12:25
@divyaruhil
Copy link
Copy Markdown
Contributor Author

Hi @andrross, please let me know if there are any more review comments. Would appreciate it if you could review when you have time.

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit bc32048

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for bc32048: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit f0373c1.

'Diff too large, requires skip by maintainers after manual review'


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit bc32048

@github-actions
Copy link
Copy Markdown
Contributor

❕ Gradle check result for bc32048: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@andrross andrross merged commit 9bcf4f0 into opensearch-project:main Apr 27, 2026
28 of 32 checks passed
krishna-ggk pushed a commit to krishna-ggk/OpenSearch that referenced this pull request Apr 28, 2026
…#21249)

Signed-off-by: Divya <divyruhil999@gmail.com>
Co-authored-by: DIVYA2 <DIVYA2@ibm.com>
Co-authored-by: Divya <divyruhil999@gmail.com>
Co-authored-by: Andrew Ross <andrross@amazon.com>
imRishN pushed a commit to imRishN/OpenSearch that referenced this pull request May 8, 2026
…#21249)

Signed-off-by: Divya <divyruhil999@gmail.com>
Co-authored-by: DIVYA2 <DIVYA2@ibm.com>
Co-authored-by: Divya <divyruhil999@gmail.com>
Co-authored-by: Andrew Ross <andrross@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants