Skip to content

Fix a flaky test for issue 20058#20240

Merged
andrross merged 2 commits into
opensearch-project:mainfrom
liuguoqingfz:flakytest-20058
Dec 18, 2025
Merged

Fix a flaky test for issue 20058#20240
andrross merged 2 commits into
opensearch-project:mainfrom
liuguoqingfz:flakytest-20058

Conversation

@liuguoqingfz
Copy link
Copy Markdown
Contributor

@liuguoqingfz liuguoqingfz commented Dec 15, 2025

Description

Fix a flaky test where there's a timing/race where the snapshot sometimes starts while one primary shard is still initializing/relocating, so createFullSnapshot() observes totalShards=11 but only successfulShards=10.

Related Issues

Resolves #20058

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Summary by CodeRabbit

  • Tests
    • Enhanced snapshot cloning test stability by adding cluster health waits around key steps and replacing direct checks with polling assertions for lock file counts, reducing flakiness during snapshot, shallow-snapshot, and clone operations.

✏️ Tip: You can customize this high-level summary in your review settings.

@liuguoqingfz liuguoqingfz requested a review from a team as a code owner December 15, 2025 15:27
@github-actions github-actions Bot added >test-failure Test failure from CI, local build, etc. autocut flaky-test Random test failure that succeeds on second run labels Dec 15, 2025
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Dec 15, 2025

Walkthrough

Added cluster health waits and converted direct lock-file count assertions to polled assertBusy checks in the CloneSnapshotIT test around snapshot creation, shallow snapshot, and clone operations to reduce timing-related flakes.

Changes

Cohort / File(s) Summary
Test Flakiness Fix
server/src/internalClusterTest/java/org/opensearch/snapshots/CloneSnapshotIT.java
Added ensureYellow/cluster health waits around key steps; replaced direct lock-file count assertions with assertBusy-wrapped polls (waiting for counts 0, 1, 2) at snapshot and clone checkpoints; extended health polling after optional index deletion.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

  • Check that each added health wait targets the correct cluster state (yellow, no initializing/relocating shards).
  • Verify assertBusy timeouts and polling intervals are reasonable and consistent.
  • Confirm assertions still fail fast on real functional regressions rather than masking issues.

Suggested reviewers

  • andrross
  • dbwiddis
  • msfroh
  • sachinpkale
  • owaiskazi19
  • saratvemulapalli
  • CEHENKLE
  • ashking94

Poem

🐇 I hopped through tests with careful paws,
Polled the locks and waited for applause,
Yellow lights checked, no races in sight,
Now snapshots clone with calmer delight. 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically identifies the change as fixing a flaky test for issue 20058, which matches the main objective of the PR.
Description check ✅ Passed The description provides the required sections with clear explanation of the timing/race condition causing the flakiness, links to issue #20058, and includes standard checklist items.
Linked Issues check ✅ Passed The PR changes directly address the flaky CloneSnapshotIT.testCloneShallowSnapshotIndex test by adding health checks and assertBusy-wrapped assertions to eliminate timing race conditions documented in issue #20058.
Out of Scope Changes check ✅ Passed All changes are focused on fixing the flaky test through health polling and assertion wrapping; no out-of-scope modifications detected in the single modified file.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1dc6303 and 59efde0.

📒 Files selected for processing (1)
  • server/src/internalClusterTest/java/org/opensearch/snapshots/CloneSnapshotIT.java (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • server/src/internalClusterTest/java/org/opensearch/snapshots/CloneSnapshotIT.java
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: gradle-check
  • GitHub Check: assemble (25, windows-latest)
  • GitHub Check: assemble (21, ubuntu-latest)
  • GitHub Check: assemble (25, ubuntu-latest)
  • GitHub Check: assemble (25, ubuntu-24.04-arm)
  • GitHub Check: assemble (21, macos-15)
  • GitHub Check: assemble (21, windows-latest)
  • GitHub Check: detect-breaking-change
  • GitHub Check: assemble (21, ubuntu-24.04-arm)
  • GitHub Check: assemble (25, macos-15)
  • GitHub Check: Analyze (java)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
server/src/internalClusterTest/java/org/opensearch/snapshots/ConcurrentSnapshotsIT.java (2)

205-218: Remove unnecessary fully qualified class names.

ClusterState and SnapshotDeletionsInProgress are already imported at the top of this file (lines 44-45), and notNullValue should be added as a static import for consistency with the existing matcher imports. Using FQNs here is inconsistent with the rest of the codebase.

-        // Stronger ordering: wait until cluster state shows the repository is in-use due to deletion
-        assertBusy(() -> {
-            final org.opensearch.cluster.ClusterState state = client().admin().cluster().prepareState().get().getState();
-            final org.opensearch.cluster.SnapshotDeletionsInProgress deletions = state.custom(
-                org.opensearch.cluster.SnapshotDeletionsInProgress.TYPE
-            );
-
-            assertThat("SnapshotDeletionsInProgress must be present once delete starts", deletions, org.hamcrest.Matchers.notNullValue());
-            assertThat(
-                deletions.getEntries().stream().map(org.opensearch.cluster.SnapshotDeletionsInProgress.Entry::repository).toList(),
-                hasItem(equalTo(repoName))
-            );
-        });
+        // Stronger ordering: wait until cluster state shows the repository is in-use due to deletion
+        assertBusy(() -> {
+            final ClusterState state = client().admin().cluster().prepareState().get().getState();
+            final SnapshotDeletionsInProgress deletions = state.custom(SnapshotDeletionsInProgress.TYPE);
+
+            assertThat("SnapshotDeletionsInProgress must be present once delete starts", deletions, notNullValue());
+            assertThat(
+                deletions.getEntries().stream().map(SnapshotDeletionsInProgress.Entry::repository).toList(),
+                hasItem(equalTo(repoName))
+            );
+        });

Add notNullValue to the static imports:

import static org.hamcrest.Matchers.notNullValue;

224-228: Clean up fully qualified names and add import for ExceptionsHelper.

The approach of catching Exception and unwrapping via ExceptionsHelper.unwrapCause() is correct for handling transport-layer wrapping, but the FQNs should be replaced with imports for consistency.

         // Transport can wrap the real exception; assert on the unwrapped root cause
-        final Exception ex = assertThrows(Exception.class, () -> updateRepository(repoName, "mock", newSettings));
-        final Throwable cause = org.opensearch.ExceptionsHelper.unwrapCause(ex);
-        assertThat(cause, org.hamcrest.Matchers.instanceOf(IllegalStateException.class));
+        final Exception ex = assertThrows(Exception.class, () -> updateRepository(repoName, "mock", newSettings));
+        final Throwable cause = ExceptionsHelper.unwrapCause(ex);
+        assertThat(cause, instanceOf(IllegalStateException.class));
         assertEquals("trying to modify or unregister repository that is currently used", cause.getMessage());

Add ExceptionsHelper to the imports:

import org.opensearch.ExceptionsHelper;

Note: instanceOf is already statically imported at line 93.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0d7f787 and 2ea5728.

📒 Files selected for processing (2)
  • server/src/internalClusterTest/java/org/opensearch/snapshots/CloneSnapshotIT.java (1 hunks)
  • server/src/internalClusterTest/java/org/opensearch/snapshots/ConcurrentSnapshotsIT.java (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
server/src/internalClusterTest/java/org/opensearch/snapshots/ConcurrentSnapshotsIT.java (2)
server/src/main/java/org/opensearch/cluster/SnapshotDeletionsInProgress.java (1)
  • SnapshotDeletionsInProgress (59-391)
libs/core/src/main/java/org/opensearch/ExceptionsHelper.java (1)
  • ExceptionsHelper (76-476)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: gradle-check
  • GitHub Check: detect-breaking-change
  • GitHub Check: Analyze (java)
🔇 Additional comments (6)
server/src/internalClusterTest/java/org/opensearch/snapshots/CloneSnapshotIT.java (6)

180-189: Excellent fix for the race condition!

This assertBusy wrapper correctly addresses the flaky test issue described in the PR. By waiting for no initializing or relocating shards before taking the snapshot, you prevent the race condition where createFullSnapshot() would observe fewer successful shards than total shards.


193-193: Correct use of assertBusy for async file operations.

Wrapping the lock file count assertion in assertBusy properly handles the asynchronous nature of file creation/cleanup operations in the remote store.


197-206: Consistent stabilization pattern applied.

This health check ensures the cluster is stable after indexing additional documents and before creating the shallow snapshot, preventing the same race condition at this stage of the test.


210-210: LGTM!

Consistent application of assertBusy for lock file assertions.


214-223: Good defensive programming for conditional path.

Adding cluster stabilization after the optional index deletion ensures the subsequent clone operation doesn't encounter race conditions regardless of which test path is taken.


230-230: Completes the stabilization pattern correctly.

The final assertBusy wrapper for the lock file count after cloning completes the comprehensive fix. The logging statement on line 229 is a helpful addition for debugging if flakiness reoccurs.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
server/src/internalClusterTest/java/org/opensearch/snapshots/CloneSnapshotIT.java (1)

157-231: The other flaky tests also lack pre-snapshot health checks.

testShallowCloneNameAvailability (line 258, 261) and testCloneAfterRepoShallowSettingDisabled (line 324) both call createFullSnapshot immediately after indexing, without the assertBusy cluster health validation that was added to testCloneShallowSnapshotIndex. They should be updated with similar health checks to prevent race conditions during snapshot and clone operations.

🧹 Nitpick comments (1)
server/src/internalClusterTest/java/org/opensearch/snapshots/CloneSnapshotIT.java (1)

184-184: Consider adding an import for ClusterHealthStatus.

The fully qualified class name org.opensearch.cluster.health.ClusterHealthStatus.YELLOW is used in three places. Adding a static import would improve readability:

+import static org.opensearch.cluster.health.ClusterHealthStatus.YELLOW;

Then use:

-                .setWaitForStatus(org.opensearch.cluster.health.ClusterHealthStatus.YELLOW)
+                .setWaitForStatus(YELLOW)

Also applies to: 201-201, 218-218

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2ea5728 and 1dc6303.

📒 Files selected for processing (1)
  • server/src/internalClusterTest/java/org/opensearch/snapshots/CloneSnapshotIT.java (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
  • GitHub Check: gradle-check
  • GitHub Check: Analyze (java)
  • GitHub Check: assemble (21, ubuntu-24.04-arm)
  • GitHub Check: assemble (25, ubuntu-latest)
  • GitHub Check: assemble (25, windows-latest)
  • GitHub Check: assemble (25, ubuntu-24.04-arm)
  • GitHub Check: assemble (21, windows-latest)
  • GitHub Check: assemble (21, ubuntu-latest)
  • GitHub Check: detect-breaking-change
  • GitHub Check: precommit (25, ubuntu-latest)
  • GitHub Check: precommit (21, ubuntu-latest)
  • GitHub Check: precommit (25, windows-latest)
  • GitHub Check: precommit (25, ubuntu-24.04-arm)
  • GitHub Check: precommit (21, macos-15)
  • GitHub Check: precommit (21, ubuntu-24.04-arm)
  • GitHub Check: precommit (21, macos-15-intel)
  • GitHub Check: precommit (21, windows-latest)
  • GitHub Check: precommit (21, windows-2025, true)
  • GitHub Check: precommit (25, macos-15)
  • GitHub Check: precommit (25, macos-15-intel)
🔇 Additional comments (2)
server/src/internalClusterTest/java/org/opensearch/snapshots/CloneSnapshotIT.java (2)

180-189: Excellent fix for the race condition.

The health checks with assertBusy directly address the timing issue where snapshots could start while shards are initializing or relocating. Waiting for YELLOW status with no initializing/relocating shards before snapshot operations ensures cluster stability.

Also applies to: 197-206, 214-223


193-193: Good fix for asynchronous lock file operations.

Converting the lock file assertions to use assertBusy properly handles the asynchronous nature of remote store lock file creation/cleanup.

Minor observation: The logger statement at line 229 executes once before the assertBusy loop, so it may log a count that differs from the eventual assertion result during test flakiness investigations.

Also applies to: 210-210, 230-230

@github-actions
Copy link
Copy Markdown
Contributor

✅ Gradle check result for 1dc6303: SUCCESS

@codecov
Copy link
Copy Markdown

codecov Bot commented Dec 15, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.17%. Comparing base (930ae74) to head (59efde0).
⚠️ Report is 9 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #20240      +/-   ##
============================================
- Coverage     73.20%   73.17%   -0.04%     
+ Complexity    71745    71712      -33     
============================================
  Files          5795     5795              
  Lines        328304   328298       -6     
  Branches      47283    47279       -4     
============================================
- Hits         240334   240223     -111     
- Misses        68663    68752      +89     
- Partials      19307    19323      +16     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread server/src/internalClusterTest/java/org/opensearch/snapshots/CloneSnapshotIT.java Outdated
…on: cluster at least YELLOW + no initializing shards + no relocating shards before each snapshot

Signed-off-by: Joe Liu <guoqing4@illinois.edu>
…a single blocking health request that waits for no initializing and no relocating shards

Signed-off-by: Joe Liu <guoqing4@illinois.edu>
@github-actions
Copy link
Copy Markdown
Contributor

✅ Gradle check result for 59efde0: SUCCESS

@andrross andrross merged commit fc01431 into opensearch-project:main Dec 18, 2025
35 of 54 checks passed
pradeep-L pushed a commit to pradeep-L/OpenSearch that referenced this pull request Apr 21, 2026
* make testCloneShallowSnapshotIndex() establish the missing precondition: cluster at least YELLOW + no initializing shards + no relocating shards before each snapshot

Signed-off-by: Joe Liu <guoqing4@illinois.edu>

* keep ensureYellow(...) for the “primaries assigned” barrier, and add a single blocking health request that waits for no initializing and no relocating shards

Signed-off-by: Joe Liu <guoqing4@illinois.edu>

---------

Signed-off-by: Joe Liu <guoqing4@illinois.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

autocut flaky-test Random test failure that succeeds on second run >test-failure Test failure from CI, local build, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[AUTOCUT] Gradle Check Flaky Test Report for CloneSnapshotIT

2 participants