Skip to content

Fix flaky tests in NodeJoinLeftIT#20382

Merged
andrross merged 1 commit into
opensearch-project:mainfrom
cwperks:fix-flaky-node-drop-it
Jan 7, 2026
Merged

Fix flaky tests in NodeJoinLeftIT#20382
andrross merged 1 commit into
opensearch-project:mainfrom
cwperks:fix-flaky-node-drop-it

Conversation

@cwperks
Copy link
Copy Markdown
Member

@cwperks cwperks commented Jan 7, 2026

Description

The changes in this PR make the tests in NodeJoinLeftIT more stable. Tested with multiple iterations locally where the test fails regularly before the change and is stable with the change.

Seeing a lot of this lately:

org.opensearch.cluster.coordination.NodeJoinLeftIT.testClusterStabilityWhenClusterStatePublicationLagsOnShardCleanup

java.lang.AssertionError: Cluster didn't have a node drop
	at __randomizedtesting.SeedInfo.seed([6FE736075EC75D45:FCB23544E7373C09]:0)
	at org.junit.Assert.fail(Assert.java:89)
	at org.junit.Assert.assertTrue(Assert.java:42)
	at org.junit.Assert.assertFalse(Assert.java:65)
	at org.opensearch.cluster.coordination.NodeJoinLeftIT.validateNodeDropDueToPublicationLag(NodeJoinLeftIT.java:460)
	at org.opensearch.cluster.coordination.NodeJoinLeftIT.testClusterStabilityWhenClusterStatePublicationLagsOnShardCleanup(NodeJoinLeftIT.java:365)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
	at java.base/java.lang.reflect.Method.invoke(Method.java:565)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
	at org.opensearch.test.OpenSearchTestClusterRule$1.evaluate(OpenSearchTestClusterRule.java:369)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)

ref: https://build.ci.opensearch.org/job/gradle-check/69731/

Related Issues

Resolves #18972

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Summary by CodeRabbit

  • Tests
    • Improved cluster coordination tests by adding simulated publication lag scenarios with enhanced validation logic.
    • Replaced direct health checks with polling-based node drop detection for more reliable test execution.
    • Added cluster recovery validation after removing lag simulation.

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: Craig Perkins <cwperx@amazon.com>
@cwperks cwperks requested a review from a team as a code owner January 7, 2026 18:43
@github-actions github-actions Bot added >test-failure Test failure from CI, local build, etc. autocut flaky-test Random test failure that succeeds on second run labels Jan 7, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jan 7, 2026

📝 Walkthrough

Walkthrough

The test file NodeJoinLeftIT was modified to introduce cluster state listener delays on blue nodes, simulating publication lag during shard cleanup. The test workflow was updated to use assertBusy-based polling for node drop detection instead of immediate checks, and the validateNodeDropDueToPublicationLag method signature was updated to throw Exception to support the new timing-aware logic.

Changes

Cohort / File(s) Summary
Test Flakiness Fix
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java
Added delaying ClusterStateListener instances on blue nodes to simulate publication lag; refactored node drop detection from immediate health checks to assertBusy-based polling with timeout; updated validateNodeDropDueToPublicationLag() method signature to throw Exception; added cleanup logic to remove listeners and validate cluster recovery post-cleanup.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

flaky-test, >test-failure

Suggested reviewers

  • andrross
  • msfroh
  • shwetathareja
  • owaiskazi19
  • dbwiddis
  • anasalkouz

Poem

🐰 A test was flaky, timing went wrong,
We add delays to make it strong!
With polling and listeners, patience we show,
Now cluster coordination flows just so! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Fix flaky tests in NodeJoinLeftIT' clearly and directly describes the main change—stabilizing flaky tests in the NodeJoinLeftIT test class.
Description check ✅ Passed The PR description provides context about the flaky test failures, includes a concrete error example, references the related issue, and follows most template structure, though some checklist items remain unchecked.
Linked Issues check ✅ Passed The changes add delaying listeners to simulate publication lag, implement polling-based detection with assertBusy, and update test validation logic—all addressing flakiness issues reported in issue #18972.
Out of Scope Changes check ✅ Passed All changes are focused on fixing flakiness in NodeJoinLeftIT through test logic improvements and do not introduce unrelated modifications.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b656e0a and 68e1e15.

📒 Files selected for processing (1)
  • server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
  • GitHub Check: gradle-check
  • GitHub Check: precommit (21, macos-15)
  • GitHub Check: precommit (25, ubuntu-24.04-arm)
  • GitHub Check: precommit (25, macos-15-intel)
  • GitHub Check: precommit (25, macos-15)
  • GitHub Check: assemble (21, windows-latest)
  • GitHub Check: precommit (21, ubuntu-24.04-arm)
  • GitHub Check: assemble (25, ubuntu-latest)
  • GitHub Check: precommit (21, windows-2025, true)
  • GitHub Check: precommit (25, ubuntu-latest)
  • GitHub Check: precommit (21, windows-latest)
  • GitHub Check: assemble (25, ubuntu-24.04-arm)
  • GitHub Check: precommit (21, ubuntu-latest)
  • GitHub Check: assemble (21, ubuntu-latest)
  • GitHub Check: precommit (21, macos-15-intel)
  • GitHub Check: precommit (25, windows-latest)
  • GitHub Check: assemble (21, ubuntu-24.04-arm)
  • GitHub Check: assemble (25, windows-latest)
  • GitHub Check: Analyze (java)
  • GitHub Check: detect-breaking-change
🔇 Additional comments (4)
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java (4)

357-371: LGTM - Blue node delay listeners properly set up.

The implementation correctly identifies blue data nodes and registers delay listeners to simulate publication lag during shard cleanup. The use of a map for tracking listeners enables proper cleanup in the finally block.

One observation: The delay listener fires on every cluster state change event, not just shard-related events. This broad scope appears intentional for simulating publication lag, but be aware it may cause longer test execution times.


380-384: LGTM - Proper cleanup and validation ordering.

The finally block correctly removes all registered listeners before validating cluster recovery. This ensures the cluster can recover without the artificial delays interfering with the validation.


465-480: Good fix for flaky test using assertBusy pattern.

The switch from an immediate health check to assertBusy with retries is the correct approach to fix timing-related flakiness. This allows the test to wait for the expected node drop condition rather than failing on the first check.

The 120-second outer timeout is generous but reasonable given:

  • The 30-second delay listeners on blue nodes
  • The 30-second shard delete delay in TestIndexStoreListener
  • Variable cluster state publication timing

The 5-second inner timeout per health check attempt is appropriate for polling.


482-496: LGTM - Log validation properly sequenced after node drop detection.

The added debug log at line 482 helps with test troubleshooting. The log validation assertions are correctly sequenced to run after confirming the node drop has occurred.


Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jan 7, 2026

✅ Gradle check result for 68e1e15: SUCCESS

@codecov
Copy link
Copy Markdown

codecov Bot commented Jan 7, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.25%. Comparing base (d866be8) to head (68e1e15).
⚠️ Report is 10 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #20382      +/-   ##
============================================
- Coverage     73.30%   73.25%   -0.05%     
+ Complexity    71777    71748      -29     
============================================
  Files          5784     5784              
  Lines        328141   328147       +6     
  Branches      47269    47270       +1     
============================================
- Hits         240531   240379     -152     
- Misses        68329    68533     +204     
+ Partials      19281    19235      -46     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@andrross andrross merged commit 62eed1e into opensearch-project:main Jan 7, 2026
39 of 40 checks passed
tanyabti pushed a commit to tanyabti/OpenSearch that referenced this pull request Feb 24, 2026
Signed-off-by: Craig Perkins <cwperx@amazon.com>
pradeep-L pushed a commit to pradeep-L/OpenSearch that referenced this pull request Apr 21, 2026
Signed-off-by: Craig Perkins <cwperx@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

autocut flaky-test Random test failure that succeeds on second run skip-changelog >test-failure Test failure from CI, local build, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[AUTOCUT] Gradle Check Flaky Test Report for NodeJoinLeftIT

2 participants