Skip to content

Fix flaky test failures in ShardsLimitAllocationDeciderIT#20375

Merged
andrross merged 5 commits intoopensearch-project:mainfrom
cwperks:fix-19762
Feb 5, 2026
Merged

Fix flaky test failures in ShardsLimitAllocationDeciderIT#20375
andrross merged 5 commits intoopensearch-project:mainfrom
cwperks:fix-19762

Conversation

@cwperks
Copy link
Copy Markdown
Member

@cwperks cwperks commented Jan 6, 2026

Description

This PR is a similar attempt to #19762 to fix flaky tests in ShardsLimitAllocationDeciderIT

See example failure here: https://build.ci.opensearch.org/job/gradle-check/69672/testReport/junit/org.opensearch.cluster.routing.allocation.decider/ShardsLimitAllocationDeciderIT/testCombinedClusterAndIndexSpecificShardLimits__p0___opensearch_experimental_feature_writable_warm_index_enabled___true___/

>  ./gradlew ':server:internalClusterTest' --tests 'org.opensearch.cluster.routing.allocation.decider.ShardsLimitAllocationDeciderIT.testCombinedClusterAndIndexSpecificShardLimits' -Dtests.iters=5

java.lang.AssertionError: Total assigned shards should be 17 expected:<17> but was:<16>
	at __randomizedtesting.SeedInfo.seed([228B86D5A534EC39:A0832C873E69A8FB]:0)
	at org.junit.Assert.fail(Assert.java:89)
	at org.junit.Assert.failNotEquals(Assert.java:835)
	at org.junit.Assert.assertEquals(Assert.java:647)
	at org.opensearch.cluster.routing.allocation.decider.ShardsLimitAllocationDeciderIT.lambda$testCombinedClusterAndIndexSpecificShardLimits$0(ShardsLimitAllocationDeciderIT.java:303)
	at org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1193)
	at org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1166)
	at org.opensearch.cluster.routing.allocation.decider.ShardsLimitAllocationDeciderIT.testCombinedClusterAndIndexSpecificShardLimits(ShardsLimitAllocationDeciderIT.java:263)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
	at java.base/java.lang.reflect.Method.invoke(Method.java:565)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
	at org.opensearch.test.OpenSearchTestClusterRule$1.evaluate(OpenSearchTestClusterRule.java:369)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)

Related Issues

Resolves #19726

Stable with 100x runs. Fails a few runs consistently in a batch of 100 iters without this fix.

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Summary by CodeRabbit

  • Bug Fixes
    • Improved reliability of cluster shard-allocation tests by waiting for cluster stabilization, forcing a reroute after index creation, tightening shard-distribution checks, and adding a 60s timeout to assertions to reduce flaky failures and improve test stability.

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: Craig Perkins <cwperx@amazon.com>
Signed-off-by: Craig Perkins <cwperx@amazon.com>
@cwperks cwperks requested a review from a team as a code owner January 6, 2026 19:54
@github-actions github-actions bot added >test-failure Test failure from CI, local build, etc. autocut flaky-test Random test failure that succeeds on second run labels Jan 6, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Jan 6, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

  • 🔍 Trigger a full review
📝 Walkthrough

Walkthrough

The PR adds a changelog entry and updates a test: ShardsLimitAllocationDeciderIT.testCombinedClusterAndIndexSpecificShardLimits now waits for cluster stabilization, forces a reroute after index creation, adjusts shard-distribution assertions to count only nodes that actually own shards, and wraps final assertions with a 60s timeout.

Changes

Cohort / File(s) Summary
Test Stabilization & Assertions
server/src/internalClusterTest/java/org/opensearch/cluster/routing/allocation/decider/ShardsLimitAllocationDeciderIT.java
Wait for cluster stabilization and call a forced reroute after creating indices; change shard-distribution check to count only nodes that own shards (expecting three nodes with shard counts 6, 6, and 5); wrap final assertions in a 60s timeout.
Changelog
CHANGELOG.md
Added a Fixed entry noting flaky test failures in ShardsLimitAllocationDeciderIT with associated PR link.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

  • andrross
  • reta
  • dbwiddis
  • kotwanikunal
  • msfroh

Poem

🐰 I waited, I rerouted, I counted each shard,
Three nodes now balanced, their burdens not hard.
A timeout to guard the test's patient art,
I hop with a grin — the cluster's in part. 🥕✨

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely describes the main fix being applied: resolving flaky test failures in ShardsLimitAllocationDeciderIT.
Description check ✅ Passed The PR description comprehensively covers the problem, provides a failure example, references the related issue, and demonstrates the fix stability. All required template sections are present.
Linked Issues check ✅ Passed The PR directly addresses the flaky test failures documented in issue #19726 by implementing fixes to the testCombinedClusterAndIndexSpecificShardLimits test method.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing the flaky test: modifications to ShardsLimitAllocationDeciderIT test logic and a CHANGELOG entry documenting the fix.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Important

Action Needed: IP Allowlist Update

If your organization protects your Git platform with IP whitelisting, please add the new CodeRabbit IP address to your allowlist:

  • 136.113.208.247/32 (new)
  • 34.170.211.100/32
  • 35.222.179.152/32

Failure to add the new IP will result in interrupted reviews.


Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: Craig Perkins <cwperx@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jan 6, 2026

❌ Gradle check result for e8fba2b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jan 6, 2026

❌ Gradle check result for e8fba2b: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Craig Perkins <cwperx@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jan 7, 2026

❕ Gradle check result for bc221a8: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@codecov
Copy link
Copy Markdown

codecov bot commented Jan 7, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.26%. Comparing base (93ae8db) to head (ce82e1c).
⚠️ Report is 10 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #20375      +/-   ##
============================================
- Coverage     73.33%   73.26%   -0.07%     
+ Complexity    72125    72086      -39     
============================================
  Files          5798     5798              
  Lines        329654   329697      +43     
  Branches      47491    47508      +17     
============================================
- Hits         241741   241566     -175     
- Misses        68504    68757     +253     
+ Partials      19409    19374      -35     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Craig Perkins <cwperx@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 3, 2026

✅ Gradle check result for ce82e1c: SUCCESS

@andrross andrross merged commit 3ba2f37 into opensearch-project:main Feb 5, 2026
34 of 45 checks passed
tanyabti pushed a commit to tanyabti/OpenSearch that referenced this pull request Feb 24, 2026
tanyabti pushed a commit to tanyabti/OpenSearch that referenced this pull request Feb 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

autocut flaky-test Random test failure that succeeds on second run >test-failure Test failure from CI, local build, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[AUTOCUT] Gradle Check Flaky Test Report for ShardsLimitAllocationDeciderIT

2 participants