Fix flaky test failures in ShardsLimitAllocationDeciderIT by cwperks · Pull Request #20375 · opensearch-project/OpenSearch

cwperks · 2026-01-06T19:54:48Z

Description

This PR is a similar attempt to #19762 to fix flaky tests in ShardsLimitAllocationDeciderIT

See example failure here: https://build.ci.opensearch.org/job/gradle-check/69672/testReport/junit/org.opensearch.cluster.routing.allocation.decider/ShardsLimitAllocationDeciderIT/testCombinedClusterAndIndexSpecificShardLimits__p0___opensearch_experimental_feature_writable_warm_index_enabled___true___/

>  ./gradlew ':server:internalClusterTest' --tests 'org.opensearch.cluster.routing.allocation.decider.ShardsLimitAllocationDeciderIT.testCombinedClusterAndIndexSpecificShardLimits' -Dtests.iters=5

java.lang.AssertionError: Total assigned shards should be 17 expected:<17> but was:<16>
	at __randomizedtesting.SeedInfo.seed([228B86D5A534EC39:A0832C873E69A8FB]:0)
	at org.junit.Assert.fail(Assert.java:89)
	at org.junit.Assert.failNotEquals(Assert.java:835)
	at org.junit.Assert.assertEquals(Assert.java:647)
	at org.opensearch.cluster.routing.allocation.decider.ShardsLimitAllocationDeciderIT.lambda$testCombinedClusterAndIndexSpecificShardLimits$0(ShardsLimitAllocationDeciderIT.java:303)
	at org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1193)
	at org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1166)
	at org.opensearch.cluster.routing.allocation.decider.ShardsLimitAllocationDeciderIT.testCombinedClusterAndIndexSpecificShardLimits(ShardsLimitAllocationDeciderIT.java:263)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
	at java.base/java.lang.reflect.Method.invoke(Method.java:565)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
	at org.opensearch.test.OpenSearchTestClusterRule$1.evaluate(OpenSearchTestClusterRule.java:369)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)

Related Issues

Resolves #19726

Stable with 100x runs. Fails a few runs consistently in a batch of 100 iters without this fix.

Check List

Functionality includes testing.
API changes companion pull request created, if applicable.
Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Summary by CodeRabbit

Bug Fixes
- Improved reliability of cluster shard-allocation tests by waiting for cluster stabilization, forcing a reroute after index creation, tightening shard-distribution checks, and adding a 60s timeout to assertions to reduce flaky failures and improve test stability.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Craig Perkins <cwperx@amazon.com>

coderabbitai · 2026-01-06T19:55:23Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

🔍 Trigger a full review

📝 Walkthrough

Walkthrough

The PR adds a changelog entry and updates a test: ShardsLimitAllocationDeciderIT.testCombinedClusterAndIndexSpecificShardLimits now waits for cluster stabilization, forces a reroute after index creation, adjusts shard-distribution assertions to count only nodes that actually own shards, and wraps final assertions with a 60s timeout.

Changes

Cohort / File(s)	Summary
Test Stabilization & Assertions `server/src/internalClusterTest/java/org/opensearch/cluster/routing/allocation/decider/ShardsLimitAllocationDeciderIT.java`	Wait for cluster stabilization and call a forced reroute after creating indices; change shard-distribution check to count only nodes that own shards (expecting three nodes with shard counts 6, 6, and 5); wrap final assertions in a 60s timeout.
Changelog `CHANGELOG.md`	Added a Fixed entry noting flaky test failures in `ShardsLimitAllocationDeciderIT` with associated PR link.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

andrross
reta
dbwiddis
kotwanikunal
msfroh

Poem

🐰 I waited, I rerouted, I counted each shard,
Three nodes now balanced, their burdens not hard.
A timeout to guard the test's patient art,
I hop with a grin — the cluster's in part. 🥕✨

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately and concisely describes the main fix being applied: resolving flaky test failures in ShardsLimitAllocationDeciderIT.
Description check	✅ Passed	The PR description comprehensively covers the problem, provides a failure example, references the related issue, and demonstrates the fix stability. All required template sections are present.
Linked Issues check	✅ Passed	The PR directly addresses the flaky test failures documented in issue `#19726` by implementing fixes to the testCombinedClusterAndIndexSpecificShardLimits test method.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to fixing the flaky test: modifications to ShardsLimitAllocationDeciderIT test logic and a CHANGELOG entry documenting the fix.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Important

Action Needed: IP Allowlist Update

If your organization protects your Git platform with IP whitelisting, please add the new CodeRabbit IP address to your allowlist:

✨ 136.113.208.247/32 (new)
34.170.211.100/32
35.222.179.152/32

Failure to add the new IP will result in interrupted reviews.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Signed-off-by: Craig Perkins <cwperx@amazon.com>

github-actions · 2026-01-06T21:14:04Z

❌ Gradle check result for e8fba2b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2026-01-06T23:32:32Z

❌ Gradle check result for e8fba2b: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Craig Perkins <cwperx@amazon.com>

github-actions · 2026-01-07T02:23:07Z

❕ Gradle check result for bc221a8: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

codecov · 2026-01-07T02:23:51Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.26%. Comparing base (93ae8db) to head (ce82e1c).
⚠️ Report is 10 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #20375      +/-   ##
============================================
- Coverage     73.33%   73.26%   -0.07%     
+ Complexity    72125    72086      -39     
============================================
  Files          5798     5798              
  Lines        329654   329697      +43     
  Branches      47491    47508      +17     
============================================
- Hits         241741   241566     -175     
- Misses        68504    68757     +253     
+ Partials      19409    19374      -35

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Craig Perkins <cwperx@amazon.com>

github-actions · 2026-02-03T17:07:54Z

✅ Gradle check result for ce82e1c: SUCCESS

…-project#20375) Signed-off-by: Craig Perkins <cwperx@amazon.com>

cwperks added 2 commits January 6, 2026 14:03

Fix flaky test failures in ShardsLimitAllocationDeciderIT

fe9a7ea

Signed-off-by: Craig Perkins <cwperx@amazon.com>

Force re-route

c70f79c

Signed-off-by: Craig Perkins <cwperx@amazon.com>

cwperks requested a review from a team as a code owner January 6, 2026 19:54

github-actions bot added >test-failure Test failure from CI, local build, etc. autocut flaky-test Random test failure that succeeds on second run labels Jan 6, 2026

Add CHANGELOG entry

e8fba2b

Signed-off-by: Craig Perkins <cwperx@amazon.com>

Merge branch 'main' into fix-19762

bc221a8

Signed-off-by: Craig Perkins <cwperx@amazon.com>

Merge branch 'main' into fix-19762

ce82e1c

Signed-off-by: Craig Perkins <cwperx@amazon.com>

andrross approved these changes Feb 5, 2026

View reviewed changes

andrross merged commit 3ba2f37 into opensearch-project:main Feb 5, 2026
34 of 45 checks passed

This was referenced Feb 18, 2026

[AUTOCUT] Gradle Check Flaky Test Report for NodeJoinLeftIT #18972

Open

[AUTOCUT] Gradle Check Flaky Test Report for AwarenessAllocationIT #17930

Open

tanyabti pushed a commit to tanyabti/OpenSearch that referenced this pull request Feb 24, 2026

Fix flaky test failures in ShardsLimitAllocationDeciderIT (opensearch…

c295a82

…-project#20375) Signed-off-by: Craig Perkins <cwperx@amazon.com>

tanyabti pushed a commit to tanyabti/OpenSearch that referenced this pull request Feb 24, 2026

Fix flaky test failures in ShardsLimitAllocationDeciderIT (opensearch…

2628cba

…-project#20375) Signed-off-by: Craig Perkins <cwperx@amazon.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flaky test failures in ShardsLimitAllocationDeciderIT#20375

Fix flaky test failures in ShardsLimitAllocationDeciderIT#20375
andrross merged 5 commits intoopensearch-project:mainfrom
cwperks:fix-19762

cwperks commented Jan 6, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 6, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

Uh oh!

github-actions bot commented Jan 6, 2026

Uh oh!

github-actions bot commented Jan 6, 2026

Uh oh!

github-actions bot commented Jan 7, 2026

Uh oh!

codecov bot commented Jan 7, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cwperks commented Jan 6, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Check List

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

Pre-merge checks

Uh oh!

github-actions bot commented Jan 6, 2026

Uh oh!

github-actions bot commented Jan 6, 2026

Uh oh!

github-actions bot commented Jan 7, 2026

Uh oh!

codecov bot commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Feb 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cwperks commented Jan 6, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 6, 2026 •

edited

Loading

codecov bot commented Jan 7, 2026 •

edited

Loading