Skip to content

Fixed a flaky test in issue 19042#20270

Merged
andrross merged 1 commit intoopensearch-project:mainfrom
liuguoqingfz:flakytest-19042
Dec 17, 2025
Merged

Fixed a flaky test in issue 19042#20270
andrross merged 1 commit intoopensearch-project:mainfrom
liuguoqingfz:flakytest-19042

Conversation

@liuguoqingfz
Copy link
Copy Markdown
Contributor

@liuguoqingfz liuguoqingfz commented Dec 17, 2025

Description

Fixed a flaky test org.opensearch.indices.replication.MergedSegmentWarmerIT.testPrimaryStopped_ReplicaPromoted which is caused by a logic mismatch in the test: the comment says “index another doc but don’t refresh”, but the code does refresh doc 2 with RefreshPolicy.IMMEDIATE. With segment replication, that can turn into a race where the refreshed segment containing doc 2 is not guaranteed to be present on the replica at the moment the primary is stopped, so when the replica is promoted you end up with only docs 1 and 3.

Related Issues

Resolves #19042

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Summary by CodeRabbit

  • Tests
    • Updated segment replication test case to verify document visibility behavior following replica promotion, including validation of translog-based realtime reads and refined indexing sequencing.

✏️ Tip: You can customize this high-level summary in your review settings.

…nt on the replica in realtime translog, then stop the primary.Index doc 3 via client so the refresh happens on the promoted primary you are querying with _only_local.

Signed-off-by: Joe Liu <guoqing4@illinois.edu>
@liuguoqingfz liuguoqingfz requested a review from a team as a code owner December 17, 2025 04:25
@github-actions github-actions bot added >test-failure Test failure from CI, local build, etc. autocut flaky-test Random test failure that succeeds on second run labels Dec 17, 2025
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Dec 17, 2025

Walkthrough

A test method testPrimaryStopped_ReplicaPromoted in SegmentReplicationIT is modified to fix flaky behavior by adjusting document indexing timing, refresh policies, and visibility verification. Doc 2 is now indexed without refresh and verified through translog realtime reads, while doc 3 indexing is relocated to the replica with immediate refresh post-promotion.

Changes

Cohort / File(s) Change Summary
Test Logic Fix
server/src/internalClusterTest/java/org/opensearch/indices/replication/SegmentReplicationIT.java
Modified testPrimaryStopped_ReplicaPromoted method: changed doc 2 indexing to use RefreshPolicy.NONE and added assertion validating visibility via translog realtime reads after promotion; relocated doc 3 indexing from new primary to replica with IMMEDIATE refresh; updated hit count assertion to expect three visible documents.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Area for attention: Verify that the new RefreshPolicy.NONE timing and translog realtime read assertion correctly address the flakiness root cause in the promotion scenario.

Suggested labels

bug, flaky-test, Indexing

Suggested reviewers

  • msfroh
  • andrross
  • dbwiddis
  • sachinpkale
  • shwetathareja
  • mch2
  • cwperks
  • owaiskazi19
  • saratvemulapalli

Poem

🐰 A test once danced in flaky ways,
Through timing bugs and refresh delays.
Now translog reads and replicas aligned,
The promotion path runs true and kind!
Docs settle down, assertions shine—
Fixed at last, this test is mine! 🌟

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Fixed a flaky test in issue 19042' is clear and specific, directly referencing the issue being addressed and the primary change (fixing a flaky test).
Description check ✅ Passed The PR description includes a clear explanation of the issue, root cause analysis, and references the related issue #19042. It follows the template structure with Description and Related Issues sections.
Linked Issues check ✅ Passed The PR directly addresses issue #19042 by fixing the flaky test testPrimaryStopped_ReplicaPromoted, which was specifically identified as failing in the linked issue's test report.
Out of Scope Changes check ✅ Passed All changes are scoped to fixing the identified flaky test in SegmentReplicationIT.java; no unrelated modifications are present.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 799fb9b and 75308aa.

📒 Files selected for processing (1)
  • server/src/internalClusterTest/java/org/opensearch/indices/replication/SegmentReplicationIT.java (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
  • GitHub Check: gradle-check
  • GitHub Check: assemble (25, windows-latest)
  • GitHub Check: assemble (21, windows-latest)
  • GitHub Check: assemble (25, ubuntu-24.04-arm)
  • GitHub Check: assemble (25, ubuntu-latest)
  • GitHub Check: assemble (21, ubuntu-latest)
  • GitHub Check: assemble (21, ubuntu-24.04-arm)
  • GitHub Check: precommit (21, windows-latest)
  • GitHub Check: precommit (21, windows-2025, true)
  • GitHub Check: precommit (25, windows-latest)
  • GitHub Check: precommit (21, macos-15)
  • GitHub Check: precommit (25, ubuntu-24.04-arm)
  • GitHub Check: precommit (21, ubuntu-24.04-arm)
  • GitHub Check: precommit (21, macos-15-intel)
  • GitHub Check: precommit (25, macos-15-intel)
  • GitHub Check: precommit (25, macos-15)
  • GitHub Check: precommit (25, ubuntu-latest)
  • GitHub Check: precommit (21, ubuntu-latest)
  • GitHub Check: detect-breaking-change
  • GitHub Check: Analyze (java)
🔇 Additional comments (2)
server/src/internalClusterTest/java/org/opensearch/indices/replication/SegmentReplicationIT.java (2)

308-312: Technical inaccuracy in reasoning about segment replication and translog handling.

With segment replication, all documents are sent to each shard and persisted in the transaction log. Replicas maintain their own translog, so doc 2 indexed with RefreshPolicy.NONE will be written to the replica's translog. The assertion on line 312 checking for doc 2 on the replica via realtime GET is not fundamentally broken by segment replication behavior—the document will reach the replica's translog.

However, segment replication has known resource constraints around GET operations on replicas, which may still warrant attention for test reliability. Consider verifying that the assertBusy timeout is sufficient to account for translog write latency on the replica.

Likely an incorrect or invalid review comment.


325-331: The test expectation is correct. Doc 2 is already present on the replica in the translog and memory buffer (verified by line 311's realtime GET check). When doc 3 is indexed with RefreshPolicy.IMMEDIATE, the resulting refresh operation converts both the buffered doc 2 and doc 3 to searchable segments. After promotion, the promoted replica contains: doc 1 (replicated segments), doc 2 (buffer/translog, searchable after refresh), and doc 3 (newly indexed and refreshed), totaling 3 documents. The test assertion is accurate.

Likely an incorrect or invalid review comment.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

✅ Gradle check result for 75308aa: SUCCESS

@codecov
Copy link
Copy Markdown

codecov bot commented Dec 17, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.15%. Comparing base (930ae74) to head (75308aa).
⚠️ Report is 7 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #20270      +/-   ##
============================================
- Coverage     73.20%   73.15%   -0.05%     
+ Complexity    71745    71697      -48     
============================================
  Files          5795     5795              
  Lines        328304   328304              
  Branches      47283    47283              
============================================
- Hits         240334   240178     -156     
- Misses        68663    68833     +170     
+ Partials      19307    19293      -14     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@andrross andrross merged commit 99602cc into opensearch-project:main Dec 17, 2025
39 of 40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

autocut flaky-test Random test failure that succeeds on second run skip-changelog >test-failure Test failure from CI, local build, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[AUTOCUT] Gradle Check Flaky Test Report for MergedSegmentWarmerIT

2 participants