Skip to content

Fix a flaky test with Files.walk()#19758

Closed
liuguoqingfz wants to merge 0 commit intoopensearch-project:mainfrom
liuguoqingfz:main
Closed

Fix a flaky test with Files.walk()#19758
liuguoqingfz wants to merge 0 commit intoopensearch-project:mainfrom
liuguoqingfz:main

Conversation

@liuguoqingfz
Copy link
Copy Markdown
Contributor

@liuguoqingfz liuguoqingfz commented Oct 24, 2025

Description

The original code was walking the source data path while it’s still being mutated by engine/shard cleanup, and Files.walk(...) is lazy: as it iterates it calls readAttributes on each entry to decide file/dir. One of those entries (translog-7.ckp) was deleted between directory listing and attribute read → NoSuchFileException, which then gets wrapped as UncheckedIOException

Related Issues

Resolves #19731

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Summary by CodeRabbit

  • Tests
    • Improved reliability of index data handling in test scenarios with enhanced error handling and more robust file operations.

✏️ Tip: You can customize this high-level summary in your review settings.

@github-actions
Copy link
Copy Markdown
Contributor

✅ Gradle check result for 1a29cb1: SUCCESS

@codecov
Copy link
Copy Markdown

codecov bot commented Oct 24, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.24%. Comparing base (0c89456) to head (1a29cb1).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #19758      +/-   ##
============================================
+ Coverage     73.19%   73.24%   +0.04%     
- Complexity    70946    70983      +37     
============================================
  Files          5735     5735              
  Lines        324654   324654              
  Branches      46962    46962              
============================================
+ Hits         237643   237789     +146     
+ Misses        67875    67702     -173     
- Partials      19136    19163      +27     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@liuguoqingfz liuguoqingfz force-pushed the main branch 2 times, most recently from 2de4651 to 196fb16 Compare October 24, 2025 14:15
@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 196fb16: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 196fb16: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

try {
if (path.endsWith(".lock") == false) {
Files.copy(path, newIndexDataPath.resolve(indexDataPath.relativize(path)));
Files.walkFileTree(indexDataPath, new SimpleFileVisitor<Path>() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this same race condition was attempted to be fixed by this code:

// race condition: async flush may cause translog file deletion resulting in an inconsistent stream from
// Files.walk below during copy phase
// temporarily disable refresh to avoid any flushes or syncs that may inadvertently cause the deletion
assertAcked(
client().admin()
.indices()
.prepareUpdateSettings(index)
.setSettings(Settings.builder().put(IndexSettings.INDEX_REFRESH_INTERVAL_SETTING.getKey(), "-1").build())
);

Is it possible there is still an in-flight refresh that causes the inconsistency even after disabling refresh? If so, could we just force a refresh after disabling and that might ensure everything is quiescent?

My concern with the approach here is that if things are being mutated while traversing the files then we can't know that we've copied a consistent view of the flies for the shard, which seems like a requirement of this test.

@opensearch-trigger-bot
Copy link
Copy Markdown
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added the stalled Issues that have stalled label Nov 24, 2025
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Dec 15, 2025

Walkthrough

Modified index data copying logic in a test file by replacing shallow directory walk with full file tree traversal using SimpleFileVisitor. Added explicit directory creation, .lock file skipping, and NoSuchFileException handling to improve robustness.

Changes

Cohort / File(s) Change Summary
Test File Visitor Implementation
server/src/internalClusterTest/java/org/opensearch/index/shard/IndexShardIT.java
Refactored file copy mechanism from Files.walk() to Files.walkFileTree() with custom SimpleFileVisitor. Implements preVisitDirectory() for target directory creation, visitFile() with .lock file filtering and error handling, and visitFileFailed() for graceful exception recovery. Added imports for FileVisitResult, NoSuchFileException, SimpleFileVisitor, BasicFileAttributes.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • File copied logic refactoring: Verify correctness of the SimpleFileVisitor implementation, particularly the directory creation and file copying sequence.
  • Exception handling: Confirm that NoSuchFileException is handled gracefully while other IO errors properly fail the test via fail().
  • Edge cases: Ensure .lock file filtering and test behavior remain consistent with original intent.

Suggested labels

Indexing

Suggested reviewers

  • msfroh
  • reta
  • mch2
  • sachinpkale
  • shwetathareja
  • dbwiddis
  • cwperks
  • kotwanikunal
  • ashking94
  • gbbafna

Poem

🐰 A flaky test needed mending,
With file trees now bending,
Visitors traverse with care,
No locks left to spare,
Better paths, test reliability's sending! ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: fixing a flaky test caused by Files.walk() behavior.
Description check ✅ Passed The description explains the root cause of the flaky test and follows the template with completed description and related issues.
Linked Issues check ✅ Passed The PR addresses the flaky IndexShardIT.testIndexCanChangeCustomDataPath test by replacing Files.walk() with Files.walkFileTree() and handling NoSuchFileException.
Out of Scope Changes check ✅ Passed All changes are focused on fixing the flaky test by improving file copy logic and error handling, with no out-of-scope modifications.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
server/src/internalClusterTest/java/org/opensearch/index/shard/IndexShardIT.java (1)

341-370: Clarify/strengthen handling of NoSuchFileException and other IO failures during walkFileTree

The switch to Files.walkFileTree plus a SimpleFileVisitor is a good direction to avoid the lazy Files.walk stream issues, but the current exception handling has a few sharp edges:

  • visitFileFailed silently skips any path where readAttributes throws NoSuchFileException (including non‑translog files) and just continues. That fixes the reported flakiness for transient files but also means we can now silently drop other files if they somehow disappear, potentially giving an inconsistent copy without a clear signal.
  • For other IOExceptions, visitFileFailed returns TERMINATE but does not log or rethrow, so the walk stops early but the test will only fail later (e.g., when reopening the index), with less direct diagnostics.
  • In contrast, the NoSuchFileException from Files.copy in visitFile is treated as a hard failure via fail(). That asymmetry is intentional for the reported case but might still leave a narrow race where we ignore some deletions (attribute read) but consider others fatal (between attributes and copy).

Consider tightening this logic a bit for clarity and robustness, for example:

  • Restrict the NoSuchFileException suppression in visitFileFailed to known transient patterns (e.g., translog checkpoint files), or at least document why ignoring all such misses is acceptable for this test’s guarantees.
  • For non‑NoSuchFileException cases in visitFileFailed, log the error and either rethrow exc or call fail() so the root cause is immediately visible rather than just terminating the walk silently.

Illustrative sketch:

@@
-            @Override
-            public FileVisitResult visitFileFailed(Path file, IOException exc) throws IOException {
-                if (exc instanceof NoSuchFileException) {
-                    return FileVisitResult.CONTINUE;
-                }
-                return FileVisitResult.TERMINATE;
-            }
+            @Override
+            public FileVisitResult visitFileFailed(Path file, IOException exc) throws IOException {
+                if (exc instanceof NoSuchFileException) {
+                    // Expected for transient files (e.g., translog checkpoints) that can disappear while copying.
+                    return FileVisitResult.CONTINUE;
+                }
+                logger.error("Failed while copying data path file [{}]", file, exc);
+                throw exc; // or fail("...") to immediately fail the test
+            }

This keeps the flakiness fix while making unexpected IO problems more visible and reducing the chance of silently copying an inconsistent view.

If you want extra validation, it may be worth confirming from JDK docs that walkFileTree routes attribute‑read failures via visitFileFailed (and not visitFile), ensuring this suppression indeed covers the originally observed NoSuchFileException.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ee59cb0 and b01b2e6.

📒 Files selected for processing (1)
  • server/src/internalClusterTest/java/org/opensearch/index/shard/IndexShardIT.java (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (21)
  • GitHub Check: gradle-check
  • GitHub Check: Analyze (java)
  • GitHub Check: precommit (25, macos-15-intel)
  • GitHub Check: precommit (21, windows-latest)
  • GitHub Check: precommit (21, windows-2025, true)
  • GitHub Check: precommit (25, ubuntu-24.04-arm)
  • GitHub Check: precommit (25, macos-15)
  • GitHub Check: precommit (25, windows-latest)
  • GitHub Check: precommit (21, macos-15)
  • GitHub Check: precommit (21, ubuntu-24.04-arm)
  • GitHub Check: precommit (21, macos-15-intel)
  • GitHub Check: precommit (25, ubuntu-latest)
  • GitHub Check: precommit (21, ubuntu-latest)
  • GitHub Check: assemble (25, windows-latest)
  • GitHub Check: detect-breaking-change
  • GitHub Check: assemble (21, ubuntu-latest)
  • GitHub Check: assemble (21, windows-latest)
  • GitHub Check: assemble (21, ubuntu-24.04-arm)
  • GitHub Check: assemble (25, ubuntu-latest)
  • GitHub Check: assemble (25, ubuntu-24.04-arm)
  • GitHub Check: Mend Security Check
🔇 Additional comments (1)
server/src/internalClusterTest/java/org/opensearch/index/shard/IndexShardIT.java (1)

102-110: New NIO imports are appropriate and scoped to the new visitor

The added imports (FileVisitResult, NoSuchFileException, SimpleFileVisitor, BasicFileAttributes) match the new walkFileTree usage and keep dependencies in java.nio.file only. No issues here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

autocut flaky-test Random test failure that succeeds on second run stalled Issues that have stalled >test-failure Test failure from CI, local build, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[AUTOCUT] Gradle Check Flaky Test Report for IndexShardIT

2 participants