Fix a flaky test with Files.walk()#19758
Fix a flaky test with Files.walk()#19758liuguoqingfz wants to merge 0 commit intoopensearch-project:mainfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #19758 +/- ##
============================================
+ Coverage 73.19% 73.24% +0.04%
- Complexity 70946 70983 +37
============================================
Files 5735 5735
Lines 324654 324654
Branches 46962 46962
============================================
+ Hits 237643 237789 +146
+ Misses 67875 67702 -173
- Partials 19136 19163 +27 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
2de4651 to
196fb16
Compare
|
❌ Gradle check result for 196fb16: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
❌ Gradle check result for 196fb16: null Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
| try { | ||
| if (path.endsWith(".lock") == false) { | ||
| Files.copy(path, newIndexDataPath.resolve(indexDataPath.relativize(path))); | ||
| Files.walkFileTree(indexDataPath, new SimpleFileVisitor<Path>() { |
There was a problem hiding this comment.
It looks like this same race condition was attempted to be fixed by this code:
Is it possible there is still an in-flight refresh that causes the inconsistency even after disabling refresh? If so, could we just force a refresh after disabling and that might ensure everything is quiescent?
My concern with the approach here is that if things are being mutated while traversing the files then we can't know that we've copied a consistent view of the flies for the shard, which seems like a requirement of this test.
|
This PR is stalled because it has been open for 30 days with no activity. |
WalkthroughModified index data copying logic in a test file by replacing shallow directory walk with full file tree traversal using SimpleFileVisitor. Added explicit directory creation, .lock file skipping, and NoSuchFileException handling to improve robustness. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes
Suggested labels
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
server/src/internalClusterTest/java/org/opensearch/index/shard/IndexShardIT.java (1)
341-370: Clarify/strengthen handling ofNoSuchFileExceptionand other IO failures duringwalkFileTreeThe switch to
Files.walkFileTreeplus aSimpleFileVisitoris a good direction to avoid the lazyFiles.walkstream issues, but the current exception handling has a few sharp edges:
visitFileFailedsilently skips any path wherereadAttributesthrowsNoSuchFileException(including non‑translog files) and just continues. That fixes the reported flakiness for transient files but also means we can now silently drop other files if they somehow disappear, potentially giving an inconsistent copy without a clear signal.- For other IOExceptions,
visitFileFailedreturnsTERMINATEbut does not log or rethrow, so the walk stops early but the test will only fail later (e.g., when reopening the index), with less direct diagnostics.- In contrast, the
NoSuchFileExceptionfromFiles.copyinvisitFileis treated as a hard failure viafail(). That asymmetry is intentional for the reported case but might still leave a narrow race where we ignore some deletions (attribute read) but consider others fatal (between attributes and copy).Consider tightening this logic a bit for clarity and robustness, for example:
- Restrict the
NoSuchFileExceptionsuppression invisitFileFailedto known transient patterns (e.g., translog checkpoint files), or at least document why ignoring all such misses is acceptable for this test’s guarantees.- For non‑
NoSuchFileExceptioncases invisitFileFailed, log the error and either rethrowexcor callfail()so the root cause is immediately visible rather than just terminating the walk silently.Illustrative sketch:
@@ - @Override - public FileVisitResult visitFileFailed(Path file, IOException exc) throws IOException { - if (exc instanceof NoSuchFileException) { - return FileVisitResult.CONTINUE; - } - return FileVisitResult.TERMINATE; - } + @Override + public FileVisitResult visitFileFailed(Path file, IOException exc) throws IOException { + if (exc instanceof NoSuchFileException) { + // Expected for transient files (e.g., translog checkpoints) that can disappear while copying. + return FileVisitResult.CONTINUE; + } + logger.error("Failed while copying data path file [{}]", file, exc); + throw exc; // or fail("...") to immediately fail the test + }This keeps the flakiness fix while making unexpected IO problems more visible and reducing the chance of silently copying an inconsistent view.
If you want extra validation, it may be worth confirming from JDK docs that
walkFileTreeroutes attribute‑read failures viavisitFileFailed(and notvisitFile), ensuring this suppression indeed covers the originally observedNoSuchFileException.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
server/src/internalClusterTest/java/org/opensearch/index/shard/IndexShardIT.java(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (21)
- GitHub Check: gradle-check
- GitHub Check: Analyze (java)
- GitHub Check: precommit (25, macos-15-intel)
- GitHub Check: precommit (21, windows-latest)
- GitHub Check: precommit (21, windows-2025, true)
- GitHub Check: precommit (25, ubuntu-24.04-arm)
- GitHub Check: precommit (25, macos-15)
- GitHub Check: precommit (25, windows-latest)
- GitHub Check: precommit (21, macos-15)
- GitHub Check: precommit (21, ubuntu-24.04-arm)
- GitHub Check: precommit (21, macos-15-intel)
- GitHub Check: precommit (25, ubuntu-latest)
- GitHub Check: precommit (21, ubuntu-latest)
- GitHub Check: assemble (25, windows-latest)
- GitHub Check: detect-breaking-change
- GitHub Check: assemble (21, ubuntu-latest)
- GitHub Check: assemble (21, windows-latest)
- GitHub Check: assemble (21, ubuntu-24.04-arm)
- GitHub Check: assemble (25, ubuntu-latest)
- GitHub Check: assemble (25, ubuntu-24.04-arm)
- GitHub Check: Mend Security Check
🔇 Additional comments (1)
server/src/internalClusterTest/java/org/opensearch/index/shard/IndexShardIT.java (1)
102-110: New NIO imports are appropriate and scoped to the new visitorThe added imports (
FileVisitResult,NoSuchFileException,SimpleFileVisitor,BasicFileAttributes) match the new walkFileTree usage and keep dependencies injava.nio.fileonly. No issues here.
Description
The original code was walking the source data path while it’s still being mutated by engine/shard cleanup, and
Files.walk(...)is lazy: as it iterates it callsreadAttributeson each entry to decide file/dir. One of those entries (translog-7.ckp) was deleted between directory listing and attribute read →NoSuchFileException, which then gets wrapped asUncheckedIOExceptionRelated Issues
Resolves #19731
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.