fix(test): wait for prefetch to complete before file verification#1815
fix(test): wait for prefetch to complete before file verification#1815huiboxes wants to merge 2 commits intodragonflyoss:masterfrom
Conversation
This commit fixes intermittent test failures in TestNativeLayer when enable_prefetch=true by ensuring prefetch operations complete before file verification begins. Changes: - Add waitPrefetch() method to monitor backend read metrics - Wait for read count to stabilize before proceeding with tests - Apply wait after both Mount() and MountByAPI() when prefetch is enabled This resolves race conditions where file verification would start before prefetch completed, causing hash mismatches and I/O errors. Signed-off-by: xieshaohui.xsh <xieshaohui.xsh@digital-engine.com>
Signed-off-by: xieshaohui.xsh <xieshaohui.xsh@digital-engine.com>
8345b6a to
23f5e6e
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1815 +/- ##
==========================================
- Coverage 56.31% 56.27% -0.05%
==========================================
Files 199 199
Lines 52324 52324
Branches 44838 44838
==========================================
- Hits 29467 29445 -22
- Misses 21368 21383 +15
- Partials 1489 1496 +7 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR fixes intermittent test failures in TestNativeLayer when prefetch is enabled by ensuring prefetch operations complete before file verification begins. The fix reduces test failure rates from approximately 8% to 0% by monitoring backend read metrics and waiting for them to stabilize.
Key Changes:
- Added
waitPrefetch()method that polls backend read metrics to detect prefetch completion - Modified
Mount()andMountByAPI()to wait for prefetch when enabled - Uses stability detection (3 consecutive identical read counts) to determine when prefetch is complete
|
|
||
| if metrics.ReadCount == 0 { |
There was a problem hiding this comment.
The logic assumes ReadCount == 0 means prefetch hasn't started yet, but this doesn't distinguish between prefetch not starting and legitimate cases where there's nothing to prefetch (e.g., empty filesystem or all data already cached). This could lead to the full 10-second timeout in scenarios where prefetch completes instantly or isn't needed. Consider checking additional metrics like PrefetchDataAmount from BlobCacheMetrics or adding an early exit condition when it's clear there's no prefetch work to do.
| if metrics.ReadCount == 0 { | |
| // If there is no prefetch data configured or needed, return early instead of | |
| // waiting for ReadCount to change. | |
| if metrics.ReadCount == 0 { | |
| // PrefetchDataAmount comes from BlobCacheMetrics; 0 here means there is | |
| // nothing to prefetch (e.g., empty filesystem or everything already cached). | |
| if metrics.PrefetchDataAmount == 0 { | |
| return | |
| } | |
| // There is prefetch work, but it hasn't started generating reads yet. |
| func (nydusd *Nydusd) waitPrefetch() { | ||
| // Wait for prefetch to start and stabilize | ||
| deadline := time.Now().Add(10 * time.Second) | ||
| var lastReadCount uint64 | ||
| stableCount := 0 | ||
|
|
||
| for time.Now().Before(deadline) { | ||
| time.Sleep(500 * time.Millisecond) | ||
|
|
||
| metrics, err := nydusd.GetBackendMetrics() | ||
| if err != nil { | ||
| continue | ||
| } | ||
|
|
||
| if metrics.ReadCount == 0 { | ||
| continue | ||
| } | ||
|
|
||
| // Check if read count is stable (prefetch completed) | ||
| if metrics.ReadCount == lastReadCount { | ||
| stableCount++ | ||
| if stableCount >= 3 { | ||
| return | ||
| } | ||
| } else { | ||
| stableCount = 0 | ||
| lastReadCount = metrics.ReadCount | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
The waitPrefetch method silently returns without any indication of success or failure. When the deadline expires without achieving stable read counts, the function returns normally, which could lead to false positives in tests where prefetch actually failed or timed out. Consider adding logging to distinguish between successful completion and timeout scenarios, or returning an error when the timeout is reached. This would make test failures more debuggable and help identify when the timeout duration needs adjustment.
| deadline := time.Now().Add(10 * time.Second) | ||
| var lastReadCount uint64 | ||
| stableCount := 0 | ||
|
|
||
| for time.Now().Before(deadline) { | ||
| time.Sleep(500 * time.Millisecond) | ||
|
|
||
| metrics, err := nydusd.GetBackendMetrics() | ||
| if err != nil { | ||
| continue | ||
| } | ||
|
|
||
| if metrics.ReadCount == 0 { | ||
| continue | ||
| } | ||
|
|
||
| // Check if read count is stable (prefetch completed) | ||
| if metrics.ReadCount == lastReadCount { | ||
| stableCount++ | ||
| if stableCount >= 3 { | ||
| return |
There was a problem hiding this comment.
The timeout duration of 10 seconds, polling interval of 500 milliseconds, and stability threshold of 3 consecutive identical reads are hardcoded. These values may not be appropriate for all environments (e.g., slower CI systems, larger prefetch operations). Consider making these configurable through environment variables or the NydusdConfig structure to allow tuning for different test scenarios and environments.
| stableCount := 0 | ||
|
|
||
| for time.Now().Before(deadline) { | ||
| time.Sleep(500 * time.Millisecond) | ||
|
|
||
| metrics, err := nydusd.GetBackendMetrics() | ||
| if err != nil { |
There was a problem hiding this comment.
Errors from GetBackendMetrics are silently ignored with continue statements. While this approach ensures the wait loop doesn't fail prematurely, it could mask underlying issues such as the API being unavailable, authentication problems, or other persistent errors. Consider adding logging for these errors or implementing a counter to fail fast if consecutive errors exceed a threshold, helping identify configuration or connectivity issues earlier.
| stableCount := 0 | |
| for time.Now().Before(deadline) { | |
| time.Sleep(500 * time.Millisecond) | |
| metrics, err := nydusd.GetBackendMetrics() | |
| if err != nil { | |
| stableCount := 0 | |
| errCount := 0 | |
| for time.Now().Before(deadline) { | |
| time.Sleep(500 * time.Millisecond) | |
| metrics, err := nydusd.GetBackendMetrics() | |
| if err != nil { | |
| errCount++ | |
| logrus.WithError(err).Warn("waitPrefetch: failed to get backend metrics") | |
| if errCount >= 5 { | |
| logrus.WithError(err).Error("waitPrefetch: aborting due to repeated backend metrics retrieval failures") | |
| return | |
| } |
|
Thanks for the PR, the latest failed CI shows the
|
bergwolf
left a comment
There was a problem hiding this comment.
No, this is WRONG.
The integration tests are supposed to pass even if the blob is not cached locally. If there is anything broken with the on-demand IO path, we should locate and fix it rather than trying to work around it with full prefetching.
Sorry.

Overview
Fixes intermittent test failures in TestNativeLayer when enable_prefetch=true by ensuring prefetch operations complete before file verification begins.
Related Issues
Related #1801
Change Details
Test Results
I used the same config and environment variables to test both the old code and the new fix. The old code had a failure rate of around 8%, when the new fix, it is 0%.
old code 1:



old code 2:
new fix 1:
new fix 2:

Change Type
Self-Checklist