Skip to content

Comments

fix(test): wait for prefetch to complete before file verification#1815

Open
huiboxes wants to merge 2 commits intodragonflyoss:masterfrom
huiboxes:fix/nydus-integration-test
Open

fix(test): wait for prefetch to complete before file verification#1815
huiboxes wants to merge 2 commits intodragonflyoss:masterfrom
huiboxes:fix/nydus-integration-test

Conversation

@huiboxes
Copy link
Contributor

Overview

Fixes intermittent test failures in TestNativeLayer when enable_prefetch=true by ensuring prefetch operations complete before file verification begins.

Related Issues

Related #1801

Change Details

  • Add waitPrefetch() method to monitor backend read metrics
  • Wait for read count to stabilize before proceeding with tests
  • Apply wait after both Mount() and MountByAPI() when prefetch is enabled

Test Results

I used the same config and environment variables to test both the old code and the new fix. The old code had a failure rate of around 8%, when the new fix, it is 0%.

old code 1:
image
old code 2:
image
new fix 1:
image

new fix 2:
image

Change Type

  • Bug Fix
  • Feature Addition
  • Documentation Update
  • Code Refactoring
  • Performance Improvement
  • Other (please describe)

Self-Checklist

  • I have run a code style check and addressed any warnings/errors.
  • I have added appropriate comments to my code (if applicable).
  • I have updated the documentation (if applicable).
  • I have written appropriate unit tests.

This commit fixes intermittent test failures in TestNativeLayer when
enable_prefetch=true by ensuring prefetch operations complete before
file verification begins.

Changes:
- Add waitPrefetch() method to monitor backend read metrics
- Wait for read count to stabilize before proceeding with tests
- Apply wait after both Mount() and MountByAPI() when prefetch is enabled

This resolves race conditions where file verification would start
before prefetch completed, causing hash mismatches and I/O errors.

Signed-off-by: xieshaohui.xsh <xieshaohui.xsh@digital-engine.com>
Signed-off-by: xieshaohui.xsh <xieshaohui.xsh@digital-engine.com>
@huiboxes huiboxes force-pushed the fix/nydus-integration-test branch from 8345b6a to 23f5e6e Compare November 25, 2025 04:31
@codecov
Copy link

codecov bot commented Dec 18, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 56.27%. Comparing base (fdae0d6) to head (23f5e6e).
⚠️ Report is 6 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1815      +/-   ##
==========================================
- Coverage   56.31%   56.27%   -0.05%     
==========================================
  Files         199      199              
  Lines       52324    52324              
  Branches    44838    44838              
==========================================
- Hits        29467    29445      -22     
- Misses      21368    21383      +15     
- Partials     1489     1496       +7     

see 8 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes intermittent test failures in TestNativeLayer when prefetch is enabled by ensuring prefetch operations complete before file verification begins. The fix reduces test failure rates from approximately 8% to 0% by monitoring backend read metrics and waiting for them to stabilize.

Key Changes:

  • Added waitPrefetch() method that polls backend read metrics to detect prefetch completion
  • Modified Mount() and MountByAPI() to wait for prefetch when enabled
  • Uses stability detection (3 consecutive identical read counts) to determine when prefetch is complete

Comment on lines +729 to +730

if metrics.ReadCount == 0 {
Copy link

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic assumes ReadCount == 0 means prefetch hasn't started yet, but this doesn't distinguish between prefetch not starting and legitimate cases where there's nothing to prefetch (e.g., empty filesystem or all data already cached). This could lead to the full 10-second timeout in scenarios where prefetch completes instantly or isn't needed. Consider checking additional metrics like PrefetchDataAmount from BlobCacheMetrics or adding an early exit condition when it's clear there's no prefetch work to do.

Suggested change
if metrics.ReadCount == 0 {
// If there is no prefetch data configured or needed, return early instead of
// waiting for ReadCount to change.
if metrics.ReadCount == 0 {
// PrefetchDataAmount comes from BlobCacheMetrics; 0 here means there is
// nothing to prefetch (e.g., empty filesystem or everything already cached).
if metrics.PrefetchDataAmount == 0 {
return
}
// There is prefetch work, but it hasn't started generating reads yet.

Copilot uses AI. Check for mistakes.
Comment on lines +716 to +745
func (nydusd *Nydusd) waitPrefetch() {
// Wait for prefetch to start and stabilize
deadline := time.Now().Add(10 * time.Second)
var lastReadCount uint64
stableCount := 0

for time.Now().Before(deadline) {
time.Sleep(500 * time.Millisecond)

metrics, err := nydusd.GetBackendMetrics()
if err != nil {
continue
}

if metrics.ReadCount == 0 {
continue
}

// Check if read count is stable (prefetch completed)
if metrics.ReadCount == lastReadCount {
stableCount++
if stableCount >= 3 {
return
}
} else {
stableCount = 0
lastReadCount = metrics.ReadCount
}
}
}
Copy link

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The waitPrefetch method silently returns without any indication of success or failure. When the deadline expires without achieving stable read counts, the function returns normally, which could lead to false positives in tests where prefetch actually failed or timed out. Consider adding logging to distinguish between successful completion and timeout scenarios, or returning an error when the timeout is reached. This would make test failures more debuggable and help identify when the timeout duration needs adjustment.

Copilot uses AI. Check for mistakes.
Comment on lines +718 to +738
deadline := time.Now().Add(10 * time.Second)
var lastReadCount uint64
stableCount := 0

for time.Now().Before(deadline) {
time.Sleep(500 * time.Millisecond)

metrics, err := nydusd.GetBackendMetrics()
if err != nil {
continue
}

if metrics.ReadCount == 0 {
continue
}

// Check if read count is stable (prefetch completed)
if metrics.ReadCount == lastReadCount {
stableCount++
if stableCount >= 3 {
return
Copy link

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout duration of 10 seconds, polling interval of 500 milliseconds, and stability threshold of 3 consecutive identical reads are hardcoded. These values may not be appropriate for all environments (e.g., slower CI systems, larger prefetch operations). Consider making these configurable through environment variables or the NydusdConfig structure to allow tuning for different test scenarios and environments.

Copilot uses AI. Check for mistakes.
Comment on lines +720 to +726
stableCount := 0

for time.Now().Before(deadline) {
time.Sleep(500 * time.Millisecond)

metrics, err := nydusd.GetBackendMetrics()
if err != nil {
Copy link

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Errors from GetBackendMetrics are silently ignored with continue statements. While this approach ensures the wait loop doesn't fail prematurely, it could mask underlying issues such as the API being unavailable, authentication problems, or other persistent errors. Consider adding logging for these errors or implementing a counter to fail fast if consecutive errors exceed a threshold, helping identify configuration or connectivity issues earlier.

Suggested change
stableCount := 0
for time.Now().Before(deadline) {
time.Sleep(500 * time.Millisecond)
metrics, err := nydusd.GetBackendMetrics()
if err != nil {
stableCount := 0
errCount := 0
for time.Now().Before(deadline) {
time.Sleep(500 * time.Millisecond)
metrics, err := nydusd.GetBackendMetrics()
if err != nil {
errCount++
logrus.WithError(err).Warn("waitPrefetch: failed to get backend metrics")
if errCount >= 5 {
logrus.WithError(err).Error("waitPrefetch: aborting due to repeated backend metrics retrieval failures")
return
}

Copilot uses AI. Check for mistakes.
@imeoer
Copy link
Collaborator

imeoer commented Dec 23, 2025

Thanks for the PR, the latest failed CI shows the TestSubMountCache case timedout:

image

Copy link
Member

@bergwolf bergwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is WRONG.

The integration tests are supposed to pass even if the blob is not cached locally. If there is anything broken with the on-demand IO path, we should locate and fix it rather than trying to work around it with full prefetching.

Sorry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants