fix(test): wait for prefetch to complete before file verification by huiboxes · Pull Request #1815 · dragonflyoss/nydus

huiboxes · 2025-11-25T04:29:07Z

Overview

Fixes intermittent test failures in TestNativeLayer when enable_prefetch=true by ensuring prefetch operations complete before file verification begins.

Related Issues

Related #1801

Change Details

Add waitPrefetch() method to monitor backend read metrics
Wait for read count to stabilize before proceeding with tests
Apply wait after both Mount() and MountByAPI() when prefetch is enabled

Test Results

I used the same config and environment variables to test both the old code and the new fix. The old code had a failure rate of around 8%, when the new fix, it is 0%.

old code 1:

old code 2:

new fix 1:

new fix 2:

Change Type

Self-Checklist

I have run a code style check and addressed any warnings/errors.
I have added appropriate comments to my code (if applicable).
I have updated the documentation (if applicable).
I have written appropriate unit tests.

This commit fixes intermittent test failures in TestNativeLayer when enable_prefetch=true by ensuring prefetch operations complete before file verification begins. Changes: - Add waitPrefetch() method to monitor backend read metrics - Wait for read count to stabilize before proceeding with tests - Apply wait after both Mount() and MountByAPI() when prefetch is enabled This resolves race conditions where file verification would start before prefetch completed, causing hash mismatches and I/O errors. Signed-off-by: xieshaohui.xsh <xieshaohui.xsh@digital-engine.com>

Signed-off-by: xieshaohui.xsh <xieshaohui.xsh@digital-engine.com>

codecov · 2025-12-18T08:43:30Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 56.27%. Comparing base (fdae0d6) to head (23f5e6e).
⚠️ Report is 6 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1815      +/-   ##
==========================================
- Coverage   56.31%   56.27%   -0.05%     
==========================================
  Files         199      199              
  Lines       52324    52324              
  Branches    44838    44838              
==========================================
- Hits        29467    29445      -22     
- Misses      21368    21383      +15     
- Partials     1489     1496       +7

see 8 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot

Pull request overview

This PR fixes intermittent test failures in TestNativeLayer when prefetch is enabled by ensuring prefetch operations complete before file verification begins. The fix reduces test failure rates from approximately 8% to 0% by monitoring backend read metrics and waiting for them to stabilize.

Key Changes:

Added waitPrefetch() method that polls backend read metrics to detect prefetch completion
Modified Mount() and MountByAPI() to wait for prefetch when enabled
Uses stability detection (3 consecutive identical read counts) to determine when prefetch is complete

Copilot · 2025-12-23T07:55:03Z

smoke/tests/tool/nydusd.go

+
+		if metrics.ReadCount == 0 {


The logic assumes ReadCount == 0 means prefetch hasn't started yet, but this doesn't distinguish between prefetch not starting and legitimate cases where there's nothing to prefetch (e.g., empty filesystem or all data already cached). This could lead to the full 10-second timeout in scenarios where prefetch completes instantly or isn't needed. Consider checking additional metrics like PrefetchDataAmount from BlobCacheMetrics or adding an early exit condition when it's clear there's no prefetch work to do.

Suggested change

if metrics.ReadCount == 0 {

// If there is no prefetch data configured or needed, return early instead of

// waiting for ReadCount to change.

if metrics.ReadCount == 0 {

// PrefetchDataAmount comes from BlobCacheMetrics; 0 here means there is

// nothing to prefetch (e.g., empty filesystem or everything already cached).

if metrics.PrefetchDataAmount == 0 {

return

}

// There is prefetch work, but it hasn't started generating reads yet.

Copilot · 2025-12-23T07:55:04Z

smoke/tests/tool/nydusd.go

+func (nydusd *Nydusd) waitPrefetch() {
+	// Wait for prefetch to start and stabilize
+	deadline := time.Now().Add(10 * time.Second)
+	var lastReadCount uint64
+	stableCount := 0
+
+	for time.Now().Before(deadline) {
+		time.Sleep(500 * time.Millisecond)
+
+		metrics, err := nydusd.GetBackendMetrics()
+		if err != nil {
+			continue
+		}
+
+		if metrics.ReadCount == 0 {
+			continue
+		}
+
+		// Check if read count is stable (prefetch completed)
+		if metrics.ReadCount == lastReadCount {
+			stableCount++
+			if stableCount >= 3 {
+				return
+			}
+		} else {
+			stableCount = 0
+			lastReadCount = metrics.ReadCount
+		}
+	}
+}


The waitPrefetch method silently returns without any indication of success or failure. When the deadline expires without achieving stable read counts, the function returns normally, which could lead to false positives in tests where prefetch actually failed or timed out. Consider adding logging to distinguish between successful completion and timeout scenarios, or returning an error when the timeout is reached. This would make test failures more debuggable and help identify when the timeout duration needs adjustment.

Copilot · 2025-12-23T07:55:04Z

smoke/tests/tool/nydusd.go

+	deadline := time.Now().Add(10 * time.Second)
+	var lastReadCount uint64
+	stableCount := 0
+
+	for time.Now().Before(deadline) {
+		time.Sleep(500 * time.Millisecond)
+
+		metrics, err := nydusd.GetBackendMetrics()
+		if err != nil {
+			continue
+		}
+
+		if metrics.ReadCount == 0 {
+			continue
+		}
+
+		// Check if read count is stable (prefetch completed)
+		if metrics.ReadCount == lastReadCount {
+			stableCount++
+			if stableCount >= 3 {
+				return


The timeout duration of 10 seconds, polling interval of 500 milliseconds, and stability threshold of 3 consecutive identical reads are hardcoded. These values may not be appropriate for all environments (e.g., slower CI systems, larger prefetch operations). Consider making these configurable through environment variables or the NydusdConfig structure to allow tuning for different test scenarios and environments.

Copilot · 2025-12-23T07:55:04Z

smoke/tests/tool/nydusd.go

+	stableCount := 0
+
+	for time.Now().Before(deadline) {
+		time.Sleep(500 * time.Millisecond)
+
+		metrics, err := nydusd.GetBackendMetrics()
+		if err != nil {


Errors from GetBackendMetrics are silently ignored with continue statements. While this approach ensures the wait loop doesn't fail prematurely, it could mask underlying issues such as the API being unavailable, authentication problems, or other persistent errors. Consider adding logging for these errors or implementing a counter to fail fast if consecutive errors exceed a threshold, helping identify configuration or connectivity issues earlier.

Suggested change

stableCount := 0

for time.Now().Before(deadline) {

time.Sleep(500 * time.Millisecond)

metrics, err := nydusd.GetBackendMetrics()

if err != nil {

stableCount := 0

errCount := 0

for time.Now().Before(deadline) {

time.Sleep(500 * time.Millisecond)

metrics, err := nydusd.GetBackendMetrics()

if err != nil {

errCount++

logrus.WithError(err).Warn("waitPrefetch: failed to get backend metrics")

if errCount >= 5 {

logrus.WithError(err).Error("waitPrefetch: aborting due to repeated backend metrics retrieval failures")

return

}

imeoer · 2025-12-23T09:48:18Z

Thanks for the PR, the latest failed CI shows the TestSubMountCache case timedout:

bergwolf

No, this is WRONG.

The integration tests are supposed to pass even if the blob is not cached locally. If there is anything broken with the on-demand IO path, we should locate and fix it rather than trying to work around it with full prefetching.

Sorry.

huiboxes added 2 commits November 25, 2025 12:31

fix(test): fix waitPrefetch zero ReadCount check

23f5e6e

Signed-off-by: xieshaohui.xsh <xieshaohui.xsh@digital-engine.com>

huiboxes force-pushed the fix/nydus-integration-test branch from 8345b6a to 23f5e6e Compare November 25, 2025 04:31

bergwolf requested a review from Copilot December 23, 2025 07:47

Copilot started reviewing on behalf of bergwolf December 23, 2025 07:47 View session

Copilot AI reviewed Dec 23, 2025

View reviewed changes

bergwolf requested changes Jan 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

fix(test): wait for prefetch to complete before file verification#1815

fix(test): wait for prefetch to complete before file verification#1815
huiboxes wants to merge 2 commits intodragonflyoss:masterfrom
huiboxes:fix/nydus-integration-test

huiboxes commented Nov 25, 2025

Uh oh!

codecov bot commented Dec 18, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 23, 2025

Uh oh!

Copilot AI Dec 23, 2025

Uh oh!

Copilot AI Dec 23, 2025

Uh oh!

Copilot AI Dec 23, 2025

Uh oh!

imeoer commented Dec 23, 2025

Uh oh!

bergwolf left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-		if metrics.ReadCount == 0 {
+		// If there is no prefetch data configured or needed, return early instead of
+		// waiting for ReadCount to change.
+		if metrics.ReadCount == 0 {
+			// PrefetchDataAmount comes from BlobCacheMetrics; 0 here means there is
+			// nothing to prefetch (e.g., empty filesystem or everything already cached).
+			if metrics.PrefetchDataAmount == 0 {
+				return
+			}
+			// There is prefetch work, but it hasn't started generating reads yet.

Comments

Conversation

huiboxes commented Nov 25, 2025

Overview

Related Issues

Change Details

Test Results

Change Type

Self-Checklist

Uh oh!

codecov bot commented Dec 18, 2025

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

imeoer commented Dec 23, 2025

Uh oh!

bergwolf left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants