csi: make FUSE recovery idempotent by cleaning stale mounts #5495

mrhapile · 2026-01-25T05:34:13Z

Ⅰ. Describe what this PR does

This PR improves the robustness of FUSE recovery in the CSI node plugin by making the recovery process idempotent.

Previously, when a FUSE daemon restarted, stale or corrupted mount points (for example, "transport endpoint is not connected") could remain in the kernel. Subsequent recovery attempts would remount on top of these stale mounts, leading to mount stacking, excessive /proc/self/mountinfo growth, and persistent data access failures.

This change ensures that stale or corrupted FUSE mount points are cleaned up before any remount attempt using Kubernetes’ canonical mount utilities. As a result, recovery always starts from a clean mount state and avoids mount stacking.

Ⅱ. Does this pull request fix one issue?

fixes #5416

Ⅲ. List the added test cases (unit test/integration test) if any, please explain if no tests are needed.

Existing unit tests under pkg/csi/recover were executed.
No new test cases were added, as this change refines existing recovery behavior and does not introduce new public APIs or logic paths that can be effectively unit-tested in isolation.

Ⅳ. Describe how to verify it

Run unit tests:
go test ./pkg/csi/recover
In a Kubernetes cluster with FUSE recovery enabled:
- Repeatedly restart the FUSE pod.
- Verify that stale or corrupted FUSE mounts are cleaned up before remounting.
- Confirm that mount stacking does not occur and data access recovers without persistent "transport endpoint is not connected" errors.

This change relies on mount.CleanupMountPoint(force=true), which lazily detaches corrupted FUSE mounts in a manner consistent with kubelet behavior.

Ⅴ. Special notes for reviews

This PR intentionally uses Kubernetes’ canonical mount cleanup utilities to handle corrupted FUSE mounts instead of forcing unconditional lazy unmounts, aligning the behavior with kubelet and avoiding disruption to healthy mount propagation.
The scope of this change is limited to FUSE recovery logic and does not modify configuration, public APIs, or runtime behavior outside of recovery.

Signed-off-by: mrhapile <allinonegaming3456@gmail.com>

gemini-code-assist · 2026-01-25T05:34:25Z

Summary of Changes

Hello @mrhapile, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the reliability of FUSE recovery within the CSI node plugin. By proactively cleaning up any lingering stale or corrupted FUSE mount points before attempting a remount, it ensures that recovery operations always start from a clean state. This prevents issues like mount stacking and persistent data access errors, leading to a more robust and predictable system when FUSE daemons restart.

Highlights

Idempotent FUSE Recovery: The FUSE recovery process in the CSI node plugin has been made idempotent by ensuring that stale or corrupted mount points are cleaned up before any remount attempt.
Prevention of Mount Stacking: This change prevents mount stacking, excessive /proc/self/mountinfo growth, and persistent data access failures that could occur when FUSE daemons restarted and left stale mounts.
Utilizing Kubernetes Mount Utilities: The solution leverages Kubernetes’ canonical mount utilities, specifically mount.CleanupMountPoint(force=true), to lazily detach corrupted FUSE mounts, aligning behavior with kubelet.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

fluid-e2e-bot · 2026-01-25T05:34:26Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign yangyuliufeng for approval by writing /assign @yangyuliufeng in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

fluid-e2e-bot · 2026-01-25T05:34:31Z

Hi @mrhapile. Thanks for your PR.

I'm waiting for a fluid-cloudnative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

gemini-code-assist

Code Review

This pull request introduces an idempotent FUSE recovery mechanism by cleaning up stale mount points before attempting to recover. The approach of using mount.CleanupMountPoint is sound and aligns with Kubernetes' own practices for handling corrupted mounts.

My review has identified a significant issue in the polling logic within the new lazyUnmountIfNeeded function that would cause unnecessary delays during recovery, along with a minor refactoring opportunity to improve code clarity and efficiency. Please see the detailed comment for a suggested fix.

pkg/csi/recover/recover.go

Copilot

Pull request overview

This PR improves the robustness of FUSE recovery in the CSI node plugin by preventing mount stacking when FUSE daemons restart. The fix adds a cleanup step before remounting to ensure corrupted or stale mount points are properly removed.

Changes:

Adds lazyUnmountIfNeeded function to clean up stale/corrupted mount points before recovery
Integrates cleanup call into the recovery flow to prevent mount stacking
Adds formatting improvements for code readability

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-25T05:40:08Z

pkg/csi/recover/recover.go

+
+	mounter := mount.New("")
+
+	// Use CleanupMountPoint which handles corrupted mounts and lazy unmount (force=true).


The comment mentions "force=true" but CleanupMountPoint's third parameter is actually "extensiveMountPointCheck", not "force". The comment should be updated to accurately reflect what the parameter does: "extensiveMountPointCheck=true uses IsNotMountPoint instead of IsLikelyNotMountPoint, which properly handles bind mounts."

Suggested change

// Use CleanupMountPoint which handles corrupted mounts and lazy unmount (force=true).

// Use CleanupMountPoint with extensiveMountPointCheck=true, which uses IsNotMountPoint

// instead of IsLikelyNotMountPoint and properly handles bind mounts and corrupted mounts.

Copilot · 2026-01-25T05:40:08Z

pkg/csi/recover/recover.go

 	glog.V(3).Infof("FuseRecovery: recovering broken mount point: %v", point)
+
+	r.lazyUnmountIfNeeded(point.MountPath)
+


CleanupMountPoint will delete the directory after successfully unmounting if only a single mount layer exists (see k8s.io/utils/mount implementation). In the edge case where a mount point has only one corrupted mount layer, lazyUnmountIfNeeded will delete the directory, causing the subsequent recoverBrokenMount call to fail since the bind mount target no longer exists. While this primarily affects mount stacking scenarios where multiple layers exist, consider adding a check to recreate the directory if it was deleted, or verify the directory exists before attempting recoverBrokenMount to handle this edge case more gracefully.

Suggested change

// CleanupMountPoint in k8s.io/utils/mount may remove the mount directory after

// unmounting the last layer. In that case, ensure the directory exists before

// attempting to recover the broken mount.

if _, err := os.Stat(point.MountPath); err != nil {

if os.IsNotExist(err) {

if mkErr := os.MkdirAll(point.MountPath, 0755); mkErr != nil {

glog.Warningf("FuseRecovery: failed to recreate mount path directory %s after lazy unmount: %v, skip recovering it", point.MountPath, mkErr)

return

}

} else {

glog.Warningf("FuseRecovery: failed to stat mount path %s after lazy unmount: %v, skip recovering it", point.MountPath, err)

return

}

}

Copilot · 2026-01-25T05:40:08Z

pkg/csi/recover/recover.go

+func (r *FuseRecover) lazyUnmountIfNeeded(mountPath string) {
+	// FUSE restart leaves stale mounts. Normal unmount may fail or block.
+	// Lazy unmount prevents stack piling and cleans up /dev/fuse references.
+
+	mounter := mount.New("")
+
+	// Use CleanupMountPoint which handles corrupted mounts and lazy unmount (force=true).
+	if err := mount.CleanupMountPoint(mountPath, mounter, true); err != nil {
+		glog.Warningf("FuseRecovery: failed to cleanup mount %s: %v", mountPath, err)
+	}
+
+	// Wait briefly until the mount is released by the kernel.
+	err := wait.Poll(500*time.Millisecond, 2*time.Second, func() (bool, error) {
+		notMnt, err := mounter.IsLikelyNotMountPoint(mountPath)
+		return err == nil && notMnt, nil
+	})
+	if err != nil {
+		glog.Warningf("FuseRecovery: timeout waiting for mount point %s to be cleaned", mountPath)
+	}
+}


The new lazyUnmountIfNeeded function introduces important logic for cleaning up stale FUSE mounts but lacks unit test coverage. While the PR description mentions existing tests were executed, adding a focused unit test for this function (similar to the existing TestFuseRecover_umountDuplicate test) would help ensure the mount cleanup logic works correctly in various scenarios, such as when CleanupMountPoint succeeds, fails, or times out during polling.

Signed-off-by: mrhapile <allinonegaming3456@gmail.com>

RongGu · 2026-01-25T06:04:19Z

/gemini review

gemini-code-assist

Code Review

This pull request improves the FUSE recovery process by making it idempotent. It introduces a lazyUnmountIfNeeded function to clean up stale mount points before attempting recovery, which is a good approach to prevent mount stacking. My review focuses on the implementation of this new function. I've found a potential issue in the error handling logic within the polling mechanism that could mask errors and cause unnecessary delays. The rest of the changes look good.

gemini-code-assist · 2026-01-25T06:07:04Z

pkg/csi/recover/recover.go

+	err := wait.Poll(500*time.Millisecond, 2*time.Second, func() (bool, error) {
+		notMnt, err := r.Interface.IsLikelyNotMountPoint(mountPath)
+		if os.IsNotExist(err) {
+			return true, nil
+		}
+		return err == nil && notMnt, nil
+	})


The error handling in the polling function could be improved. If r.Interface.IsLikelyNotMountPoint(mountPath) returns an error other than os.IsNotExist, the current implementation will cause wait.Poll to continue until it times out, without propagating the underlying error. This can mask the real issue and lead to unnecessary delays. It's better to propagate the error immediately to stop polling.

Suggested change

err := wait.Poll(500*time.Millisecond, 2*time.Second, func() (bool, error) {

notMnt, err := r.Interface.IsLikelyNotMountPoint(mountPath)

if os.IsNotExist(err) {

return true, nil

}

return err == nil && notMnt, nil

})

err := wait.Poll(500*time.Millisecond, 2*time.Second, func() (bool, error) {

notMnt, err := r.Interface.IsLikelyNotMountPoint(mountPath)

if err != nil {

if os.IsNotExist(err) {

return true, nil

}

return false, err

}

return notMnt, nil

})

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-25T06:07:36Z

pkg/csi/recover/recover.go

+	}
+
+	// Wait briefly until the mount is released by the kernel.
+	err := wait.Poll(500*time.Millisecond, 2*time.Second, func() (bool, error) {


The deprecated wait.Poll function is used here. The codebase uses wait.PollUntilContextTimeout in other locations (e.g., pkg/csi/plugins/nodeserver.go:629 and pkg/utils/dataset/lifecycle/node.go:305). Consider using wait.PollUntilContextTimeout or wait.PollUntilContextCancel instead to follow the established pattern and use the more modern context-based API.

Suggested change

err := wait.Poll(500*time.Millisecond, 2*time.Second, func() (bool, error) {

err := wait.PollUntilContextTimeout(context.Background(), 500*time.Millisecond, 2*time.Second, false, func(ctx context.Context) (bool, error) {

Copilot · 2026-01-25T06:07:36Z

pkg/csi/recover/recover.go

+func (r *FuseRecover) lazyUnmountIfNeeded(mountPath string) {
+	// FUSE restart leaves stale mounts. Normal unmount may fail or block.
+	// Lazy unmount prevents stack piling and cleans up /dev/fuse references.
+
+	// Use CleanupMountPoint with extensiveMountPointCheck=true, which relies on
+	// IsNotMountPoint instead of IsLikelyNotMountPoint and properly handles
+	// bind mounts and corrupted mount points during cleanup.
+	if err := mount.CleanupMountPoint(mountPath, r.Interface, true); err != nil {
+		glog.Warningf("FuseRecovery: failed to cleanup mount %s: %v", mountPath, err)
+	}
+
+	// Wait briefly until the mount is released by the kernel.
+	err := wait.Poll(500*time.Millisecond, 2*time.Second, func() (bool, error) {
+		notMnt, err := r.Interface.IsLikelyNotMountPoint(mountPath)
+		if os.IsNotExist(err) {
+			return true, nil
+		}
+		return err == nil && notMnt, nil
+	})
+	if err != nil {
+		glog.Warningf("FuseRecovery: timeout waiting for mount point %s to be cleaned", mountPath)
+	}
+}


The lazyUnmountIfNeeded function was introduced but lacks corresponding test coverage. While the PR description states "No new test cases were added," other functions in this file like umountDuplicate and recoverBrokenMount have dedicated unit tests (see pkg/csi/recover/recover_test.go:140 and pkg/csi/recover/recover_test.go:172). Consider adding a test case for lazyUnmountIfNeeded to maintain consistent test coverage and verify its behavior, especially around the CleanupMountPoint call and polling logic.

codecov · 2026-01-25T06:11:16Z

Codecov Report

❌ Patch coverage is 71.42857% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.82%. Comparing base (f9acae1) to head (5893f5a).
⚠️ Report is 12 commits behind head on master.

Files with missing lines	Patch %	Lines
pkg/csi/recover/recover.go	71.42%	6 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5495      +/-   ##
==========================================
- Coverage   57.61%   56.82%   -0.80%     
==========================================
  Files         443      443              
  Lines       30735    30763      +28     
==========================================
- Hits        17708    17481     -227     
- Misses      11414    11725     +311     
+ Partials     1613     1557      -56

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

RongGu · 2026-01-25T06:18:18Z

/gemini review

gemini-code-assist

Code Review

This pull request improves the robustness of FUSE recovery by making it idempotent, which is a valuable enhancement. The approach of cleaning up stale mount points before remounting is sound. I've found one issue in the new lazyUnmountIfNeeded function related to error handling within the polling logic that could mask underlying problems. My review includes a specific suggestion to address this to make the recovery process even more robust.

pkg/csi/recover/recover.go

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Signed-off-by: mrhapile <allinonegaming3456@gmail.com>

mrhapile · 2026-01-25T06:45:47Z

@RongGu can u check I have implemented the suggested changes asked by gemini-code-assist

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/csi/recover/recover.go

Signed-off-by: mrhapile <allinonegaming3456@gmail.com>

sonarqubecloud · 2026-01-26T03:48:37Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

mrhapile · 2026-01-28T03:24:45Z

@RongGu can you please review again I did all the required copilot and gemini review

csi: make FUSE recovery idempotent by cleaning stale mounts

024c571

Signed-off-by: mrhapile <allinonegaming3456@gmail.com>

Copilot AI review requested due to automatic review settings January 25, 2026 05:34

Copilot started reviewing on behalf of mrhapile January 25, 2026 05:34 View session

fluid-e2e-bot bot added the needs-ok-to-test label Jan 25, 2026

gemini-code-assist bot reviewed Jan 25, 2026

View reviewed changes

pkg/csi/recover/recover.go Outdated Show resolved Hide resolved

Copilot AI reviewed Jan 25, 2026

View reviewed changes

fix(csi): correct CleanupMountPoint comment semantics

d04668d

Signed-off-by: mrhapile <allinonegaming3456@gmail.com>

mrhapile force-pushed the fix-fuse-recovery branch from 7838223 to d04668d Compare January 25, 2026 06:01

RongGu requested a review from Copilot January 25, 2026 06:03

Copilot started reviewing on behalf of RongGu January 25, 2026 06:04 View session

gemini-code-assist bot reviewed Jan 25, 2026

View reviewed changes

Copilot AI reviewed Jan 25, 2026

View reviewed changes

gemini-code-assist bot reviewed Jan 25, 2026

View reviewed changes

pkg/csi/recover/recover.go Outdated Show resolved Hide resolved

Copilot AI reviewed Jan 25, 2026

View reviewed changes

gemini reviews: fixed

5893f5a

Signed-off-by: mrhapile <allinonegaming3456@gmail.com>

RongGu requested a review from Copilot January 25, 2026 08:07

Copilot started reviewing on behalf of RongGu January 25, 2026 08:07 View session

Copilot AI reviewed Jan 25, 2026

View reviewed changes

pkg/csi/recover/recover.go Show resolved Hide resolved

mrhapile and others added 2 commits January 26, 2026 09:12

test(csi): add unit tests for lazyUnmountIfNeeded

b46d25c

Signed-off-by: mrhapile <allinonegaming3456@gmail.com>

Merge branch 'master' into fix-fuse-recovery

80c46da


		mounter := mount.New("")

		// Use CleanupMountPoint which handles corrupted mounts and lazy unmount (force=true).

	// Use CleanupMountPoint which handles corrupted mounts and lazy unmount (force=true).
	// Use CleanupMountPoint with extensiveMountPointCheck=true, which uses IsNotMountPoint
	// instead of IsLikelyNotMountPoint and properly handles bind mounts and corrupted mounts.

		glog.V(3).Infof("FuseRecovery: recovering broken mount point: %v", point)

		r.lazyUnmountIfNeeded(point.MountPath)

+	// CleanupMountPoint in k8s.io/utils/mount may remove the mount directory after
+	// unmounting the last layer. In that case, ensure the directory exists before
+	// attempting to recover the broken mount.
+	if _, err := os.Stat(point.MountPath); err != nil {
+		if os.IsNotExist(err) {
+			if mkErr := os.MkdirAll(point.MountPath, 0755); mkErr != nil {
+				glog.Warningf("FuseRecovery: failed to recreate mount path directory %s after lazy unmount: %v, skip recovering it", point.MountPath, mkErr)
+				return
+			}
+		} else {
+			glog.Warningf("FuseRecovery: failed to stat mount path %s after lazy unmount: %v, skip recovering it", point.MountPath, err)
+			return
+		}
+	}

	err := wait.Poll(500time.Millisecond, 2time.Second, func() (bool, error) {
	err := wait.PollUntilContextTimeout(context.Background(), 500time.Millisecond, 2time.Second, false, func(ctx context.Context) (bool, error) {

csi: make FUSE recovery idempotent by cleaning stale mounts #5495

Are you sure you want to change the base?

csi: make FUSE recovery idempotent by cleaning stale mounts #5495

Uh oh!

Conversation

mrhapile commented Jan 25, 2026

Ⅰ. Describe what this PR does

Ⅱ. Does this pull request fix one issue?

Ⅲ. List the added test cases (unit test/integration test) if any, please explain if no tests are needed.

Ⅳ. Describe how to verify it

Ⅴ. Special notes for reviews

Uh oh!

gemini-code-assist bot commented Jan 25, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

fluid-e2e-bot bot commented Jan 25, 2026

Uh oh!

fluid-e2e-bot bot commented Jan 25, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

RongGu commented Jan 25, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

RongGu commented Jan 25, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

mrhapile commented Jan 25, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

sonarqubecloud bot commented Jan 26, 2026

codecov bot commented Jan 25, 2026 •

edited

Loading