Skip to content

Conversation

@mrhapile
Copy link

Ⅰ. Describe what this PR does

This PR improves the robustness of FUSE recovery in the CSI node plugin by making the recovery process idempotent.

Previously, when a FUSE daemon restarted, stale or corrupted mount points (for example, "transport endpoint is not connected") could remain in the kernel. Subsequent recovery attempts would remount on top of these stale mounts, leading to mount stacking, excessive /proc/self/mountinfo growth, and persistent data access failures.

This change ensures that stale or corrupted FUSE mount points are cleaned up before any remount attempt using Kubernetes’ canonical mount utilities. As a result, recovery always starts from a clean mount state and avoids mount stacking.

Ⅱ. Does this pull request fix one issue?

fixes #5416

Ⅲ. List the added test cases (unit test/integration test) if any, please explain if no tests are needed.

  • Existing unit tests under pkg/csi/recover were executed.
  • No new test cases were added, as this change refines existing recovery behavior and does not introduce new public APIs or logic paths that can be effectively unit-tested in isolation.

Ⅳ. Describe how to verify it

  • Run unit tests:
    go test ./pkg/csi/recover

  • In a Kubernetes cluster with FUSE recovery enabled:

    • Repeatedly restart the FUSE pod.
    • Verify that stale or corrupted FUSE mounts are cleaned up before remounting.
    • Confirm that mount stacking does not occur and data access recovers without persistent "transport endpoint is not connected" errors.

This change relies on mount.CleanupMountPoint(force=true), which lazily detaches corrupted FUSE mounts in a manner consistent with kubelet behavior.

Ⅴ. Special notes for reviews

  • This PR intentionally uses Kubernetes’ canonical mount cleanup utilities to handle corrupted FUSE mounts instead of forcing unconditional lazy unmounts, aligning the behavior with kubelet and avoiding disruption to healthy mount propagation.
  • The scope of this change is limited to FUSE recovery logic and does not modify configuration, public APIs, or runtime behavior outside of recovery.

Signed-off-by: mrhapile <allinonegaming3456@gmail.com>
Copilot AI review requested due to automatic review settings January 25, 2026 05:34
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @mrhapile, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the reliability of FUSE recovery within the CSI node plugin. By proactively cleaning up any lingering stale or corrupted FUSE mount points before attempting a remount, it ensures that recovery operations always start from a clean state. This prevents issues like mount stacking and persistent data access errors, leading to a more robust and predictable system when FUSE daemons restart.

Highlights

  • Idempotent FUSE Recovery: The FUSE recovery process in the CSI node plugin has been made idempotent by ensuring that stale or corrupted mount points are cleaned up before any remount attempt.
  • Prevention of Mount Stacking: This change prevents mount stacking, excessive /proc/self/mountinfo growth, and persistent data access failures that could occur when FUSE daemons restarted and left stale mounts.
  • Utilizing Kubernetes Mount Utilities: The solution leverages Kubernetes’ canonical mount utilities, specifically mount.CleanupMountPoint(force=true), to lazily detach corrupted FUSE mounts, aligning behavior with kubelet.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@fluid-e2e-bot
Copy link

fluid-e2e-bot bot commented Jan 25, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign yangyuliufeng for approval by writing /assign @yangyuliufeng in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@fluid-e2e-bot
Copy link

fluid-e2e-bot bot commented Jan 25, 2026

Hi @mrhapile. Thanks for your PR.

I'm waiting for a fluid-cloudnative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an idempotent FUSE recovery mechanism by cleaning up stale mount points before attempting to recover. The approach of using mount.CleanupMountPoint is sound and aligns with Kubernetes' own practices for handling corrupted mounts.

My review has identified a significant issue in the polling logic within the new lazyUnmountIfNeeded function that would cause unnecessary delays during recovery, along with a minor refactoring opportunity to improve code clarity and efficiency. Please see the detailed comment for a suggested fix.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the robustness of FUSE recovery in the CSI node plugin by preventing mount stacking when FUSE daemons restart. The fix adds a cleanup step before remounting to ensure corrupted or stale mount points are properly removed.

Changes:

  • Adds lazyUnmountIfNeeded function to clean up stale/corrupted mount points before recovery
  • Integrates cleanup call into the recovery flow to prevent mount stacking
  • Adds formatting improvements for code readability

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


mounter := mount.New("")

// Use CleanupMountPoint which handles corrupted mounts and lazy unmount (force=true).
Copy link

Copilot AI Jan 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment mentions "force=true" but CleanupMountPoint's third parameter is actually "extensiveMountPointCheck", not "force". The comment should be updated to accurately reflect what the parameter does: "extensiveMountPointCheck=true uses IsNotMountPoint instead of IsLikelyNotMountPoint, which properly handles bind mounts."

Suggested change
// Use CleanupMountPoint which handles corrupted mounts and lazy unmount (force=true).
// Use CleanupMountPoint with extensiveMountPointCheck=true, which uses IsNotMountPoint
// instead of IsLikelyNotMountPoint and properly handles bind mounts and corrupted mounts.

Copilot uses AI. Check for mistakes.
glog.V(3).Infof("FuseRecovery: recovering broken mount point: %v", point)

r.lazyUnmountIfNeeded(point.MountPath)

Copy link

Copilot AI Jan 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CleanupMountPoint will delete the directory after successfully unmounting if only a single mount layer exists (see k8s.io/utils/mount implementation). In the edge case where a mount point has only one corrupted mount layer, lazyUnmountIfNeeded will delete the directory, causing the subsequent recoverBrokenMount call to fail since the bind mount target no longer exists. While this primarily affects mount stacking scenarios where multiple layers exist, consider adding a check to recreate the directory if it was deleted, or verify the directory exists before attempting recoverBrokenMount to handle this edge case more gracefully.

Suggested change
// CleanupMountPoint in k8s.io/utils/mount may remove the mount directory after
// unmounting the last layer. In that case, ensure the directory exists before
// attempting to recover the broken mount.
if _, err := os.Stat(point.MountPath); err != nil {
if os.IsNotExist(err) {
if mkErr := os.MkdirAll(point.MountPath, 0755); mkErr != nil {
glog.Warningf("FuseRecovery: failed to recreate mount path directory %s after lazy unmount: %v, skip recovering it", point.MountPath, mkErr)
return
}
} else {
glog.Warningf("FuseRecovery: failed to stat mount path %s after lazy unmount: %v, skip recovering it", point.MountPath, err)
return
}
}

Copilot uses AI. Check for mistakes.
Comment on lines 239 to 258
func (r *FuseRecover) lazyUnmountIfNeeded(mountPath string) {
// FUSE restart leaves stale mounts. Normal unmount may fail or block.
// Lazy unmount prevents stack piling and cleans up /dev/fuse references.

mounter := mount.New("")

// Use CleanupMountPoint which handles corrupted mounts and lazy unmount (force=true).
if err := mount.CleanupMountPoint(mountPath, mounter, true); err != nil {
glog.Warningf("FuseRecovery: failed to cleanup mount %s: %v", mountPath, err)
}

// Wait briefly until the mount is released by the kernel.
err := wait.Poll(500*time.Millisecond, 2*time.Second, func() (bool, error) {
notMnt, err := mounter.IsLikelyNotMountPoint(mountPath)
return err == nil && notMnt, nil
})
if err != nil {
glog.Warningf("FuseRecovery: timeout waiting for mount point %s to be cleaned", mountPath)
}
}
Copy link

Copilot AI Jan 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new lazyUnmountIfNeeded function introduces important logic for cleaning up stale FUSE mounts but lacks unit test coverage. While the PR description mentions existing tests were executed, adding a focused unit test for this function (similar to the existing TestFuseRecover_umountDuplicate test) would help ensure the mount cleanup logic works correctly in various scenarios, such as when CleanupMountPoint succeeds, fails, or times out during polling.

Copilot uses AI. Check for mistakes.
Signed-off-by: mrhapile <allinonegaming3456@gmail.com>
@RongGu
Copy link
Member

RongGu commented Jan 25, 2026

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request improves the FUSE recovery process by making it idempotent. It introduces a lazyUnmountIfNeeded function to clean up stale mount points before attempting recovery, which is a good approach to prevent mount stacking. My review focuses on the implementation of this new function. I've found a potential issue in the error handling logic within the polling mechanism that could mask errors and cause unnecessary delays. The rest of the changes look good.

Comment on lines 251 to 257
err := wait.Poll(500*time.Millisecond, 2*time.Second, func() (bool, error) {
notMnt, err := r.Interface.IsLikelyNotMountPoint(mountPath)
if os.IsNotExist(err) {
return true, nil
}
return err == nil && notMnt, nil
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The error handling in the polling function could be improved. If r.Interface.IsLikelyNotMountPoint(mountPath) returns an error other than os.IsNotExist, the current implementation will cause wait.Poll to continue until it times out, without propagating the underlying error. This can mask the real issue and lead to unnecessary delays. It's better to propagate the error immediately to stop polling.

Suggested change
err := wait.Poll(500*time.Millisecond, 2*time.Second, func() (bool, error) {
notMnt, err := r.Interface.IsLikelyNotMountPoint(mountPath)
if os.IsNotExist(err) {
return true, nil
}
return err == nil && notMnt, nil
})
err := wait.Poll(500*time.Millisecond, 2*time.Second, func() (bool, error) {
notMnt, err := r.Interface.IsLikelyNotMountPoint(mountPath)
if err != nil {
if os.IsNotExist(err) {
return true, nil
}
return false, err
}
return notMnt, nil
})

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

}

// Wait briefly until the mount is released by the kernel.
err := wait.Poll(500*time.Millisecond, 2*time.Second, func() (bool, error) {
Copy link

Copilot AI Jan 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deprecated wait.Poll function is used here. The codebase uses wait.PollUntilContextTimeout in other locations (e.g., pkg/csi/plugins/nodeserver.go:629 and pkg/utils/dataset/lifecycle/node.go:305). Consider using wait.PollUntilContextTimeout or wait.PollUntilContextCancel instead to follow the established pattern and use the more modern context-based API.

Suggested change
err := wait.Poll(500*time.Millisecond, 2*time.Second, func() (bool, error) {
err := wait.PollUntilContextTimeout(context.Background(), 500*time.Millisecond, 2*time.Second, false, func(ctx context.Context) (bool, error) {

Copilot uses AI. Check for mistakes.
Comment on lines 239 to 261
func (r *FuseRecover) lazyUnmountIfNeeded(mountPath string) {
// FUSE restart leaves stale mounts. Normal unmount may fail or block.
// Lazy unmount prevents stack piling and cleans up /dev/fuse references.

// Use CleanupMountPoint with extensiveMountPointCheck=true, which relies on
// IsNotMountPoint instead of IsLikelyNotMountPoint and properly handles
// bind mounts and corrupted mount points during cleanup.
if err := mount.CleanupMountPoint(mountPath, r.Interface, true); err != nil {
glog.Warningf("FuseRecovery: failed to cleanup mount %s: %v", mountPath, err)
}

// Wait briefly until the mount is released by the kernel.
err := wait.Poll(500*time.Millisecond, 2*time.Second, func() (bool, error) {
notMnt, err := r.Interface.IsLikelyNotMountPoint(mountPath)
if os.IsNotExist(err) {
return true, nil
}
return err == nil && notMnt, nil
})
if err != nil {
glog.Warningf("FuseRecovery: timeout waiting for mount point %s to be cleaned", mountPath)
}
}
Copy link

Copilot AI Jan 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lazyUnmountIfNeeded function was introduced but lacks corresponding test coverage. While the PR description states "No new test cases were added," other functions in this file like umountDuplicate and recoverBrokenMount have dedicated unit tests (see pkg/csi/recover/recover_test.go:140 and pkg/csi/recover/recover_test.go:172). Consider adding a test case for lazyUnmountIfNeeded to maintain consistent test coverage and verify its behavior, especially around the CleanupMountPoint call and polling logic.

Copilot uses AI. Check for mistakes.
@codecov
Copy link

codecov bot commented Jan 25, 2026

Codecov Report

❌ Patch coverage is 71.42857% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.82%. Comparing base (f9acae1) to head (5893f5a).
⚠️ Report is 12 commits behind head on master.

Files with missing lines Patch % Lines
pkg/csi/recover/recover.go 71.42% 6 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5495      +/-   ##
==========================================
- Coverage   57.61%   56.82%   -0.80%     
==========================================
  Files         443      443              
  Lines       30735    30763      +28     
==========================================
- Hits        17708    17481     -227     
- Misses      11414    11725     +311     
+ Partials     1613     1557      -56     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@RongGu
Copy link
Member

RongGu commented Jan 25, 2026

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request improves the robustness of FUSE recovery by making it idempotent, which is a valuable enhancement. The approach of cleaning up stale mount points before remounting is sound. I've found one issue in the new lazyUnmountIfNeeded function related to error handling within the polling logic that could mask underlying problems. My review includes a specific suggestion to address this to make the recovery process even more robust.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Signed-off-by: mrhapile <allinonegaming3456@gmail.com>
@mrhapile
Copy link
Author

@RongGu can u check I have implemented the suggested changes asked by gemini-code-assist

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@sonarqubecloud
Copy link

@mrhapile
Copy link
Author

@RongGu can you please review again I did all the required copilot and gemini review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Fail to access data from mount point after recover FUSE mount point several times

2 participants