-
Notifications
You must be signed in to change notification settings - Fork 1.2k
csi: make FUSE recovery idempotent by cleaning stale mounts #5495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: mrhapile <allinonegaming3456@gmail.com>
Summary of ChangesHello @mrhapile, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the reliability of FUSE recovery within the CSI node plugin. By proactively cleaning up any lingering stale or corrupted FUSE mount points before attempting a remount, it ensures that recovery operations always start from a clean state. This prevents issues like mount stacking and persistent data access errors, leading to a more robust and predictable system when FUSE daemons restart. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @mrhapile. Thanks for your PR. I'm waiting for a fluid-cloudnative member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces an idempotent FUSE recovery mechanism by cleaning up stale mount points before attempting to recover. The approach of using mount.CleanupMountPoint is sound and aligns with Kubernetes' own practices for handling corrupted mounts.
My review has identified a significant issue in the polling logic within the new lazyUnmountIfNeeded function that would cause unnecessary delays during recovery, along with a minor refactoring opportunity to improve code clarity and efficiency. Please see the detailed comment for a suggested fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR improves the robustness of FUSE recovery in the CSI node plugin by preventing mount stacking when FUSE daemons restart. The fix adds a cleanup step before remounting to ensure corrupted or stale mount points are properly removed.
Changes:
- Adds
lazyUnmountIfNeededfunction to clean up stale/corrupted mount points before recovery - Integrates cleanup call into the recovery flow to prevent mount stacking
- Adds formatting improvements for code readability
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
pkg/csi/recover/recover.go
Outdated
|
|
||
| mounter := mount.New("") | ||
|
|
||
| // Use CleanupMountPoint which handles corrupted mounts and lazy unmount (force=true). |
Copilot
AI
Jan 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment mentions "force=true" but CleanupMountPoint's third parameter is actually "extensiveMountPointCheck", not "force". The comment should be updated to accurately reflect what the parameter does: "extensiveMountPointCheck=true uses IsNotMountPoint instead of IsLikelyNotMountPoint, which properly handles bind mounts."
| // Use CleanupMountPoint which handles corrupted mounts and lazy unmount (force=true). | |
| // Use CleanupMountPoint with extensiveMountPointCheck=true, which uses IsNotMountPoint | |
| // instead of IsLikelyNotMountPoint and properly handles bind mounts and corrupted mounts. |
| glog.V(3).Infof("FuseRecovery: recovering broken mount point: %v", point) | ||
|
|
||
| r.lazyUnmountIfNeeded(point.MountPath) | ||
|
|
Copilot
AI
Jan 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CleanupMountPoint will delete the directory after successfully unmounting if only a single mount layer exists (see k8s.io/utils/mount implementation). In the edge case where a mount point has only one corrupted mount layer, lazyUnmountIfNeeded will delete the directory, causing the subsequent recoverBrokenMount call to fail since the bind mount target no longer exists. While this primarily affects mount stacking scenarios where multiple layers exist, consider adding a check to recreate the directory if it was deleted, or verify the directory exists before attempting recoverBrokenMount to handle this edge case more gracefully.
| // CleanupMountPoint in k8s.io/utils/mount may remove the mount directory after | |
| // unmounting the last layer. In that case, ensure the directory exists before | |
| // attempting to recover the broken mount. | |
| if _, err := os.Stat(point.MountPath); err != nil { | |
| if os.IsNotExist(err) { | |
| if mkErr := os.MkdirAll(point.MountPath, 0755); mkErr != nil { | |
| glog.Warningf("FuseRecovery: failed to recreate mount path directory %s after lazy unmount: %v, skip recovering it", point.MountPath, mkErr) | |
| return | |
| } | |
| } else { | |
| glog.Warningf("FuseRecovery: failed to stat mount path %s after lazy unmount: %v, skip recovering it", point.MountPath, err) | |
| return | |
| } | |
| } |
| func (r *FuseRecover) lazyUnmountIfNeeded(mountPath string) { | ||
| // FUSE restart leaves stale mounts. Normal unmount may fail or block. | ||
| // Lazy unmount prevents stack piling and cleans up /dev/fuse references. | ||
|
|
||
| mounter := mount.New("") | ||
|
|
||
| // Use CleanupMountPoint which handles corrupted mounts and lazy unmount (force=true). | ||
| if err := mount.CleanupMountPoint(mountPath, mounter, true); err != nil { | ||
| glog.Warningf("FuseRecovery: failed to cleanup mount %s: %v", mountPath, err) | ||
| } | ||
|
|
||
| // Wait briefly until the mount is released by the kernel. | ||
| err := wait.Poll(500*time.Millisecond, 2*time.Second, func() (bool, error) { | ||
| notMnt, err := mounter.IsLikelyNotMountPoint(mountPath) | ||
| return err == nil && notMnt, nil | ||
| }) | ||
| if err != nil { | ||
| glog.Warningf("FuseRecovery: timeout waiting for mount point %s to be cleaned", mountPath) | ||
| } | ||
| } |
Copilot
AI
Jan 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new lazyUnmountIfNeeded function introduces important logic for cleaning up stale FUSE mounts but lacks unit test coverage. While the PR description mentions existing tests were executed, adding a focused unit test for this function (similar to the existing TestFuseRecover_umountDuplicate test) would help ensure the mount cleanup logic works correctly in various scenarios, such as when CleanupMountPoint succeeds, fails, or times out during polling.
Signed-off-by: mrhapile <allinonegaming3456@gmail.com>
7838223 to
d04668d
Compare
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request improves the FUSE recovery process by making it idempotent. It introduces a lazyUnmountIfNeeded function to clean up stale mount points before attempting recovery, which is a good approach to prevent mount stacking. My review focuses on the implementation of this new function. I've found a potential issue in the error handling logic within the polling mechanism that could mask errors and cause unnecessary delays. The rest of the changes look good.
pkg/csi/recover/recover.go
Outdated
| err := wait.Poll(500*time.Millisecond, 2*time.Second, func() (bool, error) { | ||
| notMnt, err := r.Interface.IsLikelyNotMountPoint(mountPath) | ||
| if os.IsNotExist(err) { | ||
| return true, nil | ||
| } | ||
| return err == nil && notMnt, nil | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error handling in the polling function could be improved. If r.Interface.IsLikelyNotMountPoint(mountPath) returns an error other than os.IsNotExist, the current implementation will cause wait.Poll to continue until it times out, without propagating the underlying error. This can mask the real issue and lead to unnecessary delays. It's better to propagate the error immediately to stop polling.
| err := wait.Poll(500*time.Millisecond, 2*time.Second, func() (bool, error) { | |
| notMnt, err := r.Interface.IsLikelyNotMountPoint(mountPath) | |
| if os.IsNotExist(err) { | |
| return true, nil | |
| } | |
| return err == nil && notMnt, nil | |
| }) | |
| err := wait.Poll(500*time.Millisecond, 2*time.Second, func() (bool, error) { | |
| notMnt, err := r.Interface.IsLikelyNotMountPoint(mountPath) | |
| if err != nil { | |
| if os.IsNotExist(err) { | |
| return true, nil | |
| } | |
| return false, err | |
| } | |
| return notMnt, nil | |
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
pkg/csi/recover/recover.go
Outdated
| } | ||
|
|
||
| // Wait briefly until the mount is released by the kernel. | ||
| err := wait.Poll(500*time.Millisecond, 2*time.Second, func() (bool, error) { |
Copilot
AI
Jan 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The deprecated wait.Poll function is used here. The codebase uses wait.PollUntilContextTimeout in other locations (e.g., pkg/csi/plugins/nodeserver.go:629 and pkg/utils/dataset/lifecycle/node.go:305). Consider using wait.PollUntilContextTimeout or wait.PollUntilContextCancel instead to follow the established pattern and use the more modern context-based API.
| err := wait.Poll(500*time.Millisecond, 2*time.Second, func() (bool, error) { | |
| err := wait.PollUntilContextTimeout(context.Background(), 500*time.Millisecond, 2*time.Second, false, func(ctx context.Context) (bool, error) { |
| func (r *FuseRecover) lazyUnmountIfNeeded(mountPath string) { | ||
| // FUSE restart leaves stale mounts. Normal unmount may fail or block. | ||
| // Lazy unmount prevents stack piling and cleans up /dev/fuse references. | ||
|
|
||
| // Use CleanupMountPoint with extensiveMountPointCheck=true, which relies on | ||
| // IsNotMountPoint instead of IsLikelyNotMountPoint and properly handles | ||
| // bind mounts and corrupted mount points during cleanup. | ||
| if err := mount.CleanupMountPoint(mountPath, r.Interface, true); err != nil { | ||
| glog.Warningf("FuseRecovery: failed to cleanup mount %s: %v", mountPath, err) | ||
| } | ||
|
|
||
| // Wait briefly until the mount is released by the kernel. | ||
| err := wait.Poll(500*time.Millisecond, 2*time.Second, func() (bool, error) { | ||
| notMnt, err := r.Interface.IsLikelyNotMountPoint(mountPath) | ||
| if os.IsNotExist(err) { | ||
| return true, nil | ||
| } | ||
| return err == nil && notMnt, nil | ||
| }) | ||
| if err != nil { | ||
| glog.Warningf("FuseRecovery: timeout waiting for mount point %s to be cleaned", mountPath) | ||
| } | ||
| } |
Copilot
AI
Jan 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lazyUnmountIfNeeded function was introduced but lacks corresponding test coverage. While the PR description states "No new test cases were added," other functions in this file like umountDuplicate and recoverBrokenMount have dedicated unit tests (see pkg/csi/recover/recover_test.go:140 and pkg/csi/recover/recover_test.go:172). Consider adding a test case for lazyUnmountIfNeeded to maintain consistent test coverage and verify its behavior, especially around the CleanupMountPoint call and polling logic.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5495 +/- ##
==========================================
- Coverage 57.61% 56.82% -0.80%
==========================================
Files 443 443
Lines 30735 30763 +28
==========================================
- Hits 17708 17481 -227
- Misses 11414 11725 +311
+ Partials 1613 1557 -56 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request improves the robustness of FUSE recovery by making it idempotent, which is a valuable enhancement. The approach of cleaning up stale mount points before remounting is sound. I've found one issue in the new lazyUnmountIfNeeded function related to error handling within the polling logic that could mask underlying problems. My review includes a specific suggestion to address this to make the recovery process even more robust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
Signed-off-by: mrhapile <allinonegaming3456@gmail.com>
|
@RongGu can u check I have implemented the suggested changes asked by gemini-code-assist |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: mrhapile <allinonegaming3456@gmail.com>
|
|
@RongGu can you please review again I did all the required copilot and gemini review |



Ⅰ. Describe what this PR does
This PR improves the robustness of FUSE recovery in the CSI node plugin by making the recovery process idempotent.
Previously, when a FUSE daemon restarted, stale or corrupted mount points (for example, "transport endpoint is not connected") could remain in the kernel. Subsequent recovery attempts would remount on top of these stale mounts, leading to mount stacking, excessive /proc/self/mountinfo growth, and persistent data access failures.
This change ensures that stale or corrupted FUSE mount points are cleaned up before any remount attempt using Kubernetes’ canonical mount utilities. As a result, recovery always starts from a clean mount state and avoids mount stacking.
Ⅱ. Does this pull request fix one issue?
fixes #5416
Ⅲ. List the added test cases (unit test/integration test) if any, please explain if no tests are needed.
Ⅳ. Describe how to verify it
Run unit tests:
go test ./pkg/csi/recover
In a Kubernetes cluster with FUSE recovery enabled:
This change relies on mount.CleanupMountPoint(force=true), which lazily detaches corrupted FUSE mounts in a manner consistent with kubelet behavior.
Ⅴ. Special notes for reviews