Skip to content

fix(vfs/sys_write): 在进入写路径前先将用户缓冲区拷贝到内核#1821

Merged
fslongjin merged 1 commit intoDragonOS-Community:masterfrom
xboHodx:fix/sys_write
Mar 19, 2026
Merged

fix(vfs/sys_write): 在进入写路径前先将用户缓冲区拷贝到内核#1821
fslongjin merged 1 commit intoDragonOS-Community:masterfrom
xboHodx:fix/sys_write

Conversation

@xboHodx
Copy link
Contributor

@xboHodx xboHodx commented Mar 19, 2026

这个 PR 修复了 DragonOS 运行 kata-agent 时触发的一处内核 panic。

问题分析

sys_write(2) 之前会把“直接借用的用户态缓冲区切片”传入 VFS 写路径。对于 pipe fd,pipe::write_at()
会先获取自旋锁,再去读取这段用户缓冲区。如果此时源用户页还没有 fault in,内核就会在持有 pipe 自旋锁的情况下
进入 filemap_fault,最终带着 preempt_count=1 进入阻塞等待路径,并在 schedule() 处触发断言 panic。

这个 PR 将 sys_write 的用户态写入路径改为:先通过受保护的用户缓冲区接口把数据拷贝到内核 Vec,然后再调用
do_write()。这样 VFS/pipe 等后续路径只会访问内核内存,不会再在持锁状态下触碰用户页。

修复前的路径大致是:

  • sys_write 创建 UserBufferReader
  • 调用 read_from_user() 得到一个基于用户地址的借用切片
  • 将该切片直接传给 do_write()
  • pipe::write_at() 持有 self.inner 自旋锁后,通过 copy_from_slice() 消费该切片

这意味着:如果用户缓冲区对应页面还未建立映射或需要缺页处理,就可能在持有 spinlock 的情况下触发页错误,这是
不允许的。

修复方式

在 kernel/src/filesystem/vfs/syscall/sys_write.rs 中:

  • 保留原有的 user_accessible_len() 检查,维持当前“部分可访问时按短写处理”的语义
  • 对于来自用户态的写请求,改为使用 buffer_protected(0)?
  • 通过 read_all()? 先把可访问数据安全拷贝到内核 Vec
  • 再将内核缓冲区传给 do_write()

对于内核态发起的写请求,仍然保留原有直接读取路径。

修复效果

  • 执行 /bin/kata-agent 不再触发之前的 schedule() panic
  • 原先由 preempt_count=1 引发的内核崩溃已经消失
  • kata-agent 已经能够正常进入用户态执行,并继续运行到后续兼容性问题
  • 当前新的报错是 /proc/self/mountstats 缺失,这与本次 panic 无关

这说明本 PR 已经从根本上修复了这次内核 bug。
:不再 panic,kata-agent 能输出自身日志并继续执行到新的用户态错误

更加合适的处理

现在的处理会多一次拷贝,影响性能。
参考linux中,将用户态引用包装成一个带类型信息的 iov_iter,明确告诉后续路径:

  • 这段数据是 user-backed
  • 不能按普通内核缓冲区随便解引用
  • 必须通过 copy_from_iter()、copy_page_from_iter()、iov_iter_get_pages*() 这类受控接口消费

若是要这样处理的话得重构 VFS 写接口,让写路径能区分 user buffer 和 kernel buffer,并且用类似 iov_iter / UserBuffer 的抽象把这个语义一路保留下去

…ite syscall

Signed-off-by: xboHodx <xbohodx2024@gmail.com>
@fslongjin
Copy link
Member

@claude review

@dragonos-community-ai-pr-reviewer
Copy link

dragonos-community-ai-pr-reviewer bot commented Mar 19, 2026

Claude finished @fslongjin's task in 5m 31s —— View job


Code Review for PR #1821: fix(vfs/sys_write): 在进入写路径前先将用户缓冲区拷贝到内核

I'll review the changes in this PR that fixes a kernel panic when running kata-agent. Let me analyze the code changes and provide feedback.

Todo list:

  • Read the changed file to understand the implementation
  • Analyze the fix for correctness and potential issues
  • Check for edge cases and performance considerations
  • Review against the PR description's analysis
  • Provide comprehensive review feedback

Summary

The PR successfully fixes a kernel panic that occurred when sys_write to a pipe file descriptor triggered a page fault while holding a spinlock. The fix ensures user buffers are copied to kernel memory before entering write paths that may hold locks.

Code Review

Changes Overview

The modifications in kernel/src/filesystem/vfs/syscall/sys_write.rs replace the previous pattern of passing a direct reference to user memory with a safe copy-to-kernel approach:

Before: UserBufferReader::read_from_user(0)? returned a slice referencing user memory, which was passed to do_write(). Downstream code (e.g., pipe::write_at) could page-fault while holding locks.

After: For user-space calls, buffer_protected(0)?.read_all() copies accessible user data into a kernel Vec<u8> before calling do_write(). Kernel-space calls continue using read_from_user(0)?.to_vec().

Correctness Assessment

  1. Safety: The fix correctly prevents page faults while holding locks by ensuring all subsequent write paths only access kernel memory.

  2. Error Handling: Maintains existing user_accessible_len checks for partial accessibility (short write semantics). The copy_from_user_protected mechanism safely handles page faults via exception tables.

  3. Edge Cases:

    • Zero-length writes (len == 0) handled correctly
    • Inaccessible buffers (accessible == 0) return EFAULT
    • Partial accessibility results in short writes (consistent with Linux behavior)
    • Kernel-space writes continue to work (with minor extra copying)
  4. Race Conditions: The user_accessible_len + copy_from_user_protected combination properly handles concurrent memory mapping changes (faults return EFAULT).

Potential Issues

  1. Memory Allocation: read_all() allocates a Vec<u8> of size accessible (up to the requested write length). This could cause kernel OOM for large writes that previously only referenced user memory. Consider:

    • Are there existing limits on write sizes? (pipe buffers, socket buffers, etc.)
    • Should we implement chunked copying or limit the maximum allocation size?
  2. Performance Impact: As noted in the PR description, this adds an extra copy. While acceptable for fixing the panic, a longer-term solution (类似 Linux 的 iov_iter 抽象) would be beneficial.

  3. Consistency: Other syscalls (sys_read, sys_writev, pwrite, etc.) may have similar patterns and could benefit from the same fix. Consider auditing related code.

Recommendations

  1. Merge this fix as it resolves a critical kernel panic.

  2. Add a comment explaining why the copy is necessary (to avoid page faults while holding locks) and referencing the kata-agent panic scenario.

  3. Consider future refactoring toward a UserBuffer/iov_iter-like abstraction that preserves the user-backed semantics through the write path without unnecessary copies.

  4. Test Coverage: Ensure existing tests pass and consider adding a test for the pipe+page-fault scenario if not already covered.

Verification

The fix aligns with the PR's analysis and should eliminate the schedule() panic observed with kata-agent. The additional copy is a reasonable trade-off for correctness.

Approval: ✅ The PR can be merged as-is. Future improvements (performance, abstraction) can be addressed separately.

--- |

@github-actions github-actions bot added the Bug fix A bug is fixed in this pull request label Mar 19, 2026
@fslongjin fslongjin merged commit 7376eae into DragonOS-Community:master Mar 19, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Bug fix A bug is fixed in this pull request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants