fix(session): fsync sessions on graceful shutdown to prevent data loss by hussein1362 · Pull Request #3369 · HKUDS/nanobot

hussein1362 · 2026-04-21T20:23:13Z

Summary

Flush all cached sessions to durable storage with fsync during gateway shutdown, preventing silent data loss on filesystems with write-back caching.

Problem

When nanobot runs on a filesystem with write-back caching (rclone VFS, NFS, FUSE mounts), the OS page cache may buffer recent session writes. If the process is killed or restarted before the cache flushes to the backing store, the most recent conversation turns are silently lost.

Real-world scenario we hit: nanobot running in Docker with rclone VFS mount (Google Drive, write_back=5s). User asks the agent to check emails → agent responds with 3 options → user says "Yes do it" → container is restarted (config change) → rclone write-back buffer is lost → agent loads stale session from Drive (missing the last 4 messages) → agent responds to a topic from 30 minutes ago instead of the user's most recent request.

The user's message is correctly processed but the response references completely wrong context because the session history jumped backward.

Changes

`nanobot/session/manager.py`

save(fsync=True) — optional flag that flushes the written file and its parent directory to durable storage via os.fsync(). Defaults to False (no performance impact on normal save operations).
flush_all() — re-saves every cached session with fsync=True. Errors on individual sessions are logged but don't prevent other sessions from flushing. Returns count of successfully flushed sessions.

`nanobot/cli/commands.py`

Call agent.sessions.flush_all() in the gateway shutdown finally block, after stopping heartbeat/cron/channels but before process exit.

`tests/session/test_session_fsync.py` (new, 8 tests)

save(fsync=False) does not call os.fsync
save(fsync=True) calls os.fsync twice (file + directory)
save() default does not fsync (backward compat)
flush_all() on empty cache returns 0
flush_all() saves all cached sessions
flush_all() uses fsync
flush_all() continues on per-session errors
Flushed data survives simulated process restart (cold cache reload)

`tests/cli/test_commands.py`

Added sessions attribute with flush_all() stub to _FakeAgentLoop so the existing gateway health endpoint test passes.

Design Notes

No performance regression — fsync=False is the default for all normal save() calls during message processing. fsync=True is only used during the shutdown flush.
Best-effort — flush_all() logs warnings on individual failures but doesn't raise, ensuring maximum sessions are saved even if one file is corrupted.
Portable — os.fsync() works on all platforms. Directory fsync uses os.open(O_RDONLY) which works on Linux/macOS; on Windows it's a no-op (NTFS journals metadata synchronously).

Tests

$ python3.12 -m pytest tests/session/test_session_fsync.py tests/cli/test_commands.py tests/agent/test_unified_session.py -q
82 passed in 1.79s

$ ruff check nanobot/session/manager.py tests/session/test_session_fsync.py
All checks passed!

On filesystems with write-back caching (rclone VFS, NFS, FUSE mounts) the OS page cache may buffer recent session writes. If the process is killed before the cache flushes, the most recent conversation turns are silently lost — causing the agent to "forget" recent context and respond to stale history on the next startup. Changes: - session/manager.py: add fsync=True option to save() that flushes the file and its parent directory to durable storage. Add flush_all() that re-saves every cached session with fsync. Default save() behavior is unchanged (no fsync) to avoid performance regression in normal operation. - cli/commands.py: call agent.sessions.flush_all() in the gateway shutdown finally block, after stopping heartbeat/cron/channels. - tests/session/test_session_fsync.py: 8 tests covering fsync flag behavior, flush_all with empty/multiple/errored sessions, and data survival across simulated process restart. - tests/cli/test_commands.py: add sessions attribute to _FakeAgentLoop so the gateway health endpoint test passes with the new shutdown flush.

On Windows, opening a directory with O_RDONLY raises PermissionError. Wrap the directory fsync in a try/except PermissionError — NTFS journals metadata synchronously so the directory sync is unnecessary there. Also adjust test assertions to expect 1 fsync call (file only) on Windows vs 2 (file + directory) on POSIX.

github-actions Bot mentioned this pull request Apr 22, 2026

🦞 OpenClaw 生态日报 2026-04-22 gsscsd/big_model_radar#226

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(session): fsync sessions on graceful shutdown to prevent data loss#3369

fix(session): fsync sessions on graceful shutdown to prevent data loss#3369
hussein1362 wants to merge 2 commits intoHKUDS:mainfrom
hussein1362:fix/graceful-session-fsync

hussein1362 commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hussein1362 commented Apr 21, 2026

Summary

Problem

Changes

nanobot/session/manager.py

nanobot/cli/commands.py

tests/session/test_session_fsync.py (new, 8 tests)

tests/cli/test_commands.py

Design Notes

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`nanobot/session/manager.py`

`nanobot/cli/commands.py`

`tests/session/test_session_fsync.py` (new, 8 tests)

`tests/cli/test_commands.py`