Skip to content

fix(session): fsync sessions on graceful shutdown to prevent data loss#3369

Open
hussein1362 wants to merge 2 commits intoHKUDS:mainfrom
hussein1362:fix/graceful-session-fsync
Open

fix(session): fsync sessions on graceful shutdown to prevent data loss#3369
hussein1362 wants to merge 2 commits intoHKUDS:mainfrom
hussein1362:fix/graceful-session-fsync

Conversation

@hussein1362
Copy link
Copy Markdown
Contributor

Summary

Flush all cached sessions to durable storage with fsync during gateway shutdown, preventing silent data loss on filesystems with write-back caching.

Problem

When nanobot runs on a filesystem with write-back caching (rclone VFS, NFS, FUSE mounts), the OS page cache may buffer recent session writes. If the process is killed or restarted before the cache flushes to the backing store, the most recent conversation turns are silently lost.

Real-world scenario we hit: nanobot running in Docker with rclone VFS mount (Google Drive, write_back=5s). User asks the agent to check emails → agent responds with 3 options → user says "Yes do it" → container is restarted (config change) → rclone write-back buffer is lost → agent loads stale session from Drive (missing the last 4 messages) → agent responds to a topic from 30 minutes ago instead of the user's most recent request.

The user's message is correctly processed but the response references completely wrong context because the session history jumped backward.

Changes

nanobot/session/manager.py

  • save(fsync=True) — optional flag that flushes the written file and its parent directory to durable storage via os.fsync(). Defaults to False (no performance impact on normal save operations).
  • flush_all() — re-saves every cached session with fsync=True. Errors on individual sessions are logged but don't prevent other sessions from flushing. Returns count of successfully flushed sessions.

nanobot/cli/commands.py

  • Call agent.sessions.flush_all() in the gateway shutdown finally block, after stopping heartbeat/cron/channels but before process exit.

tests/session/test_session_fsync.py (new, 8 tests)

  • save(fsync=False) does not call os.fsync
  • save(fsync=True) calls os.fsync twice (file + directory)
  • save() default does not fsync (backward compat)
  • flush_all() on empty cache returns 0
  • flush_all() saves all cached sessions
  • flush_all() uses fsync
  • flush_all() continues on per-session errors
  • Flushed data survives simulated process restart (cold cache reload)

tests/cli/test_commands.py

  • Added sessions attribute with flush_all() stub to _FakeAgentLoop so the existing gateway health endpoint test passes.

Design Notes

  • No performance regressionfsync=False is the default for all normal save() calls during message processing. fsync=True is only used during the shutdown flush.
  • Best-effortflush_all() logs warnings on individual failures but doesn't raise, ensuring maximum sessions are saved even if one file is corrupted.
  • Portableos.fsync() works on all platforms. Directory fsync uses os.open(O_RDONLY) which works on Linux/macOS; on Windows it's a no-op (NTFS journals metadata synchronously).

Tests

$ python3.12 -m pytest tests/session/test_session_fsync.py tests/cli/test_commands.py tests/agent/test_unified_session.py -q
82 passed in 1.79s

$ ruff check nanobot/session/manager.py tests/session/test_session_fsync.py
All checks passed!

On filesystems with write-back caching (rclone VFS, NFS, FUSE mounts)
the OS page cache may buffer recent session writes. If the process is
killed before the cache flushes, the most recent conversation turns are
silently lost — causing the agent to "forget" recent context and
respond to stale history on the next startup.

Changes:

- session/manager.py: add fsync=True option to save() that flushes the
  file and its parent directory to durable storage. Add flush_all() that
  re-saves every cached session with fsync. Default save() behavior is
  unchanged (no fsync) to avoid performance regression in normal
  operation.

- cli/commands.py: call agent.sessions.flush_all() in the gateway
  shutdown finally block, after stopping heartbeat/cron/channels.

- tests/session/test_session_fsync.py: 8 tests covering fsync flag
  behavior, flush_all with empty/multiple/errored sessions, and
  data survival across simulated process restart.

- tests/cli/test_commands.py: add sessions attribute to _FakeAgentLoop
  so the gateway health endpoint test passes with the new shutdown
  flush.
On Windows, opening a directory with O_RDONLY raises PermissionError.
Wrap the directory fsync in a try/except PermissionError — NTFS journals
metadata synchronously so the directory sync is unnecessary there.

Also adjust test assertions to expect 1 fsync call (file only) on
Windows vs 2 (file + directory) on POSIX.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant