fix(session): fsync sessions on graceful shutdown to prevent data loss#3369
Open
hussein1362 wants to merge 2 commits intoHKUDS:mainfrom
Open
fix(session): fsync sessions on graceful shutdown to prevent data loss#3369hussein1362 wants to merge 2 commits intoHKUDS:mainfrom
hussein1362 wants to merge 2 commits intoHKUDS:mainfrom
Conversation
On filesystems with write-back caching (rclone VFS, NFS, FUSE mounts) the OS page cache may buffer recent session writes. If the process is killed before the cache flushes, the most recent conversation turns are silently lost — causing the agent to "forget" recent context and respond to stale history on the next startup. Changes: - session/manager.py: add fsync=True option to save() that flushes the file and its parent directory to durable storage. Add flush_all() that re-saves every cached session with fsync. Default save() behavior is unchanged (no fsync) to avoid performance regression in normal operation. - cli/commands.py: call agent.sessions.flush_all() in the gateway shutdown finally block, after stopping heartbeat/cron/channels. - tests/session/test_session_fsync.py: 8 tests covering fsync flag behavior, flush_all with empty/multiple/errored sessions, and data survival across simulated process restart. - tests/cli/test_commands.py: add sessions attribute to _FakeAgentLoop so the gateway health endpoint test passes with the new shutdown flush.
On Windows, opening a directory with O_RDONLY raises PermissionError. Wrap the directory fsync in a try/except PermissionError — NTFS journals metadata synchronously so the directory sync is unnecessary there. Also adjust test assertions to expect 1 fsync call (file only) on Windows vs 2 (file + directory) on POSIX.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Flush all cached sessions to durable storage with
fsyncduring gateway shutdown, preventing silent data loss on filesystems with write-back caching.Problem
When nanobot runs on a filesystem with write-back caching (rclone VFS, NFS, FUSE mounts), the OS page cache may buffer recent session writes. If the process is killed or restarted before the cache flushes to the backing store, the most recent conversation turns are silently lost.
Real-world scenario we hit: nanobot running in Docker with rclone VFS mount (Google Drive,
write_back=5s). User asks the agent to check emails → agent responds with 3 options → user says "Yes do it" → container is restarted (config change) → rclone write-back buffer is lost → agent loads stale session from Drive (missing the last 4 messages) → agent responds to a topic from 30 minutes ago instead of the user's most recent request.The user's message is correctly processed but the response references completely wrong context because the session history jumped backward.
Changes
nanobot/session/manager.pysave(fsync=True)— optional flag that flushes the written file and its parent directory to durable storage viaos.fsync(). Defaults toFalse(no performance impact on normal save operations).flush_all()— re-saves every cached session withfsync=True. Errors on individual sessions are logged but don't prevent other sessions from flushing. Returns count of successfully flushed sessions.nanobot/cli/commands.pyagent.sessions.flush_all()in the gateway shutdownfinallyblock, after stopping heartbeat/cron/channels but before process exit.tests/session/test_session_fsync.py(new, 8 tests)save(fsync=False)does not callos.fsyncsave(fsync=True)callsos.fsynctwice (file + directory)save()default does not fsync (backward compat)flush_all()on empty cache returns 0flush_all()saves all cached sessionsflush_all()uses fsyncflush_all()continues on per-session errorstests/cli/test_commands.pysessionsattribute withflush_all()stub to_FakeAgentLoopso the existing gateway health endpoint test passes.Design Notes
fsync=Falseis the default for all normalsave()calls during message processing.fsync=Trueis only used during the shutdown flush.flush_all()logs warnings on individual failures but doesn't raise, ensuring maximum sessions are saved even if one file is corrupted.os.fsync()works on all platforms. Directory fsync usesos.open(O_RDONLY)which works on Linux/macOS; on Windows it's a no-op (NTFS journals metadata synchronously).Tests