fix: add asyncio.Lock to prevent concurrent file write conflicts#1520
fix: add asyncio.Lock to prevent concurrent file write conflicts#1520Br1an67 wants to merge 1 commit intoMoonshotAI:mainfrom
Conversation
On Windows, multiple coroutines writing to the same session context file simultaneously causes PermissionError (Errno 13) due to Windows file locking semantics. This happens when update_token_count() and append_message() (or other write methods) race against each other. Add an asyncio.Lock to serialize all file write operations on the context backend file, preventing concurrent access conflicts. Closes MoonshotAI#1429
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 07a7f1736e
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| self._next_checkpoint_id += 1 | ||
| logger.debug("Checkpointing, ID: {id}", id=checkpoint_id) | ||
|
|
||
| async with aiofiles.open(self._file_backend, "a", encoding="utf-8") as f: | ||
| await f.write(json.dumps({"role": "_checkpoint", "id": checkpoint_id}) + "\n") | ||
| async with self._write_lock: |
There was a problem hiding this comment.
Acquire
_write_lock before advancing in-memory state
Because run_soul() cancels soul_task on user interrupt, any checkpoint()/append_message()/update_token_count() call that is queued on the new _write_lock can now be cancelled after _next_checkpoint_id, _history, or _token_count has already been mutated. That leaves the live Context ahead of context.jsonl: later turns in the same CLI can see phantom checkpoints/messages/token counts, while a resumed or exported session after restart does not. Moving the in-memory updates inside the locked section avoids this divergence under the exact concurrent-write condition this patch introduces.
Useful? React with 👍 / 👎.
| @@ -203,13 +209,15 @@ async def append_message(self, message: Message | Sequence[Message]): | |||
| messages = [message] if isinstance(message, Message) else message | |||
| self._history.extend(messages) | |||
There was a problem hiding this comment.
🔴 In-memory state updated before acquiring _write_lock in append_message, enabling file-memory desync under concurrent access
self._history.extend(messages) at line 210 executes before the _write_lock is acquired at line 212. This means a concurrent lock holder (e.g., revert_to() or clear()) can clear _history after the extend but before the file write, leaving the file with data that's missing from memory.
Concrete scenario via asyncio.shield
asyncio.shield(self._grow_context(...)) at kimisoul.py:728 keeps _grow_context alive as a background task after cancellation. _grow_context calls append_message (kimisoul.py:778,785) and update_token_count (kimisoul.py:780). If the outer task is cancelled and a new turn starts calling clear() or revert_to():
- Shielded
_grow_contextcallsappend_message→self._history.extend(messages)succeeds (pre-lock) append_messageblocks on_write_lock(held byclear()/revert_to())- The lock holder clears
_history(wiping the just-extended message) - Lock is released →
append_messageacquires it and writes the message to the file - Result: file contains the message,
_historydoes not → desync on nextrestore()
The same pattern exists in update_token_count (line 219), where self._token_count = token_count is set before the lock.
| self._history.extend(messages) | |
| async with self._write_lock: | |
| self._history.extend(messages) |
Was this helpful? React with 👍 or 👎 to provide feedback.
|
|
||
| async def update_token_count(self, token_count: int): | ||
| logger.debug("Updating token count in context: {token_count}", token_count=token_count) | ||
| self._token_count = token_count |
There was a problem hiding this comment.
🔴 In-memory state updated before acquiring _write_lock in update_token_count, same desync risk as append_message
self._token_count = token_count at line 219 executes before the _write_lock is acquired at line 221, following the same problematic pattern as append_message. Under the asyncio.shield concurrency scenario described for append_message (via _grow_context at kimisoul.py:780), a concurrent clear() or revert_to() can reset _token_count to 0 between the assignment and the file write, causing the stale token count to be written to the file while in-memory state shows 0.
| self._token_count = token_count | |
| async with self._write_lock: | |
| self._token_count = token_count |
Was this helpful? React with 👍 or 👎 to provide feedback.
| self._history.clear() | ||
| self._token_count = 0 |
There was a problem hiding this comment.
🟡 clear() performs memory clearing outside _write_lock, inconsistent with revert_to()
In clear(), the in-memory state clearing (_history.clear(), _token_count = 0, _next_checkpoint_id = 0, _system_prompt = None) at lines 202-205 is outside the async with self._write_lock: block. In contrast, revert_to() performs identical memory operations inside its lock (lines 150-153). While in current CPython's asyncio this is technically safe (no yield between lock release and synchronous statements), it's inconsistent with the locking strategy of revert_to() and fragile — if append_message is fixed to move extend inside its lock, clear()'s memory ops also need to be inside the lock to ensure proper serialization.
(Refers to lines 202-205)
Prompt for agents
In src/kimi_cli/soul/context.py, move the four memory-clearing statements (lines 202-205) inside the async with self._write_lock: block in the clear() method, matching the pattern used by revert_to(). Specifically, self._history.clear(), self._token_count = 0, self._next_checkpoint_id = 0, and self._system_prompt = None should be placed after self._file_backend.touch() but before the async with block ends (i.e., before line 200's closing of the logger.debug call, add them indented under the async with).
Was this helpful? React with 👍 or 👎 to provide feedback.
Related Issue
Resolve #1429
Description
On Windows, multiple coroutines writing to the same session context file (
context.jsonl) simultaneously causesPermissionError: [Errno 13]due to Windows mandatory file locking. This happens whenupdate_token_count()andappend_message()(or other write methods) race against each other during normal CLI operation.Root cause: The
Contextclass has no synchronization between its async file write methods. Whileasynciois single-threaded,awaitpoints inaiofiles.open()andawait f.write()yield control, allowing another coroutine to attempt opening the same file before the first one has closed it.Fix: Added an
asyncio.Lockto theContextclass that serializes all file I/O operations:write_system_prompt()— acquires lock before file creation or atomic rewritecheckpoint()— acquires lock before appending checkpoint recordrevert_to()— acquires lock for rotation + rebuildclear()— acquires lock for rotation + touchappend_message()— acquires lock before appending messagesupdate_token_count()— acquires lock before appending usage recordAll 14 existing context tests pass.
Checklist
make gen-changelogto update the changelog.make gen-docsto update the user documentation.