Skip to content

fix: add asyncio.Lock to prevent concurrent file write conflicts#1520

Closed
Br1an67 wants to merge 1 commit intoMoonshotAI:mainfrom
Br1an67:fix/context-concurrent-write-lock
Closed

fix: add asyncio.Lock to prevent concurrent file write conflicts#1520
Br1an67 wants to merge 1 commit intoMoonshotAI:mainfrom
Br1an67:fix/context-concurrent-write-lock

Conversation

@Br1an67
Copy link
Copy Markdown

@Br1an67 Br1an67 commented Mar 20, 2026

Related Issue

Resolve #1429

Description

On Windows, multiple coroutines writing to the same session context file (context.jsonl) simultaneously causes PermissionError: [Errno 13] due to Windows mandatory file locking. This happens when update_token_count() and append_message() (or other write methods) race against each other during normal CLI operation.

Root cause: The Context class has no synchronization between its async file write methods. While asyncio is single-threaded, await points in aiofiles.open() and await f.write() yield control, allowing another coroutine to attempt opening the same file before the first one has closed it.

Fix: Added an asyncio.Lock to the Context class that serializes all file I/O operations:

  • write_system_prompt() — acquires lock before file creation or atomic rewrite
  • checkpoint() — acquires lock before appending checkpoint record
  • revert_to() — acquires lock for rotation + rebuild
  • clear() — acquires lock for rotation + touch
  • append_message() — acquires lock before appending messages
  • update_token_count() — acquires lock before appending usage record

All 14 existing context tests pass.

Checklist

  • I have read the CONTRIBUTING document.
  • I have linked the related issue, if any.
  • I have added tests that prove my fix is effective or that my feature works.
  • I have run make gen-changelog to update the changelog.
  • I have run make gen-docs to update the user documentation.

Open with Devin

On Windows, multiple coroutines writing to the same session context
file simultaneously causes PermissionError (Errno 13) due to Windows
file locking semantics. This happens when update_token_count() and
append_message() (or other write methods) race against each other.

Add an asyncio.Lock to serialize all file write operations on the
context backend file, preventing concurrent access conflicts.

Closes MoonshotAI#1429
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 07a7f1736e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 108 to +111
self._next_checkpoint_id += 1
logger.debug("Checkpointing, ID: {id}", id=checkpoint_id)

async with aiofiles.open(self._file_backend, "a", encoding="utf-8") as f:
await f.write(json.dumps({"role": "_checkpoint", "id": checkpoint_id}) + "\n")
async with self._write_lock:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Acquire _write_lock before advancing in-memory state

Because run_soul() cancels soul_task on user interrupt, any checkpoint()/append_message()/update_token_count() call that is queued on the new _write_lock can now be cancelled after _next_checkpoint_id, _history, or _token_count has already been mutated. That leaves the live Context ahead of context.jsonl: later turns in the same CLI can see phantom checkpoints/messages/token counts, while a resumed or exported session after restart does not. Moving the in-memory updates inside the locked section avoids this divergence under the exact concurrent-write condition this patch introduces.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 3 potential issues.

View 4 additional findings in Devin Review.

Open in Devin Review

@@ -203,13 +209,15 @@ async def append_message(self, message: Message | Sequence[Message]):
messages = [message] if isinstance(message, Message) else message
self._history.extend(messages)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 In-memory state updated before acquiring _write_lock in append_message, enabling file-memory desync under concurrent access

self._history.extend(messages) at line 210 executes before the _write_lock is acquired at line 212. This means a concurrent lock holder (e.g., revert_to() or clear()) can clear _history after the extend but before the file write, leaving the file with data that's missing from memory.

Concrete scenario via asyncio.shield

asyncio.shield(self._grow_context(...)) at kimisoul.py:728 keeps _grow_context alive as a background task after cancellation. _grow_context calls append_message (kimisoul.py:778,785) and update_token_count (kimisoul.py:780). If the outer task is cancelled and a new turn starts calling clear() or revert_to():

  1. Shielded _grow_context calls append_messageself._history.extend(messages) succeeds (pre-lock)
  2. append_message blocks on _write_lock (held by clear()/revert_to())
  3. The lock holder clears _history (wiping the just-extended message)
  4. Lock is released → append_message acquires it and writes the message to the file
  5. Result: file contains the message, _history does not → desync on next restore()

The same pattern exists in update_token_count (line 219), where self._token_count = token_count is set before the lock.

Suggested change
self._history.extend(messages)
async with self._write_lock:
self._history.extend(messages)
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.


async def update_token_count(self, token_count: int):
logger.debug("Updating token count in context: {token_count}", token_count=token_count)
self._token_count = token_count
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 In-memory state updated before acquiring _write_lock in update_token_count, same desync risk as append_message

self._token_count = token_count at line 219 executes before the _write_lock is acquired at line 221, following the same problematic pattern as append_message. Under the asyncio.shield concurrency scenario described for append_message (via _grow_context at kimisoul.py:780), a concurrent clear() or revert_to() can reset _token_count to 0 between the assignment and the file write, causing the stale token count to be written to the file while in-memory state shows 0.

Suggested change
self._token_count = token_count
async with self._write_lock:
self._token_count = token_count
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines 202 to 203
self._history.clear()
self._token_count = 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 clear() performs memory clearing outside _write_lock, inconsistent with revert_to()

In clear(), the in-memory state clearing (_history.clear(), _token_count = 0, _next_checkpoint_id = 0, _system_prompt = None) at lines 202-205 is outside the async with self._write_lock: block. In contrast, revert_to() performs identical memory operations inside its lock (lines 150-153). While in current CPython's asyncio this is technically safe (no yield between lock release and synchronous statements), it's inconsistent with the locking strategy of revert_to() and fragile — if append_message is fixed to move extend inside its lock, clear()'s memory ops also need to be inside the lock to ensure proper serialization.

(Refers to lines 202-205)

Prompt for agents
In src/kimi_cli/soul/context.py, move the four memory-clearing statements (lines 202-205) inside the async with self._write_lock: block in the clear() method, matching the pattern used by revert_to(). Specifically, self._history.clear(), self._token_count = 0, self._next_checkpoint_id = 0, and self._system_prompt = None should be placed after self._file_backend.touch() but before the async with block ends (i.e., before line 200's closing of the logger.debug call, add them indented under the async with).
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Windows 平台并发写入导致 Permission denied: [Errno 13]

2 participants