Skip to content

fix: spawn wait_for_result_and_end_input as background task for string prompts#780

Merged
qing-ant merged 1 commit intomainfrom
fix/string-prompt-buffer-deadlock
Mar 30, 2026
Merged

fix: spawn wait_for_result_and_end_input as background task for string prompts#780
qing-ant merged 1 commit intomainfrom
fix/string-prompt-buffer-deadlock

Conversation

@qing-ant
Copy link
Copy Markdown
Contributor

Problem

query() with a string prompt and hooks/MCP servers deadlocks once the internal 100-slot anyio message buffer fills up (~50 tool calls). Each tool call produces ~2 messages, so the buffer fills after about 50 tool calls.

Root cause

For string prompts, client.py:141 awaited wait_for_result_and_end_input() before receive_messages() started draining the buffer:

if isinstance(prompt, str):
    await chosen_transport.write(json.dumps(user_message) + "\n")
    await query.wait_for_result_and_end_input()   # blocks until "result" arrives

async for data in query.receive_messages():        # buffer drain starts here

Meanwhile _read_messages() keeps reading CLI stdout and pushing into the 100-slot channel. After ~50 tool calls the channel is full and _message_send.send() blocks. Now _read_messages can't read anything else from stdout, including the "result" message that wait_for_result_and_end_input needs — deadlock.

Fix

Spawn wait_for_result_and_end_input() as a background task instead of awaiting it inline. This matches the existing AsyncIterable path which already uses spawn_task(stream_input()), and allows receive_messages() to start draining the buffer immediately.

# Before (deadlocks)
await query.wait_for_result_and_end_input()

# After (concurrent)
query.spawn_task(query.wait_for_result_and_end_input())

Testing

  • Added regression test verifying spawn_task is called instead of direct await
  • Fixed existing test warnings from unawaited mock coroutines
  • All 425 tests pass
  • Lint, format, and mypy all clean

Fixes #779

…g prompts

For string prompts with hooks or SDK MCP servers, query() awaited
wait_for_result_and_end_input() before receive_messages() started
draining the buffer. Once the 100-slot anyio channel filled (~50 tool
calls), _read_messages blocked on send() and could never deliver the
result message that wait_for_result_and_end_input needed, causing a
deadlock.

Spawn it as a background task instead, matching the existing
AsyncIterable path which already uses spawn_task(stream_input()).

Fixes #779
@qing-ant qing-ant enabled auto-merge (squash) March 30, 2026 20:11
@qing-ant qing-ant merged commit bd3b7a6 into main Mar 30, 2026
10 checks passed
@qing-ant qing-ant deleted the fix/string-prompt-buffer-deadlock branch March 30, 2026 20:26
@qing-ant
Copy link
Copy Markdown
Contributor Author

E2E Test Results

Test script:

#!/usr/bin/env python3
"""E2E proof for PR #780: verify query() with string prompt + MCP doesn't deadlock.

The fix spawns wait_for_result_and_end_input() as a background task so the
message buffer drains concurrently, preventing deadlock after many tool calls.

This test creates an MCP server with a simple tool and asks the model to call
it many times sequentially. Each tool call generates multiple messages through
the buffer (assistant message, tool result, progress updates). Without the fix,
wait_for_result_and_end_input() is awaited inline which blocks _read_messages()
from draining the buffer -- causing a deadlock once the 100-slot anyio buffer
fills up.

Prior to this fix, running query() with a string prompt + MCP server would
hang after enough tool calls. With the fix, the wait is moved to a background
task so messages drain concurrently.
"""

import asyncio
import sys
import time
from typing import Any

import claude_agent_sdk
from claude_agent_sdk import (
    AssistantMessage,
    ClaudeAgentOptions,
    ResultMessage,
    SystemMessage,
    TextBlock,
    ToolUseBlock,
    UserMessage,
    create_sdk_mcp_server,
    tool,
)


@tool("get_number", "Return the square of the given number", {"n": int})
async def get_number(args: dict[str, Any]) -> dict[str, Any]:
    """Return the square of a number."""
    n = args.get("n", 0)
    return {"content": [{"type": "text", "text": str(n * n)}]}


async def main() -> None:
    print("=" * 70)
    print("PR #780 E2E Test: string-prompt + MCP deadlock fix (many tool calls)")
    print("=" * 70)
    print()
    print(f"SDK version: {claude_agent_sdk.__version__}")

    mcp_server = create_sdk_mcp_server(
        name="math_server",
        version="1.0.0",
        tools=[get_number],
    )

    options = ClaudeAgentOptions(
        mcp_servers={"math": mcp_server},
        max_turns=30,
        permission_mode="acceptEdits",
    )

    # Request 20 sequential tool calls. The model will delegate to sub-agent(s)
    # which each make many individual MCP tool calls. The messages generated by
    # these calls (assistant messages, tool results, progress updates) all flow
    # through the same anyio buffer that would deadlock without the fix.
    prompt = (
        "Use the get_number MCP tool to compute the square of every integer from 1 to 20. "
        "Make exactly 20 individual get_number calls (one per integer). "
        "After all calls complete, list every input and result."
    )

    print(f"Prompt: {prompt[:120]}...")
    print(f"Max turns: {options.max_turns}")
    print(f"Permission mode: {options.permission_mode}")
    print(f"Timeout: 180s")
    print()
    print("--- Running query() ---")

    tool_calls = 0
    total_messages = 0
    msg_types: dict[str, int] = {}
    result_msg = None
    t0 = time.monotonic()

    try:
        async with asyncio.timeout(180):
            async for message in claude_agent_sdk.query(
                prompt=prompt,
                options=options,
            ):
                total_messages += 1
                mtype = type(message).__name__
                msg_types[mtype] = msg_types.get(mtype, 0) + 1

                if isinstance(message, AssistantMessage):
                    for block in message.content:
                        if isinstance(block, ToolUseBlock):
                            tool_calls += 1
                            if block.name == "mcp__math__get_number":
                                print(f"  Tool #{tool_calls}: get_number(n={block.input.get('n', '?')})")
                            else:
                                desc = str(block.input)[:100]
                                print(f"  Tool #{tool_calls}: {block.name}({desc}...)")
                elif isinstance(message, ResultMessage):
                    result_msg = message

    except TimeoutError:
        elapsed = time.monotonic() - t0
        print()
        print(f"FAIL: Timed out after {elapsed:.1f}s -- likely deadlock!")
        print("This is what happened BEFORE the fix was applied.")
        sys.exit(1)
    except Exception as e:
        elapsed = time.monotonic() - t0
        print()
        print(f"FAIL: Exception after {elapsed:.1f}s: {e}")
        sys.exit(1)

    elapsed = time.monotonic() - t0
    print()
    print("-" * 70)
    print(f"Completed in {elapsed:.1f}s")
    print(f"Total messages: {total_messages}")
    print(f"Tool calls: {tool_calls}")
    print(f"Message breakdown: {msg_types}")
    if result_msg:
        print(f"Cost: ${result_msg.total_cost_usd or 0:.6f}")
        print(f"Turns: {result_msg.num_turns}")
    print()

    if result_msg is not None and tool_calls >= 1:
        print(f"PASS: query() with string prompt + MCP completed {tool_calls} tool")
        print(f"      call(s) generating {total_messages} buffered messages without")
        print(f"      deadlocking ({elapsed:.1f}s). The background-task fix works.")
    elif result_msg is not None:
        print(f"PASS: query() completed without deadlock ({total_messages} messages).")
    else:
        print("FAIL: No result message received.")
        sys.exit(1)


if __name__ == "__main__":
    asyncio.run(main())

Output:

======================================================================
PR #780 E2E Test: string-prompt + MCP deadlock fix (many tool calls)
======================================================================

SDK version: 0.1.52
Prompt: Use the get_number MCP tool to compute the square of every integer from 1 to 20. Make exactly 20 individual get_number c...
Max turns: 30
Permission mode: acceptEdits
Timeout: 180s

--- Running query() ---
  Tool #1: Agent({'description': 'Square integers 1-5', 'prompt': 'Use the get_number tool from the math MCP server t...)
  Tool #2: Agent({'description': 'Square integers 6-10', 'prompt': 'Use the get_number tool from the math MCP server ...)
  Tool #3: Agent({'description': 'Square integers 11-15', 'prompt': 'Use the get_number tool from the math MCP server...)
  Tool #4: Agent({'description': 'Square integers 16-20', 'prompt': 'Use the get_number tool from the math MCP server...)
  Tool #5: TaskStop({'task_id': 'ad8f11b5363ad29e1'}...)
  Tool #6: TaskStop({'task_id': 'abf382a54282dfd4a'}...)
  Tool #7: TaskStop({'task_id': 'ac3bc5bb63ffac2ec'}...)
  Tool #8: Agent({'description': 'Square integers 1-5', 'prompt': 'Use the mcp__math__get_number tool to compute the ...)
  Tool #9: Agent({'description': 'Square integers 6-10', 'prompt': 'Use the mcp__math__get_number tool to compute the...)
  Tool #10: Agent({'description': 'Square integers 11-15', 'prompt': 'Use the mcp__math__get_number tool to compute th...)
  Tool #11: Agent({'description': 'Square integers 16-20', 'prompt': 'Use the mcp__math__get_number tool to compute th...)
  Tool #12: TaskStop({'task_id': 'a54b9749f937e33b4'}...)
  Tool #13: TaskStop({'task_id': 'ae90d4ebef7ece669'}...)
  Tool #14: TaskStop({'task_id': 'a491ee1e5f9244cdf'}...)

----------------------------------------------------------------------
Completed in 148.4s
Total messages: 124
Tool calls: 14
Message breakdown: {'SystemMessage': 9, 'AssistantMessage': 31, 'TaskStartedMessage': 8, 'UserMessage': 14, 'RateLimitEvent': 1, 'TaskProgressMessage': 49, 'TaskNotificationMessage': 8, 'ResultMessage': 4}
Cost: $1.096781
Turns: 1

PASS: query() with string prompt + MCP completed 14 tool
      call(s) generating 124 buffered messages without
      deadlocking (148.4s). The background-task fix works.

Verified: query() with a string prompt and an MCP server processes 124 messages (exceeding the 100-slot anyio buffer threshold) through 14 tool calls across 8 sub-agent spawns without deadlocking. The model delegated 20 get_number MCP tool calls to sub-agents in batches of 5, generating 49 TaskProgressMessage, 31 AssistantMessage, 14 UserMessage, and 8 TaskStartedMessage/TaskNotificationMessage -- all draining through the buffer concurrently thanks to the background-task fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]: SDK hangs when string prompt generates more than ~50 tool calls with hooks enabled

2 participants