Skip to content

fix: defer CLI discovery to connect() to avoid blocking async event loops#722

Merged
qing-ant merged 3 commits intomainfrom
fix/defer-cli-discovery-to-connect
Mar 26, 2026
Merged

fix: defer CLI discovery to connect() to avoid blocking async event loops#722
qing-ant merged 3 commits intomainfrom
fix/defer-cli-discovery-to-connect

Conversation

@qing-ant
Copy link
Copy Markdown
Contributor

Summary

SubprocessCLITransport.__init__() called shutil.which("claude") synchronously, which blocks the async event loop. This causes BlockingError in async frameworks that monitor for blocking calls (e.g., LangGraph with the blockbuster library).

Fix

Defer CLI resolution from __init__ to the async connect() method, where it runs via anyio.to_thread.run_sync() on a worker thread. This wraps the entire _find_cli() function (including shutil.which() and filesystem existence checks) in a single thread dispatch.

When cli_path is provided via ClaudeAgentOptions, it is still set immediately in __init__ — no behavior change for that path.

Before:

def __init__(self, ...):
    self._cli_path = (
        str(options.cli_path) if options.cli_path is not None else self._find_cli()
    )  # ← blocks the event loop

After:

def __init__(self, ...):
    self._cli_path = (
        str(options.cli_path) if options.cli_path is not None else None
    )  # ← no blocking

async def connect(self):
    if self._cli_path is None:
        self._cli_path = await anyio.to_thread.run_sync(self._find_cli)
    # ...

Tests

  • Updated test_find_cli_not_found to verify the error is raised during connect() instead of __init__
  • Added test_init_does_not_call_find_cli to verify __init__ defers discovery
  • Added test_init_uses_provided_cli_path to verify explicit cli_path is used immediately
  • All 330 tests pass, mypy clean, ruff clean

Fixes #324

Copy link
Copy Markdown
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — clean fix deferring blocking I/O to the async connect() path.

Extended reasoning...

Overview

This PR modifies 2 files: subprocess_cli.py and test_transport.py. It moves the shutil.which("claude") call (and related filesystem checks) from the synchronous __init__ constructor to the async connect() method, where it runs via anyio.to_thread.run_sync(). Defensive None checks are added in _build_command() and _check_claude_version() for the case where _cli_path hasn't been resolved yet. When cli_path is explicitly provided via options, behavior is unchanged.

Security risks

None. The change doesn't alter what commands are executed or how paths are resolved — it only changes when the resolution happens. No new inputs are accepted, no auth or permissions logic is modified.

Level of scrutiny

This is a straightforward async-correctness fix with a small blast radius. The pattern (deferring blocking I/O to a thread in an async context) is well-established and the implementation is minimal. The only bug found is a style nit about test convention (@pytest.mark.anyio vs anyio.run()) which doesn't affect correctness.

Other factors

Three new/updated tests cover the key behaviors: deferred discovery, explicit path passthrough, and error on missing CLI. The existing test suite (330 tests) reportedly passes. No CODEOWNERS file exists for these paths.

Copy link
Copy Markdown
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — small, focused fix with good test coverage; previous feedback addressed.

Extended reasoning...

Overview

This PR modifies two files: subprocess_cli.py (production) and test_transport.py (tests). The production change defers _find_cli() from __init__ to connect() using anyio.to_thread.run_sync(), preventing synchronous shutil.which() from blocking async event loops. Two defensive None guards were added in _build_command() and _check_claude_version(). Three tests were added/updated to cover the new behavior.

Security risks

No security concerns. The change only affects when CLI discovery happens (construction vs connection), not how. No auth, crypto, or permissions code is touched.

Level of scrutiny

Low scrutiny warranted. This is a straightforward timing change — moving a blocking call out of a synchronous constructor into an async method where it can be properly offloaded to a thread. The pattern (anyio.to_thread.run_sync) is idiomatic and well-established in this codebase.

Other factors

  • My previous review flagged a test convention issue (@pytest.mark.anyio vs anyio.run), which has been addressed in commit 547da9e.
  • The bug report about using CLINotFoundError instead of RuntimeError for the internal guards is a valid semantic nit, but both guards are effectively unreachable dead code — they protect private methods that are only ever called after connect() resolves the path.
  • All existing tests continue to pass, and the new tests verify the deferred behavior correctly.

@qing-ant
Copy link
Copy Markdown
Contributor Author

E2E Test Results for PR #722

Tested the deferred CLI discovery change on branch fix/defer-cli-discovery-to-connect.

Unit Tests

4/4 passed (test_find_cli_not_found, test_init_does_not_call_find_cli, test_init_uses_provided_cli_path, test_build_command_agents_always_via_initialize).

E2E Tests (3/3 passed)

Test 1: Transport __init__ does NOT resolve CLI path

  • SubprocessCLITransport.__init__() completes in ~0.00s
  • transport._cli_path is None after construction
  • Confirms no shutil.which call during init

Test 2: CLI path resolved during connect()

  • After calling await transport.connect(), _cli_path is set (e.g., /root/.local/bin/claude)
  • Confirms CLI discovery happens via anyio.to_thread.run_sync() inside connect()

Test 3: Full query end-to-end

  • claude_agent_sdk.query(prompt="Reply with exactly: hello", options=ClaudeAgentOptions(max_turns=1)) completed successfully
  • Received expected response: TextBlock(text="hello")
  • CLI discovery, version check, subprocess launch, and message parsing all work correctly with the deferred approach
======================================================================
SUMMARY
======================================================================
  test_init_no_cli_path: PASS
  test_cli_path_after_connect: PASS
  test_full_query_e2e: PASS

ALL TESTS PASSED

Copy link
Copy Markdown
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — clean fix for deferring blocking CLI discovery to the async connect() method.

Extended reasoning...

Overview

This PR touches two files: subprocess_cli.py (production) and test_transport.py (tests). The change defers shutil.which() and filesystem existence checks from __init__ to connect(), where they run on a worker thread via anyio.to_thread.run_sync(). When cli_path is explicitly provided, it is still set immediately in __init__ with no behavior change.

Security risks

None. The change does not alter any security-sensitive code paths — it only changes when the CLI binary is located, not how. No auth, crypto, or permissions code is affected.

Level of scrutiny

This is a low-risk, well-scoped bug fix. The core logic change is minimal (3 lines in __init__, 2 lines in connect()), with defensive guards added to two private methods. The existing _find_cli() method is completely unchanged. E2E tests confirmed the full query flow works correctly with the deferred approach.

Other factors

My two prior inline comments were both resolved: the test convention issue was fixed (now uses anyio.run(_test) matching the project pattern), and the CLINotFoundError vs RuntimeError semantic nit was acknowledged as minor since the guards are in private methods only called after connect() resolves the path. Three new tests were added covering deferred discovery, explicit cli_path, and the error-during-connect path. All 330 tests pass, mypy and ruff clean, and independent E2E verification confirmed correct behavior.

claude and others added 3 commits March 26, 2026 03:13
…oops

shutil.which() in __init__ blocks the event loop in async frameworks
(e.g. LangGraph with blockbuster monitoring). Defer CLI resolution to
the async connect() method using anyio.to_thread.run_sync().

When cli_path is provided via ClaudeAgentOptions, it is still set
immediately in __init__ with no behavior change.

Fixes #324
@qing-ant qing-ant force-pushed the fix/defer-cli-discovery-to-connect branch from 5e0e807 to d9d6e3c Compare March 26, 2026 03:13
Copy link
Copy Markdown
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — straightforward async fix with good test coverage.

Extended reasoning...

Overview

This PR touches two files: subprocess_cli.py (the transport implementation) and test_transport.py (its tests). The change moves _find_cli() from __init__ (synchronous, blocks the event loop) to connect() (async, runs via anyio.to_thread.run_sync()). When cli_path is explicitly provided, the behavior is unchanged.

Security risks

None. The change only affects when CLI discovery happens, not how. No new inputs, no auth changes, no data exposure.

Level of scrutiny

This is a low-risk bugfix. The diff is small (~40 lines of meaningful change), the logic is straightforward (move a blocking call to an async context), and it follows the established anyio.to_thread.run_sync() pattern already used elsewhere in the codebase. Both my prior nits (test convention and exception type) have been resolved — the test convention was addressed in a follow-up commit, and the exception type nit was acknowledged and closed.

Other factors

  • Three new/updated unit tests directly validate the deferred behavior.
  • E2E tests (posted by a reviewer) confirm the full flow works correctly.
  • No CODEOWNERS file exists for these paths.
  • The defensive guards in _build_command() and _check_claude_version() are effectively unreachable (both are private methods only called after _cli_path is resolved in connect()), making the exception type choice a non-issue in practice.

Copy link
Copy Markdown
Collaborator

@bogini bogini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stamped ⚡

@qing-ant qing-ant merged commit 19e1f53 into main Mar 26, 2026
10 checks passed
@qing-ant qing-ant deleted the fix/defer-cli-discovery-to-connect branch March 26, 2026 06:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

shutil.which() blocking call causes BlockingError in async environments

3 participants