Skip to content

feat(vscode): add retry logic and auto-reconnect for ACP connection#2666

Merged
tanzhenxin merged 2 commits intoQwenLM:mainfrom
qqqys:feat/vscode-acp-reconnect-logic
Apr 1, 2026
Merged

feat(vscode): add retry logic and auto-reconnect for ACP connection#2666
tanzhenxin merged 2 commits intoQwenLM:mainfrom
qqqys:feat/vscode-acp-reconnect-logic

Conversation

@qqqys
Copy link
Copy Markdown
Collaborator

@qqqys qqqys commented Mar 25, 2026

TLDR

This PR adds robust retry logic and auto-reconnect functionality to the VS Code IDE companions ACP (Agent Client Protocol) connection handling. Key changes include:

  • Retry on connect failure: connectWithRetry() method with exponential backoff (up to 3 retries)
  • Auto-reconnect on crash: Automatic reconnection when the Qwen CLI subprocess exits unexpectedly (limited to 1 attempt)
  • Improved timeout handling: Proper readiness detection (10s timeout) and initialize handshake timeout (15s)
  • Intentional disconnect tracking: Skip auto-reconnect when user explicitly disconnects
  • User-friendly error handling: VS Code warning notifications with "Retry Connection" action button
  • Comprehensive tests: Added tests for retry logic, cleanup between retries, and disconnect flag behavior

Screenshots / Video Demo

Dive Deeper

Motivation

Previously, if the Qwen CLI subprocess failed to start or crashed unexpectedly, the VS Code extension would fail silently or require manual reconnection. This improves reliability by:

  1. Handling transient spawn failures: Network issues, race conditions, or temporary resource constraints can cause subprocess spawn to fail. Retry logic with exponential backoff helps recover from these transient failures.

  2. Automatic recovery from crashes: If the CLI process crashes during a session (e.g., due to OOM, SIGTERM), the extension now attempts to automatically reconnect once, preserving user workflow continuity.

  3. Better timeout management: The previous fixed 1s delay for readiness was unreliable. Now we use proper event-driven readiness detection with a 10s timeout, and the initialize handshake has a 15s timeout.

Implementation Details

  • connectWithRetry() wraps the full connect() call with retry logic
  • cleanupForRetry() ensures clean state between retry attempts by killing zombie processes and resetting connection state
  • intentionalDisconnect flag prevents auto-reconnect when user explicitly disconnects
  • autoReconnectAttempts counter prevents infinite reconnection loops
  • onAutoReconnectFailed callback notifies the WebViewProvider to show user-facing error messages

Reviewer Test Plan

  1. Test normal connection: Ensure extension connects successfully on first attempt
  2. Test retry behavior: Simulate spawn failures and verify retry attempts with backoff
  3. Test auto-reconnect: Kill the CLI subprocess during a session and verify auto-reconnect attempt
  4. Test intentional disconnect: Use the disconnect command and verify no auto-reconnect occurs
  5. Test error notifications: Force connection failure and verify VS Code shows warning with retry button

Testing Matrix

🍏 🪟 🐧
npm run
npx
Docker
Podman - -
Seatbelt - -

Linked issues / bugs

Resolves #2634


🤖 Generated with Qwen Code

- Add connectWithRetry() method with exponential backoff (3 retries max)
- Add cleanupForRetry() to clean partial state between retry attempts
- Improve readiness detection with proper timeout and exit handling
- Add timeout for ACP initialize handshake (15s)
- Add auto-reconnect on unexpected subprocess exit (max 1 attempt)
- Track intentionalDisconnect flag to skip auto-reconnect for user-initiated disconnects
- Add onAutoReconnectFailed callback for user notification when auto-reconnect fails
- Show VS Code warning notification with 'Retry Connection' action button
- Add comprehensive tests for retry logic, cleanup, and intentional disconnect flag

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
@github-actions
Copy link
Copy Markdown
Contributor

📋 Review Summary

This PR adds robust retry logic and auto-reconnect functionality to the VS Code IDE companion's ACP (Agent Client Protocol) connection handling. The implementation is well-structured with comprehensive test coverage, proper timeout handling, and thoughtful error management. Overall, this is a solid improvement to the extension's reliability.

🔍 General Feedback

  • Strong test coverage: The addition of 170+ lines of tests demonstrates thorough consideration of edge cases including retry behavior, cleanup between retries, and disconnect flag handling
  • Good separation of concerns: Retry logic (connectWithRetry), cleanup (cleanupForRetry), and auto-reconnect tracking are well-factored
  • Proper timeout management: The shift from a fixed 1s delay to event-driven readiness detection (10s) and initialize timeout (15s) is a significant reliability improvement
  • User-friendly error handling: VS Code notifications with "Retry Connection" action buttons provide good UX
  • Consistent patterns: The code follows existing project conventions for TypeScript, error handling, and logging

🎯 Specific Feedback

🟡 High

  • File: acpConnection.ts:687-691 - The cleanupForRetry() method kills the child process but doesn't wait for the process exit handler to run before retrying. This could lead to race conditions where the exit handler clears state that cleanupForRetry() already modified, or where the process hasn't fully terminated before the next spawn attempt. Consider adding a brief delay or waiting for a process termination signal before proceeding with retry.

  • File: qwenAgentManager.ts:327-360 - The auto-reconnect logic increments autoReconnectAttempts before attempting reconnection, but if the reconnection promise is still pending when another crash occurs, multiple reconnection attempts could be triggered simultaneously. Consider adding a flag to track when a reconnection is in progress to prevent concurrent reconnection attempts.

🟢 Medium

  • File: acpConnection.ts:207-218 - The readiness detection timer cleanup is good, but the spawnError handling at lines 217-220 happens after the timer is already set. If spawnError is already set, there's no need to wait for the timeout. Consider checking spawnError before setting up the readiness detection to fail fast.

  • File: acpConnection.ts:414-420 - When the initialize handshake times out or fails, the subprocess is killed. However, this kill operation is synchronous and could potentially throw. Consider wrapping the kill in a try-catch similar to cleanupForRetry() to ensure the original error isn't masked.

  • File: acpConnection.ts:647-676 - The exponential backoff delays are hardcoded as [1000, 2000, 4000]. Consider making this configurable or extracting to a constant for easier tuning and testing.

🔵 Low

  • File: acpConnection.ts:62 - The JSDoc comment for intentionalDisconnect uses /** */ but is on a single line. For consistency with other comments in the file, consider using the multi-line format or inline // comment.

  • File: acpConnection.ts:640-642 - The log message [ACP] Spawn retry attempt ${attempt}/${maxRetries}... uses "Spawn" but this is actually a full connect retry (which includes more than just spawning). Consider renaming to "Connect retry attempt" for accuracy.

  • File: qwenAgentManager.ts:97 - The onAutoReconnectFailed callback is initialized to () => {} but the JSDoc says it's "invoked when auto-reconnect fails and user action is needed." Consider documenting what the default no-op behavior implies (i.e., silent failure if not overridden).

  • File: acpConnection.test.ts:283-285 - The test comment says "1 initial + 2 retries = 3 total" which is helpful, but consider adding similar comments to other retry tests for clarity (e.g., test at line 269).

✅ Highlights

  • Excellent timeout implementation: The readiness detection pattern (lines 171-222 in acpConnection.ts) is well-designed with proper cleanup, settled flag to prevent duplicate resolution, and comprehensive error messages
  • Thoughtful auto-reconnect limits: The single-attempt auto-reconnect policy prevents infinite loops while still providing recovery from transient crashes
  • Good intentional disconnect tracking: The intentionalDisconnect flag properly distinguishes between user-initiated disconnects and crashes, preventing unwanted reconnection behavior
  • Comprehensive test coverage: Tests cover edge cases like already-killed processes, null child processes, and the interaction between retry counters and successful connections
  • Proper state reset on success: Resetting autoReconnectAttempts on successful connect (line 655) ensures future crashes can still trigger auto-reconnect

Copy link
Copy Markdown
Collaborator

@tanzhenxin tanzhenxin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@tanzhenxin tanzhenxin merged commit cd47e9b into QwenLM:main Apr 1, 2026
13 checks passed
tanzhenxin added a commit that referenced this pull request Apr 1, 2026
…-logic"

This reverts commit cd47e9b, reversing
changes made to 449a421.
tanzhenxin added a commit that referenced this pull request Apr 1, 2026
revert: PR #2666 ACP retry/reconnect logic
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

VS Code extension intermittently fails to connect to the Qwen agent after ACP process exits with SIGTERM

2 participants