refactor: replace file-based IPC with WebSocket communication#683
refactor: replace file-based IPC with WebSocket communication#683echthesia wants to merge 18 commits intoqwibitai:mainfrom
Conversation
Replace the three-channel IPC system (stdin init, stdout sentinel markers, filesystem polling) with a single WebSocket server authenticated with per-container cryptographic tokens. This eliminates polling latency (was 1s host-side, 500ms container-side) and removes filesystem IPC volume mounts. Key changes: - New src/ws-server.ts: host-side WS server with token auth, message routing, output promise chains, and authorization checks - New container/agent-runner/src/ws-client.ts: container-side WS client with reconnection, request-response correlation for list_tasks - container-runner.ts: removed stdout marker parsing, added WS token lifecycle (create before spawn, revoke on exit with grace period) - group-queue.ts: sendMessage/closeStdin now use WS instead of files - ipc.ts: removed filesystem watcher, kept processTaskIpc for reuse - index.ts: wired WsIpcServer, removed IPC watcher and snapshot writes - Container side: agent-runner and MCP server use WsClient instead of file reads/writes and stdout markers - Updated all tests (304 pass), added 15 new ws-server tests - Updated docs: debug SKILL.md, groups/main/CLAUDE.md, x-integration stdin remains the secure bootstrap channel for secrets, prompt, and the new wsUrl + wsToken fields. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Inline processTaskIpc into ws-server.ts and delete the vestigial ipc.ts indirection layer. Fix the outputChain race condition in container-runner where a dummy promise chain was awaited instead of the real one from TokenContext. Remove dead code (WsIpcServerDeps.onOutput, resolveGroupIpcPath). Rebase skill templates (add-gmail, convert-to-apple-container, x-integration) onto the current WS-based codebase and remove stale data/env/env references from all SKILL.md files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add an IPC handler registry (src/ipc-handlers/) that mirrors the channel self-registration pattern. Skills register handlers by message-type prefix without modifying ws-server.ts or index.ts directly. The X handler self-registers for x_* messages, spawning skill scripts as subprocesses. On the container side, replace the list_tasks_response-specific pattern with a unified ipc_response type and add sendTaskRequest() for request-response IPC. Add 5 X tools (post, like, reply, retweet, quote) to the MCP server, conditionally registered for main group only. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Address code review findings from PR qwibitai#683: - Add .catch() to outputChain to prevent silent output drops - Switch requestId to crypto.randomUUID() for collision resistance - Add connection-state guard in sendTaskRequest - Replace prefix-based IPC handler matching with O(1) exact lookup - Add isNonEmptyString validation for all message fields after auth - Add ping/pong heartbeat (30s interval, 10s pong timeout) with client-side 45s watchdog that triggers reconnect - Eliminate /tmp/input.json temp file by piping stdin directly to node Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add agent/mcp role to WS auth handshake; reject connections without valid role - Filter sendInput/sendClose to agent connections only; MCP disconnects don't trigger grace timer - Serialize post-auth message processing per connection via promise chain - Replace response broadcasting with targeted sender-only replies - Fix X tools in agent.ts to use sendTaskRequest and return real results - Validate scheduleType at runtime before type assertion - Remove unused PONG_TIMEOUT_MS constant - Document WS_BIND_ADDRESS Linux setup in REQUIREMENTS.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
X (Twitter) handler and MCP tools were hardcoded in core source files during the WebSocket IPC refactor. Move them back to the skill where they belong, using the handler registry for clean plug-in integration. - Delete src/ipc-handlers/x.ts (core X handler) - Remove X tool registrations from container ipc-mcp-stdio.ts - Remove X-specific tests from ipc-auth.test.ts - Update skill host.ts with handler registry + table-driven dispatch - Rewrite skill agent.ts for MCP architecture (registerXTools) - Update skill SKILL.md integration instructions for ws-server Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Address issues found during code review: - ws-client: clear timeout timer on successful response (was leaking 130s timers per request) - ipc-mcp-stdio: remove redundant wsReady flag (just await the promise), fix silent error swallowing on connect failure, remove dead fields (groupFolder, isMain, timestamp) from task messages that server ignores, remove redundant client-side task filtering (server already filters) - container-runner: extract cleanup() helper for revokeToken, replacing 6 scattered inline guards - ws-server/tests: formatting cleanup Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace hand-rolled JSON-RPC protocol plumbing with the json-rpc-2.0 npm library, eliminating ~100 lines of pendingRequests map, requestId generation, ipc_response routing, and sendIpcError/sendIpcOk helpers. - Use JSONRPCServerAndClient for bidirectional RPC on both client and server - Create per-connection RPC instances on the server with all methods registered - Simplify IpcHandler type to (params, groupFolder, isMain) → Promise<unknown> - Remove dead code: resolveIpcHandler, groups_updated RPC method, createdBy param - Fix timer leak in request() with proper cleanup in finally block - Rename sendTask → request for clarity - Update all skills to match new API conventions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…restore Extract withAuthorizedTask() to replace three copy-pasted task mutation handlers (pause/resume/cancel). Replace magic WS close codes 4001-4004 with named constants. Reduce stdout buffer from 10MB to 64KB since output now flows via WebSocket. Wrap runQuery handler swap in try/finally to guarantee restore on exception. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e code corrections Containers can no longer persist output after the host exits (unlike the old file-based IPC), so shutdown now signals containers via WS close, waits for them to exit within the grace period, and force-stops stragglers. Also adds .catch() to outputChain.then() calls to prevent unhandled rejections, uses the correct WS_CLOSE_TOKEN_REVOKED (4004) close code in revokeToken, and fixes shutdown ordering so the queue drains before the WS server closes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…econnect race - Grace timer now stops the container directly via onOrphaned callback instead of just logging when all agent connections are lost - Make `message` a JSON-RPC request (not notification) so callers get delivery confirmation and error feedback - Guard concurrent reconnect attempts with `reconnecting` flag - Use 1000 close code for token revocation (normal closure, not error) - Capture process ref in shutdown to avoid linear scan of groups map - Log exec errors when force-stopping containers during shutdown Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ropped WS events The WS migration introduced a callback-swapping pattern for onInput/onClose that silently drops events when handlers are null between query phases. Replace with an internal event queue on WsClient (pushEvent/nextEvent/cancelWait) so events are always buffered and never lost. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…in-loop deadlock
Two behavioral regressions in the WebSocket migration:
1. Follow-up messages arriving while a container is active but its WS
connection hasn't authenticated yet were marked as pending and only
drained after the container exited (up to 30 min idle timeout).
- sendMessage now returns 'sent' | 'retry' | 'unavailable' instead
of boolean, distinguishing "WS not connected yet" from "no
container / task container"
- Poll loop tracks retry JIDs and re-attempts delivery each
iteration (2s), falling back to enqueueMessageCheck if the
container exits
2. The drain loop in runQuery could deadlock when cancelWait() fired
between event consumption and the next nextEvent() call, leaving
no waiter to resolve.
- cancelWait() now unshifts a null sentinel to the front of the
event queue instead of only resolving the current waiter
- Guarantees the drain loop stops immediately while preserving
buffered events for the next query cycle
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… on reconnect exhaustion When WS reconnect attempts were exhausted, handleDisconnect called close() before pushing the synthetic close event. close() invoked cancelWait(), which resolved the drain loop's waiter with null (normal cancellation) instead of letting the close event propagate through the drain loop to set closedDuringQuery. This caused the container to exit cleanly with no response delivered. cancelWait() is a query-scoped operation (terminate the drain loop after SDK query completes) and has no business in close(), which is connection-scoped cleanup. In the only other call site (main loop exit), the drain loop has already exited so cancelWait() was always a no-op there. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nprocessed piped messages Reconnect exhaustion now emits a distinct 'disconnected' event instead of 'close', so the agent loop exits with code 1 (error) rather than 0 (success). This prevents the host from advancing cursors and silently dropping messages during WS outages longer than the retry window. On the host side, cursor rollback after errors now snapshots the cursor at each successful output delivery. When a mid-run error occurs after output was already sent, the cursor rolls back to the last output position instead of skipping rollback entirely, ensuring piped-but-unprocessed messages are retried. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ages immediately When piped-but-unprocessed messages are rolled back after a mid-run error, return false instead of true so GroupQueue.scheduleRetry() fires immediately rather than waiting for the next inbound message. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…acted handlers Move WebSocket auth from custom post-connect message to Authorization header during HTTP upgrade, eliminating the dual-protocol state machine (auth timer, authenticated flag, three custom close codes). Auth_ok is now a JSON-RPC notification sent after connection setup. Merge setTokenCallbacks into createToken so callbacks are atomically registered with the token, closing the race window where a container could connect before callbacks were set. Make revokeToken async — it awaits the output promise chain before closing connections, removing the leaky getOutputChain abstraction that callers had to sequence manually. Extract ~250 lines of domain handlers (message, schedule_task, pause/resume/cancel_task, refresh_groups, register_group) from ws-server.ts into ipc-handlers/core.ts using the existing handler registry. WsIpcServerDeps shrinks to just snapshot functions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
WebSocket IPC is an interesting direction but replacing the entire IPC layer is a big change. Keeping this as draft makes sense — let's discuss the design more before committing to this. @gavrielc thoughts on moving away from file-based IPC? |
|
Hey @echthesia 👋 Wow — replacing file-based IPC with an authenticated WebSocket server is a genuinely ambitious and well-thought-out proposal. The JSON-RPC 2.0 bidirectional design is elegant! That said, the codebase has evolved significantly since this was opened, and this draft now has severe merge conflicts that would make it very difficult to apply cleanly. The current IPC architecture is also deeply integrated with the container runner and scheduler. We're planning to close this soon, but you are absolutely welcome to open a fresh PR against |
|
I ended up determining that WebSockets weren't the right way to do this anyway, so closing now. |
|
@echthesia, I'm curious why you don't think websockets are the best method for this anymore. I've been closely watching this PR and thinking about them for a similar project. |
|
@mvgrimes The connection lifecycle management introduced a lot of complexity and potential failure modes relative to using stdio, which is both buffered and available when and only when the container is running. |
Summary
Replace the file-based IPC system (stdin init, stdout sentinel markers, filesystem polling) with a single WebSocket server authenticated via per-container cryptographic tokens.
What changed:
src/ws-server.ts: host-side WS server with token auth, JSON-RPC 2.0 bidirectional protocol, heartbeat, buffered event queue, and per-group authorizationcontainer/agent-runner/src/ws-client.ts: container-side WS client with automatic reconnection and reconnect guardssrc/container-runner.ts: token lifecycle (create before spawn, revoke on exit), orphaned container cleanup, graceful shutdown (signal via WS close → wait → force-stop)src/group-queue.ts: sendMessage/closeStdin use WS instead of filessrc/ipc.ts(filesystem watcher and polling)src/ipc-handlers/registry: skills register IPC handlers by message type without modifying core files (mirrors channel self-registration pattern)/tmp/input.jsontemp file)Why:
Scope: 45 files changed, ~3300 insertions, ~2400 deletions. stdin remains the secure bootstrap channel for secrets, prompt, and new
wsUrl+wsTokenfields.Test plan
npm test— all existing tests pass, plus 15+ newws-server.test.tstests covering auth, message routing, heartbeat, token revocation, orphan cleanup, and graceful shutdown🤖 Generated with Claude Code