Skip to content

refactor: replace file-based IPC with WebSocket communication#683

Closed
echthesia wants to merge 18 commits intoqwibitai:mainfrom
echthesia:feat/websocket-ipc
Closed

refactor: replace file-based IPC with WebSocket communication#683
echthesia wants to merge 18 commits intoqwibitai:mainfrom
echthesia:feat/websocket-ipc

Conversation

@echthesia
Copy link
Copy Markdown

@echthesia echthesia commented Mar 4, 2026

Summary

Replace the file-based IPC system (stdin init, stdout sentinel markers, filesystem polling) with a single WebSocket server authenticated via per-container cryptographic tokens.

What changed:

  • New src/ws-server.ts: host-side WS server with token auth, JSON-RPC 2.0 bidirectional protocol, heartbeat, buffered event queue, and per-group authorization
  • New container/agent-runner/src/ws-client.ts: container-side WS client with automatic reconnection and reconnect guards
  • src/container-runner.ts: token lifecycle (create before spawn, revoke on exit), orphaned container cleanup, graceful shutdown (signal via WS close → wait → force-stop)
  • src/group-queue.ts: sendMessage/closeStdin use WS instead of files
  • Deleted src/ipc.ts (filesystem watcher and polling)
  • New src/ipc-handlers/ registry: skills register IPC handlers by message type without modifying core files (mirrors channel self-registration pattern)
  • Container agent-runner and MCP server use WsClient instead of file reads/writes and stdout markers; stdin piped directly (no /tmp/input.json temp file)
  • All skill templates updated for WS-based API

Why:

  • Eliminates polling latency (was 1s host-side, 500ms container-side)
  • Removes filesystem IPC volume mounts
  • Gives delivery confirmation on messages (JSON-RPC requests, not fire-and-forget)
  • Heartbeat + watchdog detect dead connections and trigger reconnect
  • Handler registry makes IPC extensible without touching core

Scope: 45 files changed, ~3300 insertions, ~2400 deletions. stdin remains the secure bootstrap channel for secrets, prompt, and new wsUrl + wsToken fields.

Test plan

  • npm test — all existing tests pass, plus 15+ new ws-server.test.ts tests covering auth, message routing, heartbeat, token revocation, orphan cleanup, and graceful shutdown
  • Send a message, verify agent receives it over WS and responds
  • Send a follow-up message mid-conversation, verify it's delivered to the running container
  • Kill a container mid-conversation, verify the host detects the disconnect and cleans up
  • Trigger a scheduled task, verify it spawns a container and completes
  • Send messages to two different groups concurrently, verify isolation (each gets its own container and responses route back correctly)
  • Stop the host process while containers are running, verify graceful shutdown (containers exit cleanly within timeout)

🤖 Generated with Claude Code

Replace the three-channel IPC system (stdin init, stdout sentinel markers,
filesystem polling) with a single WebSocket server authenticated with
per-container cryptographic tokens. This eliminates polling latency
(was 1s host-side, 500ms container-side) and removes filesystem IPC
volume mounts.

Key changes:
- New src/ws-server.ts: host-side WS server with token auth, message
  routing, output promise chains, and authorization checks
- New container/agent-runner/src/ws-client.ts: container-side WS client
  with reconnection, request-response correlation for list_tasks
- container-runner.ts: removed stdout marker parsing, added WS token
  lifecycle (create before spawn, revoke on exit with grace period)
- group-queue.ts: sendMessage/closeStdin now use WS instead of files
- ipc.ts: removed filesystem watcher, kept processTaskIpc for reuse
- index.ts: wired WsIpcServer, removed IPC watcher and snapshot writes
- Container side: agent-runner and MCP server use WsClient instead of
  file reads/writes and stdout markers
- Updated all tests (304 pass), added 15 new ws-server tests
- Updated docs: debug SKILL.md, groups/main/CLAUDE.md, x-integration

stdin remains the secure bootstrap channel for secrets, prompt, and
the new wsUrl + wsToken fields.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@echthesia echthesia marked this pull request as draft March 4, 2026 05:01
echthesia and others added 6 commits March 4, 2026 12:01
Inline processTaskIpc into ws-server.ts and delete the vestigial ipc.ts
indirection layer. Fix the outputChain race condition in container-runner
where a dummy promise chain was awaited instead of the real one from
TokenContext. Remove dead code (WsIpcServerDeps.onOutput,
resolveGroupIpcPath). Rebase skill templates (add-gmail,
convert-to-apple-container, x-integration) onto the current WS-based
codebase and remove stale data/env/env references from all SKILL.md
files.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add an IPC handler registry (src/ipc-handlers/) that mirrors the channel
self-registration pattern. Skills register handlers by message-type prefix
without modifying ws-server.ts or index.ts directly. The X handler
self-registers for x_* messages, spawning skill scripts as subprocesses.

On the container side, replace the list_tasks_response-specific pattern with
a unified ipc_response type and add sendTaskRequest() for request-response
IPC. Add 5 X tools (post, like, reply, retweet, quote) to the MCP server,
conditionally registered for main group only.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Address code review findings from PR qwibitai#683:
- Add .catch() to outputChain to prevent silent output drops
- Switch requestId to crypto.randomUUID() for collision resistance
- Add connection-state guard in sendTaskRequest
- Replace prefix-based IPC handler matching with O(1) exact lookup
- Add isNonEmptyString validation for all message fields after auth
- Add ping/pong heartbeat (30s interval, 10s pong timeout) with
  client-side 45s watchdog that triggers reconnect
- Eliminate /tmp/input.json temp file by piping stdin directly to node

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add agent/mcp role to WS auth handshake; reject connections without valid role
- Filter sendInput/sendClose to agent connections only; MCP disconnects don't trigger grace timer
- Serialize post-auth message processing per connection via promise chain
- Replace response broadcasting with targeted sender-only replies
- Fix X tools in agent.ts to use sendTaskRequest and return real results
- Validate scheduleType at runtime before type assertion
- Remove unused PONG_TIMEOUT_MS constant
- Document WS_BIND_ADDRESS Linux setup in REQUIREMENTS.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
X (Twitter) handler and MCP tools were hardcoded in core source files
during the WebSocket IPC refactor. Move them back to the skill where
they belong, using the handler registry for clean plug-in integration.

- Delete src/ipc-handlers/x.ts (core X handler)
- Remove X tool registrations from container ipc-mcp-stdio.ts
- Remove X-specific tests from ipc-auth.test.ts
- Update skill host.ts with handler registry + table-driven dispatch
- Rewrite skill agent.ts for MCP architecture (registerXTools)
- Update skill SKILL.md integration instructions for ws-server

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Address issues found during code review:

- ws-client: clear timeout timer on successful response (was leaking
  130s timers per request)
- ipc-mcp-stdio: remove redundant wsReady flag (just await the promise),
  fix silent error swallowing on connect failure, remove dead fields
  (groupFolder, isMain, timestamp) from task messages that server ignores,
  remove redundant client-side task filtering (server already filters)
- container-runner: extract cleanup() helper for revokeToken, replacing
  6 scattered inline guards
- ws-server/tests: formatting cleanup

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
echthesia and others added 5 commits March 4, 2026 16:33
Replace hand-rolled JSON-RPC protocol plumbing with the json-rpc-2.0
npm library, eliminating ~100 lines of pendingRequests map, requestId
generation, ipc_response routing, and sendIpcError/sendIpcOk helpers.

- Use JSONRPCServerAndClient for bidirectional RPC on both client and server
- Create per-connection RPC instances on the server with all methods registered
- Simplify IpcHandler type to (params, groupFolder, isMain) → Promise<unknown>
- Remove dead code: resolveIpcHandler, groups_updated RPC method, createdBy param
- Fix timer leak in request() with proper cleanup in finally block
- Rename sendTask → request for clarity
- Update all skills to match new API conventions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…restore

Extract withAuthorizedTask() to replace three copy-pasted task mutation
handlers (pause/resume/cancel). Replace magic WS close codes 4001-4004
with named constants. Reduce stdout buffer from 10MB to 64KB since output
now flows via WebSocket. Wrap runQuery handler swap in try/finally to
guarantee restore on exception.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e code corrections

Containers can no longer persist output after the host exits (unlike the old
file-based IPC), so shutdown now signals containers via WS close, waits for
them to exit within the grace period, and force-stops stragglers. Also adds
.catch() to outputChain.then() calls to prevent unhandled rejections, uses
the correct WS_CLOSE_TOKEN_REVOKED (4004) close code in revokeToken, and
fixes shutdown ordering so the queue drains before the WS server closes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…econnect race

- Grace timer now stops the container directly via onOrphaned callback
  instead of just logging when all agent connections are lost
- Make `message` a JSON-RPC request (not notification) so callers get
  delivery confirmation and error feedback
- Guard concurrent reconnect attempts with `reconnecting` flag
- Use 1000 close code for token revocation (normal closure, not error)
- Capture process ref in shutdown to avoid linear scan of groups map
- Log exec errors when force-stopping containers during shutdown

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ropped WS events

The WS migration introduced a callback-swapping pattern for onInput/onClose
that silently drops events when handlers are null between query phases.
Replace with an internal event queue on WsClient (pushEvent/nextEvent/cancelWait)
so events are always buffered and never lost.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
echthesia and others added 6 commits March 4, 2026 20:04
…in-loop deadlock

Two behavioral regressions in the WebSocket migration:

1. Follow-up messages arriving while a container is active but its WS
   connection hasn't authenticated yet were marked as pending and only
   drained after the container exited (up to 30 min idle timeout).
   - sendMessage now returns 'sent' | 'retry' | 'unavailable' instead
     of boolean, distinguishing "WS not connected yet" from "no
     container / task container"
   - Poll loop tracks retry JIDs and re-attempts delivery each
     iteration (2s), falling back to enqueueMessageCheck if the
     container exits

2. The drain loop in runQuery could deadlock when cancelWait() fired
   between event consumption and the next nextEvent() call, leaving
   no waiter to resolve.
   - cancelWait() now unshifts a null sentinel to the front of the
     event queue instead of only resolving the current waiter
   - Guarantees the drain loop stops immediately while preserving
     buffered events for the next query cycle

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… on reconnect exhaustion

When WS reconnect attempts were exhausted, handleDisconnect called
close() before pushing the synthetic close event. close() invoked
cancelWait(), which resolved the drain loop's waiter with null
(normal cancellation) instead of letting the close event propagate
through the drain loop to set closedDuringQuery. This caused the
container to exit cleanly with no response delivered.

cancelWait() is a query-scoped operation (terminate the drain loop
after SDK query completes) and has no business in close(), which is
connection-scoped cleanup. In the only other call site (main loop
exit), the drain loop has already exited so cancelWait() was always
a no-op there.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nprocessed piped messages

Reconnect exhaustion now emits a distinct 'disconnected' event instead of
'close', so the agent loop exits with code 1 (error) rather than 0 (success).
This prevents the host from advancing cursors and silently dropping messages
during WS outages longer than the retry window.

On the host side, cursor rollback after errors now snapshots the cursor at
each successful output delivery. When a mid-run error occurs after output was
already sent, the cursor rolls back to the last output position instead of
skipping rollback entirely, ensuring piped-but-unprocessed messages are retried.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ages immediately

When piped-but-unprocessed messages are rolled back after a mid-run error,
return false instead of true so GroupQueue.scheduleRetry() fires immediately
rather than waiting for the next inbound message.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…acted handlers

Move WebSocket auth from custom post-connect message to Authorization
header during HTTP upgrade, eliminating the dual-protocol state machine
(auth timer, authenticated flag, three custom close codes). Auth_ok is
now a JSON-RPC notification sent after connection setup.

Merge setTokenCallbacks into createToken so callbacks are atomically
registered with the token, closing the race window where a container
could connect before callbacks were set.

Make revokeToken async — it awaits the output promise chain before
closing connections, removing the leaky getOutputChain abstraction
that callers had to sequence manually.

Extract ~250 lines of domain handlers (message, schedule_task,
pause/resume/cancel_task, refresh_groups, register_group) from
ws-server.ts into ipc-handlers/core.ts using the existing handler
registry. WsIpcServerDeps shrinks to just snapshot functions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@TomGranot
Copy link
Copy Markdown
Collaborator

WebSocket IPC is an interesting direction but replacing the entire IPC layer is a big change. Keeping this as draft makes sense — let's discuss the design more before committing to this. @gavrielc thoughts on moving away from file-based IPC?

@Andy-NanoClaw-AI Andy-NanoClaw-AI added PR: Refactor Code restructuring without behavior change Status: WIP Work in progress labels Mar 5, 2026
@Andy-NanoClaw-AI
Copy link
Copy Markdown
Collaborator

Hey @echthesia 👋 Wow — replacing file-based IPC with an authenticated WebSocket server is a genuinely ambitious and well-thought-out proposal. The JSON-RPC 2.0 bidirectional design is elegant!

That said, the codebase has evolved significantly since this was opened, and this draft now has severe merge conflicts that would make it very difficult to apply cleanly. The current IPC architecture is also deeply integrated with the container runner and scheduler.

We're planning to close this soon, but you are absolutely welcome to open a fresh PR against main if you'd like to revive it. This kind of architectural improvement deserves a clean shot! 🙌

@Andy-NanoClaw-AI Andy-NanoClaw-AI added the Status: Pending Closure PR flagged for closure during triage label Mar 7, 2026
@echthesia
Copy link
Copy Markdown
Author

I ended up determining that WebSockets weren't the right way to do this anyway, so closing now.

@echthesia echthesia closed this Mar 7, 2026
@mvgrimes
Copy link
Copy Markdown

mvgrimes commented Mar 8, 2026

@echthesia, I'm curious why you don't think websockets are the best method for this anymore. I've been closely watching this PR and thinking about them for a similar project.

@echthesia
Copy link
Copy Markdown
Author

@mvgrimes The connection lifecycle management introduced a lot of complexity and potential failure modes relative to using stdio, which is both buffered and available when and only when the container is running.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

PR: Refactor Code restructuring without behavior change Status: Pending Closure PR flagged for closure during triage Status: WIP Work in progress

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants