Skip to content

fix: retry proxy requests when gateway crashes after startup#338

Open
andreasjansson wants to merge 9 commits intomainfrom
ajansson/fix/gateway-crash-retry
Open

fix: retry proxy requests when gateway crashes after startup#338
andreasjansson wants to merge 9 commits intomainfrom
ajansson/fix/gateway-crash-retry

Conversation

@andreasjansson
Copy link
Member

Fixes #179

Problem

When the OpenClaw gateway process starts successfully and passes the TCP port health check, but then crashes while handling the first HTTP request, subsequent requests fail with:

Error proxying request to container: The container is not listening in the TCP address 10.0.0.1:18789

This happens because:

  1. ensureMoltbotGateway() succeeds — the port is reachable
  2. containerFetch() is called — the gateway processes the request, returns 500, and crashes
  3. The next request finds the port unreachable, but there's no recovery path

The containerFetch() call had no error handling at all — any exception would propagate as an unhandled error, returning an opaque failure to the client.

Fix

1. Retry on gateway crash (HTTP + WebSocket)

Both the HTTP proxy (containerFetch) and WebSocket proxy (wsConnect) now detect "container is not listening" errors from the Sandbox SDK. When detected:

  1. Kill the dead gateway process
  2. Restart the gateway via ensureMoltbotGateway()
  3. Retry the request once

This handles the exact pattern from #179: gateway starts → first request crashes it → retry brings it back.

2. Proper error handling around containerFetch

Previously, if containerFetch threw for any reason (not just crashes), the error was completely unhandled. Now all errors return structured JSON responses with appropriate status codes:

  • 503 for gateway crash + failed recovery
  • 502 for other proxy errors

Helper functions

  • isGatewayCrashedError() — detects Sandbox SDK errors indicating the gateway process died
  • killExistingGateway() — finds and kills the dead process so ensureMoltbotGateway() starts fresh

Test plan

  • npm run build — passes
  • npm run lint — 0 warnings, 0 errors
  • npm test — 82 tests pass

@andreasjansson andreasjansson force-pushed the ajansson/fix/gateway-crash-retry branch 2 times, most recently from 7cdbcbe to c2ca3ff Compare March 27, 2026 11:26
Metamolty and others added 7 commits March 27, 2026 17:39
When the OpenClaw gateway process starts successfully and passes the port
health check, but then crashes while handling the first request, subsequent
containerFetch/wsConnect calls throw 'is not listening' errors with no
recovery path. The user sees HTTP 500s followed by connection failures.

This adds retry-on-crash logic to both HTTP and WebSocket proxy paths:
1. Detect 'is not listening' errors from the Sandbox SDK
2. Kill the dead gateway process
3. Restart the gateway via ensureMoltbotGateway()
4. Retry the request once

Also adds proper error handling around containerFetch (previously had no
try-catch at all), returning structured JSON errors instead of unhandled
exceptions.

Fixes #179
…Gateway

Complete the crash retry implementation:
- HTTP proxy: catch 'is not listening' errors from containerFetch,
  kill crashed gateway, restart, retry once
- WebSocket proxy: same for wsConnect
- Return structured errors (503 for crash+failed recovery, 502 for other)

Extract killGateway() into gateway/process.ts as a shared function
used by both the restart handler and the crash retry logic. Removes
duplicate kill code from index.ts and api.ts.

Tested on staging: kill gateway → next HTTP request returns 200 (retry worked).
@andreasjansson andreasjansson force-pushed the ajansson/fix/gateway-crash-retry branch from c2ca3ff to 8770607 Compare March 27, 2026 16:45
The 4 e2e variants pull the sandbox base image in parallel, frequently
hitting Docker Hub rate limits. Retry up to 5 times with 60s backoff.
@github-actions
Copy link

E2E Test Recording (telegram)

❌ Tests failed

E2E Test Video

@github-actions
Copy link

E2E Test Recording (base)

❌ Tests failed

E2E Test Video

@github-actions
Copy link

E2E Test Recording (discord)

❌ Tests failed

E2E Test Video

@github-actions
Copy link

E2E Test Recording (workers-ai)

❌ Tests failed

E2E Test Video

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gateway returns HTTP 500 errors and crashes immediately after startup

1 participant