fix: retry proxy requests when gateway crashes after startup#338
Open
andreasjansson wants to merge 9 commits intomainfrom
Open
fix: retry proxy requests when gateway crashes after startup#338andreasjansson wants to merge 9 commits intomainfrom
andreasjansson wants to merge 9 commits intomainfrom
Conversation
3 tasks
7cdbcbe to
c2ca3ff
Compare
When the OpenClaw gateway process starts successfully and passes the port health check, but then crashes while handling the first request, subsequent containerFetch/wsConnect calls throw 'is not listening' errors with no recovery path. The user sees HTTP 500s followed by connection failures. This adds retry-on-crash logic to both HTTP and WebSocket proxy paths: 1. Detect 'is not listening' errors from the Sandbox SDK 2. Kill the dead gateway process 3. Restart the gateway via ensureMoltbotGateway() 4. Retry the request once Also adds proper error handling around containerFetch (previously had no try-catch at all), returning structured JSON errors instead of unhandled exceptions. Fixes #179
…Gateway Complete the crash retry implementation: - HTTP proxy: catch 'is not listening' errors from containerFetch, kill crashed gateway, restart, retry once - WebSocket proxy: same for wsConnect - Return structured errors (503 for crash+failed recovery, 502 for other) Extract killGateway() into gateway/process.ts as a shared function used by both the restart handler and the crash retry logic. Removes duplicate kill code from index.ts and api.ts. Tested on staging: kill gateway → next HTTP request returns 200 (retry worked).
c2ca3ff to
8770607
Compare
The 4 e2e variants pull the sandbox base image in parallel, frequently hitting Docker Hub rate limits. Retry up to 5 times with 60s backoff.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.




Fixes #179
Problem
When the OpenClaw gateway process starts successfully and passes the TCP port health check, but then crashes while handling the first HTTP request, subsequent requests fail with:
This happens because:
ensureMoltbotGateway()succeeds — the port is reachablecontainerFetch()is called — the gateway processes the request, returns 500, and crashesThe
containerFetch()call had no error handling at all — any exception would propagate as an unhandled error, returning an opaque failure to the client.Fix
1. Retry on gateway crash (HTTP + WebSocket)
Both the HTTP proxy (
containerFetch) and WebSocket proxy (wsConnect) now detect "container is not listening" errors from the Sandbox SDK. When detected:ensureMoltbotGateway()This handles the exact pattern from #179: gateway starts → first request crashes it → retry brings it back.
2. Proper error handling around containerFetch
Previously, if
containerFetchthrew for any reason (not just crashes), the error was completely unhandled. Now all errors return structured JSON responses with appropriate status codes:Helper functions
isGatewayCrashedError()— detects Sandbox SDK errors indicating the gateway process diedkillExistingGateway()— finds and kills the dead process soensureMoltbotGateway()starts freshTest plan
npm run build— passesnpm run lint— 0 warnings, 0 errorsnpm test— 82 tests pass