fix(openclaw): fix worker startup race condition in installer#1295
fix(openclaw): fix worker startup race condition in installer#1295Unayung wants to merge 2 commits intothedotmack:mainfrom
Conversation
The installer's start_worker() had two bugs causing the worker to silently exit with code 0 after spawning: 1. Race condition: The installer wrote a PID file containing the worker's own PID before the worker started. When worker-service.cjs ran its default code path, it read this PID file, found the PID alive (itself), concluded another instance was already running, and exited immediately. 2. Missing --daemon flag: Without passing --daemon as argv[2], the worker fell into the default switch case which includes duplicate instance detection. The --daemon flag triggers the correct startup path that directly initializes the HTTP server. Fix: - Remove stale PID files before starting the worker - Pass --daemon flag to worker-service.cjs - Add brief delay before writing PID file to avoid racing with the worker's own duplicate-detection logic - Keep original PID file write for future management use Tested on both Linux (Arch) and macOS (Mac mini).
fac0476 to
158f08a
Compare
xkonjin
left a comment
There was a problem hiding this comment.
Nice targeted fix. One thing I'd tighten before merging: the new rm -f ~/.claude-mem/worker.pid ~/.claude-mem/worker-37777.pid step can kill the guardrail for a genuinely running worker if the installer is re-run while one is already healthy. In that case we may start a second daemon and orphan the first process.
I'd suggest checking whether the PID in those files is alive and belongs to the expected worker command before deleting them, instead of unconditionally removing both files.
Test gap: this really wants an installer-level regression test (or at least a shell test harness) for three cases: stale PID file, healthy existing worker, and immediate daemon startup after --daemon.
…ests Address review feedback from thedotmack#1295: 1. PID file removal is now guarded: before deleting worker.pid or worker-37777.pid, we parse the PID, check if it's alive (kill -0), and verify it belongs to a worker-service process (ps -p ... | grep). - If a healthy worker is already running → skip start, return 0 - If PID is dead or recycled (not worker-service) → remove stale file 2. Added three installer regression tests to test-install.sh: - Stale PID file (dead process) is cleaned up - Live non-worker PID (recycled) is cleaned up - --daemon flag is passed to worker-service
|
Thanks for the review! Pushed a fix addressing both points: 1. Guarded PID file removal
2. Added installer regression tests (3 cases in
Didn't add a full test for the "healthy existing worker" case since that requires a real running worker process, but the live-PID + process name check logic covers the guard. |
Problem
The OpenClaw installer's
start_worker()function silently fails to start the worker — the process spawns, immediately exits with code 0, and the health check times out after 30 attempts.This affects both Linux and macOS installations via
bash <(curl -fsSL https://install.cmem.ai/openclaw.sh).Root Cause
Two bugs in
openclaw/install.sh:1. PID file race condition (self-duplicate detection)
The installer writes a PID file containing the worker's own PID before the worker starts:
When
worker-service.cjsstarts and hits thedefaultcode path, it reads this PID file, finds the PID alive (it's checking itself), concludes another instance is already running, and exits:2. Missing
--daemonflagWithout passing
--daemonasargv[2], the worker falls into thedefaultswitch case which includes the duplicate instance detection. The--daemonflag triggers the correct startup path.Fix
--daemonflag toworker-service.cjsTesting
Tested on:
Both confirmed worker starts successfully and responds to health checks after the fix.