Skip to content

fix(openclaw): fix worker startup race condition in installer#1295

Open
Unayung wants to merge 2 commits intothedotmack:mainfrom
Unayung:fix/openclaw-installer-worker-startup
Open

fix(openclaw): fix worker startup race condition in installer#1295
Unayung wants to merge 2 commits intothedotmack:mainfrom
Unayung:fix/openclaw-installer-worker-startup

Conversation

@Unayung
Copy link

@Unayung Unayung commented Mar 7, 2026

Problem

The OpenClaw installer's start_worker() function silently fails to start the worker — the process spawns, immediately exits with code 0, and the health check times out after 30 attempts.

This affects both Linux and macOS installations via bash <(curl -fsSL https://install.cmem.ai/openclaw.sh).

Root Cause

Two bugs in openclaw/install.sh:

1. PID file race condition (self-duplicate detection)

The installer writes a PID file containing the worker's own PID before the worker starts:

nohup "$BUN_PATH" "$worker_script" >> "$log_file" 2>&1 &
WORKER_PID=$!
# Writes {pid: WORKER_PID} to worker.pid immediately

When worker-service.cjs starts and hits the default code path, it reads this PID file, finds the PID alive (it's checking itself), concludes another instance is already running, and exits:

// In worker-service.cjs default path:
let i = Rf();  // reads PID file → finds own PID
i && xw(i.pid) && process.exit(0);  // alive check → true → exit

2. Missing --daemon flag

Without passing --daemon as argv[2], the worker falls into the default switch case which includes the duplicate instance detection. The --daemon flag triggers the correct startup path.

Fix

  • Remove stale PID files before starting the worker
  • Pass --daemon flag to worker-service.cjs
  • Let the worker manage its own PID file internally (written after successful server startup)

Testing

Tested on:

  • Linux (Arch Linux x64, Bun 1.3.10)
  • macOS (Mac mini, Bun 1.3.10)

Both confirmed worker starts successfully and responds to health checks after the fix.

The installer's start_worker() had two bugs causing the worker to
silently exit with code 0 after spawning:

1. Race condition: The installer wrote a PID file containing the
   worker's own PID before the worker started. When worker-service.cjs
   ran its default code path, it read this PID file, found the PID
   alive (itself), concluded another instance was already running,
   and exited immediately.

2. Missing --daemon flag: Without passing --daemon as argv[2], the
   worker fell into the default switch case which includes duplicate
   instance detection. The --daemon flag triggers the correct startup
   path that directly initializes the HTTP server.

Fix:
- Remove stale PID files before starting the worker
- Pass --daemon flag to worker-service.cjs
- Add brief delay before writing PID file to avoid racing with
  the worker's own duplicate-detection logic
- Keep original PID file write for future management use

Tested on both Linux (Arch) and macOS (Mac mini).
@Unayung Unayung force-pushed the fix/openclaw-installer-worker-startup branch from fac0476 to 158f08a Compare March 7, 2026 07:38
Copy link

@xkonjin xkonjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice targeted fix. One thing I'd tighten before merging: the new rm -f ~/.claude-mem/worker.pid ~/.claude-mem/worker-37777.pid step can kill the guardrail for a genuinely running worker if the installer is re-run while one is already healthy. In that case we may start a second daemon and orphan the first process.

I'd suggest checking whether the PID in those files is alive and belongs to the expected worker command before deleting them, instead of unconditionally removing both files.

Test gap: this really wants an installer-level regression test (or at least a shell test harness) for three cases: stale PID file, healthy existing worker, and immediate daemon startup after --daemon.

…ests

Address review feedback from thedotmack#1295:

1. PID file removal is now guarded: before deleting worker.pid or
   worker-37777.pid, we parse the PID, check if it's alive (kill -0),
   and verify it belongs to a worker-service process (ps -p ... | grep).
   - If a healthy worker is already running → skip start, return 0
   - If PID is dead or recycled (not worker-service) → remove stale file

2. Added three installer regression tests to test-install.sh:
   - Stale PID file (dead process) is cleaned up
   - Live non-worker PID (recycled) is cleaned up
   - --daemon flag is passed to worker-service
@Unayung
Copy link
Author

Unayung commented Mar 8, 2026

Thanks for the review! Pushed a fix addressing both points:

1. Guarded PID file removal
Instead of unconditional rm -f, we now:

  • Parse the PID from the file
  • Check if it's alive (kill -0)
  • Verify it's actually a worker-service process via ps
  • If a healthy worker is running → skip start entirely (return 0)
  • Only remove the file if the PID is dead or recycled (not a worker process)

2. Added installer regression tests (3 cases in test-install.sh):

  • Stale PID file (dead process) → cleaned up ✓
  • Live non-worker PID (recycled PID) → cleaned up ✓
  • --daemon flag passed to worker-service → verified ✓

Didn't add a full test for the "healthy existing worker" case since that requires a real running worker process, but the live-PID + process name check logic covers the guard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants