fix(lifecycle): shut down when grandparent reparents to init (#311)#315
fix(lifecycle): shut down when grandparent reparents to init (#311)#315ousamabenyounes wants to merge 1 commit intomksglu:nextfrom
Conversation
) Zombie context-mode servers persisted at 50-60% CPU when Claude Code died because the process chain is Claude Code → start.mjs → npm exec → context-mode server When Claude Code dies, `start.mjs` reparents to init (PID 1) but `npm exec` (our direct parent) stays alive — so the existing ppid-only check never flipped and the lifecycle guard never fired. This PR extends the default parent-alive check to also verify the grandparent. If the grandparent is reparented to PID 1 and wasn't already 1 at startup, the wrapper chain is orphaned and we shut down. When the grandparent was already 1 at startup (launchd / systemd / detached daemon) the check is skipped to avoid a false positive, and on Windows — where there is no cheap cross-platform `ps` equivalent — the reader returns NaN and the behavior is unchanged. So this change is strictly additive to what was there before. Stdin is intentionally not touched. The existing `cleanup does NOT remove stdin listeners` and `child does NOT exit when stdin is closed (mksglu#236)` tests still pass. Note: the issue also describes a symptom inside the vendored MCP SDK (missing stdin end handler in `StdioServerTransport`). That lives in `server.bundle.mjs` and is out of scope here — mksglu#236 documented why we won't patch shutdown via stdin in this process. - src/lifecycle.ts: `makeDefaultIsParentAlive(deps)` factory, injectable ppid + grandparent readers. Production wiring uses it with the live `ps` probe. Module-level `defaultIsParentAlive` remains the single default consumed by `startLifecycleGuard`. - tests/lifecycle.test.ts: 5 new regression cases (grandparent flipping to init, daemon-startup false-positive guard, Windows/NaN fallback, direct-ppid precedence, and an end-to-end check routed through startLifecycleGuard). Tests: - npx vitest run tests/lifecycle.test.ts → 14/14 pass - npm test → 1600 pass, 23 skipped, 0 fail - npm run typecheck → clean Co-Authored-By: Claude <noreply@anthropic.com>
|
Hi @ousamabenyounes! That's great. Did you fully-tested that? (I'm know you'd already created a some Unit tests) I'm gonna merge to the |
|
Thanks @mksglu! Test status, to set expectations cleanly: Unit-level (hermetic, deterministic): 5 new cases in
Full suite: What I didn't add: a real 3-level Your call on the integration test. Ready to ship either way. |
What
Partial fix for the zombie MCP server problem described in #311. The server previously only monitored its direct ppid; this PR also watches the grandparent and shuts down when it reparents to init while the chain was originally not rooted at init.
Fixes the common case described in #311 on macOS/Linux. The second half of the issue (the vendored MCP SDK missing a stdin
endhandler) is deliberately out of scope — see the "Not fixed here" section below.Why the existing guard wasn't enough
The Claude Code process chain is:
When Claude Code dies:
start.mjsreparents to init (PID 1)npm execstays alive, socontext-mode server's directppidnever changesdefaultIsParentAliveis still happy, the 30-second ppid tick never flips, and the server spins forever.Extending the check to "is the grandparent still a real process?" catches exactly this case. The grandparent IS
start.mjs, andstart.mjs's ppid flips to 1 the moment Claude Code dies.Why not stdin
The issue suggests adding stdin
end/closehandlers. #236 already removed those because the MCP stdio transport owns stdin and transient pipe events caused spurious-32000errors. The repo locks that in with three tests intests/lifecycle.test.ts:cleanup does NOT remove stdin listeners (stdin is not used)startLifecycleGuard does NOT call process.stdin.resume()child does NOT exit when stdin is closed (#236)All three continue to pass — this PR adds zero stdin listeners.
Changes
src/lifecycle.tsreadGrandparentPpidImpl()helper: runsps -o ppid= -p $PPIDon POSIX, returnsNaNon Windows or probe failure.makeDefaultIsParentAlive(deps)with injectablegetPpidandreadGrandparentPpid— used by production (const defaultIsParentAlive = makeDefaultIsParentAlive()) and by tests with fake readers.falsewhen the grandparent is now PID 1 but wasn't at startup.Safety net for false positives:
psunavailable) or if thepsprobe fails, the reader returnsNaNand the check is also skipped. So the behavior on Windows is unchanged.Not fixed here (out of scope)
#311 also reports that the MCP SDK's
StdioServerTransportlacks anend/closehandler insideserver.bundle.mjs. That's vendored third-party code, and #236 explicitly established that we don't drive shutdown via stdin listeners in this process. Addressing that root cause belongs in the SDK bundle (or a separate, coordinated change around #236's constraints).Tests added (
tests/lifecycle.test.ts)5 regression cases in a new
describe(\"makeDefaultIsParentAlive — grandparent orphan detection (#311)\"), using the factory with injected readers so each scenario is hermetic:returns false when grandparent is reparented to init after startup— core regression case.does not false-positive when grandparent was already init at startup— daemon-start tolerance.tolerates NaN grandparent (Windows / ps failure)— falls back to original behavior.direct ppid death still takes precedence over grandparent check— existing contract preserved.grandparent check kicks in through startLifecycleGuard end-to-end— wiring verification via the public API.Test plan
npx vitest run tests/lifecycle.test.ts→ 14/14 pass (9 existing + 5 new)npm test→ 1600 pass, 23 skipped, 0 failuresnpm run typecheck→ cleannpm run build→ cleanGenerated by Claude Code
Vibe coded by ousamabenyounes