Skip to content

fix(lifecycle): shut down when grandparent reparents to init (#311)#315

Open
ousamabenyounes wants to merge 1 commit intomksglu:nextfrom
ousamabenyounes:fix/issue-311
Open

fix(lifecycle): shut down when grandparent reparents to init (#311)#315
ousamabenyounes wants to merge 1 commit intomksglu:nextfrom
ousamabenyounes:fix/issue-311

Conversation

@ousamabenyounes
Copy link
Copy Markdown

@ousamabenyounes ousamabenyounes commented Apr 20, 2026

What

Partial fix for the zombie MCP server problem described in #311. The server previously only monitored its direct ppid; this PR also watches the grandparent and shuts down when it reparents to init while the chain was originally not rooted at init.

Fixes the common case described in #311 on macOS/Linux. The second half of the issue (the vendored MCP SDK missing a stdin end handler) is deliberately out of scope — see the "Not fixed here" section below.

Why the existing guard wasn't enough

The Claude Code process chain is:

Claude Code → start.mjs → npm exec → context-mode server

When Claude Code dies:

  • start.mjs reparents to init (PID 1)
  • npm exec stays alive, so context-mode server's direct ppid never changes
  • The existing defaultIsParentAlive is still happy, the 30-second ppid tick never flips, and the server spins forever.

Extending the check to "is the grandparent still a real process?" catches exactly this case. The grandparent IS start.mjs, and start.mjs's ppid flips to 1 the moment Claude Code dies.

Why not stdin

The issue suggests adding stdin end/close handlers. #236 already removed those because the MCP stdio transport owns stdin and transient pipe events caused spurious -32000 errors. The repo locks that in with three tests in tests/lifecycle.test.ts:

  • cleanup does NOT remove stdin listeners (stdin is not used)
  • startLifecycleGuard does NOT call process.stdin.resume()
  • child does NOT exit when stdin is closed (#236)

All three continue to pass — this PR adds zero stdin listeners.

Changes

src/lifecycle.ts

  • New readGrandparentPpidImpl() helper: runs ps -o ppid= -p $PPID on POSIX, returns NaN on Windows or probe failure.
  • New exported factory makeDefaultIsParentAlive(deps) with injectable getPpid and readGrandparentPpid — used by production (const defaultIsParentAlive = makeDefaultIsParentAlive()) and by tests with fake readers.
  • The default check now additionally returns false when the grandparent is now PID 1 but wasn't at startup.

Safety net for false positives:

  • If the original grandparent was already 1 (daemon / nohup / launchd startup), the grandparent check is skipped entirely.
  • On Windows (ps unavailable) or if the ps probe fails, the reader returns NaN and the check is also skipped. So the behavior on Windows is unchanged.

Not fixed here (out of scope)

#311 also reports that the MCP SDK's StdioServerTransport lacks an end / close handler inside server.bundle.mjs. That's vendored third-party code, and #236 explicitly established that we don't drive shutdown via stdin listeners in this process. Addressing that root cause belongs in the SDK bundle (or a separate, coordinated change around #236's constraints).

Tests added (tests/lifecycle.test.ts)

5 regression cases in a new describe(\"makeDefaultIsParentAlive — grandparent orphan detection (#311)\"), using the factory with injected readers so each scenario is hermetic:

  • returns false when grandparent is reparented to init after startup — core regression case.
  • does not false-positive when grandparent was already init at startup — daemon-start tolerance.
  • tolerates NaN grandparent (Windows / ps failure) — falls back to original behavior.
  • direct ppid death still takes precedence over grandparent check — existing contract preserved.
  • grandparent check kicks in through startLifecycleGuard end-to-end — wiring verification via the public API.

Test plan

  • npx vitest run tests/lifecycle.test.ts → 14/14 pass (9 existing + 5 new)
  • npm test → 1600 pass, 23 skipped, 0 failures
  • npm run typecheck → clean
  • npm run build → clean

Generated by Claude Code
Vibe coded by ousamabenyounes

)

Zombie context-mode servers persisted at 50-60% CPU when Claude Code
died because the process chain is

  Claude Code → start.mjs → npm exec → context-mode server

When Claude Code dies, `start.mjs` reparents to init (PID 1) but
`npm exec` (our direct parent) stays alive — so the existing ppid-only
check never flipped and the lifecycle guard never fired.

This PR extends the default parent-alive check to also verify the
grandparent. If the grandparent is reparented to PID 1 and wasn't
already 1 at startup, the wrapper chain is orphaned and we shut down.
When the grandparent was already 1 at startup (launchd / systemd /
detached daemon) the check is skipped to avoid a false positive, and on
Windows — where there is no cheap cross-platform `ps` equivalent — the
reader returns NaN and the behavior is unchanged. So this change is
strictly additive to what was there before.

Stdin is intentionally not touched. The existing `cleanup does NOT
remove stdin listeners` and `child does NOT exit when stdin is closed
(mksglu#236)` tests still pass.

Note: the issue also describes a symptom inside the vendored MCP SDK
(missing stdin end handler in `StdioServerTransport`). That lives in
`server.bundle.mjs` and is out of scope here — mksglu#236 documented why we
won't patch shutdown via stdin in this process.

- src/lifecycle.ts: `makeDefaultIsParentAlive(deps)` factory, injectable
  ppid + grandparent readers. Production wiring uses it with the live
  `ps` probe. Module-level `defaultIsParentAlive` remains the single
  default consumed by `startLifecycleGuard`.
- tests/lifecycle.test.ts: 5 new regression cases (grandparent flipping
  to init, daemon-startup false-positive guard, Windows/NaN fallback,
  direct-ppid precedence, and an end-to-end check routed through
  startLifecycleGuard).

Tests:
- npx vitest run tests/lifecycle.test.ts → 14/14 pass
- npm test → 1600 pass, 23 skipped, 0 fail
- npm run typecheck → clean

Co-Authored-By: Claude <noreply@anthropic.com>
@mksglu
Copy link
Copy Markdown
Owner

mksglu commented Apr 20, 2026

Hi @ousamabenyounes! That's great. Did you fully-tested that? (I'm know you'd already created a some Unit tests) I'm gonna merge to the next branch.

@ousamabenyounes
Copy link
Copy Markdown
Author

Thanks @mksglu! Test status, to set expectations cleanly:

Unit-level (hermetic, deterministic): 5 new cases in tests/lifecycle.test.ts under describe("makeDefaultIsParentAlive — grandparent orphan detection (#311)"). They use the factory with injected ppid/grandparent readers so each scenario is isolated:

  • grandparent flipping from a real PID to 1 post-startup → returns false
  • grandparent already 1 at startup (daemon / launchd / nohup case) → never flips, no false positive ✅
  • NaN grandparent (Windows / ps probe failure) → falls back to original ppid-only behavior ✅
  • direct-ppid death takes precedence (short-circuit before grandparent check) ✅
  • end-to-end through startLifecycleGuard with injected reader, actual shutdown path fires ✅
$ npx vitest run tests/lifecycle.test.ts
Test Files  1 passed (1)
Tests  14 passed (14)

Full suite: npm test → 1600 pass, 23 skipped, 0 failures. Typecheck clean. Build clean. The existing #236-era tests (cleanup does NOT remove stdin listeners, does NOT call process.stdin.resume(), child does NOT exit when stdin is closed) all still pass — this PR adds zero stdin listeners.

What I didn't add: a real 3-level spawn → spawn → spawn → kill-grandparent → assert grandchild exits harness. That'd need a POSIX-only helper (same describe.skipIf(isWindows) pattern the repo already uses for #236) and a few hundred ms of sleep to hit the real 30s guard tick with a lowered checkIntervalMs. Happy to prototype one if you want it before merging — otherwise the current coverage is "logic-deterministic + wiring-verified", and a one-time manual repro on your side post-merge (VSCode → Claude Code → kill the VSCode process → ps aux | grep context-mode 30s later) would close the loop cheaply.

Your call on the integration test. Ready to ship either way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants