Skip to content

fix(process): use process group kill when SIGKILL fails to terminate child#902

Open
livepeer-tessa wants to merge 1 commit intomainfrom
fix/process-guardian-sigkill-process-group
Open

fix(process): use process group kill when SIGKILL fails to terminate child#902
livepeer-tessa wants to merge 1 commit intomainfrom
fix/process-guardian-sigkill-process-group

Conversation

@livepeer-tessa
Copy link

Problem

When a pipeline's child process (e.g. streamdiffusion-sdxl) enters ERROR state, ProcessGuardian._monitor_loop calls process.stop(), which sends SIGTERM then SIGKILL. In rare cases SIGKILL fails to terminate the child (is_alive=True), causing RuntimeError("Failed to kill process") and forcing a full container restart.

Root causes (both can apply simultaneously):

  1. Grandchild processes — the pipeline spawns subprocesses (e.g. ComfyUI worker threads/procs) not in the parent's process group. Killing only the parent PID leaves those alive, which can hold kernel resources and prevent the parent from exiting.
  2. D-state — after handling SIGTERM the parent may enter an uninterruptible sleep waiting for a grandchild kernel operation to complete; SIGKILL is not delivered until that wait resolves.

Ref: daydreamlive/scope#735

Changes

1. process_loop() — call os.setsid() on startup

Makes the child process the leader of a new session and process group (Unix only). Every grandchild it spawns inherits the same PGID (unless it explicitly calls setsid() itself), setting up the precondition for a group kill.

2. stop() — process group kill fallback

After per-PID SIGKILL fails (2 s timeout), try os.killpg(pgid, SIGKILL) with a 3 s wait. If the whole group is gone, stop() returns normally — container restart avoided. If it still fails, /proc/{pid}/status is logged for D-state diagnosis before re-raising RuntimeError.

Also saves child_pid before queue cleanup since process.pid can become None after join.

Kill sequence after this fix

stop() called
  → done.set()
  → SIGTERM → wait 3s
    [if still alive]
  → close queues
  → SIGKILL (per-PID) → wait 2s
    [if still alive — NEW]
  → os.killpg(pgid, SIGKILL) → wait 3s
    [if still alive]
  → log /proc/{pid}/status
  → raise RuntimeError

Testing

Behavior is hard to reproduce in isolation (requires a D-state or rogue grandchild), but the code paths are straightforward. The os.setsid() call is guarded by is_unix and catches OSError for the edge case where the process is already a group leader. The killpg path catches ProcessLookupError/PermissionError/OSError so a missing/already-dead process doesn't cause a secondary failure.

…child

When a pipeline process's child (pid=169) handles SIGTERM and signals
_run_pipeline_loops to finish, but the process remains alive (is_alive=True)
after SIGKILL, the monitor loop raises RuntimeError and forces a container
restart. This can happen when:

  1. The child spawned grandchildren (e.g. ComfyUI worker subprocesses) that
     are not in the same process group — killing only the parent PID leaves
     them alive, blocking the parent from exiting.
  2. The child enters uninterruptible D-state waiting for a descendant kernel
     operation; SIGKILL is not delivered until the wait completes.

Two changes:

1. process_loop() calls os.setsid() on startup (Unix only). This makes the
   child the leader of a new session and process group. Every grandchild it
   spawns inherits the same PGID (unless it calls setsid() itself), so a
   subsequent os.killpg() from the parent can reach the whole subtree.

2. stop() adds a process-group kill fallback after the per-PID SIGKILL times
   out: os.killpg(pgid, SIGKILL) followed by a 3-second wait. If the whole
   group is gone, stop() returns normally and avoids the container restart.
   If it still fails, /proc/{pid}/status is logged to diagnose D-state before
   re-raising RuntimeError.

Fixes: daydreamlive/scope#735
Signed-off-by: livepeer-tessa <livepeer-tessa@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant