fix(process): use process group kill when SIGKILL fails to terminate child#902
Open
livepeer-tessa wants to merge 1 commit intomainfrom
Open
fix(process): use process group kill when SIGKILL fails to terminate child#902livepeer-tessa wants to merge 1 commit intomainfrom
livepeer-tessa wants to merge 1 commit intomainfrom
Conversation
…child
When a pipeline process's child (pid=169) handles SIGTERM and signals
_run_pipeline_loops to finish, but the process remains alive (is_alive=True)
after SIGKILL, the monitor loop raises RuntimeError and forces a container
restart. This can happen when:
1. The child spawned grandchildren (e.g. ComfyUI worker subprocesses) that
are not in the same process group — killing only the parent PID leaves
them alive, blocking the parent from exiting.
2. The child enters uninterruptible D-state waiting for a descendant kernel
operation; SIGKILL is not delivered until the wait completes.
Two changes:
1. process_loop() calls os.setsid() on startup (Unix only). This makes the
child the leader of a new session and process group. Every grandchild it
spawns inherits the same PGID (unless it calls setsid() itself), so a
subsequent os.killpg() from the parent can reach the whole subtree.
2. stop() adds a process-group kill fallback after the per-PID SIGKILL times
out: os.killpg(pgid, SIGKILL) followed by a 3-second wait. If the whole
group is gone, stop() returns normally and avoids the container restart.
If it still fails, /proc/{pid}/status is logged to diagnose D-state before
re-raising RuntimeError.
Fixes: daydreamlive/scope#735
Signed-off-by: livepeer-tessa <livepeer-tessa@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When a pipeline's child process (e.g.
streamdiffusion-sdxl) entersERRORstate,ProcessGuardian._monitor_loopcallsprocess.stop(), which sends SIGTERM then SIGKILL. In rare cases SIGKILL fails to terminate the child (is_alive=True), causingRuntimeError("Failed to kill process")and forcing a full container restart.Root causes (both can apply simultaneously):
Ref: daydreamlive/scope#735
Changes
1.
process_loop()— callos.setsid()on startupMakes the child process the leader of a new session and process group (Unix only). Every grandchild it spawns inherits the same PGID (unless it explicitly calls
setsid()itself), setting up the precondition for a group kill.2.
stop()— process group kill fallbackAfter per-PID SIGKILL fails (2 s timeout), try
os.killpg(pgid, SIGKILL)with a 3 s wait. If the whole group is gone,stop()returns normally — container restart avoided. If it still fails,/proc/{pid}/statusis logged for D-state diagnosis before re-raisingRuntimeError.Also saves
child_pidbefore queue cleanup sinceprocess.pidcan becomeNoneafter join.Kill sequence after this fix
Testing
Behavior is hard to reproduce in isolation (requires a D-state or rogue grandchild), but the code paths are straightforward. The
os.setsid()call is guarded byis_unixand catchesOSErrorfor the edge case where the process is already a group leader. Thekillpgpath catchesProcessLookupError/PermissionError/OSErrorso a missing/already-dead process doesn't cause a secondary failure.