Skip to content

when running an itest_service, exec.CommandContext sends SIGKILL before graceful shutdown can run #58

@samschlegel

Description

@samschlegel

Problem

When svcinit receives SIGINT/SIGTERM, the configured shutdown_signal and shutdown_timeout are bypassed because exec.CommandContext automatically sends SIGKILL when the context is cancelled.

Root Cause

In runner/runner.go:258:

cmd := exec.CommandContext(ctx, s.Exe, s.Args...)

When the signal handler calls cancelFunc(), Go's exec package immediately sends SIGKILL to the child process via pidfd_send_signal. This happens before StopAll() can send SIGTERM with the configured grace period.

Timeline (from strace analysis)

  1. SIGINT received → signal handler calls cancelFunc()
  2. pidfd_send_signal(fd, SIGKILL, ...) sent by Go runtime
  3. Child process killed immediately
  4. "Shutting down services." printed
  5. StopAll() tries to send SIGTERM to already-dead process

Impact

Services that need graceful shutdown (like docker run/docker compose up) are killed immediately without time to clean up. The shutdown_signal = "SIGTERM" and shutdown_timeout = "5s" configuration is effectively ignored.

Proposed Fixes

Have a couple ideas here but would like some input on what you think feels best

  1. Set cmd.Cancel (Go 1.20+) to send the configured ShutdownSignal instead of SIGKILL, and set cmd.WaitDelay to the configured shutdown_timeout. Not sure if the WaitDelay is necessary or not.
cmd.Cancel = func() error {
    var sig syscall.Signal
    switch s.ShutdownSignal {
    case "SIGTERM":
        sig = syscall.SIGTERM
    default:
        sig = syscall.SIGKILL
    }
    return killGroup(cmd, sig)
}
cmd.WaitDelay = shutdownTimeout
  1. Use a separate context for commands separate from the root context and propagate manually. Since the rest of the code seems to be handling the shutdown manually, perhaps this is the right move?

  2. Don't use exec.CommandContext - in a similar vein, the code is already handling this lifecycle stuff, so perhaps it makes more sense to just remove the Context propagation and handle things ourselves?

Reproduction

  1. Create a service that needs graceful shutdown (e.g., something that wraps docker run)
  2. Configure shutdown_signal = "SIGTERM" and shutdown_timeout = "5s"
  3. Run with bazel run
  4. Send SIGINT (Ctrl+C)
  5. Observe that the service is killed immediately without graceful shutdown

I'll also see about getting this into a test

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions