Skip to content

Missing child.wait() after start_kill() creates zombie processes and prevents directory cleanup #4726

@2mawi2

Description

@2mawi2

What version of Codex is running?

0.44.0

Which model were you using?

gpt-5-codex

What platform is your computer?

Darwin 24.6.0 arm64 arm

What steps can reproduce the bug?

When Codex exec terminates a child process (due to timeout or Ctrl+C), it
calls start_kill() but never calls wait(). This creates zombie processes
that hold file descriptors.

Location: codex-rs/core/src/exec.rs lines 339 and 345

Err(_) => {
    // timeout
    child.start_kill()?;
    // Debatable whether `child.wait().await` should be called here. ⚠️
    (synthetic_exit_status(EXIT_CODE_SIGNAL_BASE + TIMEOUT_CODE), true)
}

To reproduce:

  1. Run Codex with a timeout that will trigger:
    timeout 2s codex exec "sleep 10"
  2. After timeout, check for zombie processes:
    ps aux | grep codex | grep defunct
  3. Try to remove the working directory while Codex was running:
    rm -rf /path/to/worktree
    -> Fails with "Directory busy" or "Resource busy"

The issue only appears with 0.44.0 and can be traced down to b8e1fe6

What is the expected behavior?

After calling start_kill(), the parent process should call wait() to reap
the terminated child process. This is standard Unix process management:

Err(_) => {
    child.start_kill()?;
    let _ = child.wait().await;  // ✅ Reap the zombie
    (synthetic_exit_status(EXIT_CODE_SIGNAL_BASE + TIMEOUT_CODE), true)
}

Expected:

  • No zombie processes after timeout/Ctrl+C
  • File descriptors (including CWD) are released immediately
  • Host applications can clean up worktree directories without errors

What do you see instead?

  1. Zombie processes persist until the host application exits:
    $ ps aux | grep codex
    codex  12345  0.0  0.0  0  0  ??  Z  <defunct>
    
  2. File descriptors remain open, preventing directory cleanup:
    $ rm -rf /path/to/worktree
    rm: cannot remove '/path/to/worktree': Device or resource busy
  3. Resource leaks accumulate in long-running host applications that manage
    multiple Codex sessions.

Root cause: Without wait(), terminated processes become zombies and retain
their open file descriptors (including the current working directory).

Additional information

This affects any application that:

  • Runs Codex in isolated git worktrees (common in CI/CD and development
    tools)
  • Manages multiple concurrent Codex sessions
  • Needs to clean up directories after sessions complete

Why This Became More Noticeable in v0.44

Commit b8e1fe60c enabled process hardening (PT_DENY_ATTACH) by default.
While this is a good security feature, it may slightly extend process
termination time, making the missing wait() call more apparent. However,
the bug exists regardless of process hardening.

The Fix

Add child.wait().await after both start_kill() calls in
codex-rs/core/src/exec.rs:

Line 339 (timeout case):

Err(_) => {
    child.start_kill()?;
    let _ = tokio::time::timeout(Duration::from_secs(1), child.wait()).await;
    (synthetic_exit_status(EXIT_CODE_SIGNAL_BASE + TIMEOUT_CODE), true)
}

Line 345 (Ctrl+C case):
_ = tokio::signal::ctrl_c() => {
    child.start_kill()?;
    let _ = tokio::time::timeout(Duration::from_secs(1), child.wait()).await;
    (synthetic_exit_status(EXIT_CODE_SIGNAL_BASE + SIGKILL_CODE), false)
}

see in the docs: https://docs.rs/tokio/latest/tokio/process/struct.Child.html#method.kill

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions