Missing child.wait() after start_kill() creates zombie processes and prevents directory cleanup

### What version of Codex is running?

0.44.0

### Which model were you using?

gpt-5-codex

### What platform is your computer?

Darwin 24.6.0 arm64 arm

### What steps can reproduce the bug?

When Codex exec terminates a child process (due to timeout or Ctrl+C), it
  calls `start_kill()` but never calls `wait()`. This creates zombie processes
  that hold file descriptors.

  **Location:** `codex-rs/core/src/exec.rs` lines 339 and 345

  ```rust
  Err(_) => {
      // timeout
      child.start_kill()?;
      // Debatable whether `child.wait().await` should be called here. ⚠️
      (synthetic_exit_status(EXIT_CODE_SIGNAL_BASE + TIMEOUT_CODE), true)
  }
  ```
  To reproduce:

  1. Run Codex with a timeout that will trigger:
  timeout 2s codex exec "sleep 10"
  2. After timeout, check for zombie processes:
  ps aux | grep codex | grep defunct
  3. Try to remove the working directory while Codex was running:
  rm -rf /path/to/worktree
  -> Fails with "Directory busy" or "Resource busy"


The issue only appears with 0.44.0 and can be traced down to b8e1fe60c

### What is the expected behavior?


  After calling `start_kill()`, the parent process should call `wait()` to reap
   the terminated child process. This is standard Unix process management:

  ```rust
  Err(_) => {
      child.start_kill()?;
      let _ = child.wait().await;  // ✅ Reap the zombie
      (synthetic_exit_status(EXIT_CODE_SIGNAL_BASE + TIMEOUT_CODE), true)
  }
  ```
  Expected:
  - No zombie processes after timeout/Ctrl+C
  - File descriptors (including CWD) are released immediately
  - Host applications can clean up worktree directories without errors

### What do you see instead?



  1. **Zombie processes persist** until the host application exits:
     ```bash
     $ ps aux | grep codex
     codex  12345  0.0  0.0  0  0  ??  Z  <defunct>

  2. File descriptors remain open, preventing directory cleanup:
  $ rm -rf /path/to/worktree
  rm: cannot remove '/path/to/worktree': Device or resource busy
  3. Resource leaks accumulate in long-running host applications that manage
  multiple Codex sessions.

  Root cause: Without wait(), terminated processes become zombies and retain
  their open file descriptors (including the current working directory).


### Additional information

This affects any application that:
  - Runs Codex in isolated git worktrees (common in CI/CD and development
  tools)
  - Manages multiple concurrent Codex sessions
  - Needs to clean up directories after sessions complete

  ## Why This Became More Noticeable in v0.44

  Commit `b8e1fe60c` enabled process hardening (`PT_DENY_ATTACH`) by default.
  While this is a good security feature, it may slightly extend process
  termination time, making the missing `wait()` call more apparent. However,
  the bug exists regardless of process hardening.

  ## The Fix

  Add `child.wait().await` after both `start_kill()` calls in
  `codex-rs/core/src/exec.rs`:

  **Line 339 (timeout case):**
  ```rust
  Err(_) => {
      child.start_kill()?;
      let _ = tokio::time::timeout(Duration::from_secs(1), child.wait()).await;
      (synthetic_exit_status(EXIT_CODE_SIGNAL_BASE + TIMEOUT_CODE), true)
  }

  Line 345 (Ctrl+C case):
  _ = tokio::signal::ctrl_c() => {
      child.start_kill()?;
      let _ = tokio::time::timeout(Duration::from_secs(1), child.wait()).await;
      (synthetic_exit_status(EXIT_CODE_SIGNAL_BASE + SIGKILL_CODE), false)
  }
```

see in the docs: https://docs.rs/tokio/latest/tokio/process/struct.Child.html#method.kill

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing child.wait() after start_kill() creates zombie processes and prevents directory cleanup #4726

What version of Codex is running?

Which model were you using?

What platform is your computer?

What steps can reproduce the bug?

What is the expected behavior?

What do you see instead?

Additional information

Why This Became More Noticeable in v0.44

The Fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Missing child.wait() after start_kill() creates zombie processes and prevents directory cleanup #4726

Description

What version of Codex is running?

Which model were you using?

What platform is your computer?

What steps can reproduce the bug?

What is the expected behavior?

What do you see instead?

Additional information

Why This Became More Noticeable in v0.44

The Fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions