Skip to content

bug: Agent-job finalization uses a brittle 250ms grace period that misclassifies successful workers #13948

@William17738

Description

@William17738

Summary

When a batch job worker finishes execution, the finalization logic uses a hardcoded 250ms sleep as a grace period to check whether the worker reported its result. Under load, this is insufficient, causing correctly-completed workers to be marked as failed.

Root cause

File: codex-rs/core/src/tools/handlers/agent_jobs.rs, lines 894-939

async fn finalize_finished_item(...) {
    // line 908: hardcoded grace period
    tokio::time::sleep(Duration::from_millis(250)).await;

    // line 916: re-read the item
    let item = db.get_agent_job_item(...)?;

    // line 926-929: if result still empty, mark as failed
    if item.result_json.is_none() {
        // "worker finished without calling report_agent_job_result"
        db.update_agent_job_item_status(..., AgentJobItemStatus::Failed, ...)?;
    }
}

The flow is:

  1. Worker thread reaches a final status (detected by find_finished_threads(), line 851).
  2. finalize_finished_item() sleeps exactly 250ms.
  3. Re-reads the DB row to check if result_json was populated by report_agent_job_result.
  4. If still empty → marks the item as failed.

This is a timing assumption, not a correctness condition. If the worker's report_agent_job_result call takes longer than 250ms to complete (due to load, slow DB writes, or lock contention), a successful worker is permanently misclassified as failed.

Secondary issue: fixed-interval polling

File: codex-rs/core/src/tools/handlers/agent_jobs.rs, lines 38, 713, 851-862

The main job loop polls every STATUS_POLL_INTERVAL = 250ms (line 38). Each poll iteration calls find_finished_threads() which iterates all active items and fetches status one by one. With MAX_AGENT_JOB_CONCURRENCY = 64 workers, this means up to 64 sequential async status lookups every 250ms.

This could be replaced with a watch-channel-based notification system (similar to what the collab multi-agent system already uses in multi_agents.rs:546), eliminating the polling entirely.

Suggested improvements

  1. Replace the 250ms sleep with a proper notification mechanism: have report_agent_job_result signal a tokio::sync::Notify that finalize_finished_item awaits (with a reasonable timeout, e.g., 5 seconds).
  2. Alternatively, use an exponential backoff with multiple re-checks instead of a single fixed sleep.
  3. Replace the fixed-interval status polling in the main loop with watch-channel-based notifications from the worker threads.

I have a fix ready and can submit a PR if invited.

Metadata

Metadata

Assignees

No one assigned

    Labels

    CLIIssues related to the Codex CLIbugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions