bug: Agent-job finalization uses a brittle 250ms grace period that misclassifies successful workers

## Summary

When a batch job worker finishes execution, the finalization logic uses a hardcoded 250ms sleep as a grace period to check whether the worker reported its result. Under load, this is insufficient, causing correctly-completed workers to be marked as failed.

## Root cause

**File:** `codex-rs/core/src/tools/handlers/agent_jobs.rs`, lines 894-939

```rust
async fn finalize_finished_item(...) {
    // line 908: hardcoded grace period
    tokio::time::sleep(Duration::from_millis(250)).await;

    // line 916: re-read the item
    let item = db.get_agent_job_item(...)?;

    // line 926-929: if result still empty, mark as failed
    if item.result_json.is_none() {
        // "worker finished without calling report_agent_job_result"
        db.update_agent_job_item_status(..., AgentJobItemStatus::Failed, ...)?;
    }
}
```

The flow is:
1. Worker thread reaches a final status (detected by `find_finished_threads()`, line 851).
2. `finalize_finished_item()` sleeps exactly 250ms.
3. Re-reads the DB row to check if `result_json` was populated by `report_agent_job_result`.
4. If still empty → marks the item as **failed**.

This is a timing assumption, not a correctness condition. If the worker's `report_agent_job_result` call takes longer than 250ms to complete (due to load, slow DB writes, or lock contention), a successful worker is permanently misclassified as failed.

## Secondary issue: fixed-interval polling

**File:** `codex-rs/core/src/tools/handlers/agent_jobs.rs`, lines 38, 713, 851-862

The main job loop polls every `STATUS_POLL_INTERVAL = 250ms` (line 38). Each poll iteration calls `find_finished_threads()` which iterates all active items and fetches status one by one. With `MAX_AGENT_JOB_CONCURRENCY = 64` workers, this means up to 64 sequential async status lookups every 250ms.

This could be replaced with a watch-channel-based notification system (similar to what the collab multi-agent system already uses in `multi_agents.rs:546`), eliminating the polling entirely.

## Suggested improvements

1. Replace the 250ms sleep with a proper notification mechanism: have `report_agent_job_result` signal a `tokio::sync::Notify` that `finalize_finished_item` awaits (with a reasonable timeout, e.g., 5 seconds).
2. Alternatively, use an exponential backoff with multiple re-checks instead of a single fixed sleep.
3. Replace the fixed-interval status polling in the main loop with watch-channel-based notifications from the worker threads.

I have a fix ready and can submit a PR if invited.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: Agent-job finalization uses a brittle 250ms grace period that misclassifies successful workers #13948

Summary

Root cause

Secondary issue: fixed-interval polling

Suggested improvements

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

bug: Agent-job finalization uses a brittle 250ms grace period that misclassifies successful workers #13948

Description

Summary

Root cause

Secondary issue: fixed-interval polling

Suggested improvements

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions