Summary
When a batch job worker finishes execution, the finalization logic uses a hardcoded 250ms sleep as a grace period to check whether the worker reported its result. Under load, this is insufficient, causing correctly-completed workers to be marked as failed.
Root cause
File: codex-rs/core/src/tools/handlers/agent_jobs.rs, lines 894-939
async fn finalize_finished_item(...) {
// line 908: hardcoded grace period
tokio::time::sleep(Duration::from_millis(250)).await;
// line 916: re-read the item
let item = db.get_agent_job_item(...)?;
// line 926-929: if result still empty, mark as failed
if item.result_json.is_none() {
// "worker finished without calling report_agent_job_result"
db.update_agent_job_item_status(..., AgentJobItemStatus::Failed, ...)?;
}
}
The flow is:
- Worker thread reaches a final status (detected by
find_finished_threads(), line 851).
finalize_finished_item() sleeps exactly 250ms.
- Re-reads the DB row to check if
result_json was populated by report_agent_job_result.
- If still empty → marks the item as failed.
This is a timing assumption, not a correctness condition. If the worker's report_agent_job_result call takes longer than 250ms to complete (due to load, slow DB writes, or lock contention), a successful worker is permanently misclassified as failed.
Secondary issue: fixed-interval polling
File: codex-rs/core/src/tools/handlers/agent_jobs.rs, lines 38, 713, 851-862
The main job loop polls every STATUS_POLL_INTERVAL = 250ms (line 38). Each poll iteration calls find_finished_threads() which iterates all active items and fetches status one by one. With MAX_AGENT_JOB_CONCURRENCY = 64 workers, this means up to 64 sequential async status lookups every 250ms.
This could be replaced with a watch-channel-based notification system (similar to what the collab multi-agent system already uses in multi_agents.rs:546), eliminating the polling entirely.
Suggested improvements
- Replace the 250ms sleep with a proper notification mechanism: have
report_agent_job_result signal a tokio::sync::Notify that finalize_finished_item awaits (with a reasonable timeout, e.g., 5 seconds).
- Alternatively, use an exponential backoff with multiple re-checks instead of a single fixed sleep.
- Replace the fixed-interval status polling in the main loop with watch-channel-based notifications from the worker threads.
I have a fix ready and can submit a PR if invited.
Summary
When a batch job worker finishes execution, the finalization logic uses a hardcoded 250ms sleep as a grace period to check whether the worker reported its result. Under load, this is insufficient, causing correctly-completed workers to be marked as failed.
Root cause
File:
codex-rs/core/src/tools/handlers/agent_jobs.rs, lines 894-939The flow is:
find_finished_threads(), line 851).finalize_finished_item()sleeps exactly 250ms.result_jsonwas populated byreport_agent_job_result.This is a timing assumption, not a correctness condition. If the worker's
report_agent_job_resultcall takes longer than 250ms to complete (due to load, slow DB writes, or lock contention), a successful worker is permanently misclassified as failed.Secondary issue: fixed-interval polling
File:
codex-rs/core/src/tools/handlers/agent_jobs.rs, lines 38, 713, 851-862The main job loop polls every
STATUS_POLL_INTERVAL = 250ms(line 38). Each poll iteration callsfind_finished_threads()which iterates all active items and fetches status one by one. WithMAX_AGENT_JOB_CONCURRENCY = 64workers, this means up to 64 sequential async status lookups every 250ms.This could be replaced with a watch-channel-based notification system (similar to what the collab multi-agent system already uses in
multi_agents.rs:546), eliminating the polling entirely.Suggested improvements
report_agent_job_resultsignal atokio::sync::Notifythatfinalize_finished_itemawaits (with a reasonable timeout, e.g., 5 seconds).I have a fix ready and can submit a PR if invited.