Skip to content

fix(acp): propagate follow-up prompt failures as job errors#1981

Merged
serrrfirat merged 3 commits intonearai:stagingfrom
rajulbhatnagar:worktree-acp-fixes
Apr 7, 2026
Merged

fix(acp): propagate follow-up prompt failures as job errors#1981
serrrfirat merged 3 commits intonearai:stagingfrom
rajulbhatnagar:worktree-acp-fixes

Conversation

@rajulbhatnagar
Copy link
Copy Markdown
Contributor

@rajulbhatnagar rajulbhatnagar commented Apr 3, 2026

Summary

  • Bug fix (ACP bridge follow-up prompt failures should mark sandbox job failed #1915): The ACP follow-up loop silently swallowed prompt failures — it logged and posted a status event but continued the loop, so jobs always reported success: true even when follow-up prompts failed.
  • Terminal event fix: Per-turn ACP results were emitting event_type: "result" which maps to AppEvent::JobResult (the terminal signal). Job monitors exit on the first JobResult, so all subsequent follow-up outcomes were invisible. Changed per-turn events to "turn_result" (non-terminal) and emit exactly one terminal "result" from run() after the entire session ends.
  • Poll retry limits: All poll_prompt() errors were retried infinitely. Now permanent errors (OrchestratorRejected, LlmProxyFailed) fail immediately, and transient errors (ConnectionFailed) retry with a cap of 5 (consistent with LLM retry patterns).
  • Child process cleanup: Kill child process on protocol failure so stderr reader completes and the job doesn't hang. Race poll_prompt against child exit via tokio::select!.
  • Refactor: Extract the inline loop into a standalone run_follow_up_loop function with trait-based dependency injection (FollowUpPromptSource, AcpPromptSender, AcpEventSink), enabling unit testing without Docker/ACP infrastructure.
  • Logging fix: Downgrade info!debug! in the follow-up loop since it runs in background Docker containers and was corrupting REPL/TUI output.
  • Add PartialEq + Clone to JobEventPayload, narrow AcpPromptSender visibility to private.

Known follow-up

report_complete() does not broadcast AppEvent::JobResult — it only updates container state. If the fire-and-forget post_event("result") fails silently, the job monitor never learns the job ended and the max_jobs slot stays consumed. This is a pre-existing gap affecting all three bridge types (ACP, Claude, Container). Fix: have the orchestrator's report_complete handler also broadcast JobResult. Tracked separately since it touches shared orchestrator infrastructure.

Test plan

  • follow_up_success_emits_turn_event_not_terminal — per-turn events use "turn_result", not "result"
  • follow_up_prompt_failure_returns_error — follow-up failure propagates as Err
  • follow_up_loop_fails_on_permanent_poll_errorOrchestratorRejected fails immediately
  • follow_up_loop_exhausts_transient_retries — 5 consecutive ConnectionFailedErr
  • follow_up_loop_recovers_from_transient_poll_error — transient error then success → Ok
  • follow_up_loop_exits_during_long_poll_when_child_dies — child exit detected during blocked poll
  • child_monitor_kills_process_so_stderr_reader_completes — kill channel unblocks stderr
  • cargo clippy zero warnings, cargo fmt clean

@github-actions github-actions bot added scope: worker Container worker size: L 200-499 changed lines risk: medium Business logic, config, or moderate-risk modules contributor: experienced 6-19 merged PRs labels Apr 3, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the ACP bridge follow-up loop into a dedicated run_follow_up_loop function and introduces the FollowUpPromptSource and AcpPromptSender traits to improve modularity and testability. It also includes new unit tests with stubs to verify error propagation and recovery logic. Feedback was provided regarding the concurrency model of the loop, suggesting that prompt polling and the child exit signal should be monitored simultaneously using tokio::select! to ensure the worker responds immediately if the agent process terminates during a poll.

Comment thread src/worker/acp_bridge.rs Outdated
) -> Result<(), WorkerError> {
let session_id_str = session_id.to_string();
loop {
let backoff = match prompt_source.poll_prompt().await {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The follow-up loop currently polls the orchestrator sequentially. If poll_prompt is a long-poll or if the orchestrator returns prompts quickly, the loop might delay responding to the agent process exiting (monitored via child_exit_rx). Consider using tokio::select! to poll for prompts and watch for the child exit signal simultaneously to ensure the worker shuts down immediately if the agent dies.

Copy link
Copy Markdown
Collaborator

@serrrfirat serrrfirat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Posting the requested high-severity finding from the review.

Comment thread src/worker/acp_bridge.rs
data: json!({ "message": msg }),
})
.await;
return Err(WorkerError::ExecutionFailed { reason: msg });
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High Severity Early error propagation here can still leave the ACP job stuck instead of failed.

This new return Err(...) fixes the logical success/failure bug, but the process lifecycle is now split: the Child was moved into a background wait() task earlier, while run_acp_session() still unconditionally awaits the stderr reader after the LocalSet returns. If conn.prompt() fails at the protocol layer without the subprocess exiting immediately, this branch returns out of run_follow_up_loop(), then run_acp_session() can block waiting for stderr and never reach report_complete(success = false).

Please explicitly terminate the child on this error path, or keep an owned handle that can be killed during cleanup before awaiting stderr. A regression test that simulates prompt() failing while the subprocess stays alive would pin this down.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

rajulbhatnagar pushed a commit to rajulbhatnagar/ironclaw that referenced this pull request Apr 4, 2026
Address two review findings from nearai#1981:

1. (gemini-code-assist) poll_prompt() blocked the follow-up loop — child
   exit wasn't detected during active HTTP polling. Wrap poll_prompt() in
   tokio::select\! alongside child_exit_rx.

2. (serrrfirat, high severity) Protocol failure could leave the child
   alive, causing stderr_handle.await to block forever and the job to
   hang. Add a kill channel: spawn_child_monitor() returns (child_exit_rx,
   kill_tx); on error, run_acp_session sends the kill signal before
   awaiting stderr so the pipe closes and cleanup completes.

Extract spawn_child_monitor() as a standalone function so both production
code and the regression test exercise the same code path.

Tests:
- follow_up_loop_exits_during_long_poll_when_child_dies (ForeverPromptSource
  + timeout guard — hangs without the select\! fix)
- child_monitor_kills_process_so_stderr_reader_completes (real sleep
  subprocess + kill signal — hangs without the kill channel)
@github-actions github-actions bot added size: XL 500+ changed lines and removed size: L 200-499 changed lines labels Apr 4, 2026
Comment thread src/worker/acp_bridge.rs Outdated
}
Ok(None) => Duration::from_secs(2),
Err(e) => {
tracing::warn!(job_id = %job_id, "Prompt polling error: {}", e);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium Severity

All poll_prompt() errors take the same infinite 5s retry path here.

poll_prompt() also surfaces permanent failures such as OrchestratorRejected and response-parse errors, so if the prompt endpoint is broken while the ACP subprocess stays alive, this loop can keep the container alive indefinitely instead of failing the job.

Please distinguish retryable transport errors from permanent endpoint/protocol failures, or cap consecutive retries and convert exhaustion into ExecutionFailed plus a terminal error result. A regression test for repeated OrchestratorRejected or malformed prompt responses would cover this.

Comment thread src/worker/acp_bridge.rs
match follow_result {
Ok(resp) => {
let payload = stop_reason_to_result(&resp.stop_reason, &session_id_str);
sink.emit_event(&payload).await;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High Severity

This emits a result event for each successful follow-up turn, but on failure the new code only emits a status event before returning Err.

AppEvent::JobResult is the terminal signal that background job monitors use, and they ignore status events. Those monitors are also spawned for non-wait sandbox jobs, so an ACP job can still be observed as completed after an earlier follow-up succeeds while a later follow-up failure never propagates through the normal terminal event path.

Please emit exactly one terminal result for the overall ACP session, including an error result before returning on follow-up failure, and use a non-terminal event type for per-turn stop reasons. An integration test for a non-wait ACP job that fails on a later follow-up turn would lock this down.

Rajul Bhatnagar added 3 commits April 5, 2026 05:51
)

The follow-up loop silently swallowed ACP prompt failures — logging and
posting a status event but continuing the loop, so the job always
reported success: true. Extract the loop into a testable
`run_follow_up_loop` function with trait-based dependency injection and
return Err on failure so run() reports success: false.

Also: downgrade follow-up loop info! logs to debug! (background task),
add PartialEq to JobEventPayload, narrow AcpPromptSender to private.
Address two review findings from nearai#1981:

1. (gemini-code-assist) poll_prompt() blocked the follow-up loop — child
   exit wasn't detected during active HTTP polling. Wrap poll_prompt() in
   tokio::select\! alongside child_exit_rx.

2. (serrrfirat, high severity) Protocol failure could leave the child
   alive, causing stderr_handle.await to block forever and the job to
   hang. Add a kill channel: spawn_child_monitor() returns (child_exit_rx,
   kill_tx); on error, run_acp_session sends the kill signal before
   awaiting stderr so the pipe closes and cleanup completes.

Extract spawn_child_monitor() as a standalone function so both production
code and the regression test exercise the same code path.

Tests:
- follow_up_loop_exits_during_long_poll_when_child_dies (ForeverPromptSource
  + timeout guard — hangs without the select\! fix)
- child_monitor_kills_process_so_stderr_reader_completes (real sleep
  subprocess + kill signal — hangs without the kill channel)
Per-turn ACP results were emitting event_type "result" which maps to
AppEvent::JobResult (the terminal signal). This caused job monitors to
exit after the first prompt, making all subsequent follow-up outcomes
invisible. Changed per-turn events to "turn_result" and emit exactly
one terminal "result" from run() after the session ends.

Also added retry discrimination for poll_prompt errors: permanent
errors (OrchestratorRejected, LlmProxyFailed) fail immediately,
transient errors (ConnectionFailed) retry with a cap of 5.
@rajulbhatnagar
Copy link
Copy Markdown
Contributor Author

Addressed all review feedback:

@serrrfirat (line 471, high severity): Per-turn "result" events were prematurely terminating job monitors. Fixed by changing per-turn events to "turn_result" (non-terminal, falls through to AppEvent::JobStatus in the orchestrator) and emitting exactly one terminal "result" from run() after the entire session ends.

@serrrfirat (line 488, medium severity): poll_prompt() errors now have retry discrimination — OrchestratorRejected and LlmProxyFailed fail immediately (permanent), ConnectionFailed retries with a cap of 5 (~25s), consistent with llm/retry.rs patterns.

Known follow-up (not in scope): report_complete() does not broadcast AppEvent::JobResult — if the fire-and-forget post_event("result") fails silently, the monitor never learns the job ended. This is a pre-existing gap affecting all three bridge types. Fix: have the orchestrator's report_complete handler also broadcast JobResult. Will track separately since it touches shared orchestrator infrastructure.

Comment thread src/worker/acp_bridge.rs
tracing::error!(job_id = %job_id, "{}", msg);
return Err(WorkerError::ExecutionFailed { reason: msg });
}
tracing::warn!(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium Severity This retry path still uses warn! in a background ACP loop.\n\nPer the repo logging guidance, background worker paths should avoid info! and warn! because they surface in the REPL/TUI and corrupt the interactive display. A transient poll_prompt() failure is explicitly retried here, so these warnings can leak noisy infrastructure blips into the user-facing terminal even when the loop recovers normally.\n\nPlease downgrade this to debug! and keep error! for permanent failures or retry exhaustion.

Comment thread src/worker/acp_bridge.rs
}
};
self.client
.post_event(&JobEventPayload {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High Severity Terminal ACP completion is still not reliably delivered.\n\nThis PR moved ACP to emit a single result event here, which fixes the earlier multi-terminal-event problem. However, post_event() is still fire-and-forget and only logs failures. If this POST is dropped or rejected, report_complete() still succeeds, but the orchestrator completion path does not broadcast AppEvent::JobResult; it only updates container state and cleans up. That means job monitors can miss the terminal signal entirely and keep the max_jobs slot occupied even though the worker finished.\n\nPlease make terminal completion delivery authoritative in one place: either have report_complete() also broadcast JobResult, or make terminal event delivery acknowledged/retried instead of best-effort.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did note this in the PR description. It actually affects all 3 codepaths(Workee,Claude Code and ACP). Happy to tackle it here by updating -> report_complete

Copy link
Copy Markdown
Collaborator

@serrrfirat serrrfirat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed as paranoid architect. Well-structured fix for real bugs (silent follow-up failure swallowing, premature terminal events, infinite retries). Clean trait-based DI with thorough test coverage. No blockers found — minor findings (child exit code not checked, send_prompt not raced against child exit) are pre-existing and fit naturally into the acknowledged follow-up work.

@serrrfirat serrrfirat merged commit 13774cc into nearai:staging Apr 7, 2026
14 checks passed
rajulbhatnagar pushed a commit to rajulbhatnagar/ironclaw that referenced this pull request Apr 7, 2026
…earai#1981 follow-up)

Follow-up to nearai#1981 (propagate follow-up prompt failures as job errors).

- Add safety-net JobResult broadcast in report_complete so job monitors
  free the max_jobs slot even when the worker's post_event is lost
- Extract broadcast_app_event helper (shared by job_event_handler and
  report_complete) with debug logging on DB fallback errors
- Extend idempotent self-transitions to all terminal states (Failed,
  Accepted, Cancelled) and block cross-terminal transitions
- Add duplicate-event test proving completion watcher handles back-to-back
  JobResult events without error
- Demote transient poll error log from warn to debug in acp_bridge
rajulbhatnagar pushed a commit to rajulbhatnagar/ironclaw that referenced this pull request Apr 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor: experienced 6-19 merged PRs risk: medium Business logic, config, or moderate-risk modules scope: worker Container worker size: XL 500+ changed lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants