fix(acp): propagate follow-up prompt failures as job errors#1981
fix(acp): propagate follow-up prompt failures as job errors#1981serrrfirat merged 3 commits intonearai:stagingfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the ACP bridge follow-up loop into a dedicated run_follow_up_loop function and introduces the FollowUpPromptSource and AcpPromptSender traits to improve modularity and testability. It also includes new unit tests with stubs to verify error propagation and recovery logic. Feedback was provided regarding the concurrency model of the loop, suggesting that prompt polling and the child exit signal should be monitored simultaneously using tokio::select! to ensure the worker responds immediately if the agent process terminates during a poll.
| ) -> Result<(), WorkerError> { | ||
| let session_id_str = session_id.to_string(); | ||
| loop { | ||
| let backoff = match prompt_source.poll_prompt().await { |
There was a problem hiding this comment.
The follow-up loop currently polls the orchestrator sequentially. If poll_prompt is a long-poll or if the orchestrator returns prompts quickly, the loop might delay responding to the agent process exiting (monitored via child_exit_rx). Consider using tokio::select! to poll for prompts and watch for the child exit signal simultaneously to ensure the worker shuts down immediately if the agent dies.
serrrfirat
left a comment
There was a problem hiding this comment.
Posting the requested high-severity finding from the review.
| data: json!({ "message": msg }), | ||
| }) | ||
| .await; | ||
| return Err(WorkerError::ExecutionFailed { reason: msg }); |
There was a problem hiding this comment.
High Severity Early error propagation here can still leave the ACP job stuck instead of failed.
This new return Err(...) fixes the logical success/failure bug, but the process lifecycle is now split: the Child was moved into a background wait() task earlier, while run_acp_session() still unconditionally awaits the stderr reader after the LocalSet returns. If conn.prompt() fails at the protocol layer without the subprocess exiting immediately, this branch returns out of run_follow_up_loop(), then run_acp_session() can block waiting for stderr and never reach report_complete(success = false).
Please explicitly terminate the child on this error path, or keep an owned handle that can be killed during cleanup before awaiting stderr. A regression test that simulates prompt() failing while the subprocess stays alive would pin this down.
Address two review findings from nearai#1981: 1. (gemini-code-assist) poll_prompt() blocked the follow-up loop — child exit wasn't detected during active HTTP polling. Wrap poll_prompt() in tokio::select\! alongside child_exit_rx. 2. (serrrfirat, high severity) Protocol failure could leave the child alive, causing stderr_handle.await to block forever and the job to hang. Add a kill channel: spawn_child_monitor() returns (child_exit_rx, kill_tx); on error, run_acp_session sends the kill signal before awaiting stderr so the pipe closes and cleanup completes. Extract spawn_child_monitor() as a standalone function so both production code and the regression test exercise the same code path. Tests: - follow_up_loop_exits_during_long_poll_when_child_dies (ForeverPromptSource + timeout guard — hangs without the select\! fix) - child_monitor_kills_process_so_stderr_reader_completes (real sleep subprocess + kill signal — hangs without the kill channel)
| } | ||
| Ok(None) => Duration::from_secs(2), | ||
| Err(e) => { | ||
| tracing::warn!(job_id = %job_id, "Prompt polling error: {}", e); |
There was a problem hiding this comment.
Medium Severity
All poll_prompt() errors take the same infinite 5s retry path here.
poll_prompt() also surfaces permanent failures such as OrchestratorRejected and response-parse errors, so if the prompt endpoint is broken while the ACP subprocess stays alive, this loop can keep the container alive indefinitely instead of failing the job.
Please distinguish retryable transport errors from permanent endpoint/protocol failures, or cap consecutive retries and convert exhaustion into ExecutionFailed plus a terminal error result. A regression test for repeated OrchestratorRejected or malformed prompt responses would cover this.
| match follow_result { | ||
| Ok(resp) => { | ||
| let payload = stop_reason_to_result(&resp.stop_reason, &session_id_str); | ||
| sink.emit_event(&payload).await; |
There was a problem hiding this comment.
High Severity
This emits a result event for each successful follow-up turn, but on failure the new code only emits a status event before returning Err.
AppEvent::JobResult is the terminal signal that background job monitors use, and they ignore status events. Those monitors are also spawned for non-wait sandbox jobs, so an ACP job can still be observed as completed after an earlier follow-up succeeds while a later follow-up failure never propagates through the normal terminal event path.
Please emit exactly one terminal result for the overall ACP session, including an error result before returning on follow-up failure, and use a non-terminal event type for per-turn stop reasons. An integration test for a non-wait ACP job that fails on a later follow-up turn would lock this down.
) The follow-up loop silently swallowed ACP prompt failures — logging and posting a status event but continuing the loop, so the job always reported success: true. Extract the loop into a testable `run_follow_up_loop` function with trait-based dependency injection and return Err on failure so run() reports success: false. Also: downgrade follow-up loop info! logs to debug! (background task), add PartialEq to JobEventPayload, narrow AcpPromptSender to private.
Address two review findings from nearai#1981: 1. (gemini-code-assist) poll_prompt() blocked the follow-up loop — child exit wasn't detected during active HTTP polling. Wrap poll_prompt() in tokio::select\! alongside child_exit_rx. 2. (serrrfirat, high severity) Protocol failure could leave the child alive, causing stderr_handle.await to block forever and the job to hang. Add a kill channel: spawn_child_monitor() returns (child_exit_rx, kill_tx); on error, run_acp_session sends the kill signal before awaiting stderr so the pipe closes and cleanup completes. Extract spawn_child_monitor() as a standalone function so both production code and the regression test exercise the same code path. Tests: - follow_up_loop_exits_during_long_poll_when_child_dies (ForeverPromptSource + timeout guard — hangs without the select\! fix) - child_monitor_kills_process_so_stderr_reader_completes (real sleep subprocess + kill signal — hangs without the kill channel)
Per-turn ACP results were emitting event_type "result" which maps to AppEvent::JobResult (the terminal signal). This caused job monitors to exit after the first prompt, making all subsequent follow-up outcomes invisible. Changed per-turn events to "turn_result" and emit exactly one terminal "result" from run() after the session ends. Also added retry discrimination for poll_prompt errors: permanent errors (OrchestratorRejected, LlmProxyFailed) fail immediately, transient errors (ConnectionFailed) retry with a cap of 5.
b4e4252 to
4d833f3
Compare
|
Addressed all review feedback: @serrrfirat (line 471, high severity): Per-turn @serrrfirat (line 488, medium severity): Known follow-up (not in scope): |
| tracing::error!(job_id = %job_id, "{}", msg); | ||
| return Err(WorkerError::ExecutionFailed { reason: msg }); | ||
| } | ||
| tracing::warn!( |
There was a problem hiding this comment.
Medium Severity This retry path still uses warn! in a background ACP loop.\n\nPer the repo logging guidance, background worker paths should avoid info! and warn! because they surface in the REPL/TUI and corrupt the interactive display. A transient poll_prompt() failure is explicitly retried here, so these warnings can leak noisy infrastructure blips into the user-facing terminal even when the loop recovers normally.\n\nPlease downgrade this to debug! and keep error! for permanent failures or retry exhaustion.
| } | ||
| }; | ||
| self.client | ||
| .post_event(&JobEventPayload { |
There was a problem hiding this comment.
High Severity Terminal ACP completion is still not reliably delivered.\n\nThis PR moved ACP to emit a single result event here, which fixes the earlier multi-terminal-event problem. However, post_event() is still fire-and-forget and only logs failures. If this POST is dropped or rejected, report_complete() still succeeds, but the orchestrator completion path does not broadcast AppEvent::JobResult; it only updates container state and cleans up. That means job monitors can miss the terminal signal entirely and keep the max_jobs slot occupied even though the worker finished.\n\nPlease make terminal completion delivery authoritative in one place: either have report_complete() also broadcast JobResult, or make terminal event delivery acknowledged/retried instead of best-effort.
There was a problem hiding this comment.
I did note this in the PR description. It actually affects all 3 codepaths(Workee,Claude Code and ACP). Happy to tackle it here by updating -> report_complete
serrrfirat
left a comment
There was a problem hiding this comment.
Reviewed as paranoid architect. Well-structured fix for real bugs (silent follow-up failure swallowing, premature terminal events, infinite retries). Clean trait-based DI with thorough test coverage. No blockers found — minor findings (child exit code not checked, send_prompt not raced against child exit) are pre-existing and fit naturally into the acknowledged follow-up work.
…earai#1981 follow-up) Follow-up to nearai#1981 (propagate follow-up prompt failures as job errors). - Add safety-net JobResult broadcast in report_complete so job monitors free the max_jobs slot even when the worker's post_event is lost - Extract broadcast_app_event helper (shared by job_event_handler and report_complete) with debug logging on DB fallback errors - Extend idempotent self-transitions to all terminal states (Failed, Accepted, Cancelled) and block cross-terminal transitions - Add duplicate-event test proving completion watcher handles back-to-back JobResult events without error - Demote transient poll error log from warn to debug in acp_bridge
Summary
success: trueeven when follow-up prompts failed.event_type: "result"which maps toAppEvent::JobResult(the terminal signal). Job monitors exit on the firstJobResult, so all subsequent follow-up outcomes were invisible. Changed per-turn events to"turn_result"(non-terminal) and emit exactly one terminal"result"fromrun()after the entire session ends.poll_prompt()errors were retried infinitely. Now permanent errors (OrchestratorRejected,LlmProxyFailed) fail immediately, and transient errors (ConnectionFailed) retry with a cap of 5 (consistent with LLM retry patterns).poll_promptagainst child exit viatokio::select!.run_follow_up_loopfunction with trait-based dependency injection (FollowUpPromptSource,AcpPromptSender,AcpEventSink), enabling unit testing without Docker/ACP infrastructure.info!→debug!in the follow-up loop since it runs in background Docker containers and was corrupting REPL/TUI output.PartialEq+ClonetoJobEventPayload, narrowAcpPromptSendervisibility to private.Known follow-up
report_complete()does not broadcastAppEvent::JobResult— it only updates container state. If the fire-and-forgetpost_event("result")fails silently, the job monitor never learns the job ended and themax_jobsslot stays consumed. This is a pre-existing gap affecting all three bridge types (ACP, Claude, Container). Fix: have the orchestrator'sreport_completehandler also broadcastJobResult. Tracked separately since it touches shared orchestrator infrastructure.Test plan
follow_up_success_emits_turn_event_not_terminal— per-turn events use"turn_result", not"result"follow_up_prompt_failure_returns_error— follow-up failure propagates asErrfollow_up_loop_fails_on_permanent_poll_error—OrchestratorRejectedfails immediatelyfollow_up_loop_exhausts_transient_retries— 5 consecutiveConnectionFailed→Errfollow_up_loop_recovers_from_transient_poll_error— transient error then success → Okfollow_up_loop_exits_during_long_poll_when_child_dies— child exit detected during blocked pollchild_monitor_kills_process_so_stderr_reader_completes— kill channel unblocks stderrcargo clippyzero warnings,cargo fmtclean