fix(acp): propagate follow-up prompt failures as job errors by rajulbhatnagar · Pull Request #1981 · nearai/ironclaw

rajulbhatnagar · 2026-04-03T19:06:42Z

Summary

Bug fix (ACP bridge follow-up prompt failures should mark sandbox job failed #1915): The ACP follow-up loop silently swallowed prompt failures — it logged and posted a status event but continued the loop, so jobs always reported success: true even when follow-up prompts failed.
Terminal event fix: Per-turn ACP results were emitting event_type: "result" which maps to AppEvent::JobResult (the terminal signal). Job monitors exit on the first JobResult, so all subsequent follow-up outcomes were invisible. Changed per-turn events to "turn_result" (non-terminal) and emit exactly one terminal "result" from run() after the entire session ends.
Poll retry limits: All poll_prompt() errors were retried infinitely. Now permanent errors (OrchestratorRejected, LlmProxyFailed) fail immediately, and transient errors (ConnectionFailed) retry with a cap of 5 (consistent with LLM retry patterns).
Child process cleanup: Kill child process on protocol failure so stderr reader completes and the job doesn't hang. Race poll_prompt against child exit via tokio::select!.
Refactor: Extract the inline loop into a standalone run_follow_up_loop function with trait-based dependency injection (FollowUpPromptSource, AcpPromptSender, AcpEventSink), enabling unit testing without Docker/ACP infrastructure.
Logging fix: Downgrade info! → debug! in the follow-up loop since it runs in background Docker containers and was corrupting REPL/TUI output.
Add PartialEq + Clone to JobEventPayload, narrow AcpPromptSender visibility to private.

Known follow-up

report_complete() does not broadcast AppEvent::JobResult — it only updates container state. If the fire-and-forget post_event("result") fails silently, the job monitor never learns the job ended and the max_jobs slot stays consumed. This is a pre-existing gap affecting all three bridge types (ACP, Claude, Container). Fix: have the orchestrator's report_complete handler also broadcast JobResult. Tracked separately since it touches shared orchestrator infrastructure.

Test plan

follow_up_success_emits_turn_event_not_terminal — per-turn events use "turn_result", not "result"
follow_up_prompt_failure_returns_error — follow-up failure propagates as Err
follow_up_loop_fails_on_permanent_poll_error — OrchestratorRejected fails immediately
follow_up_loop_exhausts_transient_retries — 5 consecutive ConnectionFailed → Err
follow_up_loop_recovers_from_transient_poll_error — transient error then success → Ok
follow_up_loop_exits_during_long_poll_when_child_dies — child exit detected during blocked poll
child_monitor_kills_process_so_stderr_reader_completes — kill channel unblocks stderr
cargo clippy zero warnings, cargo fmt clean

gemini-code-assist

Code Review

This pull request refactors the ACP bridge follow-up loop into a dedicated run_follow_up_loop function and introduces the FollowUpPromptSource and AcpPromptSender traits to improve modularity and testability. It also includes new unit tests with stubs to verify error propagation and recovery logic. Feedback was provided regarding the concurrency model of the loop, suggesting that prompt polling and the child exit signal should be monitored simultaneously using tokio::select! to ensure the worker responds immediately if the agent process terminates during a poll.

gemini-code-assist · 2026-04-03T19:12:31Z

+) -> Result<(), WorkerError> {
+    let session_id_str = session_id.to_string();
+    loop {
+        let backoff = match prompt_source.poll_prompt().await {


The follow-up loop currently polls the orchestrator sequentially. If poll_prompt is a long-poll or if the orchestrator returns prompts quickly, the loop might delay responding to the agent process exiting (monitored via child_exit_rx). Consider using tokio::select! to poll for prompts and watch for the child exit signal simultaneously to ensure the worker shuts down immediately if the agent dies.

serrrfirat

Posting the requested high-severity finding from the review.

serrrfirat · 2026-04-03T19:44:50Z

+                            data: json!({ "message": msg }),
+                        })
+                        .await;
+                        return Err(WorkerError::ExecutionFailed { reason: msg });


High Severity Early error propagation here can still leave the ACP job stuck instead of failed.

This new return Err(...) fixes the logical success/failure bug, but the process lifecycle is now split: the Child was moved into a background wait() task earlier, while run_acp_session() still unconditionally awaits the stderr reader after the LocalSet returns. If conn.prompt() fails at the protocol layer without the subprocess exiting immediately, this branch returns out of run_follow_up_loop(), then run_acp_session() can block waiting for stderr and never reach report_complete(success = false).

Please explicitly terminate the child on this error path, or keep an owned handle that can be killed during cleanup before awaiting stderr. A regression test that simulates prompt() failing while the subprocess stays alive would pin this down.

Address two review findings from nearai#1981: 1. (gemini-code-assist) poll_prompt() blocked the follow-up loop — child exit wasn't detected during active HTTP polling. Wrap poll_prompt() in tokio::select\! alongside child_exit_rx. 2. (serrrfirat, high severity) Protocol failure could leave the child alive, causing stderr_handle.await to block forever and the job to hang. Add a kill channel: spawn_child_monitor() returns (child_exit_rx, kill_tx); on error, run_acp_session sends the kill signal before awaiting stderr so the pipe closes and cleanup completes. Extract spawn_child_monitor() as a standalone function so both production code and the regression test exercise the same code path. Tests: - follow_up_loop_exits_during_long_poll_when_child_dies (ForeverPromptSource + timeout guard — hangs without the select\! fix) - child_monitor_kills_process_so_stderr_reader_completes (real sleep subprocess + kill signal — hangs without the kill channel)

serrrfirat · 2026-04-04T11:02:47Z

+            }
+            Ok(None) => Duration::from_secs(2),
+            Err(e) => {
+                tracing::warn!(job_id = %job_id, "Prompt polling error: {}", e);


Medium Severity

All poll_prompt() errors take the same infinite 5s retry path here.

poll_prompt() also surfaces permanent failures such as OrchestratorRejected and response-parse errors, so if the prompt endpoint is broken while the ACP subprocess stays alive, this loop can keep the container alive indefinitely instead of failing the job.

Please distinguish retryable transport errors from permanent endpoint/protocol failures, or cap consecutive retries and convert exhaustion into ExecutionFailed plus a terminal error result. A regression test for repeated OrchestratorRejected or malformed prompt responses would cover this.

serrrfirat · 2026-04-04T11:02:47Z

+                match follow_result {
+                    Ok(resp) => {
+                        let payload = stop_reason_to_result(&resp.stop_reason, &session_id_str);
+                        sink.emit_event(&payload).await;


High Severity

This emits a result event for each successful follow-up turn, but on failure the new code only emits a status event before returning Err.

AppEvent::JobResult is the terminal signal that background job monitors use, and they ignore status events. Those monitors are also spawned for non-wait sandbox jobs, so an ACP job can still be observed as completed after an earlier follow-up succeeds while a later follow-up failure never propagates through the normal terminal event path.

Please emit exactly one terminal result for the overall ACP session, including an error result before returning on follow-up failure, and use a non-terminal event type for per-turn stop reasons. An integration test for a non-wait ACP job that fails on a later follow-up turn would lock this down.

) The follow-up loop silently swallowed ACP prompt failures — logging and posting a status event but continuing the loop, so the job always reported success: true. Extract the loop into a testable `run_follow_up_loop` function with trait-based dependency injection and return Err on failure so run() reports success: false. Also: downgrade follow-up loop info! logs to debug! (background task), add PartialEq to JobEventPayload, narrow AcpPromptSender to private.

Address two review findings from nearai#1981: 1. (gemini-code-assist) poll_prompt() blocked the follow-up loop — child exit wasn't detected during active HTTP polling. Wrap poll_prompt() in tokio::select\! alongside child_exit_rx. 2. (serrrfirat, high severity) Protocol failure could leave the child alive, causing stderr_handle.await to block forever and the job to hang. Add a kill channel: spawn_child_monitor() returns (child_exit_rx, kill_tx); on error, run_acp_session sends the kill signal before awaiting stderr so the pipe closes and cleanup completes. Extract spawn_child_monitor() as a standalone function so both production code and the regression test exercise the same code path. Tests: - follow_up_loop_exits_during_long_poll_when_child_dies (ForeverPromptSource + timeout guard — hangs without the select\! fix) - child_monitor_kills_process_so_stderr_reader_completes (real sleep subprocess + kill signal — hangs without the kill channel)

Per-turn ACP results were emitting event_type "result" which maps to AppEvent::JobResult (the terminal signal). This caused job monitors to exit after the first prompt, making all subsequent follow-up outcomes invisible. Changed per-turn events to "turn_result" and emit exactly one terminal "result" from run() after the session ends. Also added retry discrimination for poll_prompt errors: permanent errors (OrchestratorRejected, LlmProxyFailed) fail immediately, transient errors (ConnectionFailed) retry with a cap of 5.

rajulbhatnagar · 2026-04-05T05:54:07Z

Addressed all review feedback:

@serrrfirat (line 471, high severity): Per-turn "result" events were prematurely terminating job monitors. Fixed by changing per-turn events to "turn_result" (non-terminal, falls through to AppEvent::JobStatus in the orchestrator) and emitting exactly one terminal "result" from run() after the entire session ends.

@serrrfirat (line 488, medium severity): poll_prompt() errors now have retry discrimination — OrchestratorRejected and LlmProxyFailed fail immediately (permanent), ConnectionFailed retries with a cap of 5 (~25s), consistent with llm/retry.rs patterns.

Known follow-up (not in scope): report_complete() does not broadcast AppEvent::JobResult — if the fire-and-forget post_event("result") fails silently, the monitor never learns the job ended. This is a pre-existing gap affecting all three bridge types. Fix: have the orchestrator's report_complete handler also broadcast JobResult. Will track separately since it touches shared orchestrator infrastructure.

serrrfirat · 2026-04-05T09:04:09Z

+                    tracing::error!(job_id = %job_id, "{}", msg);
+                    return Err(WorkerError::ExecutionFailed { reason: msg });
+                }
+                tracing::warn!(


Medium Severity This retry path still uses warn! in a background ACP loop.\n\nPer the repo logging guidance, background worker paths should avoid info! and warn! because they surface in the REPL/TUI and corrupt the interactive display. A transient poll_prompt() failure is explicitly retried here, so these warnings can leak noisy infrastructure blips into the user-facing terminal even when the loop recovers normally.\n\nPlease downgrade this to debug! and keep error! for permanent failures or retry exhaustion.

serrrfirat · 2026-04-05T09:04:09Z

-        }
+        };
+        self.client
+            .post_event(&JobEventPayload {


High Severity Terminal ACP completion is still not reliably delivered.\n\nThis PR moved ACP to emit a single result event here, which fixes the earlier multi-terminal-event problem. However, post_event() is still fire-and-forget and only logs failures. If this POST is dropped or rejected, report_complete() still succeeds, but the orchestrator completion path does not broadcast AppEvent::JobResult; it only updates container state and cleans up. That means job monitors can miss the terminal signal entirely and keep the max_jobs slot occupied even though the worker finished.\n\nPlease make terminal completion delivery authoritative in one place: either have report_complete() also broadcast JobResult, or make terminal event delivery acknowledged/retried instead of best-effort.

I did note this in the PR description. It actually affects all 3 codepaths(Workee,Claude Code and ACP). Happy to tackle it here by updating -> report_complete

serrrfirat

Reviewed as paranoid architect. Well-structured fix for real bugs (silent follow-up failure swallowing, premature terminal events, infinite retries). Clean trait-based DI with thorough test coverage. No blockers found — minor findings (child exit code not checked, send_prompt not raced against child exit) are pre-existing and fit naturally into the acknowledged follow-up work.

…earai#1981 follow-up) Follow-up to nearai#1981 (propagate follow-up prompt failures as job errors). - Add safety-net JobResult broadcast in report_complete so job monitors free the max_jobs slot even when the worker's post_event is lost - Extract broadcast_app_event helper (shared by job_event_handler and report_complete) with debug logging on DB fallback errors - Extend idempotent self-transitions to all terminal states (Failed, Accepted, Cancelled) and block cross-terminal transitions - Add duplicate-event test proving completion watcher handles back-to-back JobResult events without error - Demote transient poll error log from warn to debug in acp_bridge

…earai#1981 follow-up)

github-actions bot added scope: worker Container worker size: L 200-499 changed lines risk: medium Business logic, config, or moderate-risk modules contributor: experienced 6-19 merged PRs labels Apr 3, 2026

gemini-code-assist bot reviewed Apr 3, 2026

View reviewed changes

serrrfirat reviewed Apr 3, 2026

View reviewed changes

github-actions bot added size: XL 500+ changed lines and removed size: L 200-499 changed lines labels Apr 4, 2026

rajulbhatnagar requested a review from serrrfirat April 4, 2026 01:01

github-actions bot mentioned this pull request Apr 4, 2026

🦞 OpenClaw 生态日报 2026-04-04 gsscsd/big_model_radar#131

Open

serrrfirat reviewed Apr 4, 2026

View reviewed changes

Rajul Bhatnagar added 3 commits April 5, 2026 05:51

rajulbhatnagar force-pushed the worktree-acp-fixes branch from b4e4252 to 4d833f3 Compare April 5, 2026 05:51

serrrfirat reviewed Apr 5, 2026

View reviewed changes

serrrfirat approved these changes Apr 7, 2026

View reviewed changes

serrrfirat merged commit 13774cc into nearai:staging Apr 7, 2026
14 checks passed

rajulbhatnagar mentioned this pull request Apr 7, 2026

fix(acp): safety-net broadcast and idempotent terminal transitions (#1981 follow-up) #2119

Open

5 tasks

rajulbhatnagar pushed a commit to rajulbhatnagar/ironclaw that referenced this pull request Apr 7, 2026

fix(acp): safety-net broadcast and idempotent terminal transitions (n…

c83a716

…earai#1981 follow-up)

This was referenced Apr 7, 2026

chore: promote staging to staging-promote/00fd2e88-24092158668 (2026-04-07 19:23 UTC) #2120

Merged

chore: promote staging to staging-promote/86c15903-24100112892 (2026-04-07 22:53 UTC) #2128

Merged

This was referenced Apr 10, 2026

chore: promote staging to staging-promote/4c9a985b-23931806540 (2026-04-03 05:32 UTC) #1953

Merged

chore: promote staging to staging-promote/42623ed1-23780941831 (2026-04-01 23:10 UTC) #1893

Merged

ironclaw-ci bot mentioned this pull request Apr 10, 2026

chore: release #2075

Merged

ironclaw-ci bot mentioned this pull request Apr 18, 2026

chore: release #2606

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(acp): propagate follow-up prompt failures as job errors#1981

fix(acp): propagate follow-up prompt failures as job errors#1981
serrrfirat merged 3 commits intonearai:stagingfrom
rajulbhatnagar:worktree-acp-fixes

rajulbhatnagar commented Apr 3, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 3, 2026

Uh oh!

serrrfirat left a comment

Uh oh!

serrrfirat Apr 3, 2026

Uh oh!

rajulbhatnagar Apr 4, 2026

Uh oh!

serrrfirat Apr 4, 2026

Uh oh!

serrrfirat Apr 4, 2026

Uh oh!

rajulbhatnagar commented Apr 5, 2026

Uh oh!

serrrfirat Apr 5, 2026

Uh oh!

serrrfirat Apr 5, 2026

Uh oh!

rajulbhatnagar Apr 7, 2026

Uh oh!

serrrfirat left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rajulbhatnagar commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Known follow-up

Test plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

serrrfirat left a comment

Choose a reason for hiding this comment

Uh oh!

serrrfirat Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

rajulbhatnagar Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

serrrfirat Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

serrrfirat Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

rajulbhatnagar commented Apr 5, 2026

Uh oh!

serrrfirat Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

serrrfirat Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

rajulbhatnagar Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

serrrfirat left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rajulbhatnagar commented Apr 3, 2026 •

edited

Loading