Skip to content

[FEATURE]: Recovery path for errored subagent sessions (composer lockout blocks manual continuation after provider failures) #21525

@cpkt9762

Description

@cpkt9762

Feature hasn't been suggested before.

  • I have verified this feature I'm about to request hasn't been suggested before.

Describe the enhancement you want to request

Not a duplicate of #21458 / #6907

Those issues are about steering a running subagent — "I want to change direction mid-flight". The existing reply ("you weren't actually talking to the subagent, it was routed through the primary agent") is valid for that case.

This issue is about a different scenario: recovery from a terminal error state, where there is no live primary-agent turn to route to. The design justification for #20708 does not cover this case.

Problem

When the LLM call inside a subagent session fails in a way that the backend's automatic retry policy can't recover from (non-retryable error, or transient errors that outlast the retry budget), the subagent session enters a halted error state. In v1.4.0 the user has no way to resume it:

  1. The subagent composer shows Subagent sessions cannot be prompted. Back to main session. (introduced by feat(app): better subagent experience #20708).
  2. There is no "Resume" / "Retry" / "Continue" action on the error banner.
  3. The main agent has no mechanism to observe the child failure — it continues executing its own work unaware that one of its delegated tasks died. Even if it noticed, re-delegating from the main session loses the subagent's tool history and partial findings.

The only remaining user action is to abort the main session entirely, losing all in-flight parallel subagent work.

Concrete scenario (what made us notice)

  1. Main agent delegates several subagents in parallel (typical task(run_in_background=true) fan-out).
  2. The upstream provider (in our case a third-party OpenAI-compatible gateway) has a brief outage or rate-limit burst.
  3. One subagent's stream fails with a transient error. The backend retries with exponential backoff but the outage outlasts the retry budget, or the error is classified as non-retryable.
  4. That subagent's session is halted with an error banner.
  5. The other subagents are still running. The user wants to resume just the failed one without killing the whole main session.
  6. Before 1.4.0: open the errored subagent session in the web/desktop UI, type something like "retry your previous task" in the composer, session resumes with prior context. Work continues.
  7. After 1.4.0: the composer is locked. No recovery action exists. Only option is to cancel everything.

This is not a hypothetical — it happens regularly when the upstream provider's availability dips.

Why existing alternatives don't work

  • Bump retry cap: helps transient errors only. Non-retryable errors and sustained outages still dead-end, and no cap setting covers an hours-long provider incident.
  • Re-delegate from main agent: the main agent does not receive subagent failures as a task tool result, so it has no signal to retry. Even with that signal, re-delegation starts fresh and loses the subagent's context, tool calls, and partial findings.
  • session.abort: terminates, doesn't resume. Also see [FEATURE] Add force-kill API for subagent session termination #21176 for the related gap on the termination side.
  • Aborting the main session: throws away all healthy in-flight subagents to recover one failed one.

Proposed solutions (any one of these would close the gap)

Option A — Unlock the composer for errored subagent sessions only

Keep the "no steering during live generation" design from #20708. When session.status === "error" (or similar halted state), re-enable the composer. A fresh user message on a halted session triggers a new turn that inherits the prior context. This is the smallest possible change — effectively a conditional around the disabled state introduced in #20708.

Option B — Explicit "Resume" / "Retry" action on the error banner

Add a button on the halted subagent session's error banner that re-dispatches the turn from the last checkpoint, without requiring the user to type anything. UX-wise this is the cleanest because it doesn't reintroduce any "subagent chat" affordance at all — it's clearly a recovery action, not steering.

Option C — Surface subagent failure to the parent agent as a tool event

Let the main agent receive subagent errors as a task tool result so it can decide whether to retry, fall back, or ask the user. Architecturally the cleanest long-term answer, but the most work. It also doesn't fully solve the case where the parent agent has already moved on.

We'd be happy with Option A or B as a minimal fix. Option C is a separate, larger design discussion.

Related issues

OpenCode version

1.4.0

Operating System

Not OS-specific — reproduces on macOS, Windows, and Linux web/desktop UIs. The composer lockout was introduced in packages/app/src/pages/session/composer/session-composer-region.tsx in #20708 and is rendered identically across platforms.

Metadata

Metadata

Assignees

Labels

webRelates to opencode on web / desktop

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions