Missing step-finish/step-start parts after retryable stream errors cause tool_use/tool_result mismatch

## Summary

When the `finish-step` handler in `processor.ts:244-288` throws (any of its async operations — `Session.updatePart`, `Session.updateMessage`, `Snapshot.patch` — can fail), the error is caught and if retryable, `continue` at line 377 creates a new LLM stream. But `step-finish` for step 1 and `step-start` for step 2 were never saved. Both steps' content gets merged into one DB message without boundaries.

On replay, `convertToModelMessages()` in the AI SDK produces a single assistant block with interleaved `tool_use`/`text`/`reasoning` content, which the Anthropic API rejects with:

```
messages.N: `tool_use` ids were found without `tool_result` blocks immediately after: toolu_XXX.
Each `tool_use` block must have a corresponding `tool_result` block in the next message.
```

or:

```
messages.N.content.0.type: Expected `thinking` or `redacted_thinking`, but found `tool_use`.
```

## Root Cause

The `finish-step` handler at `processor.ts:244-288` performs multiple async operations:

```typescript
case "finish-step":
  const usage = Session.getUsage({ ... })
  await Session.updatePart({ type: "step-finish", ... })   // can throw
  await Session.updateMessage(input.assistantMessage)       // can throw
  if (snapshot) {
    const patch = await Snapshot.patch(snapshot)             // can throw
    // ...
  }
  // ...
```

If any of these throw and the error is deemed retryable, the `catch` block at line 353 hits `continue` at line 377, which loops back to `while(true)` and creates a new LLM stream. The new stream's events are appended to the **same DB message**, but step 1's `step-finish` and step 2's `step-start` parts were never saved.

Without step boundaries, the AI SDK's `convertToModelMessages()` merges all parts into a single block, producing:

```
assistant: [text, tool-call, text, tool-call]   ← INVALID: text after tool-call
tool:      [tool-result, tool-result]
```

Instead of the correct:

```
assistant: [text, tool-call]
tool:      [tool-result]
assistant: [text, tool-call]
tool:      [tool-result]
```

## Secondary Root Cause: tool-error Race Condition

`processor.ts:206` — the `tool-error` handler only processes errors when `match.state.status === "running"`:

```typescript
case "tool-error": {
  const match = toolcalls[value.toolCallId]
  if (match && match.state.status === "running") { // ← only "running"
```

Due to the AI SDK's merged-stream event ordering, `tool-error` can arrive before `tool-call`, when the status is still `"pending"`. The error is silently ignored, leaving the tool in `"pending"` state. It's later cleaned up as "Tool execution aborted" with empty input `{}` by the post-stream cleanup at lines 401-417.

This was independently discovered by a user in [#10616 (comment)](https://github.com/anomalyco/opencode/issues/10616#issuecomment-3868139609) who wrote:

> *"the tool-error handler only processes errors for tools in 'running' status. If the SDK emits a tool-error for a tool that's still in 'pending' status (because tool-call was never processed), the error is silently ignored."*

## Not Tool-Specific

This is a pipeline bug, not a tool bug. I've observed it with:
- **MCP tools** (custom `api_search` tool)
- **Built-in tools** (the `write` tool)

## Real-World Evidence

Session `ses_32fb35486ffeeJAHmplKU1gB2t`, message `msg_cd05ba534001gICo48Lsy1NHWp` (from pre-repair DB backup):

```sql
SELECT p.id, p.time_created, json_extract(p.data, '$.type') as type,
       json_extract(p.data, '$.tool') as tool,
       json_extract(p.data, '$.state.status') as status,
       json_extract(p.data, '$.state.error') as error
FROM part p WHERE p.message_id = 'msg_cd05ba534001gICo48Lsy1NHWp'
ORDER BY p.time_created;
```

```
part_id                          | time_created  | type        | tool  | status    | error
---------------------------------+---------------+-------------+-------+-----------+------------------------
prt_cd05bb9ac001brzJbfx6NPVO2y   | 1773022198188 | step-start  |       |           |
prt_cd05bb9ad001pzM736ephha8OT   | 1773022198189 | text        |       |           |
prt_cd05bb9f0001N3qbpvXSA0NBGs   | 1773022198257 | tool        | write | error     | Tool execution aborted
                                                                                      ← 96 SECOND GAP
prt_cd05d3273001z4y25K6X1Q3Piz   | 1773022294644 | text        |       |           |
prt_cd05d35a8001jOK62EPx3KVVEd   | 1773022295465 | tool        | write | completed |
prt_cd05f3c5d001QVGr7VZTzuN4Gf   | 1773022428254 | step-finish |       |           |
```

Key observations:
- The errored `write` tool has `input: {}` — the `tool-error` event was dropped because the tool was still `"pending"` when it arrived
- There's a **96-second gap** between the errored tool and the next text — this is when the retry created a new stream
- No `step-finish` / `step-start` boundary between the two groups
- The errored tool's `time_updated` (1773022428264) is **9ms after** `step-finish` (1773022428254) — confirming the post-stream cleanup ran after the stream ended

## Reproduction Test

A failing test is provided in the companion PR. It constructs a `WithParts[]` with parts from two merged steps:

```
step-start → text → tool(error) → [no boundary] → text → tool(completed)
```

Runs it through `MessageV2.toModelMessages()` and asserts the structural invariant: no `text` or `reasoning` part appears after a `tool-call` part in the same assistant `ModelMessage`.

**Currently fails:**

```
error: Invalid interleaving: found "text" part after "tool-call" in the same assistant message.
Content types in this message: [text, tool-call, text, tool-call]
```

## Suggested Fixes

1. **Reconstruction-time fix** (most important — handles already-corrupted data): In `toModelMessages()` or `normalizeMessages()`, detect when a `text`/`reasoning` part appears after a `tool-call` part in the same assistant block, and inject a synthetic `step-start` boundary to force the AI SDK to split the content into separate blocks.

2. **tool-error race fix**: Accept `tool-error` when `status === "pending"` in addition to `"running"` at `processor.ts:206`.

3. **finish-step hardening**: Wrap individual operations in the `finish-step` handler so partial failures don't lose the step boundary.

## Related Issues

- #10616 — `tool_use` ids were found without `tool_result` blocks immediately after (messages.87)
- #8377 — Most sessions eventually gets an `tool_use` error
- #2720 — AI_APICallError: tool_use blocks found without corresponding tool_result blocks
- #1662 — AI_APICallError: `tool_use` ids were found without `tool_result` blocks immediately after
- #5750 — Tool use id bug
- #2214 — AI_APICallError: messages.3: `tool_use` ids were found without `tool_result` blocks
- #8312 — Bug: Session corrupted when tool execution is aborted - missing tool_result causes API rejection
- #8010 — Expected `thinking` or `redacted_thinking` but found `tool_use`

## Environment

- Provider: Anthropic (direct API)
- Model: claude-opus-4-6 with adaptive thinking
- OS: Linux (Ubuntu 22.04)
- OpenCode version: dev build (latest `dev` branch)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing step-finish/step-start parts after retryable stream errors cause tool_use/tool_result mismatch #16749

Summary

Root Cause

Secondary Root Cause: tool-error Race Condition

Not Tool-Specific

Real-World Evidence

Reproduction Test

Suggested Fixes

Related Issues

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Missing step-finish/step-start parts after retryable stream errors cause tool_use/tool_result mismatch #16749

Description

Summary

Root Cause

Secondary Root Cause: tool-error Race Condition

Not Tool-Specific

Real-World Evidence

Reproduction Test

Suggested Fixes

Related Issues

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions