[Agent refactor] Context management: boundaries, compression, and token budgeting

## Context

This addresses track 6 of the agent refactor (#1216):

> define history / summary / runtime / system prompt boundaries
> define compression triggers and strategies
> define what belongs to session context and what does not

I've been working in this area through the session persistence track (#732, #1170) and spent time reading the compression and context-building code. Below is what I found, and a proposal for how to clarify these boundaries.

---

## Current state

Context management is currently spread across three locations with implicit boundaries between them:

- **`context.go:BuildMessages()`** — assembles system prompt (cached static + dynamic) + summary + history + current message into `[]Message`. Runs `sanitizeHistoryForProvider()` to drop orphaned tool pairs at read time.
- **`loop.go:maybeSummarize()`** — checks two conditions after each turn: `len(history) > SummarizeMessageThreshold` (default 20) or `estimateTokens(history) > ContextWindow * SummarizeTokenPercent / 100`. If either is true, fires a background goroutine to run `summarizeSession()`.
- **`loop.go:forceCompression()`** — called reactively when the LLM returns a context-window error. Drops the oldest 50% of conversation messages, appends an emergency note to the system prompt.

There is no explicit model of how much context space is available, what fills it, or when compression should happen relative to the actual budget.

---

## Specific problems

### 1. ContextWindow defaults to MaxTokens

In `instance.go:227`:

```go
ContextWindow: maxTokens,
```

`MaxTokens` is the max **output** tokens (default 32768 in `defaults.go:33`), passed to the LLM as the `max_tokens` request parameter (`loop.go:930`). But `ContextWindow` should represent the model's **input** capacity — typically 128K+ for modern models.

Setting `ContextWindow = maxTokens` means:

- `maybeSummarize` threshold = `32768 * 75 / 100 = 24576` estimated tokens
- History gets summarized far too early, well before the model's actual context limit is reached
- Conversely, if a user raises `max_tokens` to a large value, summarization never triggers at all

PR #556 identified the same issue.

### 2. forceCompression can orphan tool pairs

`forceCompression()` slices conversation at `mid = len(conversation) / 2` (`loop.go:1355`) without checking whether the cut falls between an assistant message with `ToolCalls` and its matching `tool` result messages.

The read-path defense (`sanitizeHistoryForProvider` at `context.go:577`) catches orphaned pairs at query time, but the stored session history remains corrupted — tool messages without their matching assistant predecessor, or assistant messages with tool_calls but no results following. PR #665 identified this gap.

### 3. Compression is reactive, not proactive

`forceCompression` only runs after the LLM **already** rejected the request with a context-window error (`loop.go:1009-1027`). This means:

- The user sees "Context window exceeded. Compressing history and retrying..." — disruptive
- The emergency drop is blunt — 50% of messages gone without summarization
- A failed LLM call is wasted (and billed) before we realize the context was too large

A proactive check before the LLM call would prevent this entirely for the common case.

### 4. Token estimation undercounts

`estimateTokens()` (`loop.go:1691`) only counts `utf8.RuneCountInString(m.Content)`. It ignores:

- `ToolCalls` — function name + JSON arguments can be substantial (complex tool args easily add thousands of tokens)
- The system prompt — built separately in `BuildMessages`, not included in the history estimate
- Tool definitions — injected by the provider adapter, invisible to the estimator

The summarization threshold check in `maybeSummarize` compares against `ContextWindow` using this undercount, so the check is weaker than intended in both directions.

---

## Proposal

The goal is to **clarify existing implicit boundaries**, not introduce new abstractions. This follows the refactor's "minimum concepts" rule — no new types unless the current code cannot be clarified without them.

### A. Separate context_window from max_tokens in config

Add `context_window` as an explicit field in `AgentDefaults`. Default to 0, meaning "fall back to a safe default" (e.g. 131072). This lets `ContextWindow` and `MaxTokens` serve their actual distinct purposes:

- `MaxTokens` → max output tokens per LLM call
- `ContextWindow` → total input capacity of the model

A follow-up improvement could auto-detect context window from the provider, but that's not needed for the initial fix.

### B. Compute the available history budget explicitly

After building the system prompt, we know the fixed overhead. The available space for history becomes a simple subtraction:

```
fixed = tokenEstimate(systemPrompt) + tokenEstimate(toolDefinitions)
reserve = maxTokens  // leave room for model output
historyBudget = contextWindow - fixed - reserve
```

This `historyBudget` replaces the current `ContextWindow * SummarizeTokenPercent / 100` as the compression threshold. No new types — just making the arithmetic explicit and correct.

### C. Proactive pre-call check

Before calling the LLM in `runLLMIteration`, estimate the total token cost of the assembled `messages` slice. If it exceeds `contextWindow - reserve`, run summarization **before** the call.

`forceCompression` stays as a last-resort fallback for cases where the estimate was too low. But it should stop being the primary compression path.

### D. Tool-pair-aware truncation

When `forceCompression` or `summarizeSession` truncates history, ensure the cut point does not fall inside a `[assistant+tool_calls, tool_result, ...]` group. If it does, move the cut before the group start. This gives write-path protection to complement the existing read-path sanitization.

### E. Include ToolCalls in token estimation

Extend `estimateTokens` to account for `m.ToolCalls` — serialize function name and arguments into the character count. A rough estimate is still better than ignoring them.

---

## What this does NOT propose

- No new types or abstractions (no `ContextBudget` struct, no `ContextManager` interface)
- No changes to session storage format
- No changes to the `SessionStore` or `BuildMessages` API signatures
- No dependency on other refactor tracks (#1218 agent abstraction, #1316 event model)
- No changes to summarization prompt or LLM-based compression logic

This is a boundary-clarification and correctness track, not a feature expansion.

---

## Relationship to other tracks

Intentionally independent. Context management operates on `[]providers.Message` and integer token counts. It does not depend on the Agent abstraction (track 1), EventBus (track 3), persona assembly (track 4), or capability model (track 5). If the AgentLoop lifecycle changes (track 2/3), the call sites may shift, but the budget logic itself is unaffected.

---

## Related issues and PRs

- #1216 — umbrella issue (track 6)
- #772 — original refactor proposal (pain points 1–3, 5)
- #556 — context_window ≠ max_tokens (open, same root cause as A above)
- #665 — orphaned tool pairs after compression (open, same issue as D above)
- #1167 — compression retry improvements (merged)
- #732 — JSONL session persistence (merged)
- #1170 — SessionStore integration into agent loop (merged)

---

I'd be happy to take this on if it fits the refactor direction. My plan would be to start with a working note in `docs/agent-refactor/context.md` covering the boundary definitions, then follow up with implementation PRs targeting the `refactor/agent` branch.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Agent refactor] Context management: boundaries, compression, and token budgeting #1439

Context

Current state

Specific problems

1. ContextWindow defaults to MaxTokens

2. forceCompression can orphan tool pairs

3. Compression is reactive, not proactive

4. Token estimation undercounts

Proposal

A. Separate context_window from max_tokens in config

B. Compute the available history budget explicitly

C. Proactive pre-call check

D. Tool-pair-aware truncation

E. Include ToolCalls in token estimation

What this does NOT propose

Relationship to other tracks

Related issues and PRs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Agent refactor] Context management: boundaries, compression, and token budgeting #1439

Description

Context

Current state

Specific problems

1. ContextWindow defaults to MaxTokens

2. forceCompression can orphan tool pairs

3. Compression is reactive, not proactive

4. Token estimation undercounts

Proposal

A. Separate context_window from max_tokens in config

B. Compute the available history budget explicitly

C. Proactive pre-call check

D. Tool-pair-aware truncation

E. Include ToolCalls in token estimation

What this does NOT propose

Relationship to other tracks

Related issues and PRs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions