Skip to content

[Agent refactor] Context management: boundaries, compression, and token budgeting #1439

@is-Xiaoen

Description

@is-Xiaoen

Context

This addresses track 6 of the agent refactor (#1216):

define history / summary / runtime / system prompt boundaries
define compression triggers and strategies
define what belongs to session context and what does not

I've been working in this area through the session persistence track (#732, #1170) and spent time reading the compression and context-building code. Below is what I found, and a proposal for how to clarify these boundaries.


Current state

Context management is currently spread across three locations with implicit boundaries between them:

  • context.go:BuildMessages() — assembles system prompt (cached static + dynamic) + summary + history + current message into []Message. Runs sanitizeHistoryForProvider() to drop orphaned tool pairs at read time.
  • loop.go:maybeSummarize() — checks two conditions after each turn: len(history) > SummarizeMessageThreshold (default 20) or estimateTokens(history) > ContextWindow * SummarizeTokenPercent / 100. If either is true, fires a background goroutine to run summarizeSession().
  • loop.go:forceCompression() — called reactively when the LLM returns a context-window error. Drops the oldest 50% of conversation messages, appends an emergency note to the system prompt.

There is no explicit model of how much context space is available, what fills it, or when compression should happen relative to the actual budget.


Specific problems

1. ContextWindow defaults to MaxTokens

In instance.go:227:

ContextWindow: maxTokens,

MaxTokens is the max output tokens (default 32768 in defaults.go:33), passed to the LLM as the max_tokens request parameter (loop.go:930). But ContextWindow should represent the model's input capacity — typically 128K+ for modern models.

Setting ContextWindow = maxTokens means:

  • maybeSummarize threshold = 32768 * 75 / 100 = 24576 estimated tokens
  • History gets summarized far too early, well before the model's actual context limit is reached
  • Conversely, if a user raises max_tokens to a large value, summarization never triggers at all

PR #556 identified the same issue.

2. forceCompression can orphan tool pairs

forceCompression() slices conversation at mid = len(conversation) / 2 (loop.go:1355) without checking whether the cut falls between an assistant message with ToolCalls and its matching tool result messages.

The read-path defense (sanitizeHistoryForProvider at context.go:577) catches orphaned pairs at query time, but the stored session history remains corrupted — tool messages without their matching assistant predecessor, or assistant messages with tool_calls but no results following. PR #665 identified this gap.

3. Compression is reactive, not proactive

forceCompression only runs after the LLM already rejected the request with a context-window error (loop.go:1009-1027). This means:

  • The user sees "Context window exceeded. Compressing history and retrying..." — disruptive
  • The emergency drop is blunt — 50% of messages gone without summarization
  • A failed LLM call is wasted (and billed) before we realize the context was too large

A proactive check before the LLM call would prevent this entirely for the common case.

4. Token estimation undercounts

estimateTokens() (loop.go:1691) only counts utf8.RuneCountInString(m.Content). It ignores:

  • ToolCalls — function name + JSON arguments can be substantial (complex tool args easily add thousands of tokens)
  • The system prompt — built separately in BuildMessages, not included in the history estimate
  • Tool definitions — injected by the provider adapter, invisible to the estimator

The summarization threshold check in maybeSummarize compares against ContextWindow using this undercount, so the check is weaker than intended in both directions.


Proposal

The goal is to clarify existing implicit boundaries, not introduce new abstractions. This follows the refactor's "minimum concepts" rule — no new types unless the current code cannot be clarified without them.

A. Separate context_window from max_tokens in config

Add context_window as an explicit field in AgentDefaults. Default to 0, meaning "fall back to a safe default" (e.g. 131072). This lets ContextWindow and MaxTokens serve their actual distinct purposes:

  • MaxTokens → max output tokens per LLM call
  • ContextWindow → total input capacity of the model

A follow-up improvement could auto-detect context window from the provider, but that's not needed for the initial fix.

B. Compute the available history budget explicitly

After building the system prompt, we know the fixed overhead. The available space for history becomes a simple subtraction:

fixed = tokenEstimate(systemPrompt) + tokenEstimate(toolDefinitions)
reserve = maxTokens  // leave room for model output
historyBudget = contextWindow - fixed - reserve

This historyBudget replaces the current ContextWindow * SummarizeTokenPercent / 100 as the compression threshold. No new types — just making the arithmetic explicit and correct.

C. Proactive pre-call check

Before calling the LLM in runLLMIteration, estimate the total token cost of the assembled messages slice. If it exceeds contextWindow - reserve, run summarization before the call.

forceCompression stays as a last-resort fallback for cases where the estimate was too low. But it should stop being the primary compression path.

D. Tool-pair-aware truncation

When forceCompression or summarizeSession truncates history, ensure the cut point does not fall inside a [assistant+tool_calls, tool_result, ...] group. If it does, move the cut before the group start. This gives write-path protection to complement the existing read-path sanitization.

E. Include ToolCalls in token estimation

Extend estimateTokens to account for m.ToolCalls — serialize function name and arguments into the character count. A rough estimate is still better than ignoring them.


What this does NOT propose

This is a boundary-clarification and correctness track, not a feature expansion.


Relationship to other tracks

Intentionally independent. Context management operates on []providers.Message and integer token counts. It does not depend on the Agent abstraction (track 1), EventBus (track 3), persona assembly (track 4), or capability model (track 5). If the AgentLoop lifecycle changes (track 2/3), the call sites may shift, but the budget logic itself is unaffected.


Related issues and PRs


I'd be happy to take this on if it fits the refactor direction. My plan would be to start with a working note in docs/agent-refactor/context.md covering the boundary definitions, then follow up with implementation PRs targeting the refactor/agent branch.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions