Context
This addresses track 6 of the agent refactor (#1216):
define history / summary / runtime / system prompt boundaries
define compression triggers and strategies
define what belongs to session context and what does not
I've been working in this area through the session persistence track (#732, #1170) and spent time reading the compression and context-building code. Below is what I found, and a proposal for how to clarify these boundaries.
Current state
Context management is currently spread across three locations with implicit boundaries between them:
context.go:BuildMessages() — assembles system prompt (cached static + dynamic) + summary + history + current message into []Message. Runs sanitizeHistoryForProvider() to drop orphaned tool pairs at read time.
loop.go:maybeSummarize() — checks two conditions after each turn: len(history) > SummarizeMessageThreshold (default 20) or estimateTokens(history) > ContextWindow * SummarizeTokenPercent / 100. If either is true, fires a background goroutine to run summarizeSession().
loop.go:forceCompression() — called reactively when the LLM returns a context-window error. Drops the oldest 50% of conversation messages, appends an emergency note to the system prompt.
There is no explicit model of how much context space is available, what fills it, or when compression should happen relative to the actual budget.
Specific problems
1. ContextWindow defaults to MaxTokens
In instance.go:227:
ContextWindow: maxTokens,
MaxTokens is the max output tokens (default 32768 in defaults.go:33), passed to the LLM as the max_tokens request parameter (loop.go:930). But ContextWindow should represent the model's input capacity — typically 128K+ for modern models.
Setting ContextWindow = maxTokens means:
maybeSummarize threshold = 32768 * 75 / 100 = 24576 estimated tokens
- History gets summarized far too early, well before the model's actual context limit is reached
- Conversely, if a user raises
max_tokens to a large value, summarization never triggers at all
PR #556 identified the same issue.
2. forceCompression can orphan tool pairs
forceCompression() slices conversation at mid = len(conversation) / 2 (loop.go:1355) without checking whether the cut falls between an assistant message with ToolCalls and its matching tool result messages.
The read-path defense (sanitizeHistoryForProvider at context.go:577) catches orphaned pairs at query time, but the stored session history remains corrupted — tool messages without their matching assistant predecessor, or assistant messages with tool_calls but no results following. PR #665 identified this gap.
3. Compression is reactive, not proactive
forceCompression only runs after the LLM already rejected the request with a context-window error (loop.go:1009-1027). This means:
- The user sees "Context window exceeded. Compressing history and retrying..." — disruptive
- The emergency drop is blunt — 50% of messages gone without summarization
- A failed LLM call is wasted (and billed) before we realize the context was too large
A proactive check before the LLM call would prevent this entirely for the common case.
4. Token estimation undercounts
estimateTokens() (loop.go:1691) only counts utf8.RuneCountInString(m.Content). It ignores:
ToolCalls — function name + JSON arguments can be substantial (complex tool args easily add thousands of tokens)
- The system prompt — built separately in
BuildMessages, not included in the history estimate
- Tool definitions — injected by the provider adapter, invisible to the estimator
The summarization threshold check in maybeSummarize compares against ContextWindow using this undercount, so the check is weaker than intended in both directions.
Proposal
The goal is to clarify existing implicit boundaries, not introduce new abstractions. This follows the refactor's "minimum concepts" rule — no new types unless the current code cannot be clarified without them.
A. Separate context_window from max_tokens in config
Add context_window as an explicit field in AgentDefaults. Default to 0, meaning "fall back to a safe default" (e.g. 131072). This lets ContextWindow and MaxTokens serve their actual distinct purposes:
MaxTokens → max output tokens per LLM call
ContextWindow → total input capacity of the model
A follow-up improvement could auto-detect context window from the provider, but that's not needed for the initial fix.
B. Compute the available history budget explicitly
After building the system prompt, we know the fixed overhead. The available space for history becomes a simple subtraction:
fixed = tokenEstimate(systemPrompt) + tokenEstimate(toolDefinitions)
reserve = maxTokens // leave room for model output
historyBudget = contextWindow - fixed - reserve
This historyBudget replaces the current ContextWindow * SummarizeTokenPercent / 100 as the compression threshold. No new types — just making the arithmetic explicit and correct.
C. Proactive pre-call check
Before calling the LLM in runLLMIteration, estimate the total token cost of the assembled messages slice. If it exceeds contextWindow - reserve, run summarization before the call.
forceCompression stays as a last-resort fallback for cases where the estimate was too low. But it should stop being the primary compression path.
D. Tool-pair-aware truncation
When forceCompression or summarizeSession truncates history, ensure the cut point does not fall inside a [assistant+tool_calls, tool_result, ...] group. If it does, move the cut before the group start. This gives write-path protection to complement the existing read-path sanitization.
E. Include ToolCalls in token estimation
Extend estimateTokens to account for m.ToolCalls — serialize function name and arguments into the character count. A rough estimate is still better than ignoring them.
What this does NOT propose
This is a boundary-clarification and correctness track, not a feature expansion.
Relationship to other tracks
Intentionally independent. Context management operates on []providers.Message and integer token counts. It does not depend on the Agent abstraction (track 1), EventBus (track 3), persona assembly (track 4), or capability model (track 5). If the AgentLoop lifecycle changes (track 2/3), the call sites may shift, but the budget logic itself is unaffected.
Related issues and PRs
I'd be happy to take this on if it fits the refactor direction. My plan would be to start with a working note in docs/agent-refactor/context.md covering the boundary definitions, then follow up with implementation PRs targeting the refactor/agent branch.
Context
This addresses track 6 of the agent refactor (#1216):
I've been working in this area through the session persistence track (#732, #1170) and spent time reading the compression and context-building code. Below is what I found, and a proposal for how to clarify these boundaries.
Current state
Context management is currently spread across three locations with implicit boundaries between them:
context.go:BuildMessages()— assembles system prompt (cached static + dynamic) + summary + history + current message into[]Message. RunssanitizeHistoryForProvider()to drop orphaned tool pairs at read time.loop.go:maybeSummarize()— checks two conditions after each turn:len(history) > SummarizeMessageThreshold(default 20) orestimateTokens(history) > ContextWindow * SummarizeTokenPercent / 100. If either is true, fires a background goroutine to runsummarizeSession().loop.go:forceCompression()— called reactively when the LLM returns a context-window error. Drops the oldest 50% of conversation messages, appends an emergency note to the system prompt.There is no explicit model of how much context space is available, what fills it, or when compression should happen relative to the actual budget.
Specific problems
1. ContextWindow defaults to MaxTokens
In
instance.go:227:MaxTokensis the max output tokens (default 32768 indefaults.go:33), passed to the LLM as themax_tokensrequest parameter (loop.go:930). ButContextWindowshould represent the model's input capacity — typically 128K+ for modern models.Setting
ContextWindow = maxTokensmeans:maybeSummarizethreshold =32768 * 75 / 100 = 24576estimated tokensmax_tokensto a large value, summarization never triggers at allPR #556 identified the same issue.
2. forceCompression can orphan tool pairs
forceCompression()slices conversation atmid = len(conversation) / 2(loop.go:1355) without checking whether the cut falls between an assistant message withToolCallsand its matchingtoolresult messages.The read-path defense (
sanitizeHistoryForProvideratcontext.go:577) catches orphaned pairs at query time, but the stored session history remains corrupted — tool messages without their matching assistant predecessor, or assistant messages with tool_calls but no results following. PR #665 identified this gap.3. Compression is reactive, not proactive
forceCompressiononly runs after the LLM already rejected the request with a context-window error (loop.go:1009-1027). This means:A proactive check before the LLM call would prevent this entirely for the common case.
4. Token estimation undercounts
estimateTokens()(loop.go:1691) only countsutf8.RuneCountInString(m.Content). It ignores:ToolCalls— function name + JSON arguments can be substantial (complex tool args easily add thousands of tokens)BuildMessages, not included in the history estimateThe summarization threshold check in
maybeSummarizecompares againstContextWindowusing this undercount, so the check is weaker than intended in both directions.Proposal
The goal is to clarify existing implicit boundaries, not introduce new abstractions. This follows the refactor's "minimum concepts" rule — no new types unless the current code cannot be clarified without them.
A. Separate context_window from max_tokens in config
Add
context_windowas an explicit field inAgentDefaults. Default to 0, meaning "fall back to a safe default" (e.g. 131072). This letsContextWindowandMaxTokensserve their actual distinct purposes:MaxTokens→ max output tokens per LLM callContextWindow→ total input capacity of the modelA follow-up improvement could auto-detect context window from the provider, but that's not needed for the initial fix.
B. Compute the available history budget explicitly
After building the system prompt, we know the fixed overhead. The available space for history becomes a simple subtraction:
This
historyBudgetreplaces the currentContextWindow * SummarizeTokenPercent / 100as the compression threshold. No new types — just making the arithmetic explicit and correct.C. Proactive pre-call check
Before calling the LLM in
runLLMIteration, estimate the total token cost of the assembledmessagesslice. If it exceedscontextWindow - reserve, run summarization before the call.forceCompressionstays as a last-resort fallback for cases where the estimate was too low. But it should stop being the primary compression path.D. Tool-pair-aware truncation
When
forceCompressionorsummarizeSessiontruncates history, ensure the cut point does not fall inside a[assistant+tool_calls, tool_result, ...]group. If it does, move the cut before the group start. This gives write-path protection to complement the existing read-path sanitization.E. Include ToolCalls in token estimation
Extend
estimateTokensto account form.ToolCalls— serialize function name and arguments into the character count. A rough estimate is still better than ignoring them.What this does NOT propose
ContextBudgetstruct, noContextManagerinterface)SessionStoreorBuildMessagesAPI signaturesThis is a boundary-clarification and correctness track, not a feature expansion.
Relationship to other tracks
Intentionally independent. Context management operates on
[]providers.Messageand integer token counts. It does not depend on the Agent abstraction (track 1), EventBus (track 3), persona assembly (track 4), or capability model (track 5). If the AgentLoop lifecycle changes (track 2/3), the call sites may shift, but the budget logic itself is unaffected.Related issues and PRs
I'd be happy to take this on if it fits the refactor direction. My plan would be to start with a working note in
docs/agent-refactor/context.mdcovering the boundary definitions, then follow up with implementation PRs targeting therefactor/agentbranch.