feat(core): adaptive output token escalation (8K default + 64K retry)#2898
Conversation
99% of model responses are under 5K tokens, but we previously reserved 32K for every request. This wastes GPU slot capacity by ~4x. Now the default output limit is 8K. When a response hits this cap (stop_reason=max_tokens), it automatically retries once at 64K — only the ~1% of requests that actually need more tokens pay the cost. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
📋 Review SummaryThis PR implements an adaptive output token escalation strategy to optimize GPU slot utilization by reducing the default output token reservation from 32K to 8K tokens, with an automatic retry at 64K when responses are truncated. The implementation is well-structured, maintains backward compatibility, and includes comprehensive test updates. Overall, this is a solid optimization that should significantly improve server concurrency. 🔍 General Feedback
🎯 Specific Feedback🟡 High
🟢 Medium
🔵 Low
✅ Highlights
|
|
Thanks for the review. Addressing the High and Medium items: High 1 — High 2 — if (
self.history.length > 0 &&
self.history[self.history.length - 1].role === "model"
) {
self.history.pop();
}It only pops when the last entry is a model response. Additionally, there are no Medium 1 — Medium 2 — Escalated stream not wrapped in retry loop Medium 3 — "99% of outputs" claim |
- Add design doc covering problem, architecture, token limit determination, escalation mechanism, and design decisions - Document QWEN_CODE_MAX_OUTPUT_TOKENS env var in settings.md - Add max_tokens adaptive behavior explanation in model config section
tanzhenxin
left a comment
There was a problem hiding this comment.
Review
Nice optimization — "start small, escalate on demand" is the right approach, and the E2E results confirm the core flow works correctly. One item to flag.
Issues
1. Pre-existing: agent-core.ts and forkedQuery.ts don't clear accumulated state on RETRY
Recommend a follow-up PR.
Not introduced by this PR, but escalation increases the blast radius. Previously, RETRY only fired after error/partial streams where little state had accumulated. With escalation, RETRY fires after a complete successful stream — functionCalls, roundText, roundThoughtText, etc. in agent-core.ts:427 are already populated, and the continue doesn't clear them. The escalated response then appends on top, producing duplicate tool calls and doubled text. Same pattern in forkedQuery.ts:222 where fullText keeps concatenating across both attempts. Turn already handles this correctly (turn.ts:286), but these two consumers don't.
E2E Test Results (manual, CAPPED_DEFAULT_MAX_TOKENS=256)
| Test | Result | Details |
|---|---|---|
| Default cap applied | PASS | Cap is functionally active (proven by truncation in test 2) |
| Escalation retry | PASS | 2 API calls: first truncated (finish_reason: length), second with max_tokens: 64000 (finish_reason: stop). Complete output produced. |
| Debug log message | PASS | Output truncated at capped default. Escalating to 64000 tokens. confirmed in ~/.qwen/debug/ |
| User override bypasses cap | PASS | QWEN_CODE_MAX_OUTPUT_TOKENS=16000 → single API call, no escalation |
| No duplicate tool calls | PASS | list_directory appeared exactly once after escalation. RETRY event correctly cleared pendingToolCalls, pendingCitations, and debugResponses. |
Verdict
APPROVE — Core escalation logic works as designed. The consumer state cleanup (#1) is pre-existing and can be a follow-up.
Background
Currently every request reserves a fixed 32K output token slot, but 99% of responses are under 5K tokens. This over-reserves GPU slot capacity by ~4x, limiting server concurrency.
Approach
Adaptive "low default + escalate on truncation" strategy:
max_tokensFlow: send with 8K → if
finish_reason === MAX_TOKENS→ auto-retry with 64K → only ~1% of requests use the large slot.Changes
1. New constants (
tokenLimits.ts)CAPPED_DEFAULT_MAX_TOKENS = 8_000— low default for slot optimizationESCALATED_MAX_TOKENS = 64_000— escalated limit for truncated requests2. Lower default max_tokens (
default.ts,anthropicContentGenerator.ts)max_tokensreduced from 32K to 8K when no user config is setQWEN_CODE_MAX_OUTPUT_TOKENSenv var to override the defaultmax_tokensis unaffected (highest priority)3. Auto-escalation retry (
geminiChat.ts)finishReason === MAX_TOKENSfrom streamed chunksmaxOutputTokens: 64K4. RETRY event state cleanup (
turn.ts)pendingToolCalls,pendingCitations,debugResponses, andfinishReasonon RETRYTesting
default.test.ts,dashscope.test.ts, andanthropicContentGenerator.test.ts🤖 Generated with Claude Code