feat(core): adaptive output token escalation (8K default + 64K retry) by wenshao · Pull Request #2898 · QwenLM/qwen-code

wenshao · 2026-04-04T22:52:16Z

Background

Currently every request reserves a fixed 32K output token slot, but 99% of responses are under 5K tokens. This over-reserves GPU slot capacity by ~4x, limiting server concurrency.

Approach

Adaptive "low default + escalate on truncation" strategy:

Phase	Output limit	Trigger
Initial request	8K	All requests
Escalated retry	64K	Previous response truncated by `max_tokens`

Flow: send with 8K → if finish_reason === MAX_TOKENS → auto-retry with 64K → only ~1% of requests use the large slot.

Changes

1. New constants (`tokenLimits.ts`)

CAPPED_DEFAULT_MAX_TOKENS = 8_000 — low default for slot optimization
ESCALATED_MAX_TOKENS = 64_000 — escalated limit for truncated requests

2. Lower default max_tokens (`default.ts`, `anthropicContentGenerator.ts`)

Default max_tokens reduced from 32K to 8K when no user config is set
New QWEN_CODE_MAX_OUTPUT_TOKENS env var to override the default
User-configured max_tokens is unaffected (highest priority)

3. Auto-escalation retry (`geminiChat.ts`)

Detects finishReason === MAX_TOKENS from streamed chunks
When conditions are met (no user/env override, not already escalated):
- Removes the truncated model response from history
- Yields a RETRY event (UI discards partial output)
- Re-sends the same request with maxOutputTokens: 64K
Escalation is placed outside the retry loop so errors from the escalated stream propagate directly instead of being caught by retry logic

4. RETRY event state cleanup (`turn.ts`)

Clears pendingToolCalls, pendingCitations, debugResponses, and finishReason on RETRY
Prevents duplicate tool calls when the first truncated response contained completed tool calls and the escalated retry produces the same ones

Testing

All 780 existing tests pass
Updated default-value expectations in default.test.ts, dashscope.test.ts, and anthropicContentGenerator.test.ts

🤖 Generated with Claude Code

99% of model responses are under 5K tokens, but we previously reserved 32K for every request. This wastes GPU slot capacity by ~4x. Now the default output limit is 8K. When a response hits this cap (stop_reason=max_tokens), it automatically retries once at 64K — only the ~1% of requests that actually need more tokens pay the cost. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-04T22:54:04Z

📋 Review Summary

This PR implements an adaptive output token escalation strategy to optimize GPU slot utilization by reducing the default output token reservation from 32K to 8K tokens, with an automatic retry at 64K when responses are truncated. The implementation is well-structured, maintains backward compatibility, and includes comprehensive test updates. Overall, this is a solid optimization that should significantly improve server concurrency.

🔍 General Feedback

Strong architectural approach: The "low default + escalate on truncation" pattern is well-suited for the stated goal of reducing GPU slot over-reservation
Good separation of concerns: Token limit constants are centralized in tokenLimits.ts, while escalation logic lives in geminiChat.ts
Comprehensive test coverage: Test files are updated to reflect the new default values across all affected providers
Environment variable override: The QWEN_CODE_MAX_OUTPUT_TOKENS env var provides good operational flexibility
Consistent implementation: Token limit logic is applied consistently across Anthropic, DashScope, and Default OpenAI providers

🎯 Specific Feedback

🟡 High

File: packages/core/src/core/geminiChat.ts:316-320 - The hasUserMaxTokensOverride check reads from cgConfig?.samplingParams?.max_tokens, but this may not capture all user configuration paths. For example, if a user configures max_tokens directly in the request params rather than in the content generator config, this override detection would fail, causing the escalation logic to incorrectly apply. Consider also checking params.config?.maxOutputTokens in the override detection.
File: packages/core/src/core/geminiChat.ts:465-470 - The history cleanup logic (self.history.pop()) assumes the last item is always the partial model response, but if multiple concurrent operations are happening or if the history was modified elsewhere, this could remove the wrong entry. Add a safety check to verify the role before popping, or consider a more robust cleanup mechanism.

🟢 Medium

File: packages/core/src/core/geminiChat.ts:344-347 - The lastFinishReason tracking only captures the finish reason from the last chunk of the stream. If the stream yields multiple candidates or if finish reasons vary across chunks, this could miss important state. Consider tracking finish reasons from all candidates or adding a comment explaining why the last candidate's finish reason is authoritative.
File: packages/core/src/core/geminiChat.ts:454-491 - The escalation retry is placed outside the INVALID_CONTENT_RETRY_OPTIONS loop, which is correct for error propagation, but this means escalation doesn't benefit from the same retry semantics (e.g., backoff). If the escalated request also fails with a transient error, it won't be retried. Consider whether the escalated stream should also be wrapped in retry logic or if this is acceptable behavior.
File: packages/core/src/core/tokenLimits.ts:14-18 - The comment mentions "99% of outputs are under 5K tokens" and "4-6× slot capacity" - these are operational claims that should ideally be backed by telemetry data or moved to documentation rather than code comments. Consider adding a reference to telemetry dashboards or benchmarks.

🔵 Low

File: packages/core/src/core/geminiChat.ts:315-321 - Consider extracting the escalation state initialization into a small helper object or function for better readability:
```
const escalationState = {
  maxTokensEscalated: false,
  hasUserMaxTokensOverride: ...,
  lastFinishReason: undefined as string | undefined,
};
```
File: packages/core/src/core/geminiChat.ts:462 - The debug log message could be more actionable by including the current token count or the trigger condition. Consider: Output truncated (finishReason: ${lastFinishReason}). Escalating from ${CAPPED_DEFAULT_MAX_TOKENS} to ${ESCALATED_MAX_TOKENS} tokens.
File: packages/core/src/core/turn.ts:283-288 - The state cleanup on RETRY events clears pendingToolCalls, pendingCitations, debugResponses, and finishReason. Consider adding a comment explaining why each field needs to be cleared, especially for future maintainers who may add new state fields to the Turn class.
File: packages/core/src/core/openaiContentGenerator/provider/default.ts:144-146 - The comment "Capped default (8K) reduces GPU slot over-reservation by ~4×" uses a magic number. Consider using the constant CAPPED_DEFAULT_MAX_TOKENS in the comment for consistency: Capped default (${CAPPED_DEFAULT_MAX_TOKENS}) reduces...

✅ Highlights

Excellent documentation in PR description: The background, solution, and implementation details are clearly explained with a helpful table showing the escalation stages
Smart escalation design: The single-retry escalation pattern prevents runaway token consumption while still handling the <1% of requests that need more tokens
Environment variable support: Adding QWEN_CODE_MAX_OUTPUT_TOKENS provides operational flexibility without code changes
Comprehensive test updates: All affected test files (default.test.ts, dashscope.test.ts, anthropicContentGenerator.test.ts) have been updated to reflect the new default values
State cleanup on RETRY: The turn.ts changes properly handle state cleanup to prevent duplicate tool calls and stale metadata during escalation retries

wenshao · 2026-04-04T22:55:19Z

Thanks for the review. Addressing the High and Medium items:

High 1 — hasUserMaxTokensOverride not checking params.config?.maxOutputTokens
Not needed. The only caller (turn.run()) sets params.config = { abortSignal: signal } — maxOutputTokens is never present in the initial params. It only appears in the escalated retry params we construct ourselves (which is guarded by maxTokensEscalated).

High 2 — history.pop() could remove the wrong entry
The code already includes the safety check:

if (
    self.history.length > 0 &&
    self.history[self.history.length - 1].role === "model"
) {
    self.history.pop();
}

It only pops when the last entry is a model response. Additionally, there are no await points between processStreamResponse pushing the model entry and the escalation code checking it, so no concurrent modification is possible.

Medium 1 — lastFinishReason tracking
This codebase uses single-candidate responses. The finish reason appears in one chunk (the last one with candidates). The if (fr) lastFinishReason = fr pattern correctly captures the authoritative final value.

Medium 2 — Escalated stream not wrapped in retry loop
By design. The escalated makeApiCallAndProcessStream already includes HTTP-level retries via retryWithBackoff (rate limits, 5xx). Adding stream-level retry on top of the escalation would over-engineer a path that is itself already a retry. If the escalated stream fails with a transient stream error, the error surfaces to the user — acceptable for a first implementation.

Medium 3 — "99% of outputs" claim
This comes from the upstream reference implementation (Claude Code) design doc where it was validated with production telemetry (BQ p99 output = 4,911 tokens).

- Add design doc covering problem, architecture, token limit determination, escalation mechanism, and design decisions - Document QWEN_CODE_MAX_OUTPUT_TOKENS env var in settings.md - Add max_tokens adaptive behavior explanation in model config section

tanzhenxin

Review

Nice optimization — "start small, escalate on demand" is the right approach, and the E2E results confirm the core flow works correctly. One item to flag.

Issues

1. Pre-existing: agent-core.ts and forkedQuery.ts don't clear accumulated state on RETRY

Recommend a follow-up PR.

Not introduced by this PR, but escalation increases the blast radius. Previously, RETRY only fired after error/partial streams where little state had accumulated. With escalation, RETRY fires after a complete successful stream — functionCalls, roundText, roundThoughtText, etc. in agent-core.ts:427 are already populated, and the continue doesn't clear them. The escalated response then appends on top, producing duplicate tool calls and doubled text. Same pattern in forkedQuery.ts:222 where fullText keeps concatenating across both attempts. Turn already handles this correctly (turn.ts:286), but these two consumers don't.

E2E Test Results (manual, CAPPED_DEFAULT_MAX_TOKENS=256)

Test	Result	Details
Default cap applied	PASS	Cap is functionally active (proven by truncation in test 2)
Escalation retry	PASS	2 API calls: first truncated (`finish_reason: length`), second with `max_tokens: 64000` (`finish_reason: stop`). Complete output produced.
Debug log message	PASS	`Output truncated at capped default. Escalating to 64000 tokens.` confirmed in `~/.qwen/debug/`
User override bypasses cap	PASS	`QWEN_CODE_MAX_OUTPUT_TOKENS=16000` → single API call, no escalation
No duplicate tool calls	PASS	`list_directory` appeared exactly once after escalation. RETRY event correctly cleared pendingToolCalls, pendingCitations, and debugResponses.

Verdict

APPROVE — Core escalation logic works as designed. The consumer state cleanup (#1) is pre-existing and can be a follow-up.

wenshao requested review from DennisYu07, DragonnZhang, LaZzyMan, Mingholy, gwinthis, pomelo-nwu and tanzhenxin as code owners April 4, 2026 22:52

github-actions bot mentioned this pull request Apr 5, 2026

📊 AI CLI 工具社区动态日报 2026-04-05 gsscsd/big_model_radar#136

Open

JackLuguibin mentioned this pull request Apr 5, 2026

📊 AI CLI 工具社区动态日报 2026-04-05 JackLuguibin/big_model_radar#1

Open

pomelo-nwu assigned tanzhenxin Apr 7, 2026

github-actions bot mentioned this pull request Apr 8, 2026

📊 AI CLI 工具社区动态日报 2026-04-08 gsscsd/big_model_radar#152

Open

wenshao added the DDAR DataWorks Data Agent Ready label Apr 8, 2026

tanzhenxin approved these changes Apr 8, 2026

View reviewed changes

wenshao merged commit 1e8bc03 into QwenLM:main Apr 8, 2026
25 of 26 checks passed

yiliang114 mentioned this pull request Apr 9, 2026

Bug: WriteFile tool fails with missing file_path parameter when creating large HTML files #3049

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(core): adaptive output token escalation (8K default + 64K retry)#2898

feat(core): adaptive output token escalation (8K default + 64K retry)#2898
wenshao merged 2 commits intoQwenLM:mainfrom
wenshao:feat/adaptive-output-token-escalation

wenshao commented Apr 4, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 4, 2026

Uh oh!

wenshao commented Apr 4, 2026

Uh oh!

tanzhenxin left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wenshao commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Approach

Changes

1. New constants (tokenLimits.ts)

2. Lower default max_tokens (default.ts, anthropicContentGenerator.ts)

3. Auto-escalation retry (geminiChat.ts)

4. RETRY event state cleanup (turn.ts)

Testing

Uh oh!

github-actions bot commented Apr 4, 2026

📋 Review Summary

🔍 General Feedback

🎯 Specific Feedback

🟡 High

🟢 Medium

🔵 Low

✅ Highlights

Uh oh!

wenshao commented Apr 4, 2026

Uh oh!

tanzhenxin left a comment

Choose a reason for hiding this comment

Review

Issues

E2E Test Results (manual, CAPPED_DEFAULT_MAX_TOKENS=256)

Verdict

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wenshao commented Apr 4, 2026 •

edited

Loading

1. New constants (`tokenLimits.ts`)

2. Lower default max_tokens (`default.ts`, `anthropicContentGenerator.ts`)

3. Auto-escalation retry (`geminiChat.ts`)

4. RETRY event state cleanup (`turn.ts`)