Skip to content

feat(core): adaptive output token escalation (8K default + 64K retry)#2898

Merged
wenshao merged 2 commits intoQwenLM:mainfrom
wenshao:feat/adaptive-output-token-escalation
Apr 8, 2026
Merged

feat(core): adaptive output token escalation (8K default + 64K retry)#2898
wenshao merged 2 commits intoQwenLM:mainfrom
wenshao:feat/adaptive-output-token-escalation

Conversation

@wenshao
Copy link
Copy Markdown
Collaborator

@wenshao wenshao commented Apr 4, 2026

Background

Currently every request reserves a fixed 32K output token slot, but 99% of responses are under 5K tokens. This over-reserves GPU slot capacity by ~4x, limiting server concurrency.

Approach

Adaptive "low default + escalate on truncation" strategy:

Phase Output limit Trigger
Initial request 8K All requests
Escalated retry 64K Previous response truncated by max_tokens

Flow: send with 8K → if finish_reason === MAX_TOKENS → auto-retry with 64K → only ~1% of requests use the large slot.

Changes

1. New constants (tokenLimits.ts)

  • CAPPED_DEFAULT_MAX_TOKENS = 8_000 — low default for slot optimization
  • ESCALATED_MAX_TOKENS = 64_000 — escalated limit for truncated requests

2. Lower default max_tokens (default.ts, anthropicContentGenerator.ts)

  • Default max_tokens reduced from 32K to 8K when no user config is set
  • New QWEN_CODE_MAX_OUTPUT_TOKENS env var to override the default
  • User-configured max_tokens is unaffected (highest priority)

3. Auto-escalation retry (geminiChat.ts)

  • Detects finishReason === MAX_TOKENS from streamed chunks
  • When conditions are met (no user/env override, not already escalated):
    • Removes the truncated model response from history
    • Yields a RETRY event (UI discards partial output)
    • Re-sends the same request with maxOutputTokens: 64K
  • Escalation is placed outside the retry loop so errors from the escalated stream propagate directly instead of being caught by retry logic

4. RETRY event state cleanup (turn.ts)

  • Clears pendingToolCalls, pendingCitations, debugResponses, and finishReason on RETRY
  • Prevents duplicate tool calls when the first truncated response contained completed tool calls and the escalated retry produces the same ones

Testing

  • All 780 existing tests pass
  • Updated default-value expectations in default.test.ts, dashscope.test.ts, and anthropicContentGenerator.test.ts

🤖 Generated with Claude Code

99% of model responses are under 5K tokens, but we previously reserved
32K for every request. This wastes GPU slot capacity by ~4x.

Now the default output limit is 8K. When a response hits this cap
(stop_reason=max_tokens), it automatically retries once at 64K — only
the ~1% of requests that actually need more tokens pay the cost.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 4, 2026

📋 Review Summary

This PR implements an adaptive output token escalation strategy to optimize GPU slot utilization by reducing the default output token reservation from 32K to 8K tokens, with an automatic retry at 64K when responses are truncated. The implementation is well-structured, maintains backward compatibility, and includes comprehensive test updates. Overall, this is a solid optimization that should significantly improve server concurrency.

🔍 General Feedback

  • Strong architectural approach: The "low default + escalate on truncation" pattern is well-suited for the stated goal of reducing GPU slot over-reservation
  • Good separation of concerns: Token limit constants are centralized in tokenLimits.ts, while escalation logic lives in geminiChat.ts
  • Comprehensive test coverage: Test files are updated to reflect the new default values across all affected providers
  • Environment variable override: The QWEN_CODE_MAX_OUTPUT_TOKENS env var provides good operational flexibility
  • Consistent implementation: Token limit logic is applied consistently across Anthropic, DashScope, and Default OpenAI providers

🎯 Specific Feedback

🟡 High

  • File: packages/core/src/core/geminiChat.ts:316-320 - The hasUserMaxTokensOverride check reads from cgConfig?.samplingParams?.max_tokens, but this may not capture all user configuration paths. For example, if a user configures max_tokens directly in the request params rather than in the content generator config, this override detection would fail, causing the escalation logic to incorrectly apply. Consider also checking params.config?.maxOutputTokens in the override detection.

  • File: packages/core/src/core/geminiChat.ts:465-470 - The history cleanup logic (self.history.pop()) assumes the last item is always the partial model response, but if multiple concurrent operations are happening or if the history was modified elsewhere, this could remove the wrong entry. Add a safety check to verify the role before popping, or consider a more robust cleanup mechanism.

🟢 Medium

  • File: packages/core/src/core/geminiChat.ts:344-347 - The lastFinishReason tracking only captures the finish reason from the last chunk of the stream. If the stream yields multiple candidates or if finish reasons vary across chunks, this could miss important state. Consider tracking finish reasons from all candidates or adding a comment explaining why the last candidate's finish reason is authoritative.

  • File: packages/core/src/core/geminiChat.ts:454-491 - The escalation retry is placed outside the INVALID_CONTENT_RETRY_OPTIONS loop, which is correct for error propagation, but this means escalation doesn't benefit from the same retry semantics (e.g., backoff). If the escalated request also fails with a transient error, it won't be retried. Consider whether the escalated stream should also be wrapped in retry logic or if this is acceptable behavior.

  • File: packages/core/src/core/tokenLimits.ts:14-18 - The comment mentions "99% of outputs are under 5K tokens" and "4-6× slot capacity" - these are operational claims that should ideally be backed by telemetry data or moved to documentation rather than code comments. Consider adding a reference to telemetry dashboards or benchmarks.

🔵 Low

  • File: packages/core/src/core/geminiChat.ts:315-321 - Consider extracting the escalation state initialization into a small helper object or function for better readability:

    const escalationState = {
      maxTokensEscalated: false,
      hasUserMaxTokensOverride: ...,
      lastFinishReason: undefined as string | undefined,
    };
  • File: packages/core/src/core/geminiChat.ts:462 - The debug log message could be more actionable by including the current token count or the trigger condition. Consider: Output truncated (finishReason: ${lastFinishReason}). Escalating from ${CAPPED_DEFAULT_MAX_TOKENS} to ${ESCALATED_MAX_TOKENS} tokens.

  • File: packages/core/src/core/turn.ts:283-288 - The state cleanup on RETRY events clears pendingToolCalls, pendingCitations, debugResponses, and finishReason. Consider adding a comment explaining why each field needs to be cleared, especially for future maintainers who may add new state fields to the Turn class.

  • File: packages/core/src/core/openaiContentGenerator/provider/default.ts:144-146 - The comment "Capped default (8K) reduces GPU slot over-reservation by ~4×" uses a magic number. Consider using the constant CAPPED_DEFAULT_MAX_TOKENS in the comment for consistency: Capped default (${CAPPED_DEFAULT_MAX_TOKENS}) reduces...

✅ Highlights

  • Excellent documentation in PR description: The background, solution, and implementation details are clearly explained with a helpful table showing the escalation stages
  • Smart escalation design: The single-retry escalation pattern prevents runaway token consumption while still handling the <1% of requests that need more tokens
  • Environment variable support: Adding QWEN_CODE_MAX_OUTPUT_TOKENS provides operational flexibility without code changes
  • Comprehensive test updates: All affected test files (default.test.ts, dashscope.test.ts, anthropicContentGenerator.test.ts) have been updated to reflect the new default values
  • State cleanup on RETRY: The turn.ts changes properly handle state cleanup to prevent duplicate tool calls and stale metadata during escalation retries

@wenshao
Copy link
Copy Markdown
Collaborator Author

wenshao commented Apr 4, 2026

Thanks for the review. Addressing the High and Medium items:

High 1 — hasUserMaxTokensOverride not checking params.config?.maxOutputTokens
Not needed. The only caller (turn.run()) sets params.config = { abortSignal: signal }maxOutputTokens is never present in the initial params. It only appears in the escalated retry params we construct ourselves (which is guarded by maxTokensEscalated).

High 2 — history.pop() could remove the wrong entry
The code already includes the safety check:

if (
    self.history.length > 0 &&
    self.history[self.history.length - 1].role === "model"
) {
    self.history.pop();
}

It only pops when the last entry is a model response. Additionally, there are no await points between processStreamResponse pushing the model entry and the escalation code checking it, so no concurrent modification is possible.

Medium 1 — lastFinishReason tracking
This codebase uses single-candidate responses. The finish reason appears in one chunk (the last one with candidates). The if (fr) lastFinishReason = fr pattern correctly captures the authoritative final value.

Medium 2 — Escalated stream not wrapped in retry loop
By design. The escalated makeApiCallAndProcessStream already includes HTTP-level retries via retryWithBackoff (rate limits, 5xx). Adding stream-level retry on top of the escalation would over-engineer a path that is itself already a retry. If the escalated stream fails with a transient stream error, the error surfaces to the user — acceptable for a first implementation.

Medium 3 — "99% of outputs" claim
This comes from the upstream reference implementation (Claude Code) design doc where it was validated with production telemetry (BQ p99 output = 4,911 tokens).

- Add design doc covering problem, architecture, token limit
  determination, escalation mechanism, and design decisions
- Document QWEN_CODE_MAX_OUTPUT_TOKENS env var in settings.md
- Add max_tokens adaptive behavior explanation in model config section
Copy link
Copy Markdown
Collaborator

@tanzhenxin tanzhenxin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Nice optimization — "start small, escalate on demand" is the right approach, and the E2E results confirm the core flow works correctly. One item to flag.

Issues

1. Pre-existing: agent-core.ts and forkedQuery.ts don't clear accumulated state on RETRY

Recommend a follow-up PR.

Not introduced by this PR, but escalation increases the blast radius. Previously, RETRY only fired after error/partial streams where little state had accumulated. With escalation, RETRY fires after a complete successful stream — functionCalls, roundText, roundThoughtText, etc. in agent-core.ts:427 are already populated, and the continue doesn't clear them. The escalated response then appends on top, producing duplicate tool calls and doubled text. Same pattern in forkedQuery.ts:222 where fullText keeps concatenating across both attempts. Turn already handles this correctly (turn.ts:286), but these two consumers don't.

E2E Test Results (manual, CAPPED_DEFAULT_MAX_TOKENS=256)

Test Result Details
Default cap applied PASS Cap is functionally active (proven by truncation in test 2)
Escalation retry PASS 2 API calls: first truncated (finish_reason: length), second with max_tokens: 64000 (finish_reason: stop). Complete output produced.
Debug log message PASS Output truncated at capped default. Escalating to 64000 tokens. confirmed in ~/.qwen/debug/
User override bypasses cap PASS QWEN_CODE_MAX_OUTPUT_TOKENS=16000 → single API call, no escalation
No duplicate tool calls PASS list_directory appeared exactly once after escalation. RETRY event correctly cleared pendingToolCalls, pendingCitations, and debugResponses.

Verdict

APPROVE — Core escalation logic works as designed. The consumer state cleanup (#1) is pre-existing and can be a follow-up.

@wenshao wenshao merged commit 1e8bc03 into QwenLM:main Apr 8, 2026
25 of 26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

DDAR DataWorks Data Agent Ready

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants