Skip to content

fix: max_tokens not auto-detected for non-Qwen models, causing response truncation #2358

@netbrah

Description

@netbrah

Problem

When using non-Qwen models (Claude, GPT, Gemini) via the openai auth type with modelProviders config, requests omit max_tokens entirely if samplingParams.max_tokens is not explicitly configured. Many APIs default to a small value (e.g., Anthropic via VertexAI defaults to 4096), causing long responses to be truncated mid-generation.

This frequently breaks tool call parameters — the model generates a WriteFile tool call with a large content parameter, the response gets truncated at 4096 tokens, and the tool call JSON is incomplete.

Root Cause

applyResolvedModelDefaults() in modelsConfig.ts auto-detects contextWindowSize (line 767) and modalities (line 776) from the model name using existing utility functions, but does NOT auto-detect max_tokens. The tokenLimit(model, 'output') function already exists in tokenLimits.ts with correct output limits for all supported model families, but is never called as a fallback.

Expected Behavior

max_tokens should be auto-detected from the model name (using tokenLimit(model, 'output')) when not explicitly set by the provider config, following the same pattern as contextWindowSize and modalities.

Workaround

Set generationConfig.samplingParams.max_tokens explicitly in modelProviders settings for each model.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions