Skip to content

feat: fallback model support for transient LLM failures#1199

Open
nikolasdehor wants to merge 2 commits intoHKUDS:mainfrom
nikolasdehor:fix/fallback-model-on-timeout
Open

feat: fallback model support for transient LLM failures#1199
nikolasdehor wants to merge 2 commits intoHKUDS:mainfrom
nikolasdehor:fix/fallback-model-on-timeout

Conversation

@nikolasdehor
Copy link
Copy Markdown
Collaborator

Summary

Resolves #1121 — when the primary model fails with a transient error (timeout, rate limit, 503, 500), nanobot now automatically retries with user-configured fallback models.

Changes

  • config/schema.py: Added fallbacks field to AgentDefaults — a list of model names to try when the primary model fails transiently
  • cli/commands.py: Added _setup_all_provider_envs() to set env vars for all configured providers, so fallback models from different providers can authenticate via LiteLLM
  • providers/litellm_provider.py:
    • Defined _TRANSIENT_ERRORS tuple (Timeout, ServiceUnavailable, InternalServer, RateLimitError)
    • Added fallbacks parameter to provider constructor
    • Modified chat() to catch transient errors separately and delegate to _try_fallbacks()
    • Implemented _try_fallbacks() — iterates through fallback models, resolving names and applying overrides, logging each attempt

Usage

{
  "agents": {
    "defaults": {
      "model": "anthropic/claude-sonnet-4-5",
      "fallbacks": ["openai/gpt-4o", "deepseek/deepseek-chat"]
    }
  },
  "providers": {
    "anthropic": { "api_key": "sk-ant-..." },
    "openai": { "api_key": "sk-..." },
    "deepseek": { "api_key": "sk-..." }
  }
}

If Claude times out or returns 429/500/503, nanobot will automatically try GPT-4o, then DeepSeek, before returning an error.

Copilot AI review requested due to automatic review settings February 25, 2026 15:48
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds automatic fallback model retries to nanobot when LiteLLM calls fail with transient errors, enabling more resilient agent runs across multiple configured providers.

Changes:

  • Introduces agents.defaults.fallbacks config field for fallback model chain.
  • Sets env vars for multiple configured providers so LiteLLM can authenticate when switching providers.
  • Adds transient-error handling in LiteLLMProvider.chat() and implements _try_fallbacks() retry loop.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
nanobot/providers/litellm_provider.py Adds transient error classification and fallback retry logic in the LiteLLM provider.
nanobot/config/schema.py Extends agent defaults schema with a fallbacks list.
nanobot/cli/commands.py Adds CLI-time env setup to support authenticating fallback providers via LiteLLM.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread nanobot/providers/litellm_provider.py Outdated
content=f"Error calling LLM: {str(e)}",
finish_reason="error",
)
return await self._try_fallbacks(e, messages, tools, max_tokens, temperature)
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new transient-error fallback behavior isn’t covered by tests. Since the repo already has unit tests for LiteLLMProvider, add a test that stubs acompletion() to raise a transient error on the primary call and then succeed on a fallback, asserting the fallback is attempted and the successful response is returned (plus a test for the all-fallbacks-fail path).

Copilot uses AI. Check for mistakes.
Comment thread nanobot/cli/commands.py Outdated
Comment on lines +250 to +252
if spec and spec.env_key:
os.environ.setdefault(spec.env_key, p.api_key)

Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_setup_all_provider_envs() only sets spec.env_key, but several registry specs rely on env_extras (e.g. Zhipu mirrors to ZHIPUAI_API_KEY, Moonshot sets MOONSHOT_API_BASE). Without applying env_extras (and resolving {api_base} defaults), some fallback providers won’t authenticate or route correctly. Consider reusing the same env setup logic as LiteLLMProvider._setup_env() (or extracting a shared helper) so fallbacks behave consistently.

Suggested change
if spec and spec.env_key:
os.environ.setdefault(spec.env_key, p.api_key)
if not spec:
continue
# Primary API key env var
if getattr(spec, "env_key", None):
os.environ.setdefault(spec.env_key, p.api_key)
# Additional provider-specific env vars (e.g. mirrored keys, API base)
env_extras = getattr(spec, "env_extras", None)
if isinstance(env_extras, dict):
# Derive a provider-specific api_base if available on the config
api_base = getattr(p, "api_base", None)
for extra_key, extra_value in env_extras.items():
if not extra_key:
continue
# Support simple string templating with api_key/api_base
if isinstance(extra_value, str):
try:
formatted = extra_value.format(
api_key=p.api_key,
api_base=api_base or "",
)
except Exception:
formatted = extra_value
else:
formatted = str(extra_value)
os.environ.setdefault(extra_key, formatted)

Copilot uses AI. Check for mistakes.
Comment thread nanobot/cli/commands.py Outdated
Comment on lines +239 to +253
from nanobot.providers.registry import find_by_name

for spec_name in (
"anthropic", "openai", "openrouter", "deepseek", "groq", "zhipu",
"dashscope", "gemini", "moonshot", "minimax", "aihubmix",
"siliconflow", "volcengine", "vllm",
):
p = getattr(config.providers, spec_name, None)
if not p or not p.api_key:
continue
spec = find_by_name(spec_name)
if spec and spec.env_key:
os.environ.setdefault(spec.env_key, p.api_key)


Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The provider name list in _setup_all_provider_envs() is hardcoded, but providers/registry.py is documented as the single source of truth for provider metadata. To avoid missing newly added providers (or OAuth/local/direct ones) and duplicating maintenance, consider iterating over PROVIDERS (filtering to non-direct/non-OAuth specs with env_key/env_extras) instead of maintaining a separate tuple here.

Suggested change
from nanobot.providers.registry import find_by_name
for spec_name in (
"anthropic", "openai", "openrouter", "deepseek", "groq", "zhipu",
"dashscope", "gemini", "moonshot", "minimax", "aihubmix",
"siliconflow", "volcengine", "vllm",
):
p = getattr(config.providers, spec_name, None)
if not p or not p.api_key:
continue
spec = find_by_name(spec_name)
if spec and spec.env_key:
os.environ.setdefault(spec.env_key, p.api_key)
from nanobot.providers.registry import PROVIDERS
# Iterate over provider specs from the central registry so that newly
# added providers are picked up automatically, while skipping OAuth/direct
# providers and those without an env_key.
provider_specs = PROVIDERS.values() if hasattr(PROVIDERS, "values") else PROVIDERS
for spec in provider_specs:
# Skip providers that are explicitly marked as OAuth/direct, if such
# a classification is available on the spec.
if getattr(spec, "kind", None) in {"oauth", "direct"}:
continue
env_key = getattr(spec, "env_key", None)
if not env_key:
continue
name = getattr(spec, "name", None)
if not name:
continue
p = getattr(config.providers, name, None)
if not p or not getattr(p, "api_key", None):
continue
os.environ.setdefault(env_key, p.api_key)

Copilot uses AI. Check for mistakes.
Comment on lines +268 to +272
fb_kwargs: dict[str, Any] = {
"model": resolved,
"messages": sanitized,
"max_tokens": max_tokens,
"temperature": temperature,
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fallback retries build fb_kwargs without including api_key/api_base/extra_headers. This can make fallbacks fail or authenticate incorrectly (e.g., if the primary call relied on explicit api_key overriding an existing env var, or if a gateway requires extra_headers). Consider propagating the same connection/auth params used in chat() when in gateway mode (and/or when the fallback resolves to the same provider), while still allowing cross-provider fallbacks to rely on their own env vars.

Copilot uses AI. Check for mistakes.
Comment thread nanobot/providers/litellm_provider.py Outdated
content=f"Error calling LLM: {str(e)}",
finish_reason="error",
)
return await self._try_fallbacks(e, messages, tools, max_tokens, temperature)
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

messages/tools may already have cache_control injected (via _apply_cache_control() for the primary model). Passing those same mutated structures into _try_fallbacks() can break fallback models/providers that don’t support cache_control. Consider keeping an unmodified copy and, inside _try_fallbacks(), only applying cache_control when _supports_cache_control(fb_model) is true (or stripping it when unsupported).

Copilot uses AI. Check for mistakes.
When the primary model fails with a transient error (timeout, rate limit,
503, 500), automatically retry with user-configured fallback models.

- Add `fallbacks` list to AgentDefaults config schema
- Set env vars for all configured providers so fallback models from
  different providers can authenticate via LiteLLM
- Catch transient errors separately in chat() and try each fallback
  model in order before giving up

Closes HKUDS#1121
- Use PROVIDERS registry instead of hardcoded provider list
- Resolve env_extras (e.g. ZHIPUAI_API_KEY, MOONSHOT_API_BASE) for
  fallback provider authentication
- Skip OAuth/direct providers when setting up env vars
- Pass raw (pre-cache_control) messages to _try_fallbacks and only
  inject cache_control when the fallback provider supports it
@nikolasdehor nikolasdehor force-pushed the fix/fallback-model-on-timeout branch from 87b896a to b747ba3 Compare February 28, 2026 03:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fallback model not triggered on LLM timeout (ServiceUnavailableError / 503)

2 participants