feat: fallback model support for transient LLM failures by nikolasdehor · Pull Request #1199 · HKUDS/nanobot

nikolasdehor · 2026-02-25T15:48:18Z

Summary

Resolves #1121 — when the primary model fails with a transient error (timeout, rate limit, 503, 500), nanobot now automatically retries with user-configured fallback models.

Changes

config/schema.py: Added fallbacks field to AgentDefaults — a list of model names to try when the primary model fails transiently
cli/commands.py: Added _setup_all_provider_envs() to set env vars for all configured providers, so fallback models from different providers can authenticate via LiteLLM
providers/litellm_provider.py:
- Defined _TRANSIENT_ERRORS tuple (Timeout, ServiceUnavailable, InternalServer, RateLimitError)
- Added fallbacks parameter to provider constructor
- Modified chat() to catch transient errors separately and delegate to _try_fallbacks()
- Implemented _try_fallbacks() — iterates through fallback models, resolving names and applying overrides, logging each attempt

Usage

{
  "agents": {
    "defaults": {
      "model": "anthropic/claude-sonnet-4-5",
      "fallbacks": ["openai/gpt-4o", "deepseek/deepseek-chat"]
    }
  },
  "providers": {
    "anthropic": { "api_key": "sk-ant-..." },
    "openai": { "api_key": "sk-..." },
    "deepseek": { "api_key": "sk-..." }
  }
}

If Claude times out or returns 429/500/503, nanobot will automatically try GPT-4o, then DeepSeek, before returning an error.

Copilot

Pull request overview

Adds automatic fallback model retries to nanobot when LiteLLM calls fail with transient errors, enabling more resilient agent runs across multiple configured providers.

Changes:

Introduces agents.defaults.fallbacks config field for fallback model chain.
Sets env vars for multiple configured providers so LiteLLM can authenticate when switching providers.
Adds transient-error handling in LiteLLMProvider.chat() and implements _try_fallbacks() retry loop.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File	Description
`nanobot/providers/litellm_provider.py`	Adds transient error classification and fallback retry logic in the LiteLLM provider.
`nanobot/config/schema.py`	Extends agent defaults schema with a `fallbacks` list.
`nanobot/cli/commands.py`	Adds CLI-time env setup to support authenticating fallback providers via LiteLLM.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-25T15:55:18Z

+                    content=f"Error calling LLM: {str(e)}",
+                    finish_reason="error",
+                )
+            return await self._try_fallbacks(e, messages, tools, max_tokens, temperature)


This new transient-error fallback behavior isn’t covered by tests. Since the repo already has unit tests for LiteLLMProvider, add a test that stubs acompletion() to raise a transient error on the primary call and then succeed on a fallback, asserting the fallback is attempted and the successful response is returned (plus a test for the all-fallbacks-fail path).

Copilot · 2026-02-25T15:55:19Z

+        if spec and spec.env_key:
+            os.environ.setdefault(spec.env_key, p.api_key)
+


_setup_all_provider_envs() only sets spec.env_key, but several registry specs rely on env_extras (e.g. Zhipu mirrors to ZHIPUAI_API_KEY, Moonshot sets MOONSHOT_API_BASE). Without applying env_extras (and resolving {api_base} defaults), some fallback providers won’t authenticate or route correctly. Consider reusing the same env setup logic as LiteLLMProvider._setup_env() (or extracting a shared helper) so fallbacks behave consistently.

Suggested change

if spec and spec.env_key:

os.environ.setdefault(spec.env_key, p.api_key)

if not spec:

continue

# Primary API key env var

if getattr(spec, "env_key", None):

os.environ.setdefault(spec.env_key, p.api_key)

# Additional provider-specific env vars (e.g. mirrored keys, API base)

env_extras = getattr(spec, "env_extras", None)

if isinstance(env_extras, dict):

# Derive a provider-specific api_base if available on the config

api_base = getattr(p, "api_base", None)

for extra_key, extra_value in env_extras.items():

if not extra_key:

continue

# Support simple string templating with api_key/api_base

if isinstance(extra_value, str):

try:

formatted = extra_value.format(

api_key=p.api_key,

api_base=api_base or "",

)

except Exception:

formatted = extra_value

else:

formatted = str(extra_value)

os.environ.setdefault(extra_key, formatted)

Copilot · 2026-02-25T15:55:19Z

+    from nanobot.providers.registry import find_by_name
+
+    for spec_name in (
+        "anthropic", "openai", "openrouter", "deepseek", "groq", "zhipu",
+        "dashscope", "gemini", "moonshot", "minimax", "aihubmix",
+        "siliconflow", "volcengine", "vllm",
+    ):
+        p = getattr(config.providers, spec_name, None)
+        if not p or not p.api_key:
+            continue
+        spec = find_by_name(spec_name)
+        if spec and spec.env_key:
+            os.environ.setdefault(spec.env_key, p.api_key)
+
+


The provider name list in _setup_all_provider_envs() is hardcoded, but providers/registry.py is documented as the single source of truth for provider metadata. To avoid missing newly added providers (or OAuth/local/direct ones) and duplicating maintenance, consider iterating over PROVIDERS (filtering to non-direct/non-OAuth specs with env_key/env_extras) instead of maintaining a separate tuple here.

Suggested change

from nanobot.providers.registry import find_by_name

for spec_name in (

"anthropic", "openai", "openrouter", "deepseek", "groq", "zhipu",

"dashscope", "gemini", "moonshot", "minimax", "aihubmix",

"siliconflow", "volcengine", "vllm",

):

p = getattr(config.providers, spec_name, None)

if not p or not p.api_key:

continue

spec = find_by_name(spec_name)

if spec and spec.env_key:

os.environ.setdefault(spec.env_key, p.api_key)

from nanobot.providers.registry import PROVIDERS

# Iterate over provider specs from the central registry so that newly

# added providers are picked up automatically, while skipping OAuth/direct

# providers and those without an env_key.

provider_specs = PROVIDERS.values() if hasattr(PROVIDERS, "values") else PROVIDERS

for spec in provider_specs:

# Skip providers that are explicitly marked as OAuth/direct, if such

# a classification is available on the spec.

if getattr(spec, "kind", None) in {"oauth", "direct"}:

continue

env_key = getattr(spec, "env_key", None)

if not env_key:

continue

name = getattr(spec, "name", None)

if not name:

continue

p = getattr(config.providers, name, None)

if not p or not getattr(p, "api_key", None):

continue

os.environ.setdefault(env_key, p.api_key)

Copilot · 2026-02-25T15:55:19Z

+            fb_kwargs: dict[str, Any] = {
+                "model": resolved,
+                "messages": sanitized,
+                "max_tokens": max_tokens,
+                "temperature": temperature,


Fallback retries build fb_kwargs without including api_key/api_base/extra_headers. This can make fallbacks fail or authenticate incorrectly (e.g., if the primary call relied on explicit api_key overriding an existing env var, or if a gateway requires extra_headers). Consider propagating the same connection/auth params used in chat() when in gateway mode (and/or when the fallback resolves to the same provider), while still allowing cross-provider fallbacks to rely on their own env vars.

Copilot · 2026-02-25T15:55:20Z

+                    content=f"Error calling LLM: {str(e)}",
+                    finish_reason="error",
+                )
+            return await self._try_fallbacks(e, messages, tools, max_tokens, temperature)


messages/tools may already have cache_control injected (via _apply_cache_control() for the primary model). Passing those same mutated structures into _try_fallbacks() can break fallback models/providers that don’t support cache_control. Consider keeping an unmodified copy and, inside _try_fallbacks(), only applying cache_control when _supports_cache_control(fb_model) is true (or stripping it when unsupported).

When the primary model fails with a transient error (timeout, rate limit, 503, 500), automatically retry with user-configured fallback models. - Add `fallbacks` list to AgentDefaults config schema - Set env vars for all configured providers so fallback models from different providers can authenticate via LiteLLM - Catch transient errors separately in chat() and try each fallback model in order before giving up Closes HKUDS#1121

- Use PROVIDERS registry instead of hardcoded provider list - Resolve env_extras (e.g. ZHIPUAI_API_KEY, MOONSHOT_API_BASE) for fallback provider authentication - Skip OAuth/direct providers when setting up env vars - Pass raw (pre-cache_control) messages to _try_fallbacks and only inject cache_control when the fallback provider supports it

Copilot AI review requested due to automatic review settings February 25, 2026 15:48

Copilot started reviewing on behalf of nikolasdehor February 25, 2026 15:48 View session

Copilot AI reviewed Feb 25, 2026

View reviewed changes

nikolasdehor added 2 commits February 28, 2026 00:45

nikolasdehor force-pushed the fix/fallback-model-on-timeout branch from 87b896a to b747ba3 Compare February 28, 2026 03:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: fallback model support for transient LLM failures#1199

feat: fallback model support for transient LLM failures#1199
nikolasdehor wants to merge 2 commits intoHKUDS:mainfrom
nikolasdehor:fix/fallback-model-on-timeout

nikolasdehor commented Feb 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		if spec and spec.env_key:
		os.environ.setdefault(spec.env_key, p.api_key)

-        if spec and spec.env_key:
-            os.environ.setdefault(spec.env_key, p.api_key)
+        if not spec:
+            continue
+        # Primary API key env var
+        if getattr(spec, "env_key", None):
+            os.environ.setdefault(spec.env_key, p.api_key)
+        # Additional provider-specific env vars (e.g. mirrored keys, API base)
+        env_extras = getattr(spec, "env_extras", None)
+        if isinstance(env_extras, dict):
+            # Derive a provider-specific api_base if available on the config
+            api_base = getattr(p, "api_base", None)
+            for extra_key, extra_value in env_extras.items():
+                if not extra_key:
+                    continue
+                # Support simple string templating with api_key/api_base
+                if isinstance(extra_value, str):
+                    try:
+                        formatted = extra_value.format(
+                            api_key=p.api_key,
+                            api_base=api_base or "",
+                        )
+                    except Exception:
+                        formatted = extra_value
+                else:
+                    formatted = str(extra_value)
+                os.environ.setdefault(extra_key, formatted)

-    from nanobot.providers.registry import find_by_name
-    for spec_name in (
-        "anthropic", "openai", "openrouter", "deepseek", "groq", "zhipu",
-        "dashscope", "gemini", "moonshot", "minimax", "aihubmix",
-        "siliconflow", "volcengine", "vllm",
-    ):
-        p = getattr(config.providers, spec_name, None)
-        if not p or not p.api_key:
-            continue
-        spec = find_by_name(spec_name)
-        if spec and spec.env_key:
-            os.environ.setdefault(spec.env_key, p.api_key)
+    from nanobot.providers.registry import PROVIDERS
+    # Iterate over provider specs from the central registry so that newly
+    # added providers are picked up automatically, while skipping OAuth/direct
+    # providers and those without an env_key.
+    provider_specs = PROVIDERS.values() if hasattr(PROVIDERS, "values") else PROVIDERS
+    for spec in provider_specs:
+        # Skip providers that are explicitly marked as OAuth/direct, if such
+        # a classification is available on the spec.
+        if getattr(spec, "kind", None) in {"oauth", "direct"}:
+            continue
+        env_key = getattr(spec, "env_key", None)
+        if not env_key:
+            continue
+        name = getattr(spec, "name", None)
+        if not name:
+            continue
+        p = getattr(config.providers, name, None)
+        if not p or not getattr(p, "api_key", None):
+            continue
+        os.environ.setdefault(env_key, p.api_key)

Conversation

nikolasdehor commented Feb 25, 2026

Summary

Changes

Usage

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants