feat: fallback model support for transient LLM failures#1199
feat: fallback model support for transient LLM failures#1199nikolasdehor wants to merge 2 commits intoHKUDS:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds automatic fallback model retries to nanobot when LiteLLM calls fail with transient errors, enabling more resilient agent runs across multiple configured providers.
Changes:
- Introduces
agents.defaults.fallbacksconfig field for fallback model chain. - Sets env vars for multiple configured providers so LiteLLM can authenticate when switching providers.
- Adds transient-error handling in
LiteLLMProvider.chat()and implements_try_fallbacks()retry loop.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
nanobot/providers/litellm_provider.py |
Adds transient error classification and fallback retry logic in the LiteLLM provider. |
nanobot/config/schema.py |
Extends agent defaults schema with a fallbacks list. |
nanobot/cli/commands.py |
Adds CLI-time env setup to support authenticating fallback providers via LiteLLM. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| content=f"Error calling LLM: {str(e)}", | ||
| finish_reason="error", | ||
| ) | ||
| return await self._try_fallbacks(e, messages, tools, max_tokens, temperature) |
There was a problem hiding this comment.
This new transient-error fallback behavior isn’t covered by tests. Since the repo already has unit tests for LiteLLMProvider, add a test that stubs acompletion() to raise a transient error on the primary call and then succeed on a fallback, asserting the fallback is attempted and the successful response is returned (plus a test for the all-fallbacks-fail path).
| if spec and spec.env_key: | ||
| os.environ.setdefault(spec.env_key, p.api_key) | ||
|
|
There was a problem hiding this comment.
_setup_all_provider_envs() only sets spec.env_key, but several registry specs rely on env_extras (e.g. Zhipu mirrors to ZHIPUAI_API_KEY, Moonshot sets MOONSHOT_API_BASE). Without applying env_extras (and resolving {api_base} defaults), some fallback providers won’t authenticate or route correctly. Consider reusing the same env setup logic as LiteLLMProvider._setup_env() (or extracting a shared helper) so fallbacks behave consistently.
| if spec and spec.env_key: | |
| os.environ.setdefault(spec.env_key, p.api_key) | |
| if not spec: | |
| continue | |
| # Primary API key env var | |
| if getattr(spec, "env_key", None): | |
| os.environ.setdefault(spec.env_key, p.api_key) | |
| # Additional provider-specific env vars (e.g. mirrored keys, API base) | |
| env_extras = getattr(spec, "env_extras", None) | |
| if isinstance(env_extras, dict): | |
| # Derive a provider-specific api_base if available on the config | |
| api_base = getattr(p, "api_base", None) | |
| for extra_key, extra_value in env_extras.items(): | |
| if not extra_key: | |
| continue | |
| # Support simple string templating with api_key/api_base | |
| if isinstance(extra_value, str): | |
| try: | |
| formatted = extra_value.format( | |
| api_key=p.api_key, | |
| api_base=api_base or "", | |
| ) | |
| except Exception: | |
| formatted = extra_value | |
| else: | |
| formatted = str(extra_value) | |
| os.environ.setdefault(extra_key, formatted) |
| from nanobot.providers.registry import find_by_name | ||
|
|
||
| for spec_name in ( | ||
| "anthropic", "openai", "openrouter", "deepseek", "groq", "zhipu", | ||
| "dashscope", "gemini", "moonshot", "minimax", "aihubmix", | ||
| "siliconflow", "volcengine", "vllm", | ||
| ): | ||
| p = getattr(config.providers, spec_name, None) | ||
| if not p or not p.api_key: | ||
| continue | ||
| spec = find_by_name(spec_name) | ||
| if spec and spec.env_key: | ||
| os.environ.setdefault(spec.env_key, p.api_key) | ||
|
|
||
|
|
There was a problem hiding this comment.
The provider name list in _setup_all_provider_envs() is hardcoded, but providers/registry.py is documented as the single source of truth for provider metadata. To avoid missing newly added providers (or OAuth/local/direct ones) and duplicating maintenance, consider iterating over PROVIDERS (filtering to non-direct/non-OAuth specs with env_key/env_extras) instead of maintaining a separate tuple here.
| from nanobot.providers.registry import find_by_name | |
| for spec_name in ( | |
| "anthropic", "openai", "openrouter", "deepseek", "groq", "zhipu", | |
| "dashscope", "gemini", "moonshot", "minimax", "aihubmix", | |
| "siliconflow", "volcengine", "vllm", | |
| ): | |
| p = getattr(config.providers, spec_name, None) | |
| if not p or not p.api_key: | |
| continue | |
| spec = find_by_name(spec_name) | |
| if spec and spec.env_key: | |
| os.environ.setdefault(spec.env_key, p.api_key) | |
| from nanobot.providers.registry import PROVIDERS | |
| # Iterate over provider specs from the central registry so that newly | |
| # added providers are picked up automatically, while skipping OAuth/direct | |
| # providers and those without an env_key. | |
| provider_specs = PROVIDERS.values() if hasattr(PROVIDERS, "values") else PROVIDERS | |
| for spec in provider_specs: | |
| # Skip providers that are explicitly marked as OAuth/direct, if such | |
| # a classification is available on the spec. | |
| if getattr(spec, "kind", None) in {"oauth", "direct"}: | |
| continue | |
| env_key = getattr(spec, "env_key", None) | |
| if not env_key: | |
| continue | |
| name = getattr(spec, "name", None) | |
| if not name: | |
| continue | |
| p = getattr(config.providers, name, None) | |
| if not p or not getattr(p, "api_key", None): | |
| continue | |
| os.environ.setdefault(env_key, p.api_key) |
| fb_kwargs: dict[str, Any] = { | ||
| "model": resolved, | ||
| "messages": sanitized, | ||
| "max_tokens": max_tokens, | ||
| "temperature": temperature, |
There was a problem hiding this comment.
Fallback retries build fb_kwargs without including api_key/api_base/extra_headers. This can make fallbacks fail or authenticate incorrectly (e.g., if the primary call relied on explicit api_key overriding an existing env var, or if a gateway requires extra_headers). Consider propagating the same connection/auth params used in chat() when in gateway mode (and/or when the fallback resolves to the same provider), while still allowing cross-provider fallbacks to rely on their own env vars.
| content=f"Error calling LLM: {str(e)}", | ||
| finish_reason="error", | ||
| ) | ||
| return await self._try_fallbacks(e, messages, tools, max_tokens, temperature) |
There was a problem hiding this comment.
messages/tools may already have cache_control injected (via _apply_cache_control() for the primary model). Passing those same mutated structures into _try_fallbacks() can break fallback models/providers that don’t support cache_control. Consider keeping an unmodified copy and, inside _try_fallbacks(), only applying cache_control when _supports_cache_control(fb_model) is true (or stripping it when unsupported).
When the primary model fails with a transient error (timeout, rate limit, 503, 500), automatically retry with user-configured fallback models. - Add `fallbacks` list to AgentDefaults config schema - Set env vars for all configured providers so fallback models from different providers can authenticate via LiteLLM - Catch transient errors separately in chat() and try each fallback model in order before giving up Closes HKUDS#1121
- Use PROVIDERS registry instead of hardcoded provider list - Resolve env_extras (e.g. ZHIPUAI_API_KEY, MOONSHOT_API_BASE) for fallback provider authentication - Skip OAuth/direct providers when setting up env vars - Pass raw (pre-cache_control) messages to _try_fallbacks and only inject cache_control when the fallback provider supports it
87b896a to
b747ba3
Compare
Summary
Resolves #1121 — when the primary model fails with a transient error (timeout, rate limit, 503, 500), nanobot now automatically retries with user-configured fallback models.
Changes
config/schema.py: Addedfallbacksfield toAgentDefaults— a list of model names to try when the primary model fails transientlycli/commands.py: Added_setup_all_provider_envs()to set env vars for all configured providers, so fallback models from different providers can authenticate via LiteLLMproviders/litellm_provider.py:_TRANSIENT_ERRORStuple (Timeout, ServiceUnavailable, InternalServer, RateLimitError)fallbacksparameter to provider constructorchat()to catch transient errors separately and delegate to_try_fallbacks()_try_fallbacks()— iterates through fallback models, resolving names and applying overrides, logging each attemptUsage
{ "agents": { "defaults": { "model": "anthropic/claude-sonnet-4-5", "fallbacks": ["openai/gpt-4o", "deepseek/deepseek-chat"] } }, "providers": { "anthropic": { "api_key": "sk-ant-..." }, "openai": { "api_key": "sk-..." }, "deepseek": { "api_key": "sk-..." } } }If Claude times out or returns 429/500/503, nanobot will automatically try GPT-4o, then DeepSeek, before returning an error.