Skip to content

feat(runtime): add explicit LLM fallback chain across providers/models#3354

Open
hussein1362 wants to merge 2 commits intoHKUDS:mainfrom
hussein1362:feat/llm-fallback-chain
Open

feat(runtime): add explicit LLM fallback chain across providers/models#3354
hussein1362 wants to merge 2 commits intoHKUDS:mainfrom
hussein1362:feat/llm-fallback-chain

Conversation

@hussein1362
Copy link
Copy Markdown
Contributor

Problem

nanobot can be configured with multiple LLM providers, but at runtime it only uses one provider/model pair.

Today, if the active provider/model hits a transient upstream failure, nanobot only retries that same provider via provider_retry_mode. Once those retries are exhausted, the turn fails even if another configured provider is healthy.

That leaves a real gap between configuration and runtime behavior:

  • multiple providers can be configured
  • only one provider/model is active for the turn
  • transient provider failure still becomes a hard user-visible failure

Streaming has one extra constraint: once text has already started streaming, we cannot safely fail over to another model without duplicating or corrupting user-visible output.

Root Cause

Provider selection currently happens once inside _make_provider(), which constructs a single provider instance and returns it.

After that:

  • provider_retry_mode retries only the same provider
  • there is no explicit ordered fallback chain in config
  • there is no runtime wrapper that can move to another provider/model after terminal transient failure
  • streaming has no guardrail for the "fallback after first delta" case

Solution

This PR adds explicit runtime fallback chains across providers/models.

Config

Adds agents.defaults.fallbacks, an ordered list of fallback targets:

{
  "agents": {
    "defaults": {
      "model": "gpt-5.4",
      "provider": "openai",
      "fallbacks": [
        {"model": "claude-sonnet-4-6", "provider": "anthropic"},
        {"model": "gpt-4.1-mini"}
      ]
    }
  }
}

Each fallback entry supports:

  • model (required)
  • provider (optional, defaults to auto)

Runtime

Extract provider construction into a shared factory used by both runtime and CLI.

Add FallbackProvider, which wraps the primary provider plus ordered fallback candidates and applies this policy:

  1. Use the primary provider first.
  2. Preserve the existing retry behavior within that provider.
  3. If the provider still ends in a transient error, move to the next configured fallback.
  4. Stop immediately on non-transient errors.
  5. For streaming, never fail over after the first content delta has been emitted.
  6. Keep behavior unchanged when no fallbacks are configured.

Implementation details:

  • each candidate gets its own provider instance
  • messages are deep-copied per candidate so provider-specific rewrites do not leak across fallback boundaries
  • CLI and runtime now share the same provider-construction logic instead of maintaining two separate implementations

Tests

Added coverage for:

  • parsing explicit fallback targets from config
  • falling through to the next candidate after transient terminal failure
  • stopping on non-transient errors without failover
  • disabling stream failover after the first emitted delta
  • existing CLI _make_provider() behavior still working through the shared factory path

Commands run:

uv run --python 3.12 pytest -q tests/providers/test_provider_retry.py tests/cli/test_commands.py -k 'fallback or make_provider'
uv run --python 3.12 ruff check nanobot/config/schema.py nanobot/providers/fallback_provider.py nanobot/providers/factory.py tests/providers/test_provider_retry.py

…get_api_base

Config.get_api_base() re-resolves the provider from the model string via
_match_provider(), which only reads agents.defaults.provider. For fallback
targets that override the provider, this would return the wrong provider
config. _resolve_api_base() operates on the already-resolved objects from
_resolve_provider() to avoid this.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant