feat(runtime): add explicit LLM fallback chain across providers/models by hussein1362 · Pull Request #3354 · HKUDS/nanobot

hussein1362 · 2026-04-21T07:05:50Z

Problem

nanobot can be configured with multiple LLM providers, but at runtime it only uses one provider/model pair.

Today, if the active provider/model hits a transient upstream failure, nanobot only retries that same provider via provider_retry_mode. Once those retries are exhausted, the turn fails even if another configured provider is healthy.

That leaves a real gap between configuration and runtime behavior:

multiple providers can be configured
only one provider/model is active for the turn
transient provider failure still becomes a hard user-visible failure

Streaming has one extra constraint: once text has already started streaming, we cannot safely fail over to another model without duplicating or corrupting user-visible output.

Root Cause

Provider selection currently happens once inside _make_provider(), which constructs a single provider instance and returns it.

After that:

provider_retry_mode retries only the same provider
there is no explicit ordered fallback chain in config
there is no runtime wrapper that can move to another provider/model after terminal transient failure
streaming has no guardrail for the "fallback after first delta" case

Solution

This PR adds explicit runtime fallback chains across providers/models.

Config

Adds agents.defaults.fallbacks, an ordered list of fallback targets:

{
  "agents": {
    "defaults": {
      "model": "gpt-5.4",
      "provider": "openai",
      "fallbacks": [
        {"model": "claude-sonnet-4-6", "provider": "anthropic"},
        {"model": "gpt-4.1-mini"}
      ]
    }
  }
}

Each fallback entry supports:

model (required)
provider (optional, defaults to auto)

Runtime

Extract provider construction into a shared factory used by both runtime and CLI.

Add FallbackProvider, which wraps the primary provider plus ordered fallback candidates and applies this policy:

Use the primary provider first.
Preserve the existing retry behavior within that provider.
If the provider still ends in a transient error, move to the next configured fallback.
Stop immediately on non-transient errors.
For streaming, never fail over after the first content delta has been emitted.
Keep behavior unchanged when no fallbacks are configured.

Implementation details:

each candidate gets its own provider instance
messages are deep-copied per candidate so provider-specific rewrites do not leak across fallback boundaries
CLI and runtime now share the same provider-construction logic instead of maintaining two separate implementations

Tests

Added coverage for:

parsing explicit fallback targets from config
falling through to the next candidate after transient terminal failure
stopping on non-transient errors without failover
disabling stream failover after the first emitted delta
existing CLI _make_provider() behavior still working through the shared factory path

Commands run:

uv run --python 3.12 pytest -q tests/providers/test_provider_retry.py tests/cli/test_commands.py -k 'fallback or make_provider'
uv run --python 3.12 ruff check nanobot/config/schema.py nanobot/providers/fallback_provider.py nanobot/providers/factory.py tests/providers/test_provider_retry.py

…get_api_base Config.get_api_base() re-resolves the provider from the model string via _match_provider(), which only reads agents.defaults.provider. For fallback targets that override the provider, this would return the wrong provider config. _resolve_api_base() operates on the already-resolved objects from _resolve_provider() to avoid this.

hussein1362 added 2 commits April 21, 2026 10:04

feat(runtime): add explicit llm fallback chain

f37e835

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(runtime): add explicit LLM fallback chain across providers/models#3354

feat(runtime): add explicit LLM fallback chain across providers/models#3354
hussein1362 wants to merge 2 commits intoHKUDS:mainfrom
hussein1362:feat/llm-fallback-chain

hussein1362 commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hussein1362 commented Apr 21, 2026

Problem

Root Cause

Solution

Config

Runtime

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant