feat(engine): LLM council via per-call model override in CodeAct#2320
feat(engine): LLM council via per-call model override in CodeAct#2320serrrfirat merged 5 commits intostagingfrom
Conversation
Extend `llm_query()` and `llm_query_batched()` with a `model=` (and `models=` for the batched variant) keyword so CodeAct can route individual sub-queries to specific LLMs. The "LLM council" pattern becomes a skill — the agent broadcasts the same prompt across a parallel array of models and synthesizes the responses — with no new tool, dispatch path, or capability boundary. - Add `model: Option<String>` to `LlmCallConfig`; thread it through `LlmBridgeAdapter` onto `CompletionRequest.model` / `ToolCompletionRequest.model` so providers that honor per-request overrides (NEAR AI, Anthropic OAuth, GitHub Copilot, Bedrock) pick it up. Other providers fall back to their configured model. - `handle_llm_query` extracts a `model` arg; `handle_llm_query_batched` accepts either `model="..."` (broadcast) or `models=[...]` (parallel array, length-validated against `prompts`). - `__llm_complete__` host fn extracts `model` from explicit_config so the Python orchestrator can also forward it. - Update CodeAct preamble docs so the agent sees the new parameters. - Add `skills/llm-council/SKILL.md` with the council pattern, recommended NEAR AI model line-ups, and a synthesis example. Tests: 5 new scripting tests (model kwarg forwarding, default `None`, `models=` broadcast, single-`model=` broadcast, length-mismatch error) and 2 new bridge tests (config.model → CompletionRequest.model on both the no-tools and with-tools paths). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Address three review comments on the LLM council PR: 1. Test coverage for the orchestrator entry point. Add two tests driving `handle_llm_complete` directly with explicit_config containing `model` (and a control case without it). Closes the "test through the caller, not just the helper" gap — the previous tests only exercised `handle_llm_query`, leaving the parallel `__llm_complete__` host fn path unverified. 2. Loud failure on non-string entries in `models=[...]`. Previously `models=[1, 2]` was silently coerced via `monty_to_string` to `["1", "2"]`. Now returns `TypeError` with the offending value, matching the existing length-mismatch handling style. 3. `None` slots in `models=[...]` are no longer backfilled by the singular `model=` kwarg. Each slot is authoritative: a `None` means "no override for this prompt" (use the configured default). Mixing the two would have been surprising — the docs now spell out the contract explicitly. Add a regression test that passes both `models=[None, "gpt-4o"]` and `model="claude-..."` and asserts the None slot stays None. Also update `skills/llm-council/SKILL.md` to default to a 4-model council of `anthropic/claude-opus-4-6`, `google/gemini-3-pro`, `zai-org/GLM-latest`, `openai/gpt-5.4`. Per-call provider errors already flow through the existing `Ok(Err(e))` arm as `"Error: ..."` strings — the batch never fails as a whole, so unavailable models just surface in their own slot. Tests: 4 new (2 orchestrator, 2 scripting), all 346 engine unit tests pass, zero clippy warnings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…odeact # Conflicts: # crates/ironclaw_engine/src/executor/orchestrator.rs
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
There was a problem hiding this comment.
Pull request overview
Adds per-call model overrides to CodeAct’s existing llm_query() / llm_query_batched() host functions so a single CodeAct step can query a “council” of models, and includes documentation + a skill to teach the pattern.
Changes:
- Extended engine LLM call configuration with an optional per-request
modeloverride and forwarded it through the bridge adapter. - Added
model=/models=parsing + validation to CodeAct host functions, with unit tests covering the new behavior. - Documented the new API in the CodeAct preamble and introduced an
llm-councilskill describing recommended usage.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
crates/ironclaw_engine/src/traits/llm.rs |
Adds model: Option<String> to LlmCallConfig for per-call model overrides. |
crates/ironclaw_engine/src/executor/scripting.rs |
Parses model/models kwargs for llm_query* and adds tests for plumbing and validation. |
crates/ironclaw_engine/src/executor/orchestrator.rs |
Forwards explicit_config["model"] into LlmCallConfig for __llm_complete__. |
crates/ironclaw_engine/prompts/codeact_preamble.md |
Documents new model/models parameters in the agent-visible preamble. |
src/bridge/llm_adapter.rs |
Forwards config.model into provider request model fields and adds coverage. |
skills/llm-council/SKILL.md |
New skill describing “LLM council” usage pattern and suggested lineups. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ) -> ExtFunctionResult { | ||
| let prompt = extract_string_arg(args, kwargs, "prompt", 0); | ||
| let context_arg = extract_string_arg(args, kwargs, "context", 1); | ||
| let model_arg = extract_string_arg(args, kwargs, "model", 2); |
There was a problem hiding this comment.
model is extracted via extract_string_arg, which coerces None to the literal string "None" and also stringifies non-string values. That means llm_query(..., model=None) will incorrectly set a model override of "None" instead of leaving it unset. Consider parsing model as an optional string with explicit handling for MontyObject::None (=> None) and raising a TypeError for other non-string types.
| let model_arg = extract_string_arg(args, kwargs, "model", 2); | |
| let model_value = kwargs | |
| .iter() | |
| .find_map(|(key, value)| match key { | |
| MontyObject::String(name) if name == "model" => Some(value), | |
| _ => None, | |
| }) | |
| .or_else(|| args.get(2)); | |
| let model_arg = match model_value { | |
| Some(MontyObject::None) => None, | |
| Some(MontyObject::String(model)) => Some(model.clone()), | |
| Some(_) => { | |
| return ExtFunctionResult::Error(MontyException::new( | |
| ExcType::TypeError, | |
| Some("llm_query() 'model' argument must be a string or None".into()), | |
| )); | |
| } | |
| None => None, | |
| }; |
There was a problem hiding this comment.
Fixed in f47004c. Confirmed the bug: extract_string_arg delegates to monty_to_string, which converts MontyObject::None to the literal string "None", so llm_query(..., model=None) would route to a bogus model ID. Replaced with a new extract_optional_string_kwarg helper that returns Ok(None) for missing/None, Ok(Some(s)) for strings, and TypeError for anything else. Regression tests added: llm_query_model_none_kwarg_is_no_override_not_literal_none_string and llm_query_rejects_non_string_model_kwarg.
| // a `None` slot means "no override for this prompt" (the caller | ||
| // opted out of routing for that slot); the singular `model=` kwarg | ||
| // does NOT fill those slots, since mixing the two would be surprising. | ||
| let single_model = extract_string_arg(&[], kwargs, "model", usize::MAX); |
There was a problem hiding this comment.
Same issue for the batched API: model is extracted via extract_string_arg, so model=None becomes the string "None" and non-string values get stringified. This should behave like an optional string override: treat None as no override, and reject other non-string types with a TypeError to avoid silently routing to unintended model IDs.
| let single_model = extract_string_arg(&[], kwargs, "model", usize::MAX); | |
| let single_model = match kwargs.iter().find_map(|(k, v)| match k { | |
| MontyObject::String(key) if key == "model" => Some(v), | |
| _ => None, | |
| }) { | |
| None | Some(MontyObject::None) => None, | |
| Some(MontyObject::String(s)) => Some(s.clone()), | |
| Some(other) => { | |
| return ExtFunctionResult::Error(MontyException::new( | |
| ExcType::TypeError, | |
| Some(format!( | |
| "llm_query_batched(): `model` must be a string or None, got {other:?}" | |
| )), | |
| )); | |
| } | |
| }; |
There was a problem hiding this comment.
Fixed in f47004c. Same root cause as comment #3067596620 — the singular model= kwarg on llm_query_batched used the same extract_string_arg helper with the same None → "None" coercion bug. Now uses the shared extract_optional_string_kwarg. Regression tests added: llm_query_batched_single_model_none_kwarg_is_no_override and llm_query_batched_rejects_non_string_single_model.
| return ExtFunctionResult::Error(MontyException::new( | ||
| ExcType::TypeError, | ||
| Some(format!( | ||
| "llm_query_batched(): `models` must be a list of strings, got {other:?}" |
There was a problem hiding this comment.
Error message says "models must be a list of strings", but the implementation explicitly allows None entries (and the message in the other branch even says "str or None"). Consider adjusting this message to reflect the actual accepted types (e.g., list of str or None).
| "llm_query_batched(): `models` must be a list of strings, got {other:?}" | |
| "llm_query_batched(): `models` must be a list of str or None, got {other:?}" |
There was a problem hiding this comment.
Fixed in f47004c. Message now reads \`models\` must be a list of str or None to match the actual accepted types.
| - `llm_query_batched()` runs all calls in parallel — total latency is roughly the slowest model. | ||
| - If `models=` is provided, it must be the same length as `prompts`. | ||
| - Use `model=` (singular) when you want one model applied to every prompt; use `models=` (plural list) for the council pattern. | ||
| - Errors from individual models surface as `"Error: ..."` strings in the result list — the batch never raises. |
There was a problem hiding this comment.
The note says the batched call "never raises", but llm_query_batched() does raise exceptions for invalid arguments (e.g., models length mismatch, non-list models, non-list prompts). Consider clarifying that per-model failures are returned as "Error: ..." strings, but argument validation errors still raise.
| - Errors from individual models surface as `"Error: ..."` strings in the result list — the batch never raises. | |
| - Errors from individual models surface as `"Error: ..."` strings in the result list, but invalid arguments (for example, wrong types or a `models`/`prompts` length mismatch) still raise exceptions. |
There was a problem hiding this comment.
Fixed in f47004c. Clarified the note: per-model runtime failures (unavailable model, provider timeout, etc.) surface as "Error: ..." strings so a single bad model doesn't fail the whole batch, but argument validation errors (wrong types, length mismatch) still raise.
Address Copilot review on PR #2320. Four comments, all valid: 1 & 2. `model` and `single_model` were extracted via `extract_string_arg`, which calls `monty_to_string` — that coerces `MontyObject::None` to the literal string "None" and stringifies non-string values (ints become "1", etc.). So `llm_query(prompt="hi", model=None)` would silently route every call to a bogus model ID called "None". Add a strict `extract_optional_string_kwarg` helper that returns `Ok(None)` for missing/`None`, `Ok(Some(s))` for strings, and a `TypeError` for anything else. Use it in both `handle_llm_query` and `handle_llm_query_batched` for the `model=` kwarg. Regression tests cover: `model=None` → no override, `model=<int>` → TypeError, and the same two cases on the batched path. 3. The `models=` list-type error message said "list of strings" but we accept `None` entries. Updated to "list of str or None". 4. SKILL.md claimed the batched call "never raises". It does — for argument validation errors (wrong types, length mismatch). Clarified that per-model failures return as `"Error: ..."` strings, but argument validation still raises. Tests: 4 new regression tests, all 4687 main-crate and 358 engine tests pass, zero clippy warnings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| using the built-in `llm_query()` and `llm_query_batched()` functions. Both | ||
| accept a `model=` (or `models=`) keyword that overrides the configured model | ||
| for that call. Providers that support per-request model overrides (NEAR AI, | ||
| Anthropic OAuth, GitHub Copilot, Bedrock) honor it; others fall back to their |
There was a problem hiding this comment.
Medium Severity
The skill says Bedrock honors per-request overrides and provides a cross-provider default council, but src/llm/bedrock.rs ignores request.model and explicitly reports the active model only. Also, non-aggregating providers like Anthropic OAuth/Copilot cannot switch to Google/OpenAI by provider-prefixed model name.
Please scope the default lineup to NEAR AI/aggregator backends, provide provider-specific lineups, and either implement Bedrock request-level model selection or remove Bedrock from the "honors it" list.
There was a problem hiding this comment.
Fixed in 458f614. Verified both sub-points:
- Bedrock does not honor per-request overrides.
src/llm/bedrock.rs::complete()unconditionally callsself.current_model_id()and never readsrequest.model. Dropped Bedrock from the "honors it" list. - Cross-vendor routing requires an aggregator. Anthropic OAuth and GitHub Copilot do honor
set_model(confirmed viatake_model_overridecalls in both files), but they can only switch between models exposed by their own vendor — a direct Anthropic provider cannot routeopenai/gpt-5.4to OpenAI's API. Scoped the default prefixed 4-model lineup explicitly to NEAR AI.
SKILL.md now has a provider capability table with separate columns for "honors model=" and "cross-vendor routing", plus per-backend default lineups: NEAR AI (cross-vendor aggregator), Anthropic OAuth (Claude tiers only), Copilot (Copilot-exposed models). For backends that don't honor model= at all (Bedrock, raw OpenAI/Ollama/Tinfoil), the skill instructs the agent to tell the user and fall back to a single-model answer.
Implementing Bedrock request-level model selection would require changing bedrock.rs to build the Converse API call with model_id(override.unwrap_or_else(...)); I left that as out-of-scope for this PR since it's a separate provider change.
| // does NOT fill those slots, since mixing the two would be surprising. | ||
| // See note in handle_llm_query: `model` must be parsed explicitly so that | ||
| // `model=None` doesn't become the literal string "None". | ||
| let single_model = match extract_optional_string_kwarg(&[], kwargs, "model", usize::MAX) { |
There was a problem hiding this comment.
Medium Severity
llm_query_batched(prompts, context=None, model=None, models=None) only reads context, model, and models from kwargs. Positional calls like llm_query_batched(prompts, None, "gpt-4o") silently ignore the model and route to the default model.
Please either extract context from position 1, model from position 2, and models from position 3, or document these as keyword-only and reject extra positional args. Add tests for positional model/models so the call signature in the preamble stays honest.
There was a problem hiding this comment.
Fixed in 458f614. Confirmed the bug: extract_*_kwarg calls all passed &[] for args, so llm_query_batched(prompts, None, "gpt-4o") silently dropped the model. Now threads the real args slice into each extractor with the documented positional indices (context=1, model=2, models=3). Positional None at any of those slots correctly means "no override". Three regression tests added: llm_query_batched_honors_positional_context_and_model, llm_query_batched_honors_positional_models_list, and llm_query_batched_positional_none_for_models_is_no_override. The preamble signature now matches actual behavior.
… docs Address two review comments from serrrfirat on PR #2320. 1. `llm_query_batched` silently dropped positional `context`/`model`/ `models` args. The documented signature is `llm_query_batched(prompts, context=None, model=None, models=None)`, but the extractors were hardcoded to kwargs only (`&[]` for args). A call like `llm_query_batched(prompts, None, "gpt-4o")` routed to the default model — silent contract violation. Thread the real `args` slice into each extractor with the documented positional indices: context=1, model=2, models=3. Positional `MontyObject::None` at any of those slots now correctly means "no override". Added 3 regression tests: - `llm_query_batched_honors_positional_context_and_model` - `llm_query_batched_honors_positional_models_list` - `llm_query_batched_positional_none_for_models_is_no_override` 2. SKILL.md claimed Bedrock honors per-request model overrides, but `bedrock.rs::complete()` unconditionally uses `self.current_model_id()` and ignores `request.model`. Also, the default 4-model prefixed lineup (`anthropic/...`, `google/...`, `openai/...`) only works on aggregator backends like NEAR AI — a direct Anthropic OAuth or Copilot provider honors `set_model` but can only switch between models within its own vendor. Rewrite the SKILL.md preamble with a provider capability table (dropping Bedrock from "honors it" and adding cross-vendor routing as a separate column), and add per-backend default lineups: NEAR AI (prefixed cross-vendor), Anthropic OAuth (Anthropic tiers only), Copilot (Copilot-exposed models). For backends that don't honor `model=` at all (Bedrock, raw OpenAI/Ollama/Tinfoil), the skill now instructs the agent to tell the user and fall back to a single-model answer. Tests: 3 new regression tests, all 4687 main-crate and 361 engine tests pass, zero clippy warnings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| | Backend | Honors `model=`? | Cross-vendor routing? | | ||
| |--------------------|------------------|-----------------------| | ||
| | NEAR AI | Yes | Yes (aggregator — hosts models from many vendors) | | ||
| | Anthropic OAuth | Yes | No (Anthropic models only) | | ||
| | GitHub Copilot | Yes | No (Copilot-exposed models only) | | ||
| | Bedrock | **No** | — (model fixed at construction) | | ||
| | OpenAI / Ollama / Tinfoil via rig | No (silent fallback with warning log) | — | |
There was a problem hiding this comment.
The Bedrock row here says per-request model= is not honored (and implies the model is “fixed at construction”), but the PR description claims Bedrock supports per-request model overrides. The current Bedrock provider explicitly ignores per-request overrides (see src/llm/bedrock.rs effective_model_name: it always returns the active model). Please reconcile the PR description and reword this row to avoid implying Bedrock can’t switch models at runtime (it supports set_model(), just not per-request overrides).
| /// Optional per-call model override. When set, the bridge adapter forwards | ||
| /// this to the underlying `LlmProvider` via `CompletionRequest::model`. | ||
| /// Providers that don't support per-request overrides will fall back to | ||
| /// their configured model and log a warning. |
There was a problem hiding this comment.
Doc comment for LlmCallConfig.model says providers that don’t support per-request overrides “will … log a warning”, but at least the Bedrock and rig adapters currently ignore overrides without emitting a warning. Please either soften this to “may log a warning” or add a warning at the bridge/provider layer when an override is provided but effective_model_name() resolves to the active/default model.
| /// their configured model and log a warning. | |
| /// their configured model and may log a warning. |
| | Anthropic OAuth | Yes | No (Anthropic models only) | | ||
| | GitHub Copilot | Yes | No (Copilot-exposed models only) | | ||
| | Bedrock | **No** | — (model fixed at construction) | | ||
| | OpenAI / Ollama / Tinfoil via rig | No (silent fallback with warning log) | — | |
There was a problem hiding this comment.
This table says the rig-backed providers (OpenAI / Ollama / Tinfoil) “silently fallback with warning log” when model= is provided, but the current rig adapter path injects a model field without logging, and effective_model_name() ignores the override. Please either update this wording (no warning today) or add an explicit warning when a model override is provided but the adapter/provider won’t honor it.
| | OpenAI / Ollama / Tinfoil via rig | No (silent fallback with warning log) | — | | |
| | OpenAI / Ollama / Tinfoil via rig | No (override currently not honored; no warning log) | — | |
serrrfirat
left a comment
There was a problem hiding this comment.
Thorough review complete. Clean design, excellent test coverage (18 new tests), well-motivated helper (extract_optional_string_kwarg correctly avoids the None → "None" coercion bug). No blocking issues found.
Minor suggestions for follow-up (non-blocking):
- PR body provider table lists Bedrock as
Yesfor model override, but SKILL.md correctly saysNo— consider fixing the PR body to avoid confusion contextarg inhandle_llm_query_batchedswitched from lenient coercion to strict str/None — creates a subtle inconsistency withhandle_llm_querywhere context still usesextract_string_arg. Consider aligning both to the strict extractor- No test for
model=""(empty string) — downstreamnormalized_model_override()handles it, but a unit test would document the expectation
LGTM 🚢
Summary
model=(andmodels=for the batched variant) keyword to the existingllm_query()/llm_query_batched()host functions. No new tool, no new dispatch path.LlmCallConfig→LlmBridgeAdapter→CompletionRequest.model, which providers that support per-request switching (NEAR AI, Anthropic OAuth, GitHub Copilot, Bedrock) already honor.llm-councilskill that activates on council/cross-reference language and teaches the pattern (broadcast same prompt across N models, then synthesize via a follow-upllm_query).Why a skill, not a tool?
Originally drafted as a
LlmCouncilToolbuilt-in. Reverted in favor of extending CodeAct because:llm_query.SKILL.md. No Rust changes needed to add new council variants.llm_query(..., model=\"...\").API
Files changed
crates/ironclaw_engine/src/traits/llm.rsmodel: Option<String>toLlmCallConfigcrates/ironclaw_engine/src/executor/scripting.rshandle_llm_queryandhandle_llm_query_batchedextractmodel=/models=; length-mismatch validation; 5 new unit testscrates/ironclaw_engine/src/executor/orchestrator.rs__llm_complete__host fn extractsmodelfrom explicit_configcrates/ironclaw_engine/prompts/codeact_preamble.mdsrc/bridge/llm_adapter.rsconfig.modelontoCompletionRequest.model/ToolCompletionRequest.modelfor both paths; 2 new testsskills/llm-council/SKILL.mdProvider behavior
take_model_override())set_model()based)Test plan
cargo test -p ironclaw_engine --lib— 342 tests pass (337 + 5 new inexecutor::scripting)cargo test --lib— 4644 tests pass (incl. 2 new inbridge::llm_adapter)cargo clippy --all --benches --tests --examples --all-features— zero warningsmodel=is passed🤖 Generated with Claude Code