Skip to content

fix(gateway): v2 engine tool_calls persistence + e2e test coverage#2452

Merged
ilblackdragon merged 9 commits intonearai:stagingfrom
standardtoaster:test/v2-engine-tool-lifecycle
Apr 19, 2026
Merged

fix(gateway): v2 engine tool_calls persistence + e2e test coverage#2452
ilblackdragon merged 9 commits intonearai:stagingfrom
standardtoaster:test/v2-engine-tool-lifecycle

Conversation

@standardtoaster
Copy link
Copy Markdown
Contributor

Summary

  • The v2 engine had zero e2e coverage for the tool call -> result -> response path (flagged in Refactor: Remove V1 agent loop code, fully migrate to Engine V2 #2193 audit)
  • The v2 engine executed tools correctly but never persisted tool_calls metadata to the chat history DB, so the web UI showed empty tool call arrays for all v2 threads
  • After thread completion, ActionResult messages are extracted from the thread's internal transcript and written as a role="tool_calls" DB row before the assistant response
  • Unlike the existing write_v1_response, conversation ID resolution failures are logged rather than silently swallowed

Test plan

  • cargo clippy --no-default-features --features libsql -- -D warnings -- zero warnings
  • 6 new e2e tests pass (pytest tests/e2e/scenarios/test_v2_engine_tool_lifecycle.py -v)
  • 9 existing e2e tests unaffected (error_handling, tool_execution, agent_loop_recovery)
  • Manual verification: v2 chat history returns populated tool_calls array with correct names and result previews
# Manual verification
curl -s -H "Authorization: Bearer $TOKEN" \
  -X POST http://localhost:8080/api/chat/thread/new | jq .id

curl -s -H "Authorization: Bearer $TOKEN" \
  -X POST http://localhost:8080/api/chat/send \
  -H "Content-Type: application/json" \
  -d '{"content":"what time is it","thread_id":"<THREAD_ID>"}'

# Wait a few seconds, then:
curl -s -H "Authorization: Bearer $TOKEN" \
  "http://localhost:8080/api/chat/history?thread_id=<THREAD_ID>" \
  | jq '.turns[-1].tool_calls'
# Before: []
# After:  [{"name":"time","has_result":true,"result_preview":"..."}]

🤖 Generated with Claude Code

@github-actions github-actions Bot added size: XL 500+ changed lines risk: low Changes to docs, tests, or low-risk modules contributor: regular 2-5 merged PRs labels Apr 14, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements the persistence of v2 engine tool call metadata into the v1 conversation database, ensuring that tool execution history is correctly reflected in the web UI. It also introduces a comprehensive E2E test suite for the v2 engine tool lifecycle, covering single, parallel, and multi-step tool chains. Feedback was provided to simplify the message iteration in persist_v2_tool_calls by traversing only the internal transcript, as the public message list is redundant for locating action results.

Comment thread src/bridge/router.rs Outdated
// `internal_messages` has the full chain including action results with
// actual tool output content.
let mut calls = Vec::new();
for msg in thread.internal_messages.iter().chain(thread.messages.iter()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment on lines 3046-3048 suggests that internal_messages contains the full chain of messages including action results, while messages only contains user/assistant messages. If this is correct, iterating over thread.messages.iter() is unnecessary because you are filtering for MessageRole::ActionResult, which would not be present in thread.messages. You can simplify this to iterate only over internal_messages. Additionally, when reconstructing these tool results, ensure you look up the call_id from the preceding FunctionCall item to ensure proper correlation between the call and its output.

Suggested change
for msg in thread.internal_messages.iter().chain(thread.messages.iter()) {
for msg in &thread.internal_messages {
References
  1. When reconstructing tool results from stored messages with role 'tool', look up the call_id from the preceding FunctionCall item to ensure proper correlation between the call and its output.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ff372e1. Iterating only internal_messages now.

Comment thread src/bridge/router.rs Outdated
// `internal_messages` has the full chain including action results with
// actual tool output content.
let mut calls = Vec::new();
for msg in thread.internal_messages.iter().chain(thread.messages.iter()) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium Severity — Potential duplicate tool calls

The loop chains both internal_messages and messages:

for msg in thread.internal_messages.iter().chain(thread.messages.iter()) {

The inline comment states "user-visible messages only has user/assistant messages" — but if that invariant ever breaks (or doesn't hold for all thread types), ActionResult messages present in both collections will produce duplicate entries in the persisted calls array. There's no deduplication by action_call_id.

Suggestion: Either iterate only internal_messages (if the comment is correct, the .chain(thread.messages.iter()) is dead weight), or deduplicate before serializing:

let mut seen_ids = HashSet::new();
// ... in the loop:
if let Some(ref call_id) = msg.action_call_id {
    if !seen_ids.insert(call_id.clone()) {
        continue;
    }
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ff372e1. Dropped the .chain(thread.messages.iter()) -- now iterates only internal_messages as you suggested.

Comment thread src/bridge/router.rs Outdated
let content = match serde_json::to_string(&wrapper) {
Ok(c) => c,
Err(e) => {
tracing::warn!("failed to serialize v2 tool_calls: {e}");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium Severity — tracing::warn! in gateway request path may corrupt TUI

Per CLAUDE.md: "info! and warn! output appears in the REPL and corrupts the terminal UI."

This function has three tracing::warn!() calls (here, line ~3106, and line ~3117) that fire on non-critical persistence failures. Since persist_v2_tool_calls is fire-and-forget and these are not user-actionable, they should use debug!() — consistent with the two debug!() calls earlier in this same function for thread-not-found / load-failure.

// Change these three:
tracing::warn!("failed to serialize v2 tool_calls: {e}");
tracing::warn!(thread_id = %thread_id, "failed to resolve v1 conversation ...");
tracing::warn!("failed to persist v2 tool_calls to v1 DB: {e}");

// To:
debug!("failed to serialize v2 tool_calls: {e}");
debug!(thread_id = %thread_id, "failed to resolve v1 conversation ...");
debug!("failed to persist v2 tool_calls to v1 DB: {e}");

(Also a minor style note: debug! is used bare via use tracing::debug at the top of the file, while warn is fully qualified — normalizing to debug! fixes both issues.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ff372e1. All three calls are debug!() now, consistent with the rest of the function.


history = await _wait_for_response(
v2_tool_server,
thread_id,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium Severity — 5 of 6 tests don't verify the core feature (tool_calls persistence)

Only test_echo_tool asserts on the tool_calls array. The other five tests verify response text but never check that tool_calls was persisted to chat history — which is the stated purpose of this PR.

A regression that silently breaks persistence for parallel or multi-step flows would pass the entire suite.

Suggestion: Add tool_calls assertions to at least:

  • test_parallel_tool_calls — assert 2 entries with names echo and time
  • test_multi_step_chain — assert 2 entries: echo then time
  • test_tool_then_text_then_tool — assert tool_calls present in turns 1 and 3, absent/empty in turn 2

These are the highest-value additions since they cover the flows most likely to regress independently of single-tool persistence.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 652315e. Added tool_calls assertions to test_parallel_tool_calls (2 entries, echo + time), test_multi_step_chain (2 entries with results), and test_tool_then_text_then_tool (tool_calls present in turns 1 and 3, absent in turn 2).

@github-actions github-actions Bot added size: M 50-199 changed lines scope: db/postgres PostgreSQL backend DB MIGRATION PR adds or modifies PostgreSQL or libSQL migration definitions and removed size: XL 500+ changed lines labels Apr 15, 2026
Comment thread src/bridge/router.rs Outdated
if let Ok(Some(ref text)) = result
&& let Some(ref db) = state.db
{
// Write tool_calls row BEFORE assistant response so the v1 history
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High Severity — Tool calls and response orphaned after gate resume

persist_v2_tool_calls is called inside the if let Ok(Some(ref text)) = result block, which fires for all outcomes including GatePaused. This creates a correctness issue on the gate-approval-resume path:

  1. User sends message → handle_with_engine writes user row to v1 DB
  2. Thread hits gate → await_thread_outcome returns GatePaused
  3. persist_v2_tool_calls writes tool_calls row (partial tools before gate)
  4. write_v1_response writes assistant row ("Tool requires approval…")
  5. User sends "yes" → routed to handle_approval (NOT handle_with_engine, so no "user" row is written to v1 DB)
  6. Thread resumes, completes → await_thread_outcome runs again
  7. persist_v2_tool_calls writes another tool_calls row (ALL tools, duplicating pre-gate ones)
  8. write_v1_response writes assistant row (actual response)

DB state after:

user         → "do something"
tool_calls   → [partial]              ← step 3
assistant    → "requires approval"    ← step 4  (consumed by turn parser)
tool_calls   → [ALL tools]            ← step 7  (ORPHANED)
assistant    → "actual response"      ← step 8  (ORPHANED)

build_turns_from_db_messages (util.rs:100) only creates turns starting from role == "user" messages — the second tool_calls + assistant pair has no preceding "user" row, so they're silently skipped. The actual tool results and final response are invisible in the history API.

Note: the assistant orphaning is pre-existing (from write_v1_response), but this PR adds an additional orphaned tool_calls row and duplicates pre-gate entries.

Suggested fix: Move persist_v2_tool_calls into the Completed match arm so it only runs for final outcomes:

ThreadOutcome::Completed { response } => {
    if let Some(ref db) = state.db {
        persist_v2_tool_calls(&state.store, db, thread_id, message).await;
    }
    // ... existing credential fallback logic ...
    Ok(response)
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed in 652315e. Moved persist_v2_tool_calls into the Completed match arm so it only fires for final outcomes. The GatePaused path no longer writes partial/duplicate rows.

Comment thread src/bridge/router.rs Outdated
if let Ok(Some(ref text)) = result
&& let Some(ref db) = state.db
{
// Write tool_calls row BEFORE assistant response so the v1 history
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium Severity — Tool calls not persisted for Completed { response: None }

The persist_v2_tool_calls call is guarded by if let Ok(Some(ref text)) = result, so it only fires when the outcome produces a text response. ThreadOutcome::Completed { response: None } (thread completes with tools but no final text) skips this block entirely, leaving a gap in the tool_calls history.

If the fix for the GatePaused issue (finding #1) moves persist_v2_tool_calls into the Completed match arm, this is also resolved — the call would run regardless of whether response is Some or None:

ThreadOutcome::Completed { response } => {
    if let Some(ref db) = state.db {
        persist_v2_tool_calls(&state.store, db, thread_id, message).await;
    }
    // ... rest of Completed handling ...
    Ok(response)
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the same commit (652315e). The call is now in the Completed arm unconditionally, so it runs regardless of whether response is Some or None.

Comment thread src/bridge/router.rs
@@ -3006,6 +3010,106 @@ async fn handle_mission_notification(
}
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium Severity — No unit/integration test for persist_v2_tool_calls

This function is only covered by e2e tests (which are slow and run with AGENT_AUTO_APPROVE_TOOLS=true, so they never exercise the GatePaused path). Edge cases not covered:

Per the project's testing rules ("Test Through the Caller"), an integration-tier test driving the await_thread_outcome path with a mock store/DB would catch these regressions faster and more reliably than the e2e suite.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added two libsql-backed unit tests in 652315e. One verifies correct extraction from internal_messages, the other verifies skip behavior for text-only threads. The GatePaused path is harder to test at unit level since it needs the full engine, but the e2e suite now covers the persistence assertions that would catch regressions there.

@github-actions github-actions Bot added size: L 200-499 changed lines and removed size: M 50-199 changed lines labels Apr 15, 2026
ilblackdragon added a commit that referenced this pull request Apr 19, 2026
…ceptance

Closes two gaps blocking v2 engine becoming the default:

**Phase 4 — token + cost accounting**

- Delete orphaned `crates/ironclaw_engine/src/executor/compaction.rs`
  (176 lines). The Python orchestrator (`default.py::compact_if_needed`)
  has owned compaction policy since #1557; the Rust module had no
  callers anywhere in the workspace.
- Wire `cost_usd` in `LlmBridgeAdapter` by calling
  `LlmProvider::calculate_cost()` at both the no-tools and with-tools
  response paths. The engine's `Thread::total_cost_usd` accumulator and
  `max_budget_usd` gates were already plumbed — only the adapter was
  hardcoding 0.0.
- Persist `total_cost_usd` through `ThreadArchiveSummary` round-trip in
  `store_adapter.rs`. Previously, rehydrating an archived thread
  silently dropped the cost to 0.0. `#[serde(default)]` keeps existing
  archive files deserializing cleanly.

**Phase 6 — mission lifecycle acceptance**

Three new integration tests in `bridge/effect_adapter.rs` driving
`execute_action()` end-to-end (per `.claude/rules/testing.md` "Test
Through the Caller"):

- `mission_full_lifecycle_via_execute_action` — create → list → complete
  → list, asserting the `Completed` status surfaces through
  `mission_list` after `mission_complete`.
- `mission_fire_returns_thread_id_for_manual_cadence_via_execute_action`
  — fresh manual mission fires successfully and returns a UUID thread_id
  rather than `not_fired`.
- `mission_list_returns_all_user_missions_via_execute_action` — all
  three created missions appear in `mission_list` output.

**Regression tests for cost wiring**

Three new tests in `bridge/llm_adapter.rs`:

- `complete_no_tools_populates_cost_usd_through_adapter`
- `complete_with_tools_populates_cost_usd_through_adapter`
- `complete_routes_subcalls_through_cheap_provider_for_cost` — pins that
  `depth > 0` is priced with the cheap provider, not the primary.

Coordinated with in-flight work: skipped paths owned by #2504 (auth
E2E), #2631 (paused-lease resume), #2570 (mission re-fire), #2549
(mission_get), #2452 (tool_calls persistence), #2621 (replay snapshot).

Verified: `cargo fmt`, `cargo clippy --all --benches --tests --examples
--all-features` (0 warnings), `cargo test -p ironclaw_engine` (409
passed), `cargo test -p ironclaw --lib` (5079 passed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ilblackdragon added a commit that referenced this pull request Apr 19, 2026
…ceptance

Closes two gaps blocking v2 engine becoming the default:

**Phase 4 — token + cost accounting**

- Delete orphaned `crates/ironclaw_engine/src/executor/compaction.rs`
  (176 lines). The Python orchestrator (`default.py::compact_if_needed`)
  has owned compaction policy since #1557; the Rust module had no
  callers anywhere in the workspace.
- Wire `cost_usd` in `LlmBridgeAdapter` by calling
  `LlmProvider::calculate_cost()` at both the no-tools and with-tools
  response paths. The engine's `Thread::total_cost_usd` accumulator and
  `max_budget_usd` gates were already plumbed — only the adapter was
  hardcoding 0.0.
- Persist `total_cost_usd` through `ThreadArchiveSummary` round-trip in
  `store_adapter.rs`. Previously, rehydrating an archived thread
  silently dropped the cost to 0.0. `#[serde(default)]` keeps existing
  archive files deserializing cleanly.

**Phase 6 — mission lifecycle acceptance**

Three new integration tests in `bridge/effect_adapter.rs` driving
`execute_action()` end-to-end (per `.claude/rules/testing.md` "Test
Through the Caller"):

- `mission_full_lifecycle_via_execute_action` — create → list → complete
  → list, asserting the `Completed` status surfaces through
  `mission_list` after `mission_complete`.
- `mission_fire_returns_thread_id_for_manual_cadence_via_execute_action`
  — fresh manual mission fires successfully and returns a UUID thread_id
  rather than `not_fired`.
- `mission_list_returns_all_user_missions_via_execute_action` — all
  three created missions appear in `mission_list` output.

**Regression tests for cost wiring**

Three new tests in `bridge/llm_adapter.rs`:

- `complete_no_tools_populates_cost_usd_through_adapter`
- `complete_with_tools_populates_cost_usd_through_adapter`
- `complete_routes_subcalls_through_cheap_provider_for_cost` — pins that
  `depth > 0` is priced with the cheap provider, not the primary.

Coordinated with in-flight work: skipped paths owned by #2504 (auth
E2E), #2631 (paused-lease resume), #2570 (mission re-fire), #2549
(mission_get), #2452 (tool_calls persistence), #2621 (replay snapshot).

Verified: `cargo fmt`, `cargo clippy --all --benches --tests --examples
--all-features` (0 warnings), `cargo test -p ironclaw_engine` (409
passed), `cargo test -p ironclaw --lib` (5079 passed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ilblackdragon added a commit that referenced this pull request Apr 19, 2026
…ceptance (#2660)

* feat(engine-v2): Phase 4 cost tracking + Phase 6 mission lifecycle acceptance

Closes two gaps blocking v2 engine becoming the default:

**Phase 4 — token + cost accounting**

- Delete orphaned `crates/ironclaw_engine/src/executor/compaction.rs`
  (176 lines). The Python orchestrator (`default.py::compact_if_needed`)
  has owned compaction policy since #1557; the Rust module had no
  callers anywhere in the workspace.
- Wire `cost_usd` in `LlmBridgeAdapter` by calling
  `LlmProvider::calculate_cost()` at both the no-tools and with-tools
  response paths. The engine's `Thread::total_cost_usd` accumulator and
  `max_budget_usd` gates were already plumbed — only the adapter was
  hardcoding 0.0.
- Persist `total_cost_usd` through `ThreadArchiveSummary` round-trip in
  `store_adapter.rs`. Previously, rehydrating an archived thread
  silently dropped the cost to 0.0. `#[serde(default)]` keeps existing
  archive files deserializing cleanly.

**Phase 6 — mission lifecycle acceptance**

Three new integration tests in `bridge/effect_adapter.rs` driving
`execute_action()` end-to-end (per `.claude/rules/testing.md` "Test
Through the Caller"):

- `mission_full_lifecycle_via_execute_action` — create → list → complete
  → list, asserting the `Completed` status surfaces through
  `mission_list` after `mission_complete`.
- `mission_fire_returns_thread_id_for_manual_cadence_via_execute_action`
  — fresh manual mission fires successfully and returns a UUID thread_id
  rather than `not_fired`.
- `mission_list_returns_all_user_missions_via_execute_action` — all
  three created missions appear in `mission_list` output.

**Regression tests for cost wiring**

Three new tests in `bridge/llm_adapter.rs`:

- `complete_no_tools_populates_cost_usd_through_adapter`
- `complete_with_tools_populates_cost_usd_through_adapter`
- `complete_routes_subcalls_through_cheap_provider_for_cost` — pins that
  `depth > 0` is priced with the cheap provider, not the primary.

Coordinated with in-flight work: skipped paths owned by #2504 (auth
E2E), #2631 (paused-lease resume), #2570 (mission re-fire), #2549
(mission_get), #2452 (tool_calls persistence), #2621 (replay snapshot).

Verified: `cargo fmt`, `cargo clippy --all --benches --tests --examples
--all-features` (0 warnings), `cargo test -p ironclaw_engine` (409
passed), `cargo test -p ironclaw --lib` (5079 passed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(engine-v2): surface engine capability actions to LLM via available_actions

Fixes the gap called out in the PR body: `EffectBridgeAdapter::available_actions`
was only enumerating v1 `ToolRegistry` tools + latent OAuth actions, so
engine-native capabilities like `missions` never appeared in the LLM's
tools list even when a thread held an active lease for them. The LLM
was therefore unable to call `mission_create` / `mission_list` / etc.
via structured tool calls; the only ways to drive missions were CodeAct
Python calls (which relied on the same `known_actions` set and hit the
same gap) or `/routine` slash commands falling through to v1.

Wire `CapabilityRegistry` into the adapter and iterate active leases to
surface every leased, engine-registered capability action. Respects
lease grant scope — a lease granting only `mission_list` does not leak
`mission_create`. Skips the `"tools"` capability since that lease is
already reconciled from the v1 path.

Router wires the shared `Arc<CapabilityRegistry>` to both the adapter
and `ThreadManager` at setup.

Three new regression tests:
- `available_actions_surfaces_leased_mission_capability`
- `available_actions_respects_partial_lease_grant`
- `available_actions_omits_capability_without_lease`

Verified: `cargo fmt`, `cargo clippy --all --benches --tests --examples
--all-features` (0 warnings), `cargo test -p ironclaw_engine` (409
passed), `cargo test -p ironclaw --lib` (5082 passed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(engine-v2): close review gaps — archive round-trip, v1/engine merge, defensive filters

Addresses gaps raised in PR #2660 review:

- **Consolidate `use ironclaw_engine::{...}`** into a single grouped
  import in `effect_adapter.rs` (was split across two statements).

- **Apply `is_v1_only_tool` / `is_v1_auth_tool` filters** to the engine
  capability path in `available_actions`. Defensive guardrail: a future
  engine capability that registers an action under a v1-denylisted name
  (`create_job`, `tool_auth`, ...) must not bypass the v2-isolation
  filters by virtue of coming through a different capability registry.

- **`ThreadArchiveSummary` serialization round-trip tests** in
  `store_adapter.rs`:
    - `archive_summary_preserves_total_cost_usd_through_round_trip` —
      pins the regression the PR fixed (cost silently zeroed on
      rehydration).
    - `archive_summary_handles_legacy_json_without_total_cost_usd_field`
      — pins `#[serde(default)]` back-compat for archive files written
      before this PR.

- **`available_actions` combined advertising tests** in
  `effect_adapter.rs`:
    - `available_actions_merges_v1_tools_with_engine_capabilities` —
      v1 tool + mission capability both surface on one call.
    - `available_actions_filters_v1_denylisted_names_from_engine_capabilities`
      — pins the new defensive filter.

- **`cost_usd_from` subscription-billed-provider test** in
  `llm_adapter.rs`:
    - `complete_with_subscription_billed_provider_yields_zero_cost` —
      zero `cost_per_token` round-trips to exactly `0.0`, no NaN/Inf.

Verified: `cargo fmt`, `cargo clippy --all --benches --tests --examples
--all-features` (0 warnings), `cargo test -p ironclaw_engine` (409
passed), `cargo test -p ironclaw --lib` (5087 passed, +5 new).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(engine-v2): price cache tokens correctly in LlmBridgeAdapter

Addresses PR #2660 review (gemini-code-assist + Copilot, L23/L115/L189):
`cost_usd_from` only priced `input_tokens + output_tokens`, ignoring
`cache_read_input_tokens` and `cache_creation_input_tokens`. For
providers with prompt caching (Anthropic, OpenAI), this undercounted
input cost and silently neutered the `max_budget_usd` gate.

Extend the helper to mirror the canonical formula in
`src/agent/cost_guard.rs::CostGuard::record_llm_call`:

  uncached_input = input_tokens - (cache_read + cache_write)
  cache_read_cost  = input_rate * cache_read  / cache_read_discount()
  cache_write_cost = input_rate * cache_write * cache_write_multiplier()
  cost = input_rate * uncached_input
       + cache_read_cost
       + cache_write_cost
       + output_rate * output_tokens

All `LlmProvider` implementations already supply `cache_read_discount()`
(default 1, Anthropic 10, OpenAI 2) and `cache_write_multiplier()`
(default 1, Anthropic 1.25 for 5m / 2.0 for 1h) through the decorator
chain, so no trait surgery is required.

Regression test: `complete_prices_cache_tokens_with_discount_and_multiplier`
uses Anthropic Sonnet 5m-TTL rates, exercises a 10k-input / 2k-read /
1k-write / 500-output response, and pins the correct total ($0.03285)
against the old naive $0.0375 that would have undercounted ~14%.

Verified: `cargo fmt`, `cargo clippy --all --benches --tests --examples
--all-features` (0 warnings), `cargo test -p ironclaw_engine` (435
passed), `cargo test -p ironclaw --lib` (5136 passed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@ilblackdragon ilblackdragon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overview

Two interlocking pieces:

  1. Persistence fix — adds persist_v2_tool_calls in src/bridge/router.rs that, after a ThreadOutcome::Completed thread, loads the thread, walks internal_messages, extracts ActionResult messages, and writes a role="tool_calls" row to the v1 conversation DB so the chat history API can reconstruct tool call info for the web UI.
  2. E2E coverage — 6 new scenarios in test_v2_engine_tool_lifecycle.py (single-tool echo/time, text-only, parallel dispatch, multi-step chain, multi-turn) + 2 libsql unit tests in router.rs.

Net effect: chat history API stops returning tool_calls: [] for v2 threads.

Correctness

  • Completed-arm gating — commit 652315e8 correctly moved persist_v2_tool_calls out of the general outcome path and into ThreadOutcome::Completed only. Without this, GatePaused would orphan partial rows that duplicate on resume. The Completed { response: None } case (tools executed but no final text) is now also covered.
  • UTF-8-safe truncation — the 500-byte preview uses char_indices().take_while + char.len_utf8() for char-boundary truncation (matches .claude/rules/review-discipline.md "UTF-8 string safety"). Won't panic on multi-byte content.
  • Conversation ID resolution — parses UUID from message.conversation_scope() first, falls back to get_or_create_assistant_conversation. Both paths can fail cleanly (debug log + return) without corrupting the thread.
  • has_result derivation — the written JSON is {name, result_preview, tool_call_id}. src/channels/web/util.rs:93-94 derives has_result from result_preview.is_some_and(!is_null), so the E2E assertion tc["has_result"] is True is a valid contract check.
  • V24__llm_calls_created_at_index checksum — the migration file already exists on staging (landed in #1963, 532fc61d); the PR is just adding the checksum after a rebase. Not a schema change, not risky.
  • ⚠️ debug! vs warn! for persistence failures — the author initially used warn!, reviewer (commit ff372e11) changed to debug! citing CLAUDE.md's "background tasks must not use info/warn" rule. That rule is about REPL/TUI corruption. This is an HTTP handler path, not a background task — a silent debug! hides a user-visible bug (missing tool_calls in the UI) unless someone sets RUST_LOG=debug. Reasonable case for reverting to warn! or adding a metrics counter.
  • ⚠️ Extra store.load_thread on every Completed — runs even for text-only threads that will produce no rows. The comment says the thread is still in the in-memory cache, so this is a HashMap hit, not a DB round-trip. Acceptable; worth a doc comment noting the cheap-path reliance.

Style / Conventions

  • dispatch-exempt annotation not requiredrouter.rs is the bridge layer, not a handler/CLI file (the pre-commit hook's scope). Direct state.db / state.store access is canonical here.
  • 🟡 Inconsistent Arc qualification — helper signature uses &std::sync::Arc<dyn Store> / &std::sync::Arc<dyn Database> while the test module uses plain Arc::new(...). Arc is in scope — could drop the std::sync:: prefix for consistency with surrounding code.
  • 🟡 Mock LLM multi-step trigger uses _conversation_has_user_trigger with a tool-result count — pragmatic, but the comment explicitly notes "v2 engine tool results don't always include the tool name in a parseable format". Worth a TODO to tighten once that contract is stable.

Test Coverage

Strong:

  • 6 E2E scenarios cover the happy path + parallel + multi-step + multi-turn. Each asserts on the actual tool_calls array shape, not just response text.
  • 2 libsql unit tests pin the write path and the skip-when-empty path. Uses a real LibSqlBackend (not a mock), which is the right tier per .claude/rules/testing.md.

Gaps worth adding:

  • GatePaused does NOT persist — the bug that prompted commit 652315e8. No test asserts that a paused gate leaves the DB clean of role=\"tool_calls\" rows. Without this, a future "simplification" could move the call back out of the Completed arm and silently regress. Would be a straightforward libsql test: construct a thread with ActionResults, drive await_thread_outcome with a GatePaused outcome (or test the call-site conditional directly), then verify no write happens. The current unit tests always call the function — they don't exercise the gating logic at the call site.
  • is_error: true action results — the preview includes the error body but no explicit error flag in the written JSON. E2E tests don't exercise a failing tool. The history reader at util.rs:93-94 would still show has_result: true for a failed tool — probably correct, but not pinned.
  • Truncation boundarychar_indices safety isn't exercised. A small unit test with a multi-byte string >500 bytes would pin the invariant.
  • Conversation ID fallback — the fallback path (get_or_create_assistant_conversation) has no direct test. Both unit tests exercise only the conversation_scope()-provided path.

Security

  • No prompt-injection surface introduced — the preview is JSON-serialized into a content field, not interpolated into a prompt.
  • ⚠️ No sensitive-data redaction on the preview — if a tool output contains a secret (API key, auth token), the 500-byte preview persists it to DB and exposes it via chat history. This matches v1 write_v1_response behavior, so it's not introduced here, but worth flagging as a PR-body follow-up. Verify that the SafetyLayer sanitizer runs on ActionResult.content before it lands in internal_messages.

Performance

Per-completion: one extra store.load_thread (in-memory cache hit in practice) + O(n) internal_messages scan + one DB row insert. Negligible for typical threads. The 500-char truncation happens per ActionResult, which bounds worst-case row size.

Verdict

Approve-leaning with minor follow-ups. The fix is surgical, the Completed-arm placement is the right bug to solve, and E2E coverage meaningfully closes the #2193 audit gap. Before merge I'd suggest:

  1. Revert the debug!warn! decision for persistence failures (user-visible UI gap when it fails silently).
  2. Add a regression test that pins "no rows written for GatePaused" at the call-site level — otherwise the ThreadOutcome::Completed guard has no test protection.
  3. Optional: add the multi-byte truncation unit test.

Not blockers — ship after 1+2 or file as immediate follow-up. This is on the engine-v2-default critical path, so moving quickly is worth more than polishing these.

standardtoaster and others added 6 commits April 19, 2026 13:14
The v2 engine had zero e2e coverage for the tool call -> result ->
response path. This gap was flagged in the nearai#2193 audit and is the
same code path that breaks in QA bug nearai#2402 (infinite loop after
tool operations).

New test file: test_v2_engine_tool_lifecycle.py
- Single tool call (echo, time) completes through v2
- Text-only message completes through v2
- Parallel tool calls (2 tools in one response)
- Multi-step chain (echo -> result -> time -> result -> completion)
- Multi-turn tool usage across conversation turns

Mock LLM additions:
- "parallel echo and time" trigger for multi-call responses
- "multi step echo then time" trigger for sequential chains

Also documents that v2 engine does not populate the tool_calls
array in chat history (tool names show as "unknown"). This is a
separate gap from execution correctness.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The v2 engine executed tools correctly but never wrote a
`role="tool_calls"` message to the v1 conversation DB. This
meant the chat history API returned `tool_calls: []` for all
v2 threads, breaking the web UI's tool call display.

Fix: after thread completion, extract ActionExecuted/ActionFailed
events from the v2 event log and write them as a tool_calls DB
row before the assistant response. The v1 history API now shows
tool names, results, and errors for v2 engine threads.

Steps are evicted from the in-memory store after join_thread,
so this reads from the append-only event log instead.

E2E test updated to assert tool_calls are populated.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The events approach used params_summary (input parameters) where
result_preview (output) was expected. Thread internal_messages
carry the actual tool output in ActionResult messages.

Also fixes stale test file docstring that said tool_calls were
not populated.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The v1 write_v1_response silently drops errors via .ok(). Don't
replicate that -- log a warning so failed tool_calls persistence
is diagnosable.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Drop redundant .chain(thread.messages.iter()) — ActionResult messages
  only exist in internal_messages
- Change tracing::warn! to debug! for fire-and-forget persistence
  failures (warn corrupts TUI per CLAUDE.md)
- Add tool_calls assertions to parallel, multi-step, and multi-turn
  tests — all 6 tests now verify the core persistence feature
- Add result_preview content assertion to echo test for tighter coverage

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This was referenced Apr 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor: regular 2-5 merged PRs DB MIGRATION PR adds or modifies PostgreSQL or libSQL migration definitions risk: low Changes to docs, tests, or low-risk modules scope: db/postgres PostgreSQL backend size: L 200-499 changed lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants