fix(gateway): v2 engine tool_calls persistence + e2e test coverage by standardtoaster · Pull Request #2452 · nearai/ironclaw

standardtoaster · 2026-04-14T09:51:42Z

Summary

The v2 engine had zero e2e coverage for the tool call -> result -> response path (flagged in Refactor: Remove V1 agent loop code, fully migrate to Engine V2 #2193 audit)
The v2 engine executed tools correctly but never persisted tool_calls metadata to the chat history DB, so the web UI showed empty tool call arrays for all v2 threads
After thread completion, ActionResult messages are extracted from the thread's internal transcript and written as a role="tool_calls" DB row before the assistant response
Unlike the existing write_v1_response, conversation ID resolution failures are logged rather than silently swallowed

Test plan

cargo clippy --no-default-features --features libsql -- -D warnings -- zero warnings
6 new e2e tests pass (pytest tests/e2e/scenarios/test_v2_engine_tool_lifecycle.py -v)
9 existing e2e tests unaffected (error_handling, tool_execution, agent_loop_recovery)
Manual verification: v2 chat history returns populated tool_calls array with correct names and result previews

# Manual verification
curl -s -H "Authorization: Bearer $TOKEN" \
  -X POST http://localhost:8080/api/chat/thread/new | jq .id

curl -s -H "Authorization: Bearer $TOKEN" \
  -X POST http://localhost:8080/api/chat/send \
  -H "Content-Type: application/json" \
  -d '{"content":"what time is it","thread_id":"<THREAD_ID>"}'

# Wait a few seconds, then:
curl -s -H "Authorization: Bearer $TOKEN" \
  "http://localhost:8080/api/chat/history?thread_id=<THREAD_ID>" \
  | jq '.turns[-1].tool_calls'
# Before: []
# After:  [{"name":"time","has_result":true,"result_preview":"..."}]

🤖 Generated with Claude Code

gemini-code-assist

Code Review

This pull request implements the persistence of v2 engine tool call metadata into the v1 conversation database, ensuring that tool execution history is correctly reflected in the web UI. It also introduces a comprehensive E2E test suite for the v2 engine tool lifecycle, covering single, parallel, and multi-step tool chains. Feedback was provided to simplify the message iteration in persist_v2_tool_calls by traversing only the internal transcript, as the public message list is redundant for locating action results.

gemini-code-assist · 2026-04-14T09:56:19Z

+    // `internal_messages` has the full chain including action results with
+    // actual tool output content.
+    let mut calls = Vec::new();
+    for msg in thread.internal_messages.iter().chain(thread.messages.iter()) {


The comment on lines 3046-3048 suggests that internal_messages contains the full chain of messages including action results, while messages only contains user/assistant messages. If this is correct, iterating over thread.messages.iter() is unnecessary because you are filtering for MessageRole::ActionResult, which would not be present in thread.messages. You can simplify this to iterate only over internal_messages. Additionally, when reconstructing these tool results, ensure you look up the call_id from the preceding FunctionCall item to ensure proper correlation between the call and its output.

Suggested change

for msg in thread.internal_messages.iter().chain(thread.messages.iter()) {

for msg in &thread.internal_messages {

References

When reconstructing tool results from stored messages with role 'tool', look up the call_id from the preceding FunctionCall item to ensure proper correlation between the call and its output.

Fixed in ff372e1. Iterating only internal_messages now.

serrrfirat · 2026-04-14T10:06:08Z

+    // `internal_messages` has the full chain including action results with
+    // actual tool output content.
+    let mut calls = Vec::new();
+    for msg in thread.internal_messages.iter().chain(thread.messages.iter()) {


Medium Severity — Potential duplicate tool calls

The loop chains both internal_messages and messages:

for msg in thread.internal_messages.iter().chain(thread.messages.iter()) {

The inline comment states "user-visible messages only has user/assistant messages" — but if that invariant ever breaks (or doesn't hold for all thread types), ActionResult messages present in both collections will produce duplicate entries in the persisted calls array. There's no deduplication by action_call_id.

Suggestion: Either iterate only internal_messages (if the comment is correct, the .chain(thread.messages.iter()) is dead weight), or deduplicate before serializing:

let mut seen_ids = HashSet::new(); // ... in the loop: if let Some(ref call_id) = msg.action_call_id { if !seen_ids.insert(call_id.clone()) { continue; } }

Fixed in ff372e1. Dropped the .chain(thread.messages.iter()) -- now iterates only internal_messages as you suggested.

serrrfirat · 2026-04-14T10:06:21Z

+    let content = match serde_json::to_string(&wrapper) {
+        Ok(c) => c,
+        Err(e) => {
+            tracing::warn!("failed to serialize v2 tool_calls: {e}");


Medium Severity — tracing::warn! in gateway request path may corrupt TUI

Per CLAUDE.md: "info! and warn! output appears in the REPL and corrupts the terminal UI."

This function has three tracing::warn!() calls (here, line ~3106, and line ~3117) that fire on non-critical persistence failures. Since persist_v2_tool_calls is fire-and-forget and these are not user-actionable, they should use debug!() — consistent with the two debug!() calls earlier in this same function for thread-not-found / load-failure.

// Change these three: tracing::warn!("failed to serialize v2 tool_calls: {e}"); tracing::warn!(thread_id = %thread_id, "failed to resolve v1 conversation ..."); tracing::warn!("failed to persist v2 tool_calls to v1 DB: {e}"); // To: debug!("failed to serialize v2 tool_calls: {e}"); debug!(thread_id = %thread_id, "failed to resolve v1 conversation ..."); debug!("failed to persist v2 tool_calls to v1 DB: {e}");

(Also a minor style note: debug! is used bare via use tracing::debug at the top of the file, while warn is fully qualified — normalizing to debug! fixes both issues.)

Fixed in ff372e1. All three calls are debug!() now, consistent with the rest of the function.

serrrfirat · 2026-04-14T10:06:29Z

+
+        history = await _wait_for_response(
+            v2_tool_server,
+            thread_id,


Medium Severity — 5 of 6 tests don't verify the core feature (tool_calls persistence)

Only test_echo_tool asserts on the tool_calls array. The other five tests verify response text but never check that tool_calls was persisted to chat history — which is the stated purpose of this PR.

A regression that silently breaks persistence for parallel or multi-step flows would pass the entire suite.

Suggestion: Add tool_calls assertions to at least:

test_parallel_tool_calls — assert 2 entries with names echo and time

test_multi_step_chain — assert 2 entries: echo then time

test_tool_then_text_then_tool — assert tool_calls present in turns 1 and 3, absent/empty in turn 2

These are the highest-value additions since they cover the flows most likely to regress independently of single-tool persistence.

Fixed in 652315e. Added tool_calls assertions to test_parallel_tool_calls (2 entries, echo + time), test_multi_step_chain (2 entries with results), and test_tool_then_text_then_tool (tool_calls present in turns 1 and 3, absent in turn 2).

serrrfirat · 2026-04-15T13:29:53Z

    if let Ok(Some(ref text)) = result
        && let Some(ref db) = state.db
    {
+        // Write tool_calls row BEFORE assistant response so the v1 history


High Severity — Tool calls and response orphaned after gate resume

persist_v2_tool_calls is called inside the if let Ok(Some(ref text)) = result block, which fires for all outcomes including GatePaused. This creates a correctness issue on the gate-approval-resume path:

User sends message → handle_with_engine writes user row to v1 DB

Thread hits gate → await_thread_outcome returns GatePaused

persist_v2_tool_calls writes tool_calls row (partial tools before gate)

write_v1_response writes assistant row ("Tool requires approval…")

User sends "yes" → routed to handle_approval (NOT handle_with_engine, so no "user" row is written to v1 DB)

Thread resumes, completes → await_thread_outcome runs again

persist_v2_tool_calls writes another tool_calls row (ALL tools, duplicating pre-gate ones)

write_v1_response writes assistant row (actual response)

DB state after:

user → "do something" tool_calls → [partial] ← step 3 assistant → "requires approval" ← step 4 (consumed by turn parser) tool_calls → [ALL tools] ← step 7 (ORPHANED) assistant → "actual response" ← step 8 (ORPHANED)

build_turns_from_db_messages (util.rs:100) only creates turns starting from role == "user" messages — the second tool_calls + assistant pair has no preceding "user" row, so they're silently skipped. The actual tool results and final response are invisible in the history API.

Note: the assistant orphaning is pre-existing (from write_v1_response), but this PR adds an additional orphaned tool_calls row and duplicates pre-gate entries.

Suggested fix: Move persist_v2_tool_calls into the Completed match arm so it only runs for final outcomes:

ThreadOutcome::Completed { response } => { if let Some(ref db) = state.db { persist_v2_tool_calls(&state.store, db, thread_id, message).await; } // ... existing credential fallback logic ... Ok(response) }

Good catch. Fixed in 652315e. Moved persist_v2_tool_calls into the Completed match arm so it only fires for final outcomes. The GatePaused path no longer writes partial/duplicate rows.

serrrfirat · 2026-04-15T13:29:57Z

    if let Ok(Some(ref text)) = result
        && let Some(ref db) = state.db
    {
+        // Write tool_calls row BEFORE assistant response so the v1 history


Medium Severity — Tool calls not persisted for Completed { response: None }

The persist_v2_tool_calls call is guarded by if let Ok(Some(ref text)) = result, so it only fires when the outcome produces a text response. ThreadOutcome::Completed { response: None } (thread completes with tools but no final text) skips this block entirely, leaving a gap in the tool_calls history.

If the fix for the GatePaused issue (finding #1) moves persist_v2_tool_calls into the Completed match arm, this is also resolved — the call would run regardless of whether response is Some or None:

ThreadOutcome::Completed { response } => { if let Some(ref db) = state.db { persist_v2_tool_calls(&state.store, db, thread_id, message).await; } // ... rest of Completed handling ... Ok(response) }

Fixed in the same commit (652315e). The call is now in the Completed arm unconditionally, so it runs regardless of whether response is Some or None.

serrrfirat · 2026-04-15T13:30:04Z

@@ -3006,6 +3010,106 @@ async fn handle_mission_notification(
    }
 }



Medium Severity — No unit/integration test for persist_v2_tool_calls

This function is only covered by e2e tests (which are slow and run with AGENT_AUTO_APPROVE_TOOLS=true, so they never exercise the GatePaused path). Edge cases not covered:

GatePaused + resume: duplicate tool_calls rows (finding Move whatsapp channel source to channels-src/ for consistency #1)

Empty tool result: msg.content is empty → result_preview is "" → has_result is true but preview is meaningless

Completed { response: None }: tool_calls not persisted at all (finding feat: adding Web UI #2)

Thread not in store after join: load_thread returns None, function silently returns

Per the project's testing rules ("Test Through the Caller"), an integration-tier test driving the await_thread_outcome path with a mock store/DB would catch these regressions faster and more reliably than the e2e suite.

Added two libsql-backed unit tests in 652315e. One verifies correct extraction from internal_messages, the other verifies skip behavior for text-only threads. The GatePaused path is harder to test at unit level since it needs the full engine, but the e2e suite now covers the persistence assertions that would catch regressions there.

…ceptance Closes two gaps blocking v2 engine becoming the default: **Phase 4 — token + cost accounting** - Delete orphaned `crates/ironclaw_engine/src/executor/compaction.rs` (176 lines). The Python orchestrator (`default.py::compact_if_needed`) has owned compaction policy since #1557; the Rust module had no callers anywhere in the workspace. - Wire `cost_usd` in `LlmBridgeAdapter` by calling `LlmProvider::calculate_cost()` at both the no-tools and with-tools response paths. The engine's `Thread::total_cost_usd` accumulator and `max_budget_usd` gates were already plumbed — only the adapter was hardcoding 0.0. - Persist `total_cost_usd` through `ThreadArchiveSummary` round-trip in `store_adapter.rs`. Previously, rehydrating an archived thread silently dropped the cost to 0.0. `#[serde(default)]` keeps existing archive files deserializing cleanly. **Phase 6 — mission lifecycle acceptance** Three new integration tests in `bridge/effect_adapter.rs` driving `execute_action()` end-to-end (per `.claude/rules/testing.md` "Test Through the Caller"): - `mission_full_lifecycle_via_execute_action` — create → list → complete → list, asserting the `Completed` status surfaces through `mission_list` after `mission_complete`. - `mission_fire_returns_thread_id_for_manual_cadence_via_execute_action` — fresh manual mission fires successfully and returns a UUID thread_id rather than `not_fired`. - `mission_list_returns_all_user_missions_via_execute_action` — all three created missions appear in `mission_list` output. **Regression tests for cost wiring** Three new tests in `bridge/llm_adapter.rs`: - `complete_no_tools_populates_cost_usd_through_adapter` - `complete_with_tools_populates_cost_usd_through_adapter` - `complete_routes_subcalls_through_cheap_provider_for_cost` — pins that `depth > 0` is priced with the cheap provider, not the primary. Coordinated with in-flight work: skipped paths owned by #2504 (auth E2E), #2631 (paused-lease resume), #2570 (mission re-fire), #2549 (mission_get), #2452 (tool_calls persistence), #2621 (replay snapshot). Verified: `cargo fmt`, `cargo clippy --all --benches --tests --examples --all-features` (0 warnings), `cargo test -p ironclaw_engine` (409 passed), `cargo test -p ironclaw --lib` (5079 passed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ceptance (#2660) * feat(engine-v2): Phase 4 cost tracking + Phase 6 mission lifecycle acceptance Closes two gaps blocking v2 engine becoming the default: **Phase 4 — token + cost accounting** - Delete orphaned `crates/ironclaw_engine/src/executor/compaction.rs` (176 lines). The Python orchestrator (`default.py::compact_if_needed`) has owned compaction policy since #1557; the Rust module had no callers anywhere in the workspace. - Wire `cost_usd` in `LlmBridgeAdapter` by calling `LlmProvider::calculate_cost()` at both the no-tools and with-tools response paths. The engine's `Thread::total_cost_usd` accumulator and `max_budget_usd` gates were already plumbed — only the adapter was hardcoding 0.0. - Persist `total_cost_usd` through `ThreadArchiveSummary` round-trip in `store_adapter.rs`. Previously, rehydrating an archived thread silently dropped the cost to 0.0. `#[serde(default)]` keeps existing archive files deserializing cleanly. **Phase 6 — mission lifecycle acceptance** Three new integration tests in `bridge/effect_adapter.rs` driving `execute_action()` end-to-end (per `.claude/rules/testing.md` "Test Through the Caller"): - `mission_full_lifecycle_via_execute_action` — create → list → complete → list, asserting the `Completed` status surfaces through `mission_list` after `mission_complete`. - `mission_fire_returns_thread_id_for_manual_cadence_via_execute_action` — fresh manual mission fires successfully and returns a UUID thread_id rather than `not_fired`. - `mission_list_returns_all_user_missions_via_execute_action` — all three created missions appear in `mission_list` output. **Regression tests for cost wiring** Three new tests in `bridge/llm_adapter.rs`: - `complete_no_tools_populates_cost_usd_through_adapter` - `complete_with_tools_populates_cost_usd_through_adapter` - `complete_routes_subcalls_through_cheap_provider_for_cost` — pins that `depth > 0` is priced with the cheap provider, not the primary. Coordinated with in-flight work: skipped paths owned by #2504 (auth E2E), #2631 (paused-lease resume), #2570 (mission re-fire), #2549 (mission_get), #2452 (tool_calls persistence), #2621 (replay snapshot). Verified: `cargo fmt`, `cargo clippy --all --benches --tests --examples --all-features` (0 warnings), `cargo test -p ironclaw_engine` (409 passed), `cargo test -p ironclaw --lib` (5079 passed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(engine-v2): surface engine capability actions to LLM via available_actions Fixes the gap called out in the PR body: `EffectBridgeAdapter::available_actions` was only enumerating v1 `ToolRegistry` tools + latent OAuth actions, so engine-native capabilities like `missions` never appeared in the LLM's tools list even when a thread held an active lease for them. The LLM was therefore unable to call `mission_create` / `mission_list` / etc. via structured tool calls; the only ways to drive missions were CodeAct Python calls (which relied on the same `known_actions` set and hit the same gap) or `/routine` slash commands falling through to v1. Wire `CapabilityRegistry` into the adapter and iterate active leases to surface every leased, engine-registered capability action. Respects lease grant scope — a lease granting only `mission_list` does not leak `mission_create`. Skips the `"tools"` capability since that lease is already reconciled from the v1 path. Router wires the shared `Arc<CapabilityRegistry>` to both the adapter and `ThreadManager` at setup. Three new regression tests: - `available_actions_surfaces_leased_mission_capability` - `available_actions_respects_partial_lease_grant` - `available_actions_omits_capability_without_lease` Verified: `cargo fmt`, `cargo clippy --all --benches --tests --examples --all-features` (0 warnings), `cargo test -p ironclaw_engine` (409 passed), `cargo test -p ironclaw --lib` (5082 passed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(engine-v2): close review gaps — archive round-trip, v1/engine merge, defensive filters Addresses gaps raised in PR #2660 review: - **Consolidate `use ironclaw_engine::{...}`** into a single grouped import in `effect_adapter.rs` (was split across two statements). - **Apply `is_v1_only_tool` / `is_v1_auth_tool` filters** to the engine capability path in `available_actions`. Defensive guardrail: a future engine capability that registers an action under a v1-denylisted name (`create_job`, `tool_auth`, ...) must not bypass the v2-isolation filters by virtue of coming through a different capability registry. - **`ThreadArchiveSummary` serialization round-trip tests** in `store_adapter.rs`: - `archive_summary_preserves_total_cost_usd_through_round_trip` — pins the regression the PR fixed (cost silently zeroed on rehydration). - `archive_summary_handles_legacy_json_without_total_cost_usd_field` — pins `#[serde(default)]` back-compat for archive files written before this PR. - **`available_actions` combined advertising tests** in `effect_adapter.rs`: - `available_actions_merges_v1_tools_with_engine_capabilities` — v1 tool + mission capability both surface on one call. - `available_actions_filters_v1_denylisted_names_from_engine_capabilities` — pins the new defensive filter. - **`cost_usd_from` subscription-billed-provider test** in `llm_adapter.rs`: - `complete_with_subscription_billed_provider_yields_zero_cost` — zero `cost_per_token` round-trips to exactly `0.0`, no NaN/Inf. Verified: `cargo fmt`, `cargo clippy --all --benches --tests --examples --all-features` (0 warnings), `cargo test -p ironclaw_engine` (409 passed), `cargo test -p ironclaw --lib` (5087 passed, +5 new). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(engine-v2): price cache tokens correctly in LlmBridgeAdapter Addresses PR #2660 review (gemini-code-assist + Copilot, L23/L115/L189): `cost_usd_from` only priced `input_tokens + output_tokens`, ignoring `cache_read_input_tokens` and `cache_creation_input_tokens`. For providers with prompt caching (Anthropic, OpenAI), this undercounted input cost and silently neutered the `max_budget_usd` gate. Extend the helper to mirror the canonical formula in `src/agent/cost_guard.rs::CostGuard::record_llm_call`: uncached_input = input_tokens - (cache_read + cache_write) cache_read_cost = input_rate * cache_read / cache_read_discount() cache_write_cost = input_rate * cache_write * cache_write_multiplier() cost = input_rate * uncached_input + cache_read_cost + cache_write_cost + output_rate * output_tokens All `LlmProvider` implementations already supply `cache_read_discount()` (default 1, Anthropic 10, OpenAI 2) and `cache_write_multiplier()` (default 1, Anthropic 1.25 for 5m / 2.0 for 1h) through the decorator chain, so no trait surgery is required. Regression test: `complete_prices_cache_tokens_with_discount_and_multiplier` uses Anthropic Sonnet 5m-TTL rates, exercises a 10k-input / 2k-read / 1k-write / 500-output response, and pins the correct total ($0.03285) against the old naive $0.0375 that would have undercounted ~14%. Verified: `cargo fmt`, `cargo clippy --all --benches --tests --examples --all-features` (0 warnings), `cargo test -p ironclaw_engine` (435 passed), `cargo test -p ironclaw --lib` (5136 passed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ilblackdragon

Overview

Two interlocking pieces:

Persistence fix — adds persist_v2_tool_calls in src/bridge/router.rs that, after a ThreadOutcome::Completed thread, loads the thread, walks internal_messages, extracts ActionResult messages, and writes a role="tool_calls" row to the v1 conversation DB so the chat history API can reconstruct tool call info for the web UI.
E2E coverage — 6 new scenarios in test_v2_engine_tool_lifecycle.py (single-tool echo/time, text-only, parallel dispatch, multi-step chain, multi-turn) + 2 libsql unit tests in router.rs.

Net effect: chat history API stops returning tool_calls: [] for v2 threads.

Correctness

Completed-arm gating — commit 652315e8 correctly moved persist_v2_tool_calls out of the general outcome path and into ThreadOutcome::Completed only. Without this, GatePaused would orphan partial rows that duplicate on resume. The Completed { response: None } case (tools executed but no final text) is now also covered.
UTF-8-safe truncation — the 500-byte preview uses char_indices().take_while + char.len_utf8() for char-boundary truncation (matches .claude/rules/review-discipline.md "UTF-8 string safety"). Won't panic on multi-byte content.
Conversation ID resolution — parses UUID from message.conversation_scope() first, falls back to get_or_create_assistant_conversation. Both paths can fail cleanly (debug log + return) without corrupting the thread.
has_result derivation — the written JSON is {name, result_preview, tool_call_id}. src/channels/web/util.rs:93-94 derives has_result from result_preview.is_some_and(!is_null), so the E2E assertion tc["has_result"] is True is a valid contract check.
V24__llm_calls_created_at_index checksum — the migration file already exists on staging (landed in #1963, 532fc61d); the PR is just adding the checksum after a rebase. Not a schema change, not risky.
⚠️ debug! vs warn! for persistence failures — the author initially used warn!, reviewer (commit ff372e11) changed to debug! citing CLAUDE.md's "background tasks must not use info/warn" rule. That rule is about REPL/TUI corruption. This is an HTTP handler path, not a background task — a silent debug! hides a user-visible bug (missing tool_calls in the UI) unless someone sets RUST_LOG=debug. Reasonable case for reverting to warn! or adding a metrics counter.
⚠️ Extra store.load_thread on every Completed — runs even for text-only threads that will produce no rows. The comment says the thread is still in the in-memory cache, so this is a HashMap hit, not a DB round-trip. Acceptable; worth a doc comment noting the cheap-path reliance.

Style / Conventions

✅ dispatch-exempt annotation not required — router.rs is the bridge layer, not a handler/CLI file (the pre-commit hook's scope). Direct state.db / state.store access is canonical here.
🟡 Inconsistent Arc qualification — helper signature uses &std::sync::Arc<dyn Store> / &std::sync::Arc<dyn Database> while the test module uses plain Arc::new(...). Arc is in scope — could drop the std::sync:: prefix for consistency with surrounding code.
🟡 Mock LLM multi-step trigger uses _conversation_has_user_trigger with a tool-result count — pragmatic, but the comment explicitly notes "v2 engine tool results don't always include the tool name in a parseable format". Worth a TODO to tighten once that contract is stable.

Test Coverage

Strong:

6 E2E scenarios cover the happy path + parallel + multi-step + multi-turn. Each asserts on the actual tool_calls array shape, not just response text.
2 libsql unit tests pin the write path and the skip-when-empty path. Uses a real LibSqlBackend (not a mock), which is the right tier per .claude/rules/testing.md.

Gaps worth adding:

GatePaused does NOT persist — the bug that prompted commit 652315e8. No test asserts that a paused gate leaves the DB clean of role=\"tool_calls\" rows. Without this, a future "simplification" could move the call back out of the Completed arm and silently regress. Would be a straightforward libsql test: construct a thread with ActionResults, drive await_thread_outcome with a GatePaused outcome (or test the call-site conditional directly), then verify no write happens. The current unit tests always call the function — they don't exercise the gating logic at the call site.
is_error: true action results — the preview includes the error body but no explicit error flag in the written JSON. E2E tests don't exercise a failing tool. The history reader at util.rs:93-94 would still show has_result: true for a failed tool — probably correct, but not pinned.
Truncation boundary — char_indices safety isn't exercised. A small unit test with a multi-byte string >500 bytes would pin the invariant.
Conversation ID fallback — the fallback path (get_or_create_assistant_conversation) has no direct test. Both unit tests exercise only the conversation_scope()-provided path.

Security

✅ No prompt-injection surface introduced — the preview is JSON-serialized into a content field, not interpolated into a prompt.
⚠️ No sensitive-data redaction on the preview — if a tool output contains a secret (API key, auth token), the 500-byte preview persists it to DB and exposes it via chat history. This matches v1 write_v1_response behavior, so it's not introduced here, but worth flagging as a PR-body follow-up. Verify that the SafetyLayer sanitizer runs on ActionResult.content before it lands in internal_messages.

Performance

Per-completion: one extra store.load_thread (in-memory cache hit in practice) + O(n) internal_messages scan + one DB row insert. Negligible for typical threads. The 500-char truncation happens per ActionResult, which bounds worst-case row size.

Verdict

Approve-leaning with minor follow-ups. The fix is surgical, the Completed-arm placement is the right bug to solve, and E2E coverage meaningfully closes the #2193 audit gap. Before merge I'd suggest:

Revert the debug! → warn! decision for persistence failures (user-visible UI gap when it fails silently).
Add a regression test that pins "no rows written for GatePaused" at the call-site level — otherwise the ThreadOutcome::Completed guard has no test protection.
Optional: add the multi-byte truncation unit test.

Not blockers — ship after 1+2 or file as immediate follow-up. This is on the engine-v2-default critical path, so moving quickly is worth more than polishing these.

The v2 engine had zero e2e coverage for the tool call -> result -> response path. This gap was flagged in the nearai#2193 audit and is the same code path that breaks in QA bug nearai#2402 (infinite loop after tool operations). New test file: test_v2_engine_tool_lifecycle.py - Single tool call (echo, time) completes through v2 - Text-only message completes through v2 - Parallel tool calls (2 tools in one response) - Multi-step chain (echo -> result -> time -> result -> completion) - Multi-turn tool usage across conversation turns Mock LLM additions: - "parallel echo and time" trigger for multi-call responses - "multi step echo then time" trigger for sequential chains Also documents that v2 engine does not populate the tool_calls array in chat history (tool names show as "unknown"). This is a separate gap from execution correctness. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The v2 engine executed tools correctly but never wrote a `role="tool_calls"` message to the v1 conversation DB. This meant the chat history API returned `tool_calls: []` for all v2 threads, breaking the web UI's tool call display. Fix: after thread completion, extract ActionExecuted/ActionFailed events from the v2 event log and write them as a tool_calls DB row before the assistant response. The v1 history API now shows tool names, results, and errors for v2 engine threads. Steps are evicted from the in-memory store after join_thread, so this reads from the append-only event log instead. E2E test updated to assert tool_calls are populated. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The events approach used params_summary (input parameters) where result_preview (output) was expected. Thread internal_messages carry the actual tool output in ActionResult messages. Also fixes stale test file docstring that said tool_calls were not populated. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The v1 write_v1_response silently drops errors via .ok(). Don't replicate that -- log a warning so failed tool_calls persistence is diagnosable. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Drop redundant .chain(thread.messages.iter()) — ActionResult messages only exist in internal_messages - Change tracing::warn! to debug! for fire-and-forget persistence failures (warn corrupts TUI per CLAUDE.md) - Add tool_calls assertions to parallel, multi-step, and multi-turn tests — all 6 tests now verify the core persistence feature - Add result_preview content assertion to echo test for tighter coverage Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions Bot added size: XL 500+ changed lines risk: low Changes to docs, tests, or low-risk modules contributor: regular 2-5 merged PRs labels Apr 14, 2026

gemini-code-assist Bot reviewed Apr 14, 2026

View reviewed changes

serrrfirat reviewed Apr 14, 2026

View reviewed changes

github-actions Bot added size: M 50-199 changed lines scope: db/postgres PostgreSQL backend DB MIGRATION PR adds or modifies PostgreSQL or libSQL migration definitions and removed size: XL 500+ changed lines labels Apr 15, 2026

serrrfirat reviewed Apr 15, 2026

View reviewed changes

standardtoaster mentioned this pull request Apr 15, 2026

fix: restore issue-2402 v2 gate resume and action alias consistency #2458

Merged

github-actions Bot added size: L 200-499 changed lines and removed size: M 50-199 changed lines labels Apr 15, 2026

serrrfirat mentioned this pull request Apr 18, 2026

tracking: complete engine→AppEvent coverage (gaps after #2571 / #2530) #2654

Open

ilblackdragon mentioned this pull request Apr 19, 2026

feat(engine-v2): Phase 4 cost tracking + Phase 6 mission lifecycle acceptance #2660

Merged

5 tasks

ilblackdragon reviewed Apr 19, 2026

View reviewed changes

ilblackdragon mentioned this pull request Apr 19, 2026

Engine v2: review follow-ups deferred from merged PRs #2669

Open

standardtoaster and others added 6 commits April 19, 2026 13:14

fix: log conversation ID resolution failures instead of swallowing

43d8677

The v1 write_v1_response silently drops errors via .ok(). Don't replicate that -- log a warning so failed tool_calls persistence is diagnosable. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: cargo fmt + add V24 migration checksum

0261c31

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

This was referenced Apr 19, 2026

chore: promote staging to staging-promote/a524bf8a-24630860599 (2026-04-19 14:26 UTC) #2679

Merged

chore: promote staging to staging-promote/14333e4a-24631327376 (2026-04-19 15:01 UTC) #2682

Merged

ilblackdragon mentioned this pull request Apr 19, 2026

Preserve paused leases across engine auth resume #2631

Merged

	for msg in thread.internal_messages.iter().chain(thread.messages.iter()) {
	for msg in &thread.internal_messages {

		@@ -3006,6 +3010,106 @@ async fn handle_mission_notification(
		}
		}

Conversation

standardtoaster commented Apr 14, 2026

Summary

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ilblackdragon left a comment

Choose a reason for hiding this comment

Overview

Correctness

Style / Conventions

Test Coverage

Security

Performance

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants