fix(gateway): v2 engine tool_calls persistence + e2e test coverage#2452
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements the persistence of v2 engine tool call metadata into the v1 conversation database, ensuring that tool execution history is correctly reflected in the web UI. It also introduces a comprehensive E2E test suite for the v2 engine tool lifecycle, covering single, parallel, and multi-step tool chains. Feedback was provided to simplify the message iteration in persist_v2_tool_calls by traversing only the internal transcript, as the public message list is redundant for locating action results.
| // `internal_messages` has the full chain including action results with | ||
| // actual tool output content. | ||
| let mut calls = Vec::new(); | ||
| for msg in thread.internal_messages.iter().chain(thread.messages.iter()) { |
There was a problem hiding this comment.
The comment on lines 3046-3048 suggests that internal_messages contains the full chain of messages including action results, while messages only contains user/assistant messages. If this is correct, iterating over thread.messages.iter() is unnecessary because you are filtering for MessageRole::ActionResult, which would not be present in thread.messages. You can simplify this to iterate only over internal_messages. Additionally, when reconstructing these tool results, ensure you look up the call_id from the preceding FunctionCall item to ensure proper correlation between the call and its output.
| for msg in thread.internal_messages.iter().chain(thread.messages.iter()) { | |
| for msg in &thread.internal_messages { |
References
- When reconstructing tool results from stored messages with role 'tool', look up the call_id from the preceding FunctionCall item to ensure proper correlation between the call and its output.
There was a problem hiding this comment.
Fixed in ff372e1. Iterating only internal_messages now.
| // `internal_messages` has the full chain including action results with | ||
| // actual tool output content. | ||
| let mut calls = Vec::new(); | ||
| for msg in thread.internal_messages.iter().chain(thread.messages.iter()) { |
There was a problem hiding this comment.
Medium Severity — Potential duplicate tool calls
The loop chains both internal_messages and messages:
for msg in thread.internal_messages.iter().chain(thread.messages.iter()) {The inline comment states "user-visible messages only has user/assistant messages" — but if that invariant ever breaks (or doesn't hold for all thread types), ActionResult messages present in both collections will produce duplicate entries in the persisted calls array. There's no deduplication by action_call_id.
Suggestion: Either iterate only internal_messages (if the comment is correct, the .chain(thread.messages.iter()) is dead weight), or deduplicate before serializing:
let mut seen_ids = HashSet::new();
// ... in the loop:
if let Some(ref call_id) = msg.action_call_id {
if !seen_ids.insert(call_id.clone()) {
continue;
}
}There was a problem hiding this comment.
Fixed in ff372e1. Dropped the .chain(thread.messages.iter()) -- now iterates only internal_messages as you suggested.
| let content = match serde_json::to_string(&wrapper) { | ||
| Ok(c) => c, | ||
| Err(e) => { | ||
| tracing::warn!("failed to serialize v2 tool_calls: {e}"); |
There was a problem hiding this comment.
Medium Severity — tracing::warn! in gateway request path may corrupt TUI
Per CLAUDE.md: "info! and warn! output appears in the REPL and corrupts the terminal UI."
This function has three tracing::warn!() calls (here, line ~3106, and line ~3117) that fire on non-critical persistence failures. Since persist_v2_tool_calls is fire-and-forget and these are not user-actionable, they should use debug!() — consistent with the two debug!() calls earlier in this same function for thread-not-found / load-failure.
// Change these three:
tracing::warn!("failed to serialize v2 tool_calls: {e}");
tracing::warn!(thread_id = %thread_id, "failed to resolve v1 conversation ...");
tracing::warn!("failed to persist v2 tool_calls to v1 DB: {e}");
// To:
debug!("failed to serialize v2 tool_calls: {e}");
debug!(thread_id = %thread_id, "failed to resolve v1 conversation ...");
debug!("failed to persist v2 tool_calls to v1 DB: {e}");(Also a minor style note: debug! is used bare via use tracing::debug at the top of the file, while warn is fully qualified — normalizing to debug! fixes both issues.)
There was a problem hiding this comment.
Fixed in ff372e1. All three calls are debug!() now, consistent with the rest of the function.
|
|
||
| history = await _wait_for_response( | ||
| v2_tool_server, | ||
| thread_id, |
There was a problem hiding this comment.
Medium Severity — 5 of 6 tests don't verify the core feature (tool_calls persistence)
Only test_echo_tool asserts on the tool_calls array. The other five tests verify response text but never check that tool_calls was persisted to chat history — which is the stated purpose of this PR.
A regression that silently breaks persistence for parallel or multi-step flows would pass the entire suite.
Suggestion: Add tool_calls assertions to at least:
test_parallel_tool_calls— assert 2 entries with namesechoandtimetest_multi_step_chain— assert 2 entries:echothentimetest_tool_then_text_then_tool— assert tool_calls present in turns 1 and 3, absent/empty in turn 2
These are the highest-value additions since they cover the flows most likely to regress independently of single-tool persistence.
There was a problem hiding this comment.
Fixed in 652315e. Added tool_calls assertions to test_parallel_tool_calls (2 entries, echo + time), test_multi_step_chain (2 entries with results), and test_tool_then_text_then_tool (tool_calls present in turns 1 and 3, absent in turn 2).
| if let Ok(Some(ref text)) = result | ||
| && let Some(ref db) = state.db | ||
| { | ||
| // Write tool_calls row BEFORE assistant response so the v1 history |
There was a problem hiding this comment.
High Severity — Tool calls and response orphaned after gate resume
persist_v2_tool_calls is called inside the if let Ok(Some(ref text)) = result block, which fires for all outcomes including GatePaused. This creates a correctness issue on the gate-approval-resume path:
- User sends message →
handle_with_enginewritesuserrow to v1 DB - Thread hits gate →
await_thread_outcomereturnsGatePaused persist_v2_tool_callswritestool_callsrow (partial tools before gate)write_v1_responsewritesassistantrow ("Tool requires approval…")- User sends "yes" → routed to
handle_approval(NOThandle_with_engine, so no "user" row is written to v1 DB) - Thread resumes, completes →
await_thread_outcomeruns again persist_v2_tool_callswrites anothertool_callsrow (ALL tools, duplicating pre-gate ones)write_v1_responsewritesassistantrow (actual response)
DB state after:
user → "do something"
tool_calls → [partial] ← step 3
assistant → "requires approval" ← step 4 (consumed by turn parser)
tool_calls → [ALL tools] ← step 7 (ORPHANED)
assistant → "actual response" ← step 8 (ORPHANED)
build_turns_from_db_messages (util.rs:100) only creates turns starting from role == "user" messages — the second tool_calls + assistant pair has no preceding "user" row, so they're silently skipped. The actual tool results and final response are invisible in the history API.
Note: the assistant orphaning is pre-existing (from write_v1_response), but this PR adds an additional orphaned tool_calls row and duplicates pre-gate entries.
Suggested fix: Move persist_v2_tool_calls into the Completed match arm so it only runs for final outcomes:
ThreadOutcome::Completed { response } => {
if let Some(ref db) = state.db {
persist_v2_tool_calls(&state.store, db, thread_id, message).await;
}
// ... existing credential fallback logic ...
Ok(response)
}There was a problem hiding this comment.
Good catch. Fixed in 652315e. Moved persist_v2_tool_calls into the Completed match arm so it only fires for final outcomes. The GatePaused path no longer writes partial/duplicate rows.
| if let Ok(Some(ref text)) = result | ||
| && let Some(ref db) = state.db | ||
| { | ||
| // Write tool_calls row BEFORE assistant response so the v1 history |
There was a problem hiding this comment.
Medium Severity — Tool calls not persisted for Completed { response: None }
The persist_v2_tool_calls call is guarded by if let Ok(Some(ref text)) = result, so it only fires when the outcome produces a text response. ThreadOutcome::Completed { response: None } (thread completes with tools but no final text) skips this block entirely, leaving a gap in the tool_calls history.
If the fix for the GatePaused issue (finding #1) moves persist_v2_tool_calls into the Completed match arm, this is also resolved — the call would run regardless of whether response is Some or None:
ThreadOutcome::Completed { response } => {
if let Some(ref db) = state.db {
persist_v2_tool_calls(&state.store, db, thread_id, message).await;
}
// ... rest of Completed handling ...
Ok(response)
}There was a problem hiding this comment.
Fixed in the same commit (652315e). The call is now in the Completed arm unconditionally, so it runs regardless of whether response is Some or None.
| @@ -3006,6 +3010,106 @@ async fn handle_mission_notification( | |||
| } | |||
| } | |||
|
|
|||
There was a problem hiding this comment.
Medium Severity — No unit/integration test for persist_v2_tool_calls
This function is only covered by e2e tests (which are slow and run with AGENT_AUTO_APPROVE_TOOLS=true, so they never exercise the GatePaused path). Edge cases not covered:
- GatePaused + resume: duplicate tool_calls rows (finding Move whatsapp channel source to channels-src/ for consistency #1)
- Empty tool result:
msg.contentis empty →result_previewis""→has_resultistruebut preview is meaningless Completed { response: None }: tool_calls not persisted at all (finding feat: adding Web UI #2)- Thread not in store after join:
load_threadreturnsNone, function silently returns
Per the project's testing rules ("Test Through the Caller"), an integration-tier test driving the await_thread_outcome path with a mock store/DB would catch these regressions faster and more reliably than the e2e suite.
There was a problem hiding this comment.
Added two libsql-backed unit tests in 652315e. One verifies correct extraction from internal_messages, the other verifies skip behavior for text-only threads. The GatePaused path is harder to test at unit level since it needs the full engine, but the e2e suite now covers the persistence assertions that would catch regressions there.
…ceptance Closes two gaps blocking v2 engine becoming the default: **Phase 4 — token + cost accounting** - Delete orphaned `crates/ironclaw_engine/src/executor/compaction.rs` (176 lines). The Python orchestrator (`default.py::compact_if_needed`) has owned compaction policy since #1557; the Rust module had no callers anywhere in the workspace. - Wire `cost_usd` in `LlmBridgeAdapter` by calling `LlmProvider::calculate_cost()` at both the no-tools and with-tools response paths. The engine's `Thread::total_cost_usd` accumulator and `max_budget_usd` gates were already plumbed — only the adapter was hardcoding 0.0. - Persist `total_cost_usd` through `ThreadArchiveSummary` round-trip in `store_adapter.rs`. Previously, rehydrating an archived thread silently dropped the cost to 0.0. `#[serde(default)]` keeps existing archive files deserializing cleanly. **Phase 6 — mission lifecycle acceptance** Three new integration tests in `bridge/effect_adapter.rs` driving `execute_action()` end-to-end (per `.claude/rules/testing.md` "Test Through the Caller"): - `mission_full_lifecycle_via_execute_action` — create → list → complete → list, asserting the `Completed` status surfaces through `mission_list` after `mission_complete`. - `mission_fire_returns_thread_id_for_manual_cadence_via_execute_action` — fresh manual mission fires successfully and returns a UUID thread_id rather than `not_fired`. - `mission_list_returns_all_user_missions_via_execute_action` — all three created missions appear in `mission_list` output. **Regression tests for cost wiring** Three new tests in `bridge/llm_adapter.rs`: - `complete_no_tools_populates_cost_usd_through_adapter` - `complete_with_tools_populates_cost_usd_through_adapter` - `complete_routes_subcalls_through_cheap_provider_for_cost` — pins that `depth > 0` is priced with the cheap provider, not the primary. Coordinated with in-flight work: skipped paths owned by #2504 (auth E2E), #2631 (paused-lease resume), #2570 (mission re-fire), #2549 (mission_get), #2452 (tool_calls persistence), #2621 (replay snapshot). Verified: `cargo fmt`, `cargo clippy --all --benches --tests --examples --all-features` (0 warnings), `cargo test -p ironclaw_engine` (409 passed), `cargo test -p ironclaw --lib` (5079 passed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ceptance Closes two gaps blocking v2 engine becoming the default: **Phase 4 — token + cost accounting** - Delete orphaned `crates/ironclaw_engine/src/executor/compaction.rs` (176 lines). The Python orchestrator (`default.py::compact_if_needed`) has owned compaction policy since #1557; the Rust module had no callers anywhere in the workspace. - Wire `cost_usd` in `LlmBridgeAdapter` by calling `LlmProvider::calculate_cost()` at both the no-tools and with-tools response paths. The engine's `Thread::total_cost_usd` accumulator and `max_budget_usd` gates were already plumbed — only the adapter was hardcoding 0.0. - Persist `total_cost_usd` through `ThreadArchiveSummary` round-trip in `store_adapter.rs`. Previously, rehydrating an archived thread silently dropped the cost to 0.0. `#[serde(default)]` keeps existing archive files deserializing cleanly. **Phase 6 — mission lifecycle acceptance** Three new integration tests in `bridge/effect_adapter.rs` driving `execute_action()` end-to-end (per `.claude/rules/testing.md` "Test Through the Caller"): - `mission_full_lifecycle_via_execute_action` — create → list → complete → list, asserting the `Completed` status surfaces through `mission_list` after `mission_complete`. - `mission_fire_returns_thread_id_for_manual_cadence_via_execute_action` — fresh manual mission fires successfully and returns a UUID thread_id rather than `not_fired`. - `mission_list_returns_all_user_missions_via_execute_action` — all three created missions appear in `mission_list` output. **Regression tests for cost wiring** Three new tests in `bridge/llm_adapter.rs`: - `complete_no_tools_populates_cost_usd_through_adapter` - `complete_with_tools_populates_cost_usd_through_adapter` - `complete_routes_subcalls_through_cheap_provider_for_cost` — pins that `depth > 0` is priced with the cheap provider, not the primary. Coordinated with in-flight work: skipped paths owned by #2504 (auth E2E), #2631 (paused-lease resume), #2570 (mission re-fire), #2549 (mission_get), #2452 (tool_calls persistence), #2621 (replay snapshot). Verified: `cargo fmt`, `cargo clippy --all --benches --tests --examples --all-features` (0 warnings), `cargo test -p ironclaw_engine` (409 passed), `cargo test -p ironclaw --lib` (5079 passed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ceptance (#2660) * feat(engine-v2): Phase 4 cost tracking + Phase 6 mission lifecycle acceptance Closes two gaps blocking v2 engine becoming the default: **Phase 4 — token + cost accounting** - Delete orphaned `crates/ironclaw_engine/src/executor/compaction.rs` (176 lines). The Python orchestrator (`default.py::compact_if_needed`) has owned compaction policy since #1557; the Rust module had no callers anywhere in the workspace. - Wire `cost_usd` in `LlmBridgeAdapter` by calling `LlmProvider::calculate_cost()` at both the no-tools and with-tools response paths. The engine's `Thread::total_cost_usd` accumulator and `max_budget_usd` gates were already plumbed — only the adapter was hardcoding 0.0. - Persist `total_cost_usd` through `ThreadArchiveSummary` round-trip in `store_adapter.rs`. Previously, rehydrating an archived thread silently dropped the cost to 0.0. `#[serde(default)]` keeps existing archive files deserializing cleanly. **Phase 6 — mission lifecycle acceptance** Three new integration tests in `bridge/effect_adapter.rs` driving `execute_action()` end-to-end (per `.claude/rules/testing.md` "Test Through the Caller"): - `mission_full_lifecycle_via_execute_action` — create → list → complete → list, asserting the `Completed` status surfaces through `mission_list` after `mission_complete`. - `mission_fire_returns_thread_id_for_manual_cadence_via_execute_action` — fresh manual mission fires successfully and returns a UUID thread_id rather than `not_fired`. - `mission_list_returns_all_user_missions_via_execute_action` — all three created missions appear in `mission_list` output. **Regression tests for cost wiring** Three new tests in `bridge/llm_adapter.rs`: - `complete_no_tools_populates_cost_usd_through_adapter` - `complete_with_tools_populates_cost_usd_through_adapter` - `complete_routes_subcalls_through_cheap_provider_for_cost` — pins that `depth > 0` is priced with the cheap provider, not the primary. Coordinated with in-flight work: skipped paths owned by #2504 (auth E2E), #2631 (paused-lease resume), #2570 (mission re-fire), #2549 (mission_get), #2452 (tool_calls persistence), #2621 (replay snapshot). Verified: `cargo fmt`, `cargo clippy --all --benches --tests --examples --all-features` (0 warnings), `cargo test -p ironclaw_engine` (409 passed), `cargo test -p ironclaw --lib` (5079 passed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(engine-v2): surface engine capability actions to LLM via available_actions Fixes the gap called out in the PR body: `EffectBridgeAdapter::available_actions` was only enumerating v1 `ToolRegistry` tools + latent OAuth actions, so engine-native capabilities like `missions` never appeared in the LLM's tools list even when a thread held an active lease for them. The LLM was therefore unable to call `mission_create` / `mission_list` / etc. via structured tool calls; the only ways to drive missions were CodeAct Python calls (which relied on the same `known_actions` set and hit the same gap) or `/routine` slash commands falling through to v1. Wire `CapabilityRegistry` into the adapter and iterate active leases to surface every leased, engine-registered capability action. Respects lease grant scope — a lease granting only `mission_list` does not leak `mission_create`. Skips the `"tools"` capability since that lease is already reconciled from the v1 path. Router wires the shared `Arc<CapabilityRegistry>` to both the adapter and `ThreadManager` at setup. Three new regression tests: - `available_actions_surfaces_leased_mission_capability` - `available_actions_respects_partial_lease_grant` - `available_actions_omits_capability_without_lease` Verified: `cargo fmt`, `cargo clippy --all --benches --tests --examples --all-features` (0 warnings), `cargo test -p ironclaw_engine` (409 passed), `cargo test -p ironclaw --lib` (5082 passed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(engine-v2): close review gaps — archive round-trip, v1/engine merge, defensive filters Addresses gaps raised in PR #2660 review: - **Consolidate `use ironclaw_engine::{...}`** into a single grouped import in `effect_adapter.rs` (was split across two statements). - **Apply `is_v1_only_tool` / `is_v1_auth_tool` filters** to the engine capability path in `available_actions`. Defensive guardrail: a future engine capability that registers an action under a v1-denylisted name (`create_job`, `tool_auth`, ...) must not bypass the v2-isolation filters by virtue of coming through a different capability registry. - **`ThreadArchiveSummary` serialization round-trip tests** in `store_adapter.rs`: - `archive_summary_preserves_total_cost_usd_through_round_trip` — pins the regression the PR fixed (cost silently zeroed on rehydration). - `archive_summary_handles_legacy_json_without_total_cost_usd_field` — pins `#[serde(default)]` back-compat for archive files written before this PR. - **`available_actions` combined advertising tests** in `effect_adapter.rs`: - `available_actions_merges_v1_tools_with_engine_capabilities` — v1 tool + mission capability both surface on one call. - `available_actions_filters_v1_denylisted_names_from_engine_capabilities` — pins the new defensive filter. - **`cost_usd_from` subscription-billed-provider test** in `llm_adapter.rs`: - `complete_with_subscription_billed_provider_yields_zero_cost` — zero `cost_per_token` round-trips to exactly `0.0`, no NaN/Inf. Verified: `cargo fmt`, `cargo clippy --all --benches --tests --examples --all-features` (0 warnings), `cargo test -p ironclaw_engine` (409 passed), `cargo test -p ironclaw --lib` (5087 passed, +5 new). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(engine-v2): price cache tokens correctly in LlmBridgeAdapter Addresses PR #2660 review (gemini-code-assist + Copilot, L23/L115/L189): `cost_usd_from` only priced `input_tokens + output_tokens`, ignoring `cache_read_input_tokens` and `cache_creation_input_tokens`. For providers with prompt caching (Anthropic, OpenAI), this undercounted input cost and silently neutered the `max_budget_usd` gate. Extend the helper to mirror the canonical formula in `src/agent/cost_guard.rs::CostGuard::record_llm_call`: uncached_input = input_tokens - (cache_read + cache_write) cache_read_cost = input_rate * cache_read / cache_read_discount() cache_write_cost = input_rate * cache_write * cache_write_multiplier() cost = input_rate * uncached_input + cache_read_cost + cache_write_cost + output_rate * output_tokens All `LlmProvider` implementations already supply `cache_read_discount()` (default 1, Anthropic 10, OpenAI 2) and `cache_write_multiplier()` (default 1, Anthropic 1.25 for 5m / 2.0 for 1h) through the decorator chain, so no trait surgery is required. Regression test: `complete_prices_cache_tokens_with_discount_and_multiplier` uses Anthropic Sonnet 5m-TTL rates, exercises a 10k-input / 2k-read / 1k-write / 500-output response, and pins the correct total ($0.03285) against the old naive $0.0375 that would have undercounted ~14%. Verified: `cargo fmt`, `cargo clippy --all --benches --tests --examples --all-features` (0 warnings), `cargo test -p ironclaw_engine` (435 passed), `cargo test -p ironclaw --lib` (5136 passed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ilblackdragon
left a comment
There was a problem hiding this comment.
Overview
Two interlocking pieces:
- Persistence fix — adds
persist_v2_tool_callsinsrc/bridge/router.rsthat, after aThreadOutcome::Completedthread, loads the thread, walksinternal_messages, extractsActionResultmessages, and writes arole="tool_calls"row to the v1 conversation DB so the chat history API can reconstruct tool call info for the web UI. - E2E coverage — 6 new scenarios in
test_v2_engine_tool_lifecycle.py(single-tool echo/time, text-only, parallel dispatch, multi-step chain, multi-turn) + 2 libsql unit tests inrouter.rs.
Net effect: chat history API stops returning tool_calls: [] for v2 threads.
Correctness
Completed-arm gating — commit652315e8correctly movedpersist_v2_tool_callsout of the general outcome path and intoThreadOutcome::Completedonly. Without this,GatePausedwould orphan partial rows that duplicate on resume. TheCompleted { response: None }case (tools executed but no final text) is now also covered.- UTF-8-safe truncation — the 500-byte preview uses
char_indices().take_while+char.len_utf8()for char-boundary truncation (matches.claude/rules/review-discipline.md"UTF-8 string safety"). Won't panic on multi-byte content. - Conversation ID resolution — parses UUID from
message.conversation_scope()first, falls back toget_or_create_assistant_conversation. Both paths can fail cleanly (debug log + return) without corrupting the thread. has_resultderivation — the written JSON is{name, result_preview, tool_call_id}.src/channels/web/util.rs:93-94deriveshas_resultfromresult_preview.is_some_and(!is_null), so the E2E assertiontc["has_result"] is Trueis a valid contract check.V24__llm_calls_created_at_indexchecksum — the migration file already exists on staging (landed in #1963,532fc61d); the PR is just adding the checksum after a rebase. Not a schema change, not risky.⚠️ debug!vswarn!for persistence failures — the author initially usedwarn!, reviewer (commitff372e11) changed todebug!citing CLAUDE.md's "background tasks must not use info/warn" rule. That rule is about REPL/TUI corruption. This is an HTTP handler path, not a background task — a silentdebug!hides a user-visible bug (missing tool_calls in the UI) unless someone setsRUST_LOG=debug. Reasonable case for reverting towarn!or adding a metrics counter.⚠️ Extrastore.load_threadon everyCompleted— runs even for text-only threads that will produce no rows. The comment says the thread is still in the in-memory cache, so this is a HashMap hit, not a DB round-trip. Acceptable; worth a doc comment noting the cheap-path reliance.
Style / Conventions
- ✅
dispatch-exemptannotation not required —router.rsis the bridge layer, not a handler/CLI file (the pre-commit hook's scope). Directstate.db/state.storeaccess is canonical here. - 🟡 Inconsistent
Arcqualification — helper signature uses&std::sync::Arc<dyn Store>/&std::sync::Arc<dyn Database>while the test module uses plainArc::new(...).Arcis in scope — could drop thestd::sync::prefix for consistency with surrounding code. - 🟡 Mock LLM multi-step trigger uses
_conversation_has_user_triggerwith a tool-result count — pragmatic, but the comment explicitly notes "v2 engine tool results don't always include the tool name in a parseable format". Worth a TODO to tighten once that contract is stable.
Test Coverage
Strong:
- 6 E2E scenarios cover the happy path + parallel + multi-step + multi-turn. Each asserts on the actual
tool_callsarray shape, not just response text. - 2 libsql unit tests pin the write path and the skip-when-empty path. Uses a real
LibSqlBackend(not a mock), which is the right tier per.claude/rules/testing.md.
Gaps worth adding:
GatePauseddoes NOT persist — the bug that prompted commit652315e8. No test asserts that a paused gate leaves the DB clean ofrole=\"tool_calls\"rows. Without this, a future "simplification" could move the call back out of theCompletedarm and silently regress. Would be a straightforward libsql test: construct a thread withActionResults, driveawait_thread_outcomewith aGatePausedoutcome (or test the call-site conditional directly), then verify no write happens. The current unit tests always call the function — they don't exercise the gating logic at the call site.is_error: trueaction results — the preview includes the error body but no expliciterrorflag in the written JSON. E2E tests don't exercise a failing tool. The history reader atutil.rs:93-94would still showhas_result: truefor a failed tool — probably correct, but not pinned.- Truncation boundary —
char_indicessafety isn't exercised. A small unit test with a multi-byte string >500 bytes would pin the invariant. - Conversation ID fallback — the fallback path (
get_or_create_assistant_conversation) has no direct test. Both unit tests exercise only theconversation_scope()-provided path.
Security
- ✅ No prompt-injection surface introduced — the preview is JSON-serialized into a
contentfield, not interpolated into a prompt. ⚠️ No sensitive-data redaction on the preview — if a tool output contains a secret (API key, auth token), the 500-byte preview persists it to DB and exposes it via chat history. This matches v1write_v1_responsebehavior, so it's not introduced here, but worth flagging as a PR-body follow-up. Verify that theSafetyLayersanitizer runs onActionResult.contentbefore it lands ininternal_messages.
Performance
Per-completion: one extra store.load_thread (in-memory cache hit in practice) + O(n) internal_messages scan + one DB row insert. Negligible for typical threads. The 500-char truncation happens per ActionResult, which bounds worst-case row size.
Verdict
Approve-leaning with minor follow-ups. The fix is surgical, the Completed-arm placement is the right bug to solve, and E2E coverage meaningfully closes the #2193 audit gap. Before merge I'd suggest:
- Revert the
debug!→warn!decision for persistence failures (user-visible UI gap when it fails silently). - Add a regression test that pins "no rows written for
GatePaused" at the call-site level — otherwise theThreadOutcome::Completedguard has no test protection. - Optional: add the multi-byte truncation unit test.
Not blockers — ship after 1+2 or file as immediate follow-up. This is on the engine-v2-default critical path, so moving quickly is worth more than polishing these.
The v2 engine had zero e2e coverage for the tool call -> result -> response path. This gap was flagged in the nearai#2193 audit and is the same code path that breaks in QA bug nearai#2402 (infinite loop after tool operations). New test file: test_v2_engine_tool_lifecycle.py - Single tool call (echo, time) completes through v2 - Text-only message completes through v2 - Parallel tool calls (2 tools in one response) - Multi-step chain (echo -> result -> time -> result -> completion) - Multi-turn tool usage across conversation turns Mock LLM additions: - "parallel echo and time" trigger for multi-call responses - "multi step echo then time" trigger for sequential chains Also documents that v2 engine does not populate the tool_calls array in chat history (tool names show as "unknown"). This is a separate gap from execution correctness. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The v2 engine executed tools correctly but never wrote a `role="tool_calls"` message to the v1 conversation DB. This meant the chat history API returned `tool_calls: []` for all v2 threads, breaking the web UI's tool call display. Fix: after thread completion, extract ActionExecuted/ActionFailed events from the v2 event log and write them as a tool_calls DB row before the assistant response. The v1 history API now shows tool names, results, and errors for v2 engine threads. Steps are evicted from the in-memory store after join_thread, so this reads from the append-only event log instead. E2E test updated to assert tool_calls are populated. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The events approach used params_summary (input parameters) where result_preview (output) was expected. Thread internal_messages carry the actual tool output in ActionResult messages. Also fixes stale test file docstring that said tool_calls were not populated. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The v1 write_v1_response silently drops errors via .ok(). Don't replicate that -- log a warning so failed tool_calls persistence is diagnosable. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Drop redundant .chain(thread.messages.iter()) — ActionResult messages only exist in internal_messages - Change tracing::warn! to debug! for fire-and-forget persistence failures (warn corrupts TUI per CLAUDE.md) - Add tool_calls assertions to parallel, multi-step, and multi-turn tests — all 6 tests now verify the core persistence feature - Add result_preview content assertion to echo test for tighter coverage Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
tool_callsmetadata to the chat history DB, so the web UI showed empty tool call arrays for all v2 threadsrole="tool_calls"DB row before the assistant responsewrite_v1_response, conversation ID resolution failures are logged rather than silently swallowedTest plan
cargo clippy --no-default-features --features libsql -- -D warnings-- zero warningspytest tests/e2e/scenarios/test_v2_engine_tool_lifecycle.py -v)tool_callsarray with correct names and result previews🤖 Generated with Claude Code