Preserve paused leases across engine auth resume#2631
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements the persistence and reuse of capability leases when an execution is paused by a gate. It adds a "paused_lease" field to the ThreadOutcome, EngineError, and PendingGate structures, allowing the engine to restore the specific lease context upon resumption. The changes span the orchestrator, scripting, and structured executors, as well as the bridge and gateway logic, including a refactor of the OAuth callback handler to use a dedicated dispatch function. I have no feedback to provide.
…ceptance Closes two gaps blocking v2 engine becoming the default: **Phase 4 — token + cost accounting** - Delete orphaned `crates/ironclaw_engine/src/executor/compaction.rs` (176 lines). The Python orchestrator (`default.py::compact_if_needed`) has owned compaction policy since #1557; the Rust module had no callers anywhere in the workspace. - Wire `cost_usd` in `LlmBridgeAdapter` by calling `LlmProvider::calculate_cost()` at both the no-tools and with-tools response paths. The engine's `Thread::total_cost_usd` accumulator and `max_budget_usd` gates were already plumbed — only the adapter was hardcoding 0.0. - Persist `total_cost_usd` through `ThreadArchiveSummary` round-trip in `store_adapter.rs`. Previously, rehydrating an archived thread silently dropped the cost to 0.0. `#[serde(default)]` keeps existing archive files deserializing cleanly. **Phase 6 — mission lifecycle acceptance** Three new integration tests in `bridge/effect_adapter.rs` driving `execute_action()` end-to-end (per `.claude/rules/testing.md` "Test Through the Caller"): - `mission_full_lifecycle_via_execute_action` — create → list → complete → list, asserting the `Completed` status surfaces through `mission_list` after `mission_complete`. - `mission_fire_returns_thread_id_for_manual_cadence_via_execute_action` — fresh manual mission fires successfully and returns a UUID thread_id rather than `not_fired`. - `mission_list_returns_all_user_missions_via_execute_action` — all three created missions appear in `mission_list` output. **Regression tests for cost wiring** Three new tests in `bridge/llm_adapter.rs`: - `complete_no_tools_populates_cost_usd_through_adapter` - `complete_with_tools_populates_cost_usd_through_adapter` - `complete_routes_subcalls_through_cheap_provider_for_cost` — pins that `depth > 0` is priced with the cheap provider, not the primary. Coordinated with in-flight work: skipped paths owned by #2504 (auth E2E), #2631 (paused-lease resume), #2570 (mission re-fire), #2549 (mission_get), #2452 (tool_calls persistence), #2621 (replay snapshot). Verified: `cargo fmt`, `cargo clippy --all --benches --tests --examples --all-features` (0 warnings), `cargo test -p ironclaw_engine` (409 passed), `cargo test -p ironclaw --lib` (5079 passed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ceptance Closes two gaps blocking v2 engine becoming the default: **Phase 4 — token + cost accounting** - Delete orphaned `crates/ironclaw_engine/src/executor/compaction.rs` (176 lines). The Python orchestrator (`default.py::compact_if_needed`) has owned compaction policy since #1557; the Rust module had no callers anywhere in the workspace. - Wire `cost_usd` in `LlmBridgeAdapter` by calling `LlmProvider::calculate_cost()` at both the no-tools and with-tools response paths. The engine's `Thread::total_cost_usd` accumulator and `max_budget_usd` gates were already plumbed — only the adapter was hardcoding 0.0. - Persist `total_cost_usd` through `ThreadArchiveSummary` round-trip in `store_adapter.rs`. Previously, rehydrating an archived thread silently dropped the cost to 0.0. `#[serde(default)]` keeps existing archive files deserializing cleanly. **Phase 6 — mission lifecycle acceptance** Three new integration tests in `bridge/effect_adapter.rs` driving `execute_action()` end-to-end (per `.claude/rules/testing.md` "Test Through the Caller"): - `mission_full_lifecycle_via_execute_action` — create → list → complete → list, asserting the `Completed` status surfaces through `mission_list` after `mission_complete`. - `mission_fire_returns_thread_id_for_manual_cadence_via_execute_action` — fresh manual mission fires successfully and returns a UUID thread_id rather than `not_fired`. - `mission_list_returns_all_user_missions_via_execute_action` — all three created missions appear in `mission_list` output. **Regression tests for cost wiring** Three new tests in `bridge/llm_adapter.rs`: - `complete_no_tools_populates_cost_usd_through_adapter` - `complete_with_tools_populates_cost_usd_through_adapter` - `complete_routes_subcalls_through_cheap_provider_for_cost` — pins that `depth > 0` is priced with the cheap provider, not the primary. Coordinated with in-flight work: skipped paths owned by #2504 (auth E2E), #2631 (paused-lease resume), #2570 (mission re-fire), #2549 (mission_get), #2452 (tool_calls persistence), #2621 (replay snapshot). Verified: `cargo fmt`, `cargo clippy --all --benches --tests --examples --all-features` (0 warnings), `cargo test -p ironclaw_engine` (409 passed), `cargo test -p ironclaw --lib` (5079 passed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ceptance (#2660) * feat(engine-v2): Phase 4 cost tracking + Phase 6 mission lifecycle acceptance Closes two gaps blocking v2 engine becoming the default: **Phase 4 — token + cost accounting** - Delete orphaned `crates/ironclaw_engine/src/executor/compaction.rs` (176 lines). The Python orchestrator (`default.py::compact_if_needed`) has owned compaction policy since #1557; the Rust module had no callers anywhere in the workspace. - Wire `cost_usd` in `LlmBridgeAdapter` by calling `LlmProvider::calculate_cost()` at both the no-tools and with-tools response paths. The engine's `Thread::total_cost_usd` accumulator and `max_budget_usd` gates were already plumbed — only the adapter was hardcoding 0.0. - Persist `total_cost_usd` through `ThreadArchiveSummary` round-trip in `store_adapter.rs`. Previously, rehydrating an archived thread silently dropped the cost to 0.0. `#[serde(default)]` keeps existing archive files deserializing cleanly. **Phase 6 — mission lifecycle acceptance** Three new integration tests in `bridge/effect_adapter.rs` driving `execute_action()` end-to-end (per `.claude/rules/testing.md` "Test Through the Caller"): - `mission_full_lifecycle_via_execute_action` — create → list → complete → list, asserting the `Completed` status surfaces through `mission_list` after `mission_complete`. - `mission_fire_returns_thread_id_for_manual_cadence_via_execute_action` — fresh manual mission fires successfully and returns a UUID thread_id rather than `not_fired`. - `mission_list_returns_all_user_missions_via_execute_action` — all three created missions appear in `mission_list` output. **Regression tests for cost wiring** Three new tests in `bridge/llm_adapter.rs`: - `complete_no_tools_populates_cost_usd_through_adapter` - `complete_with_tools_populates_cost_usd_through_adapter` - `complete_routes_subcalls_through_cheap_provider_for_cost` — pins that `depth > 0` is priced with the cheap provider, not the primary. Coordinated with in-flight work: skipped paths owned by #2504 (auth E2E), #2631 (paused-lease resume), #2570 (mission re-fire), #2549 (mission_get), #2452 (tool_calls persistence), #2621 (replay snapshot). Verified: `cargo fmt`, `cargo clippy --all --benches --tests --examples --all-features` (0 warnings), `cargo test -p ironclaw_engine` (409 passed), `cargo test -p ironclaw --lib` (5079 passed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(engine-v2): surface engine capability actions to LLM via available_actions Fixes the gap called out in the PR body: `EffectBridgeAdapter::available_actions` was only enumerating v1 `ToolRegistry` tools + latent OAuth actions, so engine-native capabilities like `missions` never appeared in the LLM's tools list even when a thread held an active lease for them. The LLM was therefore unable to call `mission_create` / `mission_list` / etc. via structured tool calls; the only ways to drive missions were CodeAct Python calls (which relied on the same `known_actions` set and hit the same gap) or `/routine` slash commands falling through to v1. Wire `CapabilityRegistry` into the adapter and iterate active leases to surface every leased, engine-registered capability action. Respects lease grant scope — a lease granting only `mission_list` does not leak `mission_create`. Skips the `"tools"` capability since that lease is already reconciled from the v1 path. Router wires the shared `Arc<CapabilityRegistry>` to both the adapter and `ThreadManager` at setup. Three new regression tests: - `available_actions_surfaces_leased_mission_capability` - `available_actions_respects_partial_lease_grant` - `available_actions_omits_capability_without_lease` Verified: `cargo fmt`, `cargo clippy --all --benches --tests --examples --all-features` (0 warnings), `cargo test -p ironclaw_engine` (409 passed), `cargo test -p ironclaw --lib` (5082 passed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(engine-v2): close review gaps — archive round-trip, v1/engine merge, defensive filters Addresses gaps raised in PR #2660 review: - **Consolidate `use ironclaw_engine::{...}`** into a single grouped import in `effect_adapter.rs` (was split across two statements). - **Apply `is_v1_only_tool` / `is_v1_auth_tool` filters** to the engine capability path in `available_actions`. Defensive guardrail: a future engine capability that registers an action under a v1-denylisted name (`create_job`, `tool_auth`, ...) must not bypass the v2-isolation filters by virtue of coming through a different capability registry. - **`ThreadArchiveSummary` serialization round-trip tests** in `store_adapter.rs`: - `archive_summary_preserves_total_cost_usd_through_round_trip` — pins the regression the PR fixed (cost silently zeroed on rehydration). - `archive_summary_handles_legacy_json_without_total_cost_usd_field` — pins `#[serde(default)]` back-compat for archive files written before this PR. - **`available_actions` combined advertising tests** in `effect_adapter.rs`: - `available_actions_merges_v1_tools_with_engine_capabilities` — v1 tool + mission capability both surface on one call. - `available_actions_filters_v1_denylisted_names_from_engine_capabilities` — pins the new defensive filter. - **`cost_usd_from` subscription-billed-provider test** in `llm_adapter.rs`: - `complete_with_subscription_billed_provider_yields_zero_cost` — zero `cost_per_token` round-trips to exactly `0.0`, no NaN/Inf. Verified: `cargo fmt`, `cargo clippy --all --benches --tests --examples --all-features` (0 warnings), `cargo test -p ironclaw_engine` (409 passed), `cargo test -p ironclaw --lib` (5087 passed, +5 new). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(engine-v2): price cache tokens correctly in LlmBridgeAdapter Addresses PR #2660 review (gemini-code-assist + Copilot, L23/L115/L189): `cost_usd_from` only priced `input_tokens + output_tokens`, ignoring `cache_read_input_tokens` and `cache_creation_input_tokens`. For providers with prompt caching (Anthropic, OpenAI), this undercounted input cost and silently neutered the `max_budget_usd` gate. Extend the helper to mirror the canonical formula in `src/agent/cost_guard.rs::CostGuard::record_llm_call`: uncached_input = input_tokens - (cache_read + cache_write) cache_read_cost = input_rate * cache_read / cache_read_discount() cache_write_cost = input_rate * cache_write * cache_write_multiplier() cost = input_rate * uncached_input + cache_read_cost + cache_write_cost + output_rate * output_tokens All `LlmProvider` implementations already supply `cache_read_discount()` (default 1, Anthropic 10, OpenAI 2) and `cache_write_multiplier()` (default 1, Anthropic 1.25 for 5m / 2.0 for 1h) through the decorator chain, so no trait surgery is required. Regression test: `complete_prices_cache_tokens_with_discount_and_multiplier` uses Anthropic Sonnet 5m-TTL rates, exercises a 10k-input / 2k-read / 1k-write / 500-output response, and pins the correct total ($0.03285) against the old naive $0.0375 that would have undercounted ~14%. Verified: `cargo fmt`, `cargo clippy --all --benches --tests --examples --all-features` (0 warnings), `cargo test -p ironclaw_engine` (435 passed), `cargo test -p ironclaw --lib` (5136 passed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4bd24e9 to
8bb2ef5
Compare
|
Rebased onto latest `origin/staging` and addressed the two issues blocking merge. Marked ready for review. Rebase notesConflict shape: the PR touched `src/channels/web/server.rs` but since it was authored, PR #2645 (stage 4a) extracted the OAuth callback into `src/channels/web/features/oauth/mod.rs`. Cherry-picked the engine/bridge/gate changes cleanly (they applied auto-merge) and re-applied the server.rs changes surgically to the new post-refactor location:
Deliberately did not port the PR's `dispatch_engine_external_callback` helper refactor — that helper lives in `platform/engine_dispatch.rs` which is introduced by the not-yet-merged PR #2665 (stage 4b). The inline `tx.send` flow stays for now; rolling up the refactor into this PR would couple it to #2665's merge order. Clippy fixAdding `paused_lease` to the `ThreadOutcome::GatePaused` variant tipped three clippy `large_enum_variant` lints past the threshold (lint fires at ~200 bytes, `CapabilityLease` is ~360 bytes):
Fixed by boxing `paused_lease` in `ThreadOutcome::GatePaused` — matches what the PR already does in `EngineError::GatePaused`. The single consumer that stores it unboxed (`PendingGate.paused_lease`) gets an inline `paused_lease.map(|b| *b)` unbox at the assignment site in `router.rs`. Verification
Engine v2 default critical pathWith #2452 already merged and this PR ready, only #2622 (unify v2 extension auth resume flow) remains before the default flip. It's also DIRTY and needs rebase; happy to rebase that next if you want me to keep unblocking. |
There was a problem hiding this comment.
Pull request overview
Updates engine v2 gate handling to persist and reuse the original capability lease across auth-gated pauses/resumes, and adjusts web OAuth callback cleanup so the engine pending-auth gate remains until replay completes.
Changes:
- Add
paused_leaseplumbing acrossThreadOutcome::GatePaused,EngineError::GatePaused, and persistedPendingGatestate. - Prefer
paused_leasewhen replaying/resuming a pending gate action instead of re-deriving a lease. - Split legacy session auth cleanup from engine gate cleanup to avoid clearing the engine pending-auth gate in OAuth callbacks before replay finishes.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/engine_v2_gate_integration.rs | Updates test fixtures to include the new paused_lease field. |
| src/gate/store.rs | Updates pending-gate store tests to include paused_lease. |
| src/gate/persistence.rs | Updates persistence tests to include paused_lease. |
| src/gate/pending.rs | Adds paused_lease to persisted PendingGate model. |
| src/channels/web/server.rs | Splits auth cleanup into “session only” vs “session + engine gate” helpers. |
| src/channels/web/features/oauth/mod.rs | Uses session-only cleanup in OAuth callback to preserve engine gate until replay. |
| src/bridge/router.rs | Persists paused_lease into pending gates and prefers it during resume/replay. |
| src/bridge/effect_adapter.rs | Captures the current lease snapshot into GatePaused errors. |
| src/agent/agent_loop.rs | Allows ExternalCallback submissions to target a thread UUID similarly to approvals. |
| crates/ironclaw_engine/src/types/error.rs | Extends EngineError::GatePaused with paused_lease. |
| crates/ironclaw_engine/src/runtime/mission.rs | Updates test fixtures to include paused_lease. |
| crates/ironclaw_engine/src/runtime/messaging.rs | Extends ThreadOutcome::GatePaused with boxed paused_lease. |
| crates/ironclaw_engine/src/executor/structured.rs | Propagates paused_lease through GatePaused result classification and parsing. |
| crates/ironclaw_engine/src/executor/scripting.rs | Updates preflight GatePaused construction to include paused_lease. |
| crates/ironclaw_engine/src/executor/orchestrator.rs | Logs/serializes paused_lease, parses it from tool results, and adds a parsing test. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let lease = if let Some(lease) = pending.paused_lease.clone() { | ||
| lease | ||
| } else { | ||
| state | ||
| .thread_manager | ||
| .leases | ||
| .find_lease_for_action(pending.thread_id, &pending.action_name) | ||
| .await | ||
| .ok_or_else(|| { | ||
| engine_err( | ||
| "resume lease", | ||
| format!("no active lease covers action '{}'", pending.action_name), | ||
| ) | ||
| })? | ||
| }; |
There was a problem hiding this comment.
When resuming a pending gate, pending.paused_lease is used directly without any validation (thread_id match / action coverage / revocation / expiry). This bypasses the normal LeaseManager::find_lease_for_action checks and could allow resuming with a stale or mismatched lease. Consider validating that the snapshot lease applies to pending.thread_id and pending.action_name (and is not revoked/expired), otherwise fall back to find_lease_for_action or fail closed.
| let lease = if let Some(lease) = pending.paused_lease.clone() { | |
| lease | |
| } else { | |
| state | |
| .thread_manager | |
| .leases | |
| .find_lease_for_action(pending.thread_id, &pending.action_name) | |
| .await | |
| .ok_or_else(|| { | |
| engine_err( | |
| "resume lease", | |
| format!("no active lease covers action '{}'", pending.action_name), | |
| ) | |
| })? | |
| }; | |
| let lease = state | |
| .thread_manager | |
| .leases | |
| .find_lease_for_action(pending.thread_id, &pending.action_name) | |
| .await | |
| .ok_or_else(|| { | |
| engine_err( | |
| "resume lease", | |
| format!("no active lease covers action '{}'", pending.action_name), | |
| ) | |
| })?; |
There was a problem hiding this comment.
Fixed in eaf91a42. Extracted the lease-picking into a standalone resume_lease_for_pending_gate(pending, leases) helper that now validates the snapshot before use: thread_id match, action coverage, !revoked, and expires_at in the future. If any check fails, falls through to LeaseManager::find_lease_for_action — the snapshot is still preferred when valid (the original bug: no active lease at resume after restart) but stale snapshots no longer silently bypass revocation/expiry.
| let lease = if let Some(lease) = pending.paused_lease.clone() { | ||
| lease | ||
| } else { | ||
| state | ||
| .thread_manager |
There was a problem hiding this comment.
The new resume path that prefers pending.paused_lease over re-deriving via LeaseManager is a key behavioral change but doesn’t appear to be covered by a router-level test. Adding a test that reproduces the regression (e.g., paused action where no active lease exists at resume, but paused_lease allows completion) would help prevent future breaks.
There was a problem hiding this comment.
Fixed in eaf91a42. Added 6 libsql-free tests in bridge::router::tests driving resume_lease_for_pending_gate end-to-end:
resume_lease_prefers_snapshot_even_when_lease_manager_empty— pins the original bug fix.resume_lease_rejects_revoked_snapshot_and_falls_back— revoked snapshot + live lease → use live.resume_lease_rejects_expired_snapshot_and_falls_back— expired snapshot + live lease → use live.resume_lease_rejects_snapshot_with_wrong_thread_id— defensive: drift between snapshot and pending.resume_lease_rejects_snapshot_missing_action_coverage— lease must actually grant the pending action.resume_lease_returns_none_when_no_snapshot_and_no_active_lease— cleanNonefor caller's error path.
…ase-replay # Conflicts: # src/channels/web/features/oauth/mod.rs # src/channels/web/server.rs
Addresses PR #2631 review comments from Copilot: 1. **Snapshot used without validation** (src/bridge/router.rs:684 orig): `pending.paused_lease.clone()` was used directly to resume a gated action. A gate can sit in the pending-gate store for hours or across process restarts; during that window the original lease may have been revoked, expired, or the pending record could have drifted off its original thread. Extract `snapshot_lease_still_valid` + `resume_lease_for_pending_gate` helpers. The snapshot must pass four checks before use: - `lease.thread_id == pending.thread_id` - `granted_actions.covers(&pending.action_name)` - `!revoked` - `expires_at` is unset or in the future If any check fails, fall through to `LeaseManager::find_lease_for_action` (the normal path). Matches the reviewer's suggestion to avoid silently resuming a stale snapshot; still prefers the snapshot when valid so the original bug (no active lease at resume after restart) stays fixed. 2. **No router-level regression test** for the snapshot-vs-fallback decision. Six new libsql-free tests in `bridge::router::tests`: - `resume_lease_prefers_snapshot_even_when_lease_manager_empty` — reproduces the original bug; snapshot must carry the resume. - `resume_lease_rejects_revoked_snapshot_and_falls_back` - `resume_lease_rejects_expired_snapshot_and_falls_back` - `resume_lease_rejects_snapshot_with_wrong_thread_id` - `resume_lease_rejects_snapshot_missing_action_coverage` - `resume_lease_returns_none_when_no_snapshot_and_no_active_lease` Verified: `cargo fmt`, `cargo clippy --all --benches --tests --examples --all-features` (0 warnings), `cargo test -p ironclaw_engine` (435 passed), `cargo test -p ironclaw --lib` (5182 passed, +6 new). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI rustfmt (nightly/stable 1.95.0) wants the `assert!(matches!())` in `orchestrator.rs::parse_outcome_gate_paused` collapsed to fewer lines. Local rustfmt 1.94 was happy with the expanded form; matching CI to unblock the fmt check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 15 out of 15 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let approval_thread_uuid = if matches!( | ||
| submission, | ||
| Submission::ExecApproval { .. } | ||
| | Submission::ApprovalResponse { .. } | ||
| | Submission::ExternalCallback { .. } | ||
| | Submission::GateAuthResolution { .. } | ||
| ) { |
There was a problem hiding this comment.
Including Submission::ExternalCallback in approval_thread_uuid resolution can mutate sess.active_thread based solely on conversation_scope even when the request_id doesn’t match any pending auth gate. A crafted JSON ExternalCallback (parsed from message content) can therefore switch the user’s active thread and then fail with "No matching pending authentication gate found." Consider preventing thread switching unless the callback is known-valid for that thread (e.g., look up the pending gate by request_id first and only then set active_thread), or gate this cross-thread routing to trusted/internal submissions only.
Synthesizes recurring patterns from ~30 merged PRs, 147 bot review comments (Copilot/Gemini), human reviews, and ~50 issues filed in the past 2 weeks. Each rule cites the motivating PR/issue numbers. New files: - error-handling.md — silent-failure taxonomy (unwrap_or_default, .ok()?, poisoned caches), persist-then-reload atomicity, channel-edge error mapping. (#2526, #2633, #2653, #2673, #2546, #2407, #2408) - agent-evidence.md — side-effect claims must cite tool evidence, empty-fast outputs are errors, external-effect tools must read back, setup UI round-trip. (#2544, #2580, #2582, #2541, #2545, #2411, #2543, #2586) - lifecycle.md — discovery vs. activation, terminal auth rejection, list_installed vs. list_active, deactivation unwinds, snapshot rehydrate must re-validate. (#2556, #2557, #2558, #2564, #2419, PR #2617, PR #2631) Extended: - types.md — from_trusted boundary rule, validated-newtype template with shared validate(&str), serde(try_from) required for validated types, wire-stable enums (no Debug; serde alias for migrations), canonical wire-contract field naming. (PR #2685, #2681, #2687, #2678, #2669, #2665, #2683, #2702) - safety-and-sandbox.md — every new ingress scans pre-transform/pre- injection, bounded resources (interners/streams/fan-out caps), cache keys must be complete. (#2491, #2676, #2470, #2633, #2673, #2710, PR #2702) - review-discipline.md — PR scope discipline, guardrail scripts are code (regression tests, grouped-import parsing, CI has_code inclusion), absolute-path ban in committed docs, stale comments after refactors. (PR #2668, #2628, #2680, #2687, #2647, #2689, #2701) All new files carry paths: frontmatter so they auto-load only on matching files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(rules): add review-driven guidance for Claude Code Synthesizes recurring patterns from ~30 merged PRs, 147 bot review comments (Copilot/Gemini), human reviews, and ~50 issues filed in the past 2 weeks. Each rule cites the motivating PR/issue numbers. New files: - error-handling.md — silent-failure taxonomy (unwrap_or_default, .ok()?, poisoned caches), persist-then-reload atomicity, channel-edge error mapping. (#2526, #2633, #2653, #2673, #2546, #2407, #2408) - agent-evidence.md — side-effect claims must cite tool evidence, empty-fast outputs are errors, external-effect tools must read back, setup UI round-trip. (#2544, #2580, #2582, #2541, #2545, #2411, #2543, #2586) - lifecycle.md — discovery vs. activation, terminal auth rejection, list_installed vs. list_active, deactivation unwinds, snapshot rehydrate must re-validate. (#2556, #2557, #2558, #2564, #2419, PR #2617, PR #2631) Extended: - types.md — from_trusted boundary rule, validated-newtype template with shared validate(&str), serde(try_from) required for validated types, wire-stable enums (no Debug; serde alias for migrations), canonical wire-contract field naming. (PR #2685, #2681, #2687, #2678, #2669, #2665, #2683, #2702) - safety-and-sandbox.md — every new ingress scans pre-transform/pre- injection, bounded resources (interners/streams/fan-out caps), cache keys must be complete. (#2491, #2676, #2470, #2633, #2673, #2710, PR #2702) - review-discipline.md — PR scope discipline, guardrail scripts are code (regression tests, grouped-import parsing, CI has_code inclusion), absolute-path ban in committed docs, stale comments after refactors. (PR #2668, #2628, #2680, #2687, #2647, #2689, #2701) All new files carry paths: frontmatter so they auto-load only on matching files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(rules): split agent-evidence into prompt + code rule agent-evidence.md mixed two concerns: runtime agent instruction (what the LLM should do when concluding a turn) and code-enforcement rules (what the dispatcher, engine, and tools must implement). Rules under .claude/rules/ only guide Claude Code when editing the repo — the runtime agent never reads them. Splits the two: - crates/ironclaw_engine/prompts/codeact_postamble.md — new section "Evidence before claiming side effects". Sits next to the existing "FINAL() answer quality" guidance; loaded via include_str! in executor/prompt.rs (no Rust change needed). - .claude/rules/tool-evidence.md — renamed from agent-evidence.md, keeps only the code invariants (engine v2 side-effect gate, empty-fast ToolError::EmptyResult, external-effect tools must read back, setup UI round-trip). Prompt tests pass unchanged; the postamble addition is pure text. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * prompt: tighten evidence rule to FINAL() claims only, not tool use Live-test validation of the "Evidence before claiming side effects" section (added in the prior commit) showed it inhibited legitimate tool use. With the original wording, `zizmor_scan_v2` live-recording timed out at 302s with zero responses; reverting the postamble restored healthy behavior (88s run, 8 shell calls including `cargo install zizmor` and full workflow analysis). The original phrasing conflated two things: what the agent should claim and what tools it should call. The rule is only about the claim. Re-tunes the section to: - Open with an explicit "this does not restrict tool calls" scope. - Drop the "<1ms = failure" heuristic (too broad — normal tools like `tool_info(schema)` are legitimately fast). - Drop the full enumeration of forbidden side-effect verbs; keep the rule narrower and clearer. - Shorten the code example (remove redundant early-return). Re-tuned run: agent is active (shell calls, real reasoning), live recording completes in ~9s. The remaining test failure is a pre-existing assertion bug (exact `t == "shell"` match against tool strings that now carry arguments like `"shell(cmd)"`) — reproduces with the old postamble too. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(live): fix tool-name assertions + re-record zizmor traces The two `zizmor_scan*` live tests had four broken tool-name assertions that silently failed to match: `tools.iter().any(|t| t == "shell")` against a tool list that now contains `"shell(cmd)"` strings (tool events carry args via `format_action_display_name` in `src/bridge/router.rs`). Two of the four were negative assertions checking for the absence of `tool_install` recovery loops — those silently passed even when a recovery loop actually ran. `sandbox_live_e2e.rs:203` already used the correct `t == "shell" || t.starts_with("shell(")` pattern; applied it consistently to all four sites. Verified live: - `IRONCLAW_LIVE_TEST=1 cargo test --test e2e_live -- zizmor_scan --ignored --test-threads=1` → 2 passed, 0 failed, 51.78s. Agent installs and runs zizmor end-to-end, producing real findings (exit code 14, dangerous triggers, excessive permissions, etc.). Traces re-recorded with the tuned postamble (commit 50d8517) and scrubbed: replaced `/home/illia/.cargo/bin/zizmor` with `/home/user/.cargo/bin/zizmor` per the developer-local-path ban in `.claude/rules/review-discipline.md`. No credentials, PII, or high-entropy secrets in either trace (only git SHAs from zizmor's workflow analysis output). Replay still passes: `cargo test --test e2e_live -- zizmor_scan --ignored` → 2/2 ok. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(replay): update zizmor_scan_v2 insta snapshot The engine v2 replay-snapshot gate (`engine_v2_tests::snapshot_zizmor_scan_v2`) failed against the re-recorded trace from 1691efe because the old snapshot encoded a broken run: - final_state: Failed - Missing Assistant message role - 3 issues: thread_failure (error), no_response (warning), llm_error (error) - 6 tool calls that never produced a final answer The new trace completes cleanly: - final_state: Done - System / User / Assistant roles present - 1 issue: mixed_mode (info) - 3 shell tool calls + successful `FINAL()` with real findings The snapshot was pinning a regression. Regenerated with `INSTA_UPDATE=always cargo test --test e2e_engine_v2 -- snapshot_zizmor_scan_v2`; passes on replay. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(rules): address PR #2714 review feedback - review-discipline: reword "Doc Absolute Paths" as a review convention (pre-commit only scans .rs; the rule misleadingly claimed enforcement). - safety-and-sandbox: broaden `paths:` frontmatter to include the actual ingress owners (`bridge`, `channels`, `workspace`, `agent`, engine crate) so the rule auto-loads where it applies. - tool-evidence: mark the side-effect gate, empty-fast rule, and `unverified` flag as target/aspirational invariants — neither `ToolError::EmptyResult`, an `unverified` field on `ToolOutput`, nor a byte-count field on `ActionRecord` exist today. Point at concrete interim conventions (`ToolError::ExecutionFailed`, `unverified: true` in the JSON result body). - types: scope "Validated newtypes must gate Deserialize" to *new* types, document the `CredentialName`/`ExtensionName` exception (they intentionally use `#[serde(transparent)]` + derived `Deserialize` under the `serde_does_not_revalidate` test). Clarify the `from_trusted` trust boundary (trusted = typed upstream, untrusted = raw JSON field even if the field *name* is "registry entry"). Switch `new` template to `impl Into<String>` to avoid an unnecessary clone when an owned `String` is passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(rules): simplify types.md + split doc-hygiene; address review round 2 - types: collapse two templates into one canonical validated-newtype shape. New types use `#[serde(try_from = "String")]` with a shared `validate(&str)` helper — no more dual "transparent for some / try_from for others" guidance. `CredentialName`/`ExtensionName` are documented as the sole legacy exception (locked in by the `serde_does_not_revalidate` test); new code must not copy their `transparent` + `from_trusted` pattern. Removes the long "Using `from_trusted` safely" section and the separate "Validated newtypes must gate Deserialize" subsection that contradicted the Don'ts list. - doc-hygiene: new tiny rule file scoped to `**/*.md`, `**/*.py`, `docs/**` that carries the "no developer-local absolute paths in committed docs" convention. Removed from review-discipline.md where its `src/**/*.rs` scope meant the rule never loaded on the files it governed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(live): match hyphenated tool-install in attempted_relevant_tool The engine records `action_name` as the raw string the LLM emitted (`crates/ironclaw_engine/src/executor/structured.rs:381`), and the registry's lookup canonicalization only affects dispatch — not the name that reaches `StatusUpdate::ToolStarted`. The two other predicates in this file (`bad_recovery` at :420, `phase_b_recovery` at :531) already defend against both forms; this one should too, for consistency. Addresses PR #2714 review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
paused_leasethrough pending gate creation, storage, and replayVerification
CARGO_TARGET_DIR=/Users/henry/near_ai/near_claw/ironclaw-ci-regression-fix-2/.shared-target cargo test -p ironclaw_engine parse_outcome_gate_pausedCARGO_TARGET_DIR=/Users/henry/near_ai/near_claw/ironclaw-ci-regression-fix-2/.shared-target cargo test -p ironclaw_engine normalize_pause_outcome_transitions_thread_to_waiting