Skip to content

Preserve paused leases across engine auth resume#2631

Merged
ilblackdragon merged 4 commits intostagingfrom
henry/v2-paused-lease-replay
Apr 19, 2026
Merged

Preserve paused leases across engine auth resume#2631
ilblackdragon merged 4 commits intostagingfrom
henry/v2-paused-lease-replay

Conversation

@henrypark133
Copy link
Copy Markdown
Collaborator

Summary

  • persist paused_lease through pending gate creation, storage, and replay
  • prefer the paused lease when resuming gated engine actions instead of re-deriving a fresh lease
  • keep successful OAuth callbacks from clearing the pending engine auth gate before replay completes

Verification

  • CARGO_TARGET_DIR=/Users/henry/near_ai/near_claw/ironclaw-ci-regression-fix-2/.shared-target cargo test -p ironclaw_engine parse_outcome_gate_paused
  • CARGO_TARGET_DIR=/Users/henry/near_ai/near_claw/ironclaw-ci-regression-fix-2/.shared-target cargo test -p ironclaw_engine normalize_pause_outcome_transitions_thread_to_waiting

@github-actions github-actions Bot added scope: agent Agent core (agent loop, router, scheduler) scope: channel/web Web gateway channel size: M 50-199 changed lines risk: medium Business logic, config, or moderate-risk modules contributor: core 20+ merged PRs labels Apr 18, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements the persistence and reuse of capability leases when an execution is paused by a gate. It adds a "paused_lease" field to the ThreadOutcome, EngineError, and PendingGate structures, allowing the engine to restore the specific lease context upon resumption. The changes span the orchestrator, scripting, and structured executors, as well as the bridge and gateway logic, including a refactor of the OAuth callback handler to use a dedicated dispatch function. I have no feedback to provide.

ilblackdragon added a commit that referenced this pull request Apr 19, 2026
…ceptance

Closes two gaps blocking v2 engine becoming the default:

**Phase 4 — token + cost accounting**

- Delete orphaned `crates/ironclaw_engine/src/executor/compaction.rs`
  (176 lines). The Python orchestrator (`default.py::compact_if_needed`)
  has owned compaction policy since #1557; the Rust module had no
  callers anywhere in the workspace.
- Wire `cost_usd` in `LlmBridgeAdapter` by calling
  `LlmProvider::calculate_cost()` at both the no-tools and with-tools
  response paths. The engine's `Thread::total_cost_usd` accumulator and
  `max_budget_usd` gates were already plumbed — only the adapter was
  hardcoding 0.0.
- Persist `total_cost_usd` through `ThreadArchiveSummary` round-trip in
  `store_adapter.rs`. Previously, rehydrating an archived thread
  silently dropped the cost to 0.0. `#[serde(default)]` keeps existing
  archive files deserializing cleanly.

**Phase 6 — mission lifecycle acceptance**

Three new integration tests in `bridge/effect_adapter.rs` driving
`execute_action()` end-to-end (per `.claude/rules/testing.md` "Test
Through the Caller"):

- `mission_full_lifecycle_via_execute_action` — create → list → complete
  → list, asserting the `Completed` status surfaces through
  `mission_list` after `mission_complete`.
- `mission_fire_returns_thread_id_for_manual_cadence_via_execute_action`
  — fresh manual mission fires successfully and returns a UUID thread_id
  rather than `not_fired`.
- `mission_list_returns_all_user_missions_via_execute_action` — all
  three created missions appear in `mission_list` output.

**Regression tests for cost wiring**

Three new tests in `bridge/llm_adapter.rs`:

- `complete_no_tools_populates_cost_usd_through_adapter`
- `complete_with_tools_populates_cost_usd_through_adapter`
- `complete_routes_subcalls_through_cheap_provider_for_cost` — pins that
  `depth > 0` is priced with the cheap provider, not the primary.

Coordinated with in-flight work: skipped paths owned by #2504 (auth
E2E), #2631 (paused-lease resume), #2570 (mission re-fire), #2549
(mission_get), #2452 (tool_calls persistence), #2621 (replay snapshot).

Verified: `cargo fmt`, `cargo clippy --all --benches --tests --examples
--all-features` (0 warnings), `cargo test -p ironclaw_engine` (409
passed), `cargo test -p ironclaw --lib` (5079 passed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ilblackdragon added a commit that referenced this pull request Apr 19, 2026
…ceptance

Closes two gaps blocking v2 engine becoming the default:

**Phase 4 — token + cost accounting**

- Delete orphaned `crates/ironclaw_engine/src/executor/compaction.rs`
  (176 lines). The Python orchestrator (`default.py::compact_if_needed`)
  has owned compaction policy since #1557; the Rust module had no
  callers anywhere in the workspace.
- Wire `cost_usd` in `LlmBridgeAdapter` by calling
  `LlmProvider::calculate_cost()` at both the no-tools and with-tools
  response paths. The engine's `Thread::total_cost_usd` accumulator and
  `max_budget_usd` gates were already plumbed — only the adapter was
  hardcoding 0.0.
- Persist `total_cost_usd` through `ThreadArchiveSummary` round-trip in
  `store_adapter.rs`. Previously, rehydrating an archived thread
  silently dropped the cost to 0.0. `#[serde(default)]` keeps existing
  archive files deserializing cleanly.

**Phase 6 — mission lifecycle acceptance**

Three new integration tests in `bridge/effect_adapter.rs` driving
`execute_action()` end-to-end (per `.claude/rules/testing.md` "Test
Through the Caller"):

- `mission_full_lifecycle_via_execute_action` — create → list → complete
  → list, asserting the `Completed` status surfaces through
  `mission_list` after `mission_complete`.
- `mission_fire_returns_thread_id_for_manual_cadence_via_execute_action`
  — fresh manual mission fires successfully and returns a UUID thread_id
  rather than `not_fired`.
- `mission_list_returns_all_user_missions_via_execute_action` — all
  three created missions appear in `mission_list` output.

**Regression tests for cost wiring**

Three new tests in `bridge/llm_adapter.rs`:

- `complete_no_tools_populates_cost_usd_through_adapter`
- `complete_with_tools_populates_cost_usd_through_adapter`
- `complete_routes_subcalls_through_cheap_provider_for_cost` — pins that
  `depth > 0` is priced with the cheap provider, not the primary.

Coordinated with in-flight work: skipped paths owned by #2504 (auth
E2E), #2631 (paused-lease resume), #2570 (mission re-fire), #2549
(mission_get), #2452 (tool_calls persistence), #2621 (replay snapshot).

Verified: `cargo fmt`, `cargo clippy --all --benches --tests --examples
--all-features` (0 warnings), `cargo test -p ironclaw_engine` (409
passed), `cargo test -p ironclaw --lib` (5079 passed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ilblackdragon added a commit that referenced this pull request Apr 19, 2026
…ceptance (#2660)

* feat(engine-v2): Phase 4 cost tracking + Phase 6 mission lifecycle acceptance

Closes two gaps blocking v2 engine becoming the default:

**Phase 4 — token + cost accounting**

- Delete orphaned `crates/ironclaw_engine/src/executor/compaction.rs`
  (176 lines). The Python orchestrator (`default.py::compact_if_needed`)
  has owned compaction policy since #1557; the Rust module had no
  callers anywhere in the workspace.
- Wire `cost_usd` in `LlmBridgeAdapter` by calling
  `LlmProvider::calculate_cost()` at both the no-tools and with-tools
  response paths. The engine's `Thread::total_cost_usd` accumulator and
  `max_budget_usd` gates were already plumbed — only the adapter was
  hardcoding 0.0.
- Persist `total_cost_usd` through `ThreadArchiveSummary` round-trip in
  `store_adapter.rs`. Previously, rehydrating an archived thread
  silently dropped the cost to 0.0. `#[serde(default)]` keeps existing
  archive files deserializing cleanly.

**Phase 6 — mission lifecycle acceptance**

Three new integration tests in `bridge/effect_adapter.rs` driving
`execute_action()` end-to-end (per `.claude/rules/testing.md` "Test
Through the Caller"):

- `mission_full_lifecycle_via_execute_action` — create → list → complete
  → list, asserting the `Completed` status surfaces through
  `mission_list` after `mission_complete`.
- `mission_fire_returns_thread_id_for_manual_cadence_via_execute_action`
  — fresh manual mission fires successfully and returns a UUID thread_id
  rather than `not_fired`.
- `mission_list_returns_all_user_missions_via_execute_action` — all
  three created missions appear in `mission_list` output.

**Regression tests for cost wiring**

Three new tests in `bridge/llm_adapter.rs`:

- `complete_no_tools_populates_cost_usd_through_adapter`
- `complete_with_tools_populates_cost_usd_through_adapter`
- `complete_routes_subcalls_through_cheap_provider_for_cost` — pins that
  `depth > 0` is priced with the cheap provider, not the primary.

Coordinated with in-flight work: skipped paths owned by #2504 (auth
E2E), #2631 (paused-lease resume), #2570 (mission re-fire), #2549
(mission_get), #2452 (tool_calls persistence), #2621 (replay snapshot).

Verified: `cargo fmt`, `cargo clippy --all --benches --tests --examples
--all-features` (0 warnings), `cargo test -p ironclaw_engine` (409
passed), `cargo test -p ironclaw --lib` (5079 passed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(engine-v2): surface engine capability actions to LLM via available_actions

Fixes the gap called out in the PR body: `EffectBridgeAdapter::available_actions`
was only enumerating v1 `ToolRegistry` tools + latent OAuth actions, so
engine-native capabilities like `missions` never appeared in the LLM's
tools list even when a thread held an active lease for them. The LLM
was therefore unable to call `mission_create` / `mission_list` / etc.
via structured tool calls; the only ways to drive missions were CodeAct
Python calls (which relied on the same `known_actions` set and hit the
same gap) or `/routine` slash commands falling through to v1.

Wire `CapabilityRegistry` into the adapter and iterate active leases to
surface every leased, engine-registered capability action. Respects
lease grant scope — a lease granting only `mission_list` does not leak
`mission_create`. Skips the `"tools"` capability since that lease is
already reconciled from the v1 path.

Router wires the shared `Arc<CapabilityRegistry>` to both the adapter
and `ThreadManager` at setup.

Three new regression tests:
- `available_actions_surfaces_leased_mission_capability`
- `available_actions_respects_partial_lease_grant`
- `available_actions_omits_capability_without_lease`

Verified: `cargo fmt`, `cargo clippy --all --benches --tests --examples
--all-features` (0 warnings), `cargo test -p ironclaw_engine` (409
passed), `cargo test -p ironclaw --lib` (5082 passed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(engine-v2): close review gaps — archive round-trip, v1/engine merge, defensive filters

Addresses gaps raised in PR #2660 review:

- **Consolidate `use ironclaw_engine::{...}`** into a single grouped
  import in `effect_adapter.rs` (was split across two statements).

- **Apply `is_v1_only_tool` / `is_v1_auth_tool` filters** to the engine
  capability path in `available_actions`. Defensive guardrail: a future
  engine capability that registers an action under a v1-denylisted name
  (`create_job`, `tool_auth`, ...) must not bypass the v2-isolation
  filters by virtue of coming through a different capability registry.

- **`ThreadArchiveSummary` serialization round-trip tests** in
  `store_adapter.rs`:
    - `archive_summary_preserves_total_cost_usd_through_round_trip` —
      pins the regression the PR fixed (cost silently zeroed on
      rehydration).
    - `archive_summary_handles_legacy_json_without_total_cost_usd_field`
      — pins `#[serde(default)]` back-compat for archive files written
      before this PR.

- **`available_actions` combined advertising tests** in
  `effect_adapter.rs`:
    - `available_actions_merges_v1_tools_with_engine_capabilities` —
      v1 tool + mission capability both surface on one call.
    - `available_actions_filters_v1_denylisted_names_from_engine_capabilities`
      — pins the new defensive filter.

- **`cost_usd_from` subscription-billed-provider test** in
  `llm_adapter.rs`:
    - `complete_with_subscription_billed_provider_yields_zero_cost` —
      zero `cost_per_token` round-trips to exactly `0.0`, no NaN/Inf.

Verified: `cargo fmt`, `cargo clippy --all --benches --tests --examples
--all-features` (0 warnings), `cargo test -p ironclaw_engine` (409
passed), `cargo test -p ironclaw --lib` (5087 passed, +5 new).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(engine-v2): price cache tokens correctly in LlmBridgeAdapter

Addresses PR #2660 review (gemini-code-assist + Copilot, L23/L115/L189):
`cost_usd_from` only priced `input_tokens + output_tokens`, ignoring
`cache_read_input_tokens` and `cache_creation_input_tokens`. For
providers with prompt caching (Anthropic, OpenAI), this undercounted
input cost and silently neutered the `max_budget_usd` gate.

Extend the helper to mirror the canonical formula in
`src/agent/cost_guard.rs::CostGuard::record_llm_call`:

  uncached_input = input_tokens - (cache_read + cache_write)
  cache_read_cost  = input_rate * cache_read  / cache_read_discount()
  cache_write_cost = input_rate * cache_write * cache_write_multiplier()
  cost = input_rate * uncached_input
       + cache_read_cost
       + cache_write_cost
       + output_rate * output_tokens

All `LlmProvider` implementations already supply `cache_read_discount()`
(default 1, Anthropic 10, OpenAI 2) and `cache_write_multiplier()`
(default 1, Anthropic 1.25 for 5m / 2.0 for 1h) through the decorator
chain, so no trait surgery is required.

Regression test: `complete_prices_cache_tokens_with_discount_and_multiplier`
uses Anthropic Sonnet 5m-TTL rates, exercises a 10k-input / 2k-read /
1k-write / 500-output response, and pins the correct total ($0.03285)
against the old naive $0.0375 that would have undercounted ~14%.

Verified: `cargo fmt`, `cargo clippy --all --benches --tests --examples
--all-features` (0 warnings), `cargo test -p ironclaw_engine` (435
passed), `cargo test -p ironclaw --lib` (5136 passed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ilblackdragon ilblackdragon force-pushed the henry/v2-paused-lease-replay branch from 4bd24e9 to 8bb2ef5 Compare April 19, 2026 15:11
@ilblackdragon ilblackdragon marked this pull request as ready for review April 19, 2026 15:11
Copilot AI review requested due to automatic review settings April 19, 2026 15:11
@ilblackdragon
Copy link
Copy Markdown
Member

Rebased onto latest `origin/staging` and addressed the two issues blocking merge. Marked ready for review.

Rebase notes

Conflict shape: the PR touched `src/channels/web/server.rs` but since it was authored, PR #2645 (stage 4a) extracted the OAuth callback into `src/channels/web/features/oauth/mod.rs`. Cherry-picked the engine/bridge/gate changes cleanly (they applied auto-merge) and re-applied the server.rs changes surgically to the new post-refactor location:

  • Split `clear_auth_mode_for_thread` in `server.rs` into two: the public helper still clears both legacy session + engine pending auth, and a new `clear_session_auth_mode_for_thread` clears only the session half. Both `pub(crate)` so `features/oauth/` can reach them.
  • Updated `features/oauth/mod.rs::oauth_callback_handler` to call `clear_session_auth_mode_for_thread(..., None)` instead of `clear_auth_mode(...)`, preserving the engine gate across the callback so the `ExternalCallback` replay path can resolve it (preserving the paused_lease).

Deliberately did not port the PR's `dispatch_engine_external_callback` helper refactor — that helper lives in `platform/engine_dispatch.rs` which is introduced by the not-yet-merged PR #2665 (stage 4b). The inline `tx.send` flow stays for now; rolling up the refactor into this PR would couple it to #2665's merge order.

Clippy fix

Adding `paused_lease` to the `ThreadOutcome::GatePaused` variant tipped three clippy `large_enum_variant` lints past the threshold (lint fires at ~200 bytes, `CapabilityLease` is ~360 bytes):

  1. `ThreadOutcome::GatePaused` itself
  2. `ThreadSignal::ChildCompleted { outcome: ThreadOutcome }` (transitively)
  3. `PreflightResult::GatePaused(ThreadOutcome)` (transitively)

Fixed by boxing `paused_lease` in `ThreadOutcome::GatePaused` — matches what the PR already does in `EngineError::GatePaused`. The single consumer that stores it unboxed (`PendingGate.paused_lease`) gets an inline `paused_lease.map(|b| *b)` unbox at the assignment site in `router.rs`.

Verification

  • `cargo fmt` ✓
  • `cargo clippy -p ironclaw_engine --all-targets -- -D warnings` → 0 warnings
  • `cargo clippy --all --benches --tests --examples --all-features` → 0 warnings
  • `cargo test -p ironclaw_engine parse_outcome_gate_paused` → ok (the PR's named test)
  • `cargo test -p ironclaw_engine normalize_pause_outcome_transitions_thread_to_waiting` → ok (the PR's other named test)
  • `cargo test -p ironclaw_engine` → all pass
  • `cargo test --test engine_v2_gate_integration` → all pass (including the PR's new gate test)
  • `cargo test -p ironclaw --lib` → 5158 passed. 1 flake on `extensions::manager::tests::test_telegram_token_colon_preserved_in_validation_url` — known flake (Flaky test: test_telegram_token_colon_preserved_in_validation_url leaks across parallel runs #2102), passes in isolation, pre-existing.

Engine v2 default critical path

With #2452 already merged and this PR ready, only #2622 (unify v2 extension auth resume flow) remains before the default flip. It's also DIRTY and needs rebase; happy to rebase that next if you want me to keep unblocking.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates engine v2 gate handling to persist and reuse the original capability lease across auth-gated pauses/resumes, and adjusts web OAuth callback cleanup so the engine pending-auth gate remains until replay completes.

Changes:

  • Add paused_lease plumbing across ThreadOutcome::GatePaused, EngineError::GatePaused, and persisted PendingGate state.
  • Prefer paused_lease when replaying/resuming a pending gate action instead of re-deriving a lease.
  • Split legacy session auth cleanup from engine gate cleanup to avoid clearing the engine pending-auth gate in OAuth callbacks before replay finishes.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/engine_v2_gate_integration.rs Updates test fixtures to include the new paused_lease field.
src/gate/store.rs Updates pending-gate store tests to include paused_lease.
src/gate/persistence.rs Updates persistence tests to include paused_lease.
src/gate/pending.rs Adds paused_lease to persisted PendingGate model.
src/channels/web/server.rs Splits auth cleanup into “session only” vs “session + engine gate” helpers.
src/channels/web/features/oauth/mod.rs Uses session-only cleanup in OAuth callback to preserve engine gate until replay.
src/bridge/router.rs Persists paused_lease into pending gates and prefers it during resume/replay.
src/bridge/effect_adapter.rs Captures the current lease snapshot into GatePaused errors.
src/agent/agent_loop.rs Allows ExternalCallback submissions to target a thread UUID similarly to approvals.
crates/ironclaw_engine/src/types/error.rs Extends EngineError::GatePaused with paused_lease.
crates/ironclaw_engine/src/runtime/mission.rs Updates test fixtures to include paused_lease.
crates/ironclaw_engine/src/runtime/messaging.rs Extends ThreadOutcome::GatePaused with boxed paused_lease.
crates/ironclaw_engine/src/executor/structured.rs Propagates paused_lease through GatePaused result classification and parsing.
crates/ironclaw_engine/src/executor/scripting.rs Updates preflight GatePaused construction to include paused_lease.
crates/ironclaw_engine/src/executor/orchestrator.rs Logs/serializes paused_lease, parses it from tool results, and adds a parsing test.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/bridge/router.rs Outdated
Comment on lines +684 to +698
let lease = if let Some(lease) = pending.paused_lease.clone() {
lease
} else {
state
.thread_manager
.leases
.find_lease_for_action(pending.thread_id, &pending.action_name)
.await
.ok_or_else(|| {
engine_err(
"resume lease",
format!("no active lease covers action '{}'", pending.action_name),
)
})?
};
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When resuming a pending gate, pending.paused_lease is used directly without any validation (thread_id match / action coverage / revocation / expiry). This bypasses the normal LeaseManager::find_lease_for_action checks and could allow resuming with a stale or mismatched lease. Consider validating that the snapshot lease applies to pending.thread_id and pending.action_name (and is not revoked/expired), otherwise fall back to find_lease_for_action or fail closed.

Suggested change
let lease = if let Some(lease) = pending.paused_lease.clone() {
lease
} else {
state
.thread_manager
.leases
.find_lease_for_action(pending.thread_id, &pending.action_name)
.await
.ok_or_else(|| {
engine_err(
"resume lease",
format!("no active lease covers action '{}'", pending.action_name),
)
})?
};
let lease = state
.thread_manager
.leases
.find_lease_for_action(pending.thread_id, &pending.action_name)
.await
.ok_or_else(|| {
engine_err(
"resume lease",
format!("no active lease covers action '{}'", pending.action_name),
)
})?;

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in eaf91a42. Extracted the lease-picking into a standalone resume_lease_for_pending_gate(pending, leases) helper that now validates the snapshot before use: thread_id match, action coverage, !revoked, and expires_at in the future. If any check fails, falls through to LeaseManager::find_lease_for_action — the snapshot is still preferred when valid (the original bug: no active lease at resume after restart) but stale snapshots no longer silently bypass revocation/expiry.

Comment thread src/bridge/router.rs Outdated
Comment on lines +684 to +688
let lease = if let Some(lease) = pending.paused_lease.clone() {
lease
} else {
state
.thread_manager
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new resume path that prefers pending.paused_lease over re-deriving via LeaseManager is a key behavioral change but doesn’t appear to be covered by a router-level test. Adding a test that reproduces the regression (e.g., paused action where no active lease exists at resume, but paused_lease allows completion) would help prevent future breaks.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in eaf91a42. Added 6 libsql-free tests in bridge::router::tests driving resume_lease_for_pending_gate end-to-end:

  • resume_lease_prefers_snapshot_even_when_lease_manager_empty — pins the original bug fix.
  • resume_lease_rejects_revoked_snapshot_and_falls_back — revoked snapshot + live lease → use live.
  • resume_lease_rejects_expired_snapshot_and_falls_back — expired snapshot + live lease → use live.
  • resume_lease_rejects_snapshot_with_wrong_thread_id — defensive: drift between snapshot and pending.
  • resume_lease_rejects_snapshot_missing_action_coverage — lease must actually grant the pending action.
  • resume_lease_returns_none_when_no_snapshot_and_no_active_lease — clean None for caller's error path.

ilblackdragon and others added 2 commits April 19, 2026 15:28
…ase-replay

# Conflicts:
#	src/channels/web/features/oauth/mod.rs
#	src/channels/web/server.rs
Addresses PR #2631 review comments from Copilot:

1. **Snapshot used without validation** (src/bridge/router.rs:684 orig):
   `pending.paused_lease.clone()` was used directly to resume a gated
   action. A gate can sit in the pending-gate store for hours or across
   process restarts; during that window the original lease may have
   been revoked, expired, or the pending record could have drifted off
   its original thread.

   Extract `snapshot_lease_still_valid` + `resume_lease_for_pending_gate`
   helpers. The snapshot must pass four checks before use:
     - `lease.thread_id == pending.thread_id`
     - `granted_actions.covers(&pending.action_name)`
     - `!revoked`
     - `expires_at` is unset or in the future

   If any check fails, fall through to `LeaseManager::find_lease_for_action`
   (the normal path). Matches the reviewer's suggestion to avoid silently
   resuming a stale snapshot; still prefers the snapshot when valid so
   the original bug (no active lease at resume after restart) stays
   fixed.

2. **No router-level regression test** for the snapshot-vs-fallback
   decision. Six new libsql-free tests in `bridge::router::tests`:
     - `resume_lease_prefers_snapshot_even_when_lease_manager_empty` —
       reproduces the original bug; snapshot must carry the resume.
     - `resume_lease_rejects_revoked_snapshot_and_falls_back`
     - `resume_lease_rejects_expired_snapshot_and_falls_back`
     - `resume_lease_rejects_snapshot_with_wrong_thread_id`
     - `resume_lease_rejects_snapshot_missing_action_coverage`
     - `resume_lease_returns_none_when_no_snapshot_and_no_active_lease`

Verified: `cargo fmt`, `cargo clippy --all --benches --tests --examples
--all-features` (0 warnings), `cargo test -p ironclaw_engine` (435
passed), `cargo test -p ironclaw --lib` (5182 passed, +6 new).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added size: L 200-499 changed lines and removed size: M 50-199 changed lines labels Apr 19, 2026
CI rustfmt (nightly/stable 1.95.0) wants the `assert!(matches!())` in
`orchestrator.rs::parse_outcome_gate_paused` collapsed to fewer lines.
Local rustfmt 1.94 was happy with the expanded form; matching CI to
unblock the fmt check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 19, 2026 16:37
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/agent/agent_loop.rs
Comment on lines 1594 to 1600
let approval_thread_uuid = if matches!(
submission,
Submission::ExecApproval { .. }
| Submission::ApprovalResponse { .. }
| Submission::ExternalCallback { .. }
| Submission::GateAuthResolution { .. }
) {
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Including Submission::ExternalCallback in approval_thread_uuid resolution can mutate sess.active_thread based solely on conversation_scope even when the request_id doesn’t match any pending auth gate. A crafted JSON ExternalCallback (parsed from message content) can therefore switch the user’s active thread and then fail with "No matching pending authentication gate found." Consider preventing thread switching unless the callback is known-valid for that thread (e.g., look up the pending gate by request_id first and only then set active_thread), or gate this cross-thread routing to trusted/internal submissions only.

Copilot uses AI. Check for mistakes.
@ilblackdragon ilblackdragon merged commit 6419347 into staging Apr 19, 2026
22 checks passed
@ilblackdragon ilblackdragon deleted the henry/v2-paused-lease-replay branch April 19, 2026 16:45
ilblackdragon added a commit that referenced this pull request Apr 20, 2026
Synthesizes recurring patterns from ~30 merged PRs, 147 bot review
comments (Copilot/Gemini), human reviews, and ~50 issues filed in the
past 2 weeks. Each rule cites the motivating PR/issue numbers.

New files:
- error-handling.md — silent-failure taxonomy (unwrap_or_default, .ok()?,
  poisoned caches), persist-then-reload atomicity, channel-edge error
  mapping. (#2526, #2633, #2653, #2673, #2546, #2407, #2408)
- agent-evidence.md — side-effect claims must cite tool evidence,
  empty-fast outputs are errors, external-effect tools must read back,
  setup UI round-trip. (#2544, #2580, #2582, #2541, #2545, #2411, #2543,
  #2586)
- lifecycle.md — discovery vs. activation, terminal auth rejection,
  list_installed vs. list_active, deactivation unwinds, snapshot
  rehydrate must re-validate. (#2556, #2557, #2558, #2564, #2419,
  PR #2617, PR #2631)

Extended:
- types.md — from_trusted boundary rule, validated-newtype template with
  shared validate(&str), serde(try_from) required for validated types,
  wire-stable enums (no Debug; serde alias for migrations), canonical
  wire-contract field naming. (PR #2685, #2681, #2687, #2678, #2669,
  #2665, #2683, #2702)
- safety-and-sandbox.md — every new ingress scans pre-transform/pre-
  injection, bounded resources (interners/streams/fan-out caps), cache
  keys must be complete. (#2491, #2676, #2470, #2633, #2673, #2710,
  PR #2702)
- review-discipline.md — PR scope discipline, guardrail scripts are
  code (regression tests, grouped-import parsing, CI has_code inclusion),
  absolute-path ban in committed docs, stale comments after refactors.
  (PR #2668, #2628, #2680, #2687, #2647, #2689, #2701)

All new files carry paths: frontmatter so they auto-load only on
matching files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ilblackdragon added a commit that referenced this pull request Apr 20, 2026
* docs(rules): add review-driven guidance for Claude Code

Synthesizes recurring patterns from ~30 merged PRs, 147 bot review
comments (Copilot/Gemini), human reviews, and ~50 issues filed in the
past 2 weeks. Each rule cites the motivating PR/issue numbers.

New files:
- error-handling.md — silent-failure taxonomy (unwrap_or_default, .ok()?,
  poisoned caches), persist-then-reload atomicity, channel-edge error
  mapping. (#2526, #2633, #2653, #2673, #2546, #2407, #2408)
- agent-evidence.md — side-effect claims must cite tool evidence,
  empty-fast outputs are errors, external-effect tools must read back,
  setup UI round-trip. (#2544, #2580, #2582, #2541, #2545, #2411, #2543,
  #2586)
- lifecycle.md — discovery vs. activation, terminal auth rejection,
  list_installed vs. list_active, deactivation unwinds, snapshot
  rehydrate must re-validate. (#2556, #2557, #2558, #2564, #2419,
  PR #2617, PR #2631)

Extended:
- types.md — from_trusted boundary rule, validated-newtype template with
  shared validate(&str), serde(try_from) required for validated types,
  wire-stable enums (no Debug; serde alias for migrations), canonical
  wire-contract field naming. (PR #2685, #2681, #2687, #2678, #2669,
  #2665, #2683, #2702)
- safety-and-sandbox.md — every new ingress scans pre-transform/pre-
  injection, bounded resources (interners/streams/fan-out caps), cache
  keys must be complete. (#2491, #2676, #2470, #2633, #2673, #2710,
  PR #2702)
- review-discipline.md — PR scope discipline, guardrail scripts are
  code (regression tests, grouped-import parsing, CI has_code inclusion),
  absolute-path ban in committed docs, stale comments after refactors.
  (PR #2668, #2628, #2680, #2687, #2647, #2689, #2701)

All new files carry paths: frontmatter so they auto-load only on
matching files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(rules): split agent-evidence into prompt + code rule

agent-evidence.md mixed two concerns: runtime agent instruction (what
the LLM should do when concluding a turn) and code-enforcement rules
(what the dispatcher, engine, and tools must implement). Rules under
.claude/rules/ only guide Claude Code when editing the repo — the
runtime agent never reads them.

Splits the two:

- crates/ironclaw_engine/prompts/codeact_postamble.md — new section
  "Evidence before claiming side effects". Sits next to the existing
  "FINAL() answer quality" guidance; loaded via include_str! in
  executor/prompt.rs (no Rust change needed).
- .claude/rules/tool-evidence.md — renamed from agent-evidence.md,
  keeps only the code invariants (engine v2 side-effect gate,
  empty-fast ToolError::EmptyResult, external-effect tools must read
  back, setup UI round-trip).

Prompt tests pass unchanged; the postamble addition is pure text.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* prompt: tighten evidence rule to FINAL() claims only, not tool use

Live-test validation of the "Evidence before claiming side effects"
section (added in the prior commit) showed it inhibited legitimate
tool use. With the original wording, `zizmor_scan_v2` live-recording
timed out at 302s with zero responses; reverting the postamble
restored healthy behavior (88s run, 8 shell calls including
`cargo install zizmor` and full workflow analysis).

The original phrasing conflated two things: what the agent should
claim and what tools it should call. The rule is only about the
claim. Re-tunes the section to:

- Open with an explicit "this does not restrict tool calls" scope.
- Drop the "<1ms = failure" heuristic (too broad — normal tools like
  `tool_info(schema)` are legitimately fast).
- Drop the full enumeration of forbidden side-effect verbs; keep the
  rule narrower and clearer.
- Shorten the code example (remove redundant early-return).

Re-tuned run: agent is active (shell calls, real reasoning), live
recording completes in ~9s. The remaining test failure is a
pre-existing assertion bug (exact `t == "shell"` match against tool
strings that now carry arguments like `"shell(cmd)"`) — reproduces
with the old postamble too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(live): fix tool-name assertions + re-record zizmor traces

The two `zizmor_scan*` live tests had four broken tool-name assertions
that silently failed to match: `tools.iter().any(|t| t == "shell")`
against a tool list that now contains `"shell(cmd)"` strings (tool
events carry args via `format_action_display_name` in
`src/bridge/router.rs`). Two of the four were negative assertions
checking for the absence of `tool_install` recovery loops — those
silently passed even when a recovery loop actually ran. `sandbox_live_e2e.rs:203`
already used the correct `t == "shell" || t.starts_with("shell(")`
pattern; applied it consistently to all four sites.

Verified live:

- `IRONCLAW_LIVE_TEST=1 cargo test --test e2e_live -- zizmor_scan --ignored --test-threads=1`
  → 2 passed, 0 failed, 51.78s. Agent installs and runs zizmor
  end-to-end, producing real findings (exit code 14, dangerous
  triggers, excessive permissions, etc.).

Traces re-recorded with the tuned postamble (commit 50d8517) and
scrubbed: replaced `/home/illia/.cargo/bin/zizmor` with
`/home/user/.cargo/bin/zizmor` per the developer-local-path ban in
`.claude/rules/review-discipline.md`. No credentials, PII, or
high-entropy secrets in either trace (only git SHAs from zizmor's
workflow analysis output).

Replay still passes: `cargo test --test e2e_live -- zizmor_scan --ignored`
→ 2/2 ok.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(replay): update zizmor_scan_v2 insta snapshot

The engine v2 replay-snapshot gate (`engine_v2_tests::snapshot_zizmor_scan_v2`)
failed against the re-recorded trace from 1691efe because the old
snapshot encoded a broken run:

- final_state: Failed
- Missing Assistant message role
- 3 issues: thread_failure (error), no_response (warning), llm_error (error)
- 6 tool calls that never produced a final answer

The new trace completes cleanly:

- final_state: Done
- System / User / Assistant roles present
- 1 issue: mixed_mode (info)
- 3 shell tool calls + successful `FINAL()` with real findings

The snapshot was pinning a regression. Regenerated with
`INSTA_UPDATE=always cargo test --test e2e_engine_v2 -- snapshot_zizmor_scan_v2`;
passes on replay.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(rules): address PR #2714 review feedback

- review-discipline: reword "Doc Absolute Paths" as a review convention
  (pre-commit only scans .rs; the rule misleadingly claimed enforcement).
- safety-and-sandbox: broaden `paths:` frontmatter to include the actual
  ingress owners (`bridge`, `channels`, `workspace`, `agent`, engine
  crate) so the rule auto-loads where it applies.
- tool-evidence: mark the side-effect gate, empty-fast rule, and
  `unverified` flag as target/aspirational invariants — neither
  `ToolError::EmptyResult`, an `unverified` field on `ToolOutput`, nor a
  byte-count field on `ActionRecord` exist today. Point at concrete
  interim conventions (`ToolError::ExecutionFailed`, `unverified: true`
  in the JSON result body).
- types: scope "Validated newtypes must gate Deserialize" to *new*
  types, document the `CredentialName`/`ExtensionName` exception (they
  intentionally use `#[serde(transparent)]` + derived `Deserialize`
  under the `serde_does_not_revalidate` test). Clarify the
  `from_trusted` trust boundary (trusted = typed upstream, untrusted =
  raw JSON field even if the field *name* is "registry entry").
  Switch `new` template to `impl Into<String>` to avoid an unnecessary
  clone when an owned `String` is passed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(rules): simplify types.md + split doc-hygiene; address review round 2

- types: collapse two templates into one canonical validated-newtype
  shape. New types use `#[serde(try_from = "String")]` with a shared
  `validate(&str)` helper — no more dual "transparent for some /
  try_from for others" guidance. `CredentialName`/`ExtensionName` are
  documented as the sole legacy exception (locked in by the
  `serde_does_not_revalidate` test); new code must not copy their
  `transparent` + `from_trusted` pattern. Removes the long "Using
  `from_trusted` safely" section and the separate "Validated newtypes
  must gate Deserialize" subsection that contradicted the Don'ts list.
- doc-hygiene: new tiny rule file scoped to `**/*.md`, `**/*.py`,
  `docs/**` that carries the "no developer-local absolute paths in
  committed docs" convention. Removed from review-discipline.md where
  its `src/**/*.rs` scope meant the rule never loaded on the files it
  governed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(live): match hyphenated tool-install in attempted_relevant_tool

The engine records `action_name` as the raw string the LLM emitted
(`crates/ironclaw_engine/src/executor/structured.rs:381`), and the
registry's lookup canonicalization only affects dispatch — not the
name that reaches `StatusUpdate::ToolStarted`. The two other predicates
in this file (`bad_recovery` at :420, `phase_b_recovery` at :531)
already defend against both forms; this one should too, for
consistency. Addresses PR #2714 review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@henrypark133 henrypark133 mentioned this pull request Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor: core 20+ merged PRs risk: medium Business logic, config, or moderate-risk modules scope: agent Agent core (agent loop, router, scheduler) scope: channel/web Web gateway channel size: L 200-499 changed lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants