chore: promote staging to staging-promote/11f00698-24612972670 (2026-04-19 09:02 UTC)#2659
Open
ironclaw-ci[bot] wants to merge 1 commit intostaging-promote/11f00698-24612972670from
Conversation
) * test(replay): promote engine replay traces to insta-backed snapshot gate Adds a ReplayOutcome snapshot type, a replay-gate CI workflow, and a developer script wrapper for cargo-insta. Replaces unreviewable 3,000-line JSON diffs on engine changes with a YAML snapshot of the observable run shape (tool sequence, final state, retrospective analyzer issues). Why: engine v2 live-fixture traces had grown past reviewability. A single prompt-wording change could move the whole fixture, and reviewers had no way to see which behaviour actually changed. Splitting the fixture into a "replay driver" (JSON stays in tests/fixtures/) and a "regression snapshot" (YAML in tests/snapshots/) gives reviewers a narrow, stable diff to approve, while keeping the full recorded context for deterministic replay. Changes: - `tests/support/replay_outcome.rs` — ReplayOutcome + assert_replay_snapshot! macro; snapshots include retrospective analyzer output (TraceIssue severity/category) via a new `ironclaw::bridge::engine_retrospectives_for_test()` helper that runs `build_trace()` over engine threads - `tests/e2e_engine_v2.rs` — three POC snapshot tests (single_tool_echo, tool_error_recovery, zizmor_scan_v2) - `tests/e2e_bug_bash_snapshots.rs` + `tests/fixtures/llm_traces/bug_bash/` — bug-regression fixture template, mapped to open issues in the README - `.github/workflows/replay-gate.yml` — cargo insta test --check on engine/agent/LLM/tools/bridge path changes; rejects committed .snap.new - `scripts/replay-snap.sh` — review/accept/test/record wrappers around cargo-insta and IRONCLAW_RECORD_TRACE - `scripts/trace-coverage.sh` — reports EventKind variants with snapshot coverage; `--strict` mode for future CI promotion - `tests/e2e_live.rs` — `#[ignore]` swapped for `cfg_attr(not(feature="replay"), ignore)` so the replay CI job can run the scenarios without `-- --ignored` - `Cargo.toml` — new `replay = ["libsql"]` feature; insta gains the `yaml` feature - `tests/fixtures/llm_traces/README.md` — documents the two-role driver/snapshot split Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(replay): address PR #2621 review + swap cargo-insta installer Review fixes: - Replay gate was missing the bug-bash snapshot suite. Adds `tests/e2e_bug_bash_snapshots.rs` to the workflow paths trigger and the `cargo insta test --check` invocation so bug-regression snapshots are actually gated. (copilot-pull-request-reviewer) - `cargo install cargo-insta --locked` added ~40s of cold-cache compile to the gate. Swapped for `taiki-e/install-action@v2`, which downloads a precompiled binary in a few seconds. Also updated `scripts/replay-snap.sh` to *fail closed* when cargo-insta is missing instead of silently auto-installing it. (gemini-code-assist) - `engine_retrospectives_for_test` was `pub` and re-exported under the default-enabled `libsql` feature, contradicting its "not part of any public API" doc. Split the re-export, kept `reset_engine_state` as a plain `pub use`, and hid `engine_retrospectives_for_test` behind `#[doc(hidden)]` — it still needs to cross the crate boundary for integration tests (which live in a separate crate, so `#[cfg(test)]` doesn't reach them), but no longer appears in published docs. (copilot-pull-request-reviewer) - Added an explicit "caller must serialize" note on `engine_retrospectives_for_test` explaining the `ENGINE_STATE` singleton and pointing new callers at `engine_v2_test_lock()` / `reset_engine_state()`. Matches what the existing snapshot tests already do. (gemini-code-assist) Doc corrections: - `snapshot_zizmor_scan_v2` doc claimed the snapshot pinned `ApprovalNeeded` events and response wording — it doesn't. Rewrote to describe what the snapshot actually asserts (tool order, step count, retrospective issues, final state). (copilot-pull-request-reviewer) - `llm_call_count` was documented as "bucketed" but passed through verbatim. Updated the field doc to reflect the raw value. Bucketing wasn't needed because fixtures are deterministic. (copilot-pull-request-reviewer) - `src/bridge/router.rs` doc referenced a non-existent `ReplayOutcome.trace_issues` field — the struct uses `engine_threads`. Fixed the reference. (copilot-pull-request-reviewer) - `scripts/trace-coverage.sh` header claimed CI runs it with `--strict`; the workflow runs it in advisory mode. Rewrote the header to match, with a pointer for when to promote to strict. (copilot-pull-request-reviewer) No-change replies (rationale commented in the code): - `event_kind_name` uses an exhaustive `match` on `EventKind` rather than `Debug` or a `strum` derive. The compile-time exhaustiveness check is the point — adding a new engine event should force a conscious decision about how the snapshot represents it, not a silent fallthrough. Added a comment making that intent explicit. - `trace-coverage.sh` awk parser of `event.rs` is fragile — agreed, but the script is advisory and its failure mode is false negatives (uncovered variants simply aren't gated). Documented the tradeoff and the rewrite-in-Rust escape hatch in the script header. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci(replay-gate): prime cache on staging, restrict PR runs to read-only The second run on PR #2621 missed the cache ("No cache found" in the rust-cache restore step) even though the workflow is wired correctly. Root cause: the repo sits close to GitHub's 10 GB per-repo cache quota (~59 entries, many >500 MB), and the LRU policy evicts PR-scoped caches before they get reused. Fix: - Add `push: [staging, main]` so the gate runs (and saves a ~1.2 GB cache under the `replay-gate` key) on every merge to the branches PRs actually target. Subsequent PRs restore from that base-branch cache — GitHub Actions permits cross-ref restore when the restoring ref's base matches the saved ref. - Set `save-if: ${{ github.event_name == 'push' }}` so PR runs only *read* the cache. Without this gate, each PR push would save its own copy and crowd out the primed base-branch cache, putting us right back in the eviction loop. Expected effect: cold-cache 9m → warm ~2-3m once staging has a run with the new workflow. Base-branch prime run still pays 9m (no regression). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(replay): drop bug-bash fixture scaffolding Replay fixtures can't reproduce the Phase 3 target bugs because the fixture *is* the LLM's output — handwriting a trace where the LLM emits a tool call doesn't test whether the real LLM would have emitted that call, only that the harness dispatches a scripted one. What `summarization_uses_tools.json` actually pinned was the happy path, not the #2541 bug. Of the 7 open bug-bash issues, only #2544 ("plans and delegates but never executes") is catchable by replay, and only via a live-recorded fixture. The other six are LLM-behavior or infra-timing bugs outside replay's reach. Rather than ship regression theater, tear out the scaffolding. Removed: - tests/e2e_bug_bash_snapshots.rs - tests/fixtures/llm_traces/bug_bash/ - tests/snapshots/replay__bug_bash_summarization_uses_tools.snap Unwired: - Replay-gate workflow paths + test list no longer mention bug_bash - scripts/replay-snap.sh test command drops the extra --test flag Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: switch to cargo-nextest with per-test timeouts Nextest runs each integration test in its own process and runs test binaries in parallel, which is a big unlock for this repo: - Engine v2 tests share a process-global `ENGINE_STATE` singleton (OnceLock), which the current test lock serialises inside a single test binary. Nextest's process-per-test model gives each test a clean state automatically, so the 16 engine_v2 tests stop running one-by-one. - Cross-binary parallelism: `cargo test --test A --test B` runs binaries in sequence; nextest runs them concurrently. Measured locally: the replay-gate test set (3 binaries, 21 tests) went from ~30s sequential to **2.7s parallel**. Adds `.config/nextest.toml` with: - `slow-timeout = 60s / terminate-after 3` in the default profile so a hung test fails fast instead of blocking the workflow-level 25- minute cap. - A `ci` profile with `fail-fast = false` (one flake shouldn't mask other failures), `failure-output = immediate-final`, `success-output = never` for readable Actions logs. - Per-test 300s override for the handful of genuinely slow scenarios (zizmor scan, e2e_thread_scheduling). Workflows updated: - `replay-gate.yml`: installs cargo-nextest via taiki-e/install-action alongside cargo-insta (one step), runs `cargo insta test --test-runner nextest` with `NEXTEST_PROFILE=ci`. - `test.yml`: all five `cargo test` invocations swapped for `cargo nextest run --profile ci`. Nextest doesn't execute doctests, so every nextest step is paired with a `cargo test --doc` follow-up to preserve coverage. Local dev is unchanged — `cargo test` still works; nextest is only required in CI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: re-trigger replay-gate workflow after nextest migration Previous push only modified workflow files and `.config/nextest.toml`; GitHub skipped the `pull_request` workflow events for that sync, so the nextest migration didn't actually get exercised in CI. Empty commit forces re-evaluation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(replay): note nextest wiring in the fixtures README Also forces a CI re-run: the previous empty commit had no matching paths, so the `pull_request.paths` filters skipped every workflow including replay-gate. Touching a file under `tests/fixtures/llm_traces/**` re-matches the filter and runs the nextest-based gate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci(test): defer test.yml nextest migration Staging restructured test.yml significantly while this PR was open (matrix-config dynamic matrix, `changes` code-detection job, composite install-cargo-component action, save-if restricted to base-branch pushes). The merge into staging had heavy conflicts for every nextest-swap hunk. Rather than force a re-layering of the new staging structure on top of the nextest migration in this PR, revert test.yml to staging's current version. This PR now scopes the nextest change to just the replay-gate workflow (where it cleanly demonstrates the value) plus the shared `.config/nextest.toml` profile. Migrating the rest of test.yml to nextest is a follow-up that can rebase on the new structure without the heavy conflict surface. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Henry Park <henrypark133@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Auto-promotion from staging CI
Batch range:
a53eac5c2dec6b6cd5c08189086093fde64aa9cb..ff119531d43506e014d1e2101899c983e2ff0ec8Promotion branch:
staging-promote/ff119531-24625403497Base:
staging-promote/11f00698-24612972670Triggered by: Staging CI batch at 2026-04-19 09:02 UTC
Commits in this batch (122):
ironclaw profile listsubcommand (feat(cli): addironclaw profile listsubcommand #2288)Current commits in this promotion (1)
Current base:
staging-promote/11f00698-24612972670Current head:
staging-promote/ff119531-24625403497Current range:
origin/staging-promote/11f00698-24612972670..origin/staging-promote/ff119531-24625403497Auto-updated by staging promotion metadata workflow
Waiting for gates:
Auto-created by staging-ci workflow