Skip to content

chore: promote staging to staging-promote/11f00698-24612972670 (2026-04-19 09:02 UTC)#2659

Open
ironclaw-ci[bot] wants to merge 1 commit intostaging-promote/11f00698-24612972670from
staging-promote/ff119531-24625403497
Open

chore: promote staging to staging-promote/11f00698-24612972670 (2026-04-19 09:02 UTC)#2659
ironclaw-ci[bot] wants to merge 1 commit intostaging-promote/11f00698-24612972670from
staging-promote/ff119531-24625403497

Conversation

@ironclaw-ci
Copy link
Copy Markdown
Contributor

@ironclaw-ci ironclaw-ci bot commented Apr 19, 2026

Auto-promotion from staging CI

Batch range: a53eac5c2dec6b6cd5c08189086093fde64aa9cb..ff119531d43506e014d1e2101899c983e2ff0ec8
Promotion branch: staging-promote/ff119531-24625403497
Base: staging-promote/11f00698-24612972670
Triggered by: Staging CI batch at 2026-04-19 09:02 UTC

Commits in this batch (122):

Current commits in this promotion (1)

Current base: staging-promote/11f00698-24612972670
Current head: staging-promote/ff119531-24625403497
Current range: origin/staging-promote/11f00698-24612972670..origin/staging-promote/ff119531-24625403497

Auto-updated by staging promotion metadata workflow

Waiting for gates:

  • Tests: pending
  • E2E: pending
  • Claude Code review: pending (will post comments on this PR)

Auto-created by staging-ci workflow

)

* test(replay): promote engine replay traces to insta-backed snapshot gate

Adds a ReplayOutcome snapshot type, a replay-gate CI workflow, and a
developer script wrapper for cargo-insta. Replaces unreviewable 3,000-line
JSON diffs on engine changes with a YAML snapshot of the observable run
shape (tool sequence, final state, retrospective analyzer issues).

Why: engine v2 live-fixture traces had grown past reviewability. A single
prompt-wording change could move the whole fixture, and reviewers had no
way to see which behaviour actually changed. Splitting the fixture into a
"replay driver" (JSON stays in tests/fixtures/) and a "regression
snapshot" (YAML in tests/snapshots/) gives reviewers a narrow, stable diff
to approve, while keeping the full recorded context for deterministic
replay.

Changes:
- `tests/support/replay_outcome.rs` — ReplayOutcome + assert_replay_snapshot!
  macro; snapshots include retrospective analyzer output (TraceIssue
  severity/category) via a new `ironclaw::bridge::engine_retrospectives_for_test()`
  helper that runs `build_trace()` over engine threads
- `tests/e2e_engine_v2.rs` — three POC snapshot tests
  (single_tool_echo, tool_error_recovery, zizmor_scan_v2)
- `tests/e2e_bug_bash_snapshots.rs` + `tests/fixtures/llm_traces/bug_bash/`
  — bug-regression fixture template, mapped to open issues in the README
- `.github/workflows/replay-gate.yml` — cargo insta test --check on
  engine/agent/LLM/tools/bridge path changes; rejects committed .snap.new
- `scripts/replay-snap.sh` — review/accept/test/record wrappers around
  cargo-insta and IRONCLAW_RECORD_TRACE
- `scripts/trace-coverage.sh` — reports EventKind variants with
  snapshot coverage; `--strict` mode for future CI promotion
- `tests/e2e_live.rs` — `#[ignore]` swapped for
  `cfg_attr(not(feature="replay"), ignore)` so the replay CI job can
  run the scenarios without `-- --ignored`
- `Cargo.toml` — new `replay = ["libsql"]` feature; insta gains
  the `yaml` feature
- `tests/fixtures/llm_traces/README.md` — documents the two-role
  driver/snapshot split

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(replay): address PR #2621 review + swap cargo-insta installer

Review fixes:

- Replay gate was missing the bug-bash snapshot suite. Adds
  `tests/e2e_bug_bash_snapshots.rs` to the workflow paths trigger and the
  `cargo insta test --check` invocation so bug-regression snapshots are
  actually gated. (copilot-pull-request-reviewer)

- `cargo install cargo-insta --locked` added ~40s of cold-cache compile
  to the gate. Swapped for `taiki-e/install-action@v2`, which downloads
  a precompiled binary in a few seconds. Also updated
  `scripts/replay-snap.sh` to *fail closed* when cargo-insta is missing
  instead of silently auto-installing it. (gemini-code-assist)

- `engine_retrospectives_for_test` was `pub` and re-exported under the
  default-enabled `libsql` feature, contradicting its "not part of any
  public API" doc. Split the re-export, kept `reset_engine_state` as a
  plain `pub use`, and hid `engine_retrospectives_for_test` behind
  `#[doc(hidden)]` — it still needs to cross the crate boundary for
  integration tests (which live in a separate crate, so `#[cfg(test)]`
  doesn't reach them), but no longer appears in published docs.
  (copilot-pull-request-reviewer)

- Added an explicit "caller must serialize" note on
  `engine_retrospectives_for_test` explaining the `ENGINE_STATE`
  singleton and pointing new callers at `engine_v2_test_lock()` /
  `reset_engine_state()`. Matches what the existing snapshot tests
  already do. (gemini-code-assist)

Doc corrections:

- `snapshot_zizmor_scan_v2` doc claimed the snapshot pinned
  `ApprovalNeeded` events and response wording — it doesn't. Rewrote to
  describe what the snapshot actually asserts (tool order, step count,
  retrospective issues, final state). (copilot-pull-request-reviewer)

- `llm_call_count` was documented as "bucketed" but passed through
  verbatim. Updated the field doc to reflect the raw value. Bucketing
  wasn't needed because fixtures are deterministic. (copilot-pull-request-reviewer)

- `src/bridge/router.rs` doc referenced a non-existent
  `ReplayOutcome.trace_issues` field — the struct uses `engine_threads`.
  Fixed the reference. (copilot-pull-request-reviewer)

- `scripts/trace-coverage.sh` header claimed CI runs it with `--strict`;
  the workflow runs it in advisory mode. Rewrote the header to match,
  with a pointer for when to promote to strict. (copilot-pull-request-reviewer)

No-change replies (rationale commented in the code):

- `event_kind_name` uses an exhaustive `match` on `EventKind` rather
  than `Debug` or a `strum` derive. The compile-time exhaustiveness
  check is the point — adding a new engine event should force a
  conscious decision about how the snapshot represents it, not a silent
  fallthrough. Added a comment making that intent explicit.

- `trace-coverage.sh` awk parser of `event.rs` is fragile — agreed, but
  the script is advisory and its failure mode is false negatives
  (uncovered variants simply aren't gated). Documented the tradeoff and
  the rewrite-in-Rust escape hatch in the script header.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci(replay-gate): prime cache on staging, restrict PR runs to read-only

The second run on PR #2621 missed the cache ("No cache found" in the
rust-cache restore step) even though the workflow is wired correctly.
Root cause: the repo sits close to GitHub's 10 GB per-repo cache quota
(~59 entries, many >500 MB), and the LRU policy evicts PR-scoped caches
before they get reused.

Fix:
- Add `push: [staging, main]` so the gate runs (and saves a ~1.2 GB
  cache under the `replay-gate` key) on every merge to the branches
  PRs actually target. Subsequent PRs restore from that base-branch
  cache — GitHub Actions permits cross-ref restore when the restoring
  ref's base matches the saved ref.
- Set `save-if: ${{ github.event_name == 'push' }}` so PR runs only
  *read* the cache. Without this gate, each PR push would save its
  own copy and crowd out the primed base-branch cache, putting us
  right back in the eviction loop.

Expected effect: cold-cache 9m → warm ~2-3m once staging has a run with
the new workflow. Base-branch prime run still pays 9m (no regression).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(replay): drop bug-bash fixture scaffolding

Replay fixtures can't reproduce the Phase 3 target bugs because the
fixture *is* the LLM's output — handwriting a trace where the LLM
emits a tool call doesn't test whether the real LLM would have emitted
that call, only that the harness dispatches a scripted one. What
`summarization_uses_tools.json` actually pinned was the happy path,
not the #2541 bug.

Of the 7 open bug-bash issues, only #2544 ("plans and delegates but
never executes") is catchable by replay, and only via a live-recorded
fixture. The other six are LLM-behavior or infra-timing bugs outside
replay's reach. Rather than ship regression theater, tear out the
scaffolding.

Removed:
- tests/e2e_bug_bash_snapshots.rs
- tests/fixtures/llm_traces/bug_bash/
- tests/snapshots/replay__bug_bash_summarization_uses_tools.snap

Unwired:
- Replay-gate workflow paths + test list no longer mention bug_bash
- scripts/replay-snap.sh test command drops the extra --test flag

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: switch to cargo-nextest with per-test timeouts

Nextest runs each integration test in its own process and runs test
binaries in parallel, which is a big unlock for this repo:

- Engine v2 tests share a process-global `ENGINE_STATE` singleton
  (OnceLock), which the current test lock serialises inside a single
  test binary. Nextest's process-per-test model gives each test a
  clean state automatically, so the 16 engine_v2 tests stop running
  one-by-one.

- Cross-binary parallelism: `cargo test --test A --test B` runs
  binaries in sequence; nextest runs them concurrently.

Measured locally: the replay-gate test set (3 binaries, 21 tests)
went from ~30s sequential to **2.7s parallel**.

Adds `.config/nextest.toml` with:
- `slow-timeout = 60s / terminate-after 3` in the default profile so
  a hung test fails fast instead of blocking the workflow-level 25-
  minute cap.
- A `ci` profile with `fail-fast = false` (one flake shouldn't mask
  other failures), `failure-output = immediate-final`,
  `success-output = never` for readable Actions logs.
- Per-test 300s override for the handful of genuinely slow scenarios
  (zizmor scan, e2e_thread_scheduling).

Workflows updated:
- `replay-gate.yml`: installs cargo-nextest via taiki-e/install-action
  alongside cargo-insta (one step), runs `cargo insta test
  --test-runner nextest` with `NEXTEST_PROFILE=ci`.
- `test.yml`: all five `cargo test` invocations swapped for
  `cargo nextest run --profile ci`. Nextest doesn't execute doctests,
  so every nextest step is paired with a `cargo test --doc` follow-up
  to preserve coverage.

Local dev is unchanged — `cargo test` still works; nextest is only
required in CI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: re-trigger replay-gate workflow after nextest migration

Previous push only modified workflow files and `.config/nextest.toml`;
GitHub skipped the `pull_request` workflow events for that sync, so
the nextest migration didn't actually get exercised in CI. Empty
commit forces re-evaluation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(replay): note nextest wiring in the fixtures README

Also forces a CI re-run: the previous empty commit had no matching
paths, so the `pull_request.paths` filters skipped every workflow
including replay-gate. Touching a file under
`tests/fixtures/llm_traces/**` re-matches the filter and runs the
nextest-based gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci(test): defer test.yml nextest migration

Staging restructured test.yml significantly while this PR was open
(matrix-config dynamic matrix, `changes` code-detection job,
composite install-cargo-component action, save-if restricted to
base-branch pushes). The merge into staging had heavy conflicts for
every nextest-swap hunk.

Rather than force a re-layering of the new staging structure on top
of the nextest migration in this PR, revert test.yml to staging's
current version. This PR now scopes the nextest change to just the
replay-gate workflow (where it cleanly demonstrates the value) plus
the shared `.config/nextest.toml` profile. Migrating the rest of
test.yml to nextest is a follow-up that can rebase on the new
structure without the heavy conflict surface.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Henry Park <henrypark133@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added scope: ci CI/CD workflows scope: docs Documentation scope: dependencies Dependency updates size: L 200-499 changed lines risk: medium Business logic, config, or moderate-risk modules contributor: core 20+ merged PRs labels Apr 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor: core 20+ merged PRs risk: medium Business logic, config, or moderate-risk modules scope: ci CI/CD workflows scope: dependencies Dependency updates scope: docs Documentation size: L 200-499 changed lines staging-promotion

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant