feat(test): dual-mode live/replay test harness with LLM judge by ilblackdragon · Pull Request #2039 · nearai/ironclaw

ilblackdragon · 2026-04-05T11:59:27Z

Summary

Add a general-purpose dual-mode test harness (LiveTestHarness) for running E2E tests with live LLM calls (recording traces) or replay from saved traces (deterministic, no API keys)
TestRigBuilder gains with_config() for real-binary config parity and with_http_interceptor() for live recording support
First test case: zizmor security scanner against ironclaw's own GitHub Actions workflows
Engine v2 variant reveals that EffectBridgeAdapter doesn't honor auto_approve_tools config — documents this as a known gap

How it works

# Record a trace (live LLM + real tools):
IRONCLAW_LIVE_TEST=1 cargo test --features libsql --test e2e_live -- --ignored

# Replay from saved trace (deterministic):
cargo test --features libsql --test e2e_live -- --ignored

Live mode loads real config via Config::from_env() so ENGINE_V2, ALLOW_LOCAL_TOOLS, approval gates all match the real binary
Session logs (.log) saved alongside trace fixtures (.json) for human inspection and live-vs-replay diffing
LLM judge verifies response quality semantically in live mode

New files

File	Purpose
`tests/support/live_harness.rs`	`LiveTestHarness`, `LiveTestHarnessBuilder`, `judge_response()`
`tests/e2e_live.rs`	`zizmor_scan` (v1) + `zizmor_scan_v2` (engine v2)
`tests/fixtures/llm_traces/live/*.json`	Recorded traces
`tests/fixtures/llm_traces/live/*.log`	Session logs

Test plan

cargo clippy --features libsql --tests — zero warnings
Existing recorded trace tests pass (e2e_recorded_trace)
IRONCLAW_LIVE_TEST=1 records trace + session log for v1 and v2
Replay mode loads trace and passes assertions
Replay session log can be diffed against live log (identical for deterministic replay)
V2 test documents approval gate gap (relaxed assertions)

🤖 Generated with Claude Code

Add a general-purpose test infrastructure for running E2E tests in two modes: - Live mode (IRONCLAW_LIVE_TEST=1): real LLM calls with real tools, records traces to disk for future replay - Replay mode (default): loads saved trace fixtures, deterministic, no API keys The harness uses Config::from_env() in live mode so the test agent mirrors the real binary's behavior (engine_v2, allow_local_tools, approval gates). Includes an LLM judge for semantic verification of non-deterministic output, and saves human-readable session logs alongside trace fixtures for inspection and diffing between live and replay runs. First test case: zizmor security scanner against ironclaw's own workflows. New files: - tests/support/live_harness.rs — LiveTestHarness, builder, LLM judge - tests/e2e_live.rs — zizmor_scan test - tests/fixtures/llm_traces/live/ — recorded trace + session log TestRigBuilder additions: - with_http_interceptor() for injecting RecordingHttpInterceptor - with_config() for real-binary config parity (respects allow_local_tools, engine_v2 from env instead of forcing test defaults) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ness Add zizmor_scan_v2 test that exercises the same scenario through engine v2. Documents the current v2 limitation: auto_approve_tools config flag is not honored by EffectBridgeAdapter — it only checks the per-session "always" set, so shell calls pause at the approval gate. Also: - Add with_engine_v2() to LiveTestHarnessBuilder for config override - Refactor v1 test to use shared run_zizmor_scan() helper - V2 test has relaxed assertions matching current v2 behavior Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request introduces a dual-mode E2E test harness that supports both live LLM execution with recording and deterministic replay from saved traces. It includes new tests for zizmor security scans and updates the TestRig to better mirror real binary behavior by allowing configuration overrides. Feedback focuses on ensuring UTF-8 safety during string truncation in logs and correctly respecting tool approval configurations to maintain parity with the production environment.

gemini-code-assist · 2026-04-05T12:03:48Z

+                    let short = if preview.len() > 200 {
+                        format!("{}…", &preview[..200])
+                    } else {
+                        preview
+                    };


The string truncation logic here (&preview[..200]) is not safe for UTF-8 strings. If a multi-byte character happens to cross the 200-byte boundary, this will cause a panic. To fix this, you should use a character-boundary-aware truncation method. The crate already has a utility for this: ironclaw::agent::agentic_loop::truncate_for_preview. Using it will prevent potential panics and ensure the log output is always valid.

let short = ironclaw::agent::agentic_loop::truncate_for_preview(&preview, 200);

References

When truncating a UTF-8 string, use character-aware methods (e.g., char_indices() or iterating over chars()) to avoid panics caused by slicing in the middle of a multi-byte character.

Fixed in f7ceb6e — replaced byte-index slicing with char_indices() to find a safe UTF-8 boundary before truncating.

gemini-code-assist · 2026-04-05T12:03:48Z

+            .with_llm(llm)
+            .with_http_interceptor(http_interceptor)
+            .with_max_tool_iterations(self.max_tool_iterations)
+            .with_auto_approve_tools(true)


Forcing auto_approve_tools to true here overrides the value from the user's configuration loaded via Config::from_env(). This prevents the live test harness from accurately mirroring the real binary's behavior regarding tool approval gates, which contradicts one of the goals of this harness. This makes it difficult to test scenarios involving approvals, such as the one documented in zizmor_scan_v2. This line should be removed to allow the configuration from the environment to be respected. A corresponding change is needed in tests/support/test_rig.rs to change the default for auto_approve_tools to None.

Fixed in f7ceb6e — removed forced .with_auto_approve_tools(true) from build_live(). The harness now respects Config::from_env() by default. Added .with_auto_approve_tools() to the builder so tests can opt in explicitly.

gemini-code-assist · 2026-04-05T12:03:48Z

+            config_override: None,
            max_tool_iterations: 10,
            injection_check: false,
            auto_approve_tools: Some(true),


The default for auto_approve_tools should be None instead of Some(true). When with_config() is used in the live test harness, this will ensure that the value from the loaded configuration is used by default, rather than being overridden. This change is necessary for the live test harness to correctly reflect the real binary's approval gate behavior.

Suggested change

auto_approve_tools: Some(true),

auto_approve_tools: None,

The default Some(true) is kept for backward compat with all existing tests. When with_config() is used, the live harness now controls auto_approve via its own builder method, which applies the override to the config before TestRig sees it.

Copilot

Pull request overview

This PR introduces a dual-mode E2E test harness that can either (1) run with live LLM/tool calls while recording traces, or (2) deterministically replay from committed trace fixtures (no API keys), and adds initial zizmor-based live/replay E2E scenarios.

Changes:

Extend TestRigBuilder to support real-binary config parity (with_config) and direct HTTP interceptor injection (with_http_interceptor).
Add LiveTestHarness to orchestrate live recording vs replay runs, plus an optional LLM-based judge in live mode.
Add an ignored e2e_live integration test and commit initial live/replay trace + log fixtures.

Reviewed changes

Copilot reviewed 6 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`tests/support/test_rig.rs`	Adds config override + HTTP interceptor override paths to align test rigs with real binary behavior and enable recording/replay.
`tests/support/mod.rs`	Exposes the new `live_harness` module behind `libsql`.
`tests/support/live_harness.rs`	Implements the dual-mode harness and a simple “judge” helper.
`tests/e2e_live.rs`	Adds ignored live/replay E2E tests for a zizmor scan (v1 + v2 variant).
`tests/fixtures/llm_traces/live/zizmor_scan.json`	Recorded trace fixture for replay.
`tests/fixtures/llm_traces/live/zizmor_scan.log`	Session log fixture (currently labeled Replay).
`tests/fixtures/llm_traces/live/zizmor_scan_v2.json`	Recorded trace fixture for engine v2 scenario (includes HTTP capture).
`tests/fixtures/llm_traces/live/zizmor_scan_v2.log`	Session log for engine v2 scenario.
`tests/fixtures/llm_traces/live/.gitkeep`	Keeps the live fixture directory present in git.

Comments suppressed due to low confidence (1)

tests/support/test_rig.rs:598

build(self) destructures self into locals, but later in this function it still references self.keep_bootstrap (see around line ~897). After destructuring, self is moved, so this will not compile. Use the destructured keep_bootstrap local consistently (or avoid destructuring / keep self intact).

        // Destructure self up front to avoid partial-move issues.
        let TestRigBuilder {
            trace,
            llm,
            config_override,
            max_tool_iterations,
            injection_check,
            auto_approve_tools,
            enable_skills,
            enable_routines,
            http_exchanges: explicit_http_exchanges,
            http_interceptor_override,
            extra_tools,
            wasm_tools,
            keep_bootstrap,
            engine_v2,
        } = self;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-05T12:04:28Z

        // AppBuilder may re-resolve config from env/TOML and override test defaults.
-        // Force test-rig agent flags to the requested deterministic values.
-        components.config.agent.auto_approve_tools = auto_approve_tools.unwrap_or(true);
-        components.config.agent.allow_local_tools = true;
-        components.config.agent.engine_v2 = engine_v2;
+        // When a config override was provided, preserve its agent settings to mirror
+        // the real binary. Otherwise force deterministic test defaults.
+        if has_config_override {
+            if let Some(v) = auto_approve_tools {
+                components.config.agent.auto_approve_tools = v;
+            }
+            // allow_local_tools and engine_v2 come from the provided config.
+        } else {
+            components.config.agent.auto_approve_tools = auto_approve_tools.unwrap_or(true);
+            components.config.agent.allow_local_tools = true;
+            components.config.agent.engine_v2 = engine_v2;
+        }


When config_override is present, this branch ignores the builder’s with_engine_v2() flag (engine_v2 local) and relies solely on the provided config. This makes with_engine_v2() a no-op in config-parity mode, which is surprising for callers; consider applying engine_v2 when explicitly set, or document that with_config() takes precedence for engine selection.

Fixed in f7ceb6e — the config-override branch now applies the builder's engine_v2 flag when explicitly set.

Copilot · 2026-04-05T12:04:29Z

+        // - auto_approve_tools controls approval gates
+        let rig = TestRigBuilder::new()
+            .with_config(config)
+            .with_llm(llm)
+            .with_http_interceptor(http_interceptor)
+            .with_max_tool_iterations(self.max_tool_iterations)
+            .with_auto_approve_tools(true)


build_live claims to mirror real-binary behavior (including approval gates), but it unconditionally forces .with_auto_approve_tools(true), overriding whatever Config::from_env() resolved. To keep parity, consider defaulting to the config value (don’t call with_auto_approve_tools here) and add an explicit builder knob to override when tests want to bypass approvals.

Suggested change

// - auto_approve_tools controls approval gates

let rig = TestRigBuilder::new()

.with_config(config)

.with_llm(llm)

.with_http_interceptor(http_interceptor)

.with_max_tool_iterations(self.max_tool_iterations)

.with_auto_approve_tools(true)

// - auto_approve_tools remains driven by the resolved config

let rig = TestRigBuilder::new()

.with_config(config)

.with_llm(llm)

.with_http_interceptor(http_interceptor)

.with_max_tool_iterations(self.max_tool_iterations)

Fixed in f7ceb6e — removed forced .with_auto_approve_tools(true) from build_live(). The harness now respects Config::from_env() by default. Added .with_auto_approve_tools() to the builder so tests can opt in explicitly.

Copilot · 2026-04-05T12:04:29Z

+/// Builder for constructing a `LiveTestHarness`.
+pub struct LiveTestHarnessBuilder {
+    test_name: String,
+    max_tool_iterations: usize,
+    timeout: Duration,
+    engine_v2: Option<bool>,
+}
+
+impl LiveTestHarnessBuilder {
+    /// Create a new builder for a test with the given name.
+    ///
+    /// The name determines the trace fixture filename:
+    /// `tests/fixtures/llm_traces/live/{test_name}.json`
+    pub fn new(test_name: impl Into<String>) -> Self {
+        Self {
+            test_name: test_name.into(),
+            max_tool_iterations: 30,
+            timeout: Duration::from_secs(120),
+            engine_v2: None,
+        }
+    }
+
+    /// Set the maximum number of tool iterations per agentic loop invocation.
+    pub fn with_max_tool_iterations(mut self, n: usize) -> Self {
+        self.max_tool_iterations = n;
+        self
+    }
+
+    /// Set the default timeout for the test (used as a hint; callers still
+    /// pass timeout to `wait_for_responses`).
+    pub fn with_timeout(mut self, timeout: Duration) -> Self {
+        self.timeout = timeout;
+        self
+    }


LiveTestHarnessBuilder stores timeout, but it’s never read/used when building the rig or waiting for responses. This can mislead callers into thinking it changes harness behavior; either wire it into the harness API (e.g., expose as default for wait_for_responses / internal timeouts) or remove it.

Fixed in f7ceb6e — removed the timeout field and with_timeout() method.

Copilot · 2026-04-05T12:04:29Z

+            let pass = trimmed.starts_with("PASS");
+            JudgeVerdict {
+                pass,
+                reasoning: trimmed.to_string(),
+            }


judge_response treats any output starting with PASS as a pass, even if the judge doesn’t follow the required PASS: / FAIL: one-line format. This makes false positives more likely (e.g., PASSING ... or multi-line output). Consider parsing strictly for PASS: / FAIL: (case-insensitive) and treating anything else as a failure with a helpful reasoning message.

Suggested change

let pass = trimmed.starts_with("PASS");

JudgeVerdict {

pass,

reasoning: trimmed.to_string(),

}

if trimmed.is_empty() {

return JudgeVerdict {

pass: false,

reasoning: "Judge returned empty output; expected exactly one line starting with PASS: or FAIL:".to_string(),

};

}

if trimmed.lines().count() != 1 {

return JudgeVerdict {

pass: false,

reasoning: format!(

"Judge returned invalid multi-line output; expected exactly one line starting with PASS: or FAIL:. Raw output: {}",

trimmed

),

};

}

let line = trimmed;

if let Some(reasoning) = line

.get(..5)

.filter(|prefix| prefix.eq_ignore_ascii_case("PASS:"))

.map(|_| line[5..].trim())

{

JudgeVerdict {

pass: true,

reasoning: reasoning.to_string(),

}

} else if let Some(reasoning) = line

.get(..5)

.filter(|prefix| prefix.eq_ignore_ascii_case("FAIL:"))

.map(|_| line[5..].trim())

{

JudgeVerdict {

pass: false,

reasoning: reasoning.to_string(),

}

} else {

JudgeVerdict {

pass: false,

reasoning: format!(

"Judge returned invalid output; expected exactly one line starting with PASS: or FAIL:. Raw output: {}",

trimmed

),

}

}

Fixed in f7ceb6e — now parses strictly for PASS: / FAIL: prefix via strip_prefix(). Any other format is treated as failure with a diagnostic message.

Copilot · 2026-04-05T12:04:29Z

+        "min_message_count": 6
+      },
+      "response": {
+        "type": "text",
+        "content": "\n\nGreat! zizmor is already installed at `/Users/coder/.cargo/bin/zizmor`. \n\nTo run it, I need to know what you'd like to audit:\n1. **Current directory** - audit the GitHub Actions workflows in this repo\n2. **Specific repository** - audit another GitHub repo (e.g., `github:user/repo`)\n3. **Local workflow files** - audit specific `.yml` files\n\nWhich would you like to do? Or if you're in a repo with GitHub Actions workflows, I can just run it on the current directory.\n[Called tool `list_dir` with arguments: {\"path\":\".\"}]",
+        "input_tokens": 14778,
+        "output_tokens": 188
+      },
+      "expected_tool_results": [
+        {
+          "tool_call_id": "chatcmpl-tool-8e89a891c7087c33",
+          "name": "shell",
+          "content": "<tool_output name=\"shell\">\n{\n  \"exit_code\": 0,\n  \"output\": \"/Users/coder/.cargo/bin/zizmor\\n\",\n  \"sandboxed\": false,\n  \"success\": true\n}\n</tool_output>"
+        }
+      ]
+    },
+    {
+      "request_hint": {
+        "last_user_message_contains": "can we run https://github.com/zizmorcore/zizmor",
+        "min_message_count": 8
+      },
+      "response": {
+        "type": "text",
+        "content": "\n\nI can see there's a `.github/` directory which likely contains GitHub Actions workflows. Let me check what's in there and then run zizmor on this repository.\n[Called tool `list_dir` with arguments: {\"path\":\".github\"}]",
+        "input_tokens": 16007,
+        "output_tokens": 86
+      },
+      "expected_tool_results": [
+        {
+          "tool_call_id": "00000001B",
+          "name": "list_dir",
+          "content": "<tool_output name=\"list_dir\">\n{\n  \"count\": 70,\n  \"entries\": [\n    \".claude/\",\n    \".git/\",\n    \".githooks/\",\n    \".github/\",\n    \"channels-src/\",\n    \"crates/\",\n    \"deploy/\",\n    \"docker/\",\n    \"docs/\",\n    \"fuzz/\",\n    \"migrations/\",\n    \"registry/\",\n    \"scripts/\",\n    \"skills/\",\n    \"src/\",\n    \"target/\",\n    \"tests/\",\n    \"tools-src/\",\n    \"wit/\",\n    \"wix/\",\n    \".dockerignore (67B)\",\n    \".env.example (12.5KB)\",\n    \".gitattributes (50B)\",\n    \".gitignore (593B)\",\n    \"AGENTS.md (5.4KB)\",\n    \"CHANGELOG.md (69.2KB)\",\n    \"CLAUDE.md (14.7KB)\",\n    \"CONTRIBUTING.md (5.3KB)\",\n    \"COVERAGE_PLAN.md (32.2KB)\",\n    \"Cargo.lock (227.0KB)\",\n    \"Cargo.toml (9.1KB)\",\n    \"Dockerfile (2.3KB)\",\n    \"Dockerfile.test (1.5KB)\",\n    \"Dockerfile.worker (2.3KB)\",\n    \"FEATURE_PARITY.md (30.5KB)\",\n    \"LICENSE-APACHE (10.5KB)\",\n    \"LICENSE-MIT (1.0KB)\",\n    \"README.ja.md (16.7KB)\",\n    \"README.md (13.7KB)\",\n    \"README.ru.md (20.4KB)\",\n    \"README.zh-CN.md (13.3KB)\",\n    \"build.rs (8.5KB)\",\n    \"clippy.toml (537B)\",\n    \"codecov.yml (219B)\",\n    \"deny.toml (1.8KB)\",\n    \"docker-compose.yml (524B)\",\n    \"engine_trace_20260405T100717.json (20.9KB)\",\n    \"ironclaw.bash (100.3KB)\",\n    \"ironclaw.fish (69.9KB)\",\n    \"ironclaw.png (266.7KB)\",\n    \"ironclaw.zsh (84.0KB)\",\n    \"providers.json (12.8KB)\",\n    \"release-plz.toml (39B)\",\n    \"trace_20260309T172815.json (8.8KB)\",\n    \"trace_20260310T075912.json (4.8KB)\",\n    \"trace_20260311T002753.json (4.8KB)\",\n    \"trace_20260311T130928.json (4.8KB)\",\n    \"trace_20260311T141121.json (9.1KB)\",\n    \"trace_20260311T141138.json (11.7KB)\",\n    \"trace_20260311T141430.json (9.1KB)\",\n    \"trace_20260311T181338.json (11.8KB)\",\n    \"trace_20260311T183957.json (9.2KB)\",\n    \"trace_20260311T185836.json (9.2KB)\",\n    \"trace_20260311T191112.json (95.2KB)\",\n    \"trace_20260311T191641.json (156.7KB)\",\n    \"trace_20260311T194203.json (9.8KB)\",\n    \"trace_20260311T194240.json (272.4KB)\",\n    \"trace_20260311T214436.json (4.7KB)\",\n    \"trace_20260311T214457.json (133.1KB)\",\n    \"trace_20260311T221512.json (7.6KB)\"\n  ],\n  \"path\": \"/Users/coder/ironclaw8\",\n  \"truncated\": false\n}\n</tool_output>"
+        }
+      ]


This trace fixture embeds absolute local paths (e.g. /Users/coder/ironclaw8 and /Users/coder/.cargo/bin/zizmor) inside recorded tool outputs. That leaks machine-specific info and makes fixtures harder to share/re-record across environments. Consider scrubbing/normalizing paths in recorded fixtures (or in the recorder) before committing (e.g., replace with relative paths or a placeholder root).

Acknowledged — these are inside recorded tool output (what the LLM saw). Scrubbing would make replay diverge from reality. Will consider path normalization / HTTP body trimming in the recorder as a follow-up.

Copilot · 2026-04-05T12:04:30Z

+          "content": "{'count': 0, 'results': [], 'searched_online': False}"
+        },
+        {
+          "tool_call_id": "chatcmpl-tool-a3b63943e42a0ecb",
+          "name": "http",
+          "content": "{'body': '# Search code, repositories, users, issues, pull requests...\\n\\n\\n\\n\\n\\n[Sign in](/login?return_to=https%3A%2F%2Fgithub.com%2Fzizmorcore%2Fzizmor)\\n\\n/;ref_cta:Sign up;ref_loc:header logged out\"}\" data-hydro-click=\"{\"event_type\":\"authentication.click\",\"payload\":{\"location_in_page\":\"site header menu\",\"repository_id\":null,\"auth_type\":\"SIGN_UP\",\"originating_url\":\"https://github.com/zizmorcore/zizmor\",\"user_id\":null}}\" data-hydro-click-hmac=\"8caabd8b90225b5d5573f566ac422fd36398eba76666d09587f4d4f6f6450bb2\" href=\"/signup?ref_cta=Sign+up&ref_loc=header+logged+out&ref_page=%2F%3Cuser-name%3E%2F%3Crepo-name%3E&source=header-repo&source_repo=zizmorcore%2Fzizmor\">\\n Sign up\\n\\nAppearance settings\\n', 'headers': {'accept-ranges': 'bytes', 'cache-control': 'max-age=0, private, must-revalidate', 'content-security-policy': \"default-src 'none'; base-uri 'self'; child-src github.githubassets.com github.com/assets-cdn/worker/ github.com/assets/ gist.github.com/assets-cdn/worker/; connect-src 'self' uploads.github.com www.githubstatus.com collector.github.com raw.githubusercontent.com api.github.com github-cloud.s3.amazonaws.com github-production-repository-file-5c1aeb.s3.amazonaws.com github-production-upload-manifest-file-7fdce7.s3.amazonaws.com github-production-user-asset-6210df.s3.amazonaws.com *.rel.tunnels.api.visualstudio.com wss://*.rel.tunnels.api.visualstudio.com github.githubassets.com objects-origin.githubusercontent.com copilot-proxy.githubusercontent.com proxy.individual.githubcopilot.com proxy.business.githubcopilot.com proxy.enterprise.githubcopilot.com *.actions.githubusercontent.com wss://*.actions.githubusercontent.com productionresultssa0.blob.core.windows.net productionresultssa1.blob.core.windows.net productionresultssa2.blob.core.windows.net productionresultssa3.blob.core.windows.net productionresultssa4.blob.core.windows.net productionresultssa5.blob.core.windows.net productionresultssa6.blob.core.windows.net productionresultssa7.blob.core.windows.net productionresultssa8.blob.core.windows.net productionresultssa9.blob.core.windows.net productionresultssa10.blob.core.windows.net productionresultssa11.blob.core.windows.net productionresultssa12.blob.core.windows.net productionresultssa13.blob.core.windows.net productionresultssa14.blob.core.windows.net productionresultssa15.blob.core.windows.net productionresultssa16.blob.core.windows.net productionresultssa17.blob.core.windows.net productionresultssa18.blob.core.windows.net productionresultssa19.blob.core.windows.net github-production-repository-image-32fea6.s3.amazonaws.com github-production-release-asset-2e65be.s3.amazonaws.com insights.github.com wss://alive.github.com wss://alive-staging.github.com api.githubcopilot.com api.individual.githubcopilot.com api.business.githubcopilot.com api.enterprise.githubcopilot.com; font-src github.githubassets.com; form-action 'self' github.com gist.github.com copilot-workspace.githubnext.com objects-origin.githubusercontent.com; frame-ancestors 'none'; frame-src viewscreen.githubusercontent.com notebooks.githubusercontent.com; img-src 'self' data: blob: github.githubassets.com media.githubusercontent.com camo.githubusercontent.com identicons.github.com avatars.githubusercontent.com private-avatars.githubusercontent.com github-cloud.s3.amazonaws.com objects.githubusercontent.com release-assets.githubusercontent.com secured-user-images.githubusercontent.com user-images.githubusercontent.com private-user-images.githubusercontent.com opengraph.githubassets.com marketplace-screenshots.githubusercontent.com copilotprodattachments.blob.core.windows.net/github-production-copilot-attachments/ github-production-user-asset-6210df.s3.amazonaws.com customer-stories-feed.github.com spotlights-feed.github.com objects-origin.githubusercontent.com *.githubusercontent.com; manifest-src 'self'; media-src github.com user-images.githubusercontent.com secured-user-images.githubusercontent.com private-user-images.githubusercontent.com github-production-user-asset-6210df.s3.amazonaws.com gist.github.com github.githubassets.com; script-src github.githubassets.com; style-src 'unsafe-inline' github.githubassets.com; upgrade-insecure-requests; worker-src github.githubassets.com github.com/assets-cdn/worker/ github.com/assets/ gist.github.com/assets-cdn/worker/\", 'content-type': 'text/html; charset=utf-8', 'date': 'Sun, 05 Apr 2026 11:49:36 GMT', 'etag': 'W/\"f79c89ccf6deeb8bf72f05468daae684\"', 'referrer-policy': 'no-referrer-when-downgrade', 'server': 'github.com', 'strict-transport-security': 'max-age=31536000; includeSubdomains; preload', 'vary': 'X-PJAX, X-PJAX-Container, Turbo-Visit, Turbo-Frame, X-Requested-With, Sec-Fetch-Site,Accept-Encoding, Accept, X-Requested-With', 'x-content-type-options': 'nosniff', 'x-frame-options': 'deny', 'x-github-request-id': '80C1:2E367D:125F734:18F58BA:69D24C4F', 'x-xss-protection': '0'}, 'status': 200}"
+        }


The recorded http tool result stores a large HTML body plus many volatile headers (date, x-github-request-id, etc.). This tends to bloat the repo and causes frequent fixture churn on re-recording without improving test coverage. Consider redacting/truncating HTTP bodies and dropping non-essential headers in the recording layer (or post-process fixtures) so replay remains deterministic but compact and stable.

Acknowledged — these are inside recorded tool output (what the LLM saw). Scrubbing would make replay diverge from reality. Will consider path normalization / HTTP body trimming in the recorder as a follow-up.

- Fix UTF-8 unsafe string truncation in session log (use char_indices to find safe boundary instead of byte-index slicing) - Remove forced auto_approve_tools(true) from LiveTestHarness build_live; let Config::from_env() drive it, with per-test override via new with_auto_approve_tools() builder method - Apply engine_v2 builder override in TestRig's config-override branch so with_engine_v2() is not silently ignored when with_config() is used - Remove unused timeout field and with_timeout() from LiveTestHarnessBuilder - Tighten judge_response parsing to require strict PASS:/FAIL: prefix; anything else is treated as a failure with diagnostic message Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…#2039) * feat(test): add dual-mode live/replay test harness with LLM judge Add a general-purpose test infrastructure for running E2E tests in two modes: - Live mode (IRONCLAW_LIVE_TEST=1): real LLM calls with real tools, records traces to disk for future replay - Replay mode (default): loads saved trace fixtures, deterministic, no API keys The harness uses Config::from_env() in live mode so the test agent mirrors the real binary's behavior (engine_v2, allow_local_tools, approval gates). Includes an LLM judge for semantic verification of non-deterministic output, and saves human-readable session logs alongside trace fixtures for inspection and diffing between live and replay runs. First test case: zizmor security scanner against ironclaw's own workflows. New files: - tests/support/live_harness.rs — LiveTestHarness, builder, LLM judge - tests/e2e_live.rs — zizmor_scan test - tests/fixtures/llm_traces/live/ — recorded trace + session log TestRigBuilder additions: - with_http_interceptor() for injecting RecordingHttpInterceptor - with_config() for real-binary config parity (respects allow_local_tools, engine_v2 from env instead of forcing test defaults) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test(live): add engine v2 zizmor scan test, add with_engine_v2 to harness Add zizmor_scan_v2 test that exercises the same scenario through engine v2. Documents the current v2 limitation: auto_approve_tools config flag is not honored by EffectBridgeAdapter — it only checks the per-session "always" set, so shell calls pause at the approval gate. Also: - Add with_engine_v2() to LiveTestHarnessBuilder for config override - Refactor v1 test to use shared run_zizmor_scan() helper - V2 test has relaxed assertions matching current v2 behavior Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(test): address PR review feedback - Fix UTF-8 unsafe string truncation in session log (use char_indices to find safe boundary instead of byte-index slicing) - Remove forced auto_approve_tools(true) from LiveTestHarness build_live; let Config::from_env() drive it, with per-test override via new with_auto_approve_tools() builder method - Apply engine_v2 builder override in TestRig's config-override branch so with_engine_v2() is not silently ignored when with_config() is used - Remove unused timeout field and with_timeout() from LiveTestHarnessBuilder - Tighten judge_response parsing to require strict PASS:/FAIL: prefix; anything else is treated as a failure with diagnostic message Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ilblackdragon and others added 2 commits April 5, 2026 20:44

Copilot AI review requested due to automatic review settings April 5, 2026 11:59

github-actions bot added size: XL 500+ changed lines risk: low Changes to docs, tests, or low-risk modules contributor: core 20+ merged PRs labels Apr 5, 2026

Copilot started reviewing on behalf of ilblackdragon April 5, 2026 12:00 View session

gemini-code-assist bot reviewed Apr 5, 2026

View reviewed changes

Copilot AI reviewed Apr 5, 2026

View reviewed changes

ilblackdragon merged commit 1c2d2f2 into staging Apr 5, 2026
14 checks passed

ilblackdragon deleted the feat/live-replay-test-harness branch April 5, 2026 13:03

ironclaw-ci bot mentioned this pull request Apr 5, 2026

chore: promote staging to staging-promote/733678dd-23996777140 (2026-04-05 13:23 UTC) #2044

Merged

github-actions bot mentioned this pull request Apr 6, 2026

🦞 OpenClaw 生态日报 2026-04-06 gsscsd/big_model_radar#142

Open

ilblackdragon added a commit that referenced this pull request Apr 7, 2026

Merge origin/staging to pick up #2039 live test harness

6f59195

This was referenced Apr 10, 2026

chore: promote staging to staging-promote/4c9a985b-23931806540 (2026-04-03 05:32 UTC) #1953

Merged

chore: promote staging to staging-promote/42623ed1-23780941831 (2026-04-01 23:10 UTC) #1893

Merged

ironclaw-ci bot mentioned this pull request Apr 10, 2026

chore: release #2075

Merged

ironclaw-ci bot mentioned this pull request Apr 18, 2026

chore: release #2606

Open

-            let pass = trimmed.starts_with("PASS");
-            JudgeVerdict {
-                pass,
-                reasoning: trimmed.to_string(),
-            }
+            if trimmed.is_empty() {
+                return JudgeVerdict {
+                    pass: false,
+                    reasoning: "Judge returned empty output; expected exactly one line starting with PASS: or FAIL:".to_string(),
+                };
+            }
+            if trimmed.lines().count() != 1 {
+                return JudgeVerdict {
+                    pass: false,
+                    reasoning: format!(
+                        "Judge returned invalid multi-line output; expected exactly one line starting with PASS: or FAIL:. Raw output: {}",
+                        trimmed
+                    ),
+                };
+            }
+            let line = trimmed;
+            if let Some(reasoning) = line
+                .get(..5)
+                .filter(|prefix| prefix.eq_ignore_ascii_case("PASS:"))
+                .map(|_| line[5..].trim())
+            {
+                JudgeVerdict {
+                    pass: true,
+                    reasoning: reasoning.to_string(),
+                }
+            } else if let Some(reasoning) = line
+                .get(..5)
+                .filter(|prefix| prefix.eq_ignore_ascii_case("FAIL:"))
+                .map(|_| line[5..].trim())
+            {
+                JudgeVerdict {
+                    pass: false,
+                    reasoning: reasoning.to_string(),
+                }
+            } else {
+                JudgeVerdict {
+                    pass: false,
+                    reasoning: format!(
+                        "Judge returned invalid output; expected exactly one line starting with PASS: or FAIL:. Raw output: {}",
+                        trimmed
+                    ),
+                }
+            }

Conversation

ilblackdragon commented Apr 5, 2026

Summary

How it works

New files

Test plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants