feat(test): dual-mode live/replay test harness with LLM judge#2039
feat(test): dual-mode live/replay test harness with LLM judge#2039ilblackdragon merged 3 commits intostagingfrom
Conversation
Add a general-purpose test infrastructure for running E2E tests in two modes: - Live mode (IRONCLAW_LIVE_TEST=1): real LLM calls with real tools, records traces to disk for future replay - Replay mode (default): loads saved trace fixtures, deterministic, no API keys The harness uses Config::from_env() in live mode so the test agent mirrors the real binary's behavior (engine_v2, allow_local_tools, approval gates). Includes an LLM judge for semantic verification of non-deterministic output, and saves human-readable session logs alongside trace fixtures for inspection and diffing between live and replay runs. First test case: zizmor security scanner against ironclaw's own workflows. New files: - tests/support/live_harness.rs — LiveTestHarness, builder, LLM judge - tests/e2e_live.rs — zizmor_scan test - tests/fixtures/llm_traces/live/ — recorded trace + session log TestRigBuilder additions: - with_http_interceptor() for injecting RecordingHttpInterceptor - with_config() for real-binary config parity (respects allow_local_tools, engine_v2 from env instead of forcing test defaults) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ness Add zizmor_scan_v2 test that exercises the same scenario through engine v2. Documents the current v2 limitation: auto_approve_tools config flag is not honored by EffectBridgeAdapter — it only checks the per-session "always" set, so shell calls pause at the approval gate. Also: - Add with_engine_v2() to LiveTestHarnessBuilder for config override - Refactor v1 test to use shared run_zizmor_scan() helper - V2 test has relaxed assertions matching current v2 behavior Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a dual-mode E2E test harness that supports both live LLM execution with recording and deterministic replay from saved traces. It includes new tests for zizmor security scans and updates the TestRig to better mirror real binary behavior by allowing configuration overrides. Feedback focuses on ensuring UTF-8 safety during string truncation in logs and correctly respecting tool approval configurations to maintain parity with the production environment.
| let short = if preview.len() > 200 { | ||
| format!("{}…", &preview[..200]) | ||
| } else { | ||
| preview | ||
| }; |
There was a problem hiding this comment.
The string truncation logic here (&preview[..200]) is not safe for UTF-8 strings. If a multi-byte character happens to cross the 200-byte boundary, this will cause a panic. To fix this, you should use a character-boundary-aware truncation method. The crate already has a utility for this: ironclaw::agent::agentic_loop::truncate_for_preview. Using it will prevent potential panics and ensure the log output is always valid.
let short = ironclaw::agent::agentic_loop::truncate_for_preview(&preview, 200);References
- When truncating a UTF-8 string, use character-aware methods (e.g., char_indices() or iterating over chars()) to avoid panics caused by slicing in the middle of a multi-byte character.
There was a problem hiding this comment.
Fixed in f7ceb6e — replaced byte-index slicing with char_indices() to find a safe UTF-8 boundary before truncating.
| .with_llm(llm) | ||
| .with_http_interceptor(http_interceptor) | ||
| .with_max_tool_iterations(self.max_tool_iterations) | ||
| .with_auto_approve_tools(true) |
There was a problem hiding this comment.
Forcing auto_approve_tools to true here overrides the value from the user's configuration loaded via Config::from_env(). This prevents the live test harness from accurately mirroring the real binary's behavior regarding tool approval gates, which contradicts one of the goals of this harness. This makes it difficult to test scenarios involving approvals, such as the one documented in zizmor_scan_v2. This line should be removed to allow the configuration from the environment to be respected. A corresponding change is needed in tests/support/test_rig.rs to change the default for auto_approve_tools to None.
There was a problem hiding this comment.
Fixed in f7ceb6e — removed forced .with_auto_approve_tools(true) from build_live(). The harness now respects Config::from_env() by default. Added .with_auto_approve_tools() to the builder so tests can opt in explicitly.
| config_override: None, | ||
| max_tool_iterations: 10, | ||
| injection_check: false, | ||
| auto_approve_tools: Some(true), |
There was a problem hiding this comment.
The default for auto_approve_tools should be None instead of Some(true). When with_config() is used in the live test harness, this will ensure that the value from the loaded configuration is used by default, rather than being overridden. This change is necessary for the live test harness to correctly reflect the real binary's approval gate behavior.
| auto_approve_tools: Some(true), | |
| auto_approve_tools: None, |
There was a problem hiding this comment.
The default Some(true) is kept for backward compat with all existing tests. When with_config() is used, the live harness now controls auto_approve via its own builder method, which applies the override to the config before TestRig sees it.
There was a problem hiding this comment.
Pull request overview
This PR introduces a dual-mode E2E test harness that can either (1) run with live LLM/tool calls while recording traces, or (2) deterministically replay from committed trace fixtures (no API keys), and adds initial zizmor-based live/replay E2E scenarios.
Changes:
- Extend
TestRigBuilderto support real-binary config parity (with_config) and direct HTTP interceptor injection (with_http_interceptor). - Add
LiveTestHarnessto orchestrate live recording vs replay runs, plus an optional LLM-based judge in live mode. - Add an ignored
e2e_liveintegration test and commit initial live/replay trace + log fixtures.
Reviewed changes
Copilot reviewed 6 out of 9 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
tests/support/test_rig.rs |
Adds config override + HTTP interceptor override paths to align test rigs with real binary behavior and enable recording/replay. |
tests/support/mod.rs |
Exposes the new live_harness module behind libsql. |
tests/support/live_harness.rs |
Implements the dual-mode harness and a simple “judge” helper. |
tests/e2e_live.rs |
Adds ignored live/replay E2E tests for a zizmor scan (v1 + v2 variant). |
tests/fixtures/llm_traces/live/zizmor_scan.json |
Recorded trace fixture for replay. |
tests/fixtures/llm_traces/live/zizmor_scan.log |
Session log fixture (currently labeled Replay). |
tests/fixtures/llm_traces/live/zizmor_scan_v2.json |
Recorded trace fixture for engine v2 scenario (includes HTTP capture). |
tests/fixtures/llm_traces/live/zizmor_scan_v2.log |
Session log for engine v2 scenario. |
tests/fixtures/llm_traces/live/.gitkeep |
Keeps the live fixture directory present in git. |
Comments suppressed due to low confidence (1)
tests/support/test_rig.rs:598
build(self)destructuresselfinto locals, but later in this function it still referencesself.keep_bootstrap(see around line ~897). After destructuring,selfis moved, so this will not compile. Use the destructuredkeep_bootstraplocal consistently (or avoid destructuring / keepselfintact).
// Destructure self up front to avoid partial-move issues.
let TestRigBuilder {
trace,
llm,
config_override,
max_tool_iterations,
injection_check,
auto_approve_tools,
enable_skills,
enable_routines,
http_exchanges: explicit_http_exchanges,
http_interceptor_override,
extra_tools,
wasm_tools,
keep_bootstrap,
engine_v2,
} = self;
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // AppBuilder may re-resolve config from env/TOML and override test defaults. | ||
| // Force test-rig agent flags to the requested deterministic values. | ||
| components.config.agent.auto_approve_tools = auto_approve_tools.unwrap_or(true); | ||
| components.config.agent.allow_local_tools = true; | ||
| components.config.agent.engine_v2 = engine_v2; | ||
| // When a config override was provided, preserve its agent settings to mirror | ||
| // the real binary. Otherwise force deterministic test defaults. | ||
| if has_config_override { | ||
| if let Some(v) = auto_approve_tools { | ||
| components.config.agent.auto_approve_tools = v; | ||
| } | ||
| // allow_local_tools and engine_v2 come from the provided config. | ||
| } else { | ||
| components.config.agent.auto_approve_tools = auto_approve_tools.unwrap_or(true); | ||
| components.config.agent.allow_local_tools = true; | ||
| components.config.agent.engine_v2 = engine_v2; | ||
| } |
There was a problem hiding this comment.
When config_override is present, this branch ignores the builder’s with_engine_v2() flag (engine_v2 local) and relies solely on the provided config. This makes with_engine_v2() a no-op in config-parity mode, which is surprising for callers; consider applying engine_v2 when explicitly set, or document that with_config() takes precedence for engine selection.
There was a problem hiding this comment.
Fixed in f7ceb6e — the config-override branch now applies the builder's engine_v2 flag when explicitly set.
| // - auto_approve_tools controls approval gates | ||
| let rig = TestRigBuilder::new() | ||
| .with_config(config) | ||
| .with_llm(llm) | ||
| .with_http_interceptor(http_interceptor) | ||
| .with_max_tool_iterations(self.max_tool_iterations) | ||
| .with_auto_approve_tools(true) |
There was a problem hiding this comment.
build_live claims to mirror real-binary behavior (including approval gates), but it unconditionally forces .with_auto_approve_tools(true), overriding whatever Config::from_env() resolved. To keep parity, consider defaulting to the config value (don’t call with_auto_approve_tools here) and add an explicit builder knob to override when tests want to bypass approvals.
| // - auto_approve_tools controls approval gates | |
| let rig = TestRigBuilder::new() | |
| .with_config(config) | |
| .with_llm(llm) | |
| .with_http_interceptor(http_interceptor) | |
| .with_max_tool_iterations(self.max_tool_iterations) | |
| .with_auto_approve_tools(true) | |
| // - auto_approve_tools remains driven by the resolved config | |
| let rig = TestRigBuilder::new() | |
| .with_config(config) | |
| .with_llm(llm) | |
| .with_http_interceptor(http_interceptor) | |
| .with_max_tool_iterations(self.max_tool_iterations) |
There was a problem hiding this comment.
Fixed in f7ceb6e — removed forced .with_auto_approve_tools(true) from build_live(). The harness now respects Config::from_env() by default. Added .with_auto_approve_tools() to the builder so tests can opt in explicitly.
| /// Builder for constructing a `LiveTestHarness`. | ||
| pub struct LiveTestHarnessBuilder { | ||
| test_name: String, | ||
| max_tool_iterations: usize, | ||
| timeout: Duration, | ||
| engine_v2: Option<bool>, | ||
| } | ||
|
|
||
| impl LiveTestHarnessBuilder { | ||
| /// Create a new builder for a test with the given name. | ||
| /// | ||
| /// The name determines the trace fixture filename: | ||
| /// `tests/fixtures/llm_traces/live/{test_name}.json` | ||
| pub fn new(test_name: impl Into<String>) -> Self { | ||
| Self { | ||
| test_name: test_name.into(), | ||
| max_tool_iterations: 30, | ||
| timeout: Duration::from_secs(120), | ||
| engine_v2: None, | ||
| } | ||
| } | ||
|
|
||
| /// Set the maximum number of tool iterations per agentic loop invocation. | ||
| pub fn with_max_tool_iterations(mut self, n: usize) -> Self { | ||
| self.max_tool_iterations = n; | ||
| self | ||
| } | ||
|
|
||
| /// Set the default timeout for the test (used as a hint; callers still | ||
| /// pass timeout to `wait_for_responses`). | ||
| pub fn with_timeout(mut self, timeout: Duration) -> Self { | ||
| self.timeout = timeout; | ||
| self | ||
| } |
There was a problem hiding this comment.
LiveTestHarnessBuilder stores timeout, but it’s never read/used when building the rig or waiting for responses. This can mislead callers into thinking it changes harness behavior; either wire it into the harness API (e.g., expose as default for wait_for_responses / internal timeouts) or remove it.
There was a problem hiding this comment.
Fixed in f7ceb6e — removed the timeout field and with_timeout() method.
| let pass = trimmed.starts_with("PASS"); | ||
| JudgeVerdict { | ||
| pass, | ||
| reasoning: trimmed.to_string(), | ||
| } |
There was a problem hiding this comment.
judge_response treats any output starting with PASS as a pass, even if the judge doesn’t follow the required PASS: / FAIL: one-line format. This makes false positives more likely (e.g., PASSING ... or multi-line output). Consider parsing strictly for PASS: / FAIL: (case-insensitive) and treating anything else as a failure with a helpful reasoning message.
| let pass = trimmed.starts_with("PASS"); | |
| JudgeVerdict { | |
| pass, | |
| reasoning: trimmed.to_string(), | |
| } | |
| if trimmed.is_empty() { | |
| return JudgeVerdict { | |
| pass: false, | |
| reasoning: "Judge returned empty output; expected exactly one line starting with PASS: or FAIL:".to_string(), | |
| }; | |
| } | |
| if trimmed.lines().count() != 1 { | |
| return JudgeVerdict { | |
| pass: false, | |
| reasoning: format!( | |
| "Judge returned invalid multi-line output; expected exactly one line starting with PASS: or FAIL:. Raw output: {}", | |
| trimmed | |
| ), | |
| }; | |
| } | |
| let line = trimmed; | |
| if let Some(reasoning) = line | |
| .get(..5) | |
| .filter(|prefix| prefix.eq_ignore_ascii_case("PASS:")) | |
| .map(|_| line[5..].trim()) | |
| { | |
| JudgeVerdict { | |
| pass: true, | |
| reasoning: reasoning.to_string(), | |
| } | |
| } else if let Some(reasoning) = line | |
| .get(..5) | |
| .filter(|prefix| prefix.eq_ignore_ascii_case("FAIL:")) | |
| .map(|_| line[5..].trim()) | |
| { | |
| JudgeVerdict { | |
| pass: false, | |
| reasoning: reasoning.to_string(), | |
| } | |
| } else { | |
| JudgeVerdict { | |
| pass: false, | |
| reasoning: format!( | |
| "Judge returned invalid output; expected exactly one line starting with PASS: or FAIL:. Raw output: {}", | |
| trimmed | |
| ), | |
| } | |
| } |
There was a problem hiding this comment.
Fixed in f7ceb6e — now parses strictly for PASS: / FAIL: prefix via strip_prefix(). Any other format is treated as failure with a diagnostic message.
| "min_message_count": 6 | ||
| }, | ||
| "response": { | ||
| "type": "text", | ||
| "content": "\n\nGreat! zizmor is already installed at `/Users/coder/.cargo/bin/zizmor`. \n\nTo run it, I need to know what you'd like to audit:\n1. **Current directory** - audit the GitHub Actions workflows in this repo\n2. **Specific repository** - audit another GitHub repo (e.g., `github:user/repo`)\n3. **Local workflow files** - audit specific `.yml` files\n\nWhich would you like to do? Or if you're in a repo with GitHub Actions workflows, I can just run it on the current directory.\n[Called tool `list_dir` with arguments: {\"path\":\".\"}]", | ||
| "input_tokens": 14778, | ||
| "output_tokens": 188 | ||
| }, | ||
| "expected_tool_results": [ | ||
| { | ||
| "tool_call_id": "chatcmpl-tool-8e89a891c7087c33", | ||
| "name": "shell", | ||
| "content": "<tool_output name=\"shell\">\n{\n \"exit_code\": 0,\n \"output\": \"/Users/coder/.cargo/bin/zizmor\\n\",\n \"sandboxed\": false,\n \"success\": true\n}\n</tool_output>" | ||
| } | ||
| ] | ||
| }, | ||
| { | ||
| "request_hint": { | ||
| "last_user_message_contains": "can we run https://github.com/zizmorcore/zizmor", | ||
| "min_message_count": 8 | ||
| }, | ||
| "response": { | ||
| "type": "text", | ||
| "content": "\n\nI can see there's a `.github/` directory which likely contains GitHub Actions workflows. Let me check what's in there and then run zizmor on this repository.\n[Called tool `list_dir` with arguments: {\"path\":\".github\"}]", | ||
| "input_tokens": 16007, | ||
| "output_tokens": 86 | ||
| }, | ||
| "expected_tool_results": [ | ||
| { | ||
| "tool_call_id": "00000001B", | ||
| "name": "list_dir", | ||
| "content": "<tool_output name=\"list_dir\">\n{\n \"count\": 70,\n \"entries\": [\n \".claude/\",\n \".git/\",\n \".githooks/\",\n \".github/\",\n \"channels-src/\",\n \"crates/\",\n \"deploy/\",\n \"docker/\",\n \"docs/\",\n \"fuzz/\",\n \"migrations/\",\n \"registry/\",\n \"scripts/\",\n \"skills/\",\n \"src/\",\n \"target/\",\n \"tests/\",\n \"tools-src/\",\n \"wit/\",\n \"wix/\",\n \".dockerignore (67B)\",\n \".env.example (12.5KB)\",\n \".gitattributes (50B)\",\n \".gitignore (593B)\",\n \"AGENTS.md (5.4KB)\",\n \"CHANGELOG.md (69.2KB)\",\n \"CLAUDE.md (14.7KB)\",\n \"CONTRIBUTING.md (5.3KB)\",\n \"COVERAGE_PLAN.md (32.2KB)\",\n \"Cargo.lock (227.0KB)\",\n \"Cargo.toml (9.1KB)\",\n \"Dockerfile (2.3KB)\",\n \"Dockerfile.test (1.5KB)\",\n \"Dockerfile.worker (2.3KB)\",\n \"FEATURE_PARITY.md (30.5KB)\",\n \"LICENSE-APACHE (10.5KB)\",\n \"LICENSE-MIT (1.0KB)\",\n \"README.ja.md (16.7KB)\",\n \"README.md (13.7KB)\",\n \"README.ru.md (20.4KB)\",\n \"README.zh-CN.md (13.3KB)\",\n \"build.rs (8.5KB)\",\n \"clippy.toml (537B)\",\n \"codecov.yml (219B)\",\n \"deny.toml (1.8KB)\",\n \"docker-compose.yml (524B)\",\n \"engine_trace_20260405T100717.json (20.9KB)\",\n \"ironclaw.bash (100.3KB)\",\n \"ironclaw.fish (69.9KB)\",\n \"ironclaw.png (266.7KB)\",\n \"ironclaw.zsh (84.0KB)\",\n \"providers.json (12.8KB)\",\n \"release-plz.toml (39B)\",\n \"trace_20260309T172815.json (8.8KB)\",\n \"trace_20260310T075912.json (4.8KB)\",\n \"trace_20260311T002753.json (4.8KB)\",\n \"trace_20260311T130928.json (4.8KB)\",\n \"trace_20260311T141121.json (9.1KB)\",\n \"trace_20260311T141138.json (11.7KB)\",\n \"trace_20260311T141430.json (9.1KB)\",\n \"trace_20260311T181338.json (11.8KB)\",\n \"trace_20260311T183957.json (9.2KB)\",\n \"trace_20260311T185836.json (9.2KB)\",\n \"trace_20260311T191112.json (95.2KB)\",\n \"trace_20260311T191641.json (156.7KB)\",\n \"trace_20260311T194203.json (9.8KB)\",\n \"trace_20260311T194240.json (272.4KB)\",\n \"trace_20260311T214436.json (4.7KB)\",\n \"trace_20260311T214457.json (133.1KB)\",\n \"trace_20260311T221512.json (7.6KB)\"\n ],\n \"path\": \"/Users/coder/ironclaw8\",\n \"truncated\": false\n}\n</tool_output>" | ||
| } | ||
| ] |
There was a problem hiding this comment.
This trace fixture embeds absolute local paths (e.g. /Users/coder/ironclaw8 and /Users/coder/.cargo/bin/zizmor) inside recorded tool outputs. That leaks machine-specific info and makes fixtures harder to share/re-record across environments. Consider scrubbing/normalizing paths in recorded fixtures (or in the recorder) before committing (e.g., replace with relative paths or a placeholder root).
There was a problem hiding this comment.
Acknowledged — these are inside recorded tool output (what the LLM saw). Scrubbing would make replay diverge from reality. Will consider path normalization / HTTP body trimming in the recorder as a follow-up.
| "content": "{'count': 0, 'results': [], 'searched_online': False}" | ||
| }, | ||
| { | ||
| "tool_call_id": "chatcmpl-tool-a3b63943e42a0ecb", | ||
| "name": "http", | ||
| "content": "{'body': '# Search code, repositories, users, issues, pull requests...\\n\\n\\n\\n\\n\\n[Sign in](/login?return_to=https%3A%2F%2Fgithub.com%2Fzizmorcore%2Fzizmor)\\n\\n/;ref_cta:Sign up;ref_loc:header logged out\"}\" data-hydro-click=\"{\"event_type\":\"authentication.click\",\"payload\":{\"location_in_page\":\"site header menu\",\"repository_id\":null,\"auth_type\":\"SIGN_UP\",\"originating_url\":\"https://github.com/zizmorcore/zizmor\",\"user_id\":null}}\" data-hydro-click-hmac=\"8caabd8b90225b5d5573f566ac422fd36398eba76666d09587f4d4f6f6450bb2\" href=\"/signup?ref_cta=Sign+up&ref_loc=header+logged+out&ref_page=%2F%3Cuser-name%3E%2F%3Crepo-name%3E&source=header-repo&source_repo=zizmorcore%2Fzizmor\">\\n Sign up\\n\\nAppearance settings\\n', 'headers': {'accept-ranges': 'bytes', 'cache-control': 'max-age=0, private, must-revalidate', 'content-security-policy': \"default-src 'none'; base-uri 'self'; child-src github.githubassets.com github.com/assets-cdn/worker/ github.com/assets/ gist.github.com/assets-cdn/worker/; connect-src 'self' uploads.github.com www.githubstatus.com collector.github.com raw.githubusercontent.com api.github.com github-cloud.s3.amazonaws.com github-production-repository-file-5c1aeb.s3.amazonaws.com github-production-upload-manifest-file-7fdce7.s3.amazonaws.com github-production-user-asset-6210df.s3.amazonaws.com *.rel.tunnels.api.visualstudio.com wss://*.rel.tunnels.api.visualstudio.com github.githubassets.com objects-origin.githubusercontent.com copilot-proxy.githubusercontent.com proxy.individual.githubcopilot.com proxy.business.githubcopilot.com proxy.enterprise.githubcopilot.com *.actions.githubusercontent.com wss://*.actions.githubusercontent.com productionresultssa0.blob.core.windows.net productionresultssa1.blob.core.windows.net productionresultssa2.blob.core.windows.net productionresultssa3.blob.core.windows.net productionresultssa4.blob.core.windows.net productionresultssa5.blob.core.windows.net productionresultssa6.blob.core.windows.net productionresultssa7.blob.core.windows.net productionresultssa8.blob.core.windows.net productionresultssa9.blob.core.windows.net productionresultssa10.blob.core.windows.net productionresultssa11.blob.core.windows.net productionresultssa12.blob.core.windows.net productionresultssa13.blob.core.windows.net productionresultssa14.blob.core.windows.net productionresultssa15.blob.core.windows.net productionresultssa16.blob.core.windows.net productionresultssa17.blob.core.windows.net productionresultssa18.blob.core.windows.net productionresultssa19.blob.core.windows.net github-production-repository-image-32fea6.s3.amazonaws.com github-production-release-asset-2e65be.s3.amazonaws.com insights.github.com wss://alive.github.com wss://alive-staging.github.com api.githubcopilot.com api.individual.githubcopilot.com api.business.githubcopilot.com api.enterprise.githubcopilot.com; font-src github.githubassets.com; form-action 'self' github.com gist.github.com copilot-workspace.githubnext.com objects-origin.githubusercontent.com; frame-ancestors 'none'; frame-src viewscreen.githubusercontent.com notebooks.githubusercontent.com; img-src 'self' data: blob: github.githubassets.com media.githubusercontent.com camo.githubusercontent.com identicons.github.com avatars.githubusercontent.com private-avatars.githubusercontent.com github-cloud.s3.amazonaws.com objects.githubusercontent.com release-assets.githubusercontent.com secured-user-images.githubusercontent.com user-images.githubusercontent.com private-user-images.githubusercontent.com opengraph.githubassets.com marketplace-screenshots.githubusercontent.com copilotprodattachments.blob.core.windows.net/github-production-copilot-attachments/ github-production-user-asset-6210df.s3.amazonaws.com customer-stories-feed.github.com spotlights-feed.github.com objects-origin.githubusercontent.com *.githubusercontent.com; manifest-src 'self'; media-src github.com user-images.githubusercontent.com secured-user-images.githubusercontent.com private-user-images.githubusercontent.com github-production-user-asset-6210df.s3.amazonaws.com gist.github.com github.githubassets.com; script-src github.githubassets.com; style-src 'unsafe-inline' github.githubassets.com; upgrade-insecure-requests; worker-src github.githubassets.com github.com/assets-cdn/worker/ github.com/assets/ gist.github.com/assets-cdn/worker/\", 'content-type': 'text/html; charset=utf-8', 'date': 'Sun, 05 Apr 2026 11:49:36 GMT', 'etag': 'W/\"f79c89ccf6deeb8bf72f05468daae684\"', 'referrer-policy': 'no-referrer-when-downgrade', 'server': 'github.com', 'strict-transport-security': 'max-age=31536000; includeSubdomains; preload', 'vary': 'X-PJAX, X-PJAX-Container, Turbo-Visit, Turbo-Frame, X-Requested-With, Sec-Fetch-Site,Accept-Encoding, Accept, X-Requested-With', 'x-content-type-options': 'nosniff', 'x-frame-options': 'deny', 'x-github-request-id': '80C1:2E367D:125F734:18F58BA:69D24C4F', 'x-xss-protection': '0'}, 'status': 200}" | ||
| } |
There was a problem hiding this comment.
The recorded http tool result stores a large HTML body plus many volatile headers (date, x-github-request-id, etc.). This tends to bloat the repo and causes frequent fixture churn on re-recording without improving test coverage. Consider redacting/truncating HTTP bodies and dropping non-essential headers in the recording layer (or post-process fixtures) so replay remains deterministic but compact and stable.
There was a problem hiding this comment.
Acknowledged — these are inside recorded tool output (what the LLM saw). Scrubbing would make replay diverge from reality. Will consider path normalization / HTTP body trimming in the recorder as a follow-up.
- Fix UTF-8 unsafe string truncation in session log (use char_indices to find safe boundary instead of byte-index slicing) - Remove forced auto_approve_tools(true) from LiveTestHarness build_live; let Config::from_env() drive it, with per-test override via new with_auto_approve_tools() builder method - Apply engine_v2 builder override in TestRig's config-override branch so with_engine_v2() is not silently ignored when with_config() is used - Remove unused timeout field and with_timeout() from LiveTestHarnessBuilder - Tighten judge_response parsing to require strict PASS:/FAIL: prefix; anything else is treated as a failure with diagnostic message Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…#2039) * feat(test): add dual-mode live/replay test harness with LLM judge Add a general-purpose test infrastructure for running E2E tests in two modes: - Live mode (IRONCLAW_LIVE_TEST=1): real LLM calls with real tools, records traces to disk for future replay - Replay mode (default): loads saved trace fixtures, deterministic, no API keys The harness uses Config::from_env() in live mode so the test agent mirrors the real binary's behavior (engine_v2, allow_local_tools, approval gates). Includes an LLM judge for semantic verification of non-deterministic output, and saves human-readable session logs alongside trace fixtures for inspection and diffing between live and replay runs. First test case: zizmor security scanner against ironclaw's own workflows. New files: - tests/support/live_harness.rs — LiveTestHarness, builder, LLM judge - tests/e2e_live.rs — zizmor_scan test - tests/fixtures/llm_traces/live/ — recorded trace + session log TestRigBuilder additions: - with_http_interceptor() for injecting RecordingHttpInterceptor - with_config() for real-binary config parity (respects allow_local_tools, engine_v2 from env instead of forcing test defaults) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test(live): add engine v2 zizmor scan test, add with_engine_v2 to harness Add zizmor_scan_v2 test that exercises the same scenario through engine v2. Documents the current v2 limitation: auto_approve_tools config flag is not honored by EffectBridgeAdapter — it only checks the per-session "always" set, so shell calls pause at the approval gate. Also: - Add with_engine_v2() to LiveTestHarnessBuilder for config override - Refactor v1 test to use shared run_zizmor_scan() helper - V2 test has relaxed assertions matching current v2 behavior Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(test): address PR review feedback - Fix UTF-8 unsafe string truncation in session log (use char_indices to find safe boundary instead of byte-index slicing) - Remove forced auto_approve_tools(true) from LiveTestHarness build_live; let Config::from_env() drive it, with per-test override via new with_auto_approve_tools() builder method - Apply engine_v2 builder override in TestRig's config-override branch so with_engine_v2() is not silently ignored when with_config() is used - Remove unused timeout field and with_timeout() from LiveTestHarnessBuilder - Tighten judge_response parsing to require strict PASS:/FAIL: prefix; anything else is treated as a failure with diagnostic message Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
LiveTestHarness) for running E2E tests with live LLM calls (recording traces) or replay from saved traces (deterministic, no API keys)TestRigBuildergainswith_config()for real-binary config parity andwith_http_interceptor()for live recording supportEffectBridgeAdapterdoesn't honorauto_approve_toolsconfig — documents this as a known gapHow it works
Config::from_env()soENGINE_V2,ALLOW_LOCAL_TOOLS, approval gates all match the real binary.log) saved alongside trace fixtures (.json) for human inspection and live-vs-replay diffingNew files
tests/support/live_harness.rsLiveTestHarness,LiveTestHarnessBuilder,judge_response()tests/e2e_live.rszizmor_scan(v1) +zizmor_scan_v2(engine v2)tests/fixtures/llm_traces/live/*.jsontests/fixtures/llm_traces/live/*.logTest plan
cargo clippy --features libsql --tests— zero warningse2e_recorded_trace)IRONCLAW_LIVE_TEST=1records trace + session log for v1 and v2🤖 Generated with Claude Code