Skip to content

feat(test): dual-mode live/replay test harness with LLM judge#2039

Merged
ilblackdragon merged 3 commits intostagingfrom
feat/live-replay-test-harness
Apr 5, 2026
Merged

feat(test): dual-mode live/replay test harness with LLM judge#2039
ilblackdragon merged 3 commits intostagingfrom
feat/live-replay-test-harness

Conversation

@ilblackdragon
Copy link
Copy Markdown
Member

Summary

  • Add a general-purpose dual-mode test harness (LiveTestHarness) for running E2E tests with live LLM calls (recording traces) or replay from saved traces (deterministic, no API keys)
  • TestRigBuilder gains with_config() for real-binary config parity and with_http_interceptor() for live recording support
  • First test case: zizmor security scanner against ironclaw's own GitHub Actions workflows
  • Engine v2 variant reveals that EffectBridgeAdapter doesn't honor auto_approve_tools config — documents this as a known gap

How it works

# Record a trace (live LLM + real tools):
IRONCLAW_LIVE_TEST=1 cargo test --features libsql --test e2e_live -- --ignored

# Replay from saved trace (deterministic):
cargo test --features libsql --test e2e_live -- --ignored
  • Live mode loads real config via Config::from_env() so ENGINE_V2, ALLOW_LOCAL_TOOLS, approval gates all match the real binary
  • Session logs (.log) saved alongside trace fixtures (.json) for human inspection and live-vs-replay diffing
  • LLM judge verifies response quality semantically in live mode

New files

File Purpose
tests/support/live_harness.rs LiveTestHarness, LiveTestHarnessBuilder, judge_response()
tests/e2e_live.rs zizmor_scan (v1) + zizmor_scan_v2 (engine v2)
tests/fixtures/llm_traces/live/*.json Recorded traces
tests/fixtures/llm_traces/live/*.log Session logs

Test plan

  • cargo clippy --features libsql --tests — zero warnings
  • Existing recorded trace tests pass (e2e_recorded_trace)
  • IRONCLAW_LIVE_TEST=1 records trace + session log for v1 and v2
  • Replay mode loads trace and passes assertions
  • Replay session log can be diffed against live log (identical for deterministic replay)
  • V2 test documents approval gate gap (relaxed assertions)

🤖 Generated with Claude Code

ilblackdragon and others added 2 commits April 5, 2026 20:44
Add a general-purpose test infrastructure for running E2E tests in two modes:
- Live mode (IRONCLAW_LIVE_TEST=1): real LLM calls with real tools, records
  traces to disk for future replay
- Replay mode (default): loads saved trace fixtures, deterministic, no API keys

The harness uses Config::from_env() in live mode so the test agent mirrors
the real binary's behavior (engine_v2, allow_local_tools, approval gates).
Includes an LLM judge for semantic verification of non-deterministic output,
and saves human-readable session logs alongside trace fixtures for inspection
and diffing between live and replay runs.

First test case: zizmor security scanner against ironclaw's own workflows.

New files:
- tests/support/live_harness.rs — LiveTestHarness, builder, LLM judge
- tests/e2e_live.rs — zizmor_scan test
- tests/fixtures/llm_traces/live/ — recorded trace + session log

TestRigBuilder additions:
- with_http_interceptor() for injecting RecordingHttpInterceptor
- with_config() for real-binary config parity (respects allow_local_tools,
  engine_v2 from env instead of forcing test defaults)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ness

Add zizmor_scan_v2 test that exercises the same scenario through engine v2.
Documents the current v2 limitation: auto_approve_tools config flag is not
honored by EffectBridgeAdapter — it only checks the per-session "always"
set, so shell calls pause at the approval gate.

Also:
- Add with_engine_v2() to LiveTestHarnessBuilder for config override
- Refactor v1 test to use shared run_zizmor_scan() helper
- V2 test has relaxed assertions matching current v2 behavior

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 5, 2026 11:59
@github-actions github-actions bot added size: XL 500+ changed lines risk: low Changes to docs, tests, or low-risk modules contributor: core 20+ merged PRs labels Apr 5, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a dual-mode E2E test harness that supports both live LLM execution with recording and deterministic replay from saved traces. It includes new tests for zizmor security scans and updates the TestRig to better mirror real binary behavior by allowing configuration overrides. Feedback focuses on ensuring UTF-8 safety during string truncation in logs and correctly respecting tool approval configurations to maintain parity with the production environment.

Comment on lines +169 to +173
let short = if preview.len() > 200 {
format!("{}…", &preview[..200])
} else {
preview
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The string truncation logic here (&preview[..200]) is not safe for UTF-8 strings. If a multi-byte character happens to cross the 200-byte boundary, this will cause a panic. To fix this, you should use a character-boundary-aware truncation method. The crate already has a utility for this: ironclaw::agent::agentic_loop::truncate_for_preview. Using it will prevent potential panics and ensure the log output is always valid.

                    let short = ironclaw::agent::agentic_loop::truncate_for_preview(&preview, 200);
References
  1. When truncating a UTF-8 string, use character-aware methods (e.g., char_indices() or iterating over chars()) to avoid panics caused by slicing in the middle of a multi-byte character.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f7ceb6e — replaced byte-index slicing with char_indices() to find a safe UTF-8 boundary before truncating.

Comment thread tests/support/live_harness.rs Outdated
.with_llm(llm)
.with_http_interceptor(http_interceptor)
.with_max_tool_iterations(self.max_tool_iterations)
.with_auto_approve_tools(true)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Forcing auto_approve_tools to true here overrides the value from the user's configuration loaded via Config::from_env(). This prevents the live test harness from accurately mirroring the real binary's behavior regarding tool approval gates, which contradicts one of the goals of this harness. This makes it difficult to test scenarios involving approvals, such as the one documented in zizmor_scan_v2. This line should be removed to allow the configuration from the environment to be respected. A corresponding change is needed in tests/support/test_rig.rs to change the default for auto_approve_tools to None.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f7ceb6e — removed forced .with_auto_approve_tools(true) from build_live(). The harness now respects Config::from_env() by default. Added .with_auto_approve_tools() to the builder so tests can opt in explicitly.

Comment thread tests/support/test_rig.rs
config_override: None,
max_tool_iterations: 10,
injection_check: false,
auto_approve_tools: Some(true),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The default for auto_approve_tools should be None instead of Some(true). When with_config() is used in the live test harness, this will ensure that the value from the loaded configuration is used by default, rather than being overridden. This change is necessary for the live test harness to correctly reflect the real binary's approval gate behavior.

Suggested change
auto_approve_tools: Some(true),
auto_approve_tools: None,

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default Some(true) is kept for backward compat with all existing tests. When with_config() is used, the live harness now controls auto_approve via its own builder method, which applies the override to the config before TestRig sees it.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a dual-mode E2E test harness that can either (1) run with live LLM/tool calls while recording traces, or (2) deterministically replay from committed trace fixtures (no API keys), and adds initial zizmor-based live/replay E2E scenarios.

Changes:

  • Extend TestRigBuilder to support real-binary config parity (with_config) and direct HTTP interceptor injection (with_http_interceptor).
  • Add LiveTestHarness to orchestrate live recording vs replay runs, plus an optional LLM-based judge in live mode.
  • Add an ignored e2e_live integration test and commit initial live/replay trace + log fixtures.

Reviewed changes

Copilot reviewed 6 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/support/test_rig.rs Adds config override + HTTP interceptor override paths to align test rigs with real binary behavior and enable recording/replay.
tests/support/mod.rs Exposes the new live_harness module behind libsql.
tests/support/live_harness.rs Implements the dual-mode harness and a simple “judge” helper.
tests/e2e_live.rs Adds ignored live/replay E2E tests for a zizmor scan (v1 + v2 variant).
tests/fixtures/llm_traces/live/zizmor_scan.json Recorded trace fixture for replay.
tests/fixtures/llm_traces/live/zizmor_scan.log Session log fixture (currently labeled Replay).
tests/fixtures/llm_traces/live/zizmor_scan_v2.json Recorded trace fixture for engine v2 scenario (includes HTTP capture).
tests/fixtures/llm_traces/live/zizmor_scan_v2.log Session log for engine v2 scenario.
tests/fixtures/llm_traces/live/.gitkeep Keeps the live fixture directory present in git.
Comments suppressed due to low confidence (1)

tests/support/test_rig.rs:598

  • build(self) destructures self into locals, but later in this function it still references self.keep_bootstrap (see around line ~897). After destructuring, self is moved, so this will not compile. Use the destructured keep_bootstrap local consistently (or avoid destructuring / keep self intact).
        // Destructure self up front to avoid partial-move issues.
        let TestRigBuilder {
            trace,
            llm,
            config_override,
            max_tool_iterations,
            injection_check,
            auto_approve_tools,
            enable_skills,
            enable_routines,
            http_exchanges: explicit_http_exchanges,
            http_interceptor_override,
            extra_tools,
            wasm_tools,
            keep_bootstrap,
            engine_v2,
        } = self;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/support/test_rig.rs
Comment on lines 701 to +713
// AppBuilder may re-resolve config from env/TOML and override test defaults.
// Force test-rig agent flags to the requested deterministic values.
components.config.agent.auto_approve_tools = auto_approve_tools.unwrap_or(true);
components.config.agent.allow_local_tools = true;
components.config.agent.engine_v2 = engine_v2;
// When a config override was provided, preserve its agent settings to mirror
// the real binary. Otherwise force deterministic test defaults.
if has_config_override {
if let Some(v) = auto_approve_tools {
components.config.agent.auto_approve_tools = v;
}
// allow_local_tools and engine_v2 come from the provided config.
} else {
components.config.agent.auto_approve_tools = auto_approve_tools.unwrap_or(true);
components.config.agent.allow_local_tools = true;
components.config.agent.engine_v2 = engine_v2;
}
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When config_override is present, this branch ignores the builder’s with_engine_v2() flag (engine_v2 local) and relies solely on the provided config. This makes with_engine_v2() a no-op in config-parity mode, which is surprising for callers; consider applying engine_v2 when explicitly set, or document that with_config() takes precedence for engine selection.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f7ceb6e — the config-override branch now applies the builder's engine_v2 flag when explicitly set.

Comment thread tests/support/live_harness.rs Outdated
Comment on lines +311 to +317
// - auto_approve_tools controls approval gates
let rig = TestRigBuilder::new()
.with_config(config)
.with_llm(llm)
.with_http_interceptor(http_interceptor)
.with_max_tool_iterations(self.max_tool_iterations)
.with_auto_approve_tools(true)
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build_live claims to mirror real-binary behavior (including approval gates), but it unconditionally forces .with_auto_approve_tools(true), overriding whatever Config::from_env() resolved. To keep parity, consider defaulting to the config value (don’t call with_auto_approve_tools here) and add an explicit builder knob to override when tests want to bypass approvals.

Suggested change
// - auto_approve_tools controls approval gates
let rig = TestRigBuilder::new()
.with_config(config)
.with_llm(llm)
.with_http_interceptor(http_interceptor)
.with_max_tool_iterations(self.max_tool_iterations)
.with_auto_approve_tools(true)
// - auto_approve_tools remains driven by the resolved config
let rig = TestRigBuilder::new()
.with_config(config)
.with_llm(llm)
.with_http_interceptor(http_interceptor)
.with_max_tool_iterations(self.max_tool_iterations)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f7ceb6e — removed forced .with_auto_approve_tools(true) from build_live(). The harness now respects Config::from_env() by default. Added .with_auto_approve_tools() to the builder so tests can opt in explicitly.

Comment thread tests/support/live_harness.rs Outdated
Comment on lines +212 to +245
/// Builder for constructing a `LiveTestHarness`.
pub struct LiveTestHarnessBuilder {
test_name: String,
max_tool_iterations: usize,
timeout: Duration,
engine_v2: Option<bool>,
}

impl LiveTestHarnessBuilder {
/// Create a new builder for a test with the given name.
///
/// The name determines the trace fixture filename:
/// `tests/fixtures/llm_traces/live/{test_name}.json`
pub fn new(test_name: impl Into<String>) -> Self {
Self {
test_name: test_name.into(),
max_tool_iterations: 30,
timeout: Duration::from_secs(120),
engine_v2: None,
}
}

/// Set the maximum number of tool iterations per agentic loop invocation.
pub fn with_max_tool_iterations(mut self, n: usize) -> Self {
self.max_tool_iterations = n;
self
}

/// Set the default timeout for the test (used as a hint; callers still
/// pass timeout to `wait_for_responses`).
pub fn with_timeout(mut self, timeout: Duration) -> Self {
self.timeout = timeout;
self
}
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LiveTestHarnessBuilder stores timeout, but it’s never read/used when building the rig or waiting for responses. This can mislead callers into thinking it changes harness behavior; either wire it into the harness API (e.g., expose as default for wait_for_responses / internal timeouts) or remove it.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f7ceb6e — removed the timeout field and with_timeout() method.

Comment thread tests/support/live_harness.rs Outdated
Comment on lines +393 to +397
let pass = trimmed.starts_with("PASS");
JudgeVerdict {
pass,
reasoning: trimmed.to_string(),
}
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

judge_response treats any output starting with PASS as a pass, even if the judge doesn’t follow the required PASS: / FAIL: one-line format. This makes false positives more likely (e.g., PASSING ... or multi-line output). Consider parsing strictly for PASS: / FAIL: (case-insensitive) and treating anything else as a failure with a helpful reasoning message.

Suggested change
let pass = trimmed.starts_with("PASS");
JudgeVerdict {
pass,
reasoning: trimmed.to_string(),
}
if trimmed.is_empty() {
return JudgeVerdict {
pass: false,
reasoning: "Judge returned empty output; expected exactly one line starting with PASS: or FAIL:".to_string(),
};
}
if trimmed.lines().count() != 1 {
return JudgeVerdict {
pass: false,
reasoning: format!(
"Judge returned invalid multi-line output; expected exactly one line starting with PASS: or FAIL:. Raw output: {}",
trimmed
),
};
}
let line = trimmed;
if let Some(reasoning) = line
.get(..5)
.filter(|prefix| prefix.eq_ignore_ascii_case("PASS:"))
.map(|_| line[5..].trim())
{
JudgeVerdict {
pass: true,
reasoning: reasoning.to_string(),
}
} else if let Some(reasoning) = line
.get(..5)
.filter(|prefix| prefix.eq_ignore_ascii_case("FAIL:"))
.map(|_| line[5..].trim())
{
JudgeVerdict {
pass: false,
reasoning: reasoning.to_string(),
}
} else {
JudgeVerdict {
pass: false,
reasoning: format!(
"Judge returned invalid output; expected exactly one line starting with PASS: or FAIL:. Raw output: {}",
trimmed
),
}
}

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f7ceb6e — now parses strictly for PASS: / FAIL: prefix via strip_prefix(). Any other format is treated as failure with a diagnostic message.

Comment on lines +60 to +93
"min_message_count": 6
},
"response": {
"type": "text",
"content": "\n\nGreat! zizmor is already installed at `/Users/coder/.cargo/bin/zizmor`. \n\nTo run it, I need to know what you'd like to audit:\n1. **Current directory** - audit the GitHub Actions workflows in this repo\n2. **Specific repository** - audit another GitHub repo (e.g., `github:user/repo`)\n3. **Local workflow files** - audit specific `.yml` files\n\nWhich would you like to do? Or if you're in a repo with GitHub Actions workflows, I can just run it on the current directory.\n[Called tool `list_dir` with arguments: {\"path\":\".\"}]",
"input_tokens": 14778,
"output_tokens": 188
},
"expected_tool_results": [
{
"tool_call_id": "chatcmpl-tool-8e89a891c7087c33",
"name": "shell",
"content": "<tool_output name=\"shell\">\n{\n \"exit_code\": 0,\n \"output\": \"/Users/coder/.cargo/bin/zizmor\\n\",\n \"sandboxed\": false,\n \"success\": true\n}\n</tool_output>"
}
]
},
{
"request_hint": {
"last_user_message_contains": "can we run https://github.com/zizmorcore/zizmor",
"min_message_count": 8
},
"response": {
"type": "text",
"content": "\n\nI can see there's a `.github/` directory which likely contains GitHub Actions workflows. Let me check what's in there and then run zizmor on this repository.\n[Called tool `list_dir` with arguments: {\"path\":\".github\"}]",
"input_tokens": 16007,
"output_tokens": 86
},
"expected_tool_results": [
{
"tool_call_id": "00000001B",
"name": "list_dir",
"content": "<tool_output name=\"list_dir\">\n{\n \"count\": 70,\n \"entries\": [\n \".claude/\",\n \".git/\",\n \".githooks/\",\n \".github/\",\n \"channels-src/\",\n \"crates/\",\n \"deploy/\",\n \"docker/\",\n \"docs/\",\n \"fuzz/\",\n \"migrations/\",\n \"registry/\",\n \"scripts/\",\n \"skills/\",\n \"src/\",\n \"target/\",\n \"tests/\",\n \"tools-src/\",\n \"wit/\",\n \"wix/\",\n \".dockerignore (67B)\",\n \".env.example (12.5KB)\",\n \".gitattributes (50B)\",\n \".gitignore (593B)\",\n \"AGENTS.md (5.4KB)\",\n \"CHANGELOG.md (69.2KB)\",\n \"CLAUDE.md (14.7KB)\",\n \"CONTRIBUTING.md (5.3KB)\",\n \"COVERAGE_PLAN.md (32.2KB)\",\n \"Cargo.lock (227.0KB)\",\n \"Cargo.toml (9.1KB)\",\n \"Dockerfile (2.3KB)\",\n \"Dockerfile.test (1.5KB)\",\n \"Dockerfile.worker (2.3KB)\",\n \"FEATURE_PARITY.md (30.5KB)\",\n \"LICENSE-APACHE (10.5KB)\",\n \"LICENSE-MIT (1.0KB)\",\n \"README.ja.md (16.7KB)\",\n \"README.md (13.7KB)\",\n \"README.ru.md (20.4KB)\",\n \"README.zh-CN.md (13.3KB)\",\n \"build.rs (8.5KB)\",\n \"clippy.toml (537B)\",\n \"codecov.yml (219B)\",\n \"deny.toml (1.8KB)\",\n \"docker-compose.yml (524B)\",\n \"engine_trace_20260405T100717.json (20.9KB)\",\n \"ironclaw.bash (100.3KB)\",\n \"ironclaw.fish (69.9KB)\",\n \"ironclaw.png (266.7KB)\",\n \"ironclaw.zsh (84.0KB)\",\n \"providers.json (12.8KB)\",\n \"release-plz.toml (39B)\",\n \"trace_20260309T172815.json (8.8KB)\",\n \"trace_20260310T075912.json (4.8KB)\",\n \"trace_20260311T002753.json (4.8KB)\",\n \"trace_20260311T130928.json (4.8KB)\",\n \"trace_20260311T141121.json (9.1KB)\",\n \"trace_20260311T141138.json (11.7KB)\",\n \"trace_20260311T141430.json (9.1KB)\",\n \"trace_20260311T181338.json (11.8KB)\",\n \"trace_20260311T183957.json (9.2KB)\",\n \"trace_20260311T185836.json (9.2KB)\",\n \"trace_20260311T191112.json (95.2KB)\",\n \"trace_20260311T191641.json (156.7KB)\",\n \"trace_20260311T194203.json (9.8KB)\",\n \"trace_20260311T194240.json (272.4KB)\",\n \"trace_20260311T214436.json (4.7KB)\",\n \"trace_20260311T214457.json (133.1KB)\",\n \"trace_20260311T221512.json (7.6KB)\"\n ],\n \"path\": \"/Users/coder/ironclaw8\",\n \"truncated\": false\n}\n</tool_output>"
}
]
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This trace fixture embeds absolute local paths (e.g. /Users/coder/ironclaw8 and /Users/coder/.cargo/bin/zizmor) inside recorded tool outputs. That leaks machine-specific info and makes fixtures harder to share/re-record across environments. Consider scrubbing/normalizing paths in recorded fixtures (or in the recorder) before committing (e.g., replace with relative paths or a placeholder root).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged — these are inside recorded tool output (what the LLM saw). Scrubbing would make replay diverge from reality. Will consider path normalization / HTTP body trimming in the recorder as a follow-up.

Comment on lines +68 to +74
"content": "{'count': 0, 'results': [], 'searched_online': False}"
},
{
"tool_call_id": "chatcmpl-tool-a3b63943e42a0ecb",
"name": "http",
"content": "{'body': '# Search code, repositories, users, issues, pull requests...\\n\\n\\n\\n\\n\\n[Sign in](/login?return_to=https%3A%2F%2Fgithub.com%2Fzizmorcore%2Fzizmor)\\n\\n/;ref_cta:Sign up;ref_loc:header logged out\"}\" data-hydro-click=\"{\"event_type\":\"authentication.click\",\"payload\":{\"location_in_page\":\"site header menu\",\"repository_id\":null,\"auth_type\":\"SIGN_UP\",\"originating_url\":\"https://github.com/zizmorcore/zizmor\",\"user_id\":null}}\" data-hydro-click-hmac=\"8caabd8b90225b5d5573f566ac422fd36398eba76666d09587f4d4f6f6450bb2\" href=\"/signup?ref_cta=Sign+up&ref_loc=header+logged+out&ref_page=%2F%3Cuser-name%3E%2F%3Crepo-name%3E&source=header-repo&source_repo=zizmorcore%2Fzizmor\">\\n Sign up\\n\\nAppearance settings\\n', 'headers': {'accept-ranges': 'bytes', 'cache-control': 'max-age=0, private, must-revalidate', 'content-security-policy': \"default-src 'none'; base-uri 'self'; child-src github.githubassets.com github.com/assets-cdn/worker/ github.com/assets/ gist.github.com/assets-cdn/worker/; connect-src 'self' uploads.github.com www.githubstatus.com collector.github.com raw.githubusercontent.com api.github.com github-cloud.s3.amazonaws.com github-production-repository-file-5c1aeb.s3.amazonaws.com github-production-upload-manifest-file-7fdce7.s3.amazonaws.com github-production-user-asset-6210df.s3.amazonaws.com *.rel.tunnels.api.visualstudio.com wss://*.rel.tunnels.api.visualstudio.com github.githubassets.com objects-origin.githubusercontent.com copilot-proxy.githubusercontent.com proxy.individual.githubcopilot.com proxy.business.githubcopilot.com proxy.enterprise.githubcopilot.com *.actions.githubusercontent.com wss://*.actions.githubusercontent.com productionresultssa0.blob.core.windows.net productionresultssa1.blob.core.windows.net productionresultssa2.blob.core.windows.net productionresultssa3.blob.core.windows.net productionresultssa4.blob.core.windows.net productionresultssa5.blob.core.windows.net productionresultssa6.blob.core.windows.net productionresultssa7.blob.core.windows.net productionresultssa8.blob.core.windows.net productionresultssa9.blob.core.windows.net productionresultssa10.blob.core.windows.net productionresultssa11.blob.core.windows.net productionresultssa12.blob.core.windows.net productionresultssa13.blob.core.windows.net productionresultssa14.blob.core.windows.net productionresultssa15.blob.core.windows.net productionresultssa16.blob.core.windows.net productionresultssa17.blob.core.windows.net productionresultssa18.blob.core.windows.net productionresultssa19.blob.core.windows.net github-production-repository-image-32fea6.s3.amazonaws.com github-production-release-asset-2e65be.s3.amazonaws.com insights.github.com wss://alive.github.com wss://alive-staging.github.com api.githubcopilot.com api.individual.githubcopilot.com api.business.githubcopilot.com api.enterprise.githubcopilot.com; font-src github.githubassets.com; form-action 'self' github.com gist.github.com copilot-workspace.githubnext.com objects-origin.githubusercontent.com; frame-ancestors 'none'; frame-src viewscreen.githubusercontent.com notebooks.githubusercontent.com; img-src 'self' data: blob: github.githubassets.com media.githubusercontent.com camo.githubusercontent.com identicons.github.com avatars.githubusercontent.com private-avatars.githubusercontent.com github-cloud.s3.amazonaws.com objects.githubusercontent.com release-assets.githubusercontent.com secured-user-images.githubusercontent.com user-images.githubusercontent.com private-user-images.githubusercontent.com opengraph.githubassets.com marketplace-screenshots.githubusercontent.com copilotprodattachments.blob.core.windows.net/github-production-copilot-attachments/ github-production-user-asset-6210df.s3.amazonaws.com customer-stories-feed.github.com spotlights-feed.github.com objects-origin.githubusercontent.com *.githubusercontent.com; manifest-src 'self'; media-src github.com user-images.githubusercontent.com secured-user-images.githubusercontent.com private-user-images.githubusercontent.com github-production-user-asset-6210df.s3.amazonaws.com gist.github.com github.githubassets.com; script-src github.githubassets.com; style-src 'unsafe-inline' github.githubassets.com; upgrade-insecure-requests; worker-src github.githubassets.com github.com/assets-cdn/worker/ github.com/assets/ gist.github.com/assets-cdn/worker/\", 'content-type': 'text/html; charset=utf-8', 'date': 'Sun, 05 Apr 2026 11:49:36 GMT', 'etag': 'W/\"f79c89ccf6deeb8bf72f05468daae684\"', 'referrer-policy': 'no-referrer-when-downgrade', 'server': 'github.com', 'strict-transport-security': 'max-age=31536000; includeSubdomains; preload', 'vary': 'X-PJAX, X-PJAX-Container, Turbo-Visit, Turbo-Frame, X-Requested-With, Sec-Fetch-Site,Accept-Encoding, Accept, X-Requested-With', 'x-content-type-options': 'nosniff', 'x-frame-options': 'deny', 'x-github-request-id': '80C1:2E367D:125F734:18F58BA:69D24C4F', 'x-xss-protection': '0'}, 'status': 200}"
}
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recorded http tool result stores a large HTML body plus many volatile headers (date, x-github-request-id, etc.). This tends to bloat the repo and causes frequent fixture churn on re-recording without improving test coverage. Consider redacting/truncating HTTP bodies and dropping non-essential headers in the recording layer (or post-process fixtures) so replay remains deterministic but compact and stable.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged — these are inside recorded tool output (what the LLM saw). Scrubbing would make replay diverge from reality. Will consider path normalization / HTTP body trimming in the recorder as a follow-up.

- Fix UTF-8 unsafe string truncation in session log (use char_indices
  to find safe boundary instead of byte-index slicing)
- Remove forced auto_approve_tools(true) from LiveTestHarness build_live;
  let Config::from_env() drive it, with per-test override via new
  with_auto_approve_tools() builder method
- Apply engine_v2 builder override in TestRig's config-override branch
  so with_engine_v2() is not silently ignored when with_config() is used
- Remove unused timeout field and with_timeout() from LiveTestHarnessBuilder
- Tighten judge_response parsing to require strict PASS:/FAIL: prefix;
  anything else is treated as a failure with diagnostic message

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ilblackdragon ilblackdragon merged commit 1c2d2f2 into staging Apr 5, 2026
14 checks passed
@ilblackdragon ilblackdragon deleted the feat/live-replay-test-harness branch April 5, 2026 13:03
drchirag1991 pushed a commit to drchirag1991/ironclaw that referenced this pull request Apr 8, 2026
…#2039)

* feat(test): add dual-mode live/replay test harness with LLM judge

Add a general-purpose test infrastructure for running E2E tests in two modes:
- Live mode (IRONCLAW_LIVE_TEST=1): real LLM calls with real tools, records
  traces to disk for future replay
- Replay mode (default): loads saved trace fixtures, deterministic, no API keys

The harness uses Config::from_env() in live mode so the test agent mirrors
the real binary's behavior (engine_v2, allow_local_tools, approval gates).
Includes an LLM judge for semantic verification of non-deterministic output,
and saves human-readable session logs alongside trace fixtures for inspection
and diffing between live and replay runs.

First test case: zizmor security scanner against ironclaw's own workflows.

New files:
- tests/support/live_harness.rs — LiveTestHarness, builder, LLM judge
- tests/e2e_live.rs — zizmor_scan test
- tests/fixtures/llm_traces/live/ — recorded trace + session log

TestRigBuilder additions:
- with_http_interceptor() for injecting RecordingHttpInterceptor
- with_config() for real-binary config parity (respects allow_local_tools,
  engine_v2 from env instead of forcing test defaults)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test(live): add engine v2 zizmor scan test, add with_engine_v2 to harness

Add zizmor_scan_v2 test that exercises the same scenario through engine v2.
Documents the current v2 limitation: auto_approve_tools config flag is not
honored by EffectBridgeAdapter — it only checks the per-session "always"
set, so shell calls pause at the approval gate.

Also:
- Add with_engine_v2() to LiveTestHarnessBuilder for config override
- Refactor v1 test to use shared run_zizmor_scan() helper
- V2 test has relaxed assertions matching current v2 behavior

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(test): address PR review feedback

- Fix UTF-8 unsafe string truncation in session log (use char_indices
  to find safe boundary instead of byte-index slicing)
- Remove forced auto_approve_tools(true) from LiveTestHarness build_live;
  let Config::from_env() drive it, with per-test override via new
  with_auto_approve_tools() builder method
- Apply engine_v2 builder override in TestRig's config-override branch
  so with_engine_v2() is not silently ignored when with_config() is used
- Remove unused timeout field and with_timeout() from LiveTestHarnessBuilder
- Tighten judge_response parsing to require strict PASS:/FAIL: prefix;
  anything else is treated as a failure with diagnostic message

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ironclaw-ci ironclaw-ci bot mentioned this pull request Apr 10, 2026
@ironclaw-ci ironclaw-ci bot mentioned this pull request Apr 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor: core 20+ merged PRs risk: low Changes to docs, tests, or low-risk modules size: XL 500+ changed lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants