Trajectory benchmarks and e2e trace test rig by zmanian · Pull Request #553 · nearai/ironclaw

zmanian · 2026-03-04T21:14:49Z

Summary

Add e2e trace test rig (TraceLlm + TestRig) for replaying LLM traces against the full agent loop
Unified trace format around turns -- each turn pairs a user_input with LLM response steps, making traces self-contained conversation trajectories
Backward-compatible: flat "steps" JSON (legacy) is deserialized as a single turn transparently
Add TestRig::run_trace() for automatic multi-turn replay (send user messages, collect responses, advance turns)
Add trace fixtures across three tiers: spot/ (quick smoke tests), coverage/ (tool/feature coverage), advanced/ (multi-turn, error recovery, steering, iteration limits, prompt injection)
Add advanced e2e tests: multi-turn memory coherence, user steering, tool error recovery, long tool chains, workspace search, iteration limit guard, prompt injection resilience
Add benchmark design plans and phased implementation docs
Remove in-tree benchmark harness, retain retain_only utilities
Add GitHub Actions benchmark workflow (manual trigger)
Add --json flag for machine-readable benchmark output
Add SkillRegistry::retain_only and wire skill filtering in scenarios

Test plan

cargo test trace_llm -- unit tests for trace types, deserialization (flat + turns), backward compat
cargo test e2e_spot -- spot check smoke tests pass
cargo test e2e_advanced -- multi-turn, steering, error recovery, long chains, memory, injection tests pass
cargo test e2e_status -- status event tests pass
Trace fixtures cover: greeting, math, tool echo, JSON ops, chain write/read, memory save/recall, error paths, shell, list_dir, apply_patch, injection detection, multi-turn steering, workspace search, iteration limits

Move 5 assertion helpers from e2e_spot_checks.rs to a shared module. Add assert_all_tools_succeeded and assert_tool_succeeded for eliminating false positives in E2E tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Extract (name, preview) from ToolResult status events in TestChannel and TestRig, enabling content assertions on tool outputs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- tool_time.json: add missing "operation": "now" for time tool - robust_correct_tool.json: same fix - memory_full_cycle.json: change "path" to "target" for memory_write Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Every E2E test that exercises tools now calls assert_all_tools_succeeded. Added tool output content assertions where tool results are predictable (time year, read_file content, memory_read content). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Record Instant on ToolStarted and compute elapsed duration on ToolCompleted, wiring real timing data into collect_metrics() instead of hardcoded zeros. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace manual cleanup_test_dir() calls and inline remove_file() with Drop-based CleanupGuard that ensures cleanup even if a test panics. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Wrap agent_handle in Option so Drop can abort leaked tasks. Signal the channel shutdown before aborting for future cooperative shutdown. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Use a oneshot channel fired in Channel::start() instead of a fixed 100ms sleep, eliminating the race condition on slow systems. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… detection Use tool completion count vs max_tool_iterations instead of scanning status messages for "iteration"/"limit" substrings. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Remove incorrect comment about memory_tree failing with empty path (it actually succeeds). Omit empty path from fixture and use the standard assert_all_tools_succeeded instead of per-tool assertions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Move TraceMetrics, ScenarioResult, RunResult, MetricDelta, and compare_runs() from tests/support/metrics.rs to src/benchmark/metrics.rs. Existing tests use re-export for backward compatibility. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Scenario defines a task with input, success criteria, and resource limits. Criterion is an enum of programmatic checks (tool_used, response_contains, etc.) evaluated without LLM judgment. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…egories) Scenarios cover tool_selection, tool_chaining, error_recovery, efficiency, and memory_operations. All loaded from JSON with deserialization validation test. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

BenchChannel is a minimal Channel implementation for benchmarks. InstrumentedLlm wraps any LlmProvider to capture per-call metrics. Runner creates a fresh agent per scenario, evaluates success criteria, and produces RunResult with timing, token, and cost metrics. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- baseline.rs: load/save/promote benchmark results - report.rs: format comparison reports with regression detection - benchmark_runner.rs: integration test with real LLM (feature-gated) - Add benchmark feature flag to Cargo.toml Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ponseNotContains Add BenchScenario, Turn, TurnAssertions, JudgeConfig, ScenarioSetup, WorkspaceSetup, SeedDocument types for multi-turn benchmark scenarios. Add ResponseNotContains criterion variant. Add TurnAssertions::to_criteria() converter for backward compat with existing evaluation engine. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…d tag filter Add load_bench_scenarios() for the new BenchScenario format with recursive directory traversal and tag-based filtering. Create 4 initial trajectory scenarios across tool-selection, multi-turn, and efficiency categories. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…n metrics Add run_bench_scenario() that loops over BenchScenario turns, seeds workspace documents, collects per-turn metrics (tokens, tool calls, wall time), and evaluates per-turn assertions. Add TurnMetrics to metrics.rs and clear_for_next_turn() to BenchChannel. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…score parsing Create judge.rs with format_judge_prompt, parse_judge_score, and judge_turn. Wire into run_bench_scenario for turns with judge config -- scores below min_score fail the turn. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add BenchmarkCommand with --tags, --scenario, --no-judge, --timeout, --update-baseline flags. Wire into Command enum and main.rs dispatch. Feature-gated behind benchmark flag. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add save_scenario_results() that writes per-scenario JSON files alongside the run summary. Each scenario gets its own file with turn_metrics trajectory. Update CLI to use new output format. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…g in scenarios Add a retain_only() method to ToolRegistry that filters tools down to a given allowlist. Wire this into run_bench_scenario() so that when a scenario specifies a tools list in its setup, only those tools are available during the benchmark run. Includes two tests for the new method: one verifying filtering works and one verifying empty input is a no-op. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…start Add seed_identity() helper that writes identity files (IDENTITY.md, USER.md, etc.) into the workspace before the agent starts, so that workspace.system_prompt() picks them up. Wire it into run_bench_scenario() after workspace seeding. Include a test that verifies identity files are written and readable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ests Prevents snapshot conflicts between default (no benchmark) and all-features (with benchmark) builds by using separate snapshot names per feature set. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…cement Replace sequential loop in run_all_bench() with parallel execution using JoinSet + semaphore when config.parallel > 1. Add budget cap enforcement that skips remaining scenarios when max_total_cost_usd is exceeded. Track skipped count in RunResult.skipped_scenarios and display it in format_report(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…arios Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ing in scenarios Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add TraceExpects struct with 9 optional assertion fields (response_contains, tools_used, all_tools_succeeded, etc.) that can be declared in fixture JSON instead of hand-written Rust. Add verify_expects() and run_recorded_trace() so recorded trace tests become one-liners. Split trace infra tests (deserialization, backward compat) into tests/trace_format.rs which doesn't require the libsql feature gate. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add declarative expects blocks to all 19 trace fixture JSONs across spot/, coverage/, advanced/, and root directories. Update all 8 e2e test files to use verify_trace_expects() / run_and_verify_trace(), replacing ~270 lines of hand-written assertions with fixture-driven verification. Tests that check things beyond expects (file content on disk, metrics, event ordering) keep those extra assertions alongside the declarative ones. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Update test files to work with refactored TestRigBuilder that uses AppBuilder::build_all() (removing with_tools/with_workspace methods). Update telegram_check fixture to use tool_list instead of echo. Fix cargo fmt issues in src/llm/mod.rs and src/llm/recording.rs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Support modules (assertions, cleanup, test_channel, test_rig, trace_llm) had #[cfg(test)] mod tests blocks that were compiled and run 12 times — once per e2e test binary that declares `mod support;`. Extracted all 29 support unit tests into a dedicated `tests/support_unit_tests.rs` so they run exactly once. [skip-regression-check] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ilblackdragon · 2026-03-05T01:52:53Z

Checked that "check_telegram" test actually works and hits it with coverage:

zmanian · 2026-03-05T02:44:46Z

Review of Illia's commits (`a53a852`..`49dfad6`)

Overall these are solid improvements -- declarative expects, AppBuilder integration, and test deduplication are all good moves. Three issues worth discussing:

1. Duplicate type systems between `src/llm/recording.rs` and `tests/support/trace_llm.rs`

Both files define parallel trace types (TraceStep, TraceResponse, TraceToolCall, RequestHint, etc.) that are structurally similar but not shared. Changes to the trace format need to be kept in sync in two places. We should either:

Move the canonical types into src/ (behind a testing feature) so both recording and replay use the same types
Have trace_llm.rs import from recording.rs

2. Recording produces flat steps, not turns

RecordingLlm emits UserInput markers in a flat steps array. The LlmTrace deserializer wraps this into a single turn with placeholder user input "(test input)". For multi-turn recorded traces, this means run_trace() sends only the placeholder message, not the actual user messages from the recording. The UserInput markers are in the step list but get skipped by TraceLlm.

This works today because flat steps replay correctly in sequence, but run_and_verify_trace() won't actually drive multi-turn conversations from recorded traces -- it'll send one placeholder message and replay all steps linearly.

Should RecordingLlm produce turns instead? Or should the deserializer split flat steps at UserInput boundaries into separate turns?

3. `with_tools()` / `with_workspace()` removal

The switch to AppBuilder::build_all() is great for prod/test parity, but removing with_tools() means tests can no longer control tool registry composition. All tests now get the full tool set. This is fine for E2E realism, but tests like iteration_limit_stops_runaway that previously used a minimal registry may be slower or have unexpected interactions. Worth keeping an eye on.

Import shared types (TraceStep, TraceResponse, TraceToolCall, RequestHint, ExpectedToolResult, MemorySnapshotEntry, HttpExchange*) from ironclaw::llm::recording instead of redefining them in trace_llm.rs. Fix the flat-steps deserializer to split at UserInput boundaries into multiple turns, instead of filtering them out and wrapping everything into a single turn. This enables recorded multi-turn traces to be replayed as proper multi-turn conversations via run_trace(). [skip-regression-check] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ngbenchma

- Add #[allow(unused_imports)] on pub use re-exports in trace_llm.rs (types are re-exported for downstream test files, not used locally) - Add `..` to ToolCompleted pattern in test_channel.rs to match new `error` and `parameters` fields Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add missing `error` and `parameters` fields to ToolCompleted constructors in support_unit_tests.rs - Add `..` to ToolCompleted pattern match in support_unit_tests.rs - Add #[allow(dead_code)] to CleanupGuard, LlmTrace impl, and TraceLlm impl (only used behind #[cfg(feature = "libsql")]) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Increase wait_for_responses polling to exponential backoff (50ms-500ms) and raise default timeout from 15s to 30s to reduce CI flakiness (#1) - Strengthen prompt_injection_resilience test with positive safety layer assertion via has_safety_warnings(), enable injection_check (#2) - Add assert_tool_order() helper and tools_order field in TraceExpects for verifying tool execution ordering in multi-step traces (#3) - Document TraceLlm sequential-call assumption for concurrency (#6) - Clean up CleanupGuard with PathKind enum instead of shotgun remove_file + remove_dir_all on every path (#8) - Fix coverage.sh: default to --lib only, fix multi-filter syntax, add COV_ALL_TARGETS option - Add coverage/ to .gitignore - Remove planning docs from PR [skip-regression-check] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Use HashSet for O(N+M) lookup in SkillRegistry::retain_only and ToolRegistry::retain_only instead of linear scan - Strengthen test_retain_only_empty_is_noop in SkillRegistry to pre-populate with a skill before asserting the no-op behavior [skip-regression-check] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The safety layer sanitizes tool output, not user input. The injection test sends a malicious user message with no tools called, so the safety layer never fires. Reverted to the original test which correctly validates the LLM refuses via trace expects. Also fixed case-sensitive request hint ("ignore" -> "Ignore") to suppress noisy warning. [skip-regression-check] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Adds `cargo llvm-cov clean` before each run to prevent "mismatched data" warnings from stale instrumentation profiles. [skip-regression-check] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

[skip-regression-check] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor: extract shared assertion helpers to support/assertions.rs Move 5 assertion helpers from e2e_spot_checks.rs to a shared module. Add assert_all_tools_succeeded and assert_tool_succeeded for eliminating false positives in E2E tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: add tool output capture via tool_results() accessor Extract (name, preview) from ToolResult status events in TestChannel and TestRig, enabling content assertions on tool outputs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: correct tool parameters in 3 broken trace fixtures - tool_time.json: add missing "operation": "now" for time tool - robust_correct_tool.json: same fix - memory_full_cycle.json: change "path" to "target" for memory_write Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: add tool success and output assertions to eliminate false positives Every E2E test that exercises tools now calls assert_all_tools_succeeded. Added tool output content assertions where tool results are predictable (time year, read_file content, memory_read content). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: capture per-tool timing from ToolStarted/ToolCompleted events Record Instant on ToolStarted and compute elapsed duration on ToolCompleted, wiring real timing data into collect_metrics() instead of hardcoded zeros. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor: add RAII CleanupGuard for temp file/dir cleanup in tests Replace manual cleanup_test_dir() calls and inline remove_file() with Drop-based CleanupGuard that ensures cleanup even if a test panics. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: add Drop impl and graceful shutdown for TestRig Wrap agent_handle in Option so Drop can abort leaked tasks. Signal the channel shutdown before aborting for future cooperative shutdown. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: replace agent startup sleep with oneshot ready signal Use a oneshot channel fired in Channel::start() instead of a fixed 100ms sleep, eliminating the race condition on slow systems. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: replace fragile string-matching iteration limit with count-based detection Use tool completion count vs max_tool_iterations instead of scanning status messages for "iteration"/"limit" substrings. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: use assert_all_tools_succeeded for memory_full_cycle test Remove incorrect comment about memory_tree failing with empty path (it actually succeeds). Omit empty path from fixture and use the standard assert_all_tools_succeeded instead of per-tool assertions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor: promote benchmark metrics types to library code Move TraceMetrics, ScenarioResult, RunResult, MetricDelta, and compare_runs() from tests/support/metrics.rs to src/benchmark/metrics.rs. Existing tests use re-export for backward compatibility. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: add Scenario and Criterion types for agent benchmarking Scenario defines a task with input, success criteria, and resource limits. Criterion is an enum of programmatic checks (tool_used, response_contains, etc.) evaluated without LLM judgment. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: add initial benchmark scenario suite (12 scenarios across 5 categories) Scenarios cover tool_selection, tool_chaining, error_recovery, efficiency, and memory_operations. All loaded from JSON with deserialization validation test. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: add benchmark runner with BenchChannel and InstrumentedLlm BenchChannel is a minimal Channel implementation for benchmarks. InstrumentedLlm wraps any LlmProvider to capture per-call metrics. Runner creates a fresh agent per scenario, evaluates success criteria, and produces RunResult with timing, token, and cost metrics. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: add baseline management, reports, and benchmark entry point - baseline.rs: load/save/promote benchmark results - report.rs: format comparison reports with regression detection - benchmark_runner.rs: integration test with real LLM (feature-gated) - Add benchmark feature flag to Cargo.toml Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * style: apply cargo fmt to benchmark module Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): add multi-turn scenario types with setup, judge, ResponseNotContains Add BenchScenario, Turn, TurnAssertions, JudgeConfig, ScenarioSetup, WorkspaceSetup, SeedDocument types for multi-turn benchmark scenarios. Add ResponseNotContains criterion variant. Add TurnAssertions::to_criteria() converter for backward compat with existing evaluation engine. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): add JSON scenario loader with recursive discovery and tag filter Add load_bench_scenarios() for the new BenchScenario format with recursive directory traversal and tag-based filtering. Create 4 initial trajectory scenarios across tool-selection, multi-turn, and efficiency categories. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): multi-turn runner with workspace seeding and per-turn metrics Add run_bench_scenario() that loops over BenchScenario turns, seeds workspace documents, collects per-turn metrics (tokens, tool calls, wall time), and evaluates per-turn assertions. Add TurnMetrics to metrics.rs and clear_for_next_turn() to BenchChannel. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): add LLM-as-judge scoring with prompt formatting and score parsing Create judge.rs with format_judge_prompt, parse_judge_score, and judge_turn. Wire into run_bench_scenario for turns with judge config -- scores below min_score fail the turn. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): add CLI subcommand (ironclaw benchmark) Add BenchmarkCommand with --tags, --scenario, --no-judge, --timeout, --update-baseline flags. Wire into Command enum and main.rs dispatch. Feature-gated behind benchmark flag. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): per-scenario JSON output with full trajectory Add save_scenario_results() that writes per-scenario JSON files alongside the run summary. Each scenario gets its own file with turn_metrics trajectory. Update CLI to use new output format. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): add ToolRegistry::retain_only and wire tool filtering in scenarios Add a retain_only() method to ToolRegistry that filters tools down to a given allowlist. Wire this into run_bench_scenario() so that when a scenario specifies a tools list in its setup, only those tools are available during the benchmark run. Includes two tests for the new method: one verifying filtering works and one verifying empty input is a no-op. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): wire identity overrides into workspace before agent start Add seed_identity() helper that writes identity files (IDENTITY.md, USER.md, etc.) into the workspace before the agent starts, so that workspace.system_prompt() picks them up. Wire it into run_bench_scenario() after workspace seeding. Include a test that verifies identity files are written and readable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): add --parallel and --max-cost CLI flags Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(benchmark): use feature-conditional snapshot names for CLI help tests Prevents snapshot conflicts between default (no benchmark) and all-features (with benchmark) builds by using separate snapshot names per feature set. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): parallel execution with JoinSet and budget cap enforcement Replace sequential loop in run_all_bench() with parallel execution using JoinSet + semaphore when config.parallel > 1. Add budget cap enforcement that skips remaining scenarios when max_total_cost_usd is exceeded. Track skipped count in RunResult.skipped_scenarios and display it in format_report(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): add tool restriction and identity override test scenarios Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: fix formatting for Phase 3 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): add SkillRegistry::retain_only and wire skill filtering in scenarios Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): add --json flag for machine-readable output Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * ci: add GitHub Actions benchmark workflow (manual trigger) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor(benchmark): remove in-tree benchmark harness, keep retain_only utilities Move benchmark-specific code out of ironclaw in preparation for the nearai/benchmarks trajectory adapter. This removes: - src/benchmark/ (runner, scenarios, metrics, judge, report, etc.) - src/cli/benchmark.rs and the Benchmark CLI subcommand - benchmarks/ data directory (scenarios + trajectories) - .github/workflows/benchmark.yml - The "benchmark" Cargo feature flag What remains: - ToolRegistry::retain_only() and SkillRegistry::retain_only() - Test support types (TraceMetrics, InstrumentedLlm) inlined into tests/support/ instead of re-exporting from the deleted module Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: add README for LLM trace fixture format Documents the trajectory JSON format, response types, request hints, directory structure, and how to write new traces. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(test): unify trace format around turns, add multi-turn support Introduce TraceTurn type that groups user_input with LLM response steps, making traces self-contained conversation trajectories. Add run_trace() to TestRig for automatic multi-turn replay. Backward-compatible: flat "steps" JSON is deserialized as a single turn transparently. Includes all trace fixtures (spot, coverage, advanced), plan docs, and new e2e tests for steering, error recovery, long chains, memory, and prompt injection resilience. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(test): fix CI failures after merging main - Fix tool_json fixture: use "data" parameter (not "input") to match JsonTool schema - Fix status_events test: remove assertion for "time" tool that isn't in the fixture (only "echo" calls are used) - Allow dead_code in test support metrics/instrumented_llm modules (utilities for future benchmark tests) [skip-regression-check] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Working on recording traces and testing them * feat(test): add declarative expects to trace fixtures, split infra tests Add TraceExpects struct with 9 optional assertion fields (response_contains, tools_used, all_tools_succeeded, etc.) that can be declared in fixture JSON instead of hand-written Rust. Add verify_expects() and run_recorded_trace() so recorded trace tests become one-liners. Split trace infra tests (deserialization, backward compat) into tests/trace_format.rs which doesn't require the libsql feature gate. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor(test): add expects to all trace fixtures, simplify e2e tests Add declarative expects blocks to all 19 trace fixture JSONs across spot/, coverage/, advanced/, and root directories. Update all 8 e2e test files to use verify_trace_expects() / run_and_verify_trace(), replacing ~270 lines of hand-written assertions with fixture-driven verification. Tests that check things beyond expects (file content on disk, metrics, event ordering) keep those extra assertions alongside the declarative ones. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(test): adapt tests to AppBuilder refactor, fix formatting Update test files to work with refactored TestRigBuilder that uses AppBuilder::build_all() (removing with_tools/with_workspace methods). Update telegram_check fixture to use tool_list instead of echo. Fix cargo fmt issues in src/llm/mod.rs and src/llm/recording.rs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor(test): deduplicate support unit tests into single binary Support modules (assertions, cleanup, test_channel, test_rig, trace_llm) had #[cfg(test)] mod tests blocks that were compiled and run 12 times — once per e2e test binary that declares `mod support;`. Extracted all 29 support unit tests into a dedicated `tests/support_unit_tests.rs` so they run exactly once. [skip-regression-check] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * style: fix trailing newlines in support files Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor(test): unify trace types and fix recorded multi-turn replay Import shared types (TraceStep, TraceResponse, TraceToolCall, RequestHint, ExpectedToolResult, MemorySnapshotEntry, HttpExchange*) from ironclaw::llm::recording instead of redefining them in trace_llm.rs. Fix the flat-steps deserializer to split at UserInput boundaries into multiple turns, instead of filtering them out and wrapping everything into a single turn. This enables recorded multi-turn traces to be replayed as proper multi-turn conversations via run_trace(). [skip-regression-check] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(test): fix CI failures - unused imports and missing struct fields - Add #[allow(unused_imports)] on pub use re-exports in trace_llm.rs (types are re-exported for downstream test files, not used locally) - Add `..` to ToolCompleted pattern in test_channel.rs to match new `error` and `parameters` fields Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(test): fix CI failures after merging main - Add missing `error` and `parameters` fields to ToolCompleted constructors in support_unit_tests.rs - Add `..` to ToolCompleted pattern match in support_unit_tests.rs - Add #[allow(dead_code)] to CleanupGuard, LlmTrace impl, and TraceLlm impl (only used behind #[cfg(feature = "libsql")]) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Adding coverage running script * fix(test): address review feedback on E2E test infrastructure - Increase wait_for_responses polling to exponential backoff (50ms-500ms) and raise default timeout from 15s to 30s to reduce CI flakiness (nearai#1) - Strengthen prompt_injection_resilience test with positive safety layer assertion via has_safety_warnings(), enable injection_check (nearai#2) - Add assert_tool_order() helper and tools_order field in TraceExpects for verifying tool execution ordering in multi-step traces (nearai#3) - Document TraceLlm sequential-call assumption for concurrency (nearai#6) - Clean up CleanupGuard with PathKind enum instead of shotgun remove_file + remove_dir_all on every path (nearai#8) - Fix coverage.sh: default to --lib only, fix multi-filter syntax, add COV_ALL_TARGETS option - Add coverage/ to .gitignore - Remove planning docs from PR [skip-regression-check] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: address PR review - use HashSet in retain_only, improve skill test - Use HashSet for O(N+M) lookup in SkillRegistry::retain_only and ToolRegistry::retain_only instead of linear scan - Strengthen test_retain_only_empty_is_noop in SkillRegistry to pre-populate with a skill before asserting the no-op behavior [skip-regression-check] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(test): revert incorrect safety layer assertion in injection test The safety layer sanitizes tool output, not user input. The injection test sends a malicious user message with no tools called, so the safety layer never fires. Reverted to the original test which correctly validates the LLM refuses via trace expects. Also fixed case-sensitive request hint ("ignore" -> "Ignore") to suppress noisy warning. [skip-regression-check] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: clean stale profdata before coverage run Adds `cargo llvm-cov clean` before each run to prevent "mismatched data" warnings from stale instrumentation profiles. [skip-regression-check] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * style: fix formatting in retain_only test [skip-regression-check] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Illia Polosukhin <ilblackdragon@gmail.com>

zmanian and others added 30 commits March 3, 2026 15:48

feat: add tool output capture via tool_results() accessor

f0486ab

Extract (name, preview) from ToolResult status events in TestChannel and TestRig, enabling content assertions on tool outputs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: correct tool parameters in 3 broken trace fixtures

e3854ee

- tool_time.json: add missing "operation": "now" for time tool - robust_correct_tool.json: same fix - memory_full_cycle.json: change "path" to "target" for memory_write Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat: capture per-tool timing from ToolStarted/ToolCompleted events

0e51154

Record Instant on ToolStarted and compute elapsed duration on ToolCompleted, wiring real timing data into collect_metrics() instead of hardcoded zeros. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

refactor: add RAII CleanupGuard for temp file/dir cleanup in tests

ae2a353

Replace manual cleanup_test_dir() calls and inline remove_file() with Drop-based CleanupGuard that ensures cleanup even if a test panics. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: add Drop impl and graceful shutdown for TestRig

ce5746c

Wrap agent_handle in Option so Drop can abort leaked tasks. Signal the channel shutdown before aborting for future cooperative shutdown. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: replace agent startup sleep with oneshot ready signal

bcf4758

Use a oneshot channel fired in Channel::start() instead of a fixed 100ms sleep, eliminating the race condition on slow systems. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: replace fragile string-matching iteration limit with count-based…

00d06bf

… detection Use tool completion count vs max_tool_iterations instead of scanning status messages for "iteration"/"limit" substrings. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

style: apply cargo fmt to benchmark module

d3a31ff

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat(benchmark): add --parallel and --max-cost CLI flags

1f7bef7

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix(benchmark): use feature-conditional snapshot names for CLI help t…

d6773e2

…ests Prevents snapshot conflicts between default (no benchmark) and all-features (with benchmark) builds by using separate snapshot names per feature set. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat(benchmark): add tool restriction and identity override test scen…

83a048d

…arios Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chore: fix formatting for Phase 3

d59e177

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat(benchmark): add SkillRegistry::retain_only and wire skill filter…

2f7d042

…ing in scenarios Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions bot added scope: agent Agent core (agent loop, router, scheduler) scope: tool/builtin Built-in tools scope: llm LLM integration labels Mar 5, 2026

ilblackdragon and others added 5 commits March 4, 2026 16:52

style: fix trailing newlines in support files

49dfad6

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions bot mentioned this pull request Mar 5, 2026

🦞 OpenClaw 生态日报 2026-03-05 rollysys/agents-radar#38

Open

zmanian and others added 10 commits March 4, 2026 19:20

Merge remote-tracking branch 'origin/main' into how-can-we-have-testi…

0a5584b

…ngbenchma

Adding coverage running script

7976490

fix: clean stale profdata before coverage run

dde1d9a

Adds `cargo llvm-cov clean` before each run to prevent "mismatched data" warnings from stale instrumentation profiles. [skip-regression-check] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

style: fix formatting in retain_only test

94ef3aa

[skip-regression-check] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ilblackdragon approved these changes Mar 5, 2026

View reviewed changes

ilblackdragon merged commit b4b1973 into main Mar 5, 2026
14 checks passed

ilblackdragon deleted the how-can-we-have-testingbenchma branch March 5, 2026 09:13

This was referenced Mar 5, 2026

test(e2e): add trace tests for agent/worker.rs core execution paths (29% → ~60% coverage) #571

Closed

test(e2e): add trace tests for untested builtin tools — routine (0%), time (37%), job (60%), web_fetch (47%) #573

Closed

This was referenced Mar 6, 2026

chore: release v0.16.0 #594

Closed

chore: release v0.16.0 #595

Merged

BrewTestBot mentioned this pull request Mar 6, 2026

ironclaw 0.16.0 Homebrew/homebrew-core#270971

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trajectory benchmarks and e2e trace test rig#553

Trajectory benchmarks and e2e trace test rig#553
ilblackdragon merged 53 commits intomainfrom
how-can-we-have-testingbenchma

zmanian commented Mar 4, 2026 •

edited

Loading

Uh oh!

ilblackdragon commented Mar 5, 2026

Uh oh!

zmanian commented Mar 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zmanian commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

ilblackdragon commented Mar 5, 2026

Uh oh!

zmanian commented Mar 5, 2026

Review of Illia's commits (a53a852..49dfad6)

1. Duplicate type systems between src/llm/recording.rs and tests/support/trace_llm.rs

2. Recording produces flat steps, not turns

3. with_tools() / with_workspace() removal

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zmanian commented Mar 4, 2026 •

edited

Loading

Review of Illia's commits (`a53a852`..`49dfad6`)

1. Duplicate type systems between `src/llm/recording.rs` and `tests/support/trace_llm.rs`

3. `with_tools()` / `with_workspace()` removal