Trajectory benchmarks and e2e trace test rig#553
Conversation
Move 5 assertion helpers from e2e_spot_checks.rs to a shared module. Add assert_all_tools_succeeded and assert_tool_succeeded for eliminating false positives in E2E tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extract (name, preview) from ToolResult status events in TestChannel and TestRig, enabling content assertions on tool outputs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- tool_time.json: add missing "operation": "now" for time tool - robust_correct_tool.json: same fix - memory_full_cycle.json: change "path" to "target" for memory_write Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Every E2E test that exercises tools now calls assert_all_tools_succeeded. Added tool output content assertions where tool results are predictable (time year, read_file content, memory_read content). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Record Instant on ToolStarted and compute elapsed duration on ToolCompleted, wiring real timing data into collect_metrics() instead of hardcoded zeros. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace manual cleanup_test_dir() calls and inline remove_file() with Drop-based CleanupGuard that ensures cleanup even if a test panics. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wrap agent_handle in Option so Drop can abort leaked tasks. Signal the channel shutdown before aborting for future cooperative shutdown. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use a oneshot channel fired in Channel::start() instead of a fixed 100ms sleep, eliminating the race condition on slow systems. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… detection Use tool completion count vs max_tool_iterations instead of scanning status messages for "iteration"/"limit" substrings. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove incorrect comment about memory_tree failing with empty path (it actually succeeds). Omit empty path from fixture and use the standard assert_all_tools_succeeded instead of per-tool assertions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move TraceMetrics, ScenarioResult, RunResult, MetricDelta, and compare_runs() from tests/support/metrics.rs to src/benchmark/metrics.rs. Existing tests use re-export for backward compatibility. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Scenario defines a task with input, success criteria, and resource limits. Criterion is an enum of programmatic checks (tool_used, response_contains, etc.) evaluated without LLM judgment. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…egories) Scenarios cover tool_selection, tool_chaining, error_recovery, efficiency, and memory_operations. All loaded from JSON with deserialization validation test. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
BenchChannel is a minimal Channel implementation for benchmarks. InstrumentedLlm wraps any LlmProvider to capture per-call metrics. Runner creates a fresh agent per scenario, evaluates success criteria, and produces RunResult with timing, token, and cost metrics. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- baseline.rs: load/save/promote benchmark results - report.rs: format comparison reports with regression detection - benchmark_runner.rs: integration test with real LLM (feature-gated) - Add benchmark feature flag to Cargo.toml Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ponseNotContains Add BenchScenario, Turn, TurnAssertions, JudgeConfig, ScenarioSetup, WorkspaceSetup, SeedDocument types for multi-turn benchmark scenarios. Add ResponseNotContains criterion variant. Add TurnAssertions::to_criteria() converter for backward compat with existing evaluation engine. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…d tag filter Add load_bench_scenarios() for the new BenchScenario format with recursive directory traversal and tag-based filtering. Create 4 initial trajectory scenarios across tool-selection, multi-turn, and efficiency categories. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…n metrics Add run_bench_scenario() that loops over BenchScenario turns, seeds workspace documents, collects per-turn metrics (tokens, tool calls, wall time), and evaluates per-turn assertions. Add TurnMetrics to metrics.rs and clear_for_next_turn() to BenchChannel. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…score parsing Create judge.rs with format_judge_prompt, parse_judge_score, and judge_turn. Wire into run_bench_scenario for turns with judge config -- scores below min_score fail the turn. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add BenchmarkCommand with --tags, --scenario, --no-judge, --timeout, --update-baseline flags. Wire into Command enum and main.rs dispatch. Feature-gated behind benchmark flag. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add save_scenario_results() that writes per-scenario JSON files alongside the run summary. Each scenario gets its own file with turn_metrics trajectory. Update CLI to use new output format. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…g in scenarios Add a retain_only() method to ToolRegistry that filters tools down to a given allowlist. Wire this into run_bench_scenario() so that when a scenario specifies a tools list in its setup, only those tools are available during the benchmark run. Includes two tests for the new method: one verifying filtering works and one verifying empty input is a no-op. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…start Add seed_identity() helper that writes identity files (IDENTITY.md, USER.md, etc.) into the workspace before the agent starts, so that workspace.system_prompt() picks them up. Wire it into run_bench_scenario() after workspace seeding. Include a test that verifies identity files are written and readable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ests Prevents snapshot conflicts between default (no benchmark) and all-features (with benchmark) builds by using separate snapshot names per feature set. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…cement Replace sequential loop in run_all_bench() with parallel execution using JoinSet + semaphore when config.parallel > 1. Add budget cap enforcement that skips remaining scenarios when max_total_cost_usd is exceeded. Track skipped count in RunResult.skipped_scenarios and display it in format_report(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…arios Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ing in scenarios Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add TraceExpects struct with 9 optional assertion fields (response_contains, tools_used, all_tools_succeeded, etc.) that can be declared in fixture JSON instead of hand-written Rust. Add verify_expects() and run_recorded_trace() so recorded trace tests become one-liners. Split trace infra tests (deserialization, backward compat) into tests/trace_format.rs which doesn't require the libsql feature gate. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add declarative expects blocks to all 19 trace fixture JSONs across spot/, coverage/, advanced/, and root directories. Update all 8 e2e test files to use verify_trace_expects() / run_and_verify_trace(), replacing ~270 lines of hand-written assertions with fixture-driven verification. Tests that check things beyond expects (file content on disk, metrics, event ordering) keep those extra assertions alongside the declarative ones. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update test files to work with refactored TestRigBuilder that uses AppBuilder::build_all() (removing with_tools/with_workspace methods). Update telegram_check fixture to use tool_list instead of echo. Fix cargo fmt issues in src/llm/mod.rs and src/llm/recording.rs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Support modules (assertions, cleanup, test_channel, test_rig, trace_llm) had #[cfg(test)] mod tests blocks that were compiled and run 12 times — once per e2e test binary that declares `mod support;`. Extracted all 29 support unit tests into a dedicated `tests/support_unit_tests.rs` so they run exactly once. [skip-regression-check] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Review of Illia's commits (a53a852..49dfad6)Overall these are solid improvements -- declarative expects, AppBuilder integration, and test deduplication are all good moves. Three issues worth discussing: 1. Duplicate type systems between
|
Import shared types (TraceStep, TraceResponse, TraceToolCall, RequestHint, ExpectedToolResult, MemorySnapshotEntry, HttpExchange*) from ironclaw::llm::recording instead of redefining them in trace_llm.rs. Fix the flat-steps deserializer to split at UserInput boundaries into multiple turns, instead of filtering them out and wrapping everything into a single turn. This enables recorded multi-turn traces to be replayed as proper multi-turn conversations via run_trace(). [skip-regression-check] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add #[allow(unused_imports)] on pub use re-exports in trace_llm.rs (types are re-exported for downstream test files, not used locally) - Add `..` to ToolCompleted pattern in test_channel.rs to match new `error` and `parameters` fields Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add missing `error` and `parameters` fields to ToolCompleted constructors in support_unit_tests.rs - Add `..` to ToolCompleted pattern match in support_unit_tests.rs - Add #[allow(dead_code)] to CleanupGuard, LlmTrace impl, and TraceLlm impl (only used behind #[cfg(feature = "libsql")]) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Increase wait_for_responses polling to exponential backoff (50ms-500ms) and raise default timeout from 15s to 30s to reduce CI flakiness (#1) - Strengthen prompt_injection_resilience test with positive safety layer assertion via has_safety_warnings(), enable injection_check (#2) - Add assert_tool_order() helper and tools_order field in TraceExpects for verifying tool execution ordering in multi-step traces (#3) - Document TraceLlm sequential-call assumption for concurrency (#6) - Clean up CleanupGuard with PathKind enum instead of shotgun remove_file + remove_dir_all on every path (#8) - Fix coverage.sh: default to --lib only, fix multi-filter syntax, add COV_ALL_TARGETS option - Add coverage/ to .gitignore - Remove planning docs from PR [skip-regression-check] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Use HashSet for O(N+M) lookup in SkillRegistry::retain_only and ToolRegistry::retain_only instead of linear scan - Strengthen test_retain_only_empty_is_noop in SkillRegistry to pre-populate with a skill before asserting the no-op behavior [skip-regression-check] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The safety layer sanitizes tool output, not user input. The injection
test sends a malicious user message with no tools called, so the safety
layer never fires. Reverted to the original test which correctly
validates the LLM refuses via trace expects. Also fixed case-sensitive
request hint ("ignore" -> "Ignore") to suppress noisy warning.
[skip-regression-check]
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds `cargo llvm-cov clean` before each run to prevent "mismatched data" warnings from stale instrumentation profiles. [skip-regression-check] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
[skip-regression-check] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* refactor: extract shared assertion helpers to support/assertions.rs Move 5 assertion helpers from e2e_spot_checks.rs to a shared module. Add assert_all_tools_succeeded and assert_tool_succeeded for eliminating false positives in E2E tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: add tool output capture via tool_results() accessor Extract (name, preview) from ToolResult status events in TestChannel and TestRig, enabling content assertions on tool outputs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: correct tool parameters in 3 broken trace fixtures - tool_time.json: add missing "operation": "now" for time tool - robust_correct_tool.json: same fix - memory_full_cycle.json: change "path" to "target" for memory_write Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: add tool success and output assertions to eliminate false positives Every E2E test that exercises tools now calls assert_all_tools_succeeded. Added tool output content assertions where tool results are predictable (time year, read_file content, memory_read content). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: capture per-tool timing from ToolStarted/ToolCompleted events Record Instant on ToolStarted and compute elapsed duration on ToolCompleted, wiring real timing data into collect_metrics() instead of hardcoded zeros. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor: add RAII CleanupGuard for temp file/dir cleanup in tests Replace manual cleanup_test_dir() calls and inline remove_file() with Drop-based CleanupGuard that ensures cleanup even if a test panics. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: add Drop impl and graceful shutdown for TestRig Wrap agent_handle in Option so Drop can abort leaked tasks. Signal the channel shutdown before aborting for future cooperative shutdown. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: replace agent startup sleep with oneshot ready signal Use a oneshot channel fired in Channel::start() instead of a fixed 100ms sleep, eliminating the race condition on slow systems. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: replace fragile string-matching iteration limit with count-based detection Use tool completion count vs max_tool_iterations instead of scanning status messages for "iteration"/"limit" substrings. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: use assert_all_tools_succeeded for memory_full_cycle test Remove incorrect comment about memory_tree failing with empty path (it actually succeeds). Omit empty path from fixture and use the standard assert_all_tools_succeeded instead of per-tool assertions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor: promote benchmark metrics types to library code Move TraceMetrics, ScenarioResult, RunResult, MetricDelta, and compare_runs() from tests/support/metrics.rs to src/benchmark/metrics.rs. Existing tests use re-export for backward compatibility. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: add Scenario and Criterion types for agent benchmarking Scenario defines a task with input, success criteria, and resource limits. Criterion is an enum of programmatic checks (tool_used, response_contains, etc.) evaluated without LLM judgment. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: add initial benchmark scenario suite (12 scenarios across 5 categories) Scenarios cover tool_selection, tool_chaining, error_recovery, efficiency, and memory_operations. All loaded from JSON with deserialization validation test. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: add benchmark runner with BenchChannel and InstrumentedLlm BenchChannel is a minimal Channel implementation for benchmarks. InstrumentedLlm wraps any LlmProvider to capture per-call metrics. Runner creates a fresh agent per scenario, evaluates success criteria, and produces RunResult with timing, token, and cost metrics. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: add baseline management, reports, and benchmark entry point - baseline.rs: load/save/promote benchmark results - report.rs: format comparison reports with regression detection - benchmark_runner.rs: integration test with real LLM (feature-gated) - Add benchmark feature flag to Cargo.toml Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * style: apply cargo fmt to benchmark module Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): add multi-turn scenario types with setup, judge, ResponseNotContains Add BenchScenario, Turn, TurnAssertions, JudgeConfig, ScenarioSetup, WorkspaceSetup, SeedDocument types for multi-turn benchmark scenarios. Add ResponseNotContains criterion variant. Add TurnAssertions::to_criteria() converter for backward compat with existing evaluation engine. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): add JSON scenario loader with recursive discovery and tag filter Add load_bench_scenarios() for the new BenchScenario format with recursive directory traversal and tag-based filtering. Create 4 initial trajectory scenarios across tool-selection, multi-turn, and efficiency categories. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): multi-turn runner with workspace seeding and per-turn metrics Add run_bench_scenario() that loops over BenchScenario turns, seeds workspace documents, collects per-turn metrics (tokens, tool calls, wall time), and evaluates per-turn assertions. Add TurnMetrics to metrics.rs and clear_for_next_turn() to BenchChannel. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): add LLM-as-judge scoring with prompt formatting and score parsing Create judge.rs with format_judge_prompt, parse_judge_score, and judge_turn. Wire into run_bench_scenario for turns with judge config -- scores below min_score fail the turn. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): add CLI subcommand (ironclaw benchmark) Add BenchmarkCommand with --tags, --scenario, --no-judge, --timeout, --update-baseline flags. Wire into Command enum and main.rs dispatch. Feature-gated behind benchmark flag. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): per-scenario JSON output with full trajectory Add save_scenario_results() that writes per-scenario JSON files alongside the run summary. Each scenario gets its own file with turn_metrics trajectory. Update CLI to use new output format. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): add ToolRegistry::retain_only and wire tool filtering in scenarios Add a retain_only() method to ToolRegistry that filters tools down to a given allowlist. Wire this into run_bench_scenario() so that when a scenario specifies a tools list in its setup, only those tools are available during the benchmark run. Includes two tests for the new method: one verifying filtering works and one verifying empty input is a no-op. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): wire identity overrides into workspace before agent start Add seed_identity() helper that writes identity files (IDENTITY.md, USER.md, etc.) into the workspace before the agent starts, so that workspace.system_prompt() picks them up. Wire it into run_bench_scenario() after workspace seeding. Include a test that verifies identity files are written and readable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): add --parallel and --max-cost CLI flags Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(benchmark): use feature-conditional snapshot names for CLI help tests Prevents snapshot conflicts between default (no benchmark) and all-features (with benchmark) builds by using separate snapshot names per feature set. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): parallel execution with JoinSet and budget cap enforcement Replace sequential loop in run_all_bench() with parallel execution using JoinSet + semaphore when config.parallel > 1. Add budget cap enforcement that skips remaining scenarios when max_total_cost_usd is exceeded. Track skipped count in RunResult.skipped_scenarios and display it in format_report(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): add tool restriction and identity override test scenarios Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: fix formatting for Phase 3 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): add SkillRegistry::retain_only and wire skill filtering in scenarios Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): add --json flag for machine-readable output Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * ci: add GitHub Actions benchmark workflow (manual trigger) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor(benchmark): remove in-tree benchmark harness, keep retain_only utilities Move benchmark-specific code out of ironclaw in preparation for the nearai/benchmarks trajectory adapter. This removes: - src/benchmark/ (runner, scenarios, metrics, judge, report, etc.) - src/cli/benchmark.rs and the Benchmark CLI subcommand - benchmarks/ data directory (scenarios + trajectories) - .github/workflows/benchmark.yml - The "benchmark" Cargo feature flag What remains: - ToolRegistry::retain_only() and SkillRegistry::retain_only() - Test support types (TraceMetrics, InstrumentedLlm) inlined into tests/support/ instead of re-exporting from the deleted module Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: add README for LLM trace fixture format Documents the trajectory JSON format, response types, request hints, directory structure, and how to write new traces. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(test): unify trace format around turns, add multi-turn support Introduce TraceTurn type that groups user_input with LLM response steps, making traces self-contained conversation trajectories. Add run_trace() to TestRig for automatic multi-turn replay. Backward-compatible: flat "steps" JSON is deserialized as a single turn transparently. Includes all trace fixtures (spot, coverage, advanced), plan docs, and new e2e tests for steering, error recovery, long chains, memory, and prompt injection resilience. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(test): fix CI failures after merging main - Fix tool_json fixture: use "data" parameter (not "input") to match JsonTool schema - Fix status_events test: remove assertion for "time" tool that isn't in the fixture (only "echo" calls are used) - Allow dead_code in test support metrics/instrumented_llm modules (utilities for future benchmark tests) [skip-regression-check] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Working on recording traces and testing them * feat(test): add declarative expects to trace fixtures, split infra tests Add TraceExpects struct with 9 optional assertion fields (response_contains, tools_used, all_tools_succeeded, etc.) that can be declared in fixture JSON instead of hand-written Rust. Add verify_expects() and run_recorded_trace() so recorded trace tests become one-liners. Split trace infra tests (deserialization, backward compat) into tests/trace_format.rs which doesn't require the libsql feature gate. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor(test): add expects to all trace fixtures, simplify e2e tests Add declarative expects blocks to all 19 trace fixture JSONs across spot/, coverage/, advanced/, and root directories. Update all 8 e2e test files to use verify_trace_expects() / run_and_verify_trace(), replacing ~270 lines of hand-written assertions with fixture-driven verification. Tests that check things beyond expects (file content on disk, metrics, event ordering) keep those extra assertions alongside the declarative ones. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(test): adapt tests to AppBuilder refactor, fix formatting Update test files to work with refactored TestRigBuilder that uses AppBuilder::build_all() (removing with_tools/with_workspace methods). Update telegram_check fixture to use tool_list instead of echo. Fix cargo fmt issues in src/llm/mod.rs and src/llm/recording.rs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor(test): deduplicate support unit tests into single binary Support modules (assertions, cleanup, test_channel, test_rig, trace_llm) had #[cfg(test)] mod tests blocks that were compiled and run 12 times — once per e2e test binary that declares `mod support;`. Extracted all 29 support unit tests into a dedicated `tests/support_unit_tests.rs` so they run exactly once. [skip-regression-check] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * style: fix trailing newlines in support files Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor(test): unify trace types and fix recorded multi-turn replay Import shared types (TraceStep, TraceResponse, TraceToolCall, RequestHint, ExpectedToolResult, MemorySnapshotEntry, HttpExchange*) from ironclaw::llm::recording instead of redefining them in trace_llm.rs. Fix the flat-steps deserializer to split at UserInput boundaries into multiple turns, instead of filtering them out and wrapping everything into a single turn. This enables recorded multi-turn traces to be replayed as proper multi-turn conversations via run_trace(). [skip-regression-check] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(test): fix CI failures - unused imports and missing struct fields - Add #[allow(unused_imports)] on pub use re-exports in trace_llm.rs (types are re-exported for downstream test files, not used locally) - Add `..` to ToolCompleted pattern in test_channel.rs to match new `error` and `parameters` fields Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(test): fix CI failures after merging main - Add missing `error` and `parameters` fields to ToolCompleted constructors in support_unit_tests.rs - Add `..` to ToolCompleted pattern match in support_unit_tests.rs - Add #[allow(dead_code)] to CleanupGuard, LlmTrace impl, and TraceLlm impl (only used behind #[cfg(feature = "libsql")]) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Adding coverage running script * fix(test): address review feedback on E2E test infrastructure - Increase wait_for_responses polling to exponential backoff (50ms-500ms) and raise default timeout from 15s to 30s to reduce CI flakiness (nearai#1) - Strengthen prompt_injection_resilience test with positive safety layer assertion via has_safety_warnings(), enable injection_check (nearai#2) - Add assert_tool_order() helper and tools_order field in TraceExpects for verifying tool execution ordering in multi-step traces (nearai#3) - Document TraceLlm sequential-call assumption for concurrency (nearai#6) - Clean up CleanupGuard with PathKind enum instead of shotgun remove_file + remove_dir_all on every path (nearai#8) - Fix coverage.sh: default to --lib only, fix multi-filter syntax, add COV_ALL_TARGETS option - Add coverage/ to .gitignore - Remove planning docs from PR [skip-regression-check] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: address PR review - use HashSet in retain_only, improve skill test - Use HashSet for O(N+M) lookup in SkillRegistry::retain_only and ToolRegistry::retain_only instead of linear scan - Strengthen test_retain_only_empty_is_noop in SkillRegistry to pre-populate with a skill before asserting the no-op behavior [skip-regression-check] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(test): revert incorrect safety layer assertion in injection test The safety layer sanitizes tool output, not user input. The injection test sends a malicious user message with no tools called, so the safety layer never fires. Reverted to the original test which correctly validates the LLM refuses via trace expects. Also fixed case-sensitive request hint ("ignore" -> "Ignore") to suppress noisy warning. [skip-regression-check] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: clean stale profdata before coverage run Adds `cargo llvm-cov clean` before each run to prevent "mismatched data" warnings from stale instrumentation profiles. [skip-regression-check] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * style: fix formatting in retain_only test [skip-regression-check] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Illia Polosukhin <ilblackdragon@gmail.com>

Summary
TraceLlm+TestRig) for replaying LLM traces against the full agent loopuser_inputwith LLM responsesteps, making traces self-contained conversation trajectories"steps"JSON (legacy) is deserialized as a single turn transparentlyTestRig::run_trace()for automatic multi-turn replay (send user messages, collect responses, advance turns)spot/(quick smoke tests),coverage/(tool/feature coverage),advanced/(multi-turn, error recovery, steering, iteration limits, prompt injection)retain_onlyutilities--jsonflag for machine-readable benchmark outputSkillRegistry::retain_onlyand wire skill filtering in scenariosTest plan
cargo test trace_llm-- unit tests for trace types, deserialization (flat + turns), backward compatcargo test e2e_spot-- spot check smoke tests passcargo test e2e_advanced-- multi-turn, steering, error recovery, long chains, memory, injection tests passcargo test e2e_status-- status event tests passGenerated with Claude Code