Add automated QA: schema validator, CI matrix, Docker build, and P1 test coverage#353
Conversation
…er build P0 items from the automated QA plan (nearai#352): - Add validate_tool_schema() that checks OpenAI strict-mode rules (type: object, required keys in properties, nested object/array recursion) with 10 unit tests and 6 integration tests covering all core built-in tools - CI test matrix now runs with --all-features, default features, and --no-default-features --features libsql to catch dead code behind wrong cfg gates - CI clippy now runs the same 3-feature matrix with --all flags - Docker build job added to catch missing files in Dockerfile Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary of ChangesHello @zmanian, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the project's quality assurance by introducing several automated checks. It addresses critical P0 items from issue #352, focusing on preventing common bug categories related to tool schema validity, build integrity, and feature flag coverage. The changes lay the groundwork for a more robust and reliable development process by ensuring stricter adherence to schema definitions and broader test coverage across different build configurations. Highlights
Changelog
Ignored Files
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive automated QA plan and implements the first tier of it, focusing on schema validation for tools. A new validate_tool_schema function is added to enforce OpenAI strict-mode rules on tool parameter schemas, along with extensive unit and integration tests to validate all core built-in tools. This is a significant improvement to the project's quality assurance, preventing a whole class of bugs related to invalid tool schemas. The changes are well-structured and thoroughly tested. I have a few minor suggestions for improvement in the implementation and the QA plan document.
| #[tokio::test] | ||
| async fn failover_with_all_providers_failing() { | ||
| let failover = FailoverProvider::new(vec![ | ||
| always_failing_provider("a]"), |
There was a problem hiding this comment.
Acknowledged -- this is in a planning doc, not shipped code. Will leave as-is since the plan doc is reference material.
| screenshot = await self.take_screenshot() | ||
| return {"type": "tool_result", "content": [ | ||
| {"type": "image", "source": {"type": "base64", "media_type": "image/png", | ||
| "data": base64.b64encode(screenshot).decode()}} |
There was a problem hiding this comment.
Acknowledged -- this is in a planning doc (pseudocode example), not shipped code. The actual E2E implementation in tests/e2e/ handles imports correctly.
| "array" => { | ||
| if prop.get("items").is_none() { | ||
| errors.push(format!("{prop_path}: array property missing \"items\"")); | ||
| } else if let Some(items) = prop.get("items") { | ||
| // If items is an object type, recurse | ||
| if items.get("type").and_then(|t| t.as_str()) == Some("object") { | ||
| errors | ||
| .extend(validate_tool_schema(items, &format!("{prop_path}.items"))); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
This logic can be simplified by using an if let ... else structure to avoid checking for items twice.
| "array" => { | |
| if prop.get("items").is_none() { | |
| errors.push(format!("{prop_path}: array property missing \"items\"")); | |
| } else if let Some(items) = prop.get("items") { | |
| // If items is an object type, recurse | |
| if items.get("type").and_then(|t| t.as_str()) == Some("object") { | |
| errors | |
| .extend(validate_tool_schema(items, &format!("{prop_path}.items"))); | |
| } | |
| } | |
| } | |
| "array" => { | |
| if let Some(items) = prop.get("items") { | |
| // If items is an object type, recurse | |
| if items.get("type").and_then(|t| t.as_str()) == Some("object") { | |
| errors | |
| .extend(validate_tool_schema(items, &format!("{prop_path}.items"))); | |
| } | |
| } else { | |
| errors.push(format!("{prop_path}: array property missing \"items\"")); | |
| } | |
| } |
There was a problem hiding this comment.
Fixed. Restructured to if-let-else so items is only checked once.
P1 test coverage: config round-trip (settings + bootstrap), shell tool arg handling, safety adversarial tests (sanitizer, leak detector, allowlist), turn persistence (conversations, metadata, pagination, jobs), and a clippy fix for libsql-only builds. Fixed a real bug where AhoCorasick non-overlapping prefix iteration caused shorter prefixes (e.g. "sk-") to shadow longer ones (e.g. "sk-ant-api"), preventing Anthropic API key and SSH private key detection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Cover all P2 items from the automated QA plan: - Circuit breaker chaos tests (hanging provider, rapid cycles, mixed errors) - Failover chaos tests (hanging failover, all-fail, tools path, single provider) - Value estimator boundary tests (negative cost, zero price, zero earnings) - Context length recovery test (ContextLengthExceeded -> compact -> retry) - WASM channel lifecycle tests (write/commit/read round-trip, namespace isolation) - Extension registry collision tests (same-name different-kind coexistence) - Extension filesystem collision tests (separate dirs, detect_kind priority) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tests verify thread safety of double-checked locking, TOCTOU prevention, and RwLock-based concurrent access patterns under load. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Dispatcher: test force_text mechanism prevents infinite tool call loops, verify iteration bound arithmetic guarantees termination for all configs. Self-repair: test stuck job detection, recovery within attempt limits, manual escalation when limit exceeded, graceful degradation without store/builder dependencies. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Auth middleware (8 new tests): valid/invalid bearer tokens, query param fallback, case sensitivity, empty tokens, whitespace handling. Compaction module (16 new tests): truncation strategy, summarize strategy with mock LLM, workspace fallback, format_turns helper, sequential compactions, coherence after compaction, token decrease verification. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Test the full bootstrap .env lifecycle: write via the same format as save_bootstrap_env/upsert_bootstrap_var, read back via dotenvy, and assert values match. Covers LLM backend selection, embedding disable flag, onboard completion flag, session token keys, multi-key preservation across upsert, and special characters (spaces, equals, quotes, backslashes, hashes). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…A 4.3/4.4) Value estimator (14 new tests): zero/negative prices, large values, negative cost, exact margin boundaries, custom margin configuration. Dispatcher loop guard (2 new tests): verifies the dispatch loop terminates when all tool calls fail (regression guard for PR nearai#252 infinite loop) and when max iterations are reached. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Failover edge cases (4 new tests): cooldown at zero nanos, half-open failure reopens circuit, all providers fail gracefully (no panic), single failing provider with cooldown. Provider chaos tests (15 new tests): flakey provider with retries, hanging provider with timeout, garbage provider, circuit breaker trip/recover, failover chain cascading, non-transient error stops chain, full stack integration (retry + failover + circuit breaker). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
serrrfirat
left a comment
There was a problem hiding this comment.
Summary
This is a massive, well-structured QA PR adding 424 tests across 39 files with only 16 lines of production code changes (a correct bug fix in the leak detector prefix shadowing, a clippy lint fix, and a module export). The test architecture is solid: the TestHarnessBuilder pattern, provider chaos mocks, and E2E infrastructure with mock LLM are well-designed. The leak detector fix correctly addresses the Aho-Corasick non-overlapping iteration issue where short prefixes shadowed longer ones. The CI matrix expansion for feature-flag configurations is a significant reliability improvement.
The most substantive issues are: (1) the auth middleware's case-sensitive Bearer prefix check violates RFC 6750 and the new test codifies this as expected behavior, (2) the config round-trip tests reimplement the bootstrap escaping logic rather than testing the actual functions, creating a risk of undetected divergence, (3) the E2E event_loop fixture uses a pattern deprecated in the minimum-required pytest-asyncio version, and (4) two schema validation functions with different strictness levels lack documentation about which to use when. The E2E infrastructure also has practical concerns around fixed ports and environment contamination that could cause flakiness in CI.
Overall this is a high-quality PR that significantly improves test coverage. No critical or blocking issues found.
General findings
[MEDIUM] Bearer auth scheme comparison is case-sensitive, violating RFC 6750 (src/channels/web/auth.rs:33)
The auth middleware uses value.strip_prefix("Bearer ") which is a case-sensitive check. RFC 6750 Section 2.1 specifies that the auth-scheme token comparison must be case-insensitive. A client sending authorization: bearer <token> (lowercase, which is valid per the RFC) will receive a 401. The new test test_bearer_prefix_case_sensitivity codifies this deviation by asserting that lowercase "bearer" is rejected. While this is pre-existing behavior (not introduced by this PR), the PR cements it as expected behavior. The test comment says 'Lowercase "bearer" should NOT match' — this is incorrect per the RFC.
Suggested fix:
Change `value.strip_prefix("Bearer ")` to a case-insensitive check: `if value.len() > 7 && value[..7].eq_ignore_ascii_case("Bearer ") { let token = &value[7..]; ... }`. Update the test to expect 200 for lowercase "bearer", or add a comment explaining the intentional deviation from RFC 6750.| /// `save_bootstrap_env`: double-quoted values with escaped backslashes and | ||
| /// double quotes. | ||
| fn write_bootstrap_env( | ||
| path: &std::path::Path, |
There was a problem hiding this comment.
Config round-trip tests reimplement bootstrap logic instead of testing actual code
The test file reimplements write_bootstrap_env (lines 22-34) and upsert_env_var (lines 38-75) as local helper functions rather than calling the actual save_bootstrap_env and upsert_bootstrap_var from src/bootstrap.rs. The comment on line 10 explains this is because the real functions write to the global ~/.ironclaw/.env path, but the consequence is that these tests verify a copy of the logic, not the real implementation. If someone changes the escaping in bootstrap.rs without updating the test helpers, the tests will still pass, silently missing regressions. The test reimplementation could also have bugs that the real code doesn't (or vice versa).
Suggested fix:
Refactor `save_bootstrap_env` and `upsert_bootstrap_var` in `bootstrap.rs` to accept a path parameter (or add `_to(path, ...)` variants) so the tests can call the actual implementation against a tempdir. Alternatively, extract the formatting logic into a shared pure function that both the real code and tests call.Severity: medium · Confidence: high
There was a problem hiding this comment.
Fixed. Refactored bootstrap.rs to expose path-parameterized variants (save_bootstrap_env_to, upsert_bootstrap_var_to). Tests now import and call the real functions instead of reimplementing them.
|
|
||
|
|
||
| @pytest.fixture(scope="session") | ||
| def event_loop(): |
There was a problem hiding this comment.
Deprecated pytest-asyncio event_loop fixture override
The event_loop fixture at line 47-50 overrides the built-in event_loop fixture to make it session-scoped. This pattern was deprecated in pytest-asyncio 0.23 (the minimum version required in pyproject.toml). In pytest-asyncio >= 0.23, users should configure the event loop scope via the loop_scope setting instead. This will emit DeprecationWarnings during test runs and may break entirely in future pytest-asyncio versions.
Suggested fix:
Remove the custom `event_loop` fixture. Instead, in `pyproject.toml` add `loop_scope = "session"` under `[tool.pytest.ini_options]`, or use `@pytest.fixture(scope="session")` with the `pytest_asyncio.fixture` decorator and set `scope="session"` on the async fixtures directly.Severity: medium · Confidence: high
There was a problem hiding this comment.
Fixed. Removed the deprecated event_loop fixture override. pytest-asyncio >= 0.23 handles session-scoped async fixtures natively.
| pub use rate_limiter::RateLimiter; | ||
| pub use registry::ToolRegistry; | ||
| pub use tool::{ApprovalRequirement, Tool, ToolDomain, ToolError, ToolOutput, ToolRateLimitConfig}; | ||
| pub use tool::{ |
There was a problem hiding this comment.
Duplicate schema validation functions with divergent strictness levels
The PR introduces validate_strict_schema() in src/tools/schema_validator.rs (958 lines, full OpenAI strict-mode validation including enum type checking, additionalProperties enforcement, recursive nested validation) and also exports the pre-existing validate_tool_schema() from src/tools/tool.rs (less strict — no enum type checking, no additionalProperties check). Both functions validate tool JSON schemas but with different rules. There is no documentation indicating when to use which. This creates a maintenance burden: changes to validation rules need to be synchronized across two implementations, and callers may pick the wrong one.
Suggested fix:
Consider making `validate_strict_schema` the single source of truth and having `validate_tool_schema` delegate to it (possibly with a `strict: bool` flag). Alternatively, add clear doc comments explaining: `validate_strict_schema` is for CI/build-time validation, `validate_tool_schema` is for runtime registration.Severity: medium · Confidence: high
There was a problem hiding this comment.
Fixed. Added cross-referencing doc comments to both functions. validate_tool_schema (tool.rs) documents itself as the lenient runtime variant and points to validate_strict_schema. validate_strict_schema (schema_validator.rs) documents itself as the strict CI variant and points back to validate_tool_schema.
| ROOT = Path(__file__).resolve().parent.parent.parent | ||
|
|
||
| # Ports: use high fixed ports to avoid conflicts with development instances | ||
| MOCK_LLM_PORT = 18_199 |
There was a problem hiding this comment.
Fixed E2E ports will cause failures in parallel CI runs
The E2E fixtures use hardcoded ports MOCK_LLM_PORT = 18199 and GATEWAY_PORT = 18200. If two CI jobs run E2E tests simultaneously on the same runner (e.g., matrix builds, or concurrent PR builds on a self-hosted runner), they will fail with 'address already in use'. The mock LLM server supports port 0 (OS-assigned) via --port, but the gateway appears to need a fixed port because it logs the configured port, not the bound port.
Suggested fix:
For the mock LLM, use port 0 and parse the actual port from stdout (which is already supported). For the gateway, either fix the binary to log the actual bound port, or use a randomized high port (e.g., `random.randint(18200, 19200)`) to reduce collision probability.Severity: medium · Confidence: medium
There was a problem hiding this comment.
Fixed. Both MOCK_LLM_PORT and GATEWAY_PORT are now dynamically allocated via _find_free_port() (binds to port 0, reads OS-assigned port). Mock LLM was already passing --port 0 but the conftest had a hardcoded fallback constant; removed it.
| **os.environ, | ||
| "RUST_LOG": "ironclaw=info", | ||
| "GATEWAY_ENABLED": "true", | ||
| "GATEWAY_HOST": "127.0.0.1", |
There was a problem hiding this comment.
E2E test inherits full host environment, risking config contamination
The ironclaw_server fixture passes **os.environ merged with test overrides. If the host machine has ~/.ironclaw/.env or environment variables like DATABASE_URL, LLM_API_KEY, or GATEWAY_PORT set, the ironclaw binary's load_ironclaw_env() bootstrap may load them, potentially overriding the test's configuration. This could cause non-deterministic test failures on developer machines or CI runners with residual configuration.
Suggested fix:
Use a minimal environment instead of inheriting everything: `env = {"PATH": os.environ.get("PATH", ""), "HOME": tempdir, "RUST_LOG": ..., ...}`. Set `HOME` to a temp directory to prevent `~/.ironclaw/.env` from being loaded. Or set `IRONCLAW_HOME` (if supported) to an empty temp directory.Severity: medium · Confidence: medium
There was a problem hiding this comment.
Fixed. Replaced **os.environ with a minimal explicit env dict containing only PATH and HOME plus the test-specific vars. No host environment leakage.
|
|
||
| ## Run tests | ||
|
|
||
| ```bash |
There was a problem hiding this comment.
README documents HEADED=1 env var but conftest.py ignores it
The README states HEADED=1 pytest tests/e2e/scenarios/test_connection.py -v for visible browser mode, but the page fixture in conftest.py line 117 hardcodes headless=True without reading any environment variable. Running HEADED=1 will still run headless.
Suggested fix:
In `conftest.py`, change `browser = await p.chromium.launch(headless=True)` to `browser = await p.chromium.launch(headless=not os.environ.get("HEADED"))`.Severity: low · Confidence: high
There was a problem hiding this comment.
Fixed. The browser fixture is now session-scoped and reads HEADED env var: headless = os.environ.get("HEADED", "").strip() not in ("1", "true").
| runs-on: ubuntu-latest | ||
| timeout-minutes: 30 | ||
| steps: | ||
| - uses: actions/checkout@v4 |
There was a problem hiding this comment.
Inconsistent actions/checkout version across CI workflows
The E2E workflow uses actions/checkout@v4 while both code_style.yml and test.yml use actions/checkout@v6. While v4 still works, the inconsistency could lead to different checkout behaviors (e.g., different default fetch depths, different handling of submodules) and makes maintenance harder.
Suggested fix:
Change `uses: actions/checkout@v4` to `uses: actions/checkout@v6` in `.github/workflows/e2e.yml` line 17.Severity: low · Confidence: high
There was a problem hiding this comment.
Fixed. Bumped to actions/checkout@v6 to match test.yml and code_style.yml.
| proc.send_signal(signal.SIGTERM) | ||
| try: | ||
| await asyncio.wait_for(proc.wait(), timeout=5) | ||
| except asyncio.TimeoutError: |
There was a problem hiding this comment.
E2E page fixture launches a new browser per test function
The page fixture (function-scoped) calls p.chromium.launch() for every test, creating a new browser process each time. With 19 E2E tests, that's 19 Chromium launches. A session-scoped browser fixture with per-test browser contexts would be significantly faster while maintaining test isolation (contexts don't share cookies, localStorage, etc.).
Suggested fix:
Add a session-scoped `browser` fixture: `@pytest.fixture(scope="session") async def browser(): ...`. Then in the `page` fixture, accept `browser` as a parameter and only create a new context and page from it.Severity: low · Confidence: high
There was a problem hiding this comment.
Fixed. Browser is now session-scoped. The page fixture creates a fresh context per test (isolated cookies/storage) without relaunching the browser process.
|
All review items addressed:
|
- Fix Bearer auth case-sensitivity per RFC 6750 (auth.rs) - Refactor bootstrap.rs to expose path-parameterized variants so config_round_trip tests call real code instead of reimplementations - Remove deprecated event_loop fixture, use dynamic ports, minimal env, session-scoped browser, and wire HEADED=1 in E2E conftest - Add cross-referencing doc comments between schema validators - Simplify array validation logic in tool.rs - Bump e2e.yml checkout@v4 to @v6 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Resolve conflict in src/agent/session_manager.rs: keep both QA concurrent stress tests and new resolve_thread UUID lookup test. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add --no-onboard flag to prevent wizard from blocking in CI - Pipe /dev/null to stdin to prevent any stdin reads from hanging - Add RUST_BACKTRACE=1 for crash diagnostics - On server startup timeout, dump stderr to pytest output so CI logs show why the server failed to start Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
pytest-asyncio 1.3.0 defaults asyncio_default_fixture_loop_scope to None (function scope), causing session-scoped async fixtures to be re-evaluated per test function with independent event loops. Each test then independently attempts to start the ironclaw server, times out at 120s, and wastes ~24 minutes of CI before the job is cancelled. Setting asyncio_default_fixture_loop_scope = "session" ensures all session-scoped async fixtures share a single event loop, so the server starts once and is reused across all tests. Also adds -x flag to pytest in CI to stop on first failure instead of running all 19 tests when the fixture is broken. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
With asyncio_default_fixture_loop_scope=session but asyncio_default_test_loop_scope=function (the default), tests run on a per-function event loop while fixtures produce objects (Playwright pages, browser contexts) on the session event loop. This event loop mismatch causes the test to hang indefinitely awaiting Playwright operations that are bound to the wrong loop. Setting both scopes to "session" ensures a single event loop is shared across all fixtures and tests, eliminating the deadlock. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Branch protection expects "Code Style (fmt + clippy)" and "Run Tests" status checks, but only individual job names were reported. Add roll-up jobs that aggregate results and report the expected names. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Head branch was pushed to by a user without write access
…est coverage (nearai#353) * Add automated QA: tool schema validator, feature-flag CI matrix, Docker build P0 items from the automated QA plan (nearai#352): - Add validate_tool_schema() that checks OpenAI strict-mode rules (type: object, required keys in properties, nested object/array recursion) with 10 unit tests and 6 integration tests covering all core built-in tools - CI test matrix now runs with --all-features, default features, and --no-default-features --features libsql to catch dead code behind wrong cfg gates - CI clippy now runs the same 3-feature matrix with --all flags - Docker build job added to catch missing files in Dockerfile Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add P1 automated QA tests and fix LeakDetector prefix shadowing bug P1 test coverage: config round-trip (settings + bootstrap), shell tool arg handling, safety adversarial tests (sanitizer, leak detector, allowlist), turn persistence (conversations, metadata, pagination, jobs), and a clippy fix for libsql-only builds. Fixed a real bug where AhoCorasick non-overlapping prefix iteration caused shorter prefixes (e.g. "sk-") to shadow longer ones (e.g. "sk-ant-api"), preventing Anthropic API key and SSH private key detection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add P2 automated QA tests: chaos, lifecycle, collision, and recovery Cover all P2 items from the automated QA plan: - Circuit breaker chaos tests (hanging provider, rapid cycles, mixed errors) - Failover chaos tests (hanging failover, all-fail, tools path, single provider) - Value estimator boundary tests (negative cost, zero price, zero earnings) - Context length recovery test (ContextLengthExceeded -> compact -> retry) - WASM channel lifecycle tests (write/commit/read round-trip, namespace isolation) - Extension registry collision tests (same-name different-kind coexistence) - Extension filesystem collision tests (separate dirs, detect_kind priority) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add P3 concurrent stress tests for ContextManager and SessionManager Tests verify thread safety of double-checked locking, TOCTOU prevention, and RwLock-based concurrent access patterns under load. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add dispatcher loop guard and self-repair stuck job tests Dispatcher: test force_text mechanism prevents infinite tool call loops, verify iteration bound arithmetic guarantees termination for all configs. Self-repair: test stuck job detection, recovery within attempt limits, manual escalation when limit exceeded, graceful degradation without store/builder dependencies. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add E2E testing infrastructure design doc Python + Playwright framework with mock LLM server for deterministic browser-level testing of the web gateway. Covers connection/auth, chat round-trip with SSE streaming, and skills lifecycle scenarios. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add E2E testing infrastructure implementation plan 10-task plan covering: scaffolding, mock LLM server, helpers, conftest fixtures, connection/chat/skills test scenarios, CI workflow, README, and integration run. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * scaffold: E2E test project with pyproject.toml Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: E2E helpers with DOM selectors and port discovery Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: mock OpenAI-compat LLM server for E2E tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: E2E conftest with session fixtures for mock LLM and ironclaw Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: E2E scenario 1 -- connection and tab navigation tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: E2E scenario 2 -- chat message round-trip tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: E2E scenario 3 -- skills search, install, remove tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * ci: add weekly E2E test workflow with Playwright Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: E2E test README with setup and usage instructions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: E2E test integration fixes from first run - Use temp file DB instead of :memory: (libSQL :memory: doesn't persist tables across execute_batch) - Fix installed skills selector: #skills-list not #installed-skills - Add pytest-timeout to dependencies - Improve skills install/remove test with wait_for instead of fixed sleeps 8 passed, 1 skipped (skills install depends on ClawHub availability) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * test: add OpenAI strict-mode schema validator for all built-in tools (QA 1.1) Add src/tools/schema_validator.rs with validate_strict_schema() that checks tool parameter schemas against OpenAI function calling strict-mode rules: type object at top level, required keys in properties, enum type consistency, array items definitions, nested object recursion, and additionalProperties. 17 tests validate all 34+ built-in tool schemas across 5 test groups: - 9 simple tools (echo, time, json, http, shell, file read/write/list/patch) - 4 job tools (create, list, status, cancel) - 4 skill tools (list, search, install, remove) - 13 inline schemas for extension, routine, and complex job tools - 4 memory tool schemas Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * test: add E2E scenarios for SSE reconnect, HTML injection, and tool approval (QA 3.3/5/6) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: E2E test reliability for HTML injection and SSE reconnect - HTML injection: test sanitization directly via JS injection instead of depending on full LLM round-trip (avoids intermittent 404 from mock) - SSE reconnect: increase wait times for DB persistence and relax assertion to check total message count after history reload Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * style: cargo fmt formatting Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * test: add WASM and MCP tool schema validation tests (QA 1.1) Extends the schema validator with representative WASM tool schemas (weather, HTTP client, batch processor, status), MCP tool schemas (default, file read, SQL query, strict mode), and defect detection tests for common external schema issues (missing type, typo in required, array without items, enum type mismatch). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * test: add auth middleware and compaction module tests Auth middleware (8 new tests): valid/invalid bearer tokens, query param fallback, case sensitivity, empty tokens, whitespace handling. Compaction module (16 new tests): truncation strategy, summarize strategy with mock LLM, workspace fallback, format_turns helper, sequential compactions, coherence after compaction, token decrease verification. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * test: add config round-trip integration tests (QA 1.2) Test the full bootstrap .env lifecycle: write via the same format as save_bootstrap_env/upsert_bootstrap_var, read back via dotenvy, and assert values match. Covers LLM backend selection, embedding disable flag, onboard completion flag, session token keys, multi-key preservation across upsert, and special characters (spaces, equals, quotes, backslashes, hashes). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * test: add value estimator boundary tests and dispatcher loop guard (QA 4.3/4.4) Value estimator (14 new tests): zero/negative prices, large values, negative cost, exact margin boundaries, custom margin configuration. Dispatcher loop guard (2 new tests): verifies the dispatch loop terminates when all tool calls fail (regression guard for PR nearai#252 infinite loop) and when max iterations are reached. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * test: add failover edge cases and provider chaos tests (QA 2.6/4.1) Failover edge cases (4 new tests): cooldown at zero nanos, half-open failure reopens circuit, all providers fail gracefully (no panic), single failing provider with cooldown. Provider chaos tests (15 new tests): flakey provider with retries, hanging provider with timeout, garbage provider, circuit breaker trip/recover, failover chain cascading, non-transient error stops chain, full stack integration (retry + failover + circuit breaker). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: address PR review feedback on QA tests - Fix Bearer auth case-sensitivity per RFC 6750 (auth.rs) - Refactor bootstrap.rs to expose path-parameterized variants so config_round_trip tests call real code instead of reimplementations - Remove deprecated event_loop fixture, use dynamic ports, minimal env, session-scoped browser, and wire HEADED=1 in E2E conftest - Add cross-referencing doc comments between schema validators - Simplify array validation logic in tool.rs - Bump e2e.yml checkout@v4 to @v6 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * style: cargo fmt and fix clippy warning in signal.rs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: improve E2E fixture error reporting and prevent stdin blocking - Add --no-onboard flag to prevent wizard from blocking in CI - Pipe /dev/null to stdin to prevent any stdin reads from hanging - Add RUST_BACKTRACE=1 for crash diagnostics - On server startup timeout, dump stderr to pytest output so CI logs show why the server failed to start Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: set session-scoped event loop for E2E async fixtures pytest-asyncio 1.3.0 defaults asyncio_default_fixture_loop_scope to None (function scope), causing session-scoped async fixtures to be re-evaluated per test function with independent event loops. Each test then independently attempts to start the ironclaw server, times out at 120s, and wastes ~24 minutes of CI before the job is cancelled. Setting asyncio_default_fixture_loop_scope = "session" ensures all session-scoped async fixtures share a single event loop, so the server starts once and is reused across all tests. Also adds -x flag to pytest in CI to stop on first failure instead of running all 19 tests when the fixture is broken. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: set test loop scope to session to match fixture loop scope With asyncio_default_fixture_loop_scope=session but asyncio_default_test_loop_scope=function (the default), tests run on a per-function event loop while fixtures produce objects (Playwright pages, browser contexts) on the session event loop. This event loop mismatch causes the test to hang indefinitely awaiting Playwright operations that are bound to the wrong loop. Setting both scopes to "session" ensures a single event loop is shared across all fixtures and tests, eliminating the deadlock. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * ci: add roll-up jobs to match branch protection required checks Branch protection expects "Code Style (fmt + clippy)" and "Run Tests" status checks, but only individual job names were reported. Add roll-up jobs that aggregate results and report the expected names. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Summary
Comprehensive automated QA implementation covering all four tiers of the QA plan (
docs/plans/2026-02-24-automated-qa.md). Adds 424 new tests (1,300 -> 1,724 total) across 39 files (+7,887 lines), addressing the recurring bug categories identified from ~40 closed issues and merged bug-fix PRs.P0: Schema & CI (3 items -- all done)
src/tools/schema_validator.rs):validate_strict_schema()checks all built-in tool schemas against OpenAI strict-mode rules (type: object, required/properties alignment, nested recursion). 10 unit tests + 7 integration tests for all built-in tools + 3 WASM/MCP schema tests..github/workflows/test.yml,code_style.yml): Tests and clippy now run 3 configurations:--all-features, default,--no-default-features --features libsql..github/workflows/test.yml): Newdocker-buildjob runsdocker buildon every PR.P1: Integration Tests (6 items -- all done)
src/bootstrap.rs+ 4 insrc/settings.rs+ 6 integration tests intests/config_round_trip.rs(LLM backend round-trip, embedding disabled survives, onboard completed check, session token key, preserving existing values, special characters).TestHarnessBuilder(message round-trip, metadata persistence, ownership check, ensure idempotency, pagination ordering, conversation preview, job+action persistence).src/channels/web/auth.rs): 8 new tests -- valid/invalid bearer tokens, query param fallback, case sensitivity, empty tokens, whitespace handling.src/agent/compaction.rs): 16 new tests -- truncation strategy, summarize strategy with mock LLM, workspace fallback, format_turns helper, sequential compactions, coherence after compaction, token decrease verification.P2: Edge Cases & Chaos (7 items -- all done)
src/llm/failover.rs): 4 new tests -- all providers fail gracefully (no panic), single failing provider with cooldown, cooldown at zero nanos, half-open failure reopens circuit.src/llm/circuit_breaker.rs): 2 new tests -- zero-duration recovery timeout, half-open failure counter reset.tests/provider_chaos.rs): 15 new tests with FlakeyProvider, HangingProvider, GarbageProvider, ReliableProvider mocks. Covers retry exhaustion, timeout handling, circuit breaker trip/recover, failover cascading, non-transient error stops chain, full stack integration (retry + failover + circuit breaker).src/agent/dispatcher.rs): 2 new tests -- all tool calls failing terminates (regression guard for PR fix: parallelize tool call execution via JoinSet (#219) #252), max iterations reached terminates.P3: E2E & Stress Testing (5 items -- done)
tests/e2e/mock_llm.py), pytest fixtures (binary build, mock LLM, gateway, Playwright browser), centralized DOM selectors..github/workflows/e2e.yml): Weekly + manual + PR-triggered on web gateway changes.src/estimation/value.rs): 14 new tests -- zero/negative prices, large values, exact margin boundaries, custom margins, minimum bid edge cases.Bug Fixes
needless_returninsrc/cli/mcp.rsfor libsql-only builds.Intentionally Skipped
:memory:-- works correctly.Addresses #352.
Test plan
cargo test --all-features-- 1,724 tests passcargo test --no-default-features --features libsql-- passescargo clippy --all --benches --tests --examples --all-features-- zero warningscargo clippy --all --benches --tests --examples-- zero warningscargo clippy --all --benches --tests --examples --no-default-features --features libsql-- zero warningscargo fmt --all -- --check-- cleanpytest tests/e2e/ -v-- 18 passed, 1 skippedGenerated with Claude Code