Add automated QA: schema validator, CI matrix, Docker build, and P1 test coverage by zmanian · Pull Request #353 · nearai/ironclaw

zmanian · 2026-02-24T19:30:25Z

Summary

Comprehensive automated QA implementation covering all four tiers of the QA plan (docs/plans/2026-02-24-automated-qa.md). Adds 424 new tests (1,300 -> 1,724 total) across 39 files (+7,887 lines), addressing the recurring bug categories identified from ~40 closed issues and merged bug-fix PRs.

P0: Schema & CI (3 items -- all done)

Tool schema validator (src/tools/schema_validator.rs): validate_strict_schema() checks all built-in tool schemas against OpenAI strict-mode rules (type: object, required/properties alignment, nested recursion). 10 unit tests + 7 integration tests for all built-in tools + 3 WASM/MCP schema tests.
Feature-flag CI matrix (.github/workflows/test.yml, code_style.yml): Tests and clippy now run 3 configurations: --all-features, default, --no-default-features --features libsql.
Docker build in CI (.github/workflows/test.yml): New docker-build job runs docker build on every PR.

P1: Integration Tests (6 items -- all done)

Config round-trip tests: 4 tests in src/bootstrap.rs + 4 in src/settings.rs + 6 integration tests in tests/config_round_trip.rs (LLM backend round-trip, embedding disabled survives, onboard completed check, session token key, preserving existing values, special characters).
Turn persistence tests: 7 tests using TestHarnessBuilder (message round-trip, metadata persistence, ownership check, ensure idempotency, pagination ordering, conversation preview, job+action persistence).
Shell tool realistic arg tests: Object-format args (the real LLM case), large output deadlock prevention (131KB), env scrubbing, injection pattern detection.
Safety adversarial tests: 6 sanitizer tests + 9 leak detector tests + 8 allowlist tests covering case insensitivity, subdomain bypass, credential URLs, IP addresses, multi-type detection, masking.
Auth middleware tests (src/channels/web/auth.rs): 8 new tests -- valid/invalid bearer tokens, query param fallback, case sensitivity, empty tokens, whitespace handling.
Compaction module tests (src/agent/compaction.rs): 16 new tests -- truncation strategy, summarize strategy with mock LLM, workspace fallback, format_turns helper, sequential compactions, coherence after compaction, token decrease verification.

P2: Edge Cases & Chaos (7 items -- all done)

Failover edge cases (src/llm/failover.rs): 4 new tests -- all providers fail gracefully (no panic), single failing provider with cooldown, cooldown at zero nanos, half-open failure reopens circuit.
Circuit breaker edge cases (src/llm/circuit_breaker.rs): 2 new tests -- zero-duration recovery timeout, half-open failure counter reset.
Provider chaos tests (tests/provider_chaos.rs): 15 new tests with FlakeyProvider, HangingProvider, GarbageProvider, ReliableProvider mocks. Covers retry exhaustion, timeout handling, circuit breaker trip/recover, failover cascading, non-transient error stops chain, full stack integration (retry + failover + circuit breaker).
Dispatcher loop guard (src/agent/dispatcher.rs): 2 new tests -- all tool calls failing terminates (regression guard for PR fix: parallelize tool call execution via JoinSet (#219) #252), max iterations reached terminates.
Context length recovery: ContextLengthExceeded triggers compaction and retry succeeds.
WASM channel lifecycle: workspace write/commit/read round-trip, overwrite across callbacks, emit order preservation, isolated namespaces.
Extension registry collisions: same-name different-kind entries coexist, tools/channels dirs separate, capabilities files independent.

P3: E2E & Stress Testing (5 items -- done)

E2E infrastructure (Python + Playwright): Mock OpenAI-compatible LLM server (tests/e2e/mock_llm.py), pytest fixtures (binary build, mock LLM, gateway, Playwright browser), centralized DOM selectors.
E2E scenarios (19 tests across 6 scenarios):
- Scenario 1 (Connection): auth, tab navigation, connection status
- Scenario 2 (Chat): message round-trip via SSE, multiple messages, empty message guard
- Scenario 3 (SSE Reconnect): status indicator, disconnect/reconnect, history preservation
- Scenario 4 (Skills): tab visibility, ClawHub search, install/remove lifecycle
- Scenario 5 (HTML Injection): XSS sanitization, user messages as plain text, no script elements
- Scenario 6 (Tool Approval): approval card rendering, approve/deny, parameter toggle
E2E CI workflow (.github/workflows/e2e.yml): Weekly + manual + PR-triggered on web gateway changes.
Concurrent stress tests: ContextManager (50 concurrent creates, max_jobs contention, 20 writers+readers, state transitions), SessionManager (30 concurrent get_or_create, 20 distinct users, 5 channels isolation, undo manager).
Value estimator boundaries (src/estimation/value.rs): 14 new tests -- zero/negative prices, large values, exact margin boundaries, custom margins, minimum bid edge cases.

Bug Fixes

LeakDetector prefix shadowing: AhoCorasick non-overlapping iteration caused shorter prefixes ("sk-") to shadow longer ones ("sk-ant-api"), preventing Anthropic API key and SSH private key detection.
Clippy needless_return in src/cli/mcp.rs for libsql-only builds.

Intentionally Skipped

E2E Scenario 7 (Onboarding Wizard): Requires PTY/pexpect infrastructure not viable in current E2E setup.
Test harness in-memory mode (2.1): Using tempdir-backed libSQL instead of :memory: -- works correctly.

Addresses #352.

Test plan

cargo test --all-features -- 1,724 tests pass
cargo test --no-default-features --features libsql -- passes
cargo clippy --all --benches --tests --examples --all-features -- zero warnings
cargo clippy --all --benches --tests --examples -- zero warnings
cargo clippy --all --benches --tests --examples --no-default-features --features libsql -- zero warnings
cargo fmt --all -- --check -- clean
E2E tests: pytest tests/e2e/ -v -- 18 passed, 1 skipped
CI matrix runs all 3 feature configurations without errors
Docker build job succeeds in CI

Generated with Claude Code

…er build P0 items from the automated QA plan (nearai#352): - Add validate_tool_schema() that checks OpenAI strict-mode rules (type: object, required keys in properties, nested object/array recursion) with 10 unit tests and 6 integration tests covering all core built-in tools - CI test matrix now runs with --all-features, default features, and --no-default-features --features libsql to catch dead code behind wrong cfg gates - CI clippy now runs the same 3-feature matrix with --all flags - Docker build job added to catch missing files in Dockerfile Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gemini-code-assist · 2026-02-24T19:30:43Z

Summary of Changes

Hello @zmanian, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the project's quality assurance by introducing several automated checks. It addresses critical P0 items from issue #352, focusing on preventing common bug categories related to tool schema validity, build integrity, and feature flag coverage. The changes lay the groundwork for a more robust and reliable development process by ensuring stricter adherence to schema definitions and broader test coverage across different build configurations.

Highlights

Tool Schema Validation: Implemented validate_tool_schema() to check built-in tool schemas against OpenAI strict-mode rules, including 10 unit tests and 6 integration tests covering core built-in tools.
CI Test Matrix Expansion: Expanded the CI test matrix to run with --all-features, default features, and --no-default-features --features libsql to catch dead code behind incorrect #[cfg] gates.
CI Clippy Configuration: Configured CI clippy to run across the same three-feature matrix with comprehensive flags (--all --benches --tests --examples).
Docker Build in CI: Introduced a Docker build job in CI for every pull request to verify Dockerfile integrity and catch missing files.
Automated QA Plan Document: Included a comprehensive automated QA plan document (docs/plans/2026-02-24-automated-qa.md) outlining future QA initiatives.

Changelog

docs/plans/2026-02-24-automated-qa.md
- Added a new document outlining a comprehensive automated QA plan for the project.
src/tools/mod.rs
- Exported the new validate_tool_schema function for public use within the tools module.
src/tools/tool.rs
- Implemented the validate_tool_schema function to enforce OpenAI strict-mode rules for tool schemas.
- Added extensive unit tests for the new schema validation logic, covering various valid and invalid schema structures.
tests/tool_schema_validation.rs
- Added new integration tests to validate the schemas of all core built-in tools, ensuring they conform to strict OpenAI requirements.
- Included specific regression tests for json, http, time, and shell tool schemas.

Ignored Files

Ignored by pattern: .github/workflows/** (2)
- .github/workflows/code_style.yml
- .github/workflows/test.yml

Activity

The pull request was opened by zmanian to introduce automated QA measures and address P0 items from issue Automated QA: schema validation, integration tests, computer-use E2E, chaos testing #352.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a comprehensive automated QA plan and implements the first tier of it, focusing on schema validation for tools. A new validate_tool_schema function is added to enforce OpenAI strict-mode rules on tool parameter schemas, along with extensive unit and integration tests to validate all core built-in tools. This is a significant improvement to the project's quality assurance, preventing a whole class of bugs related to invalid tool schemas. The changes are well-structured and thoroughly tested. I have a few minor suggestions for improvement in the implementation and the QA plan document.

gemini-code-assist · 2026-02-24T19:34:27Z

+#[tokio::test]
+async fn failover_with_all_providers_failing() {
+    let failover = FailoverProvider::new(vec![
+        always_failing_provider("a]"),


There appears to be a typo in this code example. The string literal "a]" likely has an unintentional closing bracket. It should probably be "a".

Suggested change

always_failing_provider("a]"),

always_failing_provider("a"),

Acknowledged -- this is in a planning doc, not shipped code. Will leave as-is since the plan doc is reference material.

gemini-code-assist · 2026-02-24T19:34:27Z

+        screenshot = await self.take_screenshot()
+        return {"type": "tool_result", "content": [
+            {"type": "image", "source": {"type": "base64", "media_type": "image/png",
+             "data": base64.b64encode(screenshot).decode()}}


The base64 module is used here, but it's not imported in the ComputerUseAgent class snippet above. For completeness of the example, you should add import base64.

Acknowledged -- this is in a planning doc (pseudocode example), not shipped code. The actual E2E implementation in tests/e2e/ handles imports correctly.

gemini-code-assist · 2026-02-24T19:34:27Z

+                "array" => {
+                    if prop.get("items").is_none() {
+                        errors.push(format!("{prop_path}: array property missing \"items\""));
+                    } else if let Some(items) = prop.get("items") {
+                        // If items is an object type, recurse
+                        if items.get("type").and_then(|t| t.as_str()) == Some("object") {
+                            errors
+                                .extend(validate_tool_schema(items, &format!("{prop_path}.items")));
+                        }
+                    }
+                }


This logic can be simplified by using an if let ... else structure to avoid checking for items twice.

Suggested change

"array" => {

if prop.get("items").is_none() {

errors.push(format!("{prop_path}: array property missing \"items\""));

} else if let Some(items) = prop.get("items") {

// If items is an object type, recurse

if items.get("type").and_then(|t| t.as_str()) == Some("object") {

errors

.extend(validate_tool_schema(items, &format!("{prop_path}.items")));

}

}

}

"array" => {

if let Some(items) = prop.get("items") {

// If items is an object type, recurse

if items.get("type").and_then(|t| t.as_str()) == Some("object") {

errors

.extend(validate_tool_schema(items, &format!("{prop_path}.items")));

}

} else {

errors.push(format!("{prop_path}: array property missing \"items\""));

}

}

Fixed. Restructured to if-let-else so items is only checked once.

P1 test coverage: config round-trip (settings + bootstrap), shell tool arg handling, safety adversarial tests (sanitizer, leak detector, allowlist), turn persistence (conversations, metadata, pagination, jobs), and a clippy fix for libsql-only builds. Fixed a real bug where AhoCorasick non-overlapping prefix iteration caused shorter prefixes (e.g. "sk-") to shadow longer ones (e.g. "sk-ant-api"), preventing Anthropic API key and SSH private key detection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Cover all P2 items from the automated QA plan: - Circuit breaker chaos tests (hanging provider, rapid cycles, mixed errors) - Failover chaos tests (hanging failover, all-fail, tools path, single provider) - Value estimator boundary tests (negative cost, zero price, zero earnings) - Context length recovery test (ContextLengthExceeded -> compact -> retry) - WASM channel lifecycle tests (write/commit/read round-trip, namespace isolation) - Extension registry collision tests (same-name different-kind coexistence) - Extension filesystem collision tests (separate dirs, detect_kind priority) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Tests verify thread safety of double-checked locking, TOCTOU prevention, and RwLock-based concurrent access patterns under load. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Dispatcher: test force_text mechanism prevents infinite tool call loops, verify iteration bound arithmetic guarantees termination for all configs. Self-repair: test stuck job detection, recovery within attempt limits, manual escalation when limit exceeded, graceful degradation without store/builder dependencies. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Auth middleware (8 new tests): valid/invalid bearer tokens, query param fallback, case sensitivity, empty tokens, whitespace handling. Compaction module (16 new tests): truncation strategy, summarize strategy with mock LLM, workspace fallback, format_turns helper, sequential compactions, coherence after compaction, token decrease verification. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Test the full bootstrap .env lifecycle: write via the same format as save_bootstrap_env/upsert_bootstrap_var, read back via dotenvy, and assert values match. Covers LLM backend selection, embedding disable flag, onboard completion flag, session token keys, multi-key preservation across upsert, and special characters (spaces, equals, quotes, backslashes, hashes). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…A 4.3/4.4) Value estimator (14 new tests): zero/negative prices, large values, negative cost, exact margin boundaries, custom margin configuration. Dispatcher loop guard (2 new tests): verifies the dispatch loop terminates when all tool calls fail (regression guard for PR nearai#252 infinite loop) and when max iterations are reached. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Failover edge cases (4 new tests): cooldown at zero nanos, half-open failure reopens circuit, all providers fail gracefully (no panic), single failing provider with cooldown. Provider chaos tests (15 new tests): flakey provider with retries, hanging provider with timeout, garbage provider, circuit breaker trip/recover, failover chain cascading, non-transient error stops chain, full stack integration (retry + failover + circuit breaker). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

serrrfirat

Summary

This is a massive, well-structured QA PR adding 424 tests across 39 files with only 16 lines of production code changes (a correct bug fix in the leak detector prefix shadowing, a clippy lint fix, and a module export). The test architecture is solid: the TestHarnessBuilder pattern, provider chaos mocks, and E2E infrastructure with mock LLM are well-designed. The leak detector fix correctly addresses the Aho-Corasick non-overlapping iteration issue where short prefixes shadowed longer ones. The CI matrix expansion for feature-flag configurations is a significant reliability improvement.

The most substantive issues are: (1) the auth middleware's case-sensitive Bearer prefix check violates RFC 6750 and the new test codifies this as expected behavior, (2) the config round-trip tests reimplement the bootstrap escaping logic rather than testing the actual functions, creating a risk of undetected divergence, (3) the E2E event_loop fixture uses a pattern deprecated in the minimum-required pytest-asyncio version, and (4) two schema validation functions with different strictness levels lack documentation about which to use when. The E2E infrastructure also has practical concerns around fixed ports and environment contamination that could cause flakiness in CI.

Overall this is a high-quality PR that significantly improves test coverage. No critical or blocking issues found.

General findings

[MEDIUM] Bearer auth scheme comparison is case-sensitive, violating RFC 6750 (src/channels/web/auth.rs:33)

The auth middleware uses value.strip_prefix("Bearer ") which is a case-sensitive check. RFC 6750 Section 2.1 specifies that the auth-scheme token comparison must be case-insensitive. A client sending authorization: bearer <token> (lowercase, which is valid per the RFC) will receive a 401. The new test test_bearer_prefix_case_sensitivity codifies this deviation by asserting that lowercase "bearer" is rejected. While this is pre-existing behavior (not introduced by this PR), the PR cements it as expected behavior. The test comment says 'Lowercase "bearer" should NOT match' — this is incorrect per the RFC.

Suggested fix:

Change `value.strip_prefix("Bearer ")` to a case-insensitive check: `if value.len() > 7 && value[..7].eq_ignore_ascii_case("Bearer ") { let token = &value[7..]; ... }`. Update the test to expect 200 for lowercase "bearer", or add a comment explaining the intentional deviation from RFC 6750.

serrrfirat · 2026-02-26T11:52:50Z

+/// `save_bootstrap_env`: double-quoted values with escaped backslashes and
+/// double quotes.
+fn write_bootstrap_env(
+    path: &std::path::Path,


Config round-trip tests reimplement bootstrap logic instead of testing actual code

The test file reimplements write_bootstrap_env (lines 22-34) and upsert_env_var (lines 38-75) as local helper functions rather than calling the actual save_bootstrap_env and upsert_bootstrap_var from src/bootstrap.rs. The comment on line 10 explains this is because the real functions write to the global ~/.ironclaw/.env path, but the consequence is that these tests verify a copy of the logic, not the real implementation. If someone changes the escaping in bootstrap.rs without updating the test helpers, the tests will still pass, silently missing regressions. The test reimplementation could also have bugs that the real code doesn't (or vice versa).

Suggested fix:

Refactor `save_bootstrap_env` and `upsert_bootstrap_var` in `bootstrap.rs` to accept a path parameter (or add `_to(path, ...)` variants) so the tests can call the actual implementation against a tempdir. Alternatively, extract the formatting logic into a shared pure function that both the real code and tests call.

Severity: medium · Confidence: high

Fixed. Refactored bootstrap.rs to expose path-parameterized variants (save_bootstrap_env_to, upsert_bootstrap_var_to). Tests now import and call the real functions instead of reimplementing them.

serrrfirat · 2026-02-26T11:52:50Z

+
+
+@pytest.fixture(scope="session")
+def event_loop():


Deprecated pytest-asyncio event_loop fixture override

The event_loop fixture at line 47-50 overrides the built-in event_loop fixture to make it session-scoped. This pattern was deprecated in pytest-asyncio 0.23 (the minimum version required in pyproject.toml). In pytest-asyncio >= 0.23, users should configure the event loop scope via the loop_scope setting instead. This will emit DeprecationWarnings during test runs and may break entirely in future pytest-asyncio versions.

Suggested fix:

Remove the custom `event_loop` fixture. Instead, in `pyproject.toml` add `loop_scope = "session"` under `[tool.pytest.ini_options]`, or use `@pytest.fixture(scope="session")` with the `pytest_asyncio.fixture` decorator and set `scope="session"` on the async fixtures directly.

Severity: medium · Confidence: high

Fixed. Removed the deprecated event_loop fixture override. pytest-asyncio >= 0.23 handles session-scoped async fixtures natively.

serrrfirat · 2026-02-26T11:52:50Z

 pub use rate_limiter::RateLimiter;
 pub use registry::ToolRegistry;
-pub use tool::{ApprovalRequirement, Tool, ToolDomain, ToolError, ToolOutput, ToolRateLimitConfig};
+pub use tool::{


Duplicate schema validation functions with divergent strictness levels

The PR introduces validate_strict_schema() in src/tools/schema_validator.rs (958 lines, full OpenAI strict-mode validation including enum type checking, additionalProperties enforcement, recursive nested validation) and also exports the pre-existing validate_tool_schema() from src/tools/tool.rs (less strict — no enum type checking, no additionalProperties check). Both functions validate tool JSON schemas but with different rules. There is no documentation indicating when to use which. This creates a maintenance burden: changes to validation rules need to be synchronized across two implementations, and callers may pick the wrong one.

Suggested fix:

Consider making `validate_strict_schema` the single source of truth and having `validate_tool_schema` delegate to it (possibly with a `strict: bool` flag). Alternatively, add clear doc comments explaining: `validate_strict_schema` is for CI/build-time validation, `validate_tool_schema` is for runtime registration.

Severity: medium · Confidence: high

Fixed. Added cross-referencing doc comments to both functions. validate_tool_schema (tool.rs) documents itself as the lenient runtime variant and points to validate_strict_schema. validate_strict_schema (schema_validator.rs) documents itself as the strict CI variant and points back to validate_tool_schema.

serrrfirat · 2026-02-26T11:52:50Z

+ROOT = Path(__file__).resolve().parent.parent.parent
+
+# Ports: use high fixed ports to avoid conflicts with development instances
+MOCK_LLM_PORT = 18_199


Fixed E2E ports will cause failures in parallel CI runs

The E2E fixtures use hardcoded ports MOCK_LLM_PORT = 18199 and GATEWAY_PORT = 18200. If two CI jobs run E2E tests simultaneously on the same runner (e.g., matrix builds, or concurrent PR builds on a self-hosted runner), they will fail with 'address already in use'. The mock LLM server supports port 0 (OS-assigned) via --port, but the gateway appears to need a fixed port because it logs the configured port, not the bound port.

Suggested fix:

For the mock LLM, use port 0 and parse the actual port from stdout (which is already supported). For the gateway, either fix the binary to log the actual bound port, or use a randomized high port (e.g., `random.randint(18200, 19200)`) to reduce collision probability.

Severity: medium · Confidence: medium

Fixed. Both MOCK_LLM_PORT and GATEWAY_PORT are now dynamically allocated via _find_free_port() (binds to port 0, reads OS-assigned port). Mock LLM was already passing --port 0 but the conftest had a hardcoded fallback constant; removed it.

serrrfirat · 2026-02-26T11:52:50Z

+        **os.environ,
+        "RUST_LOG": "ironclaw=info",
+        "GATEWAY_ENABLED": "true",
+        "GATEWAY_HOST": "127.0.0.1",


E2E test inherits full host environment, risking config contamination

The ironclaw_server fixture passes **os.environ merged with test overrides. If the host machine has ~/.ironclaw/.env or environment variables like DATABASE_URL, LLM_API_KEY, or GATEWAY_PORT set, the ironclaw binary's load_ironclaw_env() bootstrap may load them, potentially overriding the test's configuration. This could cause non-deterministic test failures on developer machines or CI runners with residual configuration.

Suggested fix:

Use a minimal environment instead of inheriting everything: `env = {"PATH": os.environ.get("PATH", ""), "HOME": tempdir, "RUST_LOG": ..., ...}`. Set `HOME` to a temp directory to prevent `~/.ironclaw/.env` from being loaded. Or set `IRONCLAW_HOME` (if supported) to an empty temp directory.

Severity: medium · Confidence: medium

Fixed. Replaced **os.environ with a minimal explicit env dict containing only PATH and HOME plus the test-specific vars. No host environment leakage.

serrrfirat · 2026-02-26T11:52:50Z

+
+## Run tests
+
+```bash


README documents HEADED=1 env var but conftest.py ignores it

The README states HEADED=1 pytest tests/e2e/scenarios/test_connection.py -v for visible browser mode, but the page fixture in conftest.py line 117 hardcodes headless=True without reading any environment variable. Running HEADED=1 will still run headless.

Suggested fix:

In `conftest.py`, change `browser = await p.chromium.launch(headless=True)` to `browser = await p.chromium.launch(headless=not os.environ.get("HEADED"))`.

Severity: low · Confidence: high

Fixed. The browser fixture is now session-scoped and reads HEADED env var: headless = os.environ.get("HEADED", "").strip() not in ("1", "true").

serrrfirat · 2026-02-26T11:52:50Z

+    runs-on: ubuntu-latest
+    timeout-minutes: 30
+    steps:
+      - uses: actions/checkout@v4


Inconsistent actions/checkout version across CI workflows

The E2E workflow uses actions/checkout@v4 while both code_style.yml and test.yml use actions/checkout@v6. While v4 still works, the inconsistency could lead to different checkout behaviors (e.g., different default fetch depths, different handling of submodules) and makes maintenance harder.

Suggested fix:

Change `uses: actions/checkout@v4` to `uses: actions/checkout@v6` in `.github/workflows/e2e.yml` line 17.

Severity: low · Confidence: high

Fixed. Bumped to actions/checkout@v6 to match test.yml and code_style.yml.

serrrfirat · 2026-02-26T11:52:50Z

+        proc.send_signal(signal.SIGTERM)
+        try:
+            await asyncio.wait_for(proc.wait(), timeout=5)
+        except asyncio.TimeoutError:


E2E page fixture launches a new browser per test function

The page fixture (function-scoped) calls p.chromium.launch() for every test, creating a new browser process each time. With 19 E2E tests, that's 19 Chromium launches. A session-scoped browser fixture with per-test browser contexts would be significantly faster while maintaining test isolation (contexts don't share cookies, localStorage, etc.).

Suggested fix:

Add a session-scoped `browser` fixture: `@pytest.fixture(scope="session") async def browser(): ...`. Then in the `page` fixture, accept `browser` as a parameter and only create a new context and page from it.

Severity: low · Confidence: high

Fixed. Browser is now session-scoped. The page fixture creates a fresh context per test (isolated cookies/storage) without relaunching the browser process.

zmanian · 2026-02-26T17:20:51Z

All review items addressed:

Bearer auth case-insensitive (RFC 6750) -- Fixed auth.rs to use eq_ignore_ascii_case("Bearer ") for the auth-scheme check. Tests updated to expect 200 for lowercase/mixed-case "bearer". Added test_bearer_prefix_mixed_case for "BEARER" variant.
Config round-trip tests reimplemented bootstrap logic -- Refactored bootstrap.rs to expose save_bootstrap_env_to and upsert_bootstrap_var_to (path-parameterized variants). tests/config_round_trip.rs now imports and calls the real functions.
Deprecated event_loop fixture -- Removed. pytest-asyncio >= 0.23 handles session-scoped async fixtures natively.
Schema validator doc comments -- Added cross-referencing doc comments to both validate_tool_schema (lenient, runtime) and validate_strict_schema (strict, CI).
Fixed E2E ports -- Both ports now dynamically allocated via _find_free_port().
Environment contamination -- Replaced **os.environ with minimal explicit env dict (PATH + HOME only).
HEADED=1 not wired -- browser fixture now reads HEADED env var for headless toggle.
checkout@v4 -> @v6 -- Bumped e2e.yml to match other workflows.
Browser launched per-test -- Browser is now session-scoped. page fixture creates fresh contexts per test.
Array validation simplification -- Restructured to if-let-else.

@v6

- Fix Bearer auth case-sensitivity per RFC 6750 (auth.rs) - Refactor bootstrap.rs to expose path-parameterized variants so config_round_trip tests call real code instead of reimplementations - Remove deprecated event_loop fixture, use dynamic ports, minimal env, session-scoped browser, and wire HEADED=1 in E2E conftest - Add cross-referencing doc comments between schema validators - Simplify array validation logic in tool.rs - Bump e2e.yml checkout@v4 to @v6 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Resolve conflict in src/agent/session_manager.rs: keep both QA concurrent stress tests and new resolve_thread UUID lookup test. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add --no-onboard flag to prevent wizard from blocking in CI - Pipe /dev/null to stdin to prevent any stdin reads from hanging - Add RUST_BACKTRACE=1 for crash diagnostics - On server startup timeout, dump stderr to pytest output so CI logs show why the server failed to start Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

pytest-asyncio 1.3.0 defaults asyncio_default_fixture_loop_scope to None (function scope), causing session-scoped async fixtures to be re-evaluated per test function with independent event loops. Each test then independently attempts to start the ironclaw server, times out at 120s, and wastes ~24 minutes of CI before the job is cancelled. Setting asyncio_default_fixture_loop_scope = "session" ensures all session-scoped async fixtures share a single event loop, so the server starts once and is reused across all tests. Also adds -x flag to pytest in CI to stop on first failure instead of running all 19 tests when the fixture is broken. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

With asyncio_default_fixture_loop_scope=session but asyncio_default_test_loop_scope=function (the default), tests run on a per-function event loop while fixtures produce objects (Playwright pages, browser contexts) on the session event loop. This event loop mismatch causes the test to hang indefinitely awaiting Playwright operations that are bound to the wrong loop. Setting both scopes to "session" ensures a single event loop is shared across all fixtures and tests, eliminating the deadlock. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Branch protection expects "Code Style (fmt + clippy)" and "Run Tests" status checks, but only individual job names were reported. Add roll-up jobs that aggregate results and report the expected names. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@v6

…est coverage (nearai#353) * Add automated QA: tool schema validator, feature-flag CI matrix, Docker build P0 items from the automated QA plan (nearai#352): - Add validate_tool_schema() that checks OpenAI strict-mode rules (type: object, required keys in properties, nested object/array recursion) with 10 unit tests and 6 integration tests covering all core built-in tools - CI test matrix now runs with --all-features, default features, and --no-default-features --features libsql to catch dead code behind wrong cfg gates - CI clippy now runs the same 3-feature matrix with --all flags - Docker build job added to catch missing files in Dockerfile Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add P1 automated QA tests and fix LeakDetector prefix shadowing bug P1 test coverage: config round-trip (settings + bootstrap), shell tool arg handling, safety adversarial tests (sanitizer, leak detector, allowlist), turn persistence (conversations, metadata, pagination, jobs), and a clippy fix for libsql-only builds. Fixed a real bug where AhoCorasick non-overlapping prefix iteration caused shorter prefixes (e.g. "sk-") to shadow longer ones (e.g. "sk-ant-api"), preventing Anthropic API key and SSH private key detection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add P2 automated QA tests: chaos, lifecycle, collision, and recovery Cover all P2 items from the automated QA plan: - Circuit breaker chaos tests (hanging provider, rapid cycles, mixed errors) - Failover chaos tests (hanging failover, all-fail, tools path, single provider) - Value estimator boundary tests (negative cost, zero price, zero earnings) - Context length recovery test (ContextLengthExceeded -> compact -> retry) - WASM channel lifecycle tests (write/commit/read round-trip, namespace isolation) - Extension registry collision tests (same-name different-kind coexistence) - Extension filesystem collision tests (separate dirs, detect_kind priority) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add P3 concurrent stress tests for ContextManager and SessionManager Tests verify thread safety of double-checked locking, TOCTOU prevention, and RwLock-based concurrent access patterns under load. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add dispatcher loop guard and self-repair stuck job tests Dispatcher: test force_text mechanism prevents infinite tool call loops, verify iteration bound arithmetic guarantees termination for all configs. Self-repair: test stuck job detection, recovery within attempt limits, manual escalation when limit exceeded, graceful degradation without store/builder dependencies. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add E2E testing infrastructure design doc Python + Playwright framework with mock LLM server for deterministic browser-level testing of the web gateway. Covers connection/auth, chat round-trip with SSE streaming, and skills lifecycle scenarios. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add E2E testing infrastructure implementation plan 10-task plan covering: scaffolding, mock LLM server, helpers, conftest fixtures, connection/chat/skills test scenarios, CI workflow, README, and integration run. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * scaffold: E2E test project with pyproject.toml Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: E2E helpers with DOM selectors and port discovery Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: mock OpenAI-compat LLM server for E2E tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: E2E conftest with session fixtures for mock LLM and ironclaw Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: E2E scenario 1 -- connection and tab navigation tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: E2E scenario 2 -- chat message round-trip tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: E2E scenario 3 -- skills search, install, remove tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * ci: add weekly E2E test workflow with Playwright Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: E2E test README with setup and usage instructions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: E2E test integration fixes from first run - Use temp file DB instead of :memory: (libSQL :memory: doesn't persist tables across execute_batch) - Fix installed skills selector: #skills-list not #installed-skills - Add pytest-timeout to dependencies - Improve skills install/remove test with wait_for instead of fixed sleeps 8 passed, 1 skipped (skills install depends on ClawHub availability) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * test: add OpenAI strict-mode schema validator for all built-in tools (QA 1.1) Add src/tools/schema_validator.rs with validate_strict_schema() that checks tool parameter schemas against OpenAI function calling strict-mode rules: type object at top level, required keys in properties, enum type consistency, array items definitions, nested object recursion, and additionalProperties. 17 tests validate all 34+ built-in tool schemas across 5 test groups: - 9 simple tools (echo, time, json, http, shell, file read/write/list/patch) - 4 job tools (create, list, status, cancel) - 4 skill tools (list, search, install, remove) - 13 inline schemas for extension, routine, and complex job tools - 4 memory tool schemas Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * test: add E2E scenarios for SSE reconnect, HTML injection, and tool approval (QA 3.3/5/6) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: E2E test reliability for HTML injection and SSE reconnect - HTML injection: test sanitization directly via JS injection instead of depending on full LLM round-trip (avoids intermittent 404 from mock) - SSE reconnect: increase wait times for DB persistence and relax assertion to check total message count after history reload Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * style: cargo fmt formatting Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * test: add WASM and MCP tool schema validation tests (QA 1.1) Extends the schema validator with representative WASM tool schemas (weather, HTTP client, batch processor, status), MCP tool schemas (default, file read, SQL query, strict mode), and defect detection tests for common external schema issues (missing type, typo in required, array without items, enum type mismatch). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * test: add auth middleware and compaction module tests Auth middleware (8 new tests): valid/invalid bearer tokens, query param fallback, case sensitivity, empty tokens, whitespace handling. Compaction module (16 new tests): truncation strategy, summarize strategy with mock LLM, workspace fallback, format_turns helper, sequential compactions, coherence after compaction, token decrease verification. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * test: add config round-trip integration tests (QA 1.2) Test the full bootstrap .env lifecycle: write via the same format as save_bootstrap_env/upsert_bootstrap_var, read back via dotenvy, and assert values match. Covers LLM backend selection, embedding disable flag, onboard completion flag, session token keys, multi-key preservation across upsert, and special characters (spaces, equals, quotes, backslashes, hashes). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * test: add value estimator boundary tests and dispatcher loop guard (QA 4.3/4.4) Value estimator (14 new tests): zero/negative prices, large values, negative cost, exact margin boundaries, custom margin configuration. Dispatcher loop guard (2 new tests): verifies the dispatch loop terminates when all tool calls fail (regression guard for PR nearai#252 infinite loop) and when max iterations are reached. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * test: add failover edge cases and provider chaos tests (QA 2.6/4.1) Failover edge cases (4 new tests): cooldown at zero nanos, half-open failure reopens circuit, all providers fail gracefully (no panic), single failing provider with cooldown. Provider chaos tests (15 new tests): flakey provider with retries, hanging provider with timeout, garbage provider, circuit breaker trip/recover, failover chain cascading, non-transient error stops chain, full stack integration (retry + failover + circuit breaker). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: address PR review feedback on QA tests - Fix Bearer auth case-sensitivity per RFC 6750 (auth.rs) - Refactor bootstrap.rs to expose path-parameterized variants so config_round_trip tests call real code instead of reimplementations - Remove deprecated event_loop fixture, use dynamic ports, minimal env, session-scoped browser, and wire HEADED=1 in E2E conftest - Add cross-referencing doc comments between schema validators - Simplify array validation logic in tool.rs - Bump e2e.yml checkout@v4 to @v6 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * style: cargo fmt and fix clippy warning in signal.rs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: improve E2E fixture error reporting and prevent stdin blocking - Add --no-onboard flag to prevent wizard from blocking in CI - Pipe /dev/null to stdin to prevent any stdin reads from hanging - Add RUST_BACKTRACE=1 for crash diagnostics - On server startup timeout, dump stderr to pytest output so CI logs show why the server failed to start Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: set session-scoped event loop for E2E async fixtures pytest-asyncio 1.3.0 defaults asyncio_default_fixture_loop_scope to None (function scope), causing session-scoped async fixtures to be re-evaluated per test function with independent event loops. Each test then independently attempts to start the ironclaw server, times out at 120s, and wastes ~24 minutes of CI before the job is cancelled. Setting asyncio_default_fixture_loop_scope = "session" ensures all session-scoped async fixtures share a single event loop, so the server starts once and is reused across all tests. Also adds -x flag to pytest in CI to stop on first failure instead of running all 19 tests when the fixture is broken. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: set test loop scope to session to match fixture loop scope With asyncio_default_fixture_loop_scope=session but asyncio_default_test_loop_scope=function (the default), tests run on a per-function event loop while fixtures produce objects (Playwright pages, browser contexts) on the session event loop. This event loop mismatch causes the test to hang indefinitely awaiting Playwright operations that are bound to the wrong loop. Setting both scopes to "session" ensures a single event loop is shared across all fixtures and tests, eliminating the deadlock. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * ci: add roll-up jobs to match branch protection required checks Branch protection expects "Code Style (fmt + clippy)" and "Run Tests" status checks, but only individual job names were reported. Add roll-up jobs that aggregate results and report the expected names. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

github-actions bot added scope: tool Tool infrastructure scope: ci CI/CD workflows scope: docs Documentation size: L 200-499 changed lines risk: medium Business logic, config, or moderate-risk modules contributor: experienced 6-19 merged PRs labels Feb 24, 2026

gemini-code-assist bot reviewed Feb 24, 2026

View reviewed changes

zmanian changed the title ~~Add automated QA: schema validator, feature-flag CI matrix, Docker build~~ Add automated QA: schema validator, CI matrix, Docker build, and P1 test coverage Feb 24, 2026

github-actions bot added scope: agent Agent core (agent loop, router, scheduler) scope: channel/wasm WASM channel runtime scope: llm LLM integration scope: extensions Extension management scope: estimation Cost/time estimation and removed size: L 200-499 changed lines labels Feb 24, 2026

zmanian and others added 2 commits February 24, 2026 13:11

Add P3 concurrent stress tests for ContextManager and SessionManager

dd90140

Tests verify thread safety of double-checked locking, TOCTOU prevention, and RwLock-based concurrent access patterns under load. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions bot added the scope: channel/web Web gateway channel label Feb 25, 2026

zmanian and others added 3 commits February 24, 2026 19:55

serrrfirat reviewed Feb 26, 2026

View reviewed changes

serrrfirat previously approved these changes Feb 26, 2026

View reviewed changes

Merge origin/main into feat/automated-qa-p0

4e5297c

Resolve conflict in src/agent/session_manager.rs: keep both QA concurrent stress tests and new resolve_thread UUID lookup test. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

zmanian dismissed serrrfirat’s stale review via 4e5297c February 26, 2026 19:13

zmanian and others added 4 commits February 26, 2026 11:20

style: cargo fmt and fix clippy warning in signal.rs

66a5997

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions bot mentioned this pull request Feb 27, 2026

🦞 OpenClaw 生态日报 2026-02-27 duanyytop/agents-radar#21

Open

serrrfirat previously approved these changes Feb 27, 2026

View reviewed changes

serrrfirat enabled auto-merge (squash) February 27, 2026 03:39

zmanian and others added 2 commits February 26, 2026 20:21

Merge branch 'main' into feat/automated-qa-p0

4c31a5c

auto-merge was automatically disabled February 27, 2026 04:54
Head branch was pushed to by a user without write access

zmanian dismissed serrrfirat’s stale review via ad9cffc February 27, 2026 04:54

serrrfirat approved these changes Feb 27, 2026

View reviewed changes

serrrfirat merged commit a24fd3e into nearai:main Feb 27, 2026
14 checks passed

github-actions bot mentioned this pull request Feb 27, 2026

chore: release v0.13.0 #385

Merged

zmanian deleted the feat/automated-qa-p0 branch February 27, 2026 17:06

This was referenced Feb 28, 2026

🦞 OpenClaw 生态日报 2026-02-28 duanyytop/agents-radar#24

Open

🦞 OpenClaw 生态日报 2026-02-28 duanyytop/agents-radar#28

Open

serrrfirat mentioned this pull request Feb 28, 2026

docs: add comprehensive documentation for v0.16.0 / v0.16.1 #294

Closed

Conversation

zmanian commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

P0: Schema & CI (3 items -- all done)

P1: Integration Tests (6 items -- all done)

P2: Edge Cases & Chaos (7 items -- all done)

P3: E2E & Stress Testing (5 items -- done)

Bug Fixes

Intentionally Skipped

Test plan

Uh oh!

gemini-code-assist bot commented Feb 24, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

serrrfirat left a comment

Choose a reason for hiding this comment

Summary

General findings

[MEDIUM] Bearer auth scheme comparison is case-sensitive, violating RFC 6750 (src/channels/web/auth.rs:33)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zmanian commented Feb 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

zmanian commented Feb 24, 2026 •

edited

Loading