Skip to content

Add automated QA: schema validator, CI matrix, Docker build, and P1 test coverage#353

Merged
serrrfirat merged 34 commits intonearai:mainfrom
zmanian:feat/automated-qa-p0
Feb 27, 2026
Merged

Add automated QA: schema validator, CI matrix, Docker build, and P1 test coverage#353
serrrfirat merged 34 commits intonearai:mainfrom
zmanian:feat/automated-qa-p0

Conversation

@zmanian
Copy link
Copy Markdown
Collaborator

@zmanian zmanian commented Feb 24, 2026

Summary

Comprehensive automated QA implementation covering all four tiers of the QA plan (docs/plans/2026-02-24-automated-qa.md). Adds 424 new tests (1,300 -> 1,724 total) across 39 files (+7,887 lines), addressing the recurring bug categories identified from ~40 closed issues and merged bug-fix PRs.

P0: Schema & CI (3 items -- all done)

  • Tool schema validator (src/tools/schema_validator.rs): validate_strict_schema() checks all built-in tool schemas against OpenAI strict-mode rules (type: object, required/properties alignment, nested recursion). 10 unit tests + 7 integration tests for all built-in tools + 3 WASM/MCP schema tests.
  • Feature-flag CI matrix (.github/workflows/test.yml, code_style.yml): Tests and clippy now run 3 configurations: --all-features, default, --no-default-features --features libsql.
  • Docker build in CI (.github/workflows/test.yml): New docker-build job runs docker build on every PR.

P1: Integration Tests (6 items -- all done)

  • Config round-trip tests: 4 tests in src/bootstrap.rs + 4 in src/settings.rs + 6 integration tests in tests/config_round_trip.rs (LLM backend round-trip, embedding disabled survives, onboard completed check, session token key, preserving existing values, special characters).
  • Turn persistence tests: 7 tests using TestHarnessBuilder (message round-trip, metadata persistence, ownership check, ensure idempotency, pagination ordering, conversation preview, job+action persistence).
  • Shell tool realistic arg tests: Object-format args (the real LLM case), large output deadlock prevention (131KB), env scrubbing, injection pattern detection.
  • Safety adversarial tests: 6 sanitizer tests + 9 leak detector tests + 8 allowlist tests covering case insensitivity, subdomain bypass, credential URLs, IP addresses, multi-type detection, masking.
  • Auth middleware tests (src/channels/web/auth.rs): 8 new tests -- valid/invalid bearer tokens, query param fallback, case sensitivity, empty tokens, whitespace handling.
  • Compaction module tests (src/agent/compaction.rs): 16 new tests -- truncation strategy, summarize strategy with mock LLM, workspace fallback, format_turns helper, sequential compactions, coherence after compaction, token decrease verification.

P2: Edge Cases & Chaos (7 items -- all done)

  • Failover edge cases (src/llm/failover.rs): 4 new tests -- all providers fail gracefully (no panic), single failing provider with cooldown, cooldown at zero nanos, half-open failure reopens circuit.
  • Circuit breaker edge cases (src/llm/circuit_breaker.rs): 2 new tests -- zero-duration recovery timeout, half-open failure counter reset.
  • Provider chaos tests (tests/provider_chaos.rs): 15 new tests with FlakeyProvider, HangingProvider, GarbageProvider, ReliableProvider mocks. Covers retry exhaustion, timeout handling, circuit breaker trip/recover, failover cascading, non-transient error stops chain, full stack integration (retry + failover + circuit breaker).
  • Dispatcher loop guard (src/agent/dispatcher.rs): 2 new tests -- all tool calls failing terminates (regression guard for PR fix: parallelize tool call execution via JoinSet (#219) #252), max iterations reached terminates.
  • Context length recovery: ContextLengthExceeded triggers compaction and retry succeeds.
  • WASM channel lifecycle: workspace write/commit/read round-trip, overwrite across callbacks, emit order preservation, isolated namespaces.
  • Extension registry collisions: same-name different-kind entries coexist, tools/channels dirs separate, capabilities files independent.

P3: E2E & Stress Testing (5 items -- done)

  • E2E infrastructure (Python + Playwright): Mock OpenAI-compatible LLM server (tests/e2e/mock_llm.py), pytest fixtures (binary build, mock LLM, gateway, Playwright browser), centralized DOM selectors.
  • E2E scenarios (19 tests across 6 scenarios):
    • Scenario 1 (Connection): auth, tab navigation, connection status
    • Scenario 2 (Chat): message round-trip via SSE, multiple messages, empty message guard
    • Scenario 3 (SSE Reconnect): status indicator, disconnect/reconnect, history preservation
    • Scenario 4 (Skills): tab visibility, ClawHub search, install/remove lifecycle
    • Scenario 5 (HTML Injection): XSS sanitization, user messages as plain text, no script elements
    • Scenario 6 (Tool Approval): approval card rendering, approve/deny, parameter toggle
  • E2E CI workflow (.github/workflows/e2e.yml): Weekly + manual + PR-triggered on web gateway changes.
  • Concurrent stress tests: ContextManager (50 concurrent creates, max_jobs contention, 20 writers+readers, state transitions), SessionManager (30 concurrent get_or_create, 20 distinct users, 5 channels isolation, undo manager).
  • Value estimator boundaries (src/estimation/value.rs): 14 new tests -- zero/negative prices, large values, exact margin boundaries, custom margins, minimum bid edge cases.

Bug Fixes

  • LeakDetector prefix shadowing: AhoCorasick non-overlapping iteration caused shorter prefixes ("sk-") to shadow longer ones ("sk-ant-api"), preventing Anthropic API key and SSH private key detection.
  • Clippy needless_return in src/cli/mcp.rs for libsql-only builds.

Intentionally Skipped

  • E2E Scenario 7 (Onboarding Wizard): Requires PTY/pexpect infrastructure not viable in current E2E setup.
  • Test harness in-memory mode (2.1): Using tempdir-backed libSQL instead of :memory: -- works correctly.

Addresses #352.

Test plan

  • cargo test --all-features -- 1,724 tests pass
  • cargo test --no-default-features --features libsql -- passes
  • cargo clippy --all --benches --tests --examples --all-features -- zero warnings
  • cargo clippy --all --benches --tests --examples -- zero warnings
  • cargo clippy --all --benches --tests --examples --no-default-features --features libsql -- zero warnings
  • cargo fmt --all -- --check -- clean
  • E2E tests: pytest tests/e2e/ -v -- 18 passed, 1 skipped
  • CI matrix runs all 3 feature configurations without errors
  • Docker build job succeeds in CI

Generated with Claude Code

…er build

P0 items from the automated QA plan (nearai#352):

- Add validate_tool_schema() that checks OpenAI strict-mode rules
  (type: object, required keys in properties, nested object/array
  recursion) with 10 unit tests and 6 integration tests covering
  all core built-in tools

- CI test matrix now runs with --all-features, default features, and
  --no-default-features --features libsql to catch dead code behind
  wrong cfg gates

- CI clippy now runs the same 3-feature matrix with --all flags

- Docker build job added to catch missing files in Dockerfile

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added scope: tool Tool infrastructure scope: ci CI/CD workflows scope: docs Documentation size: L 200-499 changed lines risk: medium Business logic, config, or moderate-risk modules contributor: experienced 6-19 merged PRs labels Feb 24, 2026
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @zmanian, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the project's quality assurance by introducing several automated checks. It addresses critical P0 items from issue #352, focusing on preventing common bug categories related to tool schema validity, build integrity, and feature flag coverage. The changes lay the groundwork for a more robust and reliable development process by ensuring stricter adherence to schema definitions and broader test coverage across different build configurations.

Highlights

  • Tool Schema Validation: Implemented validate_tool_schema() to check built-in tool schemas against OpenAI strict-mode rules, including 10 unit tests and 6 integration tests covering core built-in tools.
  • CI Test Matrix Expansion: Expanded the CI test matrix to run with --all-features, default features, and --no-default-features --features libsql to catch dead code behind incorrect #[cfg] gates.
  • CI Clippy Configuration: Configured CI clippy to run across the same three-feature matrix with comprehensive flags (--all --benches --tests --examples).
  • Docker Build in CI: Introduced a Docker build job in CI for every pull request to verify Dockerfile integrity and catch missing files.
  • Automated QA Plan Document: Included a comprehensive automated QA plan document (docs/plans/2026-02-24-automated-qa.md) outlining future QA initiatives.
Changelog
  • docs/plans/2026-02-24-automated-qa.md
    • Added a new document outlining a comprehensive automated QA plan for the project.
  • src/tools/mod.rs
    • Exported the new validate_tool_schema function for public use within the tools module.
  • src/tools/tool.rs
    • Implemented the validate_tool_schema function to enforce OpenAI strict-mode rules for tool schemas.
    • Added extensive unit tests for the new schema validation logic, covering various valid and invalid schema structures.
  • tests/tool_schema_validation.rs
    • Added new integration tests to validate the schemas of all core built-in tools, ensuring they conform to strict OpenAI requirements.
    • Included specific regression tests for json, http, time, and shell tool schemas.
Ignored Files
  • Ignored by pattern: .github/workflows/** (2)
    • .github/workflows/code_style.yml
    • .github/workflows/test.yml
Activity
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive automated QA plan and implements the first tier of it, focusing on schema validation for tools. A new validate_tool_schema function is added to enforce OpenAI strict-mode rules on tool parameter schemas, along with extensive unit and integration tests to validate all core built-in tools. This is a significant improvement to the project's quality assurance, preventing a whole class of bugs related to invalid tool schemas. The changes are well-structured and thoroughly tested. I have a few minor suggestions for improvement in the implementation and the QA plan document.

#[tokio::test]
async fn failover_with_all_providers_failing() {
let failover = FailoverProvider::new(vec![
always_failing_provider("a]"),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There appears to be a typo in this code example. The string literal "a]" likely has an unintentional closing bracket. It should probably be "a".

Suggested change
always_failing_provider("a]"),
always_failing_provider("a"),

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged -- this is in a planning doc, not shipped code. Will leave as-is since the plan doc is reference material.

screenshot = await self.take_screenshot()
return {"type": "tool_result", "content": [
{"type": "image", "source": {"type": "base64", "media_type": "image/png",
"data": base64.b64encode(screenshot).decode()}}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The base64 module is used here, but it's not imported in the ComputerUseAgent class snippet above. For completeness of the example, you should add import base64.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged -- this is in a planning doc (pseudocode example), not shipped code. The actual E2E implementation in tests/e2e/ handles imports correctly.

Comment thread src/tools/tool.rs
Comment on lines +349 to +359
"array" => {
if prop.get("items").is_none() {
errors.push(format!("{prop_path}: array property missing \"items\""));
} else if let Some(items) = prop.get("items") {
// If items is an object type, recurse
if items.get("type").and_then(|t| t.as_str()) == Some("object") {
errors
.extend(validate_tool_schema(items, &format!("{prop_path}.items")));
}
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic can be simplified by using an if let ... else structure to avoid checking for items twice.

Suggested change
"array" => {
if prop.get("items").is_none() {
errors.push(format!("{prop_path}: array property missing \"items\""));
} else if let Some(items) = prop.get("items") {
// If items is an object type, recurse
if items.get("type").and_then(|t| t.as_str()) == Some("object") {
errors
.extend(validate_tool_schema(items, &format!("{prop_path}.items")));
}
}
}
"array" => {
if let Some(items) = prop.get("items") {
// If items is an object type, recurse
if items.get("type").and_then(|t| t.as_str()) == Some("object") {
errors
.extend(validate_tool_schema(items, &format!("{prop_path}.items")));
}
} else {
errors.push(format!("{prop_path}: array property missing \"items\""));
}
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Restructured to if-let-else so items is only checked once.

P1 test coverage: config round-trip (settings + bootstrap), shell tool
arg handling, safety adversarial tests (sanitizer, leak detector,
allowlist), turn persistence (conversations, metadata, pagination, jobs),
and a clippy fix for libsql-only builds.

Fixed a real bug where AhoCorasick non-overlapping prefix iteration
caused shorter prefixes (e.g. "sk-") to shadow longer ones
(e.g. "sk-ant-api"), preventing Anthropic API key and SSH private key
detection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added size: L 200-499 changed lines scope: channel/cli TUI / CLI channel scope: tool/builtin Built-in tools scope: safety Prompt injection defense scope: config Configuration scope: sandbox Docker sandbox size: XL 500+ changed lines risk: high Safety, secrets, auth, or critical infrastructure and removed size: L 200-499 changed lines risk: medium Business logic, config, or moderate-risk modules labels Feb 24, 2026
@zmanian zmanian changed the title Add automated QA: schema validator, feature-flag CI matrix, Docker build Add automated QA: schema validator, CI matrix, Docker build, and P1 test coverage Feb 24, 2026
Cover all P2 items from the automated QA plan:
- Circuit breaker chaos tests (hanging provider, rapid cycles, mixed errors)
- Failover chaos tests (hanging failover, all-fail, tools path, single provider)
- Value estimator boundary tests (negative cost, zero price, zero earnings)
- Context length recovery test (ContextLengthExceeded -> compact -> retry)
- WASM channel lifecycle tests (write/commit/read round-trip, namespace isolation)
- Extension registry collision tests (same-name different-kind coexistence)
- Extension filesystem collision tests (separate dirs, detect_kind priority)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added scope: agent Agent core (agent loop, router, scheduler) scope: channel/wasm WASM channel runtime scope: llm LLM integration scope: extensions Extension management scope: estimation Cost/time estimation and removed size: L 200-499 changed lines labels Feb 24, 2026
zmanian and others added 2 commits February 24, 2026 13:11
Tests verify thread safety of double-checked locking, TOCTOU
prevention, and RwLock-based concurrent access patterns under load.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Dispatcher: test force_text mechanism prevents infinite tool call loops,
verify iteration bound arithmetic guarantees termination for all configs.

Self-repair: test stuck job detection, recovery within attempt limits,
manual escalation when limit exceeded, graceful degradation without
store/builder dependencies.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Auth middleware (8 new tests): valid/invalid bearer tokens, query param
fallback, case sensitivity, empty tokens, whitespace handling.

Compaction module (16 new tests): truncation strategy, summarize strategy
with mock LLM, workspace fallback, format_turns helper, sequential
compactions, coherence after compaction, token decrease verification.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added the scope: channel/web Web gateway channel label Feb 25, 2026
zmanian and others added 3 commits February 24, 2026 19:55
Test the full bootstrap .env lifecycle: write via the same format
as save_bootstrap_env/upsert_bootstrap_var, read back via dotenvy,
and assert values match. Covers LLM backend selection, embedding
disable flag, onboard completion flag, session token keys, multi-key
preservation across upsert, and special characters (spaces, equals,
quotes, backslashes, hashes).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…A 4.3/4.4)

Value estimator (14 new tests): zero/negative prices, large values,
negative cost, exact margin boundaries, custom margin configuration.

Dispatcher loop guard (2 new tests): verifies the dispatch loop terminates
when all tool calls fail (regression guard for PR nearai#252 infinite loop)
and when max iterations are reached.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Failover edge cases (4 new tests): cooldown at zero nanos, half-open
failure reopens circuit, all providers fail gracefully (no panic),
single failing provider with cooldown.

Provider chaos tests (15 new tests): flakey provider with retries,
hanging provider with timeout, garbage provider, circuit breaker
trip/recover, failover chain cascading, non-transient error stops
chain, full stack integration (retry + failover + circuit breaker).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@serrrfirat serrrfirat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

This is a massive, well-structured QA PR adding 424 tests across 39 files with only 16 lines of production code changes (a correct bug fix in the leak detector prefix shadowing, a clippy lint fix, and a module export). The test architecture is solid: the TestHarnessBuilder pattern, provider chaos mocks, and E2E infrastructure with mock LLM are well-designed. The leak detector fix correctly addresses the Aho-Corasick non-overlapping iteration issue where short prefixes shadowed longer ones. The CI matrix expansion for feature-flag configurations is a significant reliability improvement.

The most substantive issues are: (1) the auth middleware's case-sensitive Bearer prefix check violates RFC 6750 and the new test codifies this as expected behavior, (2) the config round-trip tests reimplement the bootstrap escaping logic rather than testing the actual functions, creating a risk of undetected divergence, (3) the E2E event_loop fixture uses a pattern deprecated in the minimum-required pytest-asyncio version, and (4) two schema validation functions with different strictness levels lack documentation about which to use when. The E2E infrastructure also has practical concerns around fixed ports and environment contamination that could cause flakiness in CI.

Overall this is a high-quality PR that significantly improves test coverage. No critical or blocking issues found.

General findings

[MEDIUM] Bearer auth scheme comparison is case-sensitive, violating RFC 6750 (src/channels/web/auth.rs:33)

The auth middleware uses value.strip_prefix("Bearer ") which is a case-sensitive check. RFC 6750 Section 2.1 specifies that the auth-scheme token comparison must be case-insensitive. A client sending authorization: bearer <token> (lowercase, which is valid per the RFC) will receive a 401. The new test test_bearer_prefix_case_sensitivity codifies this deviation by asserting that lowercase "bearer" is rejected. While this is pre-existing behavior (not introduced by this PR), the PR cements it as expected behavior. The test comment says 'Lowercase "bearer" should NOT match' — this is incorrect per the RFC.

Suggested fix:

Change `value.strip_prefix("Bearer ")` to a case-insensitive check: `if value.len() > 7 && value[..7].eq_ignore_ascii_case("Bearer ") { let token = &value[7..]; ... }`. Update the test to expect 200 for lowercase "bearer", or add a comment explaining the intentional deviation from RFC 6750.

Comment thread tests/config_round_trip.rs Outdated
/// `save_bootstrap_env`: double-quoted values with escaped backslashes and
/// double quotes.
fn write_bootstrap_env(
path: &std::path::Path,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Config round-trip tests reimplement bootstrap logic instead of testing actual code

The test file reimplements write_bootstrap_env (lines 22-34) and upsert_env_var (lines 38-75) as local helper functions rather than calling the actual save_bootstrap_env and upsert_bootstrap_var from src/bootstrap.rs. The comment on line 10 explains this is because the real functions write to the global ~/.ironclaw/.env path, but the consequence is that these tests verify a copy of the logic, not the real implementation. If someone changes the escaping in bootstrap.rs without updating the test helpers, the tests will still pass, silently missing regressions. The test reimplementation could also have bugs that the real code doesn't (or vice versa).

Suggested fix:

Refactor `save_bootstrap_env` and `upsert_bootstrap_var` in `bootstrap.rs` to accept a path parameter (or add `_to(path, ...)` variants) so the tests can call the actual implementation against a tempdir. Alternatively, extract the formatting logic into a shared pure function that both the real code and tests call.

Severity: medium · Confidence: high

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Refactored bootstrap.rs to expose path-parameterized variants (save_bootstrap_env_to, upsert_bootstrap_var_to). Tests now import and call the real functions instead of reimplementing them.

Comment thread tests/e2e/conftest.py Outdated


@pytest.fixture(scope="session")
def event_loop():
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deprecated pytest-asyncio event_loop fixture override

The event_loop fixture at line 47-50 overrides the built-in event_loop fixture to make it session-scoped. This pattern was deprecated in pytest-asyncio 0.23 (the minimum version required in pyproject.toml). In pytest-asyncio >= 0.23, users should configure the event loop scope via the loop_scope setting instead. This will emit DeprecationWarnings during test runs and may break entirely in future pytest-asyncio versions.

Suggested fix:

Remove the custom `event_loop` fixture. Instead, in `pyproject.toml` add `loop_scope = "session"` under `[tool.pytest.ini_options]`, or use `@pytest.fixture(scope="session")` with the `pytest_asyncio.fixture` decorator and set `scope="session"` on the async fixtures directly.

Severity: medium · Confidence: high

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Removed the deprecated event_loop fixture override. pytest-asyncio >= 0.23 handles session-scoped async fixtures natively.

Comment thread src/tools/mod.rs
pub use rate_limiter::RateLimiter;
pub use registry::ToolRegistry;
pub use tool::{ApprovalRequirement, Tool, ToolDomain, ToolError, ToolOutput, ToolRateLimitConfig};
pub use tool::{
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate schema validation functions with divergent strictness levels

The PR introduces validate_strict_schema() in src/tools/schema_validator.rs (958 lines, full OpenAI strict-mode validation including enum type checking, additionalProperties enforcement, recursive nested validation) and also exports the pre-existing validate_tool_schema() from src/tools/tool.rs (less strict — no enum type checking, no additionalProperties check). Both functions validate tool JSON schemas but with different rules. There is no documentation indicating when to use which. This creates a maintenance burden: changes to validation rules need to be synchronized across two implementations, and callers may pick the wrong one.

Suggested fix:

Consider making `validate_strict_schema` the single source of truth and having `validate_tool_schema` delegate to it (possibly with a `strict: bool` flag). Alternatively, add clear doc comments explaining: `validate_strict_schema` is for CI/build-time validation, `validate_tool_schema` is for runtime registration.

Severity: medium · Confidence: high

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Added cross-referencing doc comments to both functions. validate_tool_schema (tool.rs) documents itself as the lenient runtime variant and points to validate_strict_schema. validate_strict_schema (schema_validator.rs) documents itself as the strict CI variant and points back to validate_tool_schema.

Comment thread tests/e2e/conftest.py Outdated
ROOT = Path(__file__).resolve().parent.parent.parent

# Ports: use high fixed ports to avoid conflicts with development instances
MOCK_LLM_PORT = 18_199
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed E2E ports will cause failures in parallel CI runs

The E2E fixtures use hardcoded ports MOCK_LLM_PORT = 18199 and GATEWAY_PORT = 18200. If two CI jobs run E2E tests simultaneously on the same runner (e.g., matrix builds, or concurrent PR builds on a self-hosted runner), they will fail with 'address already in use'. The mock LLM server supports port 0 (OS-assigned) via --port, but the gateway appears to need a fixed port because it logs the configured port, not the bound port.

Suggested fix:

For the mock LLM, use port 0 and parse the actual port from stdout (which is already supported). For the gateway, either fix the binary to log the actual bound port, or use a randomized high port (e.g., `random.randint(18200, 19200)`) to reduce collision probability.

Severity: medium · Confidence: medium

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Both MOCK_LLM_PORT and GATEWAY_PORT are now dynamically allocated via _find_free_port() (binds to port 0, reads OS-assigned port). Mock LLM was already passing --port 0 but the conftest had a hardcoded fallback constant; removed it.

Comment thread tests/e2e/conftest.py
**os.environ,
"RUST_LOG": "ironclaw=info",
"GATEWAY_ENABLED": "true",
"GATEWAY_HOST": "127.0.0.1",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E2E test inherits full host environment, risking config contamination

The ironclaw_server fixture passes **os.environ merged with test overrides. If the host machine has ~/.ironclaw/.env or environment variables like DATABASE_URL, LLM_API_KEY, or GATEWAY_PORT set, the ironclaw binary's load_ironclaw_env() bootstrap may load them, potentially overriding the test's configuration. This could cause non-deterministic test failures on developer machines or CI runners with residual configuration.

Suggested fix:

Use a minimal environment instead of inheriting everything: `env = {"PATH": os.environ.get("PATH", ""), "HOME": tempdir, "RUST_LOG": ..., ...}`. Set `HOME` to a temp directory to prevent `~/.ironclaw/.env` from being loaded. Or set `IRONCLAW_HOME` (if supported) to an empty temp directory.

Severity: medium · Confidence: medium

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Replaced **os.environ with a minimal explicit env dict containing only PATH and HOME plus the test-specific vars. No host environment leakage.

Comment thread tests/e2e/README.md

## Run tests

```bash
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

README documents HEADED=1 env var but conftest.py ignores it

The README states HEADED=1 pytest tests/e2e/scenarios/test_connection.py -v for visible browser mode, but the page fixture in conftest.py line 117 hardcodes headless=True without reading any environment variable. Running HEADED=1 will still run headless.

Suggested fix:

In `conftest.py`, change `browser = await p.chromium.launch(headless=True)` to `browser = await p.chromium.launch(headless=not os.environ.get("HEADED"))`.

Severity: low · Confidence: high

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. The browser fixture is now session-scoped and reads HEADED env var: headless = os.environ.get("HEADED", "").strip() not in ("1", "true").

Comment thread .github/workflows/e2e.yml Outdated
runs-on: ubuntu-latest
timeout-minutes: 30
steps:
- uses: actions/checkout@v4
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent actions/checkout version across CI workflows

The E2E workflow uses actions/checkout@v4 while both code_style.yml and test.yml use actions/checkout@v6. While v4 still works, the inconsistency could lead to different checkout behaviors (e.g., different default fetch depths, different handling of submodules) and makes maintenance harder.

Suggested fix:

Change `uses: actions/checkout@v4` to `uses: actions/checkout@v6` in `.github/workflows/e2e.yml` line 17.

Severity: low · Confidence: high

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Bumped to actions/checkout@v6 to match test.yml and code_style.yml.

Comment thread tests/e2e/conftest.py Outdated
proc.send_signal(signal.SIGTERM)
try:
await asyncio.wait_for(proc.wait(), timeout=5)
except asyncio.TimeoutError:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E2E page fixture launches a new browser per test function

The page fixture (function-scoped) calls p.chromium.launch() for every test, creating a new browser process each time. With 19 E2E tests, that's 19 Chromium launches. A session-scoped browser fixture with per-test browser contexts would be significantly faster while maintaining test isolation (contexts don't share cookies, localStorage, etc.).

Suggested fix:

Add a session-scoped `browser` fixture: `@pytest.fixture(scope="session") async def browser(): ...`. Then in the `page` fixture, accept `browser` as a parameter and only create a new context and page from it.

Severity: low · Confidence: high

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Browser is now session-scoped. The page fixture creates a fresh context per test (isolated cookies/storage) without relaunching the browser process.

@zmanian
Copy link
Copy Markdown
Collaborator Author

zmanian commented Feb 26, 2026

All review items addressed:

  1. Bearer auth case-insensitive (RFC 6750) -- Fixed auth.rs to use eq_ignore_ascii_case("Bearer ") for the auth-scheme check. Tests updated to expect 200 for lowercase/mixed-case "bearer". Added test_bearer_prefix_mixed_case for "BEARER" variant.

  2. Config round-trip tests reimplemented bootstrap logic -- Refactored bootstrap.rs to expose save_bootstrap_env_to and upsert_bootstrap_var_to (path-parameterized variants). tests/config_round_trip.rs now imports and calls the real functions.

  3. Deprecated event_loop fixture -- Removed. pytest-asyncio >= 0.23 handles session-scoped async fixtures natively.

  4. Schema validator doc comments -- Added cross-referencing doc comments to both validate_tool_schema (lenient, runtime) and validate_strict_schema (strict, CI).

  5. Fixed E2E ports -- Both ports now dynamically allocated via _find_free_port().

  6. Environment contamination -- Replaced **os.environ with minimal explicit env dict (PATH + HOME only).

  7. HEADED=1 not wired -- browser fixture now reads HEADED env var for headless toggle.

  8. checkout@v4 -> @v6 -- Bumped e2e.yml to match other workflows.

  9. Browser launched per-test -- Browser is now session-scoped. page fixture creates fresh contexts per test.

  10. Array validation simplification -- Restructured to if-let-else.

- Fix Bearer auth case-sensitivity per RFC 6750 (auth.rs)
- Refactor bootstrap.rs to expose path-parameterized variants so
  config_round_trip tests call real code instead of reimplementations
- Remove deprecated event_loop fixture, use dynamic ports, minimal env,
  session-scoped browser, and wire HEADED=1 in E2E conftest
- Add cross-referencing doc comments between schema validators
- Simplify array validation logic in tool.rs
- Bump e2e.yml checkout@v4 to @v6

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
serrrfirat
serrrfirat previously approved these changes Feb 26, 2026
Resolve conflict in src/agent/session_manager.rs: keep both QA
concurrent stress tests and new resolve_thread UUID lookup test.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
zmanian and others added 4 commits February 26, 2026 11:20
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add --no-onboard flag to prevent wizard from blocking in CI
- Pipe /dev/null to stdin to prevent any stdin reads from hanging
- Add RUST_BACKTRACE=1 for crash diagnostics
- On server startup timeout, dump stderr to pytest output so CI
  logs show why the server failed to start

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
pytest-asyncio 1.3.0 defaults asyncio_default_fixture_loop_scope to
None (function scope), causing session-scoped async fixtures to be
re-evaluated per test function with independent event loops. Each test
then independently attempts to start the ironclaw server, times out
at 120s, and wastes ~24 minutes of CI before the job is cancelled.

Setting asyncio_default_fixture_loop_scope = "session" ensures all
session-scoped async fixtures share a single event loop, so the server
starts once and is reused across all tests.

Also adds -x flag to pytest in CI to stop on first failure instead of
running all 19 tests when the fixture is broken.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
With asyncio_default_fixture_loop_scope=session but
asyncio_default_test_loop_scope=function (the default), tests run on
a per-function event loop while fixtures produce objects (Playwright
pages, browser contexts) on the session event loop. This event loop
mismatch causes the test to hang indefinitely awaiting Playwright
operations that are bound to the wrong loop.

Setting both scopes to "session" ensures a single event loop is shared
across all fixtures and tests, eliminating the deadlock.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
serrrfirat
serrrfirat previously approved these changes Feb 27, 2026
@serrrfirat serrrfirat enabled auto-merge (squash) February 27, 2026 03:39
zmanian and others added 2 commits February 26, 2026 20:21
Branch protection expects "Code Style (fmt + clippy)" and "Run Tests"
status checks, but only individual job names were reported. Add
roll-up jobs that aggregate results and report the expected names.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
auto-merge was automatically disabled February 27, 2026 04:54

Head branch was pushed to by a user without write access

@serrrfirat serrrfirat merged commit a24fd3e into nearai:main Feb 27, 2026
14 checks passed
@github-actions github-actions bot mentioned this pull request Feb 27, 2026
@zmanian zmanian deleted the feat/automated-qa-p0 branch February 27, 2026 17:06
bkutasi pushed a commit to bkutasi/ironclaw that referenced this pull request Mar 28, 2026
…est coverage (nearai#353)

* Add automated QA: tool schema validator, feature-flag CI matrix, Docker build

P0 items from the automated QA plan (nearai#352):

- Add validate_tool_schema() that checks OpenAI strict-mode rules
  (type: object, required keys in properties, nested object/array
  recursion) with 10 unit tests and 6 integration tests covering
  all core built-in tools

- CI test matrix now runs with --all-features, default features, and
  --no-default-features --features libsql to catch dead code behind
  wrong cfg gates

- CI clippy now runs the same 3-feature matrix with --all flags

- Docker build job added to catch missing files in Dockerfile

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add P1 automated QA tests and fix LeakDetector prefix shadowing bug

P1 test coverage: config round-trip (settings + bootstrap), shell tool
arg handling, safety adversarial tests (sanitizer, leak detector,
allowlist), turn persistence (conversations, metadata, pagination, jobs),
and a clippy fix for libsql-only builds.

Fixed a real bug where AhoCorasick non-overlapping prefix iteration
caused shorter prefixes (e.g. "sk-") to shadow longer ones
(e.g. "sk-ant-api"), preventing Anthropic API key and SSH private key
detection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add P2 automated QA tests: chaos, lifecycle, collision, and recovery

Cover all P2 items from the automated QA plan:
- Circuit breaker chaos tests (hanging provider, rapid cycles, mixed errors)
- Failover chaos tests (hanging failover, all-fail, tools path, single provider)
- Value estimator boundary tests (negative cost, zero price, zero earnings)
- Context length recovery test (ContextLengthExceeded -> compact -> retry)
- WASM channel lifecycle tests (write/commit/read round-trip, namespace isolation)
- Extension registry collision tests (same-name different-kind coexistence)
- Extension filesystem collision tests (separate dirs, detect_kind priority)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add P3 concurrent stress tests for ContextManager and SessionManager

Tests verify thread safety of double-checked locking, TOCTOU
prevention, and RwLock-based concurrent access patterns under load.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add dispatcher loop guard and self-repair stuck job tests

Dispatcher: test force_text mechanism prevents infinite tool call loops,
verify iteration bound arithmetic guarantees termination for all configs.

Self-repair: test stuck job detection, recovery within attempt limits,
manual escalation when limit exceeded, graceful degradation without
store/builder dependencies.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add E2E testing infrastructure design doc

Python + Playwright framework with mock LLM server for deterministic
browser-level testing of the web gateway. Covers connection/auth,
chat round-trip with SSE streaming, and skills lifecycle scenarios.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add E2E testing infrastructure implementation plan

10-task plan covering: scaffolding, mock LLM server, helpers,
conftest fixtures, connection/chat/skills test scenarios,
CI workflow, README, and integration run.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* scaffold: E2E test project with pyproject.toml

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: E2E helpers with DOM selectors and port discovery

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: mock OpenAI-compat LLM server for E2E tests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: E2E conftest with session fixtures for mock LLM and ironclaw

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: E2E scenario 1 -- connection and tab navigation tests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: E2E scenario 2 -- chat message round-trip tests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: E2E scenario 3 -- skills search, install, remove tests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* ci: add weekly E2E test workflow with Playwright

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: E2E test README with setup and usage instructions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: E2E test integration fixes from first run

- Use temp file DB instead of :memory: (libSQL :memory: doesn't persist
  tables across execute_batch)
- Fix installed skills selector: #skills-list not #installed-skills
- Add pytest-timeout to dependencies
- Improve skills install/remove test with wait_for instead of fixed sleeps

8 passed, 1 skipped (skills install depends on ClawHub availability)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test: add OpenAI strict-mode schema validator for all built-in tools (QA 1.1)

Add src/tools/schema_validator.rs with validate_strict_schema() that checks
tool parameter schemas against OpenAI function calling strict-mode rules:
type object at top level, required keys in properties, enum type consistency,
array items definitions, nested object recursion, and additionalProperties.

17 tests validate all 34+ built-in tool schemas across 5 test groups:
- 9 simple tools (echo, time, json, http, shell, file read/write/list/patch)
- 4 job tools (create, list, status, cancel)
- 4 skill tools (list, search, install, remove)
- 13 inline schemas for extension, routine, and complex job tools
- 4 memory tool schemas

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test: add E2E scenarios for SSE reconnect, HTML injection, and tool approval (QA 3.3/5/6)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: E2E test reliability for HTML injection and SSE reconnect

- HTML injection: test sanitization directly via JS injection instead of
  depending on full LLM round-trip (avoids intermittent 404 from mock)
- SSE reconnect: increase wait times for DB persistence and relax
  assertion to check total message count after history reload

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* style: cargo fmt formatting

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test: add WASM and MCP tool schema validation tests (QA 1.1)

Extends the schema validator with representative WASM tool schemas
(weather, HTTP client, batch processor, status), MCP tool schemas
(default, file read, SQL query, strict mode), and defect detection
tests for common external schema issues (missing type, typo in
required, array without items, enum type mismatch).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test: add auth middleware and compaction module tests

Auth middleware (8 new tests): valid/invalid bearer tokens, query param
fallback, case sensitivity, empty tokens, whitespace handling.

Compaction module (16 new tests): truncation strategy, summarize strategy
with mock LLM, workspace fallback, format_turns helper, sequential
compactions, coherence after compaction, token decrease verification.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test: add config round-trip integration tests (QA 1.2)

Test the full bootstrap .env lifecycle: write via the same format
as save_bootstrap_env/upsert_bootstrap_var, read back via dotenvy,
and assert values match. Covers LLM backend selection, embedding
disable flag, onboard completion flag, session token keys, multi-key
preservation across upsert, and special characters (spaces, equals,
quotes, backslashes, hashes).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test: add value estimator boundary tests and dispatcher loop guard (QA 4.3/4.4)

Value estimator (14 new tests): zero/negative prices, large values,
negative cost, exact margin boundaries, custom margin configuration.

Dispatcher loop guard (2 new tests): verifies the dispatch loop terminates
when all tool calls fail (regression guard for PR nearai#252 infinite loop)
and when max iterations are reached.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test: add failover edge cases and provider chaos tests (QA 2.6/4.1)

Failover edge cases (4 new tests): cooldown at zero nanos, half-open
failure reopens circuit, all providers fail gracefully (no panic),
single failing provider with cooldown.

Provider chaos tests (15 new tests): flakey provider with retries,
hanging provider with timeout, garbage provider, circuit breaker
trip/recover, failover chain cascading, non-transient error stops
chain, full stack integration (retry + failover + circuit breaker).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: address PR review feedback on QA tests

- Fix Bearer auth case-sensitivity per RFC 6750 (auth.rs)
- Refactor bootstrap.rs to expose path-parameterized variants so
  config_round_trip tests call real code instead of reimplementations
- Remove deprecated event_loop fixture, use dynamic ports, minimal env,
  session-scoped browser, and wire HEADED=1 in E2E conftest
- Add cross-referencing doc comments between schema validators
- Simplify array validation logic in tool.rs
- Bump e2e.yml checkout@v4 to @v6

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* style: cargo fmt and fix clippy warning in signal.rs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: improve E2E fixture error reporting and prevent stdin blocking

- Add --no-onboard flag to prevent wizard from blocking in CI
- Pipe /dev/null to stdin to prevent any stdin reads from hanging
- Add RUST_BACKTRACE=1 for crash diagnostics
- On server startup timeout, dump stderr to pytest output so CI
  logs show why the server failed to start

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: set session-scoped event loop for E2E async fixtures

pytest-asyncio 1.3.0 defaults asyncio_default_fixture_loop_scope to
None (function scope), causing session-scoped async fixtures to be
re-evaluated per test function with independent event loops. Each test
then independently attempts to start the ironclaw server, times out
at 120s, and wastes ~24 minutes of CI before the job is cancelled.

Setting asyncio_default_fixture_loop_scope = "session" ensures all
session-scoped async fixtures share a single event loop, so the server
starts once and is reused across all tests.

Also adds -x flag to pytest in CI to stop on first failure instead of
running all 19 tests when the fixture is broken.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: set test loop scope to session to match fixture loop scope

With asyncio_default_fixture_loop_scope=session but
asyncio_default_test_loop_scope=function (the default), tests run on
a per-function event loop while fixtures produce objects (Playwright
pages, browser contexts) on the session event loop. This event loop
mismatch causes the test to hang indefinitely awaiting Playwright
operations that are bound to the wrong loop.

Setting both scopes to "session" ensures a single event loop is shared
across all fixtures and tests, eliminating the deadlock.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* ci: add roll-up jobs to match branch protection required checks

Branch protection expects "Code Style (fmt + clippy)" and "Run Tests"
status checks, but only individual job names were reported. Add
roll-up jobs that aggregate results and report the expected names.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor: experienced 6-19 merged PRs risk: high Safety, secrets, auth, or critical infrastructure scope: agent Agent core (agent loop, router, scheduler) scope: channel/cli TUI / CLI channel scope: channel/wasm WASM channel runtime scope: channel/web Web gateway channel scope: ci CI/CD workflows scope: config Configuration scope: docs Documentation scope: estimation Cost/time estimation scope: extensions Extension management scope: llm LLM integration scope: safety Prompt injection defense scope: sandbox Docker sandbox scope: tool/builtin Built-in tools scope: tool Tool infrastructure size: XL 500+ changed lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants