Skip to content

feat: structured collections — typed CRUD tools for agent workspaces#1937

Open
standardtoaster wants to merge 56 commits intonearai:stagingfrom
standardtoaster:fix/collection-scoping-and-startup
Open

feat: structured collections — typed CRUD tools for agent workspaces#1937
standardtoaster wants to merge 56 commits intonearai:stagingfrom
standardtoaster:fix/collection-scoping-and-startup

Conversation

@standardtoaster
Copy link
Copy Markdown
Contributor

Collections: Typed Structured Storage for Agent Workspaces

Summary

"Add milk to the grocery list." Most deployments can't do this out of the box. The agent either creates a new document every time (fragmenting the list) or tries to edit an existing markdown file and corrupts it. Across 28 test scenarios and 2 models, modifying structured data stored in memory documents fails every time.

This PR adds schema-defined collections with auto-generated typed CRUD tools. Register a collection with a schema, get a tool that handles inserts, queries, updates, deletes, and aggregations with validation.

Model Collections Memory docs Delta
Qwen 3.5-35B (local) 76% 37% +39
Claude Haiku 4.5 70% 26% +44

35 files changed, ~12,600 lines. Dual-backend (PostgreSQL + libSQL), per-user tool isolation. Also provides the storage layer for #1474.

The problem

A user says "add milk to the grocery list." Both options fail:

  1. Write a new document. Creates fragments. "What's on the list?" requires deduplication across scattered files.

  2. Edit an existing document. Read, parse, modify, write back. Models can't do this — they lose items, duplicate entries, or corrupt formatting. Fails on every model tested.

The problem is structural: append-only documents don't support mutation.

What a user sees

POST /api/collections
{
  "collection": "grocery_items",
  "description": "Shopping list",
  "fields": {
    "name":     { "type": "text", "required": true },
    "quantity": { "type": "number" },
    "category": { "type": { "enum": ["produce", "dairy", "meat", "bakery", "other"] } },
    "store":    { "type": "text" },
    "done":     { "type": "bool", "default": false }
  }
}

This generates a tool called grocery_items. The model calls:

grocery_items(operation: "add", data: { "name": "milk", "category": "dairy" })
grocery_items(operation: "query")
grocery_items(operation: "summary", agg_operation: "count")

The schema validates inputs and coerces LLM quirks: "12" becomes 12, "true" becomes true, "tomorrow" becomes an ISO date.

Evidence

Collections vs memory docs (28 scenarios, same data, same models)

Category Qwen+Coll Qwen+Mem Haiku+Coll Haiku+Mem
Grocery (7) 76% 29% 76% 28%
Nanny hours (7) 56% 21% 69% 24%
Todo (6) 72% 52% 72% 35%
Transactions (8) 59% 48% 65% 21%

Tool design matters

Approach Score Why
1 unified tool/collection + skills 76% Low tool count + guided operations
5 per-operation tools + skills 65% Self-documenting but 20 tools adds noise
1 unified tool/collection, no skills 68% Model handles the operation enum fine
Flat files (memory docs) 37% Writes broken
Generic CRUD (5 tools for all) 41% Model forgets collection names

One tool per collection with an operation parameter is the best design. Auto-generated skills (+8%) teach the model which operation to use for natural language intents.

How it works

Storage

StructuredStore trait, fully implemented for both PostgreSQL (JSONB) and libSQL (json_extract). Records in a shared structured_records table, discriminated by (user_id, collection) with composite index + GIN index. 7 field types: Text, Number, Date, Time, DateTime, Bool, Enum.

Schema alteration supports adding/removing fields and enum values. Existing records are preserved — queries, filters, and aggregations handle missing fields correctly (SUM/AVG skip records where the field doesn't exist).

Tools

Each collection gets one tool named {user}_{collection} with parameters:

  • operation (required enum: query, add, update, delete, summary)
  • data (object — validated against the collection's field schema)
  • record_id (string — for update/delete)
  • filters (object — field → {op, value}, supports eq/neq/gt/gte/lt/lte/is_null/is_not_null)
  • field, agg_operation, group_by (for summary aggregations)

Tool names include the owner prefix; the dispatcher filters per-user via tool_definitions_for_user().

When a collection is registered, a SKILL.md is auto-generated with activation keywords from the schema. Keyword/regex matching injects the skill into the system prompt when relevant. On restart, existing schemas are loaded and tools registered before the first conversation.

REST API

6 endpoints for external integration (webhooks, cron, scripts). Inherits gateway auth.

  • GET/POST /api/collections — List / register schemas
  • GET/POST /api/collections/{name}/records — Query / insert records
  • PUT/DELETE /api/collections/{name}/records/{id} — Update / delete records

Multi-tenant scoping

Collections are scoped by user_id. Every query includes WHERE user_id = $1. Tool definitions are filtered per-user in the dispatcher, job workers, and routine engine.

For cross-user read access, source_scope allows a tool to query another user's data. source_scope is stripped in CollectionRegisterTool::execute() and collections_register_handler — only server-side seeding can set it.

Tool scaling

Each collection adds 1 tool. 20 collections = 20 tools, ~300 extra tokens with compressed descriptions. Per-user filtering and on-demand skill injection keep prompts lean.

Drawbacks

  • Adds a new storage abstraction alongside memory documents. Zero impact if unused.
  • Fields must be defined upfront. Adding a required field doesn't backfill old records.
  • Delete by description succeeds ~40%. Record IDs in query results make delete-by-ID reliable.

Alternatives considered

  • Improve document-based search: writes remain 0%.
  • External database/service: breaks single-deployment model.
  • Generic CRUD tools: 41%. Model forgets collection names as parameters.
  • 5 per-operation tools per collection: 65%. Scales poorly (50 tools at 10 collections).

Compatibility

Additive only. No existing tables modified, no existing tool behavior changed, no changes to memory documents. V16 migration adds two new tables (structured_schemas, structured_records). owner_user_id() added to Tool trait with default None — existing tools unaffected. Both backends implemented and tested at parity. Set COLLECTION_TOOL_MODE=unified to enable (recommended).

@github-actions github-actions bot added scope: agent Agent core (agent loop, router, scheduler) scope: channel/web Web gateway channel scope: tool Tool infrastructure scope: tool/builtin Built-in tools scope: db Database trait / abstraction scope: db/postgres PostgreSQL backend scope: db/libsql libSQL / Turso backend scope: worker Container worker scope: docs Documentation size: XL 500+ changed lines risk: medium Business logic, config, or moderate-risk modules contributor: regular 2-5 merged PRs labels Apr 2, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces "Collections," a typed structured storage system for agent workspaces that provides schema-validated CRUD tools with support for PostgreSQL and libSQL backends. The code review identified several critical issues, including the omission of workspace read scopes in API handlers which breaks cross-lens access, and missing event broadcasts in the unified collection tool that would prevent routine triggers from firing. Additionally, feedback highlighted a correctness issue in libSQL aggregations where app-layer processing combined with record limits could yield wrong results, and pointed out that ORDER BY clauses fail to handle top-level database columns correctly. Other improvements were suggested regarding transaction locking in libSQL, robust group key extraction, and the use of UTC for consistent date resolution.

Comment on lines +190 to +195
let owner = match resolve_collection_owner(
db.as_ref(),
&user.user_id,
&Vec::<String>::new(),
&name,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The call to resolve_collection_owner passes an empty vector for scopes, effectively disabling cross-lens access for queries. To support the multi-tenant scoping described in the RFC, the user's workspace_read_scopes should be passed here. Similar issues exist in the insert, update, and delete handlers (lines 281, 384, and 478).

Suggested change
let owner = match resolve_collection_owner(
db.as_ref(),
&user.user_id,
&Vec::<String>::new(),
&name,
)
let owner = match resolve_collection_owner(
db.as_ref(),
&user.user_id,
&user.workspace_read_scopes,
&name,
)

Comment on lines +230 to +232
.query_records(user_id, collection, &aggregation.filters, None, usize::MAX)
.await?;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The aggregate_in_rust function performs aggregations in the application layer after fetching records. However, query_records enforces a limit of 1000, leading to incorrect results for larger collections. Following repository rules, use targeted database queries (e.g., SQL SUM/AVG) to perform aggregations at the database level to ensure correctness and prevent performance/DoS vulnerabilities.

References
  1. Use targeted database queries to fetch specific records instead of loading all records and filtering in the application, especially for public endpoints, to prevent performance bottlenecks and DoS vulnerabilities.

Comment on lines +212 to +218
self.db
.update_record(
self.owner_scope(ctx),
record_id,
serde_json::Value::Object(final_updates),
)
.await
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The execute_update method is missing the broadcast of a CollectionWriteEvent. This will prevent routine triggers based on collection updates from firing when this tool is used. An event should be sent to collection_write_tx similar to the implementation in execute_add.

Comment on lines +242 to +244
self.db
.delete_record(self.owner_scope(ctx), record_id)
.await
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The execute_delete method is missing the broadcast of a CollectionWriteEvent. This will prevent routine triggers based on collection deletions from firing when this tool is used.

Comment on lines +137 to +138
let mut user_ids = vec![user.user_id.clone()];
user_ids.extend(Vec::<String>::new().iter().cloned());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The collections_list_handler is missing the logic to include the user's workspace read scopes, as indicated by the redundant extension with an empty vector. This prevents users from seeing collections they have cross-lens access to, which contradicts the PR description regarding cross-lens support.

Suggested change
let mut user_ids = vec![user.user_id.clone()];
user_ids.extend(Vec::<String>::new().iter().cloned());
let mut user_ids = vec![user.user_id.clone()];
user_ids.extend(user.workspace_read_scopes.iter().cloned());

Comment thread src/db/libsql/structured.rs Outdated
) -> Result<(), DatabaseError> {
let conn = self.connect().await?;

conn.execute("BEGIN", ())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

A deferred transaction (BEGIN) is used here for a read-modify-write cycle. To prevent race conditions and potential deadlocks in libSQL/SQLite, use BEGIN IMMEDIATE to acquire the write lock at the start of the transaction, as per repository guidelines.

Suggested change
conn.execute("BEGIN", ())
conn.execute("BEGIN IMMEDIATE", ())
References
  1. To prevent race conditions in get-or-create operations in SQLite/libSQL, use BEGIN IMMEDIATE transactions to serialize concurrent writers.

Comment on lines +647 to +649
Some(field) => {
structured::validate_field_name(field)
.map_err(|e| DatabaseError::Query(e.to_string()))?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The ORDER BY clause interpolates the field name directly into json_extract. If a user attempts to order by a top-level column like created_at or updated_at, the query will attempt to extract these from the JSON data field (where they don't exist), resulting in an incorrect sort order. Consider handling these special fields similarly to resolve_filter_field.

Comment thread src/db/libsql/structured.rs Outdated
Comment on lines +778 to +780
.get::<String>(0)
.unwrap_or_else(|_| "null".to_string());
let value = extract_agg_value(&row, 1, &aggregation.operation)?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using row.get::<String>(0) will fail if the group key is a numeric or boolean type in the JSON data. Using the get_text helper is safer as it handles type conversion and NULL values more robustly.

Suggested change
.get::<String>(0)
.unwrap_or_else(|_| "null".to_string());
let value = extract_agg_value(&row, 1, &aggregation.operation)?;
let key = get_text(&row, 0);

Comment thread src/db/postgres.rs Outdated
Comment on lines +1375 to +1377
structured::validate_field_name(field)
.map_err(|e| DatabaseError::Query(e.to_string()))?;
sql.push_str(&format!(" ORDER BY data->>'{field}'"));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the libSQL implementation, ordering by top-level columns like created_at or updated_at will fail because the query attempts to access them via the data->> JSONB operator. These fields should be handled as direct column references.

Comment thread src/db/structured/mod.rs Outdated
/// "tomorrow", "this week", or day names instead of YYYY-MM-DD. This handles
/// the most common cases so the tool call doesn't fail validation.
fn try_parse_natural_date(s: &str) -> Option<NaiveDate> {
let today = chrono::Local::now().date_naive();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using chrono::Local::now() for natural language date resolution can lead to inconsistent results depending on the server's local timezone configuration. It is generally safer to use Utc::now() to ensure consistent behavior across different deployment environments.

Suggested change
let today = chrono::Local::now().date_naive();
let today = chrono::Utc::now().date_naive();

@standardtoaster standardtoaster marked this pull request as draft April 2, 2026 23:28
@ilblackdragon
Copy link
Copy Markdown
Member

What about modelling this as a SQL database with tables, and let agent just run SQL queries on it?
This would then work also if user connects to external databases too. And allows to leverage all the same SQL knowledge that it has. And just can maintain metadata of tables that agent can fetch.

For multi tenant case would be great to have: per-user, shared across everyone and also group-based. This would allow to also share these tables between groups of users or the whole organization.

@github-actions github-actions bot added scope: channel Channel infrastructure scope: tool/builder Dynamic tool builder labels Apr 3, 2026
@standardtoaster
Copy link
Copy Markdown
Contributor Author

Thanks for the feedback. I think SQL is interesting as a backing store, but I'd push back on exposing raw SQL to the agent:

Safety surface. SQL gives the model room to accidentally DELETE, DROP TABLE, or ALTER schema. Multi-tenant safety would need sharded databases and row-level ACLs — a lot of infrastructure to protect against something typed tools prevent by design.

Model interface. Small models likely know SQL syntax well — it's heavily represented in training data. But SQL shifts the scaling problem: instead of knowing which tool to call, the model needs to know every table name, column name, and type. That's the same context pressure as tool count, but less constrained. Typed tools give the model a bounded action space with self-documenting parameters. Happy to benchmark SQL vs typed tools if you feel strongly — I have the test harness for it.

Tool scaling is the real concern. With cross-scope access, tool count can balloon (107 tools in my deployment, 40 from collection duplicates). I think the answer is a few things working together:

  1. Deferred tool schemas — compressed descriptions (~15 tokens/tool) instead of full JSON schemas. I have a branch where 87 tools = ~1,300 tokens vs ~18K for full schemas.
  2. Unified search pipeline with tool activation — embed tool descriptions into the same search index as memory. When the agent does a memory_search, it gets back relevant memories and relevant tool schemas for actions it might want to take — similar to how skills are already injected based on context. As the system grows in capabilities, the search pipeline becomes the discovery mechanism rather than the tool list.
  3. REST endpoints / routines for bulk operations — imports, syncs, and batch updates don't need to go through the agent's tool list. A webhook-triggered routine or dedicated endpoint handles bulk ingestion directly, keeping the agent's tool surface focused on interactive CRUD.

v2's available_actions(leases) has the right hook for this — just needs filtering wired up alongside the existing execution-time gating.

On multi-tenant sharing: I just pushed new code for cross-scope auto-discovery in this PR but haven't had a chance to fully test it yet. The approach uses workspace scoping (workspace_read_scopes) to make collections from other scopes available without duplicating tools.

standardtoaster and others added 12 commits April 3, 2026 15:11
Add V16 migration creating structured_schemas and structured_records
tables. Define FieldType, FieldDef, CollectionSchema, Record, Filter,
Aggregation types and the StructuredStore trait (10 async methods) for
schema-validated CRUD on user-defined collections.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PostgreSQL: full implementation with JSONB operators for filtering,
aggregation via SQL, and user-scoped queries.

libSQL: stub returning "not yet supported" — collections require
PostgreSQL for now.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Dynamic tool generation: each registered collection gets 5 typed tools
(add, query, update, delete, summary). Tools are prefixed with owner
user_id to prevent cross-tenant collisions in the shared registry.

Includes:
- Keyword-triggered SKILL.md generation per collection
- Collections router skill for cross-domain discovery
- Discovery docs written to workspace memory for embedding search
- Domain synonym mappings for natural language matching
- CollectionWriteEvent broadcast channel for routine triggers

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Six endpoints:
- GET/POST /api/collections (list schemas, register new)
- PUT/DELETE /api/collections/:name (alter, drop schema)
- GET/POST /api/collections/:name/records (query, insert)
- PUT/DELETE /api/collections/:name/records/:id (update, delete)
- POST /api/collections/:name/aggregate (summary queries)

GatewayState extended with collection_write_tx and skills_dir.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Load existing collection schemas from the database during build_all()
and register their CRUD tools before the first conversation. Without
this, collections created in prior sessions had no tools after restart.

Iterates all known users via db.list_users() to cover multi-tenant
deployments.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Schema lifecycle (register, alter, drop)
- CRUD operations with validation
- User isolation (tenant A can't see tenant B's data)
- Querying with filters and aggregations
- Per-collection tool generation and naming
- Fixture schemas: nanny_shifts, grocery_items

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Verify that two users registering the same collection get tool names
with different prefixes (andrew_grocery_items_query vs
grace_grocery_items_query) and no names overlap in the shared registry.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Full implementation of all 10 StructuredStore methods for SQLite/libSQL:
register, get, list, drop collections; insert, get, update, delete
records; query with json_extract filters; aggregate with SUM/COUNT/
AVG/MIN/MAX.

Key SQLite adaptations:
- json_extract(data, '$.field') for JSONB equivalent
- Type-preserving filter params (Integer/Real/Text matching)
- Explicit timestamps via fmt_ts() (no NOW())
- V17 incremental migration for existing databases

All 19 structured_collections integration tests pass on libSQL.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add owner_user_id() to the Tool trait (default None for built-ins).
Collection CRUD tools return their owner's user_id. ToolRegistry gains
tool_definitions_for_user() which excludes tools owned by other users.
Dispatcher uses this for both initial tool list and mid-loop refreshes.

Prevents collection tool leakage across lenses in multi-tenant
deployments — Andrew only sees andrew_grocery_items_query, never
grace_grocery_items_query.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
POST /api/collections now also handles schema registration (alias for
/api/collections/register). POST /api/collections/{name} inserts
records (alias for /api/collections/{name}/records). Ensures bench
tooling and Percy fork clients work against the upstream API.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
15 new integration tests (35 total, all passing):
- Filter operators: Lt, Lte, Gte, Ne
- Aggregations: Avg, Min, Max
- Non-existent resources: query/get/update/delete errors
- Limit pagination
- Empty collection aggregation (sum/count/avg)
- Multiple combined filters (AND logic)

No bugs found — all edge cases handled correctly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2×2 experiment (Qwen/Haiku × Collections/MemoryDocs), 28 scenarios:
- Collections: Qwen 0.65, Haiku 0.70
- Memory docs: Qwen 0.37, Haiku 0.26
- Delta: +0.28 (Qwen), +0.44 (Haiku)
- 126 tests passing on both backends

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
standardtoaster and others added 3 commits April 3, 2026 15:23
…akage prevention

Add 10 new tests across three files validating the complete cross-scope feature:

- registry.rs: comprehensive multi-user visibility test (3 owners + builtins +
  scoped/unscoped users), execution gate test (5 cases: blocked, allowed,
  owner-always, builtin-always, nonexistent)
- collections.rs: 5 libsql-backed tests covering write targeting to owner
  partition, cross-scope query reads, collections_list leakage prevention,
  scoped collection inclusion, and source_scope write targeting
- selector.rs: 3 skill leakage tests verifying tools_prefix prevents skill
  activation for users without matching tools, multiple prefixed skills
  activate independently, and prefix matching is case-insensitive

All tests use in-process libsql (file-backed tempdir) or in-memory registries
— no postgres dependency. Total test count: 4348 passed, 0 failed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
After rebasing onto upstream staging (v2 engine PR), fix:
- Remove duplicate chat handlers from handlers/chat.rs (now in server.rs)
- Update crate::skills::SkillRegistry → ironclaw_skills::SkillRegistry
  (types moved to extracted crate in v2)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@standardtoaster standardtoaster force-pushed the fix/collection-scoping-and-startup branch from 4fca16e to ebb5735 Compare April 3, 2026 14:31
standardtoaster and others added 12 commits April 3, 2026 15:45
The REST handler for POST /api/collections only generated tools for the
single schema being registered. If collections from prior sessions had
their tools lost on restart, they remained toolless until individually
re-registered. Add a public bootstrap_collection_tools() that loads ALL
collections for the user and generates per-collection tools for each.
The handler now calls this instead of refresh_collection_tools() on
the single schema, ensuring every collection has working tools.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- rustfmt on 5 files subagents touched but didn't format
- Gate unique_id behind #[cfg(feature = "postgres")] for libsql-only builds
- Include bootstrap regression test from subagent working tree

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add embedding-based scoring alongside existing keyword/tag scoring in the
skill selection pipeline, so skills activate for semantically related
queries even when keywords don't match.

Architecture:
- New `Embedder` trait in engine crate (minimal async embed interface)
- `EmbedderBridgeAdapter` wraps main crate's `EmbeddingProvider`
- `__embed_text__` host function exposed to Python orchestrator
- Threaded through ExecutionLoop -> ThreadManager -> init_engine
- Wired automatically when embedding provider is configured

Python orchestrator changes:
- `score_skill()` computes cosine similarity (max 25 points)
- Skill description embeddings cached per session
- Goal embedding computed once in `select_skills()`
- Graceful fallback to keyword-only when embeddings unavailable

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
8 tests covering the Percy multi-lens scenario: generate_collection_tools()
through ToolRegistry::tool_definitions_for_user() for 3 lenses (andrew,
grace, household) with workspace_read_scopes filtering.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The v1 agent loop's skill injection (dispatcher.rs) selected skills via
keyword matching but only logged at debug level. The v2 engine emits
skill_activated events via __emit_event__ but the v1 path didn't.

Emit StatusUpdate::SkillActivated after skill selection so SSE consumers
(bench scripts, web UI) can track which skills are actually injected per
scenario.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The cherry-pick from 7775902 left merge conflict markers in
collections.rs. Resolve conflicts and verify generic_collections.rs
compiles cleanly against the post-rebase codebase.

COLLECTION_TOOL_MODE=unified gives 1 tool per collection (with operation
enum) instead of 5, reducing tool count from 5N to N.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Each collection now generates 1 tool with an operation enum parameter
instead of 5 separate tools. This reduces tool count from 5N to N,
which is critical for small models that struggle with 55+ tools.

Set COLLECTION_TOOL_MODE=separate for legacy 5-per-collection behavior.

Adds regression test verifying generate_collection_tools returns exactly
1 tool per collection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
After keyword-based skill selection, the v1 dispatcher now runs a
semantic re-scoring pass using the EmbeddingProvider. Skills that
keywords missed but are semantically similar to the user message
(cosine similarity >= 0.5) get discovered and activated. Skill
description embeddings are cached across calls to avoid redundant
API requests. Falls back gracefully when embeddings are unavailable.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
After keyword-based skill selection, boost with cosine similarity
scoring against skill descriptions using EmbeddingProvider. Skills
with similarity >= 0.5 are added even without keyword matches.
Skill embeddings are cached across calls.

This should suppress irrelevant built-in skills (e.g. github injecting
on commitment queries) while surfacing relevant skills that lack exact
keyword matches.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…e flag

- UnifiedCollectionTool now broadcasts events for update and delete operations
  (was only broadcasting for insert)
- Remove COLLECTION_TOOL_MODE env var and is_separate_mode() — unified tool is
  the only mode
- Simplify generate_collection_tools() and refresh_collection_tools() to always
  use unified path
- Update multi-lens tests for unified tool names (1 tool per collection)
- Add 4 event broadcast tests: insert/update/delete fire events, query does not

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Keyword matching alone lets irrelevant built-in skills (e.g. github)
inject alongside commitment skills. Now keyword-matched skills are
checked against the message embedding — if cosine similarity < 0.3,
the skill is dropped. Users can always force-activate with /skill-name.

Two thresholds:
- VETO (0.3): keyword-matched skills below this are dropped
- DISCOVERY (0.5): skills without keyword matches above this are added

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Delete CollectionAddTool, CollectionUpdateTool, CollectionDeleteTool,
CollectionQueryTool, and CollectionSummaryTool — all dead production code
replaced by UnifiedCollectionTool.

Rewrite all test modules (history, cross_scope, scoped_access, bootstrap,
cross_scope_libsql, structured_tools) to use the unified tool API with
operation parameter instead of per-op struct constructors.

Net -816 lines.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ilblackdragon
Copy link
Copy Markdown
Member

Here is another suggest I'm actually want to try: memory files with schema validation.

We already have metadata for files in workspace - we can add schema validation.
Whenever agent reads folder or file - it gets schema and when it writes - it gets validated automatically.

Would be great to test this approach. This uses all the same infra we already have.

You can still create alias tools like grocery_list which effectively does memory_tree("collections/groceries", ).

@standardtoaster
Copy link
Copy Markdown
Contributor Author

standardtoaster commented Apr 5, 2026

I prototyped this to get some signal. Quick hack: same 30 benchmark scenarios from my earlier comment on #1736, but seeded as JSON files in workspace folders instead of collection records or markdown. That comment has the full breakdown across all variants including the workspace skills from #1736.

Setup: Each entity type gets a folder (collections/commitments/, collections/signals/, etc.). Each record is a {id}-{slug}.json file. Each folder has a _schema.json describing fields and types. AGENTS.md tells the model to use memory_tree to list folders, memory_read to fetch individual records, and memory_search to find items by keyword. No custom tools — just the standard memory tools with structured data and explicit guidance. ~50 lines of seeding code added to bench-commitments.py (--variant schema-workspace-bare).

Results (Qwen 3.5 4B, 30 scenarios):

Approach Score
collections-bare (typed tools, #1937) 0.65
workspace + skills (#1736) 0.47
workspace-bare (markdown) 0.45
schema-workspace (JSON + _schema.json) 0.37

Per-category:

Category schema-workspace collections-bare
Commitment query 0.24 0.72
Commitment CRUD 0.30 0.58
Signal 0.50 0.46
Decision 0.53 0.56
Aggregation 0.35 0.44
Cross-entity 0.46 0.48

Schema-workspace lands below plain markdown workspace. The model has to memory_treememory_read each file → parse JSON → reason across all results. For "how many commitments are critical?" that's 10+ tool calls vs 1 with a typed collection tool. Filtering by field value (urgency=critical, status=open) is where workspace approaches fundamentally struggle — memory_search finds relevant files semantically but can't do structured comparisons.

Where it does fine: signal detection (0.50) and decisions (0.53) — these are mostly "find the thing and read it" queries where semantic search works. The gap opens on filtering, counting, and cross-record aggregation.

I think both ideas have a place: collections for data you filter/aggregate (transactions, time tracking, tasks), schema-validated workspace files for semi-structured content where search is more useful than structured queries.

@standardtoaster
Copy link
Copy Markdown
Contributor Author

This has been bouncing around in my head since your first comment about SQL. I've been looking at #2025 too, the glob/grep/file_undo tools and coding skills, and I keep coming back to a tension between two philosophies of how memory could work: everything is a file (schema validation, _schema.json, the workspace as universal substrate) versus structured data behind typed query tools. They're not necessarily in conflict, but they pull in different directions when you're deciding where to put new features.

Text files clearly work well for agent memory. But sometimes what you want isn't a blank notebook, it's a preprinted one. A grocery pad with lines for item, quantity, aisle. A timesheet with columns for date, hours, name. The structure isn't constraining the agent, it's telling it exactly what to write down and giving it a way to look things up later without reading every page. That's what my testing is showing across the variants in #1736 and above: schema-workspace scored 0.37 not because the data format was wrong, but because memory_search can't flip to "all entries where urgency=critical" the way a typed tool can. I tried throwing larger models at it (Sonnet) and it didn't seem to close the gap.

For the use cases I'm building toward (financial transactions, hour tracking, to-do management with "what's overdue?" and nagging when things go stale), the query interface is load-bearing. Even modest use accumulates: a year of nanny shifts, a few months of to-dos, eventually bank statements feeding in. That's the preprinted notebook. The schema tells the agent what fields matter, and the typed tool lets it flip straight to the right page instead of reading every entry.

I think both belong in the system. Workspace for the blank notebook (prose, context, meeting notes, identity), collections for the preprinted pads (transactions, hours, tasks). Schema validation on workspace files could bridge the gap for semi-structured content. Curious whether you see a path to making the file approach work for these query patterns, or whether you're thinking of them as separate use cases too.

ilblackdragon added a commit that referenced this pull request Apr 6, 2026
Add a `schema` field to `DocumentMetadata` that enables automatic content
validation on workspace writes. When a document or its folder `.config`
carries a JSON Schema, all write operations (write, append, patch,
write_to_layer, append_to_layer) validate content against it before
persisting. This is the foundation for typed system state (settings,
extension configs, skill manifests) stored as workspace documents.

Builds on the metadata infrastructure from #1723 — schema is inherited
via the existing `.config` chain (folder → document → defaults).

Refs: #640, #1894, #1937

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ilblackdragon added a commit that referenced this pull request Apr 6, 2026
Add WorkspaceSettingsAdapter that implements SettingsStore by reading/
writing workspace documents at _system/settings/{key}.json. During
migration, dual-writes to both the legacy settings table and workspace.
Reads prefer workspace, falling back to the legacy table.

Known setting keys (llm_backend, selected_model, tool_permissions.*, etc.)
get JSON Schemas stored in document metadata — writes are validated
automatically by Phase 0's schema validation.

Also adds settings_schemas.rs with compile-time schema registry and
settings_path() helper.

Refs: #640, #1937

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ilblackdragon added a commit that referenced this pull request Apr 9, 2026
* feat(workspace): add JSON Schema validation to document metadata

Add a `schema` field to `DocumentMetadata` that enables automatic content
validation on workspace writes. When a document or its folder `.config`
carries a JSON Schema, all write operations (write, append, patch,
write_to_layer, append_to_layer) validate content against it before
persisting. This is the foundation for typed system state (settings,
extension configs, skill manifests) stored as workspace documents.

Builds on the metadata infrastructure from #1723 — schema is inherited
via the existing `.config` chain (folder → document → defaults).

Refs: #640, #1894, #1937

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(tools): add channel-agnostic ToolDispatcher with audit trail

Introduce `ToolDispatcher` — a universal entry point for executing tools
from any caller (gateway, CLI, routine engine, WASM channels). Creates
lightweight system jobs for FK integrity, records ActionRecords, and
returns ToolOutput. This is a third entry point alongside v1's
Worker::execute_tool() and v2's EffectBridgeAdapter::execute_action().

DispatchSource::Channel(String) is intentionally string-typed — channels
are interchangeable extensions that can appear at runtime.

Also adds JobContext::system() factory and create_system_job() to both
PostgreSQL and libSQL backends.

Refs: #640

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(workspace): settings-as-workspace-documents with dual-write adapter

Add WorkspaceSettingsAdapter that implements SettingsStore by reading/
writing workspace documents at _system/settings/{key}.json. During
migration, dual-writes to both the legacy settings table and workspace.
Reads prefer workspace, falling back to the legacy table.

Known setting keys (llm_backend, selected_model, tool_permissions.*, etc.)
get JSON Schemas stored in document metadata — writes are validated
automatically by Phase 0's schema validation.

Also adds settings_schemas.rs with compile-time schema registry and
settings_path() helper.

Refs: #640, #1937

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(gateway): wire ToolDispatcher into GatewayState

Add tool_dispatcher field to GatewayState with with_tool_dispatcher()
builder method. Create and wire the dispatcher in main.rs when both
tool_registry and database are available. All 16 GatewayState
construction sites updated.

Per-handler migration (routing mutations through ToolDispatcher instead
of direct DB calls) is deferred to follow-up PRs — each handler has
complex ownership checks, cache refresh, and response types.

Refs: #640

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(tools): add system introspection tools (tools_list, version)

Add SystemToolsListTool and SystemVersionTool as proper Tool
implementations that replace hardcoded /tools and /version commands.
Registered at startup via register_system_tools(). Available in both
v1 and v2 engines — no is_v1_only_tool filter to worry about.

Refs: #640

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(workspace): extension and skill state schemas and path helpers

Add workspace path helpers and JSON Schemas for storing extension configs,
extension state, and skill manifests under _system/extensions/ and
_system/skills/. This establishes the workspace document structure that
ExtensionManager and SkillRegistry will use as a durable persistence
backend (read-through cache pattern).

Runtime state (active MCP connections, WASM runtimes) stays in memory.
Only durable config and activation state moves to workspace documents.

Refs: #640, #1741

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address PR review feedback and CI failures

CI fixes:
- deny.toml: allow MIT-0 license required by jsonschema
- workspace/document.rs: #[allow(dead_code)] on system path constants
  pending follow-up phases that consume them
- workspace/settings_adapter.rs: remove unused chrono::Utc import
- workspace/settings_adapter.rs: collapse nested if into && form

Review fixes (gemini-code-assist):
- tools/dispatch.rs: await save_action directly instead of fire-and-forget
  tokio::spawn so short-lived CLI callers cannot drop audit records before
  they are persisted; surface errors via tracing::warn
- tools/dispatch.rs: remove DispatchSource::Agent variant — sequence_num=0
  with a reused job_id would violate UNIQUE(job_id, sequence_num). Agent
  callers must use Worker::execute_tool() which manages sequence numbers
  atomically against the agent's existing job
- workspace/settings_adapter.rs: validate content against the schema BEFORE
  the first workspace write so the initial document creation cannot bypass
  schema enforcement (subsequent writes are validated by the workspace
  resolved-metadata path established after the first write)

Refs: #2049

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor: unify all machine state under .system/

Rename the workspace prefix from `_system/` to `.system/` (Unix dot-prefix
convention for hidden internal state) and migrate v2 engine state from
`engine/` to `.system/engine/` so all machine-managed state lives under
one root.

New layout:

  .system/
  ├── settings/         (per-user settings as workspace docs)
  ├── extensions/       (extension config + activation state)
  ├── skills/           (skill manifests)
  └── engine/
      ├── README.md     (auto-generated index)
      ├── knowledge/    (lessons, skills, summaries, specs, issues)
      ├── orchestrator/ (Python orchestrator versions, failures, overlays)
      ├── projects/     (project files + nested missions/)
      └── runtime/      (threads, steps, events, leases, conversations)

The inner `.runtime/` dot-prefix is dropped under `.system/engine/` since
`.system/` itself is the hidden marker; no double-hiding needed.

The `ENGINE_PREFIX` constant in `workspace::document::system_paths` is
declared as the canonical convention; bridge `store_adapter` continues
to define per-subdirectory constants below it for ergonomic interpolation.

No legacy migration code — pre-production rename.

Refs: #2049

* fix(pr-2049): security, correctness, and robustness fixes from review

Critical security:
- dispatch.rs: redact sensitive params before persisting ActionRecord
  (was leaking plaintext secrets into the audit log for tools with
  sensitive_params())
- settings_schemas.rs: validate settings keys against path traversal
  (reject /, \, .., leading ., empty, length > 128, non-alphanumeric);
  wire validation into all settings_adapter read/write/delete paths

Data correctness:
- history/store.rs + libsql/jobs.rs: write status as JobState::Completed
  .to_string() ('completed' snake_case) instead of 'Completed'; system
  jobs were round-tripping as Pending in parse_job_state()
- settings_adapter.rs: fix .system/.config metadata to set
  skip_versioning: false (was true) — descendants inherit this via
  find_nearest_config, so the previous value silently disabled
  versioning for ALL .system/** documents, contradicting the audit-
  trail intent
- workspace/mod.rs: add resolve_metadata_in_scope; use it in
  write_to_layer / append_to_layer so non-primary layer writes resolve
  schema/indexing/versioning from the target layer's .config chain
  instead of the primary user_id's. Also pass &scope (not &self.user_id)
  to maybe_save_version so versions are attributed to the correct scope

Pipeline parity:
- dispatch.rs: add SafetyLayer to ToolDispatcher; mirror Worker pipeline
  (prepare_tool_params -> validator -> redact -> timeout -> sanitize
  output) so dispatch path gets the same safety guarantees as the agent
  worker. Sanitized output is now stored in ActionRecord.output_sanitized
  instead of duplicating raw JSON

Robustness:
- settings_adapter.rs: propagate update_metadata errors in
  ensure_system_config and write_to_workspace (was silently ignored
  via let _ =, leaving schemas/skip_indexing unenforced)
- settings_adapter.rs: set_all_settings now collects the first workspace
  write error and returns it after the legacy write completes, so
  partial-migration state is observable
- settings_schemas.rs: rewrite llm_custom_providers schema to match
  CustomLlmProviderSettings (id/name/adapter/base_url/default_model/
  api_key/builtin instead of stale name/protocol/base_url/model)

Build:
- Cargo.toml: jsonschema with default-features = false to avoid pulling
  a second reqwest major version

Docs:
- db/mod.rs: docstring for create_system_job uses 'completed' snake_case
- workspace/document.rs: clarify .system/ versioning ("by default ARE
  versioned; individual files may opt out via skip_versioning")
- settings_adapter.rs: clarify per-key reads prefer workspace, aggregate
  reads stay on legacy during migration
- tools/builtin/system.rs: trim doc to match implemented scope
  (system_tools_list, system_version)
- channels/web/mod.rs: move stale 'sweep tasks managed by with_oauth'
  comment back to oauth_sweep_shutdown line

Refs: #2049

* docs+ci: enforce 'everything goes through tools' principle

Document the core design principle from #2049 in two places so future
contributors (human and AI) discover it during development:

- CLAUDE.md: new "Everything Goes Through Tools" section near the
  "Adding a New Channel" guide. Includes the rule, the rationale (audit
  trail, safety pipeline parity, channel-agnostic surface, agent
  parity), and a pointer to the detailed rule file.
- .claude/rules/tools.md: full pattern with required/forbidden examples,
  the list of layers that ARE exempt (Worker::execute_tool, v2
  EffectBridgeAdapter, tool implementations themselves, background
  engine jobs, read-aggregation queries), and how to annotate
  intentional exceptions. Also extends `paths` to cover
  src/channels/** and src/cli/** so it surfaces when those files are
  edited.

Enforce with a new pre-commit safety check (#7) in
scripts/pre-commit-safety.sh:

- Scans newly added lines under src/channels/web/handlers/*.rs and
  src/cli/*.rs for direct touches of state.{store, workspace,
  workspace_pool, extension_manager, skill_registry, session_manager}.
- Suppress with a trailing `// dispatch-exempt: <reason>` comment on
  the same line, matching the existing `// safety:` convention.
- Only checks added lines (`+` in the diff), so existing untouched
  handlers don't trip the check during incremental migration.

The check fires only for new code: handlers that haven't been migrated
yet (52 existing direct accesses across 12 handler files) won't break
unmodified, but any new line that bypasses the dispatcher will be
flagged at commit time.

Refs: #2049

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(pr-2049): address Copilot review on workspace schema layer

- workspace::extension_state: extension/skill path helpers now reuse the
  canonical name validators (`canonicalize_extension_name`,
  `validate_skill_name`) instead of a weak `replace('/', "_")`. Names
  containing `..`, `\`, NUL, or other escapes are now rejected at the
  helper boundary, eliminating a path-traversal foothold for callers.
  Helpers return `Result<String, PathError>`. Regression tests added.

- workspace::settings_adapter::ensure_system_config: now idempotent across
  upgrades. If `.system/.config` already exists with stale metadata
  (e.g. an older `skip_versioning: true` from before fix #3042846635),
  it is repaired to the expected inherited values instead of being left
  silently broken. Regression test added.

- workspace::settings_adapter::write_to_workspace: lazily seeds
  `.system/.config` via a `OnceCell`, so callers no longer need to
  remember to invoke `ensure_system_config()` at startup before any
  setting write. Regression test added.

- workspace::settings_adapter::delete_setting: workspace delete failures
  are now logged via `tracing::warn!` instead of being silently dropped.
  We still don't propagate the error — the legacy table is the source of
  truth during migration and a stale workspace doc is recoverable on the
  next write — but partial-delete state is now observable.

- workspace::schema: documented why we don't cache compiled validators
  yet (settings/extension/skill writes are not a hot path; revisit if
  schema validation moves into a frequent write path).

[skip-regression-check] schema.rs change is doc-only.

* fix(pr-2049): address 4 remaining review issues

1. tool_dispatcher dropped during gateway startup
   src/channels/web/mod.rs: rebuild_state was initializing
   tool_dispatcher to None, so every subsequent with_* call zeroed
   the dispatcher the first caller injected. Preserve it across
   rebuild_state like every other field. Regression test:
   tool_dispatcher_survives_subsequent_with_calls.

2. WorkspaceSettingsAdapter not wired into runtime
   src/app.rs: Build the adapter in build_all() when workspace+db
   are both present, eagerly call ensure_system_config(), expose
   on AppComponents as settings_store, and thread it into
   init_extensions(...) so register_permission_tools and
   upgrade_tool_list receive it instead of the raw db.
   src/main.rs: SIGHUP handler prefers the adapter over raw db.
   src/workspace/mod.rs: re-export WorkspaceSettingsAdapter.

3. changed_by regression on layered writes
   src/workspace/mod.rs: write_to_layer and append_to_layer were
   passing the target layer's scope as changed_by, so version
   history attributed layered edits to the layer name instead of
   the actor. Pass self.user_id while keeping metadata resolution
   in the target scope. Regression test:
   layered_writes_record_actor_in_changed_by.

4. Legacy engine/ paths invisible after upgrade
   src/bridge/store_adapter.rs: Add migrate_legacy_engine_paths(),
   called at the start of load_state_from_workspace(), which scans
   list_all() for engine/... documents and rewrites them to
   .system/engine/... Idempotent: skips rewrites when the new path
   already exists, deletes the legacy duplicate either way. Three
   regression tests in #[cfg(all(test, feature = "libsql"))]
   module.

Quality gate: cargo fmt, cargo clippy --all --all-features zero
warnings, cargo test --all-features --lib 4313 passed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(e2e): use PUT for settings write in ownership test

test_settings_written_and_readable was sending POST /api/settings/{key}
but the route has been PUT since #4 (Feb 2026) — the test was returning
405 Method Not Allowed. Switch to httpx.put() so it matches the current
route registration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(pr-2049): address second round of review feedback

Addresses the remaining unresolved PR #2049 review comments from
serrrfirat and ilblackdragon.

## Changes

### ToolDispatcher — integration coverage + log level
- src/tools/dispatch.rs: add two libsql-gated integration tests for
  the full dispatch pipeline: (a) persist an ActionRecord with
  sensitive params redacted in the audit row while the tool still
  sees the raw value, sanitized output populated; (b) honor the
  per-tool execution_timeout() and record a failure action.
- Tests use a raw-SQL helper to find system-category jobs since
  list_agent_jobs_for_user intentionally filters them out.
- Replace warn! with debug! on audit persistence failure — dispatch
  is reachable from interactive CLI/REPL sessions where warn!/info!
  output corrupts the terminal UI (CLAUDE.md Code Style → logging).

### WorkspaceSettingsAdapter — log level
- src/workspace/settings_adapter.rs: same warn! → debug! fix on the
  delete_setting workspace failure path, for the same REPL reason.

### Schema validation — surface all errors
- src/workspace/schema.rs: switch from jsonschema::validate to
  validator_for + iter_errors so users fixing a malformed setting
  see every violation in one round instead of playing whack-a-mole.
  Also distinguishes "invalid schema" from "invalid content" errors.
- Regression tests: multiple_errors_are_all_reported and
  invalid_schema_is_distinguished_from_invalid_content.

### create_system_job — started_at + row growth docs
- src/db/libsql/jobs.rs and src/history/store.rs: include started_at
  in the INSERT (set to the same instant as created_at/completed_at)
  so duration queries don't see NULL and "started but not completed"
  filters don't misclassify these rows. Fixed in both backends.
- Add doc comments on both impls warning about row growth per
  dispatch call. Deleting rows would violate "LLM data is never
  deleted" (CLAUDE.md); if listing-query performance becomes a
  concern, prefer a partial index (WHERE category != 'system') over
  deletion.

### Lib test repair
- src/channels/web/server.rs: extensions_setup_submit_handler Err
  branch now sets resp.activated = Some(false) so clients and the
  regression test see an explicit `false` rather than `null`. Also
  rename the test's fake channel to snake_case (test_failing_channel)
  so it matches the canonicalize-extension-names behavior from
  PR #2129 — previously the test was passing a dashed name and
  getting "Capabilities file not found" instead of the intended
  activation failure.

## Not addressed (false positive / deferred)
- dispatch.rs:177 output_raw/output_sanitized swap — verified against
  ActionRecord::succeed(Option<String>, Value, Duration) and the
  worker's call site at job.rs:704; argument order is correct.
- settings_adapter.rs:186 TOCTOU window — author self-classified as
  "Low / completeness" and no other code path writes to
  .system/settings/** without going through write_to_workspace.
- schema.rs recompilation caching — deferred per earlier review.

## Quality gate
- cargo fmt
- cargo clippy --all --benches --tests --examples --all-features
  zero warnings
- cargo test --all-features --lib: 4387 passed, 0 failed, 3 ignored

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(pr-2049): address third round of review feedback

Addresses unresolved comments from serrrfirat's "Paranoid Architect
Review" and Copilot's third pass on the engine-state migration.

## src/workspace/settings_adapter.rs

### HIGH — Cross-tenant data leak through owner-scoped Workspace

`Workspace` is constructed for a single user_id at AppBuilder time.
Without gating, `set_setting("user_B", key, val)` would dual-write into
the **owner's** workspace, and a subsequent `user_A.get_setting(...)`
would return user_B's value: a real cross-user data leak.

Fix:
- Add `gate_user_id` field set to `workspace.user_id()` at construction.
- All `SettingsStore` methods that touch the workspace now check
  `workspace_allowed_for(user_id)` first; non-owner callers fall through
  to the legacy table only — preserving their pre-#2049 behavior.
- This matches the long-term plan: per-user settings live in the legacy
  table until a per-user `WorkspaceSettingsAdapter` (one per
  WorkspacePool entry) is wired up; admin/global settings go through
  the workspace-backed path so they pick up schema validation.

Regression test: `workspace_settings_are_owner_gated_in_multi_tenant_mode`
asserts (a) owner's workspace doc is not overwritten by a non-owner write,
(b) each user reads back their own legacy value, and (c) a non-owner with
no legacy entry must NOT see the owner's workspace value bleeding through.

### MEDIUM — Dual-write order

Reverse `set_setting` and `set_all_settings` to write legacy first,
workspace second. The legacy table is the source of truth during
migration (it backs aggregate `list_settings` reads), so writing it
first guarantees those readers always see a consistent value even if
the workspace write fails. Failed workspace writes are self-healing on
the next per-key read-miss.

### MEDIUM — `ensure_system_config_lazy` double-execution race

Replace the manual `get()`/`set()` pattern with
`OnceCell::get_or_try_init`. Two concurrent first-callers no longer
both run `ensure_system_config()`. Functionally equivalent (idempotent
either way) but no longer wasteful.

## src/bridge/store_adapter.rs

### MEDIUM — Migration drops document metadata (S3)

`migrate_legacy_engine_paths` previously copied only `doc.content`,
silently dropping the `metadata` column. Now calls
`ws.update_metadata(new_doc.id, &doc.metadata)` after each write to
preserve schema/skip_indexing/hygiene flags. Logged-not-fatal: content
has already been moved, metadata loss is recoverable.

Regression test: `migration_preserves_document_metadata` seeds a doc
with custom metadata and asserts it survives the rewrite.

### MEDIUM — `ws.exists()` swallowed transient errors (Copilot)

`unwrap_or(false)` on the existence check could cause the migrator to
overwrite an existing `.system/engine/...` doc when storage hiccups.
Now propagates the error (counts as failed step + `continue`), per
Copilot's exact suggested patch.

### LOW — `list_all()` runs every startup (Copilot)

Add a cheap preflight: `ws.list("engine")` first; only fall through to
the recursive `list_all()` discovery when the directory listing returns
at least one entry. Steady-state startups (post-migration) skip the
full workspace scan entirely.

Regression test: `migration_preflight_skips_full_scan_when_no_legacy_paths`
asserts unrelated and already-migrated documents are untouched.

### MEDIUM — Counter undercount on `already_present` (S5)

When `already_present` is true the legacy duplicate is still deleted,
but the previous code skipped the `migrated += 1` increment, undercounting
in debug logs. Fixed: `migrated` now counts every successful path
migration including the already-present case.

### Documented — Version-history loss is acceptable scope (C1)

Read-write-delete pattern means `memory_document_versions.document_id
ON DELETE CASCADE` drops the legacy doc's version chain. Documented in
the function-level doc comment as intentional + bounded:
- v2 engine state is runtime state (rewritten on every mutation), not
  user-curated data
- v2 was newly introduced in this PR — no production deployment with
  pre-existing curated history at risk
- A path-preserving rename op would need new trait methods on both
  backends; out of scope for fix-forward. If a future caller needs
  history-preserving rename, it should be added to the storage layer
  properly, not bolted onto migration.

## Quality gate
- cargo fmt
- cargo clippy --all --benches --tests --examples --all-features
  zero warnings
- cargo test --all-features --lib: 4390 passed, 0 failed, 3 ignored
  (+3 new tests on top of round 2)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(pr-2049): address fourth round of review feedback

Two latent issues flagged by serrrfirat in the latest review pass:

1. **Null schema permanently locks documents** (`src/workspace/schema.rs`).
   `serde_json` deserializes a metadata field of `"schema": null` as
   `Some(Value::Null)`, not `None`, so the upstream
   `if let Some(schema) = &metadata.schema` check passes through to
   `validate_content_against_schema`. There, `validator_for(Value::Null)`
   errors out and every subsequent write to that document is blocked — a
   latent DoS. Added an explicit `schema.is_null()` early-return guard at
   the top of the validator, plus a regression test
   (`null_schema_is_treated_as_no_op`) that asserts even non-JSON content
   passes when the schema is null.

2. **System job titles were raw source labels** (`src/history/store.rs`,
   `src/db/libsql/jobs.rs`). `create_system_job` set `title = source`,
   so any UI rendering `agent_jobs.title` would display dispatched
   system jobs as `channel:gateway` / `system` / etc. instead of a
   human-readable label. Both PostgreSQL and libSQL backends now write
   `format!("System: {source}")`. Updated the two dispatch integration
   tests that pinned the old format.

Schema-recompilation comment (`schema.rs:47`) was acknowledged as
"acceptable for now" by the reviewer; existing NOTE in the source
already documents the caching trade-off and upgrade path, so no code
change.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(pr-2049): address fifth round of review feedback

Eight comments from Copilot + serrrfirat. Real fixes for the load-bearing
gaps; doc clarifications for the rest where the existing behavior is
intentional.

**Real code changes**

- `src/tools/dispatch.rs` — enforce `tool.parameters_schema()` (JSON
  Schema) in the dispatch path. Previously the SafetyLayer validator only
  checked for injection patterns; channel/CLI/routine callers could pass
  arbitrary shapes and only discover the mismatch (or worse, silently
  malformed behavior) inside the tool itself. Now we run
  `jsonschema::validate(&tool.parameters_schema(), &normalized_params)`
  after the injection check, with a permissive-empty-schema fast path so
  tools that haven't yet declared a schema aren't penalised. Regression
  test `dispatch_rejects_params_violating_tool_schema` asserts a
  required-field violation is rejected before the tool is invoked.

- `src/workspace/settings_adapter.rs` — `write_to_workspace` now calls
  `schema_for_key(key)` once and reuses the resolved schema for both
  pre-write validation and post-write metadata persistence (was called
  twice). Eliminates duplicate work and removes a theoretical
  divergence window if the schema registry ever became non-deterministic.

- `src/workspace/settings_adapter.rs` — `ensure_system_config` now also
  rewrites the `.config` document content when its metadata is repaired,
  not just the metadata column. The metadata column is the inheritance
  source of truth, but having the doc's content silently diverge from it
  confuses anyone reading the doc directly to understand which inherited
  flags are active.

- `src/error.rs` + `src/workspace/settings_schemas.rs` — new
  `WorkspaceError::InvalidPath { path, reason }` variant. Path/key
  rejection (path-traversal, character set, length) now surfaces as
  `InvalidPath`, not `SchemaValidation` — callers and downstream UIs can
  distinguish "your settings *key* has bad characters" from "your
  settings *value* failed JSON-Schema validation" without string-matching
  error messages. `validate_settings_key` returns the new variant; the
  one match site in `settings_adapter.rs::write_to_workspace` is updated.
  Regression test `validate_settings_key_returns_invalid_path_variant`.

**Documentation-only fixes**

- `src/tools/dispatch.rs` — clarify in the `dispatch()` doc-comment that
  `sanitize_tool_output` runs only against the persisted ActionRecord
  payload, NOT against the value returned to the caller. This mirrors
  `Worker::execute_tool` (the agent loop also receives the raw output so
  reasoning can be reproduced from history). Channels that forward
  dispatcher output to end users must run their own boundary
  sanitization at the channel edge.

- `src/history/store.rs` + `src/db/libsql/jobs.rs` —
  `create_system_job` doc updated to explicitly state that system job
  timestamps do NOT reflect tool execution time (the row is INSERTed
  before the tool runs, with all three timestamps pinned to "now").
  Consumers that need execution duration must read
  `job_actions.duration_ms` for the associated action rows. Restructuring
  to a two-phase INSERT+UPDATE was rejected: the audit row must be
  durable even if the dispatcher panics mid-tool, and the second write
  would double per-dispatch DB cost.

- `src/workspace/schema.rs` — added baseline regression test
  `moderately_complex_schema_compiles_within_budget` that pins schema
  compile + validate latency for a moderately deep nested schema at
  <500ms wall-clock. Guards against orders-of-magnitude regressions
  from a future `jsonschema` upgrade or accidentally pathological
  schema construction. Hard limits on schema complexity are deferred
  (the real defense today is keeping schema-bearing paths under
  `.system/`, which is system-controlled).

**Acknowledged, no change**

- libSQL `create_system_job` unbounded row growth — already documented
  as intentional in the existing comment block, with the mitigation path
  spelled out (partial index on `WHERE category != 'system'` for listing
  queries). Rate-limiting dispatch would silently drop user-initiated
  actions, which is worse than unbounded retention. The "LLM data is
  never deleted" rule (CLAUDE.md) explicitly applies.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@standardtoaster
Copy link
Copy Markdown
Contributor Author

I ran these scenarios through the nearai/benchmarks harness — 52 structured data scenarios across 8 entity types (commitments, signals, decisions, parked ideas, grocery, todos, nanny hours, transactions). Same model (Qwen 3.5), same harness, same scenarios — only the storage and tool surface changes.

Workspace files (0.356) vs Collections (0.850):

Category Workspace Collections Delta
commitments 0.516 0.937 +0.421
cross-entity 0.000 1.000 +1.000
grocery 0.000 0.833 +0.833
todos 0.400 0.800 +0.400
transactions 0.000 0.857 +0.857
overall 0.356 0.850 +0.494

The gap is largest for high-item-count entities — transactions (60 items: 0→0.857), grocery (12 items: 0→0.833) — and cross-entity queries (0→1.0) where workspace requires reading every file but a collection tool handles it in one call.

Benchmark PR: nearai/benchmarks#14. Harness features PR: nearai/benchmarks#13. The generator supports --variant workspace and --variant collections so anyone can reproduce both runs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor: regular 2-5 merged PRs risk: medium Business logic, config, or moderate-risk modules scope: agent Agent core (agent loop, router, scheduler) scope: channel/web Web gateway channel scope: channel Channel infrastructure scope: db/libsql libSQL / Turso backend scope: db/postgres PostgreSQL backend scope: db Database trait / abstraction scope: docs Documentation scope: tool/builder Dynamic tool builder scope: tool/builtin Built-in tools scope: tool Tool infrastructure scope: worker Container worker size: XL 500+ changed lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants