feat: structured collections — typed CRUD tools for agent workspaces by standardtoaster · Pull Request #1937 · nearai/ironclaw

standardtoaster · 2026-04-02T22:22:53Z

Collections: Typed Structured Storage for Agent Workspaces

Summary

"Add milk to the grocery list." Most deployments can't do this out of the box. The agent either creates a new document every time (fragmenting the list) or tries to edit an existing markdown file and corrupts it. Across 28 test scenarios and 2 models, modifying structured data stored in memory documents fails every time.

This PR adds schema-defined collections with auto-generated typed CRUD tools. Register a collection with a schema, get a tool that handles inserts, queries, updates, deletes, and aggregations with validation.

Model	Collections	Memory docs	Delta
Qwen 3.5-35B (local)	76%	37%	+39
Claude Haiku 4.5	70%	26%	+44

35 files changed, ~12,600 lines. Dual-backend (PostgreSQL + libSQL), per-user tool isolation. Also provides the storage layer for #1474.

The problem

A user says "add milk to the grocery list." Both options fail:

Write a new document. Creates fragments. "What's on the list?" requires deduplication across scattered files.
Edit an existing document. Read, parse, modify, write back. Models can't do this — they lose items, duplicate entries, or corrupt formatting. Fails on every model tested.

The problem is structural: append-only documents don't support mutation.

What a user sees

POST /api/collections
{
  "collection": "grocery_items",
  "description": "Shopping list",
  "fields": {
    "name":     { "type": "text", "required": true },
    "quantity": { "type": "number" },
    "category": { "type": { "enum": ["produce", "dairy", "meat", "bakery", "other"] } },
    "store":    { "type": "text" },
    "done":     { "type": "bool", "default": false }
  }
}

This generates a tool called grocery_items. The model calls:

grocery_items(operation: "add", data: { "name": "milk", "category": "dairy" })
grocery_items(operation: "query")
grocery_items(operation: "summary", agg_operation: "count")

The schema validates inputs and coerces LLM quirks: "12" becomes 12, "true" becomes true, "tomorrow" becomes an ISO date.

Evidence

Collections vs memory docs (28 scenarios, same data, same models)

Category	Qwen+Coll	Qwen+Mem	Haiku+Coll	Haiku+Mem
Grocery (7)	76%	29%	76%	28%
Nanny hours (7)	56%	21%	69%	24%
Todo (6)	72%	52%	72%	35%
Transactions (8)	59%	48%	65%	21%

Tool design matters

Approach	Score	Why
1 unified tool/collection + skills	76%	Low tool count + guided operations
5 per-operation tools + skills	65%	Self-documenting but 20 tools adds noise
1 unified tool/collection, no skills	68%	Model handles the operation enum fine
Flat files (memory docs)	37%	Writes broken
Generic CRUD (5 tools for all)	41%	Model forgets collection names

One tool per collection with an operation parameter is the best design. Auto-generated skills (+8%) teach the model which operation to use for natural language intents.

How it works

Storage

StructuredStore trait, fully implemented for both PostgreSQL (JSONB) and libSQL (json_extract). Records in a shared structured_records table, discriminated by (user_id, collection) with composite index + GIN index. 7 field types: Text, Number, Date, Time, DateTime, Bool, Enum.

Schema alteration supports adding/removing fields and enum values. Existing records are preserved — queries, filters, and aggregations handle missing fields correctly (SUM/AVG skip records where the field doesn't exist).

Tools

Each collection gets one tool named {user}_{collection} with parameters:

operation (required enum: query, add, update, delete, summary)
data (object — validated against the collection's field schema)
record_id (string — for update/delete)
filters (object — field → {op, value}, supports eq/neq/gt/gte/lt/lte/is_null/is_not_null)
field, agg_operation, group_by (for summary aggregations)

Tool names include the owner prefix; the dispatcher filters per-user via tool_definitions_for_user().

When a collection is registered, a SKILL.md is auto-generated with activation keywords from the schema. Keyword/regex matching injects the skill into the system prompt when relevant. On restart, existing schemas are loaded and tools registered before the first conversation.

REST API

6 endpoints for external integration (webhooks, cron, scripts). Inherits gateway auth.

GET/POST /api/collections — List / register schemas
GET/POST /api/collections/{name}/records — Query / insert records
PUT/DELETE /api/collections/{name}/records/{id} — Update / delete records

Multi-tenant scoping

Collections are scoped by user_id. Every query includes WHERE user_id = $1. Tool definitions are filtered per-user in the dispatcher, job workers, and routine engine.

For cross-user read access, source_scope allows a tool to query another user's data. source_scope is stripped in CollectionRegisterTool::execute() and collections_register_handler — only server-side seeding can set it.

Tool scaling

Each collection adds 1 tool. 20 collections = 20 tools, ~300 extra tokens with compressed descriptions. Per-user filtering and on-demand skill injection keep prompts lean.

Drawbacks

Adds a new storage abstraction alongside memory documents. Zero impact if unused.
Fields must be defined upfront. Adding a required field doesn't backfill old records.
Delete by description succeeds ~40%. Record IDs in query results make delete-by-ID reliable.

Alternatives considered

Improve document-based search: writes remain 0%.
External database/service: breaks single-deployment model.
Generic CRUD tools: 41%. Model forgets collection names as parameters.
5 per-operation tools per collection: 65%. Scales poorly (50 tools at 10 collections).

Compatibility

Additive only. No existing tables modified, no existing tool behavior changed, no changes to memory documents. V16 migration adds two new tables (structured_schemas, structured_records). owner_user_id() added to Tool trait with default None — existing tools unaffected. Both backends implemented and tested at parity. Set COLLECTION_TOOL_MODE=unified to enable (recommended).

gemini-code-assist

Code Review

This pull request introduces "Collections," a typed structured storage system for agent workspaces that provides schema-validated CRUD tools with support for PostgreSQL and libSQL backends. The code review identified several critical issues, including the omission of workspace read scopes in API handlers which breaks cross-lens access, and missing event broadcasts in the unified collection tool that would prevent routine triggers from firing. Additionally, feedback highlighted a correctness issue in libSQL aggregations where app-layer processing combined with record limits could yield wrong results, and pointed out that ORDER BY clauses fail to handle top-level database columns correctly. Other improvements were suggested regarding transaction locking in libSQL, robust group key extraction, and the use of UTC for consistent date resolution.

gemini-code-assist · 2026-04-02T22:27:40Z

+    let owner = match resolve_collection_owner(
+        db.as_ref(),
+        &user.user_id,
+        &Vec::<String>::new(),
+        &name,
+    )


The call to resolve_collection_owner passes an empty vector for scopes, effectively disabling cross-lens access for queries. To support the multi-tenant scoping described in the RFC, the user's workspace_read_scopes should be passed here. Similar issues exist in the insert, update, and delete handlers (lines 281, 384, and 478).

Suggested change

let owner = match resolve_collection_owner(

db.as_ref(),

&user.user_id,

&Vec::<String>::new(),

&name,

)

let owner = match resolve_collection_owner(

db.as_ref(),

&user.user_id,

&user.workspace_read_scopes,

&name,

)

gemini-code-assist · 2026-04-02T22:27:41Z

+        .query_records(user_id, collection, &aggregation.filters, None, usize::MAX)
+        .await?;
+


The aggregate_in_rust function performs aggregations in the application layer after fetching records. However, query_records enforces a limit of 1000, leading to incorrect results for larger collections. Following repository rules, use targeted database queries (e.g., SQL SUM/AVG) to perform aggregations at the database level to ensure correctness and prevent performance/DoS vulnerabilities.

References

Use targeted database queries to fetch specific records instead of loading all records and filtering in the application, especially for public endpoints, to prevent performance bottlenecks and DoS vulnerabilities.

gemini-code-assist · 2026-04-02T22:27:41Z

+        self.db
+            .update_record(
+                self.owner_scope(ctx),
+                record_id,
+                serde_json::Value::Object(final_updates),
+            )
+            .await


The execute_update method is missing the broadcast of a CollectionWriteEvent. This will prevent routine triggers based on collection updates from firing when this tool is used. An event should be sent to collection_write_tx similar to the implementation in execute_add.

gemini-code-assist · 2026-04-02T22:27:41Z

+        self.db
+            .delete_record(self.owner_scope(ctx), record_id)
+            .await


The execute_delete method is missing the broadcast of a CollectionWriteEvent. This will prevent routine triggers based on collection deletions from firing when this tool is used.

gemini-code-assist · 2026-04-02T22:27:41Z

+    let mut user_ids = vec![user.user_id.clone()];
+    user_ids.extend(Vec::<String>::new().iter().cloned());


The collections_list_handler is missing the logic to include the user's workspace read scopes, as indicated by the redundant extension with an empty vector. This prevents users from seeing collections they have cross-lens access to, which contradicts the PR description regarding cross-lens support.

Suggested change

let mut user_ids = vec![user.user_id.clone()];

user_ids.extend(Vec::<String>::new().iter().cloned());

let mut user_ids = vec![user.user_id.clone()];

user_ids.extend(user.workspace_read_scopes.iter().cloned());

gemini-code-assist · 2026-04-02T22:27:41Z

+    ) -> Result<(), DatabaseError> {
+        let conn = self.connect().await?;
+
+        conn.execute("BEGIN", ())


A deferred transaction (BEGIN) is used here for a read-modify-write cycle. To prevent race conditions and potential deadlocks in libSQL/SQLite, use BEGIN IMMEDIATE to acquire the write lock at the start of the transaction, as per repository guidelines.

Suggested change

conn.execute("BEGIN", ())

conn.execute("BEGIN IMMEDIATE", ())

References

To prevent race conditions in get-or-create operations in SQLite/libSQL, use BEGIN IMMEDIATE transactions to serialize concurrent writers.

gemini-code-assist · 2026-04-02T22:27:41Z

+            Some(field) => {
+                structured::validate_field_name(field)
+                    .map_err(|e| DatabaseError::Query(e.to_string()))?;


The ORDER BY clause interpolates the field name directly into json_extract. If a user attempts to order by a top-level column like created_at or updated_at, the query will attempt to extract these from the JSON data field (where they don't exist), resulting in an incorrect sort order. Consider handling these special fields similarly to resolve_filter_field.

gemini-code-assist · 2026-04-02T22:27:41Z

+                    .get::<String>(0)
+                    .unwrap_or_else(|_| "null".to_string());
+                let value = extract_agg_value(&row, 1, &aggregation.operation)?;


Using row.get::<String>(0) will fail if the group key is a numeric or boolean type in the JSON data. Using the get_text helper is safer as it handles type conversion and NULL values more robustly.

Suggested change

.get::<String>(0)

.unwrap_or_else(|_| "null".to_string());

let value = extract_agg_value(&row, 1, &aggregation.operation)?;

let key = get_text(&row, 0);

gemini-code-assist · 2026-04-02T22:27:41Z

+                structured::validate_field_name(field)
+                    .map_err(|e| DatabaseError::Query(e.to_string()))?;
+                sql.push_str(&format!(" ORDER BY data->>'{field}'"));


Similar to the libSQL implementation, ordering by top-level columns like created_at or updated_at will fail because the query attempts to access them via the data->> JSONB operator. These fields should be handled as direct column references.

gemini-code-assist · 2026-04-02T22:27:41Z

+/// "tomorrow", "this week", or day names instead of YYYY-MM-DD. This handles
+/// the most common cases so the tool call doesn't fail validation.
+fn try_parse_natural_date(s: &str) -> Option<NaiveDate> {
+    let today = chrono::Local::now().date_naive();


Using chrono::Local::now() for natural language date resolution can lead to inconsistent results depending on the server's local timezone configuration. It is generally safer to use Utc::now() to ensure consistent behavior across different deployment environments.

Suggested change

let today = chrono::Local::now().date_naive();

let today = chrono::Utc::now().date_naive();

ilblackdragon · 2026-04-03T01:51:11Z

What about modelling this as a SQL database with tables, and let agent just run SQL queries on it?
This would then work also if user connects to external databases too. And allows to leverage all the same SQL knowledge that it has. And just can maintain metadata of tables that agent can fetch.

For multi tenant case would be great to have: per-user, shared across everyone and also group-based. This would allow to also share these tables between groups of users or the whole organization.

standardtoaster · 2026-04-03T08:14:18Z

Thanks for the feedback. I think SQL is interesting as a backing store, but I'd push back on exposing raw SQL to the agent:

Safety surface. SQL gives the model room to accidentally DELETE, DROP TABLE, or ALTER schema. Multi-tenant safety would need sharded databases and row-level ACLs — a lot of infrastructure to protect against something typed tools prevent by design.

Model interface. Small models likely know SQL syntax well — it's heavily represented in training data. But SQL shifts the scaling problem: instead of knowing which tool to call, the model needs to know every table name, column name, and type. That's the same context pressure as tool count, but less constrained. Typed tools give the model a bounded action space with self-documenting parameters. Happy to benchmark SQL vs typed tools if you feel strongly — I have the test harness for it.

Tool scaling is the real concern. With cross-scope access, tool count can balloon (107 tools in my deployment, 40 from collection duplicates). I think the answer is a few things working together:

Deferred tool schemas — compressed descriptions (~15 tokens/tool) instead of full JSON schemas. I have a branch where 87 tools = ~1,300 tokens vs ~18K for full schemas.
Unified search pipeline with tool activation — embed tool descriptions into the same search index as memory. When the agent does a memory_search, it gets back relevant memories and relevant tool schemas for actions it might want to take — similar to how skills are already injected based on context. As the system grows in capabilities, the search pipeline becomes the discovery mechanism rather than the tool list.
REST endpoints / routines for bulk operations — imports, syncs, and batch updates don't need to go through the agent's tool list. A webhook-triggered routine or dedicated endpoint handles bulk ingestion directly, keeping the agent's tool surface focused on interactive CRUD.

v2's available_actions(leases) has the right hook for this — just needs filtering wired up alongside the existing execution-time gating.

On multi-tenant sharing: I just pushed new code for cross-scope auto-discovery in this PR but haven't had a chance to fully test it yet. The approach uses workspace scoping (workspace_read_scopes) to make collections from other scopes available without duplicating tools.

Add V16 migration creating structured_schemas and structured_records tables. Define FieldType, FieldDef, CollectionSchema, Record, Filter, Aggregation types and the StructuredStore trait (10 async methods) for schema-validated CRUD on user-defined collections. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PostgreSQL: full implementation with JSONB operators for filtering, aggregation via SQL, and user-scoped queries. libSQL: stub returning "not yet supported" — collections require PostgreSQL for now. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Dynamic tool generation: each registered collection gets 5 typed tools (add, query, update, delete, summary). Tools are prefixed with owner user_id to prevent cross-tenant collisions in the shared registry. Includes: - Keyword-triggered SKILL.md generation per collection - Collections router skill for cross-domain discovery - Discovery docs written to workspace memory for embedding search - Domain synonym mappings for natural language matching - CollectionWriteEvent broadcast channel for routine triggers Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Six endpoints: - GET/POST /api/collections (list schemas, register new) - PUT/DELETE /api/collections/:name (alter, drop schema) - GET/POST /api/collections/:name/records (query, insert) - PUT/DELETE /api/collections/:name/records/:id (update, delete) - POST /api/collections/:name/aggregate (summary queries) GatewayState extended with collection_write_tx and skills_dir. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Load existing collection schemas from the database during build_all() and register their CRUD tools before the first conversation. Without this, collections created in prior sessions had no tools after restart. Iterates all known users via db.list_users() to cover multi-tenant deployments. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Schema lifecycle (register, alter, drop) - CRUD operations with validation - User isolation (tenant A can't see tenant B's data) - Querying with filters and aggregations - Per-collection tool generation and naming - Fixture schemas: nanny_shifts, grocery_items Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Verify that two users registering the same collection get tool names with different prefixes (andrew_grocery_items_query vs grace_grocery_items_query) and no names overlap in the shared registry. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Full implementation of all 10 StructuredStore methods for SQLite/libSQL: register, get, list, drop collections; insert, get, update, delete records; query with json_extract filters; aggregate with SUM/COUNT/ AVG/MIN/MAX. Key SQLite adaptations: - json_extract(data, '$.field') for JSONB equivalent - Type-preserving filter params (Integer/Real/Text matching) - Explicit timestamps via fmt_ts() (no NOW()) - V17 incremental migration for existing databases All 19 structured_collections integration tests pass on libSQL. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add owner_user_id() to the Tool trait (default None for built-ins). Collection CRUD tools return their owner's user_id. ToolRegistry gains tool_definitions_for_user() which excludes tools owned by other users. Dispatcher uses this for both initial tool list and mid-loop refreshes. Prevents collection tool leakage across lenses in multi-tenant deployments — Andrew only sees andrew_grocery_items_query, never grace_grocery_items_query. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

POST /api/collections now also handles schema registration (alias for /api/collections/register). POST /api/collections/{name} inserts records (alias for /api/collections/{name}/records). Ensures bench tooling and Percy fork clients work against the upstream API. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

15 new integration tests (35 total, all passing): - Filter operators: Lt, Lte, Gte, Ne - Aggregations: Avg, Min, Max - Non-existent resources: query/get/update/delete errors - Limit pagination - Empty collection aggregation (sum/count/avg) - Multiple combined filters (AND logic) No bugs found — all edge cases handled correctly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2×2 experiment (Qwen/Haiku × Collections/MemoryDocs), 28 scenarios: - Collections: Qwen 0.65, Haiku 0.70 - Memory docs: Qwen 0.37, Haiku 0.26 - Delta: +0.28 (Qwen), +0.44 (Haiku) - 126 tests passing on both backends Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…akage prevention Add 10 new tests across three files validating the complete cross-scope feature: - registry.rs: comprehensive multi-user visibility test (3 owners + builtins + scoped/unscoped users), execution gate test (5 cases: blocked, allowed, owner-always, builtin-always, nonexistent) - collections.rs: 5 libsql-backed tests covering write targeting to owner partition, cross-scope query reads, collections_list leakage prevention, scoped collection inclusion, and source_scope write targeting - selector.rs: 3 skill leakage tests verifying tools_prefix prevents skill activation for users without matching tools, multiple prefixed skills activate independently, and prefix matching is case-insensitive All tests use in-process libsql (file-backed tempdir) or in-memory registries — no postgres dependency. Total test count: 4348 passed, 0 failed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

After rebasing onto upstream staging (v2 engine PR), fix: - Remove duplicate chat handlers from handlers/chat.rs (now in server.rs) - Update crate::skills::SkillRegistry → ironclaw_skills::SkillRegistry (types moved to extracted crate in v2) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The REST handler for POST /api/collections only generated tools for the single schema being registered. If collections from prior sessions had their tools lost on restart, they remained toolless until individually re-registered. Add a public bootstrap_collection_tools() that loads ALL collections for the user and generates per-collection tools for each. The handler now calls this instead of refresh_collection_tools() on the single schema, ensuring every collection has working tools. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- rustfmt on 5 files subagents touched but didn't format - Gate unique_id behind #[cfg(feature = "postgres")] for libsql-only builds - Include bootstrap regression test from subagent working tree Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add embedding-based scoring alongside existing keyword/tag scoring in the skill selection pipeline, so skills activate for semantically related queries even when keywords don't match. Architecture: - New `Embedder` trait in engine crate (minimal async embed interface) - `EmbedderBridgeAdapter` wraps main crate's `EmbeddingProvider` - `__embed_text__` host function exposed to Python orchestrator - Threaded through ExecutionLoop -> ThreadManager -> init_engine - Wired automatically when embedding provider is configured Python orchestrator changes: - `score_skill()` computes cosine similarity (max 25 points) - Skill description embeddings cached per session - Goal embedding computed once in `select_skills()` - Graceful fallback to keyword-only when embeddings unavailable Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

8 tests covering the Percy multi-lens scenario: generate_collection_tools() through ToolRegistry::tool_definitions_for_user() for 3 lenses (andrew, grace, household) with workspace_read_scopes filtering. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The v1 agent loop's skill injection (dispatcher.rs) selected skills via keyword matching but only logged at debug level. The v2 engine emits skill_activated events via __emit_event__ but the v1 path didn't. Emit StatusUpdate::SkillActivated after skill selection so SSE consumers (bench scripts, web UI) can track which skills are actually injected per scenario. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The cherry-pick from 7775902 left merge conflict markers in collections.rs. Resolve conflicts and verify generic_collections.rs compiles cleanly against the post-rebase codebase. COLLECTION_TOOL_MODE=unified gives 1 tool per collection (with operation enum) instead of 5, reducing tool count from 5N to N. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Each collection now generates 1 tool with an operation enum parameter instead of 5 separate tools. This reduces tool count from 5N to N, which is critical for small models that struggle with 55+ tools. Set COLLECTION_TOOL_MODE=separate for legacy 5-per-collection behavior. Adds regression test verifying generate_collection_tools returns exactly 1 tool per collection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

After keyword-based skill selection, the v1 dispatcher now runs a semantic re-scoring pass using the EmbeddingProvider. Skills that keywords missed but are semantically similar to the user message (cosine similarity >= 0.5) get discovered and activated. Skill description embeddings are cached across calls to avoid redundant API requests. Falls back gracefully when embeddings are unavailable. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

After keyword-based skill selection, boost with cosine similarity scoring against skill descriptions using EmbeddingProvider. Skills with similarity >= 0.5 are added even without keyword matches. Skill embeddings are cached across calls. This should suppress irrelevant built-in skills (e.g. github injecting on commitment queries) while surfacing relevant skills that lack exact keyword matches. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…e flag - UnifiedCollectionTool now broadcasts events for update and delete operations (was only broadcasting for insert) - Remove COLLECTION_TOOL_MODE env var and is_separate_mode() — unified tool is the only mode - Simplify generate_collection_tools() and refresh_collection_tools() to always use unified path - Update multi-lens tests for unified tool names (1 tool per collection) - Add 4 event broadcast tests: insert/update/delete fire events, query does not Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Keyword matching alone lets irrelevant built-in skills (e.g. github) inject alongside commitment skills. Now keyword-matched skills are checked against the message embedding — if cosine similarity < 0.3, the skill is dropped. Users can always force-activate with /skill-name. Two thresholds: - VETO (0.3): keyword-matched skills below this are dropped - DISCOVERY (0.5): skills without keyword matches above this are added Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Delete CollectionAddTool, CollectionUpdateTool, CollectionDeleteTool, CollectionQueryTool, and CollectionSummaryTool — all dead production code replaced by UnifiedCollectionTool. Rewrite all test modules (history, cross_scope, scoped_access, bootstrap, cross_scope_libsql, structured_tools) to use the unified tool API with operation parameter instead of per-op struct constructors. Net -816 lines. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ilblackdragon · 2026-04-05T14:11:13Z

Here is another suggest I'm actually want to try: memory files with schema validation.

We already have metadata for files in workspace - we can add schema validation.
Whenever agent reads folder or file - it gets schema and when it writes - it gets validated automatically.

Would be great to test this approach. This uses all the same infra we already have.

You can still create alias tools like grocery_list which effectively does memory_tree("collections/groceries", ).

standardtoaster · 2026-04-05T16:44:52Z

I prototyped this to get some signal. Quick hack: same 30 benchmark scenarios from my earlier comment on #1736, but seeded as JSON files in workspace folders instead of collection records or markdown. That comment has the full breakdown across all variants including the workspace skills from #1736.

Setup: Each entity type gets a folder (collections/commitments/, collections/signals/, etc.). Each record is a {id}-{slug}.json file. Each folder has a _schema.json describing fields and types. AGENTS.md tells the model to use memory_tree to list folders, memory_read to fetch individual records, and memory_search to find items by keyword. No custom tools — just the standard memory tools with structured data and explicit guidance. ~50 lines of seeding code added to bench-commitments.py (--variant schema-workspace-bare).

Results (Qwen 3.5 4B, 30 scenarios):

Approach	Score
collections-bare (typed tools, #1937)	0.65
workspace + skills (#1736)	0.47
workspace-bare (markdown)	0.45
schema-workspace (JSON + _schema.json)	0.37

Per-category:

Category	schema-workspace	collections-bare
Commitment query	0.24	0.72
Commitment CRUD	0.30	0.58
Signal	0.50	0.46
Decision	0.53	0.56
Aggregation	0.35	0.44
Cross-entity	0.46	0.48

Schema-workspace lands below plain markdown workspace. The model has to memory_tree → memory_read each file → parse JSON → reason across all results. For "how many commitments are critical?" that's 10+ tool calls vs 1 with a typed collection tool. Filtering by field value (urgency=critical, status=open) is where workspace approaches fundamentally struggle — memory_search finds relevant files semantically but can't do structured comparisons.

Where it does fine: signal detection (0.50) and decisions (0.53) — these are mostly "find the thing and read it" queries where semantic search works. The gap opens on filtering, counting, and cross-record aggregation.

I think both ideas have a place: collections for data you filter/aggregate (transactions, time tracking, tasks), schema-validated workspace files for semi-structured content where search is more useful than structured queries.

standardtoaster · 2026-04-05T23:04:37Z

This has been bouncing around in my head since your first comment about SQL. I've been looking at #2025 too, the glob/grep/file_undo tools and coding skills, and I keep coming back to a tension between two philosophies of how memory could work: everything is a file (schema validation, _schema.json, the workspace as universal substrate) versus structured data behind typed query tools. They're not necessarily in conflict, but they pull in different directions when you're deciding where to put new features.

Text files clearly work well for agent memory. But sometimes what you want isn't a blank notebook, it's a preprinted one. A grocery pad with lines for item, quantity, aisle. A timesheet with columns for date, hours, name. The structure isn't constraining the agent, it's telling it exactly what to write down and giving it a way to look things up later without reading every page. That's what my testing is showing across the variants in #1736 and above: schema-workspace scored 0.37 not because the data format was wrong, but because memory_search can't flip to "all entries where urgency=critical" the way a typed tool can. I tried throwing larger models at it (Sonnet) and it didn't seem to close the gap.

For the use cases I'm building toward (financial transactions, hour tracking, to-do management with "what's overdue?" and nagging when things go stale), the query interface is load-bearing. Even modest use accumulates: a year of nanny shifts, a few months of to-dos, eventually bank statements feeding in. That's the preprinted notebook. The schema tells the agent what fields matter, and the typed tool lets it flip straight to the right page instead of reading every entry.

I think both belong in the system. Workspace for the blank notebook (prose, context, meeting notes, identity), collections for the preprinted pads (transactions, hours, tasks). Schema validation on workspace files could bridge the gap for semi-structured content. Curious whether you see a path to making the file approach work for these query patterns, or whether you're thinking of them as separate use cases too.

Add a `schema` field to `DocumentMetadata` that enables automatic content validation on workspace writes. When a document or its folder `.config` carries a JSON Schema, all write operations (write, append, patch, write_to_layer, append_to_layer) validate content against it before persisting. This is the foundation for typed system state (settings, extension configs, skill manifests) stored as workspace documents. Builds on the metadata infrastructure from #1723 — schema is inherited via the existing `.config` chain (folder → document → defaults). Refs: #640, #1894, #1937 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add WorkspaceSettingsAdapter that implements SettingsStore by reading/ writing workspace documents at _system/settings/{key}.json. During migration, dual-writes to both the legacy settings table and workspace. Reads prefer workspace, falling back to the legacy table. Known setting keys (llm_backend, selected_model, tool_permissions.*, etc.) get JSON Schemas stored in document metadata — writes are validated automatically by Phase 0's schema validation. Also adds settings_schemas.rs with compile-time schema registry and settings_path() helper. Refs: #640, #1937 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(workspace): add JSON Schema validation to document metadata Add a `schema` field to `DocumentMetadata` that enables automatic content validation on workspace writes. When a document or its folder `.config` carries a JSON Schema, all write operations (write, append, patch, write_to_layer, append_to_layer) validate content against it before persisting. This is the foundation for typed system state (settings, extension configs, skill manifests) stored as workspace documents. Builds on the metadata infrastructure from #1723 — schema is inherited via the existing `.config` chain (folder → document → defaults). Refs: #640, #1894, #1937 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(tools): add channel-agnostic ToolDispatcher with audit trail Introduce `ToolDispatcher` — a universal entry point for executing tools from any caller (gateway, CLI, routine engine, WASM channels). Creates lightweight system jobs for FK integrity, records ActionRecords, and returns ToolOutput. This is a third entry point alongside v1's Worker::execute_tool() and v2's EffectBridgeAdapter::execute_action(). DispatchSource::Channel(String) is intentionally string-typed — channels are interchangeable extensions that can appear at runtime. Also adds JobContext::system() factory and create_system_job() to both PostgreSQL and libSQL backends. Refs: #640 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(workspace): settings-as-workspace-documents with dual-write adapter Add WorkspaceSettingsAdapter that implements SettingsStore by reading/ writing workspace documents at _system/settings/{key}.json. During migration, dual-writes to both the legacy settings table and workspace. Reads prefer workspace, falling back to the legacy table. Known setting keys (llm_backend, selected_model, tool_permissions.*, etc.) get JSON Schemas stored in document metadata — writes are validated automatically by Phase 0's schema validation. Also adds settings_schemas.rs with compile-time schema registry and settings_path() helper. Refs: #640, #1937 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(gateway): wire ToolDispatcher into GatewayState Add tool_dispatcher field to GatewayState with with_tool_dispatcher() builder method. Create and wire the dispatcher in main.rs when both tool_registry and database are available. All 16 GatewayState construction sites updated. Per-handler migration (routing mutations through ToolDispatcher instead of direct DB calls) is deferred to follow-up PRs — each handler has complex ownership checks, cache refresh, and response types. Refs: #640 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(tools): add system introspection tools (tools_list, version) Add SystemToolsListTool and SystemVersionTool as proper Tool implementations that replace hardcoded /tools and /version commands. Registered at startup via register_system_tools(). Available in both v1 and v2 engines — no is_v1_only_tool filter to worry about. Refs: #640 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(workspace): extension and skill state schemas and path helpers Add workspace path helpers and JSON Schemas for storing extension configs, extension state, and skill manifests under _system/extensions/ and _system/skills/. This establishes the workspace document structure that ExtensionManager and SkillRegistry will use as a durable persistence backend (read-through cache pattern). Runtime state (active MCP connections, WASM runtimes) stays in memory. Only durable config and activation state moves to workspace documents. Refs: #640, #1741 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: address PR review feedback and CI failures CI fixes: - deny.toml: allow MIT-0 license required by jsonschema - workspace/document.rs: #[allow(dead_code)] on system path constants pending follow-up phases that consume them - workspace/settings_adapter.rs: remove unused chrono::Utc import - workspace/settings_adapter.rs: collapse nested if into && form Review fixes (gemini-code-assist): - tools/dispatch.rs: await save_action directly instead of fire-and-forget tokio::spawn so short-lived CLI callers cannot drop audit records before they are persisted; surface errors via tracing::warn - tools/dispatch.rs: remove DispatchSource::Agent variant — sequence_num=0 with a reused job_id would violate UNIQUE(job_id, sequence_num). Agent callers must use Worker::execute_tool() which manages sequence numbers atomically against the agent's existing job - workspace/settings_adapter.rs: validate content against the schema BEFORE the first workspace write so the initial document creation cannot bypass schema enforcement (subsequent writes are validated by the workspace resolved-metadata path established after the first write) Refs: #2049 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: unify all machine state under .system/ Rename the workspace prefix from `_system/` to `.system/` (Unix dot-prefix convention for hidden internal state) and migrate v2 engine state from `engine/` to `.system/engine/` so all machine-managed state lives under one root. New layout: .system/ ├── settings/ (per-user settings as workspace docs) ├── extensions/ (extension config + activation state) ├── skills/ (skill manifests) └── engine/ ├── README.md (auto-generated index) ├── knowledge/ (lessons, skills, summaries, specs, issues) ├── orchestrator/ (Python orchestrator versions, failures, overlays) ├── projects/ (project files + nested missions/) └── runtime/ (threads, steps, events, leases, conversations) The inner `.runtime/` dot-prefix is dropped under `.system/engine/` since `.system/` itself is the hidden marker; no double-hiding needed. The `ENGINE_PREFIX` constant in `workspace::document::system_paths` is declared as the canonical convention; bridge `store_adapter` continues to define per-subdirectory constants below it for ergonomic interpolation. No legacy migration code — pre-production rename. Refs: #2049 * fix(pr-2049): security, correctness, and robustness fixes from review Critical security: - dispatch.rs: redact sensitive params before persisting ActionRecord (was leaking plaintext secrets into the audit log for tools with sensitive_params()) - settings_schemas.rs: validate settings keys against path traversal (reject /, \, .., leading ., empty, length > 128, non-alphanumeric); wire validation into all settings_adapter read/write/delete paths Data correctness: - history/store.rs + libsql/jobs.rs: write status as JobState::Completed .to_string() ('completed' snake_case) instead of 'Completed'; system jobs were round-tripping as Pending in parse_job_state() - settings_adapter.rs: fix .system/.config metadata to set skip_versioning: false (was true) — descendants inherit this via find_nearest_config, so the previous value silently disabled versioning for ALL .system/** documents, contradicting the audit- trail intent - workspace/mod.rs: add resolve_metadata_in_scope; use it in write_to_layer / append_to_layer so non-primary layer writes resolve schema/indexing/versioning from the target layer's .config chain instead of the primary user_id's. Also pass &scope (not &self.user_id) to maybe_save_version so versions are attributed to the correct scope Pipeline parity: - dispatch.rs: add SafetyLayer to ToolDispatcher; mirror Worker pipeline (prepare_tool_params -> validator -> redact -> timeout -> sanitize output) so dispatch path gets the same safety guarantees as the agent worker. Sanitized output is now stored in ActionRecord.output_sanitized instead of duplicating raw JSON Robustness: - settings_adapter.rs: propagate update_metadata errors in ensure_system_config and write_to_workspace (was silently ignored via let _ =, leaving schemas/skip_indexing unenforced) - settings_adapter.rs: set_all_settings now collects the first workspace write error and returns it after the legacy write completes, so partial-migration state is observable - settings_schemas.rs: rewrite llm_custom_providers schema to match CustomLlmProviderSettings (id/name/adapter/base_url/default_model/ api_key/builtin instead of stale name/protocol/base_url/model) Build: - Cargo.toml: jsonschema with default-features = false to avoid pulling a second reqwest major version Docs: - db/mod.rs: docstring for create_system_job uses 'completed' snake_case - workspace/document.rs: clarify .system/ versioning ("by default ARE versioned; individual files may opt out via skip_versioning") - settings_adapter.rs: clarify per-key reads prefer workspace, aggregate reads stay on legacy during migration - tools/builtin/system.rs: trim doc to match implemented scope (system_tools_list, system_version) - channels/web/mod.rs: move stale 'sweep tasks managed by with_oauth' comment back to oauth_sweep_shutdown line Refs: #2049 * docs+ci: enforce 'everything goes through tools' principle Document the core design principle from #2049 in two places so future contributors (human and AI) discover it during development: - CLAUDE.md: new "Everything Goes Through Tools" section near the "Adding a New Channel" guide. Includes the rule, the rationale (audit trail, safety pipeline parity, channel-agnostic surface, agent parity), and a pointer to the detailed rule file. - .claude/rules/tools.md: full pattern with required/forbidden examples, the list of layers that ARE exempt (Worker::execute_tool, v2 EffectBridgeAdapter, tool implementations themselves, background engine jobs, read-aggregation queries), and how to annotate intentional exceptions. Also extends `paths` to cover src/channels/** and src/cli/** so it surfaces when those files are edited. Enforce with a new pre-commit safety check (#7) in scripts/pre-commit-safety.sh: - Scans newly added lines under src/channels/web/handlers/*.rs and src/cli/*.rs for direct touches of state.{store, workspace, workspace_pool, extension_manager, skill_registry, session_manager}. - Suppress with a trailing `// dispatch-exempt: <reason>` comment on the same line, matching the existing `// safety:` convention. - Only checks added lines (`+` in the diff), so existing untouched handlers don't trip the check during incremental migration. The check fires only for new code: handlers that haven't been migrated yet (52 existing direct accesses across 12 handler files) won't break unmodified, but any new line that bypasses the dispatcher will be flagged at commit time. Refs: #2049 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(pr-2049): address Copilot review on workspace schema layer - workspace::extension_state: extension/skill path helpers now reuse the canonical name validators (`canonicalize_extension_name`, `validate_skill_name`) instead of a weak `replace('/', "_")`. Names containing `..`, `\`, NUL, or other escapes are now rejected at the helper boundary, eliminating a path-traversal foothold for callers. Helpers return `Result<String, PathError>`. Regression tests added. - workspace::settings_adapter::ensure_system_config: now idempotent across upgrades. If `.system/.config` already exists with stale metadata (e.g. an older `skip_versioning: true` from before fix #3042846635), it is repaired to the expected inherited values instead of being left silently broken. Regression test added. - workspace::settings_adapter::write_to_workspace: lazily seeds `.system/.config` via a `OnceCell`, so callers no longer need to remember to invoke `ensure_system_config()` at startup before any setting write. Regression test added. - workspace::settings_adapter::delete_setting: workspace delete failures are now logged via `tracing::warn!` instead of being silently dropped. We still don't propagate the error — the legacy table is the source of truth during migration and a stale workspace doc is recoverable on the next write — but partial-delete state is now observable. - workspace::schema: documented why we don't cache compiled validators yet (settings/extension/skill writes are not a hot path; revisit if schema validation moves into a frequent write path). [skip-regression-check] schema.rs change is doc-only. * fix(pr-2049): address 4 remaining review issues 1. tool_dispatcher dropped during gateway startup src/channels/web/mod.rs: rebuild_state was initializing tool_dispatcher to None, so every subsequent with_* call zeroed the dispatcher the first caller injected. Preserve it across rebuild_state like every other field. Regression test: tool_dispatcher_survives_subsequent_with_calls. 2. WorkspaceSettingsAdapter not wired into runtime src/app.rs: Build the adapter in build_all() when workspace+db are both present, eagerly call ensure_system_config(), expose on AppComponents as settings_store, and thread it into init_extensions(...) so register_permission_tools and upgrade_tool_list receive it instead of the raw db. src/main.rs: SIGHUP handler prefers the adapter over raw db. src/workspace/mod.rs: re-export WorkspaceSettingsAdapter. 3. changed_by regression on layered writes src/workspace/mod.rs: write_to_layer and append_to_layer were passing the target layer's scope as changed_by, so version history attributed layered edits to the layer name instead of the actor. Pass self.user_id while keeping metadata resolution in the target scope. Regression test: layered_writes_record_actor_in_changed_by. 4. Legacy engine/ paths invisible after upgrade src/bridge/store_adapter.rs: Add migrate_legacy_engine_paths(), called at the start of load_state_from_workspace(), which scans list_all() for engine/... documents and rewrites them to .system/engine/... Idempotent: skips rewrites when the new path already exists, deletes the legacy duplicate either way. Three regression tests in #[cfg(all(test, feature = "libsql"))] module. Quality gate: cargo fmt, cargo clippy --all --all-features zero warnings, cargo test --all-features --lib 4313 passed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(e2e): use PUT for settings write in ownership test test_settings_written_and_readable was sending POST /api/settings/{key} but the route has been PUT since #4 (Feb 2026) — the test was returning 405 Method Not Allowed. Switch to httpx.put() so it matches the current route registration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(pr-2049): address second round of review feedback Addresses the remaining unresolved PR #2049 review comments from serrrfirat and ilblackdragon. ## Changes ### ToolDispatcher — integration coverage + log level - src/tools/dispatch.rs: add two libsql-gated integration tests for the full dispatch pipeline: (a) persist an ActionRecord with sensitive params redacted in the audit row while the tool still sees the raw value, sanitized output populated; (b) honor the per-tool execution_timeout() and record a failure action. - Tests use a raw-SQL helper to find system-category jobs since list_agent_jobs_for_user intentionally filters them out. - Replace warn! with debug! on audit persistence failure — dispatch is reachable from interactive CLI/REPL sessions where warn!/info! output corrupts the terminal UI (CLAUDE.md Code Style → logging). ### WorkspaceSettingsAdapter — log level - src/workspace/settings_adapter.rs: same warn! → debug! fix on the delete_setting workspace failure path, for the same REPL reason. ### Schema validation — surface all errors - src/workspace/schema.rs: switch from jsonschema::validate to validator_for + iter_errors so users fixing a malformed setting see every violation in one round instead of playing whack-a-mole. Also distinguishes "invalid schema" from "invalid content" errors. - Regression tests: multiple_errors_are_all_reported and invalid_schema_is_distinguished_from_invalid_content. ### create_system_job — started_at + row growth docs - src/db/libsql/jobs.rs and src/history/store.rs: include started_at in the INSERT (set to the same instant as created_at/completed_at) so duration queries don't see NULL and "started but not completed" filters don't misclassify these rows. Fixed in both backends. - Add doc comments on both impls warning about row growth per dispatch call. Deleting rows would violate "LLM data is never deleted" (CLAUDE.md); if listing-query performance becomes a concern, prefer a partial index (WHERE category != 'system') over deletion. ### Lib test repair - src/channels/web/server.rs: extensions_setup_submit_handler Err branch now sets resp.activated = Some(false) so clients and the regression test see an explicit `false` rather than `null`. Also rename the test's fake channel to snake_case (test_failing_channel) so it matches the canonicalize-extension-names behavior from PR #2129 — previously the test was passing a dashed name and getting "Capabilities file not found" instead of the intended activation failure. ## Not addressed (false positive / deferred) - dispatch.rs:177 output_raw/output_sanitized swap — verified against ActionRecord::succeed(Option<String>, Value, Duration) and the worker's call site at job.rs:704; argument order is correct. - settings_adapter.rs:186 TOCTOU window — author self-classified as "Low / completeness" and no other code path writes to .system/settings/** without going through write_to_workspace. - schema.rs recompilation caching — deferred per earlier review. ## Quality gate - cargo fmt - cargo clippy --all --benches --tests --examples --all-features zero warnings - cargo test --all-features --lib: 4387 passed, 0 failed, 3 ignored Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(pr-2049): address third round of review feedback Addresses unresolved comments from serrrfirat's "Paranoid Architect Review" and Copilot's third pass on the engine-state migration. ## src/workspace/settings_adapter.rs ### HIGH — Cross-tenant data leak through owner-scoped Workspace `Workspace` is constructed for a single user_id at AppBuilder time. Without gating, `set_setting("user_B", key, val)` would dual-write into the **owner's** workspace, and a subsequent `user_A.get_setting(...)` would return user_B's value: a real cross-user data leak. Fix: - Add `gate_user_id` field set to `workspace.user_id()` at construction. - All `SettingsStore` methods that touch the workspace now check `workspace_allowed_for(user_id)` first; non-owner callers fall through to the legacy table only — preserving their pre-#2049 behavior. - This matches the long-term plan: per-user settings live in the legacy table until a per-user `WorkspaceSettingsAdapter` (one per WorkspacePool entry) is wired up; admin/global settings go through the workspace-backed path so they pick up schema validation. Regression test: `workspace_settings_are_owner_gated_in_multi_tenant_mode` asserts (a) owner's workspace doc is not overwritten by a non-owner write, (b) each user reads back their own legacy value, and (c) a non-owner with no legacy entry must NOT see the owner's workspace value bleeding through. ### MEDIUM — Dual-write order Reverse `set_setting` and `set_all_settings` to write legacy first, workspace second. The legacy table is the source of truth during migration (it backs aggregate `list_settings` reads), so writing it first guarantees those readers always see a consistent value even if the workspace write fails. Failed workspace writes are self-healing on the next per-key read-miss. ### MEDIUM — `ensure_system_config_lazy` double-execution race Replace the manual `get()`/`set()` pattern with `OnceCell::get_or_try_init`. Two concurrent first-callers no longer both run `ensure_system_config()`. Functionally equivalent (idempotent either way) but no longer wasteful. ## src/bridge/store_adapter.rs ### MEDIUM — Migration drops document metadata (S3) `migrate_legacy_engine_paths` previously copied only `doc.content`, silently dropping the `metadata` column. Now calls `ws.update_metadata(new_doc.id, &doc.metadata)` after each write to preserve schema/skip_indexing/hygiene flags. Logged-not-fatal: content has already been moved, metadata loss is recoverable. Regression test: `migration_preserves_document_metadata` seeds a doc with custom metadata and asserts it survives the rewrite. ### MEDIUM — `ws.exists()` swallowed transient errors (Copilot) `unwrap_or(false)` on the existence check could cause the migrator to overwrite an existing `.system/engine/...` doc when storage hiccups. Now propagates the error (counts as failed step + `continue`), per Copilot's exact suggested patch. ### LOW — `list_all()` runs every startup (Copilot) Add a cheap preflight: `ws.list("engine")` first; only fall through to the recursive `list_all()` discovery when the directory listing returns at least one entry. Steady-state startups (post-migration) skip the full workspace scan entirely. Regression test: `migration_preflight_skips_full_scan_when_no_legacy_paths` asserts unrelated and already-migrated documents are untouched. ### MEDIUM — Counter undercount on `already_present` (S5) When `already_present` is true the legacy duplicate is still deleted, but the previous code skipped the `migrated += 1` increment, undercounting in debug logs. Fixed: `migrated` now counts every successful path migration including the already-present case. ### Documented — Version-history loss is acceptable scope (C1) Read-write-delete pattern means `memory_document_versions.document_id ON DELETE CASCADE` drops the legacy doc's version chain. Documented in the function-level doc comment as intentional + bounded: - v2 engine state is runtime state (rewritten on every mutation), not user-curated data - v2 was newly introduced in this PR — no production deployment with pre-existing curated history at risk - A path-preserving rename op would need new trait methods on both backends; out of scope for fix-forward. If a future caller needs history-preserving rename, it should be added to the storage layer properly, not bolted onto migration. ## Quality gate - cargo fmt - cargo clippy --all --benches --tests --examples --all-features zero warnings - cargo test --all-features --lib: 4390 passed, 0 failed, 3 ignored (+3 new tests on top of round 2) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(pr-2049): address fourth round of review feedback Two latent issues flagged by serrrfirat in the latest review pass: 1. **Null schema permanently locks documents** (`src/workspace/schema.rs`). `serde_json` deserializes a metadata field of `"schema": null` as `Some(Value::Null)`, not `None`, so the upstream `if let Some(schema) = &metadata.schema` check passes through to `validate_content_against_schema`. There, `validator_for(Value::Null)` errors out and every subsequent write to that document is blocked — a latent DoS. Added an explicit `schema.is_null()` early-return guard at the top of the validator, plus a regression test (`null_schema_is_treated_as_no_op`) that asserts even non-JSON content passes when the schema is null. 2. **System job titles were raw source labels** (`src/history/store.rs`, `src/db/libsql/jobs.rs`). `create_system_job` set `title = source`, so any UI rendering `agent_jobs.title` would display dispatched system jobs as `channel:gateway` / `system` / etc. instead of a human-readable label. Both PostgreSQL and libSQL backends now write `format!("System: {source}")`. Updated the two dispatch integration tests that pinned the old format. Schema-recompilation comment (`schema.rs:47`) was acknowledged as "acceptable for now" by the reviewer; existing NOTE in the source already documents the caching trade-off and upgrade path, so no code change. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(pr-2049): address fifth round of review feedback Eight comments from Copilot + serrrfirat. Real fixes for the load-bearing gaps; doc clarifications for the rest where the existing behavior is intentional. **Real code changes** - `src/tools/dispatch.rs` — enforce `tool.parameters_schema()` (JSON Schema) in the dispatch path. Previously the SafetyLayer validator only checked for injection patterns; channel/CLI/routine callers could pass arbitrary shapes and only discover the mismatch (or worse, silently malformed behavior) inside the tool itself. Now we run `jsonschema::validate(&tool.parameters_schema(), &normalized_params)` after the injection check, with a permissive-empty-schema fast path so tools that haven't yet declared a schema aren't penalised. Regression test `dispatch_rejects_params_violating_tool_schema` asserts a required-field violation is rejected before the tool is invoked. - `src/workspace/settings_adapter.rs` — `write_to_workspace` now calls `schema_for_key(key)` once and reuses the resolved schema for both pre-write validation and post-write metadata persistence (was called twice). Eliminates duplicate work and removes a theoretical divergence window if the schema registry ever became non-deterministic. - `src/workspace/settings_adapter.rs` — `ensure_system_config` now also rewrites the `.config` document content when its metadata is repaired, not just the metadata column. The metadata column is the inheritance source of truth, but having the doc's content silently diverge from it confuses anyone reading the doc directly to understand which inherited flags are active. - `src/error.rs` + `src/workspace/settings_schemas.rs` — new `WorkspaceError::InvalidPath { path, reason }` variant. Path/key rejection (path-traversal, character set, length) now surfaces as `InvalidPath`, not `SchemaValidation` — callers and downstream UIs can distinguish "your settings *key* has bad characters" from "your settings *value* failed JSON-Schema validation" without string-matching error messages. `validate_settings_key` returns the new variant; the one match site in `settings_adapter.rs::write_to_workspace` is updated. Regression test `validate_settings_key_returns_invalid_path_variant`. **Documentation-only fixes** - `src/tools/dispatch.rs` — clarify in the `dispatch()` doc-comment that `sanitize_tool_output` runs only against the persisted ActionRecord payload, NOT against the value returned to the caller. This mirrors `Worker::execute_tool` (the agent loop also receives the raw output so reasoning can be reproduced from history). Channels that forward dispatcher output to end users must run their own boundary sanitization at the channel edge. - `src/history/store.rs` + `src/db/libsql/jobs.rs` — `create_system_job` doc updated to explicitly state that system job timestamps do NOT reflect tool execution time (the row is INSERTed before the tool runs, with all three timestamps pinned to "now"). Consumers that need execution duration must read `job_actions.duration_ms` for the associated action rows. Restructuring to a two-phase INSERT+UPDATE was rejected: the audit row must be durable even if the dispatcher panics mid-tool, and the second write would double per-dispatch DB cost. - `src/workspace/schema.rs` — added baseline regression test `moderately_complex_schema_compiles_within_budget` that pins schema compile + validate latency for a moderately deep nested schema at <500ms wall-clock. Guards against orders-of-magnitude regressions from a future `jsonschema` upgrade or accidentally pathological schema construction. Hard limits on schema complexity are deferred (the real defense today is keeping schema-bearing paths under `.system/`, which is system-controlled). **Acknowledged, no change** - libSQL `create_system_job` unbounded row growth — already documented as intentional in the existing comment block, with the mitigation path spelled out (partial index on `WHERE category != 'system'` for listing queries). Rate-limiting dispatch would silently drop user-initiated actions, which is worse than unbounded retention. The "LLM data is never deleted" rule (CLAUDE.md) explicitly applies. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

standardtoaster · 2026-04-10T14:24:46Z

I ran these scenarios through the nearai/benchmarks harness — 52 structured data scenarios across 8 entity types (commitments, signals, decisions, parked ideas, grocery, todos, nanny hours, transactions). Same model (Qwen 3.5), same harness, same scenarios — only the storage and tool surface changes.

Workspace files (0.356) vs Collections (0.850):

Category	Workspace	Collections	Delta
commitments	0.516	0.937	+0.421
cross-entity	0.000	1.000	+1.000
grocery	0.000	0.833	+0.833
todos	0.400	0.800	+0.400
transactions	0.000	0.857	+0.857
overall	0.356	0.850	+0.494

The gap is largest for high-item-count entities — transactions (60 items: 0→0.857), grocery (12 items: 0→0.833) — and cross-entity queries (0→1.0) where workspace requires reading every file but a collection tool handles it in one call.

Benchmark PR: nearai/benchmarks#14. Harness features PR: nearai/benchmarks#13. The generator supports --variant workspace and --variant collections so anyone can reproduce both runs.

gemini-code-assist bot reviewed Apr 2, 2026

View reviewed changes

standardtoaster marked this pull request as draft April 2, 2026 23:28

github-actions bot added scope: channel Channel infrastructure scope: tool/builder Dynamic tool builder labels Apr 3, 2026

standardtoaster and others added 12 commits April 3, 2026 15:11

standardtoaster and others added 3 commits April 3, 2026 15:23

fix: add collection fields to oauth_greeting_integration test

ebb5735

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

standardtoaster force-pushed the fix/collection-scoping-and-startup branch from 4fca16e to ebb5735 Compare April 3, 2026 14:31

standardtoaster and others added 12 commits April 3, 2026 15:45

standardtoaster mentioned this pull request Apr 4, 2026

feat(skills): commitments system — active intake for personal AI assistant #1736

Merged

8 tasks

standardtoaster marked this pull request as ready for review April 4, 2026 22:35

github-actions bot mentioned this pull request Apr 5, 2026

🦞 OpenClaw 生态日报 2026-04-05 gsscsd/big_model_radar#137

Open

JackLuguibin mentioned this pull request Apr 5, 2026

🦞 OpenClaw 生态日报 2026-04-05 JackLuguibin/big_model_radar#2

Open

github-actions bot mentioned this pull request Apr 6, 2026

🦞 OpenClaw 生态日报 2026-04-06 gsscsd/big_model_radar#142

Open

ilblackdragon mentioned this pull request Apr 6, 2026

feat: unified tool dispatch + schema-validated workspace #2049

Merged

8 tasks

standardtoaster mentioned this pull request Apr 10, 2026

feat: structured data benchmark suite — 52 scenarios across 8 entity types nearai/benchmarks#14

Open

5 tasks

		.query_records(user_id, collection, &aggregation.filters, None, usize::MAX)
		.await?;

		let mut user_ids = vec![user.user_id.clone()];
		user_ids.extend(Vec::<String>::new().iter().cloned());

	conn.execute("BEGIN", ())
	conn.execute("BEGIN IMMEDIATE", ())

	let today = chrono::Local::now().date_naive();
	let today = chrono::Utc::now().date_naive();

Conversation

standardtoaster commented Apr 2, 2026

Collections: Typed Structured Storage for Agent Workspaces

Summary

The problem

What a user sees

Evidence

Collections vs memory docs (28 scenarios, same data, same models)

Tool design matters

How it works

Storage

Tools

REST API

Multi-tenant scoping

Tool scaling

Drawbacks

Alternatives considered

Compatibility

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

ilblackdragon commented Apr 3, 2026

Uh oh!

standardtoaster commented Apr 3, 2026

Uh oh!

ilblackdragon commented Apr 5, 2026

Uh oh!

standardtoaster commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

standardtoaster commented Apr 5, 2026

Uh oh!

standardtoaster commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

standardtoaster commented Apr 5, 2026 •

edited

Loading