feat: spawn_status/spawn_cancel tools, domain loop detection, hardened cursor recovery#3303
feat: spawn_status/spawn_cancel tools, domain loop detection, hardened cursor recovery#3303MuataSr wants to merge 4 commits intoHKUDS:mainfrom
Conversation
- Add SpawnStatusTool: list running/done subagent tasks with ID, label, elapsed time - Add SpawnCancelTool: cancel a running subagent by task ID - Add cancel_by_id(), get_task_status(), get_all_status(), _format_status() to SubagentManager - Add max_iterations and timeout_seconds params to SpawnTool.execute() - Wire new tools into AgentLoop tool registration
Upstream blocks repeated exact-URL lookups but not repeated fetches from the same domain. Add _MAX_REPEAT_SAME_DOMAIN=3 threshold and domain extraction via urlparse. Blocks after 4 fetches from the same site, catching agent loops that vary the URL path but keep hitting one domain.
Upstream uses e.get('cursor', 0) which passes string cursors (e.g. 'abc')
as truthy values. Our fix explicitly checks isinstance(c, int) in both
_next_cursor fallback and read_unprocessed_history, preventing poisoned
string cursors from corrupting session state.
tool_parameters decorator wraps all params as StringSchema, so values arrive as strings (e.g. '5' not 5). The runner compares these against int thresholds, causing TypeError: '<=' not supported between instances of 'str' and 'int'. Add int() coercion before passing to manager.
Re-bin
left a comment
There was a problem hiding this comment.
Thanks for bundling these — the individual ideas (spawn status/cancel, iteration budget, cursor poisoning guard) are all reasonable. But the PR as-is has enough execution issues that I'd ask for changes before we merge any of it.
Bugs and blockers
-
nanobot/utils/runtime.pydeclares_MAX_REPEAT_SAME_DOMAIN = 3twice on consecutive lines. Drop the duplicate. -
spawn.pydeclaresmax_iterationsandtimeout_secondsasStringSchemawith a comment sayingtool_parameters wraps everything as StringSchema — coerce to int. That's not correct —nanobot/agent/tools/schema.py:54definesIntegerSchema. UsingStringSchematells strict-mode providers to emit strings and makesint(max_iterations)crash with ValueError if the model emits anything non-numeric. Please useIntegerSchemaand drop the runtime coerce. -
spawn_status.pyandspawn_cancel.pydeclare their parameters with the old hand-rolledparametersproperty dict, whilespawn.pyin the same PR uses@tool_parameters(tool_parameters_schema(...)). The rest of the codebase has standardized on the decorator DSL — please convert these two new tools over for consistency. -
memory.py_next_cursornow scans every history entry on every call (O(n)). The old fast path read the last entry (O(1)). The poisoning fix is correct in intent — the right shape is to keep the fast path and addisinstance(c, int)there, and only fall back to the full scan when the last entry's cursor isn't an int. -
asyncio.wait_foraroundself.runner.run(...)will raiseTimeoutErrorwithout settingstatus.phase = "done"or cleaning up_running_tasks/_task_statusesfor the timed-out task. Please add explicit cleanup so timed-out subagents don't show as "running" forever inspawn_status.
Testing
223 lines of new behavior across subagent lifecycle, history cursor recovery, and same-domain blocking, with zero tests. "Running in production" isn't a test. Please add at least:
- one test for cursor recovery: write a history entry with a non-int cursor, verify
_next_cursordoesn't crash and returns a valid int - one test for cancel_by_id: spawn → cancel → verify
_running_tasksand_task_statusesare cleaned up - one test for timeout: spawn with 1s timeout on a task that sleeps 5s, verify cleanup happens and status ends as error
- one test for domain detection: 4 different URLs from the same domain → 4th is blocked
Scope
These are actually four unrelated improvements (spawn tools, iteration/timeout budget, domain loop detection, cursor recovery). Merging them as one makes any partial revert painful. If you can split into separate PRs I can review each one faster; the cursor recovery one especially deserves to land on its own since it's a reliability fix with a clear story.
Ergonomics
_MAX_REPEAT_SAME_DOMAIN = 3 is hardcoded and not configurable. Legitimate research flows often fetch 4+ pages from one site (blog archives, docs). Please expose this as a config knob with a more permissive default (I'd suggest 10) so the protection doesn't fire on legitimate usage.
Also worth closing #3172 with a note pointing at this PR once the scope is settled.
|
Thanks for the thorough review, @Re-bin — the feedback was spot-on. I've split this PR into separate focused PRs as you suggested. First one is up: PR #3340 — Cursor recovery (the highest-priority reliability fix with a clear production incident behind it) It addresses your specific feedback:
The remaining three PRs (spawn status/cancel with decorator DSL, iteration/timeout budget with IntegerSchema, domain loop detection with configurable threshold) will follow. Also closing #3172 — cursor poisoning was the blocking issue reported there and PR #3340 fixes it. |
|
#3340 is merged. Thanks! |
Summary
Three targeted improvements to agent orchestration and safety:
1. spawn_status and spawn_cancel tools
2. max_iterations and timeout_seconds on spawn
3. Same-domain loop detection
4. Hardened cursor recovery (memory.py)
Files changed (7 files, +218 -35)
Testing
Replaces