Skip to content

feat: spawn_status/spawn_cancel tools, domain loop detection, hardened cursor recovery#3303

Open
MuataSr wants to merge 4 commits intoHKUDS:mainfrom
MuataSr:feat/spawn-tools-and-domain-loop-detection
Open

feat: spawn_status/spawn_cancel tools, domain loop detection, hardened cursor recovery#3303
MuataSr wants to merge 4 commits intoHKUDS:mainfrom
MuataSr:feat/spawn-tools-and-domain-loop-detection

Conversation

@MuataSr
Copy link
Copy Markdown
Contributor

@MuataSr MuataSr commented Apr 19, 2026

Summary

Three targeted improvements to agent orchestration and safety:

1. spawn_status and spawn_cancel tools

  • SpawnStatusTool: List running/done subagent tasks with task ID, label, elapsed time, and status
  • SpawnCancelTool: Cancel a running subagent by task ID
  • New methods on SubagentManager: cancel_by_id, get_task_status, get_all_status, _format_status

2. max_iterations and timeout_seconds on spawn

  • SpawnTool.execute now accepts max_iterations (default 15) and timeout_seconds (default 300)
  • Passed through to SubagentManager.spawn and enforced via asyncio.wait_for
  • Lets the main agent allocate iteration budget based on task complexity

3. Same-domain loop detection

  • Upstream blocks repeated exact-URL lookups but not repeated fetches from the same domain
  • Adds _MAX_REPEAT_SAME_DOMAIN = 3 threshold with domain extraction via urlparse
  • Blocks after 4 fetches from the same site, catching loops that vary the URL path but keep hitting one domain

4. Hardened cursor recovery (memory.py)

  • Upstream uses e.get(cursor, 0) which passes poisoned string cursors as truthy
  • Explicit isinstance(c, int) guard in both _next_cursor fallback and read_unprocessed_history
  • Prevents string cursors from corrupting session state (proven fix from Apr 2026 outage)

Files changed (7 files, +218 -35)

  • nanobot/agent/tools/spawn_status.py (new)
  • nanobot/agent/tools/spawn_cancel.py (new)
  • nanobot/agent/subagent.py
  • nanobot/agent/loop.py
  • nanobot/agent/tools/spawn.py
  • nanobot/utils/runtime.py
  • nanobot/agent/memory.py

Testing

  • All imports verified passing
  • Local patches running in production since April 2026
  • Memory cursor fix proven by surviving the Apr 2026 history poisoning outage

Replaces

MuataSr added 4 commits April 19, 2026 03:54
- Add SpawnStatusTool: list running/done subagent tasks with ID, label, elapsed time
- Add SpawnCancelTool: cancel a running subagent by task ID
- Add cancel_by_id(), get_task_status(), get_all_status(), _format_status() to SubagentManager
- Add max_iterations and timeout_seconds params to SpawnTool.execute()
- Wire new tools into AgentLoop tool registration
Upstream blocks repeated exact-URL lookups but not repeated fetches from
the same domain. Add _MAX_REPEAT_SAME_DOMAIN=3 threshold and domain
extraction via urlparse. Blocks after 4 fetches from the same site,
catching agent loops that vary the URL path but keep hitting one domain.
Upstream uses e.get('cursor', 0) which passes string cursors (e.g. 'abc')
as truthy values. Our fix explicitly checks isinstance(c, int) in both
_next_cursor fallback and read_unprocessed_history, preventing poisoned
string cursors from corrupting session state.
tool_parameters decorator wraps all params as StringSchema, so values
arrive as strings (e.g. '5' not 5). The runner compares these against
int thresholds, causing TypeError: '<=' not supported between instances
of 'str' and 'int'. Add int() coercion before passing to manager.
Copy link
Copy Markdown
Collaborator

@Re-bin Re-bin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bundling these — the individual ideas (spawn status/cancel, iteration budget, cursor poisoning guard) are all reasonable. But the PR as-is has enough execution issues that I'd ask for changes before we merge any of it.

Bugs and blockers

  1. nanobot/utils/runtime.py declares _MAX_REPEAT_SAME_DOMAIN = 3 twice on consecutive lines. Drop the duplicate.

  2. spawn.py declares max_iterations and timeout_seconds as StringSchema with a comment saying tool_parameters wraps everything as StringSchema — coerce to int. That's not correct — nanobot/agent/tools/schema.py:54 defines IntegerSchema. Using StringSchema tells strict-mode providers to emit strings and makes int(max_iterations) crash with ValueError if the model emits anything non-numeric. Please use IntegerSchema and drop the runtime coerce.

  3. spawn_status.py and spawn_cancel.py declare their parameters with the old hand-rolled parameters property dict, while spawn.py in the same PR uses @tool_parameters(tool_parameters_schema(...)). The rest of the codebase has standardized on the decorator DSL — please convert these two new tools over for consistency.

  4. memory.py _next_cursor now scans every history entry on every call (O(n)). The old fast path read the last entry (O(1)). The poisoning fix is correct in intent — the right shape is to keep the fast path and add isinstance(c, int) there, and only fall back to the full scan when the last entry's cursor isn't an int.

  5. asyncio.wait_for around self.runner.run(...) will raise TimeoutError without setting status.phase = "done" or cleaning up _running_tasks/_task_statuses for the timed-out task. Please add explicit cleanup so timed-out subagents don't show as "running" forever in spawn_status.

Testing

223 lines of new behavior across subagent lifecycle, history cursor recovery, and same-domain blocking, with zero tests. "Running in production" isn't a test. Please add at least:

  • one test for cursor recovery: write a history entry with a non-int cursor, verify _next_cursor doesn't crash and returns a valid int
  • one test for cancel_by_id: spawn → cancel → verify _running_tasks and _task_statuses are cleaned up
  • one test for timeout: spawn with 1s timeout on a task that sleeps 5s, verify cleanup happens and status ends as error
  • one test for domain detection: 4 different URLs from the same domain → 4th is blocked

Scope

These are actually four unrelated improvements (spawn tools, iteration/timeout budget, domain loop detection, cursor recovery). Merging them as one makes any partial revert painful. If you can split into separate PRs I can review each one faster; the cursor recovery one especially deserves to land on its own since it's a reliability fix with a clear story.

Ergonomics

_MAX_REPEAT_SAME_DOMAIN = 3 is hardcoded and not configurable. Legitimate research flows often fetch 4+ pages from one site (blog archives, docs). Please expose this as a config knob with a more permissive default (I'd suggest 10) so the protection doesn't fire on legitimate usage.

Also worth closing #3172 with a note pointing at this PR once the scope is settled.

@MuataSr
Copy link
Copy Markdown
Contributor Author

MuataSr commented Apr 20, 2026

Thanks for the thorough review, @Re-bin — the feedback was spot-on.

I've split this PR into separate focused PRs as you suggested. First one is up:

PR #3340 — Cursor recovery (the highest-priority reliability fix with a clear production incident behind it)

It addresses your specific feedback:

  • Keeps the O(1) fast path plus isinstance guard, only falls back to full scan on corruption
  • read_unprocessed_history skips non-int cursors instead of crashing
  • 7 regression tests included

The remaining three PRs (spawn status/cancel with decorator DSL, iteration/timeout budget with IntegerSchema, domain loop detection with configurable threshold) will follow.

Also closing #3172 — cursor poisoning was the blocking issue reported there and PR #3340 fixes it.

@Re-bin
Copy link
Copy Markdown
Collaborator

Re-bin commented Apr 21, 2026

#3340 is merged. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants