fix(memory): harden cursor recovery against non-integer corruption by MuataSr · Pull Request #3340 · HKUDS/nanobot

MuataSr · 2026-04-20T20:29:54Z

Problem

External callers (cron jobs, plugins, buggy scripts) occasionally wrote non-integer values to the cursor field in history.jsonl — e.g. "cursor": "abc". This caused _next_cursor and read_unprocessed_history to crash with TypeError/ValueError, blocking all subsequent append_history calls and silently breaking Dream, cron, and any feature that reads history.

This happened in production (nanobot v0.1.5.post1, April 2026) — cron jobs wrote string cursors that poisoned the file and caused a 3-day agent outage.

Changes

nanobot/agent/memory.py (2 functions patched, ~15 lines changed):

_next_cursor: Added isinstance(cursor, int) guard on the fast path. When the last entry has a non-int cursor, falls back to a reverse scan of all entries to find the highest valid int cursor. Returns 1 if no valid cursor exists anywhere.
read_unprocessed_history: Skips entries with non-int cursors instead of crashing on the > comparison.

Both changes preserve the existing fast path (O(1) cursor read, O(1) last-entry fallback) — the full scan only runs when corruption is detected.

tests/agent/test_cursor_recovery.py (7 tests, 110 lines):

Test	What it covers
`test_string_cursor_falls_back_to_scan`	String cursor in last entry → scan finds valid int
`test_all_corrupted_cursors_return_one`	Every entry corrupted → restart at 1
`test_non_int_cursor_types`	Float, None, list cursors all handled
`test_cursor_file_with_string_content`	Corrupted .cursor file → falls back to JSONL scan
`test_skips_string_cursor_entries`	`read_unprocessed_history` skips non-int entries
`test_mixed_corruption_preserves_order`	Valid entries maintain order despite corrupt neighbors
`test_all_valid_still_works`	Baseline — normal operation unaffected

All 7 tests pass. Existing test_memory_store.py tests also pass (no regressions).

Scope

This is extracted from PR #3303 as a standalone reliability fix with a clear story. It does not include the spawn tools, domain detection, or iteration budget changes from that PR — those will follow as separate PRs.

Closes #3172 (cursor poisoning was the blocking issue reported there).

_next_cursor now checks isinstance(cursor, int) before arithmetic, falling back to a reverse scan of all entries when the last entry's cursor is corrupted. read_unprocessed_history skips entries with non-int cursors instead of crashing on comparison. Root cause: external callers (cron jobs, plugins) occasionally wrote string cursors to history.jsonl, which blocked all subsequent append_history calls with TypeError/ValueError. Includes 7 regression tests covering string, float, null, and list cursor types.

Re-bin

Verdict: Approve after my follow-up refactor

Great catch on the root symptom — a single non-int cursor in history.jsonl (cron/external writer poisoning, see #3172) can wedge the agent, and the two isinstance guards + recovery scan both move in the right direction. Clean, focused, well-tested diff.

I pushed one follow-up commit (a8043ca0, refactor(memory): centralize cursor validation behind a single gate) to tighten the fix from first principles:

Single gate, single invariant. Pulled the guard into MemoryStore._valid_cursor so "what counts as a valid cursor" lives in exactly one place; both _next_cursor and read_unprocessed_history consume it. Reduces the surface area where the next contributor might miss a check.
Closed the bool edge case. isinstance(True, int) is True in Python, so the original guard silently admits {"cursor": true} as cursor 1. _valid_cursor explicitly excludes bool.
Recovery returns max, not "first int from the tail". Your PR's comment promised "the highest valid one" but the loop returned the first int found while scanning in reverse. Under adversarial corruption the two differ, and only max guarantees the recovered cursor is strictly greater than every legitimate cursor still on disk.
Observability, rate-limited. When we drop a corrupted entry we now log a single warning per MemoryStore with the offending raw value. Without this, the root cause (the external writer) stays invisible; with it, the log trail survives while the agent keeps running. Rate-limiting avoids spamming every agent turn on a persistently poisoned file.

All 7 of your tests pass unchanged; I added 3 to pin the invariants above (test_bool_cursor_rejected, test_next_cursor_returns_max_not_just_last_int, test_corruption_is_logged_exactly_once_per_store). Full agent suite: 670/670.

Nothing in the refactor changes externally observable behavior beyond your fix for any case that was previously hit in production — it's a safety-net widening plus one diagnostic line. Happy to merge once CI turns green on the new head.

Move the non-int cursor guard out of the two consumer sites and into a shared ``_iter_valid_entries`` iterator so the invariant lives in one place. Closes three gaps left by the original fix: * ``bool`` is now rejected — ``isinstance(True, int)`` is ``True`` in Python, so the previous guard silently treated ``{"cursor": true}`` as cursor ``1``. * Recovery now returns ``max(valid cursors) + 1``. Under adversarial corruption "first int scanning in reverse" is not the same thing, and only ``max`` keeps the recovered cursor strictly greater than every legitimate cursor still on disk. * Non-int cursors are logged exactly once per ``MemoryStore``. Silently dropping corrupted entries hides the root cause (an external writer to ``memory/history.jsonl``); rate-limiting keeps the log clean when the same poisoned file is read every turn. All 7 tests from the original fix pass unchanged; 3 new tests pin the invariants above. Made-with: Cursor

MuataSr mentioned this pull request Apr 20, 2026

feat: spawn_status/spawn_cancel tools, domain loop detection, hardened cursor recovery #3303

Open

Merge remote-tracking branch 'origin/main' into pr-3340

2940804

Re-bin approved these changes Apr 21, 2026

View reviewed changes

Re-bin force-pushed the fix/cursor-recovery branch from a8043ca to a965a69 Compare April 21, 2026 06:01

Re-bin merged commit c1957e1 into HKUDS:main Apr 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(memory): harden cursor recovery against non-integer corruption#3340

fix(memory): harden cursor recovery against non-integer corruption#3340
Re-bin merged 3 commits intoHKUDS:mainfrom
MuataSr:fix/cursor-recovery

MuataSr commented Apr 20, 2026

Uh oh!

Re-bin left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MuataSr commented Apr 20, 2026

Problem

Changes

Scope

Uh oh!

Re-bin left a comment

Choose a reason for hiding this comment

Verdict: Approve after my follow-up refactor

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants