fix(memory): harden cursor recovery against non-integer corruption#3340
fix(memory): harden cursor recovery against non-integer corruption#3340Re-bin merged 3 commits intoHKUDS:mainfrom
Conversation
_next_cursor now checks isinstance(cursor, int) before arithmetic, falling back to a reverse scan of all entries when the last entry's cursor is corrupted. read_unprocessed_history skips entries with non-int cursors instead of crashing on comparison. Root cause: external callers (cron jobs, plugins) occasionally wrote string cursors to history.jsonl, which blocked all subsequent append_history calls with TypeError/ValueError. Includes 7 regression tests covering string, float, null, and list cursor types.
Re-bin
left a comment
There was a problem hiding this comment.
Verdict: Approve after my follow-up refactor
Great catch on the root symptom — a single non-int cursor in history.jsonl (cron/external writer poisoning, see #3172) can wedge the agent, and the two isinstance guards + recovery scan both move in the right direction. Clean, focused, well-tested diff.
I pushed one follow-up commit (a8043ca0, refactor(memory): centralize cursor validation behind a single gate) to tighten the fix from first principles:
- Single gate, single invariant. Pulled the guard into
MemoryStore._valid_cursorso "what counts as a valid cursor" lives in exactly one place; both_next_cursorandread_unprocessed_historyconsume it. Reduces the surface area where the next contributor might miss a check. - Closed the
booledge case.isinstance(True, int)isTruein Python, so the original guard silently admits{"cursor": true}as cursor1._valid_cursorexplicitly excludesbool. - Recovery returns
max, not "first int from the tail". Your PR's comment promised "the highest valid one" but the loop returned the first int found while scanning in reverse. Under adversarial corruption the two differ, and onlymaxguarantees the recovered cursor is strictly greater than every legitimate cursor still on disk. - Observability, rate-limited. When we drop a corrupted entry we now log a single warning per
MemoryStorewith the offending raw value. Without this, the root cause (the external writer) stays invisible; with it, the log trail survives while the agent keeps running. Rate-limiting avoids spamming every agent turn on a persistently poisoned file.
All 7 of your tests pass unchanged; I added 3 to pin the invariants above (test_bool_cursor_rejected, test_next_cursor_returns_max_not_just_last_int, test_corruption_is_logged_exactly_once_per_store). Full agent suite: 670/670.
Nothing in the refactor changes externally observable behavior beyond your fix for any case that was previously hit in production — it's a safety-net widening plus one diagnostic line. Happy to merge once CI turns green on the new head.
Move the non-int cursor guard out of the two consumer sites and into a
shared ``_iter_valid_entries`` iterator so the invariant lives in one
place. Closes three gaps left by the original fix:
* ``bool`` is now rejected — ``isinstance(True, int)`` is ``True`` in
Python, so the previous guard silently treated ``{"cursor": true}`` as
cursor ``1``.
* Recovery now returns ``max(valid cursors) + 1``. Under adversarial
corruption "first int scanning in reverse" is not the same thing, and
only ``max`` keeps the recovered cursor strictly greater than every
legitimate cursor still on disk.
* Non-int cursors are logged exactly once per ``MemoryStore``. Silently
dropping corrupted entries hides the root cause (an external writer
to ``memory/history.jsonl``); rate-limiting keeps the log clean when
the same poisoned file is read every turn.
All 7 tests from the original fix pass unchanged; 3 new tests pin the
invariants above.
Made-with: Cursor
a8043ca to
a965a69
Compare
Problem
External callers (cron jobs, plugins, buggy scripts) occasionally wrote non-integer values to the
cursorfield inhistory.jsonl— e.g."cursor": "abc". This caused_next_cursorandread_unprocessed_historyto crash withTypeError/ValueError, blocking all subsequentappend_historycalls and silently breaking Dream, cron, and any feature that reads history.This happened in production (nanobot v0.1.5.post1, April 2026) — cron jobs wrote string cursors that poisoned the file and caused a 3-day agent outage.
Changes
nanobot/agent/memory.py(2 functions patched, ~15 lines changed):_next_cursor: Addedisinstance(cursor, int)guard on the fast path. When the last entry has a non-int cursor, falls back to a reverse scan of all entries to find the highest valid int cursor. Returns1if no valid cursor exists anywhere.read_unprocessed_history: Skips entries with non-int cursors instead of crashing on the>comparison.Both changes preserve the existing fast path (O(1) cursor read, O(1) last-entry fallback) — the full scan only runs when corruption is detected.
tests/agent/test_cursor_recovery.py(7 tests, 110 lines):test_string_cursor_falls_back_to_scantest_all_corrupted_cursors_return_onetest_non_int_cursor_typestest_cursor_file_with_string_contenttest_skips_string_cursor_entriesread_unprocessed_historyskips non-int entriestest_mixed_corruption_preserves_ordertest_all_valid_still_worksAll 7 tests pass. Existing
test_memory_store.pytests also pass (no regressions).Scope
This is extracted from PR #3303 as a standalone reliability fix with a clear story. It does not include the spawn tools, domain detection, or iteration budget changes from that PR — those will follow as separate PRs.
Closes #3172 (cursor poisoning was the blocking issue reported there).