Skip to content

fix(memory): harden cursor recovery against non-integer corruption#3340

Merged
Re-bin merged 3 commits intoHKUDS:mainfrom
MuataSr:fix/cursor-recovery
Apr 21, 2026
Merged

fix(memory): harden cursor recovery against non-integer corruption#3340
Re-bin merged 3 commits intoHKUDS:mainfrom
MuataSr:fix/cursor-recovery

Conversation

@MuataSr
Copy link
Copy Markdown
Contributor

@MuataSr MuataSr commented Apr 20, 2026

Problem

External callers (cron jobs, plugins, buggy scripts) occasionally wrote non-integer values to the cursor field in history.jsonl — e.g. "cursor": "abc". This caused _next_cursor and read_unprocessed_history to crash with TypeError/ValueError, blocking all subsequent append_history calls and silently breaking Dream, cron, and any feature that reads history.

This happened in production (nanobot v0.1.5.post1, April 2026) — cron jobs wrote string cursors that poisoned the file and caused a 3-day agent outage.

Changes

nanobot/agent/memory.py (2 functions patched, ~15 lines changed):

  • _next_cursor: Added isinstance(cursor, int) guard on the fast path. When the last entry has a non-int cursor, falls back to a reverse scan of all entries to find the highest valid int cursor. Returns 1 if no valid cursor exists anywhere.
  • read_unprocessed_history: Skips entries with non-int cursors instead of crashing on the > comparison.

Both changes preserve the existing fast path (O(1) cursor read, O(1) last-entry fallback) — the full scan only runs when corruption is detected.

tests/agent/test_cursor_recovery.py (7 tests, 110 lines):

Test What it covers
test_string_cursor_falls_back_to_scan String cursor in last entry → scan finds valid int
test_all_corrupted_cursors_return_one Every entry corrupted → restart at 1
test_non_int_cursor_types Float, None, list cursors all handled
test_cursor_file_with_string_content Corrupted .cursor file → falls back to JSONL scan
test_skips_string_cursor_entries read_unprocessed_history skips non-int entries
test_mixed_corruption_preserves_order Valid entries maintain order despite corrupt neighbors
test_all_valid_still_works Baseline — normal operation unaffected

All 7 tests pass. Existing test_memory_store.py tests also pass (no regressions).

Scope

This is extracted from PR #3303 as a standalone reliability fix with a clear story. It does not include the spawn tools, domain detection, or iteration budget changes from that PR — those will follow as separate PRs.

Closes #3172 (cursor poisoning was the blocking issue reported there).

_next_cursor now checks isinstance(cursor, int) before arithmetic,
falling back to a reverse scan of all entries when the last entry's
cursor is corrupted. read_unprocessed_history skips entries with
non-int cursors instead of crashing on comparison.

Root cause: external callers (cron jobs, plugins) occasionally wrote
string cursors to history.jsonl, which blocked all subsequent
append_history calls with TypeError/ValueError.

Includes 7 regression tests covering string, float, null, and list
cursor types.
Copy link
Copy Markdown
Collaborator

@Re-bin Re-bin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verdict: Approve after my follow-up refactor

Great catch on the root symptom — a single non-int cursor in history.jsonl (cron/external writer poisoning, see #3172) can wedge the agent, and the two isinstance guards + recovery scan both move in the right direction. Clean, focused, well-tested diff.

I pushed one follow-up commit (a8043ca0, refactor(memory): centralize cursor validation behind a single gate) to tighten the fix from first principles:

  • Single gate, single invariant. Pulled the guard into MemoryStore._valid_cursor so "what counts as a valid cursor" lives in exactly one place; both _next_cursor and read_unprocessed_history consume it. Reduces the surface area where the next contributor might miss a check.
  • Closed the bool edge case. isinstance(True, int) is True in Python, so the original guard silently admits {"cursor": true} as cursor 1. _valid_cursor explicitly excludes bool.
  • Recovery returns max, not "first int from the tail". Your PR's comment promised "the highest valid one" but the loop returned the first int found while scanning in reverse. Under adversarial corruption the two differ, and only max guarantees the recovered cursor is strictly greater than every legitimate cursor still on disk.
  • Observability, rate-limited. When we drop a corrupted entry we now log a single warning per MemoryStore with the offending raw value. Without this, the root cause (the external writer) stays invisible; with it, the log trail survives while the agent keeps running. Rate-limiting avoids spamming every agent turn on a persistently poisoned file.

All 7 of your tests pass unchanged; I added 3 to pin the invariants above (test_bool_cursor_rejected, test_next_cursor_returns_max_not_just_last_int, test_corruption_is_logged_exactly_once_per_store). Full agent suite: 670/670.

Nothing in the refactor changes externally observable behavior beyond your fix for any case that was previously hit in production — it's a safety-net widening plus one diagnostic line. Happy to merge once CI turns green on the new head.

Move the non-int cursor guard out of the two consumer sites and into a
shared ``_iter_valid_entries`` iterator so the invariant lives in one
place.  Closes three gaps left by the original fix:

* ``bool`` is now rejected — ``isinstance(True, int)`` is ``True`` in
  Python, so the previous guard silently treated ``{"cursor": true}`` as
  cursor ``1``.
* Recovery now returns ``max(valid cursors) + 1``.  Under adversarial
  corruption "first int scanning in reverse" is not the same thing, and
  only ``max`` keeps the recovered cursor strictly greater than every
  legitimate cursor still on disk.
* Non-int cursors are logged exactly once per ``MemoryStore``.  Silently
  dropping corrupted entries hides the root cause (an external writer
  to ``memory/history.jsonl``); rate-limiting keeps the log clean when
  the same poisoned file is read every turn.

All 7 tests from the original fix pass unchanged; 3 new tests pin the
invariants above.

Made-with: Cursor
@Re-bin Re-bin force-pushed the fix/cursor-recovery branch from a8043ca to a965a69 Compare April 21, 2026 06:01
@Re-bin Re-bin merged commit c1957e1 into HKUDS:main Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants