t1256: Reduce stale-evaluating frequency via rate limit cooldown and eval checkpoint#1963
t1256: Reduce stale-evaluating frequency via rate limit cooldown and eval checkpoint#1963marcusquinn merged 1 commit intomainfrom
Conversation
…al checkpoint (t1256) Root cause analysis of 114 Phase 0.7 recovery events (30 days): - worker_rate_limited: 33 events (29%) — immediate re-dispatch into same rate limit - eval_process_died: 30 events (26%) — pulse killed mid-evaluation, imprecise diagnosis - worker_failed_before_eval: 27 events (24%) — worker crashes before evaluation - worker_oom_killed: 12 events (11%) — OOM kills Fixes: 1. Rate limit cooldown (t1256): Phase 0.7 now sets rate_limit_until = now+5min when recovering a worker_rate_limited task. cmd_next() skips tasks until cooldown expires, preventing immediate re-dispatch into the same rate limit. Addresses #1 root cause. Configurable via SUPERVISOR_RATE_LIMIT_COOLDOWN_SECONDS (default: 300s). 2. Eval checkpoint file (t1256): evaluate_with_ai() writes a checkpoint file at eval start, removed on completion. If the pulse is killed mid-eval, the checkpoint persists. _diagnose_stale_root_cause() now detects 'pulse_killed_mid_eval' vs the generic 'eval_process_died', enabling more precise observability. 3. DB schema: Added rate_limit_until column to tasks table (migration + schema). 4. Stale GC report: Added rate limit cooldown section showing currently deferred tasks.
Summary of ChangesHello @marcusquinn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the supervisor system's resilience and diagnostic capabilities by addressing common causes of 'stale-evaluating' events. It introduces a rate limit cooldown mechanism to prevent rapid re-dispatch of rate-limited tasks and implements an evaluation checkpoint system to provide more granular insights into why evaluation processes might fail. These changes are expected to reduce task recovery overhead and improve the overall stability and observability of task evaluations. Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
WalkthroughThis change introduces a rate-limiting mechanism with eval checkpoint tracking to the supervisor automation framework. A new Changes
Sequence DiagramsequenceDiagram
participant Dispatcher as State/Dispatcher
participant Scheduler as Pulse Scheduler
participant Evaluator as Evaluate Process
participant Database as Tasks DB
participant FileSystem as Checkpoint FS
Dispatcher->>Database: Query next tasks (WHERE rate_limit_until IS NULL OR <= now)
Database-->>Dispatcher: Available tasks
Dispatcher->>Scheduler: Dispatch task
Scheduler->>Evaluator: Start evaluation
Evaluator->>FileSystem: Write checkpoint (PID, timestamp)
Evaluator->>Database: Execute task eval
alt Eval Completes
Evaluator->>FileSystem: Remove checkpoint file
Evaluator-->>Scheduler: Success
else Eval Stalls/Timeout
Scheduler->>Scheduler: Detect stale task
Scheduler->>FileSystem: Check checkpoint file exists
alt Checkpoint Found
Scheduler->>Scheduler: Diagnose: pulse_killed_mid_eval
else No Checkpoint
Scheduler->>Scheduler: Diagnose: eval_process_died
end
Scheduler->>Database: Set rate_limit_until = now + 300s
Scheduler->>FileSystem: Remove stale checkpoint
Scheduler-->>Dispatcher: Task rate-limited, retry later
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Thu Feb 19 12:32:55 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
|
There was a problem hiding this comment.
Code Review
This pull request introduces two valuable features for improving stale state recovery: a rate-limit cooldown and an evaluation checkpoint. The implementation is well-documented and aligns with the project's existing patterns. I've found one high-severity issue related to the checkpoint file cleanup that could lead to stale files and another medium-severity issue regarding non-atomic database updates. Addressing these points will enhance the robustness of these new features. Overall, this is a solid contribution to improving the supervisor's stability.
| local _eval_checkpoint_file="" | ||
| mkdir -p "$_eval_checkpoint_dir" 2>/dev/null || true | ||
| _eval_checkpoint_file="${_eval_checkpoint_dir}/${task_id}.eval" | ||
| printf '%s\n%s\n' "$$" "$(date -u '+%Y-%m-%dT%H:%M:%SZ' 2>/dev/null || echo 'unknown')" >"$_eval_checkpoint_file" 2>/dev/null || _eval_checkpoint_file="" |
There was a problem hiding this comment.
There is a potential for leaving stale checkpoint files if writing the file fails. If the printf command fails (e.g., due to disk full or permissions issues), the file might be created but be empty or partially written. The || _eval_checkpoint_file="" part then resets the variable, causing the subsequent push_cleanup to register a cleanup for an empty path (rm -f ''), leaving the stale checkpoint file on disk. This could lead to incorrect root cause diagnosis for stale tasks later on.
To ensure the file is always cleaned up, the variable should not be reset. The push_cleanup will then correctly register the cleanup for the file path, and rm -f will handle removing it even if it's empty or was never fully created.
| printf '%s\n%s\n' "$$" "$(date -u '+%Y-%m-%dT%H:%M:%SZ' 2>/dev/null || echo 'unknown')" >"$_eval_checkpoint_file" 2>/dev/null || _eval_checkpoint_file="" | |
| printf '%s\n%s\n' "$$" "$(date -u '+%Y-%m-%dT%H:%M:%SZ' 2>/dev/null || echo 'unknown')" >"$_eval_checkpoint_file" 2>/dev/null || true |
| db "$SUPERVISOR_DB" "UPDATE tasks SET retries = $new_retries WHERE id = '$(sql_escape "$stale_id")';" 2>/dev/null || true | ||
| # t1256: Rate limit cooldown — when the root cause is worker_rate_limited, | ||
| # set rate_limit_until to prevent immediate re-dispatch into the same limit. | ||
| # Default cooldown: 5 minutes (SUPERVISOR_RATE_LIMIT_COOLDOWN_SECONDS). | ||
| # This addresses the #1 stale-evaluating root cause (29% of events). | ||
| if [[ "$root_cause" == "worker_rate_limited" ]]; then | ||
| local rate_limit_cooldown="${SUPERVISOR_RATE_LIMIT_COOLDOWN_SECONDS:-300}" | ||
| db "$SUPERVISOR_DB" "UPDATE tasks SET rate_limit_until = strftime('%Y-%m-%dT%H:%M:%SZ','now','+${rate_limit_cooldown} seconds') WHERE id = '$(sql_escape "$stale_id")';" 2>/dev/null || true | ||
| log_info " Phase 0.7: $stale_id rate_limit_until set (+${rate_limit_cooldown}s cooldown, t1256)" | ||
| fi |
There was a problem hiding this comment.
The logic to update the task's retries count and conditionally set rate_limit_until uses two separate UPDATE statements. This is not atomic. There's a small window where the script could be interrupted between these two database calls, leaving the task in an inconsistent state (retries incremented, but cooldown not applied).
To improve robustness, these updates should be combined into a single, atomic UPDATE statement. This can be done by using an if/else block to construct the appropriate UPDATE query.
| db "$SUPERVISOR_DB" "UPDATE tasks SET retries = $new_retries WHERE id = '$(sql_escape "$stale_id")';" 2>/dev/null || true | |
| # t1256: Rate limit cooldown — when the root cause is worker_rate_limited, | |
| # set rate_limit_until to prevent immediate re-dispatch into the same limit. | |
| # Default cooldown: 5 minutes (SUPERVISOR_RATE_LIMIT_COOLDOWN_SECONDS). | |
| # This addresses the #1 stale-evaluating root cause (29% of events). | |
| if [[ "$root_cause" == "worker_rate_limited" ]]; then | |
| local rate_limit_cooldown="${SUPERVISOR_RATE_LIMIT_COOLDOWN_SECONDS:-300}" | |
| db "$SUPERVISOR_DB" "UPDATE tasks SET rate_limit_until = strftime('%Y-%m-%dT%H:%M:%SZ','now','+${rate_limit_cooldown} seconds') WHERE id = '$(sql_escape "$stale_id")';" 2>/dev/null || true | |
| log_info " Phase 0.7: $stale_id rate_limit_until set (+${rate_limit_cooldown}s cooldown, t1256)" | |
| fi | |
| # t1256: Rate limit cooldown — when the root cause is worker_rate_limited, | |
| # set rate_limit_until to prevent immediate re-dispatch into the same limit. | |
| # Default cooldown: 5 minutes (SUPERVISOR_RATE_LIMIT_COOLDOWN_SECONDS). | |
| # This addresses the #1 stale-evaluating root cause (29% of events). | |
| if [[ "$root_cause" == "worker_rate_limited" ]]; then | |
| local rate_limit_cooldown="${SUPERVISOR_RATE_LIMIT_COOLDOWN_SECONDS:-300}" | |
| db "$SUPERVISOR_DB" "UPDATE tasks SET retries = $new_retries, rate_limit_until = strftime('%Y-%m-%dT%H:%M:%SZ','now','+${rate_limit_cooldown} seconds') WHERE id = '$(sql_escape "$stale_id")';" 2>/dev/null || true | |
| log_info " Phase 0.7: $stale_id rate_limit_until set (+${rate_limit_cooldown}s cooldown, t1256)" | |
| else | |
| db "$SUPERVISOR_DB" "UPDATE tasks SET retries = $new_retries WHERE id = '$(sql_escape "$stale_id")';" 2>/dev/null || true | |
| fi |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (4)
.agents/scripts/supervisor/database.sh (2)
741-752: Migration is idempotent and consistent with existing patterns —log_successalways prints even on failureThe
pragma_table_infoguard +ALTER TABLE … 2>/dev/null || true+log_successpattern is consistent throughout the file, so this isn't a regression. That said,log_success "Added rate_limit_until …"will print even whenALTER TABLEsilently fails (e.g., disk-full), which could give a false sense of safety. Compare with the guarded pattern used formax_load_factorat line 329 that captures failure and emitslog_warn. A small defensive improvement:🛡️ Proposed defensive guard (matches max_load_factor pattern)
- db "$SUPERVISOR_DB" "ALTER TABLE tasks ADD COLUMN rate_limit_until TEXT DEFAULT NULL;" 2>/dev/null || true - log_success "Added rate_limit_until column to tasks (t1256)" + if db "$SUPERVISOR_DB" "ALTER TABLE tasks ADD COLUMN rate_limit_until TEXT DEFAULT NULL;" 2>/dev/null; then + log_success "Added rate_limit_until column to tasks (t1256)" + else + log_warn "Failed to add rate_limit_until column (may already exist or DB error)" + fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.agents/scripts/supervisor/database.sh around lines 741 - 752, The migration block that adds rate_limit_until uses has_rate_limit_until, db, and then blindly calls log_success even if ALTER TABLE failed; change it to detect ALTER TABLE failure (capture the db command exit/status or output) and only call log_success on success, otherwise call log_warn with the error message similar to the max_load_factor pattern: run db "$SUPERVISOR_DB" "ALTER TABLE ..." capturing stderr/status, check the result, and emit log_warn on failure (including the error) and log_success only on success.
813-827: Consider adding an index onrate_limit_untilfor larger deploymentsThe
cmd_next()filter(rate_limit_until IS NULL OR rate_limit_until <= strftime(…))currently rides theidx_tasks_statusindex and then scans the filteredqueuedset. For typical workloads this is fine, but at scale a partial index improves scheduling latency to O(log n). Bothinit_db()and the migration block omit this index.📈 Proposed partial index
In
init_db()after the existing index block (line 826):CREATE INDEX IF NOT EXISTS idx_tasks_diagnostic ON tasks(diagnostic_of); +-- t1256: index to make rate-limit cooldown filter in cmd_next() O(log n) +CREATE INDEX IF NOT EXISTS idx_tasks_rate_limit ON tasks(status, rate_limit_until) + WHERE status = 'queued' AND rate_limit_until IS NOT NULL;And add the same
CREATE INDEX IF NOT EXISTSstatement to the migration block at line 751 after the ALTER TABLE.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.agents/scripts/supervisor/database.sh around lines 813 - 827, Add a partial index on rate_limit_until to speed up cmd_next() queries: locate init_db() and the migration block that creates/alter tasks and, after the existing CREATE INDEX statements (and after the ALTER TABLE in the migration block), add a CREATE INDEX IF NOT EXISTS on tasks(rate_limit_until) using a partial predicate that matches cmd_next()’s filter (e.g., WHERE status='queued' AND (rate_limit_until IS NULL OR rate_limit_until <= strftime(...))); reference the rate_limit_until column and the cmd_next() query when adding the index so both init and migration paths include the same partial index..agents/scripts/supervisor/pulse.sh (2)
1020-1028: Rate limit cooldown logic is correct and well-ordered.The
rate_limit_untilUPDATE executes beforecmd_transition(line 1029), so even if the transition fails, the cooldown is in place — a safe ordering. Thesql_escapeonstale_idis properly applied.One minor robustness note:
rate_limit_cooldownis interpolated directly into the SQL modifier string ('+${rate_limit_cooldown} seconds'). IfSUPERVISOR_RATE_LIMIT_COOLDOWN_SECONDSis set to a non-numeric value, SQLite'sstrftimewill return NULL, meaning no cooldown is applied. This is a safe fallback, but you could add a numeric guard for defense-in-depth if you want to log a warning when the operator misconfigures the value.Optional: numeric validation for the cooldown env var
if [[ "$root_cause" == "worker_rate_limited" ]]; then local rate_limit_cooldown="${SUPERVISOR_RATE_LIMIT_COOLDOWN_SECONDS:-300}" + if ! [[ "$rate_limit_cooldown" =~ ^[0-9]+$ ]]; then + log_warn " Phase 0.7: SUPERVISOR_RATE_LIMIT_COOLDOWN_SECONDS='$rate_limit_cooldown' is not numeric, defaulting to 300" + rate_limit_cooldown=300 + fi db "$SUPERVISOR_DB" "UPDATE tasks SET rate_limit_until = strftime('%Y-%m-%dT%H:%M:%SZ','now','+${rate_limit_cooldown} seconds') WHERE id = '$(sql_escape "$stale_id")';" 2>/dev/null || true🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.agents/scripts/supervisor/pulse.sh around lines 1020 - 1028, Validate SUPERVISOR_RATE_LIMIT_COOLDOWN_SECONDS before using it: check that it is a non-negative integer (e.g., via a numeric regex or integer cast) and if invalid, set rate_limit_cooldown to the default 300 and emit a warning with log_info/log_warn; then use that sanitized rate_limit_cooldown in the strftime DB UPDATE (the variable rate_limit_cooldown used in the db call), leaving sql_escape "$stale_id" and the ordering with cmd_transition intact.
1733-1734: Phase 1c and Phase 4b are both missing eval checkpoint cleanup (unlike Phase 0.7).Phase 0.7 (lines 1071-1075) correctly removes
eval-checkpoints/${task_id}.evalafter recovering a stale evaluating task. Phase 1c (stuck evaluating recovery) and Phase 4b (DB orphan recovery) both handle evaluating tasks but do not perform the same cleanup. This means checkpoint files frompulse_killed_mid_evalscenarios recovered by either phase will accumulate on disk. They won't cause incorrect diagnosis (they're keyed by task ID), but they drift from the Phase 0.7 pattern and could grow unbounded over time.Phase 4b is particularly susceptible since it re-evaluates orphaned evaluating tasks with
cmd_evaluate, which will create new checkpoints that are never cleaned when the re-evaluation completes.Add checkpoint cleanup to Phase 1c and Phase 4b, matching Phase 0.7 pattern
Phase 1c (after line 1734, within the stuck_evaluating loop): # Clean up PID file cleanup_worker_processes "$stuck_id" 2>>"$SUPERVISOR_LOG" || true + + # t1256: Clean up eval checkpoint file (matches Phase 0.7 cleanup) + local stuck_eval_checkpoint="${SUPERVISOR_DIR}/eval-checkpoints/${stuck_id}.eval" + if [[ -f "$stuck_eval_checkpoint" ]]; then + rm -f "$stuck_eval_checkpoint" 2>/dev/null || true + log_verbose " Phase 1c: removed eval checkpoint for $stuck_id" + fi done <<<"$stuck_evaluating" Phase 4b (after line 2570, within the db_orphans loop, following cmd_transition on failure): attempt_self_heal "$orphan_id" "failed" "No worker process found" "${batch_id:-}" 2>>"$SUPERVISOR_LOG" || true + + # t1256: Clean up eval checkpoint file if it exists (matches Phase 0.7 cleanup) + local orphan_eval_checkpoint="${SUPERVISOR_DIR}/eval-checkpoints/${orphan_id}.eval" + if [[ -f "$orphan_eval_checkpoint" ]]; then + rm -f "$orphan_eval_checkpoint" 2>/dev/null || true + log_verbose " Phase 4b: removed eval checkpoint for $orphan_id" + fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.agents/scripts/supervisor/pulse.sh around lines 1733 - 1734, Phase 1c (stuck evaluating recovery) and Phase 4b (DB orphan recovery) need to remove the eval checkpoint file just like Phase 0.7: after you detect/handle an evaluating task (use the same task_id variable used in those blocks) call the cleanup that removes eval-checkpoints/${task_id}.eval (the same removal logic used in Phase 0.7) so checkpoint files are unlinked when the task is recovered or re-evaluated (for Phase 4b where cmd_evaluate may recreate checkpoints). Ensure the removal occurs after successful recovery/handling and logs/errors are redirected the same way as Phase 0.7.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.agents/scripts/supervisor/evaluate.sh:
- Around line 1442-1452: Move the checkpoint-file creation and its push_cleanup
registration (the block that sets _eval_checkpoint_dir, creates the file into
_eval_checkpoint_file, and calls push_cleanup to remove it) so it occurs after
installing the RETURN trap and after calling _save_cleanup_scope (i.e. after the
trap '_run_cleanups' RETURN and _save_cleanup_scope are set up), ensuring the
cleanup entry is within the RETURN-trap scope so it runs when evaluate_with_ai
returns; additionally, guard the push_cleanup invocation and any later manual
removals by checking that _eval_checkpoint_file is non-empty (e.g., only call
push_cleanup or rm when _eval_checkpoint_file != "") so you never register or
attempt to remove an empty filename.
---
Nitpick comments:
In @.agents/scripts/supervisor/database.sh:
- Around line 741-752: The migration block that adds rate_limit_until uses
has_rate_limit_until, db, and then blindly calls log_success even if ALTER TABLE
failed; change it to detect ALTER TABLE failure (capture the db command
exit/status or output) and only call log_success on success, otherwise call
log_warn with the error message similar to the max_load_factor pattern: run db
"$SUPERVISOR_DB" "ALTER TABLE ..." capturing stderr/status, check the result,
and emit log_warn on failure (including the error) and log_success only on
success.
- Around line 813-827: Add a partial index on rate_limit_until to speed up
cmd_next() queries: locate init_db() and the migration block that creates/alter
tasks and, after the existing CREATE INDEX statements (and after the ALTER TABLE
in the migration block), add a CREATE INDEX IF NOT EXISTS on
tasks(rate_limit_until) using a partial predicate that matches cmd_next()’s
filter (e.g., WHERE status='queued' AND (rate_limit_until IS NULL OR
rate_limit_until <= strftime(...))); reference the rate_limit_until column and
the cmd_next() query when adding the index so both init and migration paths
include the same partial index.
In @.agents/scripts/supervisor/pulse.sh:
- Around line 1020-1028: Validate SUPERVISOR_RATE_LIMIT_COOLDOWN_SECONDS before
using it: check that it is a non-negative integer (e.g., via a numeric regex or
integer cast) and if invalid, set rate_limit_cooldown to the default 300 and
emit a warning with log_info/log_warn; then use that sanitized
rate_limit_cooldown in the strftime DB UPDATE (the variable rate_limit_cooldown
used in the db call), leaving sql_escape "$stale_id" and the ordering with
cmd_transition intact.
- Around line 1733-1734: Phase 1c (stuck evaluating recovery) and Phase 4b (DB
orphan recovery) need to remove the eval checkpoint file just like Phase 0.7:
after you detect/handle an evaluating task (use the same task_id variable used
in those blocks) call the cleanup that removes eval-checkpoints/${task_id}.eval
(the same removal logic used in Phase 0.7) so checkpoint files are unlinked when
the task is recovered or re-evaluated (for Phase 4b where cmd_evaluate may
recreate checkpoints). Ensure the removal occurs after successful
recovery/handling and logs/errors are redirected the same way as Phase 0.7.
| # t1256: Write eval checkpoint file — persists across pulse restarts. | ||
| # If the pulse is killed mid-eval (SIGTERM, cron restart, system shutdown), | ||
| # the checkpoint file survives. _diagnose_stale_root_cause() reads it to | ||
| # distinguish "pulse_killed_mid_eval" from "eval_process_died" (unknown cause). | ||
| # The checkpoint is removed when evaluate_with_ai() returns (success or timeout). | ||
| local _eval_checkpoint_dir="${SUPERVISOR_DIR}/eval-checkpoints" | ||
| local _eval_checkpoint_file="" | ||
| mkdir -p "$_eval_checkpoint_dir" 2>/dev/null || true | ||
| _eval_checkpoint_file="${_eval_checkpoint_dir}/${task_id}.eval" | ||
| printf '%s\n%s\n' "$$" "$(date -u '+%Y-%m-%dT%H:%M:%SZ' 2>/dev/null || echo 'unknown')" >"$_eval_checkpoint_file" 2>/dev/null || _eval_checkpoint_file="" | ||
| push_cleanup "rm -f '${_eval_checkpoint_file}' 2>/dev/null || true" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
rg -n --type sh -C 10 '_save_cleanup_scope|_run_cleanups' --include='*.sh'Repository: marcusquinn/aidevops
Length of output: 505
🏁 Script executed:
rg -n -t sh -C 10 '_save_cleanup_scope|_run_cleanups'Repository: marcusquinn/aidevops
Length of output: 50377
🏁 Script executed:
rg -n -t sh 'def _save_cleanup_scope|^_save_cleanup_scope\(\)|^_run_cleanups\(\)' -A 15Repository: marcusquinn/aidevops
Length of output: 2561
🏁 Script executed:
rg -n -t sh '_run_cleanups' -A 25 shared-constants.sh | head -40Repository: marcusquinn/aidevops
Length of output: 123
🏁 Script executed:
fd -type f 'shared-constants.sh' --exec cat {} \; -A 25 | grep -A 25 '_run_cleanups'Repository: marcusquinn/aidevops
Length of output: 234
🏁 Script executed:
fd 'shared-constants.sh' --exec cat {} \;Repository: marcusquinn/aidevops
Length of output: 38231
🏁 Script executed:
fd 'evaluate.sh' -path '*/supervisor/*' --exec wc -l {} \;Repository: marcusquinn/aidevops
Length of output: 232
🏁 Script executed:
fd 'evaluate.sh' | grep supervisorRepository: marcusquinn/aidevops
Length of output: 102
🏁 Script executed:
# Check file size first
wc -l .agents/scripts/supervisor/evaluate.shRepository: marcusquinn/aidevops
Length of output: 107
🏁 Script executed:
# View the checkpoint section (1442-1452)
sed -n '1440,1455p' .agents/scripts/supervisor/evaluate.shRepository: marcusquinn/aidevops
Length of output: 1198
🏁 Script executed:
# View the _save_cleanup_scope and trap lines
sed -n '1475,1495p' .agents/scripts/supervisor/evaluate.shRepository: marcusquinn/aidevops
Length of output: 1057
🏁 Script executed:
# View heartbeat section (around 1478)
sed -n '1455,1485p' .agents/scripts/supervisor/evaluate.shRepository: marcusquinn/aidevops
Length of output: 1683
🏁 Script executed:
# View lines 1485-1535 to see the trap setup and watchdog pattern
sed -n '1485,1535p' .agents/scripts/supervisor/evaluate.shRepository: marcusquinn/aidevops
Length of output: 2281
🏁 Script executed:
# Verify the exact line numbers for the watchdog pattern
sed -n '1488,1510p' .agents/scripts/supervisor/evaluate.shRepository: marcusquinn/aidevops
Length of output: 1128
🏁 Script executed:
# Look for heartbeat manual cleanup (mentioned as 1530-1534 in review)
sed -n '1520,1540p' .agents/scripts/supervisor/evaluate.shRepository: marcusquinn/aidevops
Length of output: 1046
🏁 Script executed:
# Confirm checkpoint cleanup is never called — no manual cleanup equivalent to heartbeat
sed -n '1442,1560p' .agents/scripts/supervisor/evaluate.sh | grep -A 100 'checkpoint' | head -120Repository: marcusquinn/aidevops
Length of output: 5563
Checkpoint push_cleanup registered outside the RETURN-trap scope — stale checkpoint files will persist and cause false diagnoses
The cleanup for _eval_checkpoint_file is registered via push_cleanup at line 1452, but _save_cleanup_scope and trap '_run_cleanups' RETURN are not set until lines 1488–1489. Based on the scope-bounding behavior of _run_cleanups (which executes only commands added after the most-recent _save_cleanup_scope), the checkpoint cleanup will not fire when evaluate_with_ai() returns — directly contradicting the comment on line 1446.
The actual code pattern confirms this:
- Checkpoint (line 1452):
push_cleanupcalled before_save_cleanup_scope→ skipped on RETURN ✗ - Heartbeat (line 1478):
push_cleanupcalled before_save_cleanup_scope, but includes manual cleanup at lines 1530–1534 ✓ - Watchdog (lines 1494–1508):
push_cleanupcalled after_save_cleanup_scope→ cleaned up on RETURN ✓
Stale checkpoint files undermine the 26% diagnosis improvement target by causing _diagnose_stale_root_cause() to emit false pulse_killed_mid_eval diagnoses for tasks where eval completed normally.
Fix: move checkpoint setup to after the RETURN trap is installed and add a guard for empty _eval_checkpoint_file:
🔧 Recommended fix
- # t1256: Write eval checkpoint file — persists across pulse restarts.
- # If the pulse is killed mid-eval (SIGTERM, cron restart, system shutdown),
- # the checkpoint file survives. _diagnose_stale_root_cause() reads it to
- # distinguish "pulse_killed_mid_eval" from "eval_process_died" (unknown cause).
- # The checkpoint is removed when evaluate_with_ai() returns (success or timeout).
- local _eval_checkpoint_dir="${SUPERVISOR_DIR}/eval-checkpoints"
- local _eval_checkpoint_file=""
- mkdir -p "$_eval_checkpoint_dir" 2>/dev/null || true
- _eval_checkpoint_file="${_eval_checkpoint_dir}/${task_id}.eval"
- printf '%s\n%s\n' "$$" "$(date -u '+%Y-%m-%dT%H:%M:%SZ' 2>/dev/null || echo 'unknown')" >"$_eval_checkpoint_file" 2>/dev/null || _eval_checkpoint_file=""
- push_cleanup "rm -f '${_eval_checkpoint_file}' 2>/dev/null || true"
-
# t1251: Initial heartbeat …
…
_save_cleanup_scope
trap '_run_cleanups' RETURN
+
+ # t1256: Write eval checkpoint file — persists across pulse restarts.
+ # Registered AFTER _save_cleanup_scope so _run_cleanups fires on RETURN.
+ local _eval_checkpoint_dir="${SUPERVISOR_DIR}/eval-checkpoints"
+ local _eval_checkpoint_file=""
+ mkdir -p "$_eval_checkpoint_dir" 2>/dev/null || true
+ _eval_checkpoint_file="${_eval_checkpoint_dir}/${task_id}.eval"
+ printf '%s\n%s\n' "$$" "$(date -u '+%Y-%m-%dT%H:%M:%SZ' 2>/dev/null || echo 'unknown')" \
+ >"$_eval_checkpoint_file" 2>/dev/null || _eval_checkpoint_file=""
+ [[ -n "$_eval_checkpoint_file" ]] && \
+ push_cleanup "rm -f '${_eval_checkpoint_file}' 2>/dev/null || true"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.agents/scripts/supervisor/evaluate.sh around lines 1442 - 1452, Move the
checkpoint-file creation and its push_cleanup registration (the block that sets
_eval_checkpoint_dir, creates the file into _eval_checkpoint_file, and calls
push_cleanup to remove it) so it occurs after installing the RETURN trap and
after calling _save_cleanup_scope (i.e. after the trap '_run_cleanups' RETURN
and _save_cleanup_scope are set up), ensuring the cleanup entry is within the
RETURN-trap scope so it runs when evaluate_with_ai returns; additionally, guard
the push_cleanup invocation and any later manual removals by checking that
_eval_checkpoint_file is non-empty (e.g., only call push_cleanup or rm when
_eval_checkpoint_file != "") so you never register or attempt to remove an empty
filename.



Summary
Root cause analysis of 114 Phase 0.7 stale-state recovery events (30 days) identified systemic issues causing the 93% stale-evaluating rate.
Root causes: worker_rate_limited 33 events (29%), eval_process_died 30 events (26%), worker_failed_before_eval 27 events (24%), worker_oom_killed 12 events (11%).
Changes
1. Rate limit cooldown (database.sh, pulse.sh, state.sh)
2. Eval checkpoint file (evaluate.sh, pulse.sh)
3. Stale GC report enhancement (pulse.sh)
Expected Impact
Ref #1962
Summary by CodeRabbit
Release Notes