Skip to content

t1256: Reduce stale-evaluating frequency via rate limit cooldown and eval checkpoint#1963

Merged
marcusquinn merged 1 commit intomainfrom
feature/t1256
Feb 19, 2026
Merged

t1256: Reduce stale-evaluating frequency via rate limit cooldown and eval checkpoint#1963
marcusquinn merged 1 commit intomainfrom
feature/t1256

Conversation

@marcusquinn
Copy link
Copy Markdown
Owner

@marcusquinn marcusquinn commented Feb 19, 2026

Summary

Root cause analysis of 114 Phase 0.7 stale-state recovery events (30 days) identified systemic issues causing the 93% stale-evaluating rate.

Root causes: worker_rate_limited 33 events (29%), eval_process_died 30 events (26%), worker_failed_before_eval 27 events (24%), worker_oom_killed 12 events (11%).

Changes

1. Rate limit cooldown (database.sh, pulse.sh, state.sh)

  • Added rate_limit_until column to tasks table (migration + schema)
  • Phase 0.7: when recovering a worker_rate_limited task, sets rate_limit_until = now + 5min
  • cmd_next(): filters out tasks where rate_limit_until > now
  • Prevents immediate re-dispatch into the same rate limit (addresses 🤖 CodeRabbit AI Integration - Comprehensive Code Review Setup #1 root cause, 29%)
  • Configurable via SUPERVISOR_RATE_LIMIT_COOLDOWN_SECONDS (default: 300s)

2. Eval checkpoint file (evaluate.sh, pulse.sh)

  • evaluate_with_ai() writes eval-checkpoints/{task_id}.eval at eval start
  • Checkpoint removed on completion (success or timeout) via push_cleanup
  • _diagnose_stale_root_cause() now detects pulse_killed_mid_eval vs generic eval_process_died
  • Phase 0.7 cleans up checkpoint files after recovery
  • Enables more precise observability of pulse-kill vs eval-crash patterns

3. Stale GC report enhancement (pulse.sh)

  • Added Rate Limit Cooldown section showing currently deferred tasks with retry-after timestamps

Expected Impact

  • 29% reduction in stale-evaluating events from rate limit cooldown
  • Improved diagnosis for 26% of eval_process_died events to pulse_killed_mid_eval
  • No retry overhead for rate-limited tasks during cooldown window
  • Backward compatible: rate_limit_until IS NULL tasks dispatch normally

Ref #1962

Summary by CodeRabbit

Release Notes

  • New Features
    • Implemented rate limit cooldown mechanism that temporarily pauses task retry scheduling when workers are rate-limited, with a configurable cooldown window.
    • Enhanced task recovery diagnostics to distinguish between pulse process interruption and evaluation failure, improving troubleshooting accuracy.
    • Added checkpoint tracking system to assist with debugging evaluation timeouts and process failures.

…al checkpoint (t1256)

Root cause analysis of 114 Phase 0.7 recovery events (30 days):
- worker_rate_limited: 33 events (29%) — immediate re-dispatch into same rate limit
- eval_process_died: 30 events (26%) — pulse killed mid-evaluation, imprecise diagnosis
- worker_failed_before_eval: 27 events (24%) — worker crashes before evaluation
- worker_oom_killed: 12 events (11%) — OOM kills

Fixes:
1. Rate limit cooldown (t1256): Phase 0.7 now sets rate_limit_until = now+5min when
   recovering a worker_rate_limited task. cmd_next() skips tasks until cooldown expires,
   preventing immediate re-dispatch into the same rate limit. Addresses #1 root cause.
   Configurable via SUPERVISOR_RATE_LIMIT_COOLDOWN_SECONDS (default: 300s).

2. Eval checkpoint file (t1256): evaluate_with_ai() writes a checkpoint file at eval
   start, removed on completion. If the pulse is killed mid-eval, the checkpoint
   persists. _diagnose_stale_root_cause() now detects 'pulse_killed_mid_eval' vs
   the generic 'eval_process_died', enabling more precise observability.

3. DB schema: Added rate_limit_until column to tasks table (migration + schema).

4. Stale GC report: Added rate limit cooldown section showing currently deferred tasks.
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello @marcusquinn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the supervisor system's resilience and diagnostic capabilities by addressing common causes of 'stale-evaluating' events. It introduces a rate limit cooldown mechanism to prevent rapid re-dispatch of rate-limited tasks and implements an evaluation checkpoint system to provide more granular insights into why evaluation processes might fail. These changes are expected to reduce task recovery overhead and improve the overall stability and observability of task evaluations.

Highlights

  • Rate Limit Cooldown: Introduced a rate_limit_until column in the tasks table to prevent immediate re-dispatch of worker_rate_limited tasks, aiming to reduce stale-evaluating events by 29%.
  • Eval Checkpoint File: Implemented a mechanism to write a checkpoint file at the start of an AI evaluation and remove it upon completion, allowing for more precise diagnosis of pulse_killed_mid_eval versus generic eval_process_died events.
  • Stale GC Report Enhancement: Added a new section to the stale garbage collection report to display tasks currently deferred due to rate limiting, including their retry-after timestamps.
Changelog
  • .agents/scripts/supervisor/database.sh
    • Added a migration to include the rate_limit_until column in the tasks table.
    • Updated the CREATE TABLE statement for tasks to include the rate_limit_until column with a TEXT DEFAULT NULL type.
  • .agents/scripts/supervisor/evaluate.sh
    • Implemented logic to create an evaluation checkpoint file ({task_id}.eval) at the beginning of evaluate_with_ai.
    • Ensured the checkpoint file is removed upon successful completion or timeout of the evaluation via push_cleanup.
  • .agents/scripts/supervisor/pulse.sh
    • Modified _diagnose_stale_root_cause to check for the presence of an eval checkpoint file to specifically identify pulse_killed_mid_eval events.
    • Added a new 'Rate Limit Cooldown' section to cmd_stale_gc_report to display tasks currently deferred due to rate limiting.
    • Updated comments in cmd_pulse to include SUPERVISOR_RATE_LIMIT_COOLDOWN_SECONDS as a configurable grace period.
    • Added logic in cmd_pulse to set the rate_limit_until timestamp for tasks identified as worker_rate_limited during stale recovery.
    • Added logic in cmd_pulse to clean up eval checkpoint files for recovered tasks.
  • .agents/scripts/supervisor/state.sh
    • Modified cmd_next queries to filter out tasks where rate_limit_until is in the future, preventing them from being dispatched prematurely.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 19, 2026

Walkthrough

This change introduces a rate-limiting mechanism with eval checkpoint tracking to the supervisor automation framework. A new rate_limit_until column is added to the tasks table with database migration support, while evaluate processes now create checkpoint files to track active evaluations. Stale task recovery logic in pulse now handles rate-limit cooldowns and performs checkpoint diagnostics and cleanup.

Changes

Cohort / File(s) Summary
Database Schema
.agents/scripts/supervisor/database.sh
Added rate_limit_until TEXT DEFAULT NULL column to tasks table. Includes runtime migration logic to backfill existing databases with idempotent ALTER TABLE statement and logging on success.
Eval Checkpoint System
.agents/scripts/supervisor/evaluate.sh
Implemented per-task checkpoint file mechanism in eval-checkpoints directory. Captures process ID and AI-start timestamp at evaluation start; auto-cleaned up on completion or timeout.
Pulse Recovery & Diagnostics
.agents/scripts/supervisor/pulse.sh
Enhanced Phase 0.7 stale task recovery with: (1) rate-limit cooldown assignment (now + 300s default) when recovering rate-limited tasks, (2) eval checkpoint diagnostics in _diagnose_stale_root_cause_ to distinguish pulse-killed-mid-eval scenarios, (3) checkpoint file cleanup during recovery, (4) rate-limit status reporting in cmd_stale_gc_report.
Task Dispatch Filtering
.agents/scripts/supervisor/state.sh
Added rate-limit filtering to batch and global task selection queries in cmd_next(): excludes tasks where rate_limit_until is set and in the future.

Sequence Diagram

sequenceDiagram
    participant Dispatcher as State/Dispatcher
    participant Scheduler as Pulse Scheduler
    participant Evaluator as Evaluate Process
    participant Database as Tasks DB
    participant FileSystem as Checkpoint FS

    Dispatcher->>Database: Query next tasks (WHERE rate_limit_until IS NULL OR <= now)
    Database-->>Dispatcher: Available tasks
    Dispatcher->>Scheduler: Dispatch task
    Scheduler->>Evaluator: Start evaluation
    Evaluator->>FileSystem: Write checkpoint (PID, timestamp)
    Evaluator->>Database: Execute task eval
    alt Eval Completes
        Evaluator->>FileSystem: Remove checkpoint file
        Evaluator-->>Scheduler: Success
    else Eval Stalls/Timeout
        Scheduler->>Scheduler: Detect stale task
        Scheduler->>FileSystem: Check checkpoint file exists
        alt Checkpoint Found
            Scheduler->>Scheduler: Diagnose: pulse_killed_mid_eval
        else No Checkpoint
            Scheduler->>Scheduler: Diagnose: eval_process_died
        end
        Scheduler->>Database: Set rate_limit_until = now + 300s
        Scheduler->>FileSystem: Remove stale checkpoint
        Scheduler-->>Dispatcher: Task rate-limited, retry later
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

Possibly related PRs

Poem

⏱️ A checkpoint marks the eval's way,
Rate limits tell it: "Please delay!"
When stale affairs arise anew,
Diagnosis points to what is true—
Five hundred seconds? Give it space,
Then let the task resume its race! 🚀

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately summarizes the main changes: introducing a rate limit cooldown mechanism and eval checkpoint feature to reduce stale-evaluating frequency.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/t1256

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 29 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Thu Feb 19 12:32:52 UTC 2026: Code review monitoring started
Thu Feb 19 12:32:52 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 29

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 29
  • VULNERABILITIES: 0

Generated on: Thu Feb 19 12:32:55 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

@sonarqubecloud
Copy link
Copy Markdown

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces two valuable features for improving stale state recovery: a rate-limit cooldown and an evaluation checkpoint. The implementation is well-documented and aligns with the project's existing patterns. I've found one high-severity issue related to the checkpoint file cleanup that could lead to stale files and another medium-severity issue regarding non-atomic database updates. Addressing these points will enhance the robustness of these new features. Overall, this is a solid contribution to improving the supervisor's stability.

local _eval_checkpoint_file=""
mkdir -p "$_eval_checkpoint_dir" 2>/dev/null || true
_eval_checkpoint_file="${_eval_checkpoint_dir}/${task_id}.eval"
printf '%s\n%s\n' "$$" "$(date -u '+%Y-%m-%dT%H:%M:%SZ' 2>/dev/null || echo 'unknown')" >"$_eval_checkpoint_file" 2>/dev/null || _eval_checkpoint_file=""
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a potential for leaving stale checkpoint files if writing the file fails. If the printf command fails (e.g., due to disk full or permissions issues), the file might be created but be empty or partially written. The || _eval_checkpoint_file="" part then resets the variable, causing the subsequent push_cleanup to register a cleanup for an empty path (rm -f ''), leaving the stale checkpoint file on disk. This could lead to incorrect root cause diagnosis for stale tasks later on.

To ensure the file is always cleaned up, the variable should not be reset. The push_cleanup will then correctly register the cleanup for the file path, and rm -f will handle removing it even if it's empty or was never fully created.

Suggested change
printf '%s\n%s\n' "$$" "$(date -u '+%Y-%m-%dT%H:%M:%SZ' 2>/dev/null || echo 'unknown')" >"$_eval_checkpoint_file" 2>/dev/null || _eval_checkpoint_file=""
printf '%s\n%s\n' "$$" "$(date -u '+%Y-%m-%dT%H:%M:%SZ' 2>/dev/null || echo 'unknown')" >"$_eval_checkpoint_file" 2>/dev/null || true

Comment on lines 1019 to +1028
db "$SUPERVISOR_DB" "UPDATE tasks SET retries = $new_retries WHERE id = '$(sql_escape "$stale_id")';" 2>/dev/null || true
# t1256: Rate limit cooldown — when the root cause is worker_rate_limited,
# set rate_limit_until to prevent immediate re-dispatch into the same limit.
# Default cooldown: 5 minutes (SUPERVISOR_RATE_LIMIT_COOLDOWN_SECONDS).
# This addresses the #1 stale-evaluating root cause (29% of events).
if [[ "$root_cause" == "worker_rate_limited" ]]; then
local rate_limit_cooldown="${SUPERVISOR_RATE_LIMIT_COOLDOWN_SECONDS:-300}"
db "$SUPERVISOR_DB" "UPDATE tasks SET rate_limit_until = strftime('%Y-%m-%dT%H:%M:%SZ','now','+${rate_limit_cooldown} seconds') WHERE id = '$(sql_escape "$stale_id")';" 2>/dev/null || true
log_info " Phase 0.7: $stale_id rate_limit_until set (+${rate_limit_cooldown}s cooldown, t1256)"
fi
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic to update the task's retries count and conditionally set rate_limit_until uses two separate UPDATE statements. This is not atomic. There's a small window where the script could be interrupted between these two database calls, leaving the task in an inconsistent state (retries incremented, but cooldown not applied).

To improve robustness, these updates should be combined into a single, atomic UPDATE statement. This can be done by using an if/else block to construct the appropriate UPDATE query.

Suggested change
db "$SUPERVISOR_DB" "UPDATE tasks SET retries = $new_retries WHERE id = '$(sql_escape "$stale_id")';" 2>/dev/null || true
# t1256: Rate limit cooldown — when the root cause is worker_rate_limited,
# set rate_limit_until to prevent immediate re-dispatch into the same limit.
# Default cooldown: 5 minutes (SUPERVISOR_RATE_LIMIT_COOLDOWN_SECONDS).
# This addresses the #1 stale-evaluating root cause (29% of events).
if [[ "$root_cause" == "worker_rate_limited" ]]; then
local rate_limit_cooldown="${SUPERVISOR_RATE_LIMIT_COOLDOWN_SECONDS:-300}"
db "$SUPERVISOR_DB" "UPDATE tasks SET rate_limit_until = strftime('%Y-%m-%dT%H:%M:%SZ','now','+${rate_limit_cooldown} seconds') WHERE id = '$(sql_escape "$stale_id")';" 2>/dev/null || true
log_info " Phase 0.7: $stale_id rate_limit_until set (+${rate_limit_cooldown}s cooldown, t1256)"
fi
# t1256: Rate limit cooldown — when the root cause is worker_rate_limited,
# set rate_limit_until to prevent immediate re-dispatch into the same limit.
# Default cooldown: 5 minutes (SUPERVISOR_RATE_LIMIT_COOLDOWN_SECONDS).
# This addresses the #1 stale-evaluating root cause (29% of events).
if [[ "$root_cause" == "worker_rate_limited" ]]; then
local rate_limit_cooldown="${SUPERVISOR_RATE_LIMIT_COOLDOWN_SECONDS:-300}"
db "$SUPERVISOR_DB" "UPDATE tasks SET retries = $new_retries, rate_limit_until = strftime('%Y-%m-%dT%H:%M:%SZ','now','+${rate_limit_cooldown} seconds') WHERE id = '$(sql_escape "$stale_id")';" 2>/dev/null || true
log_info " Phase 0.7: $stale_id rate_limit_until set (+${rate_limit_cooldown}s cooldown, t1256)"
else
db "$SUPERVISOR_DB" "UPDATE tasks SET retries = $new_retries WHERE id = '$(sql_escape "$stale_id")';" 2>/dev/null || true
fi

@marcusquinn marcusquinn merged commit 5045020 into main Feb 19, 2026
18 of 19 checks passed
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
.agents/scripts/supervisor/database.sh (2)

741-752: Migration is idempotent and consistent with existing patterns — log_success always prints even on failure

The pragma_table_info guard + ALTER TABLE … 2>/dev/null || true + log_success pattern is consistent throughout the file, so this isn't a regression. That said, log_success "Added rate_limit_until …" will print even when ALTER TABLE silently fails (e.g., disk-full), which could give a false sense of safety. Compare with the guarded pattern used for max_load_factor at line 329 that captures failure and emits log_warn. A small defensive improvement:

🛡️ Proposed defensive guard (matches max_load_factor pattern)
-	db "$SUPERVISOR_DB" "ALTER TABLE tasks ADD COLUMN rate_limit_until TEXT DEFAULT NULL;" 2>/dev/null || true
-	log_success "Added rate_limit_until column to tasks (t1256)"
+	if db "$SUPERVISOR_DB" "ALTER TABLE tasks ADD COLUMN rate_limit_until TEXT DEFAULT NULL;" 2>/dev/null; then
+		log_success "Added rate_limit_until column to tasks (t1256)"
+	else
+		log_warn "Failed to add rate_limit_until column (may already exist or DB error)"
+	fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/database.sh around lines 741 - 752, The migration
block that adds rate_limit_until uses has_rate_limit_until, db, and then blindly
calls log_success even if ALTER TABLE failed; change it to detect ALTER TABLE
failure (capture the db command exit/status or output) and only call log_success
on success, otherwise call log_warn with the error message similar to the
max_load_factor pattern: run db "$SUPERVISOR_DB" "ALTER TABLE ..." capturing
stderr/status, check the result, and emit log_warn on failure (including the
error) and log_success only on success.

813-827: Consider adding an index on rate_limit_until for larger deployments

The cmd_next() filter (rate_limit_until IS NULL OR rate_limit_until <= strftime(…)) currently rides the idx_tasks_status index and then scans the filtered queued set. For typical workloads this is fine, but at scale a partial index improves scheduling latency to O(log n). Both init_db() and the migration block omit this index.

📈 Proposed partial index

In init_db() after the existing index block (line 826):

 CREATE INDEX IF NOT EXISTS idx_tasks_diagnostic ON tasks(diagnostic_of);
+-- t1256: index to make rate-limit cooldown filter in cmd_next() O(log n)
+CREATE INDEX IF NOT EXISTS idx_tasks_rate_limit ON tasks(status, rate_limit_until)
+    WHERE status = 'queued' AND rate_limit_until IS NOT NULL;

And add the same CREATE INDEX IF NOT EXISTS statement to the migration block at line 751 after the ALTER TABLE.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/database.sh around lines 813 - 827, Add a partial
index on rate_limit_until to speed up cmd_next() queries: locate init_db() and
the migration block that creates/alter tasks and, after the existing CREATE
INDEX statements (and after the ALTER TABLE in the migration block), add a
CREATE INDEX IF NOT EXISTS on tasks(rate_limit_until) using a partial predicate
that matches cmd_next()’s filter (e.g., WHERE status='queued' AND
(rate_limit_until IS NULL OR rate_limit_until <= strftime(...))); reference the
rate_limit_until column and the cmd_next() query when adding the index so both
init and migration paths include the same partial index.
.agents/scripts/supervisor/pulse.sh (2)

1020-1028: Rate limit cooldown logic is correct and well-ordered.

The rate_limit_until UPDATE executes before cmd_transition (line 1029), so even if the transition fails, the cooldown is in place — a safe ordering. The sql_escape on stale_id is properly applied.

One minor robustness note: rate_limit_cooldown is interpolated directly into the SQL modifier string ('+${rate_limit_cooldown} seconds'). If SUPERVISOR_RATE_LIMIT_COOLDOWN_SECONDS is set to a non-numeric value, SQLite's strftime will return NULL, meaning no cooldown is applied. This is a safe fallback, but you could add a numeric guard for defense-in-depth if you want to log a warning when the operator misconfigures the value.

Optional: numeric validation for the cooldown env var
 			if [[ "$root_cause" == "worker_rate_limited" ]]; then
 				local rate_limit_cooldown="${SUPERVISOR_RATE_LIMIT_COOLDOWN_SECONDS:-300}"
+				if ! [[ "$rate_limit_cooldown" =~ ^[0-9]+$ ]]; then
+					log_warn "  Phase 0.7: SUPERVISOR_RATE_LIMIT_COOLDOWN_SECONDS='$rate_limit_cooldown' is not numeric, defaulting to 300"
+					rate_limit_cooldown=300
+				fi
 				db "$SUPERVISOR_DB" "UPDATE tasks SET rate_limit_until = strftime('%Y-%m-%dT%H:%M:%SZ','now','+${rate_limit_cooldown} seconds') WHERE id = '$(sql_escape "$stale_id")';" 2>/dev/null || true
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/pulse.sh around lines 1020 - 1028, Validate
SUPERVISOR_RATE_LIMIT_COOLDOWN_SECONDS before using it: check that it is a
non-negative integer (e.g., via a numeric regex or integer cast) and if invalid,
set rate_limit_cooldown to the default 300 and emit a warning with
log_info/log_warn; then use that sanitized rate_limit_cooldown in the strftime
DB UPDATE (the variable rate_limit_cooldown used in the db call), leaving
sql_escape "$stale_id" and the ordering with cmd_transition intact.

1733-1734: Phase 1c and Phase 4b are both missing eval checkpoint cleanup (unlike Phase 0.7).

Phase 0.7 (lines 1071-1075) correctly removes eval-checkpoints/${task_id}.eval after recovering a stale evaluating task. Phase 1c (stuck evaluating recovery) and Phase 4b (DB orphan recovery) both handle evaluating tasks but do not perform the same cleanup. This means checkpoint files from pulse_killed_mid_eval scenarios recovered by either phase will accumulate on disk. They won't cause incorrect diagnosis (they're keyed by task ID), but they drift from the Phase 0.7 pattern and could grow unbounded over time.

Phase 4b is particularly susceptible since it re-evaluates orphaned evaluating tasks with cmd_evaluate, which will create new checkpoints that are never cleaned when the re-evaluation completes.

Add checkpoint cleanup to Phase 1c and Phase 4b, matching Phase 0.7 pattern
Phase 1c (after line 1734, within the stuck_evaluating loop):
 			# Clean up PID file
 			cleanup_worker_processes "$stuck_id" 2>>"$SUPERVISOR_LOG" || true
+
+			# t1256: Clean up eval checkpoint file (matches Phase 0.7 cleanup)
+			local stuck_eval_checkpoint="${SUPERVISOR_DIR}/eval-checkpoints/${stuck_id}.eval"
+			if [[ -f "$stuck_eval_checkpoint" ]]; then
+				rm -f "$stuck_eval_checkpoint" 2>/dev/null || true
+				log_verbose "  Phase 1c: removed eval checkpoint for $stuck_id"
+			fi
 		done <<<"$stuck_evaluating"

Phase 4b (after line 2570, within the db_orphans loop, following cmd_transition on failure):
 				attempt_self_heal "$orphan_id" "failed" "No worker process found" "${batch_id:-}" 2>>"$SUPERVISOR_LOG" || true
+
+			# t1256: Clean up eval checkpoint file if it exists (matches Phase 0.7 cleanup)
+			local orphan_eval_checkpoint="${SUPERVISOR_DIR}/eval-checkpoints/${orphan_id}.eval"
+			if [[ -f "$orphan_eval_checkpoint" ]]; then
+				rm -f "$orphan_eval_checkpoint" 2>/dev/null || true
+				log_verbose "  Phase 4b: removed eval checkpoint for $orphan_id"
+			fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/pulse.sh around lines 1733 - 1734, Phase 1c
(stuck evaluating recovery) and Phase 4b (DB orphan recovery) need to remove the
eval checkpoint file just like Phase 0.7: after you detect/handle an evaluating
task (use the same task_id variable used in those blocks) call the cleanup that
removes eval-checkpoints/${task_id}.eval (the same removal logic used in Phase
0.7) so checkpoint files are unlinked when the task is recovered or re-evaluated
(for Phase 4b where cmd_evaluate may recreate checkpoints). Ensure the removal
occurs after successful recovery/handling and logs/errors are redirected the
same way as Phase 0.7.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.agents/scripts/supervisor/evaluate.sh:
- Around line 1442-1452: Move the checkpoint-file creation and its push_cleanup
registration (the block that sets _eval_checkpoint_dir, creates the file into
_eval_checkpoint_file, and calls push_cleanup to remove it) so it occurs after
installing the RETURN trap and after calling _save_cleanup_scope (i.e. after the
trap '_run_cleanups' RETURN and _save_cleanup_scope are set up), ensuring the
cleanup entry is within the RETURN-trap scope so it runs when evaluate_with_ai
returns; additionally, guard the push_cleanup invocation and any later manual
removals by checking that _eval_checkpoint_file is non-empty (e.g., only call
push_cleanup or rm when _eval_checkpoint_file != "") so you never register or
attempt to remove an empty filename.

---

Nitpick comments:
In @.agents/scripts/supervisor/database.sh:
- Around line 741-752: The migration block that adds rate_limit_until uses
has_rate_limit_until, db, and then blindly calls log_success even if ALTER TABLE
failed; change it to detect ALTER TABLE failure (capture the db command
exit/status or output) and only call log_success on success, otherwise call
log_warn with the error message similar to the max_load_factor pattern: run db
"$SUPERVISOR_DB" "ALTER TABLE ..." capturing stderr/status, check the result,
and emit log_warn on failure (including the error) and log_success only on
success.
- Around line 813-827: Add a partial index on rate_limit_until to speed up
cmd_next() queries: locate init_db() and the migration block that creates/alter
tasks and, after the existing CREATE INDEX statements (and after the ALTER TABLE
in the migration block), add a CREATE INDEX IF NOT EXISTS on
tasks(rate_limit_until) using a partial predicate that matches cmd_next()’s
filter (e.g., WHERE status='queued' AND (rate_limit_until IS NULL OR
rate_limit_until <= strftime(...))); reference the rate_limit_until column and
the cmd_next() query when adding the index so both init and migration paths
include the same partial index.

In @.agents/scripts/supervisor/pulse.sh:
- Around line 1020-1028: Validate SUPERVISOR_RATE_LIMIT_COOLDOWN_SECONDS before
using it: check that it is a non-negative integer (e.g., via a numeric regex or
integer cast) and if invalid, set rate_limit_cooldown to the default 300 and
emit a warning with log_info/log_warn; then use that sanitized
rate_limit_cooldown in the strftime DB UPDATE (the variable rate_limit_cooldown
used in the db call), leaving sql_escape "$stale_id" and the ordering with
cmd_transition intact.
- Around line 1733-1734: Phase 1c (stuck evaluating recovery) and Phase 4b (DB
orphan recovery) need to remove the eval checkpoint file just like Phase 0.7:
after you detect/handle an evaluating task (use the same task_id variable used
in those blocks) call the cleanup that removes eval-checkpoints/${task_id}.eval
(the same removal logic used in Phase 0.7) so checkpoint files are unlinked when
the task is recovered or re-evaluated (for Phase 4b where cmd_evaluate may
recreate checkpoints). Ensure the removal occurs after successful
recovery/handling and logs/errors are redirected the same way as Phase 0.7.

Comment on lines +1442 to +1452
# t1256: Write eval checkpoint file — persists across pulse restarts.
# If the pulse is killed mid-eval (SIGTERM, cron restart, system shutdown),
# the checkpoint file survives. _diagnose_stale_root_cause() reads it to
# distinguish "pulse_killed_mid_eval" from "eval_process_died" (unknown cause).
# The checkpoint is removed when evaluate_with_ai() returns (success or timeout).
local _eval_checkpoint_dir="${SUPERVISOR_DIR}/eval-checkpoints"
local _eval_checkpoint_file=""
mkdir -p "$_eval_checkpoint_dir" 2>/dev/null || true
_eval_checkpoint_file="${_eval_checkpoint_dir}/${task_id}.eval"
printf '%s\n%s\n' "$$" "$(date -u '+%Y-%m-%dT%H:%M:%SZ' 2>/dev/null || echo 'unknown')" >"$_eval_checkpoint_file" 2>/dev/null || _eval_checkpoint_file=""
push_cleanup "rm -f '${_eval_checkpoint_file}' 2>/dev/null || true"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

rg -n --type sh -C 10 '_save_cleanup_scope|_run_cleanups' --include='*.sh'

Repository: marcusquinn/aidevops

Length of output: 505


🏁 Script executed:

rg -n -t sh -C 10 '_save_cleanup_scope|_run_cleanups'

Repository: marcusquinn/aidevops

Length of output: 50377


🏁 Script executed:

rg -n -t sh 'def _save_cleanup_scope|^_save_cleanup_scope\(\)|^_run_cleanups\(\)' -A 15

Repository: marcusquinn/aidevops

Length of output: 2561


🏁 Script executed:

rg -n -t sh '_run_cleanups' -A 25 shared-constants.sh | head -40

Repository: marcusquinn/aidevops

Length of output: 123


🏁 Script executed:

fd -type f 'shared-constants.sh' --exec cat {} \; -A 25 | grep -A 25 '_run_cleanups'

Repository: marcusquinn/aidevops

Length of output: 234


🏁 Script executed:

fd 'shared-constants.sh' --exec cat {} \;

Repository: marcusquinn/aidevops

Length of output: 38231


🏁 Script executed:

fd 'evaluate.sh' -path '*/supervisor/*' --exec wc -l {} \;

Repository: marcusquinn/aidevops

Length of output: 232


🏁 Script executed:

fd 'evaluate.sh' | grep supervisor

Repository: marcusquinn/aidevops

Length of output: 102


🏁 Script executed:

# Check file size first
wc -l .agents/scripts/supervisor/evaluate.sh

Repository: marcusquinn/aidevops

Length of output: 107


🏁 Script executed:

# View the checkpoint section (1442-1452)
sed -n '1440,1455p' .agents/scripts/supervisor/evaluate.sh

Repository: marcusquinn/aidevops

Length of output: 1198


🏁 Script executed:

# View the _save_cleanup_scope and trap lines
sed -n '1475,1495p' .agents/scripts/supervisor/evaluate.sh

Repository: marcusquinn/aidevops

Length of output: 1057


🏁 Script executed:

# View heartbeat section (around 1478)
sed -n '1455,1485p' .agents/scripts/supervisor/evaluate.sh

Repository: marcusquinn/aidevops

Length of output: 1683


🏁 Script executed:

# View lines 1485-1535 to see the trap setup and watchdog pattern
sed -n '1485,1535p' .agents/scripts/supervisor/evaluate.sh

Repository: marcusquinn/aidevops

Length of output: 2281


🏁 Script executed:

# Verify the exact line numbers for the watchdog pattern
sed -n '1488,1510p' .agents/scripts/supervisor/evaluate.sh

Repository: marcusquinn/aidevops

Length of output: 1128


🏁 Script executed:

# Look for heartbeat manual cleanup (mentioned as 1530-1534 in review)
sed -n '1520,1540p' .agents/scripts/supervisor/evaluate.sh

Repository: marcusquinn/aidevops

Length of output: 1046


🏁 Script executed:

# Confirm checkpoint cleanup is never called — no manual cleanup equivalent to heartbeat
sed -n '1442,1560p' .agents/scripts/supervisor/evaluate.sh | grep -A 100 'checkpoint' | head -120

Repository: marcusquinn/aidevops

Length of output: 5563


Checkpoint push_cleanup registered outside the RETURN-trap scope — stale checkpoint files will persist and cause false diagnoses

The cleanup for _eval_checkpoint_file is registered via push_cleanup at line 1452, but _save_cleanup_scope and trap '_run_cleanups' RETURN are not set until lines 1488–1489. Based on the scope-bounding behavior of _run_cleanups (which executes only commands added after the most-recent _save_cleanup_scope), the checkpoint cleanup will not fire when evaluate_with_ai() returns — directly contradicting the comment on line 1446.

The actual code pattern confirms this:

  • Checkpoint (line 1452): push_cleanup called before _save_cleanup_scope → skipped on RETURN ✗
  • Heartbeat (line 1478): push_cleanup called before _save_cleanup_scope, but includes manual cleanup at lines 1530–1534 ✓
  • Watchdog (lines 1494–1508): push_cleanup called after _save_cleanup_scope → cleaned up on RETURN ✓

Stale checkpoint files undermine the 26% diagnosis improvement target by causing _diagnose_stale_root_cause() to emit false pulse_killed_mid_eval diagnoses for tasks where eval completed normally.

Fix: move checkpoint setup to after the RETURN trap is installed and add a guard for empty _eval_checkpoint_file:

🔧 Recommended fix
-	# t1256: Write eval checkpoint file — persists across pulse restarts.
-	# If the pulse is killed mid-eval (SIGTERM, cron restart, system shutdown),
-	# the checkpoint file survives. _diagnose_stale_root_cause() reads it to
-	# distinguish "pulse_killed_mid_eval" from "eval_process_died" (unknown cause).
-	# The checkpoint is removed when evaluate_with_ai() returns (success or timeout).
-	local _eval_checkpoint_dir="${SUPERVISOR_DIR}/eval-checkpoints"
-	local _eval_checkpoint_file=""
-	mkdir -p "$_eval_checkpoint_dir" 2>/dev/null || true
-	_eval_checkpoint_file="${_eval_checkpoint_dir}/${task_id}.eval"
-	printf '%s\n%s\n' "$$" "$(date -u '+%Y-%m-%dT%H:%M:%SZ' 2>/dev/null || echo 'unknown')" >"$_eval_checkpoint_file" 2>/dev/null || _eval_checkpoint_file=""
-	push_cleanup "rm -f '${_eval_checkpoint_file}' 2>/dev/null || true"
-
 	# t1251: Initial heartbeat …
 	…
 	_save_cleanup_scope
 	trap '_run_cleanups' RETURN
+
+	# t1256: Write eval checkpoint file — persists across pulse restarts.
+	# Registered AFTER _save_cleanup_scope so _run_cleanups fires on RETURN.
+	local _eval_checkpoint_dir="${SUPERVISOR_DIR}/eval-checkpoints"
+	local _eval_checkpoint_file=""
+	mkdir -p "$_eval_checkpoint_dir" 2>/dev/null || true
+	_eval_checkpoint_file="${_eval_checkpoint_dir}/${task_id}.eval"
+	printf '%s\n%s\n' "$$" "$(date -u '+%Y-%m-%dT%H:%M:%SZ' 2>/dev/null || echo 'unknown')" \
+		>"$_eval_checkpoint_file" 2>/dev/null || _eval_checkpoint_file=""
+	[[ -n "$_eval_checkpoint_file" ]] && \
+		push_cleanup "rm -f '${_eval_checkpoint_file}' 2>/dev/null || true"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/evaluate.sh around lines 1442 - 1452, Move the
checkpoint-file creation and its push_cleanup registration (the block that sets
_eval_checkpoint_dir, creates the file into _eval_checkpoint_file, and calls
push_cleanup to remove it) so it occurs after installing the RETURN trap and
after calling _save_cleanup_scope (i.e. after the trap '_run_cleanups' RETURN
and _save_cleanup_scope are set up), ensuring the cleanup entry is within the
RETURN-trap scope so it runs when evaluate_with_ai returns; additionally, guard
the push_cleanup invocation and any later manual removals by checking that
_eval_checkpoint_file is non-empty (e.g., only call push_cleanup or rm when
_eval_checkpoint_file != "") so you never register or attempt to remove an empty
filename.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant