t1331: Supervisor circuit breaker — pause on consecutive failures#2294
t1331: Supervisor circuit breaker — pause on consecutive failures#2294marcusquinn merged 1 commit intomainfrom
Conversation
Add circuit-breaker-helper.sh: a standalone, file-based circuit breaker that prevents runaway dispatch loops in the AI pulse supervisor. Tracks consecutive task failures globally; after N failures (default: 3, configurable via SUPERVISOR_CIRCUIT_BREAKER_THRESHOLD), pauses dispatch and creates a GitHub issue with the circuit-breaker label. Key features: - File-based JSON state (no SQLite dependency) - Atomic writes with directory-based locking - Auto-reset after configurable cooldown (default: 30 min) - Counter resets to 0 on any task success - Manual trip/reset/status/check CLI commands - GitHub issue creation/close on trip/reset - CB_SKIP_GITHUB env var for test isolation Integration: - pulse.md: Step 0 checks circuit breaker before dispatch - pulse.md: Step 5 records success/failure after dispatch - 19 passing tests covering all state transitions Closes #2264
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (3)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @marcusquinn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a robust circuit breaker mechanism for the AI pulse supervisor. It prevents continuous task dispatch when a configurable number of consecutive failures occur, thereby safeguarding the system from repeated errors. The implementation is a standalone Bash script that manages its state via JSON files and integrates seamlessly into the existing Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Wed Feb 25 16:46:23 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
|
There was a problem hiding this comment.
Code Review
This pull request introduces a valuable, standalone circuit breaker helper script to enhance system stability by pausing the supervisor on consecutive task failures. The implementation is robust, featuring file-based locking for atomic state updates and a clear CLI. The accompanying test suite is comprehensive, covering a wide range of states and edge cases.
My review primarily focuses on improving the script's debuggability. I've identified a recurring pattern of suppressing stderr (2>/dev/null) for file operations and gh CLI calls. While this keeps logs clean, it unfortunately hides critical error information when issues like permission errors or authentication failures occur. My suggestions aim to remove this suppression, making underlying error messages visible to significantly simplify troubleshooting. The integration into pulse.md is well-done. Overall, this is a solid feature, and addressing the feedback on error visibility will make it even more resilient.
| if ! printf '%s\n' "$state_json" >"$tmp_file" 2>/dev/null; then | ||
| _cb_log_warn "failed to write temp state file: $tmp_file" | ||
| rm -f "$tmp_file" 2>/dev/null || true | ||
| return 1 | ||
| fi | ||
| if ! mv -f "$tmp_file" "$state_file" 2>/dev/null; then | ||
| _cb_log_warn "failed to move temp state to: $state_file" | ||
| rm -f "$tmp_file" 2>/dev/null || true | ||
| return 1 | ||
| fi |
There was a problem hiding this comment.
In this function, printf and mv commands have their stderr redirected to /dev/null. This practice conceals the root cause of failures, such as 'disk full' or permission errors, making troubleshooting difficult. The function logs a generic warning, but the specific error from the command is lost.
| if ! printf '%s\n' "$state_json" >"$tmp_file" 2>/dev/null; then | |
| _cb_log_warn "failed to write temp state file: $tmp_file" | |
| rm -f "$tmp_file" 2>/dev/null || true | |
| return 1 | |
| fi | |
| if ! mv -f "$tmp_file" "$state_file" 2>/dev/null; then | |
| _cb_log_warn "failed to move temp state to: $state_file" | |
| rm -f "$tmp_file" 2>/dev/null || true | |
| return 1 | |
| fi | |
| if ! printf '%s\n' "$state_json" >"$tmp_file"; then | |
| _cb_log_warn "failed to write temp state file: $tmp_file" | |
| rm -f "$tmp_file" 2>/dev/null || true | |
| return 1 | |
| fi | |
| if ! mv -f "$tmp_file" "$state_file"; then | |
| _cb_log_warn "failed to move temp state to: $state_file" | |
| rm -f "$tmp_file" 2>/dev/null || true | |
| return 1 | |
| fi |
References
- When reporting errors for failed file operations in shell scripts, such as 'jq' writes, include the file path in the error message. Avoid suppressing stderr with '2>/dev/null' to ensure that diagnostic information about malformed files or write failures is visible.
| issue_url=$(gh issue create \ | ||
| --repo "$repo_slug" \ | ||
| --title "Supervisor circuit breaker tripped — ${failure_count} consecutive failures" \ | ||
| --body "$body" \ | ||
| --label "circuit-breaker" 2>/dev/null) || { | ||
| _cb_log_warn "failed to create GitHub issue" | ||
| return 1 | ||
| } |
There was a problem hiding this comment.
The gh issue create command has its stderr suppressed. This prevents useful error messages from gh (e.g., authentication failure, invalid repository, permission denied) from being displayed, making it difficult to diagnose why issue creation failed. The script only logs a generic "failed to create GitHub issue" message. This applies to other gh calls in this script as well.
| issue_url=$(gh issue create \ | |
| --repo "$repo_slug" \ | |
| --title "Supervisor circuit breaker tripped — ${failure_count} consecutive failures" \ | |
| --body "$body" \ | |
| --label "circuit-breaker" 2>/dev/null) || { | |
| _cb_log_warn "failed to create GitHub issue" | |
| return 1 | |
| } | |
| issue_url=$(gh issue create \ | |
| --repo "$repo_slug" \ | |
| --title "Supervisor circuit breaker tripped — ${failure_count} consecutive failures" \ | |
| --body "$body" \ | |
| --label "circuit-breaker") || { | |
| _cb_log_warn "failed to create GitHub issue" | |
| return 1 | |
| } |
References
- Avoid using '2>/dev/null' for blanket suppression of command errors in shell scripts to ensure that authentication, syntax, or system issues remain visible for debugging.
|
|
||
| _cb_state_dir() { | ||
| local dir="${SUPERVISOR_DIR:-${HOME}/.aidevops/.agent-workspace/supervisor}" | ||
| mkdir -p "$dir" 2>/dev/null || true |
There was a problem hiding this comment.
Suppressing stderr with 2>/dev/null here hides potentially important errors, such as the script lacking permissions to create the state directory. While || true prevents the script from exiting, the failure is silent. To improve debuggability, it's better to allow the error message from mkdir to be displayed.
| mkdir -p "$dir" 2>/dev/null || true | |
| mkdir -p "$dir" || true |
References
- Avoid using '2>/dev/null' for blanket suppression of command errors in shell scripts to ensure that authentication, syntax, or system issues remain visible for debugging.
|
|
||
| if [[ -f "$state_file" ]]; then | ||
| local content | ||
| content=$(cat "$state_file" 2>/dev/null) || content="" |
There was a problem hiding this comment.
The 2>/dev/null suppresses error messages from cat. Although the file's existence is checked on line 159, cat could still fail due to permission issues. Hiding this error makes debugging more difficult. If cat fails, the script currently proceeds as if the state file is corrupt, which could be misleading.
| content=$(cat "$state_file" 2>/dev/null) || content="" | |
| content=$(cat "$state_file") || content="" |
References
- Avoid using
2>/dev/nullto suppress errors on file operations if the file's existence has already been verified by a preceding check (e.g.,[[ -f "$file" ]]or an early return). This practice is redundant for 'file not found' errors and can mask other important issues like permissions problems.
| return 0 | ||
| fi | ||
| local repo_slug | ||
| repo_slug=$(gh repo view --json nameWithOwner -q '.nameWithOwner' 2>/dev/null) || repo_slug="" |
There was a problem hiding this comment.
Suppressing stderr for the gh command hides important diagnostic information. If the command fails due to authentication issues, network problems, or incorrect repository configuration, the error is silenced. The script then proceeds with an empty repo_slug, leading to a less specific error message later.
| repo_slug=$(gh repo view --json nameWithOwner -q '.nameWithOwner' 2>/dev/null) || repo_slug="" | |
| repo_slug=$(gh repo view --json nameWithOwner -q '.nameWithOwner') || repo_slug="" |
References
- Avoid using '2>/dev/null' for blanket suppression of command errors in shell scripts to ensure that authentication, syntax, or system issues remain visible for debugging.
🤖 Augment PR SummarySummary: This PR adds a supervisor “circuit breaker” to pause the AI pulse dispatcher after repeated failures, restoring a safety mechanism that existed in the archived bash supervisor. Changes:
Technical Notes: Behavior is configured via 🤖 Was this summary useful? React with 👍 or 👎 |
| done | ||
| # Trap ensures lock cleanup even if "$@" fails under set -e | ||
| # shellcheck disable=SC2064 | ||
| trap "rmdir '$lock_dir' 2>/dev/null || true" RETURN |
There was a problem hiding this comment.
trap "rmdir ..." RETURN in _cb_with_state_lock sets a global RETURN trap that persists after the function returns, so later function returns can remove the lock directory while another process holds it (breaking mutual exclusion). Also, cleanup only runs on RETURN (not EXIT), so an early exit/set -e abort can leave a stale lock and block dispatch.
Severity: high
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
| CIRCUIT_BREAKER_THRESHOLD="${SUPERVISOR_CIRCUIT_BREAKER_THRESHOLD:-3}" | ||
| # Validate numeric — strip non-digits, fallback to default if empty | ||
| CIRCUIT_BREAKER_THRESHOLD="${CIRCUIT_BREAKER_THRESHOLD//[!0-9]/}" | ||
| [[ -n "$CIRCUIT_BREAKER_THRESHOLD" ]] || CIRCUIT_BREAKER_THRESHOLD=3 |
There was a problem hiding this comment.
After stripping non-digits, SUPERVISOR_CIRCUIT_BREAKER_THRESHOLD=0 remains “valid” and will cause the breaker to trip on the first failure (new_count >= 0). That seems like an invalid/misleading configuration value to accept silently.
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
|
|
||
| if [[ "$elapsed" -ge "$CIRCUIT_BREAKER_COOLDOWN_SECS" ]]; then | ||
| _cb_log_info "auto-reset after ${elapsed}s cooldown (threshold: ${CIRCUIT_BREAKER_COOLDOWN_SECS}s)" | ||
| cmd_reset "auto_cooldown" |
There was a problem hiding this comment.
In cmd_check, the auto-reset path calls cmd_reset "auto_cooldown" without handling failures; if reset errors (lock timeout/write failure), check may abort under set -e and still report dispatch as blocked unexpectedly. If this runs in the pulse loop, it could turn a transient reset failure into a longer dispatch pause.
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
…manence trap Adds _is_bot_generated_cleanup_issue() helper that detects issues created by post-merge-review-scanner.sh (labels: review-followup or source:review- scanner). Extracts the NMR approval gate from _dispatch_dedup_check_layers into _check_nmr_approval_gate() to keep the parent function under the 100-line complexity threshold. When the gate evaluates, if: (a) the issue is bot-generated cleanup AND (b) the needs-maintainer-review label is NOT currently present, then the historical timeline-based ever-NMR check is skipped, allowing dispatch to proceed. The exemption does NOT fire when NMR is currently present — the active case preserves t1894's security posture (maintainer-gated issues still require cryptographic approval). The exemption surgically unblocks the drift case where the fast-fail escalation path applied NMR, the maintainer subsequently removed it, and the historical timeline event keeps the issue permanently blocked. Live evidence from the 2026-04-13 awardsapp queue drain: #2294, #2293, #2292, #2290, #2287 all stuck on ever-NMR with current labels: [auto-dispatch, origin:worker, origin:interactive, review- followup, source:review-scanner]. No current NMR label. Fix 3a immediately unblocks them on the next pulse cycle. 9 new unit tests cover: review-followup detection, source:review-scanner detection, both labels together, regular issue rejection, empty labels array, empty meta_json, invalid JSON, partial-name rejection (avoid 'review-followup-queued' false positive), and the exact awardsapp#2294 regression scenario. Function lengths after refactor: _dispatch_dedup_check_layers: 77 lines (was 83 pre-Fix-3, 103 post-Fix-3 before _check_nmr_approval_gate extraction) _check_nmr_approval_gate: 23 lines (new) _is_bot_generated_cleanup_issue: 6 lines (new) Scope: intentionally surgical. Does NOT change timeline API semantics, remove NMR from scanner output (GH#18538), or touch the origin:worker vs origin:interactive double-labeling (separate bug). Fixes #18648
…manence trap (#18649) Adds _is_bot_generated_cleanup_issue() helper that detects issues created by post-merge-review-scanner.sh (labels: review-followup or source:review- scanner). Extracts the NMR approval gate from _dispatch_dedup_check_layers into _check_nmr_approval_gate() to keep the parent function under the 100-line complexity threshold. When the gate evaluates, if: (a) the issue is bot-generated cleanup AND (b) the needs-maintainer-review label is NOT currently present, then the historical timeline-based ever-NMR check is skipped, allowing dispatch to proceed. The exemption does NOT fire when NMR is currently present — the active case preserves t1894's security posture (maintainer-gated issues still require cryptographic approval). The exemption surgically unblocks the drift case where the fast-fail escalation path applied NMR, the maintainer subsequently removed it, and the historical timeline event keeps the issue permanently blocked. Live evidence from the 2026-04-13 awardsapp queue drain: #2294, #2293, #2292, #2290, #2287 all stuck on ever-NMR with current labels: [auto-dispatch, origin:worker, origin:interactive, review- followup, source:review-scanner]. No current NMR label. Fix 3a immediately unblocks them on the next pulse cycle. 9 new unit tests cover: review-followup detection, source:review-scanner detection, both labels together, regular issue rejection, empty labels array, empty meta_json, invalid JSON, partial-name rejection (avoid 'review-followup-queued' false positive), and the exact awardsapp#2294 regression scenario. Function lengths after refactor: _dispatch_dedup_check_layers: 77 lines (was 83 pre-Fix-3, 103 post-Fix-3 before _check_nmr_approval_gate extraction) _check_nmr_approval_gate: 23 lines (new) _is_bot_generated_cleanup_issue: 6 lines (new) Scope: intentionally surgical. Does NOT change timeline API semantics, remove NMR from scanner output (GH#18538), or touch the origin:worker vs origin:interactive double-labeling (separate bug). Fixes #18648



Summary
circuit-breaker-helper.shthat tracks consecutive task failures and pauses dispatch after a configurable threshold (default: 3)pulse.md(Step 0: check before dispatch, Step 5: record outcomes)What Changed
New Files
.agents/scripts/circuit-breaker-helper.sh— Standalone circuit breaker (672 lines)check,status,record-failure,record-success,reset,trip.agents/scripts/tests/test-circuit-breaker.sh— 19 testsModified Files
.agents/scripts/commands/pulse.md— Added Step 0 (circuit breaker check) and Step 5 (outcome recording)Acceptance Criteria (from issue #2264)
circuit-breakerlabel on tripcircuit-breaker-helper.sh statusshows current statecircuit-breaker-helper.sh resetresumes dispatchSUPERVISOR_CIRCUIT_BREAKER_THRESHOLDConfiguration
SUPERVISOR_CIRCUIT_BREAKER_THRESHOLDSUPERVISOR_CIRCUIT_BREAKER_COOLDOWN_SECSSUPERVISOR_CIRCUIT_BREAKER_REPOCB_SKIP_GITHUBContext
The old bash supervisor (37K lines, archived in #2291) had a circuit breaker integrated into its pulse loop. The new AI pulse system (
pulse.md) had no circuit breaker. This PR adds a standalone, lightweight implementation that works with the new system.Closes #2264