Skip to content

t263: Fix deploying auto-recovery infinite loop#1036

Merged
marcusquinn merged 1 commit intomainfrom
feature/t263
Feb 11, 2026
Merged

t263: Fix deploying auto-recovery infinite loop#1036
marcusquinn merged 1 commit intomainfrom
feature/t263

Conversation

@marcusquinn
Copy link
Copy Markdown
Owner

@marcusquinn marcusquinn commented Feb 11, 2026

Summary

Fixes the deploying auto-recovery infinite loop by adding a persistent loop guard, explicit error handling, and fallback direct SQL updates.

Problem

The auto-recovery logic in Step 4b had a critical flaw: the retry_count variable was local and reset on every pulse cycle. This allowed infinite recovery attempts across pulses:

  1. Pulse N: Task in deploying → 3 retry attempts fail → stays deploying
  2. Pulse N+1: Task still deploying → counter resets → 3 NEW attempts fail → stays deploying
  3. Loop continues forever...

Additionally, if cmd_transition to failed also failed (line 8637), the task would remain stuck in deploying state indefinitely.

Solution

1. Persistent Loop Guard (t263)

  • Added deploying_recovery_attempts column to tasks table
  • Tracks recovery attempts across all pulse cycles (not just within one pulse)
  • Max limit: 10 total attempts across all pulses
  • Counter increments at the start of each Step 4b execution
  • Counter resets to 0 on successful recovery

2. Explicit Error Handling

  • When max limit (10) is exceeded, force transition to failed
  • Clear error messages indicating infinite loop guard triggered
  • Proper logging at each decision point

3. Fallback Direct SQL

  • If cmd_transition fails (both to deployed and to failed), use direct SQL UPDATE
  • Ensures task state is updated even when helper functions fail
  • Prevents task from staying stuck in deploying forever

Changes

Database Schema

  • Added deploying_recovery_attempts INTEGER NOT NULL DEFAULT 0 to tasks table
  • Migration runs automatically on supervisor-helper.sh init

Step 4b Auto-Recovery Logic

  • Check persistent counter before attempting recovery
  • Exit early if max attempts exceeded (with fallback SQL)
  • Increment counter at start of each recovery attempt
  • Reset counter on successful recovery
  • Use fallback SQL if cmd_transition fails

Testing

✅ All tests passed:

  • Migration adds column successfully
  • Default value is 0
  • Counter increments correctly
  • Max limit check works (10 attempts)
  • Fallback SQL updates status when cmd_transition fails
  • ShellCheck: zero violations in t263 changes

Integration Test Results

=== t263 Integration Test ===

Test 1: Creating test task in deploying state
✓ Task set to deploying state

Test 2: Verify default recovery attempts is 0
✓ Default recovery attempts is 0

Test 3: Simulate recovery attempt increment
✓ Recovery attempts incremented to 5

Test 4: Test max limit behavior
✓ Recovery attempts at max limit (10)

Test 5: Test fallback SQL update
✓ Fallback SQL successfully updated status to failed

Cleanup: Removing test task
✓ Test task removed

=== All t263 tests passed! ===

Impact

  • Prevents infinite loops: Tasks can no longer loop forever in deploying state
  • Self-healing: Fallback SQL ensures state updates even when helpers fail
  • Observable: Clear logging shows recovery attempt count and max limit
  • Backward compatible: Migration runs automatically, existing tasks unaffected

Related

  • t222: Initial deploying auto-recovery implementation
  • t248: Added retry with exponential backoff
  • t263: Added infinite loop guard (this PR)

- Add deploying_recovery_attempts field to tasks table
- Implement persistent counter across pulse cycles (max 10 attempts)
- Add explicit error handling with fallback direct SQL
- Prevent infinite recovery loops when cmd_transition fails
- Reset counter on successful recovery
@gemini-code-assist
Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 11, 2026

Warning

Rate limit exceeded

@marcusquinn has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 14 minutes and 25 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/t263

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 46 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Wed Feb 11 00:34:32 UTC 2026: Code review monitoring started
Wed Feb 11 00:34:33 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 46

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 46
  • VULNERABILITIES: 0

Generated on: Wed Feb 11 00:34:35 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

@sonarqubecloud
Copy link
Copy Markdown

@marcusquinn marcusquinn marked this pull request as ready for review February 11, 2026 00:36
@gemini-code-assist
Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

code-reviews-actioned All review feedback has been actioned

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant