Skip to content

Add agentic workflow ci-doctor#34045

Merged
rmarinho merged 1 commit intomainfrom
add-workflow-.github-workflows-ci-doctor.md-9462
Feb 13, 2026
Merged

Add agentic workflow ci-doctor#34045
rmarinho merged 1 commit intomainfrom
add-workflow-.github-workflows-ci-doctor.md-9462

Conversation

@rmarinho
Copy link
Copy Markdown
Member

Add agentic workflow ci-doctor

Copilot AI review requested due to automatic review settings February 13, 2026 16:23
@rmarinho rmarinho merged commit 2f64d2c into main Feb 13, 2026
@rmarinho rmarinho deleted the add-workflow-.github-workflows-ci-doctor.md-9462 branch February 13, 2026 16:23
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new gh-aw agentic workflow (“CI Failure Doctor”) that triggers on failed CI workflow runs on main and uses GitHub Copilot CLI + GitHub MCP (read-only) to analyze failures and emit safe outputs (issue/comment/noop), with optional cached “memory”.

Changes:

  • Introduces ci-doctor.md as the agent prompt/frontmatter definition for investigating CI failures.
  • Adds the compiled ci-doctor.lock.yml GitHub Actions workflow implementing the agent execution, threat detection, safe-output handling, and cache-memory persistence.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
.github/workflows/ci-doctor.md Defines the CI Doctor agent prompt, triggers, safe-outputs configuration, and investigation protocol.
.github/workflows/ci-doctor.lock.yml Generated workflow that runs the agent, performs threat detection, processes safe outputs, and manages cache-memory.

Comment on lines +68 to +73
**ONLY proceed if the workflow conclusion is 'failure' or 'cancelled'**. If the workflow was successful, **call the `noop` tool** immediately and exit.

### Phase 1: Initial Triage
1. **Verify Failure**: Check that `${{ github.event.workflow_run.conclusion }}` is `failure` or `cancelled`
- **If the workflow was successful**: Call the `noop` tool with message "CI workflow completed successfully - no investigation needed" and **stop immediately**. Do not proceed with any further analysis.
- **If the workflow failed or was cancelled**: Proceed with the investigation steps below.
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prompt says to proceed on 'failure' or 'cancelled', but this workflow is gated to run only when conclusion == 'failure' (so cancelled runs won’t be analyzed). Please align the prompt with the actual trigger/if logic (either include cancelled in the workflow conditions, or remove it from the instructions).

Suggested change
**ONLY proceed if the workflow conclusion is 'failure' or 'cancelled'**. If the workflow was successful, **call the `noop` tool** immediately and exit.
### Phase 1: Initial Triage
1. **Verify Failure**: Check that `${{ github.event.workflow_run.conclusion }}` is `failure` or `cancelled`
- **If the workflow was successful**: Call the `noop` tool with message "CI workflow completed successfully - no investigation needed" and **stop immediately**. Do not proceed with any further analysis.
- **If the workflow failed or was cancelled**: Proceed with the investigation steps below.
**ONLY proceed if the workflow conclusion is 'failure'**. If the workflow was successful, **call the `noop` tool** immediately and exit.
### Phase 1: Initial Triage
1. **Verify Failure**: Check that `${{ github.event.workflow_run.conclusion }}` is `failure`
- **If the workflow was successful**: Call the `noop` tool with message "CI workflow completed successfully - no investigation needed" and **stop immediately**. Do not proceed with any further analysis.
- **If the workflow failed**: Proceed with the investigation steps below.

Copilot uses AI. Check for mistakes.
Comment on lines +94 to +122
### Phase 3: Historical Context Analysis
1. **Search Investigation History**: Use file-based storage to search for similar failures:
- Read from cached investigation files in `/tmp/memory/investigations/`
- Parse previous failure patterns and solutions
- Look for recurring error signatures
2. **Issue History**: Search existing issues for related problems
3. **Commit Analysis**: Examine the commit that triggered the failure
4. **PR Context**: If triggered by a PR, analyze the changed files

### Phase 4: Root Cause Investigation
1. **Categorize Failure Type**:
- **Code Issues**: Syntax errors, logic bugs, test failures
- **Infrastructure**: Runner issues, network problems, resource constraints
- **Dependencies**: Version conflicts, missing packages, outdated libraries
- **Configuration**: Workflow configuration, environment variables
- **Flaky Tests**: Intermittent failures, timing issues
- **External Services**: Third-party API failures, downstream dependencies

2. **Deep Dive Analysis**:
- For test failures: Identify specific test methods and assertions
- For build failures: Analyze compilation errors and missing dependencies
- For infrastructure issues: Check runner logs and resource usage
- For timeout issues: Identify slow operations and bottlenecks

### Phase 5: Pattern Storage and Knowledge Building
1. **Store Investigation**: Save structured investigation data to files:
- Write investigation report to `/tmp/memory/investigations/<timestamp>-<run-id>.json`
- Store error patterns in `/tmp/memory/patterns/`
- Maintain an index file of all investigations for fast searching
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workflow enables cache-memory, but the prompt instructs writing investigation history to /tmp/memory/.... The generated workflow only restores/saves /tmp/gh-aw/cache-memory, so anything written to /tmp/memory won’t persist across runs. Please update the prompt to write under the cache-memory directory (e.g., /tmp/gh-aw/cache-memory/...) so the “historical context” and pattern DB actually work.

Copilot uses AI. Check for mistakes.

pre_activation:
if: ${{ github.event.workflow_run.conclusion == 'failure' }}
runs-on: ubuntu-slim
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pre_activation doesn’t declare any job permissions, while the workflow’s top-level permissions is {}. The actions/github-script step uses ${{ secrets.GITHUB_TOKEN }} to call GitHub APIs (team membership check), which is likely to fail with an insufficiently-scoped token. Add the minimal required permissions: for this job (at least contents: read, and any others required by check_membership.cjs) so activation gating works reliably.

Suggested change
runs-on: ubuntu-slim
runs-on: ubuntu-slim
permissions:
contents: read
members: read

Copilot uses AI. Check for mistakes.
- If you find a duplicate issue, add a comment with your findings and close the investigation.
- Do NOT open a new issue since you found a duplicate already (skip next phases).

### Phase 6: Reporting and Recommendations
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two sections labeled “Phase 6” ("Looking for existing issues" and "Reporting and Recommendations"). Please renumber one of them to avoid confusion when following the protocol.

Suggested change
### Phase 6: Reporting and Recommendations
### Phase 7: Reporting and Recommendations

Copilot uses AI. Check for mistakes.
Comment on lines +3 to +16
on:
workflow_run:
workflows: ["CI"] # Monitor the CI workflow specifically
types:
- completed
branches:
- main
# This will trigger only when the CI workflow completes with failure
# The condition is handled in the workflow body
stop-after: +1mo

# Only trigger for failures - check in the workflow body
if: ${{ github.event.workflow_run.conclusion == 'failure' }}

Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stop-after: +1mo will cause this workflow to automatically deactivate after a month (and bakes a fixed stop time into the generated .lock.yml). If this is intended to be a long-lived workflow, please remove stop-after (or set an appropriate long-term window) so it doesn’t unexpectedly stop running and require periodic re-compilation.

Copilot uses AI. Check for mistakes.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants