Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new gh-aw agentic workflow (“CI Failure Doctor”) that triggers on failed CI workflow runs on main and uses GitHub Copilot CLI + GitHub MCP (read-only) to analyze failures and emit safe outputs (issue/comment/noop), with optional cached “memory”.
Changes:
- Introduces
ci-doctor.mdas the agent prompt/frontmatter definition for investigating CI failures. - Adds the compiled
ci-doctor.lock.ymlGitHub Actions workflow implementing the agent execution, threat detection, safe-output handling, and cache-memory persistence.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| .github/workflows/ci-doctor.md | Defines the CI Doctor agent prompt, triggers, safe-outputs configuration, and investigation protocol. |
| .github/workflows/ci-doctor.lock.yml | Generated workflow that runs the agent, performs threat detection, processes safe outputs, and manages cache-memory. |
| **ONLY proceed if the workflow conclusion is 'failure' or 'cancelled'**. If the workflow was successful, **call the `noop` tool** immediately and exit. | ||
|
|
||
| ### Phase 1: Initial Triage | ||
| 1. **Verify Failure**: Check that `${{ github.event.workflow_run.conclusion }}` is `failure` or `cancelled` | ||
| - **If the workflow was successful**: Call the `noop` tool with message "CI workflow completed successfully - no investigation needed" and **stop immediately**. Do not proceed with any further analysis. | ||
| - **If the workflow failed or was cancelled**: Proceed with the investigation steps below. |
There was a problem hiding this comment.
The prompt says to proceed on 'failure' or 'cancelled', but this workflow is gated to run only when conclusion == 'failure' (so cancelled runs won’t be analyzed). Please align the prompt with the actual trigger/if logic (either include cancelled in the workflow conditions, or remove it from the instructions).
| **ONLY proceed if the workflow conclusion is 'failure' or 'cancelled'**. If the workflow was successful, **call the `noop` tool** immediately and exit. | |
| ### Phase 1: Initial Triage | |
| 1. **Verify Failure**: Check that `${{ github.event.workflow_run.conclusion }}` is `failure` or `cancelled` | |
| - **If the workflow was successful**: Call the `noop` tool with message "CI workflow completed successfully - no investigation needed" and **stop immediately**. Do not proceed with any further analysis. | |
| - **If the workflow failed or was cancelled**: Proceed with the investigation steps below. | |
| **ONLY proceed if the workflow conclusion is 'failure'**. If the workflow was successful, **call the `noop` tool** immediately and exit. | |
| ### Phase 1: Initial Triage | |
| 1. **Verify Failure**: Check that `${{ github.event.workflow_run.conclusion }}` is `failure` | |
| - **If the workflow was successful**: Call the `noop` tool with message "CI workflow completed successfully - no investigation needed" and **stop immediately**. Do not proceed with any further analysis. | |
| - **If the workflow failed**: Proceed with the investigation steps below. |
| ### Phase 3: Historical Context Analysis | ||
| 1. **Search Investigation History**: Use file-based storage to search for similar failures: | ||
| - Read from cached investigation files in `/tmp/memory/investigations/` | ||
| - Parse previous failure patterns and solutions | ||
| - Look for recurring error signatures | ||
| 2. **Issue History**: Search existing issues for related problems | ||
| 3. **Commit Analysis**: Examine the commit that triggered the failure | ||
| 4. **PR Context**: If triggered by a PR, analyze the changed files | ||
|
|
||
| ### Phase 4: Root Cause Investigation | ||
| 1. **Categorize Failure Type**: | ||
| - **Code Issues**: Syntax errors, logic bugs, test failures | ||
| - **Infrastructure**: Runner issues, network problems, resource constraints | ||
| - **Dependencies**: Version conflicts, missing packages, outdated libraries | ||
| - **Configuration**: Workflow configuration, environment variables | ||
| - **Flaky Tests**: Intermittent failures, timing issues | ||
| - **External Services**: Third-party API failures, downstream dependencies | ||
|
|
||
| 2. **Deep Dive Analysis**: | ||
| - For test failures: Identify specific test methods and assertions | ||
| - For build failures: Analyze compilation errors and missing dependencies | ||
| - For infrastructure issues: Check runner logs and resource usage | ||
| - For timeout issues: Identify slow operations and bottlenecks | ||
|
|
||
| ### Phase 5: Pattern Storage and Knowledge Building | ||
| 1. **Store Investigation**: Save structured investigation data to files: | ||
| - Write investigation report to `/tmp/memory/investigations/<timestamp>-<run-id>.json` | ||
| - Store error patterns in `/tmp/memory/patterns/` | ||
| - Maintain an index file of all investigations for fast searching |
There was a problem hiding this comment.
The workflow enables cache-memory, but the prompt instructs writing investigation history to /tmp/memory/.... The generated workflow only restores/saves /tmp/gh-aw/cache-memory, so anything written to /tmp/memory won’t persist across runs. Please update the prompt to write under the cache-memory directory (e.g., /tmp/gh-aw/cache-memory/...) so the “historical context” and pattern DB actually work.
|
|
||
| pre_activation: | ||
| if: ${{ github.event.workflow_run.conclusion == 'failure' }} | ||
| runs-on: ubuntu-slim |
There was a problem hiding this comment.
pre_activation doesn’t declare any job permissions, while the workflow’s top-level permissions is {}. The actions/github-script step uses ${{ secrets.GITHUB_TOKEN }} to call GitHub APIs (team membership check), which is likely to fail with an insufficiently-scoped token. Add the minimal required permissions: for this job (at least contents: read, and any others required by check_membership.cjs) so activation gating works reliably.
| runs-on: ubuntu-slim | |
| runs-on: ubuntu-slim | |
| permissions: | |
| contents: read | |
| members: read |
| - If you find a duplicate issue, add a comment with your findings and close the investigation. | ||
| - Do NOT open a new issue since you found a duplicate already (skip next phases). | ||
|
|
||
| ### Phase 6: Reporting and Recommendations |
There was a problem hiding this comment.
There are two sections labeled “Phase 6” ("Looking for existing issues" and "Reporting and Recommendations"). Please renumber one of them to avoid confusion when following the protocol.
| ### Phase 6: Reporting and Recommendations | |
| ### Phase 7: Reporting and Recommendations |
| on: | ||
| workflow_run: | ||
| workflows: ["CI"] # Monitor the CI workflow specifically | ||
| types: | ||
| - completed | ||
| branches: | ||
| - main | ||
| # This will trigger only when the CI workflow completes with failure | ||
| # The condition is handled in the workflow body | ||
| stop-after: +1mo | ||
|
|
||
| # Only trigger for failures - check in the workflow body | ||
| if: ${{ github.event.workflow_run.conclusion == 'failure' }} | ||
|
|
There was a problem hiding this comment.
stop-after: +1mo will cause this workflow to automatically deactivate after a month (and bakes a fixed stop time into the generated .lock.yml). If this is intended to be a long-lived workflow, please remove stop-after (or set an appropriate long-term window) so it doesn’t unexpectedly stop running and require periodic re-compilation.
Add agentic workflow ci-doctor