Skip to content

Add GitHub Actions workflow to run evaluate-pr-tests via Copilot CLI#34548

Merged
PureWeen merged 42 commits intomainfrom
feature/copilot-evaluate-tests-workflow
Mar 25, 2026
Merged

Add GitHub Actions workflow to run evaluate-pr-tests via Copilot CLI#34548
PureWeen merged 42 commits intomainfrom
feature/copilot-evaluate-tests-workflow

Conversation

@PureWeen
Copy link
Copy Markdown
Member

@PureWeen PureWeen commented Mar 18, 2026

Note

Are you waiting for the changes in this PR to be merged?
It would be very helpful if you could test the resulting artifacts from this PR and let us know in a comment if this change resolves your issue. Thank you!

Description

Adds a gh-aw (GitHub Agentic Workflows) workflow that automatically evaluates test quality on PRs using the evaluate-pr-tests skill.

What it does

When a PR adds or modifies test files, this workflow:

  1. Checks out the PR branch (including fork PRs) in a pre-agent step
  2. Runs the evaluate-pr-tests skill via Copilot CLI in a sandboxed container
  3. Posts the evaluation report as a PR comment using gh-aw safe-outputs

Triggers

Trigger When Fork PR support
pull_request Automatic on test file changes (src/**/tests/**) ❌ Blocked by pre_activation gate
workflow_dispatch Manual — enter PR number ✅ Works for all PRs
issue_comment (/evaluate-tests) Comment on PR ⚠️ Same-repo only (see Known Limitations)

Security model

Layer Implementation
gh-aw sandbox Agent runs in container with scrubbed credentials, network firewall
Safe outputs Max 1 PR comment per run, content-limited
Checkout without execution steps: checks out PR code but never executes workspace scripts
Base branch restoration .github/skills/, .github/instructions/, .github/copilot-instructions.md restored from base branch after checkout
Fork PR activation gate pull_request events blocked for forks via head.repo.id == repository_id
Pinned actions SHA-pinned actions/checkout, actions/github-script, etc.
Minimal permissions Each job declares only what it needs
Concurrency One evaluation per PR, cancels in-progress
Threat detection gh-aw built-in threat detection analyzes agent output

Files added/modified

  • .github/workflows/copilot-evaluate-tests.md — gh-aw workflow source
  • .github/workflows/copilot-evaluate-tests.lock.yml — Compiled workflow (auto-generated by gh aw compile)
  • .github/skills/evaluate-pr-tests/scripts/Gather-TestContext.ps1 — Test context gathering script (binary-safe file download, path traversal protection)
  • .github/instructions/gh-aw-workflows.instructions.md — Copilot instructions for gh-aw development

Known Limitations

Fork PR evaluation via /evaluate-tests comment is not supported in v1. The gh-aw platform inserts a checkout_pr_branch.cjs step after all user steps, which may overwrite base-branch skill files restored for fork PRs. This is a known gh-aw platform limitation — user steps always run before platform-generated steps, with no way to insert steps after.

Workaround: Use workflow_dispatch (Actions UI → "Run workflow" → enter PR number) to evaluate fork PRs. This trigger bypasses the platform checkout step entirely and works correctly.

Related upstream issues:

Fixes

Adds a workflow that automatically evaluates test quality on PRs using
the evaluate-pr-tests skill through the Copilot CLI.

Architecture:
- 3-job pipeline: gate → evaluate → comment
- Same-repo PRs: triggers automatically on test file changes
- Fork PRs: gated behind /evaluate-tests comment from write+ user
- Token separation: COPILOT_GITHUB_TOKEN for Copilot API, github.token for PR comments
- File sandboxing: --add-dir restricts agent to repo checkout only
- Tool allowlist: bash, view, grep, glob, lsp (no git push, no network)
- Idempotent comments: updates existing bot comment instead of creating duplicates

Security model follows dotnet/skills evaluation.yml patterns:
- Pinned action SHAs
- persist-credentials: false
- Minimal per-job permissions
- Fork PR permission verification

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 18, 2026 17:23
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 18, 2026

🚀 Dogfood this PR with:

⚠️ WARNING: Do not do this without first carefully reviewing the code of this PR to satisfy yourself it is safe.

curl -fsSL https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.sh | bash -s -- 34548

Or

  • Run remotely in PowerShell:
iex "& { $(irm https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.ps1) } 34548"

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

github-actions bot and others added 8 commits March 18, 2026 15:09
…el if:

Adopted the dotnet/skills pattern: comment body is only checked in the
GitHub Actions expression engine (job-level if: condition), never passed
to any run: block or env: variable. Permission check and eyes reaction
are separate steps. All event data in run: blocks passed via env:.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add workflow file paths to pull_request trigger so the workflow runs on its own PR
- Add COPILOT_GITHUB_TOKEN presence check before running Copilot CLI
- Write informative message when token is missing (graceful degradation)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The download-artifact action preserves the original directory structure,
so /tmp/copilot-output.txt becomes /tmp/evaluation/tmp/copilot-output.txt.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Pass GH_TOKEN (github.token) to Copilot CLI so it can resolve PR
  metadata via 'gh pr diff' and 'gh api'
- Add usage example for dispatching against a specific PR from this branch
- Include repo context in prompt so CLI knows which repo to query

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Problem: Copilot CLI checked out the target PR's head SHA, which doesn't
have the evaluate-pr-tests skill. Also couldn't call gh API from sandbox.

Fix:
- Gate job now pre-fetches PR diff, metadata, and file list as artifacts
- Evaluate job checks out the workflow branch (which has .github/skills/)
- CLI reads pre-fetched PR context from /tmp/pr-context/ (local files)
- No GH_TOKEN needed in evaluate job — all API calls happen in gate
- Added /tmp/pr-context to --add-dir for CLI sandbox access

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The CLI sandbox couldn't run pwsh or create directories, so it was
synthesizing evaluations instead of using the actual analysis script.

Fix:
- Gate job now checks out PR code at head_sha (fetch-depth: 0 for diff)
- Runs Gather-TestContext.ps1 against the PR code with proper base branch
- Includes test-context.md in the pr-context artifact
- Updated prompt: tells CLI to read pre-generated files, NOT run scripts
- Explicit instruction to not attempt running Gather-TestContext.ps1

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
actions/checkout refuses paths outside /home/runner/work/ — use
'pr-code' relative to workspace instead of /tmp/pr-code.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Update the copilot-evaluate-tests workflow to prefer a MAUI_BOT_TOKEN PAT (from the MauiBot account) for posting PR comments and fall back to the built-in github.token. Comments in the file were updated to document the new token purpose and clarify that github.token is used for reactions and commit statuses. The GH_TOKEN env was changed to ${{ secrets.MAUI_BOT_TOKEN || github.token }} to implement the fallback.
@dotnet dotnet deleted a comment from github-actions bot Mar 18, 2026
kubaflo and others added 9 commits March 19, 2026 00:43
- Add copilot-evaluate-tests.md (gh-aw source)
- Add copilot-evaluate-tests.lock.yml (compiled)
- Delete copilot-evaluate-tests.yml (old CLI workflow)
- Delete Post-CopilotComment.ps1 (use add_comment safe-output)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…hub.com/dotnet/maui into feature/copilot-evaluate-tests-workflow

# Conflicts:
#	.github/scripts/Post-CopilotComment.ps1
#	.github/workflows/copilot-evaluate-tests.yml
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

🧪 PR Test Evaluation

Overall Verdict: ℹ️ No test evaluation needed

This PR adds GitHub Actions workflow infrastructure for automated test evaluation and includes minor test utility improvements. Since it contains no functional code changes that would require test coverage, the standard test evaluation criteria do not apply.

📊 Expand Analysis Summary

Changed Files Analysis

This PR primarily contains:

Workflow Infrastructure (83% of changes)

  • .github/workflows/copilot-evaluate-tests.lock.yml - New workflow for automated PR test evaluation
  • .github/workflows/copilot-evaluate-tests.md - Workflow documentation
  • .github/aw/actions-lock.json - Action configuration
  • Various workflow-related documentation updates

Minor Test Infrastructure Improvements (17% of changes)

  • UITest.cs: Removed unused ImageMagick.Drawing import and simplified width/height validation
  • VisualTestUtils.MagickNet: Updated to use newer ImageMagick patterns (RePage() vs ResetPage(), implicit int casting)
  • BaseTemplateTests.cs: Added OnlyAndroid() helper for integration tests
  • Integration test files: Minor updates to test helper methods

Assessment

Appropriate scope: This is infrastructure work that doesn't require functional test coverage
Test utility improvements: The minor changes to test infrastructure are improvements/modernizations
No functional changes: No business logic or UI changes that would need testing

Warning

⚠️ Firewall blocked 1 domain

The following domain was blocked by the firewall during workflow execution:

  • dc.services.visualstudio.com

To allow these domains, add them to the network.allowed list in your workflow frontmatter:

network:
  allowed:
    - defaults
    - "dc.services.visualstudio.com"

See Network Configuration for more information.

Note

🔒 Integrity filtering filtered 1 item

Integrity filtering activated and filtered the following item during workflow execution.
This happens when a tool call accesses a resource that does not meet the required integrity or secrecy level of the workflow.

🧪 Test evaluation by Evaluate PR Tests

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

🧪 PR Test Evaluation

Overall Verdict: ❌ Tests are insufficient

No tests were added for this GitHub workflow infrastructure change.

📊 Expand Full Evaluation

PR Test Evaluation Report

PR: #34548 — Add copilot-evaluate-tests workflow
Test files evaluated: 0
Fix files: 0 (workflow/infrastructure only)


Overall Verdict

❌ Tests are insufficient

This PR adds GitHub workflow infrastructure but includes no tests to validate the workflow functionality.


1. Fix Coverage — ❌

No tests were added for this infrastructure change. While this PR adds a GitHub Actions workflow for test evaluation, there are no tests to verify:

  • The workflow triggers correctly on the expected events
  • The skill integration works as intended
  • The comment posting functionality operates properly
  • The workflow handles edge cases (fork PRs, missing permissions, etc.)

2. Edge Cases & Gaps — ❌

Covered:

  • None

Missing:

  • Workflow trigger validation (PR events, issue comments, workflow dispatch)
  • Permission handling for fork PRs vs same-repo PRs
  • Error handling when the skill fails or times out
  • Comment formatting and posting verification
  • Integration testing with the evaluate-pr-tests skill

3. Test Type Appropriateness — ❌

Current: No tests
Recommendation: Integration tests would be appropriate for workflow validation, possibly using GitHub's workflow testing patterns or mock scenarios.

4. Convention Compliance — ✅

N/A - No test files to evaluate for conventions.

5. Flakiness Risk — N/A

Cannot assess without tests.

6. Duplicate Coverage — ✅ No duplicates

No existing workflow tests to compare against.

7. Platform Scope — N/A

This is workflow infrastructure, not platform-specific code.

8. Assertion Quality — ❌

No assertions present as no tests were added.

9. Fix-Test Alignment — ❌

The infrastructure change (workflow addition) has no corresponding tests to validate its functionality.


Recommendations

  1. Add workflow integration tests - Consider adding tests that validate the workflow triggers and executes correctly in different scenarios
  2. Add skill unit tests - The evaluate-pr-tests skill should have its own tests to ensure reliable operation
  3. Document testing strategy - If manual testing was performed, document the test scenarios in the PR description

Note: For infrastructure/workflow changes like this, testing may be inherently challenging and manual validation may be sufficient. However, documenting the manual testing performed would improve confidence in the change.

Warning

⚠️ Firewall blocked 1 domain

The following domain was blocked by the firewall during workflow execution:

  • dc.services.visualstudio.com

To allow these domains, add them to the network.allowed list in your workflow frontmatter:

network:
  allowed:
    - defaults
    - "dc.services.visualstudio.com"

See Network Configuration for more information.

Note

🔒 Integrity filtering filtered 1 item

Integrity filtering activated and filtered the following item during workflow execution.
This happens when a tool call accesses a resource that does not meet the required integrity or secrecy level of the workflow.

🧪 Test evaluation by Evaluate PR Tests

Eliminates integrity filtering warning on fork PR comments.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

🧪 PR Test Evaluation

Overall Verdict: ✅ Tests are adequate

This PR adds GitHub Actions workflow infrastructure and does not contain functional code changes that require test coverage.

📊 Expand Full Evaluation

PR Test Evaluation Report

PR: #34548 — Add GitHub Actions workflow to run evaluate-pr-tests via Copilot CLI
Test files evaluated: 0
Fix files: 0


Overall Verdict

Tests are adequate

This PR adds GitHub Actions workflow infrastructure (.github/workflows/copilot-evaluate-tests.lock.yml, .github/workflows/copilot-evaluate-tests.md, and metadata) but does not contain functional code changes that require test coverage.


1. Fix Coverage — ℹ️ N/A

Finding: This PR contains no functional code changes to evaluate for test coverage.

Analysis: The changed files are GitHub Actions workflow definitions and documentation:

  • .github/aw/actions-lock.json — Actions security lockdown metadata
  • .github/workflows/copilot-evaluate-tests.lock.yml — GitHub Actions workflow definition
  • .github/workflows/copilot-evaluate-tests.md — Workflow documentation

These are infrastructure/configuration files that enable automated test evaluation on future PRs, not code that needs test coverage itself.


Evaluation Summary

No evaluation criteria apply to this PR because:

  • No fix files detected — Only GitHub Actions workflow and metadata files
  • No test files added — No test coverage to evaluate
  • No functional code changes — Infrastructure-only changes

Expected behavior: Future PRs that modify functional code will trigger this new workflow to automatically evaluate their test coverage using the evaluate-pr-tests skill.


Recommendations

None required — This PR appropriately adds CI infrastructure without functional changes requiring test coverage.

For future PRs: The new workflow will automatically evaluate test quality when functional changes are detected.

Warning

⚠️ Firewall blocked 1 domain

The following domain was blocked by the firewall during workflow execution:

  • dc.services.visualstudio.com

To allow these domains, add them to the network.allowed list in your workflow frontmatter:

network:
  allowed:
    - defaults
    - "dc.services.visualstudio.com"

See Network Configuration for more information.

Note

🔒 Integrity filtering filtered 1 item

Integrity filtering activated and filtered the following item during workflow execution.
This happens when a tool call accesses a resource that does not meet the required integrity or secrecy level of the workflow.

🧪 Test evaluation by Evaluate PR Tests

The agent was evaluating the workflow branch instead of the target PR
when triggered via workflow_dispatch. Two fixes:
1. Prompt now instructs gh pr checkout <number> before analysis
2. Gather-TestContext.ps1 accepts -PrNumber and uses gh pr diff for
   correct file listing regardless of local checkout state

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

🧪 PR Test Evaluation

Overall Verdict: ℹ️ No functional test evaluation needed

This PR adds GitHub Actions workflow infrastructure and contains only cleanup/refactoring changes to test utilities.

📊 Expand Full Evaluation

PR Test Evaluation Report

PR: #34548 — Add GitHub Actions workflow to run evaluate-pr-tests via Copilot CLI
Test files evaluated: 0 functional tests
Fix files: 0 functional fixes


Overall Verdict

ℹ️ No functional test evaluation needed

This PR adds GitHub Actions workflow infrastructure for automated test evaluation. The test file changes are infrastructure cleanup/refactoring, not new test coverage for bug fixes.


Analysis Summary

What this PR contains:

  • GitHub Actions workflow — Adds copilot-evaluate-tests.yml to automatically evaluate test quality on PRs
  • Security model — Follows dotnet/skills patterns with token separation, fork gating, file sandboxing
  • Infrastructure cleanup — Minor refactoring in test utilities (ImageMagick, integration tests)

Files changed:

  • Workflow infrastructure: .github/workflows/copilot-evaluate-tests.lock.yml, related scripts
  • Test utilities cleanup: Removed unused imports, simplified ImageMagick calls, minor integration test improvements
  • Skills updates: Enhanced Gather-TestContext.ps1, updated skill documentation

Infrastructure Changes Assessment

Test Utility Updates:

  • UITest.cs: Removed unused ImageMagick.Drawing import, simplified screenshot masking
  • MagickNetImageEditor.cs: Removed redundant bounds checks, simplified API calls
  • Integration tests: Added platform-specific logic for CoreCLR vs Mono runtime

Quality: These are appropriate cleanup changes that:

  • Remove dead code and unused imports
  • Simplify API usage where safe
  • Add proper platform-specific handling

Recommendations

  1. Workflow is ready — The GitHub Actions workflow follows security best practices and integrates well with existing CI
  2. Test cleanup is appropriate — Infrastructure changes improve code quality without affecting functionality
  3. Documentation is comprehensive — The PR description clearly explains the security model and prerequisites

Warning

⚠️ Firewall blocked 1 domain

The following domain was blocked by the firewall during workflow execution:

  • dc.services.visualstudio.com

To allow these domains, add them to the network.allowed list in your workflow frontmatter:

network:
  allowed:
    - defaults
    - "dc.services.visualstudio.com"

See Network Configuration for more information.

Note

🔒 Integrity filtering filtered 1 item

Integrity filtering activated and filtered the following item during workflow execution.
This happens when a tool call accesses a resource that does not meet the required integrity or secrecy level of the workflow.

🧪 Test evaluation by Evaluate PR Tests

@PureWeen
Copy link
Copy Markdown
Member Author

Re-Review #3: PR #34548 — Commits through 8ab12f4

3 new commits since our ✅ Approve changed the security model fundamentally:

Commit Change
7b06ea0 Fetch changed files via API instead of skipping fork PRs
acf8b1c Remove fork guard entirely, allow checkout for all PRs
8ab12f4 Restore .github/skills/ and .github/instructions/ from base branch after checkout

Previous Findings — All Still Fixed ✅

Finding Status
Path traversal in Gather-TestContext.ps1 ✅ FIXED
Binary corruption (Encoding.UTF8WriteAllBytes) ✅ FIXED
URL encoding for API paths ✅ FIXED
Fork guard fail-open N/A — guard removed, replaced with new model

New Security Model Assessment

The "checkout without execution" approach is architecturally sound for preventing classic pwn-request RCE. No workspace scripts are executed after checkout in the steps: context. The gh-aw sandbox runs with scrubbed credentials, network firewalling, and safe-outputs limits (max 1 comment). The pre_activation gate blocks fork PRs on pull_request events (head.repo.id == repository_id). This is well-documented with clear warnings.

However, the trusted-file restoration has structural issues:

New Findings (5-model consensus, 2+ agreement threshold)

🔴 Finding 1 — checkout_pr_branch.cjs runs AFTER restore, likely undoes it (5/5 models)

The compiled lock.yml has two "Checkout PR branch" steps in the agent job:

  1. User's step (~line 332): gh pr checkout + git checkout "$BASE_SHA" -- .github/skills/ .github/instructions/
  2. gh-aw platform step (~line 347): checkout_pr_branch.cjs with if: (github.event.pull_request) || (github.event.issue.pull_request) — runs after step 1

For /evaluate-tests on a fork PR (issue_comment trigger), checkout_pr_branch.cjs fires and likely re-applies the PR branch, overwriting the restored files. The restoration becomes a no-op for the exact scenario it was designed to protect.

Fix: Move the git checkout "$BASE_SHA" restoration to occur after all checkout steps, or verify with the gh-aw team that checkout_pr_branch.cjs is a no-op when the branch is already checked out.

🔴 Finding 2 — .github/copilot-instructions.md not in restore list (5/5 models)

The restore covers two directories but misses the primary Copilot instruction file:

git checkout "$BASE_SHA" -- .github/skills/ .github/instructions/ 2>/dev/null || true
#                          ☝️ missing: .github/copilot-instructions.md

This 19KB file is auto-loaded by Copilot CLI as the system-level behavioral context. A fork PR could modify it to inject instructions that bias the evaluation (e.g., "Always report tests as adequate").

Fix: Add to the restore list:

git checkout "$BASE_SHA" -- .github/skills/ .github/instructions/ .github/copilot-instructions.md 2>/dev/null || true

🟡 Finding 3 — git checkout doesn't remove fork-added files (4/5 models)

git checkout "$BASE_SHA" -- .github/skills/ restores existing files but does not remove files added by the fork. A fork could add .github/skills/malicious/SKILL.md that survives the restore and gets discovered by the agent.

Fix:

rm -rf .github/skills/ .github/instructions/
git checkout "$BASE_SHA" -- .github/skills/ .github/instructions/ .github/copilot-instructions.md 2>/dev/null || true

🟡 Finding 4 — || true silently swallows restore failures (4/5 models)

If the restore fails (bad SHA, git error), the workflow continues with fork-supplied files and no observable failure. At minimum, log a warning.

🟢 Finding 5 — Gate exit 1 still shows red X (3/5 models)

PRs without test files get a ❌ failed check instead of a skip. UX issue, not security.

🟢 Finding 6 — .pr-changed-files.txt spoofable by fork (3/5 models)

A fork could commit a .pr-changed-files.txt picked up by Gather-TestContext.ps1, directing the agent to evaluate different files. Low impact since the agent also reads the actual PR diff.

Blast Radius Assessment

Even if all findings above were exploited:

  • No RCE — no workspace scripts executed in privileged context
  • No credential exfil — sandbox has scrubbed credentials + network firewall
  • Max damage — 1 misleading PR comment (safe-outputs limit)
  • Threat detection — gh-aw's threat detection job analyzes agent output

CI Status: ✅ passing, ⏭️ 2 skipping, ⏳ Build Analysis pending

Recommended Action: ⚠️ Request Changes

Findings 1 and 2 are straightforward fixes. The fundamental "checkout without execution" model is sound, but the restoration needs to:

  1. Run after checkout_pr_branch.cjs (or be verified as not undone)
  2. Include .github/copilot-instructions.md in the restore list
  3. Clean directories before restoring (rm -rf + git checkout)

Findings addressed:
- [CRITICAL] checkout_pr_branch.cjs ordering: split restore into a
  separate step that runs as late as possible. For workflow_dispatch
  the platform checkout is skipped (if condition false), so this is
  fully effective. For issue_comment on fork PRs, the platform step
  may still run after — noted as a gh-aw platform limitation.
- [CRITICAL] .github/copilot-instructions.md added to restore list
- [MODERATE] rm -rf before git checkout to remove fork-added files
- [MODERATE] restore failure now exits with error instead of || true
- [MINOR] gate exit 1 is a UX issue — acceptable for now

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Member Author

Re-Review #3 Response — Commit 77361d5

All 6 findings addressed. Here is the breakdown:

🔴 Finding 1 — checkout_pr_branch.cjs ordering

Status: Partially addressed (gh-aw platform limitation)

Split the restore into a separate step ("Restore agent infrastructure from base branch") so it runs as late as possible among user steps. However, gh aw compile places ALL user steps: before platform-generated steps, so checkout_pr_branch.cjs still runs after our restore.

Current impact by trigger:

Trigger Platform checkout runs? Restore effective?
workflow_dispatch ❌ Skipped (if: pull_request || issue.pull_request is false) ✅ Fully effective
pull_request (same-repo) ✅ Runs ✅ N/A — same repo has all files
pull_request (fork) ✅ Runs but pre_activation gate blocks fork PRs ✅ N/A — workflow never reaches agent
issue_comment (fork PR) ✅ Runs and may undo restore ⚠️ May be undone

The only vulnerable path is /evaluate-tests comments on fork PRs. This is a gh-aw platform ordering limitation — we cannot insert user steps after platform steps. Options:

  1. File a gh-aw feature request for post-platform user steps
  2. Verify with gh-aw team whether checkout_pr_branch.cjs is a no-op when branch is already checked out
  3. Accept the risk (blast radius = 1 misleading comment on a fork PR explicitly triggered by a maintainer)

🔴 Finding 2 — .github/copilot-instructions.md ✅ FIXED

Added to the git checkout restore list alongside .github/skills/ and .github/instructions/.

🟡 Finding 3 — Fork-added files ✅ FIXED

Added rm -rf .github/skills/ .github/instructions/ before git checkout restore. This removes any fork-added files before restoring the base branch versions.

🟡 Finding 4 — || true swallowing failures ✅ FIXED

Replaced with explicit exit code check:

git checkout "$BASE_SHA" -- ... 2>/dev/null
if [ $? -eq 0 ]; then
  echo "✅ Restored agent infrastructure from base branch ($BASE_SHA)"
else
  echo "⚠️ Failed to restore agent infrastructure from $BASE_SHA"
  exit 1
fi

🟢 Finding 5 — Gate exit 1 red X

Acknowledged as UX issue. Keeping exit 1 for now since exit 0 would require a separate mechanism to prevent the agent from running on non-test PRs.

🟢 Finding 6 — .pr-changed-files.txt spoofing

Low impact since the agent also reads the actual PR diff via MCP. No change needed.

Architecture note

BASE_SHA is now saved to $GITHUB_ENV in the checkout step so the restore step can read it. The two steps are:

  1. Checkout PR branch — saves BASE_SHA, runs gh pr checkout
  2. Restore agent infrastructurerm -rf + git checkout $BASE_SHA -- <paths>, fails hard on error

Reference upstream gh-aw#18481 in the error message and point users
to the Actions tab workflow_dispatch trigger as the workaround.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Member Author

Re-Review #4: PR #34548 — Commits 77361d5 + 39405a9

2 new commits since Re-Review #3 (8ab12f4). 5-model consensus (4/5 agents completed).

Previous Findings — All Addressed ✅

Finding Status Notes
🔴 F1: checkout_pr_branch.cjs ordering ADDRESSED Untracked .restore-backup/ survives git checkout; agent prompt (from trusted base-branch artifact) instructs restore as first action
🔴 F2: .github/copilot-instructions.md not in restore FIXED Added to both git checkout $BASE_SHA -- list and .restore-backup/ backup
🟡 F3: Fork-added files survive restore FIXED rm -rf .github/skills/ .github/instructions/ now runs before restore
🟡 F4: || true swallowed failures FIXED Replaced with if [ $? -eq 0 ]; then ... else exit 1; fi
🟢 F5: Gate exit 1 UX ⚠️ Still present Accepted by author — no change
🟢 F6: Changed files spoofable IMPROVED Gather-TestContext.ps1 -PrNumber uses gh pr diff via API, not local git diff

The untracked backup design is mechanically correct: git checkout <branch> only affects tracked files, so .restore-backup/ (untracked) survives checkout_pr_branch.cjs. The agent prompt is baked from the base-branch artifact by the activation job — fork PRs cannot alter it.


Remaining Issue: issue_comment Fork Guard is Fail-Open (4/4 consensus)

The new guard for /evaluate-tests comments on fork PRs:

HEAD_REPO_ID=$(gh pr view "$PR_NUMBER" --repo "$GITHUB_REPOSITORY" \
  --json headRepository --jq '.headRepository.id // ""')
BASE_REPO_ID=$(gh pr view "$PR_NUMBER" --repo "$GITHUB_REPOSITORY" \
  --json baseRepository --jq '.baseRepository.id // ""')
if [ -n "$HEAD_REPO_ID" ] && [ -n "$BASE_REPO_ID" ] && \
   [ "$HEAD_REPO_ID" != "$BASE_REPO_ID" ]; then
  exit 1
fi

When gh pr view fails (API error, rate limit, transient network issue), both IDs are "". [ -n "" ] is false, so the outer if never fires — fork PRs are not blocked. This is fail-open, and is a regression from the fail-closed approach in e65c602 ([ -z "$HEAD_OWNER" ] || ...).

One-line fix:

HEAD_REPO_ID=$(gh pr view "$PR_NUMBER" --repo "$GITHUB_REPOSITORY" \
  --json headRepository --jq '.headRepository.id' 2>/dev/null) || { echo "❌ API error"; exit 1; }
BASE_REPO_ID=$(gh pr view "$PR_NUMBER" --repo "$GITHUB_REPOSITORY" \
  --json baseRepository --jq '.baseRepository.id' 2>/dev/null) || { echo "❌ API error"; exit 1; }
if [ "$HEAD_REPO_ID" != "$BASE_REPO_ID" ]; then
  exit 1
fi

Mitigating factors (why this is 🟡 not 🔴):

  • Even if the guard fails open, the untracked backup + agent prompt restore still apply
  • Agent runs in a sandboxed container with scrubbed credentials
  • Max blast radius remains 1 misleading PR comment (safe-outputs enforced)
  • pull_request-triggered fork runs are already blocked by the activation job's if: condition

Minor Note: Agent Restore Is a Soft Control (4/4 consensus)

The final defense is the LLM agent following the prompt instruction: "Before doing ANYTHING else, you MUST restore from .restore-backup/". The prompt is trusted (base-branch artifact, not workspace), but LLM compliance is probabilistic. This is acceptable as defense-in-depth (sandbox + max 1 comment), not as a sole security gate — and it isn't, since fork PR paths have additional blocks before reaching the agent.


Verdict: ⚠️ One small fix, then ✅ Approve

The fail-open guard is a one-line change. Everything else is well-addressed, and the untracked backup architecture is clever and sound. Once the issue_comment fork guard is made fail-closed, this is ready to merge.

Review conducted with 5-model consensus (claude-opus-4.6 ×2, claude-sonnet-4.6, gemini-3-pro-preview, gpt-5.3-codex)

github-actions bot and others added 8 commits March 24, 2026 14:13
When gh pr view fails (rate limit, network error), both IDs are empty
and the guard silently passes. Now uses || exit 1 on each API call
and removes the -n guards so any mismatch (including empty) blocks.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The .restore-backup/ mechanism and agent prompt restore were dead code:
- Same-repo PRs: files already exist on the PR branch
- Fork PRs via workflow_dispatch: platform checkout skipped, restore works
- Fork PRs via /evaluate-tests: hard-fail before reaching restore

Keep only the simple git checkout restore for the workflow_dispatch path.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…il-closed patterns

Incorporates learnings from building copilot-evaluate-tests workflow:
- Detailed execution model diagram (activation → user steps → platform steps → agent)
- Step ordering constraint (user steps always before platform steps)
- Prompt rendering via runtime-import from base branch
- Fork PR behavior table by trigger type
- issue_comment + fork platform limitation and workaround
- Fail-closed fork guard pattern with code examples
- Upstream gh-aw issue references (#18481, #18518, #18521)
- Expanded limitations and troubleshooting tables

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move checkout, fork guard, and restore logic into .github/scripts/gh-aw-checkout-pr.sh.
This makes the pattern reusable by future gh-aw workflows with a single step:

  run: bash .github/scripts/gh-aw-checkout-pr.sh

The script handles:
- Base SHA capture before PR checkout
- Fail-closed fork guard for issue_comment triggers
- PR branch checkout via gh CLI
- Restore of trusted agent infrastructure from base branch

Fix set -e + $? bug found by 3-model review (Opus/Sonnet/Codex):
git checkout inside if-condition is exempt from errexit, ensuring
the diagnostic message prints on restore failure.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
All other scripts in .github/scripts/ are .ps1. Rename gh-aw-checkout-pr.sh
to Checkout-GhAwPr.ps1, following PowerShell verb-noun naming convention.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove the fork detection guard from Checkout-GhAwPr.ps1. Fork PRs that
are rebased on main already have the skill files, so checkout_pr_branch.cjs
producing the final workspace state is fine.

Make the base-branch restore best-effort (non-fatal) since for issue_comment
triggers the platform may overwrite it anyway.

Add a pre-flight check to the agent prompt: if SKILL.md is missing after
all checkouts, post a helpful comment telling the user to rebase or use
workflow_dispatch.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… debugging

- Remove .pr-changed-files.txt support from Gather-TestContext.ps1 (dead code
  attack surface — no step creates this file, fork could ship one to manipulate
  analysis)
- Update gh-aw instructions: replace stale fork hard-fail documentation with
  optimistic path description and known limitation
- Clean up Checkout-GhAwPr.ps1: remove unused GITHUB_EVENT_NAME/WORKFLOW_NAME
  references, replace 2>$null with 2>&1 for restore diagnostics
- Remove WORKFLOW_NAME env var from workflow (no longer consumed by script)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Member Author

Re-Review #6 — Round 6 Security Assessment

5 new commits since last review (efde5ef6, 1d399304, 207c3ddf, 876587d4, de1d9e10)

Previous Findings Status

Finding Status
R5: Fork guard fail-open ([ -n "" ] passes when API fails) CHANGED — Guard removed entirely; replaced by test -f SKILL.md existence check
R4: Missing copilot-instructions.md in restore ✅ Fixed (still covered in Checkout-GhAwPr.ps1)
R4: No rm -rf before restore ✅ Fixed
R4: `
R3–R5: All earlier findings ✅ Fixed/N/A

New Findings (3/4 model consensus)

🟡 MODERATE — test -f SKILL.md existence check does not protect against a modified SKILL.md

Models agreeing: 3/4 (claude-sonnet-4.6, gemini-3-pro-preview, gpt-5.3-codex)

For issue_comment triggered on a fork PR, the execution order is:

  1. User step (Checkout-GhAwPr.ps1): checks out PR branch, then does best-effort git checkout $BaseSha -- .github/skills/ ... to restore trusted files ✅
  2. Platform step (checkout_pr_branch.cjs): re-checks out the fork branch → overwrites .github/skills/ with the fork's version
  3. Agent pre-flight: test -f .github/skills/evaluate-pr-tests/SKILL.md → if the fork includes a modified SKILL.md, this check passes and the agent runs with the fork's instructions

The check correctly blocks forks that omit the skill file (e.g., external contributors' PRs that predate the workflow). It doesn't block forks that include a modified SKILL.md, which is the deliberate-attacker case.

Mitigating factors (why this is MODERATE rather than CRITICAL):

  • The /evaluate-tests slash command requires write access to trigger — a random external fork author cannot self-trigger this; a maintainer would have to deliberately run it on a suspicious fork PR
  • Agent is sandboxed (no credentials), output limited to 1 comment via safe-outputs
  • The blast radius is one potentially-misleading evaluation comment

The core issue: The existing instructions file (gh-aw-workflows.instructions.md) explicitly documents the fail-closed hard-fail as the recommended pattern and labels the current approach's alternative as an "anti-pattern." The code and documentation now actively contradict each other.

Suggested fix — reinstate fail-closed fork detection in Checkout-GhAwPr.ps1 (fixing the R5 fail-open bug):

# In Checkout-GhAwPr.ps1, add before gh pr checkout:
if ($env:GITHUB_EVENT_NAME -eq 'issue_comment') {
    $headRepoId = gh pr view $PrNumber --repo $env:GITHUB_REPOSITORY --json headRepository --jq '.headRepository.id'
    if ($LASTEXITCODE -ne 0) { Write-Error "Could not verify PR origin — failing closed"; exit 1 }
    $baseRepoId = gh repo view $env:GITHUB_REPOSITORY --json id --jq '.id'
    if ($LASTEXITCODE -ne 0) { Write-Error "Could not verify repo ID — failing closed"; exit 1 }
    if ($headRepoId -ne $baseRepoId) {
        Write-Host "Fork PR detected on issue_comment trigger — use workflow_dispatch instead"
        exit 1
    }
}

This is fail-closed (API error → exit 1, not exit 0), fixing the R5 bug, and matches what the instructions file documents.


🟢 MINOR — Documentation actively contradicts implementation

Models agreeing: 4/4

.github/instructions/gh-aw-workflows.instructions.md states:

Current workaround: Hard-fail for fork PRs on issue_comment, direct users to workflow_dispatch.

It also includes the fail-closed fork guard as the "Safe Pattern." Commit de1d9e10 removed this guard. Future maintainers and agents reading the instructions will believe a fork guard is in place when it isn't. The fix: either restore the guard (per the above), or update the docs to describe the actual model and its accepted risk explicitly.


🟢 MINOR — Best-effort restore on failure path (workflow_dispatch edge case)

Models agreeing: 2/4 (claude-sonnet-4.6, gpt-5.3-codex)

If git checkout $BaseSha -- ... fails in Checkout-GhAwPr.ps1, the script logs a warning but continues. For workflow_dispatch (where checkout_pr_branch.cjs is skipped), an unrestored failure would leave the PR branch's .github/skills/ in the workspace when the agent runs. Consider adding exit 1 on restore failure to be consistent with the script's $ErrorActionPreference = 'Stop' posture.


Blast Radius Note

Per the gh-aw platform: the agent has scrubbed credentials, network-firewalled, and can post max 1 comment (sanitized, 65KB). The pre-activation gate requires write access (admin/maintainer/write), so this threat requires a collaborator to trigger /evaluate-tests on a malicious fork PR. This significantly reduces real-world risk — the primary concern is comment-content manipulation (fake evaluation results, misleading test assessments).


Verdict: ⚠️ Request Changes

The simplification in de1d9e10 trades a fixable fail-open bug for a new "existence check ≠ integrity check" vulnerability on the issue_comment fork path. The fix from R5 was to remove the guard rather than correct it. Given that the instructions file already documents exactly what the correct implementation should look like, the ask is straightforward: implement the fail-closed guard as documented, or update the documentation to explicitly accept and describe the current risk posture.

Reviewed by 5-model consensus (claude-opus-4.6 ×2, claude-sonnet-4.6, gemini-3-pro-preview, gpt-5.3-codex) — 3/4 models concur on the MODERATE finding.

Address PR review findings (3-model consensus + PureWeen Round 6):

1. Checkout-GhAwPr.ps1: Add fail-closed fork guard for issue_comment
   triggers using isCrossRepository. Fork PRs are rejected with an
   actionable notice to use workflow_dispatch instead. API failures
   also fail closed (exit 1, not 0).

2. Gather-TestContext.ps1: Escape markdown metacharacters (pipes,
   angle brackets) in Format-FileList and use double-backtick code
   spans so literal backticks in filenames render correctly.

3. gh-aw-workflows.instructions.md: Update fork handling docs to
   reflect the fail-closed guard, remove 'Known limitation' section,
   update trigger table and troubleshooting.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@dotnet dotnet deleted a comment from github-actions bot Mar 24, 2026
@dotnet dotnet deleted a comment from github-actions bot Mar 24, 2026
@dotnet dotnet deleted a comment from github-actions bot Mar 24, 2026
@dotnet dotnet deleted a comment from github-actions bot Mar 24, 2026
@dotnet dotnet deleted a comment from github-actions bot Mar 24, 2026
@dotnet dotnet deleted a comment from MauiBot Mar 24, 2026
- Add ready_for_review and reopened to pull_request types (reopened was
  lost when overriding defaults)
- Gate pull_request activation on draft == false so draft PRs don't
  waste agent runs
- When marked ready-for-review, the workflow now re-triggers
- Reorder concurrency group to prefer pull_request.number first

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen merged commit 6717d1c into main Mar 25, 2026
9 of 10 checks passed
@PureWeen PureWeen deleted the feature/copilot-evaluate-tests-workflow branch March 25, 2026 14:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[AGR] Add an gh-aw for our evaluate-pr-tests workflow to automatically assess test coverage for users submitting PRs

4 participants