Skip to content

Add slash command for promoting behavioral evals to CI blocking#20575

Merged
gundermanc merged 7 commits intomainfrom
gundermanc/promote
Feb 27, 2026
Merged

Add slash command for promoting behavioral evals to CI blocking#20575
gundermanc merged 7 commits intomainfrom
gundermanc/promote

Conversation

@gundermanc
Copy link
Member

@gundermanc gundermanc commented Feb 27, 2026

Summary

Updates docs to indicate that tests must start as non-CI blocking and are to be promoted to CI blocking only via the new slash command. The slash command will utilize the historical record of nightly eval runs to identify and select tests that are highly reliable for promotion to CI blocking.

  • Updated docs clarify that every test should start out USUALLY_PASSES (not CI blocking).
  • Added a slash command which automatically reviews the nightly runs history and promotes only the set of tests that have successfully passed 100% of the time (3/3 times), for all supported models, 7 days in row (21X in a row, per model)
  • The process to promote a test is now: check in -> let stabilize over 7 days -> if needed, use /fix-behavioral-eval to stabilize -> use slash command to promote.

@gundermanc gundermanc requested a review from a team as a code owner February 27, 2026 16:55
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a structured process for managing behavioral evaluations, ensuring that new tests do not immediately block continuous integration. By mandating that all new evaluations start as USUALLY_PASSES and are promoted to ALWAYS_PASSES only after demonstrating consistent reliability via a new slash command, the change aims to enhance CI stability and reduce flakiness. The updated documentation provides clear guidelines for developers on this new workflow.

Highlights

  • New Behavioral Evaluation Policy: All new behavioral evaluations must now be created with the USUALLY_PASSES policy, ensuring they do not block CI initially.
  • Automated Test Promotion: A new slash command, /promote-behavioral-eval, has been introduced to automate the promotion of stable tests from USUALLY_PASSES to ALWAYS_PASSES.
  • Promotion Criteria: The promotion command verifies that tests have achieved a 100% success rate over at least 10 nightly runs across all supported models before updating their policy.
  • Documentation Update: The evals/README.md file has been updated to clearly outline the new test promotion process and the usage of the /promote-behavioral-eval command.
Changelog
  • evals/README.md
    • Updated the 'Policies' section to mandate that new behavioral evaluations start as USUALLY_PASSES and link to the new promotion process.
    • Added a new 'Test promotion process' section detailing the incubation, monitoring, and promotion steps for evaluations.
    • Modified an example evalTest call to use USUALLY_PASSES and included a comment about the promotion process.
    • Introduced a new top-level section 'Promoting evaluations' that describes the /promote-behavioral-eval slash command, its automated steps, and usage.
  • evals/validation_fidelity.eval.ts
    • Changed the evalTest policy from ALWAYS_PASSES to USUALLY_PASSES for the 'validation_fidelity' test.
Ignored Files
  • Ignored by pattern: .gemini/** (1)
    • .gemini/commands/promote-behavioral-eval.toml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new process for promoting behavioral evaluation tests. My review found a broken link in the documentation that should be fixed to ensure clarity for developers following the new process.

@github-actions
Copy link

github-actions bot commented Feb 27, 2026

Size Change: -2 B (0%)

Total Size: 25.7 MB

ℹ️ View Unchanged
Filename Size Change
./bundle/gemini.js 25.2 MB -2 B (0%)
./bundle/node_modules/@google/gemini-cli-devtools/dist/client/main.js 221 kB 0 B
./bundle/node_modules/@google/gemini-cli-devtools/dist/src/_client-assets.js 227 kB 0 B
./bundle/node_modules/@google/gemini-cli-devtools/dist/src/index.js 11.5 kB 0 B
./bundle/node_modules/@google/gemini-cli-devtools/dist/src/types.js 132 B 0 B
./bundle/sandbox-macos-permissive-open.sb 890 B 0 B
./bundle/sandbox-macos-permissive-proxied.sb 1.31 kB 0 B
./bundle/sandbox-macos-restrictive-open.sb 3.36 kB 0 B
./bundle/sandbox-macos-restrictive-proxied.sb 3.56 kB 0 B
./bundle/sandbox-macos-strict-open.sb 4.82 kB 0 B
./bundle/sandbox-macos-strict-proxied.sb 5.02 kB 0 B

compressed-size-action

@gemini-cli gemini-cli bot added the status/need-issue Pull requests that need to have an associated issue. label Feb 27, 2026
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@gundermanc gundermanc enabled auto-merge February 27, 2026 17:38
@gundermanc gundermanc added this pull request to the merge queue Feb 27, 2026
@gundermanc gundermanc removed this pull request from the merge queue due to a manual request Feb 27, 2026
@gundermanc gundermanc added this pull request to the merge queue Feb 27, 2026
Merged via the queue into main with commit b2b6092 Feb 27, 2026
27 checks passed
@gundermanc gundermanc deleted the gundermanc/promote branch February 27, 2026 19:22
BryanBradfo pushed a commit to BryanBradfo/gemini-cli that referenced this pull request Mar 5, 2026
…le-gemini#20575)

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
liamhelmer pushed a commit to badal-io/gemini-cli that referenced this pull request Mar 12, 2026
…le-gemini#20575)

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

status/need-issue Pull requests that need to have an associated issue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants