feat: add behavioral evals for web tool selection by PewterZz · Pull Request #23415 · google-gemini/gemini-cli

PewterZz · 2026-03-22T01:40:35Z

Summary

Adds four behavioral evals testing the agent's ability to correctly choose between web tools based on the nature of the request -- without being told which tool to use.

Details

Test	Policy	What it tests
Current info not in local files	`USUALLY_PASSES`	Agent chooses `google_web_search` for version info not available locally
Specific URL provided	`USUALLY_PASSES`	Agent uses `web_fetch` for an explicit URL, not `google_web_search`
Answer available locally	`USUALLY_PASSES`	Agent reads `package.json` rather than searching the web
URL mentioned in context but task is local	`USUALLY_PASSES`	Agent resists fetching a URL mentioned as context when the task only requires local edits

Design note: Prompts do not name the expected tool. Each eval creates a situation where the agent must infer the right tool from context. This tests genuine decision-making rather than instruction-following.

Finding during validation: The correct tool name is google_web_search (defined as WEB_SEARCH_TOOL_NAME in base-declarations.ts). Documentation uses web_search. All assertions import constants from @google/gemini-cli-core rather than using string literals.

How to Validate

npm run build
RUN_EVALS=1 npx vitest run evals/web-tools.eval.ts --config evals/vitest.config.ts --reporter=verbose

Related Issues

Fixes #23483
Related to #23331

Adds three evals covering the agent's decision about when to use web tools vs. local file reads: - google_web_search for current information queries - web_fetch when given a specific URL - no web tool calls when the answer exists in local files All three evals validated against the live Gemini API. Notably, the correct tool name is google_web_search (as defined in WEB_SEARCH_TOOL_NAME in base-declarations.ts), not web_search.

gemini-code-assist · 2026-03-22T01:40:50Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces crucial behavioral evaluations to enhance the agent's ability to correctly select between web tools (google_web_search, web_fetch) and local file access. By adding these tests, the system gains better coverage for scenarios requiring live data, specific URL fetching, or local context, thereby improving the agent's overall tool-use accuracy and reliability.

Highlights

New Behavioral Evals: Added three new behavioral evaluation tests to cover the agent's decision-making regarding web tool usage versus local file reads, specifically for google_web_search and web_fetch.
Tool Name Discrepancy Addressed: Corrected the tool name for web search from web_search to google_web_search in the new evals, addressing a discrepancy found during validation and documenting it.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

PewterZz · 2026-03-22T01:41:16Z

cc @gundermanc — this is part of pre-proposal work for the GSoC behavioral evals project (#23331). Happy to adjust the prompt wording or assertion logic based on your feedback.

gemini-code-assist

Code Review

This pull request adds valuable behavioral evaluations for the web tools (google_web_search and web_fetch), significantly improving test coverage for the agent's tool selection logic. The new tests are well-structured with clear prompts and assertions. The implementation is clean and follows existing patterns.

… literals

…ompts

PewterZz requested a review from a team as a code owner March 22, 2026 01:40

PewterZz mentioned this pull request Mar 22, 2026

feat: add behavioral evals for tool selection decisions #23416

Open

PewterZz changed the title ~~feat(evals): add behavioral evals for web tool selection~~ feat: add behavioral evals for web tool selection Mar 22, 2026

gemini-code-assist bot reviewed Mar 22, 2026

View reviewed changes

github-actions bot mentioned this pull request Mar 22, 2026

📊 Bản tin hàng ngày công cụ AI CLI 2026-03-22 compasify/agents-radar#70

Open

gemini-cli bot added priority/p2 Important but can be addressed in a future release. area/agent Issues related to Core Agent, Tools, Memory, Sub-Agents, Hooks, Agent Quality 🔒 maintainer only ⛔ Do not contribute. Internal roadmap item. labels Mar 22, 2026

PewterZz mentioned this pull request Mar 22, 2026

feat: add behavioral eval for write_todos task planning #23418

Open

gemini-cli bot added the status/need-issue Pull requests that need to have an associated issue. label Mar 22, 2026

fix: add negative assertions and turn count to web tool selection evals

a1b754d

gemini-cli bot removed the status/need-issue Pull requests that need to have an associated issue. label Mar 22, 2026

This was referenced Mar 23, 2026

📊 AI CLI 工具社区动态日报 2026-03-23 gsscsd/big_model_radar#80

Open

📊 Bản tin hàng ngày công cụ AI CLI 2026-03-23 compasify/agents-radar#75

Open

PewterZz added 3 commits March 25, 2026 19:56

fix: use tool name constants from base-declarations instead of string…

6103eef

… literals

test: add hard edge case for URL mentioned in context without fetching

5049b3f

fix: rewrite web-tools evals -- remove telegraphed tool hints from pr…

ef62d12

…ompts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add behavioral evals for web tool selection#23415

feat: add behavioral evals for web tool selection#23415
PewterZz wants to merge 5 commits intogoogle-gemini:mainfrom
PewterZz:feat/add-web-tools-eval

PewterZz commented Mar 22, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Mar 22, 2026

Uh oh!

PewterZz commented Mar 22, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

PewterZz commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

How to Validate

Related Issues

Uh oh!

gemini-code-assist bot commented Mar 22, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

PewterZz commented Mar 22, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

PewterZz commented Mar 22, 2026 •

edited

Loading