feat: add behavioral evals for web tool selection#23415
feat: add behavioral evals for web tool selection#23415PewterZz wants to merge 5 commits intogoogle-gemini:mainfrom
Conversation
Adds three evals covering the agent's decision about when to use web tools vs. local file reads: - google_web_search for current information queries - web_fetch when given a specific URL - no web tool calls when the answer exists in local files All three evals validated against the live Gemini API. Notably, the correct tool name is google_web_search (as defined in WEB_SEARCH_TOOL_NAME in base-declarations.ts), not web_search.
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces crucial behavioral evaluations to enhance the agent's ability to correctly select between web tools ( Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
|
cc @gundermanc — this is part of pre-proposal work for the GSoC behavioral evals project (#23331). Happy to adjust the prompt wording or assertion logic based on your feedback. |
There was a problem hiding this comment.
Code Review
This pull request adds valuable behavioral evaluations for the web tools (google_web_search and web_fetch), significantly improving test coverage for the agent's tool selection logic. The new tests are well-structured with clear prompts and assertions. The implementation is clean and follows existing patterns.
Summary
Adds four behavioral evals testing the agent's ability to correctly choose between web tools based on the nature of the request -- without being told which tool to use.
Details
USUALLY_PASSESgoogle_web_searchfor version info not available locallyUSUALLY_PASSESweb_fetchfor an explicit URL, notgoogle_web_searchUSUALLY_PASSESpackage.jsonrather than searching the webUSUALLY_PASSESDesign note: Prompts do not name the expected tool. Each eval creates a situation where the agent must infer the right tool from context. This tests genuine decision-making rather than instruction-following.
Finding during validation: The correct tool name is
google_web_search(defined asWEB_SEARCH_TOOL_NAMEinbase-declarations.ts). Documentation usesweb_search. All assertions import constants from@google/gemini-cli-corerather than using string literals.How to Validate
Related Issues
Fixes #23483
Related to #23331