fix(evals): stale tool log snapshot, missing telemetry wait, and wrong import path#23842
fix(evals): stale tool log snapshot, missing telemetry wait, and wrong import path#23842ishaan-arora-1 wants to merge 1 commit intogoogle-gemini:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the reliability and consistency of several evaluation files by resolving issues related to stale tool log snapshots, ensuring proper telemetry readiness before assertions, and standardizing import paths for shared test utilities. These changes aim to prevent misleading test failures and improve the overall robustness of the evaluation suite. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request improves the reliability and correctness of evaluation tests. Specifically, it adds a waitForTelemetryReady() call in edit-locations-eval.eval.ts to ensure telemetry data is fully processed before reading tool logs, removes a debug console.log statement, corrects import paths in save_memory.eval.ts to use a local test-helper.js for consistency, and re-reads toolLogs in tracker.eval.ts to prevent issues with stale log data. I have no feedback to provide as all review comments were filtered out.
| ).toBe(true); | ||
|
|
||
| const updateCall = toolLogs.find( | ||
| const updatedToolLogs = rig.readToolLogs(); |
There was a problem hiding this comment.
Re-reading toolLogs into updatedToolLogs before finding the tracker_update_task call is a critical fix. Previously, the code was using a stale snapshot of toolLogs, which could lead to updateCall being undefined and subsequent assertion failures that masked the true issue. This ensures the test uses the most current state of tool activity.
| prompt: 'Fix the bug in src/math.ts. Do not run the code.', | ||
| timeout: 180000, | ||
| assert: async (rig) => { | ||
| await rig.waitForTelemetryReady(); |
There was a problem hiding this comment.
| } | ||
| }); | ||
|
|
||
| console.log('DEBUG: targetFiles', targetFiles); |
| import { | ||
| evalTest, | ||
| assertModelHasOutput, | ||
| checkModelOutputContent, | ||
| } from '../integration-tests/test-helper.js'; | ||
| } from './test-helper.js'; |
There was a problem hiding this comment.
…g import path
tracker.eval.ts: readToolLogs() was called once after waiting for
tracker_create_task, but the same snapshot was reused to find
tracker_update_task after a second waitForToolCall(). The update call
was not in that snapshot. Re-read logs after the second wait.
edit-locations-eval.eval.ts: readToolLogs() was called without
waitForTelemetryReady(), risking incomplete logs. Also removed a
leftover console.log('DEBUG: ...') statement.
save_memory.eval.ts: Consolidated imports to use ./test-helper.js
(which re-exports from @google/gemini-cli-test-utils) instead of
reaching into ../integration-tests/test-helper.js.
Fixes google-gemini#23841
1fa4a25 to
68b69e2
Compare
|
Hi there! Thank you for your interest in contributing to Gemini CLI. To ensure we maintain high code quality and focus on our prioritized roadmap, we have updated our contribution policy (see Discussion #17383). We only guarantee review and consideration of pull requests for issues that are explicitly labeled as 'help wanted'. All other community pull requests are subject to closure after 14 days if they do not align with our current focus areas. For this reason, we strongly recommend that contributors only submit pull requests against issues explicitly labeled as 'help-wanted'. This pull request is being closed as it has been open for 14 days without a 'help wanted' designation. We encourage you to find and contribute to existing 'help wanted' issues in our backlog! Thank you for your understanding and for being part of our community! |
Fixes #23841
Three eval files had assertion reliability bugs where tool logs were read at the wrong time or imported from the wrong source.
tracker.eval.ts—readToolLogs()was captured once afterwaitForToolCall(TRACKER_CREATE_TASK_TOOL_NAME)(line 44). After a secondwaitForToolCall(TRACKER_UPDATE_TASK_TOOL_NAME)(line 55), the code searched the stale snapshot for the update call (line 63). The update call was not in that snapshot, soupdateCallwould beundefinedandJSON.parse(updateCall!.toolRequest.args)would throw a misleading error instead of a clear assertion failure. Fixed by re-reading tool logs after the second wait.edit-locations-eval.eval.ts—readToolLogs()was called without a precedingwaitForTelemetryReady(), so tool logs could be incomplete when assertions ran. Also removed a leftoverconsole.log('DEBUG: targetFiles', targetFiles)statement.save_memory.eval.ts— ImportedassertModelHasOutputandcheckModelOutputContentfrom../integration-tests/test-helper.jsinstead of./test-helper.js(which re-exports the same functions viaexport * from '@google/gemini-cli-test-utils'). Every other eval file uses the local import. This is the same fix applied tohierarchical_memory.eval.tsin #23790.