Skip to content

fix(evals): stale tool log snapshot, missing telemetry wait, and wrong import path#23842

Closed
ishaan-arora-1 wants to merge 1 commit intogoogle-gemini:mainfrom
ishaan-arora-1:fix/eval-assertion-reliability
Closed

fix(evals): stale tool log snapshot, missing telemetry wait, and wrong import path#23842
ishaan-arora-1 wants to merge 1 commit intogoogle-gemini:mainfrom
ishaan-arora-1:fix/eval-assertion-reliability

Conversation

@ishaan-arora-1
Copy link
Copy Markdown
Contributor

Fixes #23841

Three eval files had assertion reliability bugs where tool logs were read at the wrong time or imported from the wrong source.

tracker.eval.tsreadToolLogs() was captured once after waitForToolCall(TRACKER_CREATE_TASK_TOOL_NAME) (line 44). After a second waitForToolCall(TRACKER_UPDATE_TASK_TOOL_NAME) (line 55), the code searched the stale snapshot for the update call (line 63). The update call was not in that snapshot, so updateCall would be undefined and JSON.parse(updateCall!.toolRequest.args) would throw a misleading error instead of a clear assertion failure. Fixed by re-reading tool logs after the second wait.

edit-locations-eval.eval.tsreadToolLogs() was called without a preceding waitForTelemetryReady(), so tool logs could be incomplete when assertions ran. Also removed a leftover console.log('DEBUG: targetFiles', targetFiles) statement.

save_memory.eval.ts — Imported assertModelHasOutput and checkModelOutputContent from ../integration-tests/test-helper.js instead of ./test-helper.js (which re-exports the same functions via export * from '@google/gemini-cli-test-utils'). Every other eval file uses the local import. This is the same fix applied to hierarchical_memory.eval.ts in #23790.

@ishaan-arora-1 ishaan-arora-1 requested a review from a team as a code owner March 26, 2026 00:21
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the reliability and consistency of several evaluation files by resolving issues related to stale tool log snapshots, ensuring proper telemetry readiness before assertions, and standardizing import paths for shared test utilities. These changes aim to prevent misleading test failures and improve the overall robustness of the evaluation suite.

Highlights

  • tracker.eval.ts reliability: Fixed an issue in tracker.eval.ts where tool logs were read once and then used as a stale snapshot, leading to misleading assertion errors when a second tool call was made. The fix involves re-reading the tool logs to ensure up-to-date data.
  • edit-locations-eval.eval.ts telemetry and debugging: Addressed a potential race condition in edit-locations-eval.eval.ts by ensuring the test waits for telemetry to be ready before attempting to read tool logs. Additionally, a leftover debug console.log statement was removed.
  • save_memory.eval.ts import path correction: Corrected an incorrect import path in save_memory.eval.ts for test utility functions (assertModelHasOutput and checkModelOutputContent), aligning it with other evaluation files by using the local test-helper.js.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request improves the reliability and correctness of evaluation tests. Specifically, it adds a waitForTelemetryReady() call in edit-locations-eval.eval.ts to ensure telemetry data is fully processed before reading tool logs, removes a debug console.log statement, corrects import paths in save_memory.eval.ts to use a local test-helper.js for consistency, and re-reads toolLogs in tracker.eval.ts to prevent issues with stale log data. I have no feedback to provide as all review comments were filtered out.

).toBe(true);

const updateCall = toolLogs.find(
const updatedToolLogs = rig.readToolLogs();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Re-reading toolLogs into updatedToolLogs before finding the tracker_update_task call is a critical fix. Previously, the code was using a stale snapshot of toolLogs, which could lead to updateCall being undefined and subsequent assertion failures that masked the true issue. This ensures the test uses the most current state of tool activity.

prompt: 'Fix the bug in src/math.ts. Do not run the code.',
timeout: 180000,
assert: async (rig) => {
await rig.waitForTelemetryReady();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The addition of await rig.waitForTelemetryReady(); is a crucial fix. Without this, readToolLogs() could be called before telemetry data is fully processed, leading to incomplete logs and flaky test results. This ensures the test operates on accurate and up-to-date telemetry information.

}
});

console.log('DEBUG: targetFiles', targetFiles);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Removing the console.log statement is a good practice. Debugging logs should not be committed to the main branch as they can clutter test output and potentially expose internal state during automated runs, impacting maintainability and clarity.

Comment on lines 8 to +12
import {
evalTest,
assertModelHasOutput,
checkModelOutputContent,
} from '../integration-tests/test-helper.js';
} from './test-helper.js';
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Correcting the import path from ../integration-tests/test-helper.js to ./test-helper.js ensures consistency and correctness within the evals directory. Relying on the local re-export is more robust and aligns with the import patterns used by other evaluation files, improving maintainability.

@gemini-cli gemini-cli bot added the area/platform Issues related to Build infra, Release mgmt, Testing, Eval infra, Capacity, Quota mgmt label Mar 26, 2026
…g import path

tracker.eval.ts: readToolLogs() was called once after waiting for
tracker_create_task, but the same snapshot was reused to find
tracker_update_task after a second waitForToolCall(). The update call
was not in that snapshot. Re-read logs after the second wait.

edit-locations-eval.eval.ts: readToolLogs() was called without
waitForTelemetryReady(), risking incomplete logs. Also removed a
leftover console.log('DEBUG: ...') statement.

save_memory.eval.ts: Consolidated imports to use ./test-helper.js
(which re-exports from @google/gemini-cli-test-utils) instead of
reaching into ../integration-tests/test-helper.js.

Fixes google-gemini#23841
@ishaan-arora-1 ishaan-arora-1 force-pushed the fix/eval-assertion-reliability branch from 1fa4a25 to 68b69e2 Compare March 29, 2026 11:14
@gemini-cli
Copy link
Copy Markdown
Contributor

gemini-cli bot commented Apr 9, 2026

Hi there! Thank you for your interest in contributing to Gemini CLI.

To ensure we maintain high code quality and focus on our prioritized roadmap, we have updated our contribution policy (see Discussion #17383).

We only guarantee review and consideration of pull requests for issues that are explicitly labeled as 'help wanted'. All other community pull requests are subject to closure after 14 days if they do not align with our current focus areas. For this reason, we strongly recommend that contributors only submit pull requests against issues explicitly labeled as 'help-wanted'.

This pull request is being closed as it has been open for 14 days without a 'help wanted' designation. We encourage you to find and contribute to existing 'help wanted' issues in our backlog! Thank you for your understanding and for being part of our community!

@gemini-cli gemini-cli bot closed this Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/platform Issues related to Build infra, Release mgmt, Testing, Eval infra, Capacity, Quota mgmt

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(evals): stale tool log snapshots, missing telemetry wait, and wrong import paths in eval assertions

1 participant