Add Gemini 2.5 support and robust answer key handling by codelion · Pull Request #8 · snap-research/locomo

codelion · 2025-07-10T01:14:35Z

Updated Gemini integration to use the new google-genai SDK and support Gemini 2.5 models. Refactored answer extraction logic across evaluation, Claude, GPT, Gemini, and HF utils to handle both 'answer' and 'adversarial_answer' keys, improving robustness to missing fields. Cleaned up requirements.txt, added .gitignore, and removed compiled Python cache files.

Will fix #6 #5 #3 #2

Updated Gemini integration to use the new google-genai SDK and support Gemini 2.5 models. Refactored answer extraction logic across evaluation, Claude, GPT, Gemini, and HF utils to handle both 'answer' and 'adversarial_answer' keys, improving robustness to missing fields. Cleaned up requirements.txt, added .gitignore, and removed compiled Python cache files.

Introduced a CATEGORY_MAPPING dictionary in evaluate_qa.py, evaluation.py, and evaluation_stats.py to map category numbers to descriptive names. Enhanced output and result dictionaries to include category names and detailed statistics, improving interpretability of QA evaluation results and reporting.

codelion · 2025-07-10T01:20:27Z

This also fixes #6

Added checks to convert non-string answer values to strings in the process_output function. This prevents errors when processing outputs that may not be strings.

codelion · 2025-07-19T03:02:49Z

Gemini-2.5-Flash

Overall accuracy: 0.728

Category | Name | Count | Correct | Accuracy

   4 | Single-hop  |   841 |   619.5 |    0.737
   1 | Multi-hop   |   282 |   161.1 |    0.571
   2 | Temporal    |   321 |   208.5 |    0.649
   3 | Open-domain |    96 |    32.6 |    0.340
   5 | Adversarial |   446 |   424.0 |    0.951

Gemini-2.5-Flash-Lite

Overall accuracy: 0.490

Category | Name | Count | Correct | Accuracy

     4 | Single-hop  |   841 |   584.7 |    0.695
     1 | Multi-hop   |   282 |   111.2 |    0.394
     2 | Temporal    |   321 |   111.7 |    0.348
     3 | Open-domain |    96 |    18.0 |    0.187
     5 | Adversarial |   446 |   148.0 |    0.332

These are the results with the Gemini 2.5 models.

alphanlp · 2025-07-25T12:02:33Z

Gemini-2.5-Flash

Overall accuracy: 0.728

Category | Name | Count | Correct | Accuracy

   4 | Single-hop  |   841 |   619.5 |    0.737
   1 | Multi-hop   |   282 |   161.1 |    0.571
   2 | Temporal    |   321 |   208.5 |    0.649
   3 | Open-domain |    96 |    32.6 |    0.340
   5 | Adversarial |   446 |   424.0 |    0.951

Gemini-2.5-Flash-Lite

Overall accuracy: 0.490

Category | Name | Count | Correct | Accuracy

     4 | Single-hop  |   841 |   584.7 |    0.695
     1 | Multi-hop   |   282 |   111.2 |    0.394
     2 | Temporal    |   321 |   111.7 |    0.348
     3 | Open-domain |    96 |    18.0 |    0.187
     5 | Adversarial |   446 |   148.0 |    0.332

These are the results with the Gemini 2.5 models.

dataset	metric	score
locomo	Overall	62
locomo	Single Hop	65.8
locomo	Multi Hop	39.7
locomo	Temporal	54
locomo	Open Domain	22
locomo	Adversarial	83.4

alphanlp · 2025-07-25T12:03:58Z

Model： Gemini-2.5-Flash， this is my eval score

ataset	metric	score
locomo	Overall	62
locomo	Single Hop	65.8
locomo	Multi Hop	39.7
locomo	Temporal	54
locomo	Open Domain	22
locomo	Adversarial	83.4

codelion added 2 commits July 10, 2025 09:14

Ensure answer values are strings in process_output

8bcbd89

Added checks to convert non-string answer values to strings in the process_output function. This prevents errors when processing outputs that may not be strings.

WujiangXu mentioned this pull request Sep 30, 2025

Reproduce results on Locomo WujiangXu/A-mem#11

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Gemini 2.5 support and robust answer key handling#8

Add Gemini 2.5 support and robust answer key handling#8
codelion wants to merge 3 commits intosnap-research:mainfrom
codelion:main

codelion commented Jul 10, 2025 •

edited

Loading

Uh oh!

codelion commented Jul 10, 2025

Uh oh!

codelion commented Jul 19, 2025

Uh oh!

alphanlp commented Jul 25, 2025

Gemini-2.5-Flash

Category | Name | Count | Correct | Accuracy

Gemini-2.5-Flash-Lite

Category | Name | Count | Correct | Accuracy

Uh oh!

alphanlp commented Jul 25, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

codelion commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codelion commented Jul 10, 2025

Uh oh!

codelion commented Jul 19, 2025

Gemini-2.5-Flash

Category | Name | Count | Correct | Accuracy

Gemini-2.5-Flash-Lite

Category | Name | Count | Correct | Accuracy

Uh oh!

alphanlp commented Jul 25, 2025

Gemini-2.5-Flash

Category | Name | Count | Correct | Accuracy

Gemini-2.5-Flash-Lite

Category | Name | Count | Correct | Accuracy

Uh oh!

alphanlp commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codelion commented Jul 10, 2025 •

edited

Loading

alphanlp commented Jul 25, 2025 •

edited

Loading