Skip to content

Add Gemini 2.5 support and robust answer key handling#8

Open
codelion wants to merge 3 commits intosnap-research:mainfrom
codelion:main
Open

Add Gemini 2.5 support and robust answer key handling#8
codelion wants to merge 3 commits intosnap-research:mainfrom
codelion:main

Conversation

@codelion
Copy link

@codelion codelion commented Jul 10, 2025

Updated Gemini integration to use the new google-genai SDK and support Gemini 2.5 models. Refactored answer extraction logic across evaluation, Claude, GPT, Gemini, and HF utils to handle both 'answer' and 'adversarial_answer' keys, improving robustness to missing fields. Cleaned up requirements.txt, added .gitignore, and removed compiled Python cache files.

Will fix #6 #5 #3 #2

codelion added 2 commits July 10, 2025 09:14
Updated Gemini integration to use the new google-genai SDK and support Gemini 2.5 models. Refactored answer extraction logic across evaluation, Claude, GPT, Gemini, and HF utils to handle both 'answer' and 'adversarial_answer' keys, improving robustness to missing fields. Cleaned up requirements.txt, added .gitignore, and removed compiled Python cache files.
Introduced a CATEGORY_MAPPING dictionary in evaluate_qa.py, evaluation.py, and evaluation_stats.py to map category numbers to descriptive names. Enhanced output and result dictionaries to include category names and detailed statistics, improving interpretability of QA evaluation results and reporting.
@codelion
Copy link
Author

This also fixes #6

Added checks to convert non-string answer values to strings in the process_output function. This prevents errors when processing outputs that may not be strings.
@codelion
Copy link
Author

Gemini-2.5-Flash

Overall accuracy: 0.728

Category | Name | Count | Correct | Accuracy

   4 | Single-hop  |   841 |   619.5 |    0.737
   1 | Multi-hop   |   282 |   161.1 |    0.571
   2 | Temporal    |   321 |   208.5 |    0.649
   3 | Open-domain |    96 |    32.6 |    0.340
   5 | Adversarial |   446 |   424.0 |    0.951

Gemini-2.5-Flash-Lite

Overall accuracy: 0.490

Category | Name | Count | Correct | Accuracy

     4 | Single-hop  |   841 |   584.7 |    0.695
     1 | Multi-hop   |   282 |   111.2 |    0.394
     2 | Temporal    |   321 |   111.7 |    0.348
     3 | Open-domain |    96 |    18.0 |    0.187
     5 | Adversarial |   446 |   148.0 |    0.332

These are the results with the Gemini 2.5 models.

@alphanlp
Copy link

Gemini-2.5-Flash

Overall accuracy: 0.728

Category | Name | Count | Correct | Accuracy

   4 | Single-hop  |   841 |   619.5 |    0.737
   1 | Multi-hop   |   282 |   161.1 |    0.571
   2 | Temporal    |   321 |   208.5 |    0.649
   3 | Open-domain |    96 |    32.6 |    0.340
   5 | Adversarial |   446 |   424.0 |    0.951

Gemini-2.5-Flash-Lite

Overall accuracy: 0.490

Category | Name | Count | Correct | Accuracy

     4 | Single-hop  |   841 |   584.7 |    0.695
     1 | Multi-hop   |   282 |   111.2 |    0.394
     2 | Temporal    |   321 |   111.7 |    0.348
     3 | Open-domain |    96 |    18.0 |    0.187
     5 | Adversarial |   446 |   148.0 |    0.332

These are the results with the Gemini 2.5 models.

dataset metric score
locomo Overall 62
locomo Single Hop 65.8
locomo Multi Hop 39.7
locomo Temporal 54
locomo Open Domain 22
locomo Adversarial 83.4

@alphanlp
Copy link

alphanlp commented Jul 25, 2025

Model: Gemini-2.5-Flash, this is my eval score

ataset metric score
locomo Overall 62
locomo Single Hop 65.8
locomo Multi Hop 39.7
locomo Temporal 54
locomo Open Domain 22
locomo Adversarial 83.4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

The mapping between category number and category type

2 participants