Benchmark evaluation fails for Yes/No questions due to label mismatch

The current benchmark evaluation logic cannot correctly handle Yes/No questions when there is a mismatch between prediction format and ground truth labels.

Model prediction (pred): "yes" / "no"
Ground truth (GT): "A" / "B"
Because of this inconsistency, correct answers are being marked as incorrect during evaluation.


Steps to Reproduce
Run the evaluation on Ego_Centric_Abosolute_Distance
Observe that correct predictions are not counted as correct.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark evaluation fails for Yes/No questions due to label mismatch #3

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Benchmark evaluation fails for Yes/No questions due to label mismatch #3

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions