The current benchmark evaluation logic cannot correctly handle Yes/No questions when there is a mismatch between prediction format and ground truth labels.
Model prediction (pred): "yes" / "no"
Ground truth (GT): "A" / "B"
Because of this inconsistency, correct answers are being marked as incorrect during evaluation.
Steps to Reproduce
Run the evaluation on Ego_Centric_Abosolute_Distance
Observe that correct predictions are not counted as correct.
The current benchmark evaluation logic cannot correctly handle Yes/No questions when there is a mismatch between prediction format and ground truth labels.
Model prediction (pred): "yes" / "no"
Ground truth (GT): "A" / "B"
Because of this inconsistency, correct answers are being marked as incorrect during evaluation.
Steps to Reproduce
Run the evaluation on Ego_Centric_Abosolute_Distance
Observe that correct predictions are not counted as correct.