forked from allenai/WildBench
-
Notifications
You must be signed in to change notification settings - Fork 30
Open
Description
Hi there,
In going through your evaluation code for ZebraLogic and examining my model's outputs, it seems that the current evaluation fails to account for instances where the model's answers are semantically correct, but formattedly or syntactically incorrect. The issue is quite noticeable when there are around 300 out of 1000 samples that are evaluated as ''failed'' cases due to having incorrect formats when this is not taken into account. This number includes both semantically correct and incorrect answers there. But the point remains that:
Should (and how can) we have another metric that takes into account of these 'probabilistic' answers?
Thanks!
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels