New Evaluation ?

Hi there,

In going through your evaluation code for ZebraLogic and examining my model's outputs, it seems that the current evaluation fails to account for instances where the model's answers are semantically correct, but formattedly or syntactically incorrect. The issue is quite noticeable when there are around 300 out of 1000 samples that are evaluated as ''failed'' cases due to having incorrect formats when this is not taken into account. This number includes both semantically correct and incorrect answers there. But the point remains that: 
Should (and how can) we have another metric that takes into account of these 'probabilistic' answers?


Thanks! 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Evaluation ? #29

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

New Evaluation ? #29

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions