Skip to content

New Evaluation ? #29

@Peter0452

Description

@Peter0452

Hi there,

In going through your evaluation code for ZebraLogic and examining my model's outputs, it seems that the current evaluation fails to account for instances where the model's answers are semantically correct, but formattedly or syntactically incorrect. The issue is quite noticeable when there are around 300 out of 1000 samples that are evaluated as ''failed'' cases due to having incorrect formats when this is not taken into account. This number includes both semantically correct and incorrect answers there. But the point remains that:
Should (and how can) we have another metric that takes into account of these 'probabilistic' answers?

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions