-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Open
Description
The evaluation module is not complete. It requires a finalised structure, some more informations, and exercises.
Structure
Here is a basic proposal for a structure:
- what's eval
- here are the well known benchmarks, limitations, and some alternatives people set up (arenas/llm judges)
- you should do your own evals for your own use case
- project on domain specific evaluation
- notebook on comparing models
Comments
-
add a small mention of human based elo rankings and llm as judges
-
notebook for implementing a custom eval (you'll find one in the eval guidebook (could make sense to point towards it for further analysis/knowledge)
-
Refactor to basic structure and add TODOs
-
Add all information and references from the evaluation guidebook
-
Update projects
-
Update notebook with exercises
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels