Improve eval: improve doc and support multiple samples

1. Add example for other model (DeepScaleR-1.5B-Preview) and dataset in `docs/guides/eval.md`.
    (Default eval config is test AIME on Qwen2.5-Math-1.5B-Instruct)
2. Test and add eval link in the training guides: `docs/guides/dpo|grpo|sft.md`.
    I think there is no need to add convert guide in the training guides since it is already in eval guide.
3. Support Pass@1 accuracy averaged over n samples.

Link `docs/guides/eval.md` on the front page readme after all these done.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve eval: improve doc and support multiple samples #323

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve eval: improve doc and support multiple samples #323

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions