There's this very nice multilingual benchmark, that would be really valuable to evaluate on: PolyMath.
https://arxiv.org/pdf/2504.18428
https://huggingface.co/datasets/Qwen/PolyMath
https://github.com/QwenLM/PolyMath
How could this be included?