✨ TL;DR
MathNet is a large-scale, multilingual dataset of 30,676 Olympiad-level math problems from 47 countries spanning two decades, designed to benchmark both mathematical reasoning in generative models and mathematical retrieval in embedding systems. The benchmark reveals that even state-of-the-art models struggle with these problems, with top models achieving only 78.4% accuracy, and that retrieval quality significantly impacts retrieval-augmented generation performance.
Existing mathematical reasoning benchmarks suffer from significant limitations in scale, language diversity, and task coverage. Current datasets are too small to adequately test modern large language models, focus predominantly on English, and fail to evaluate critical capabilities like mathematical retrieval—the ability to find semantically or structurally similar problems. This gap is particularly problematic as mathematical problem solving represents a fundamental test of reasoning ability, and real-world mathematical applications often require both solving problems and retrieving relevant prior work or similar examples.
The authors constructed MathNet by collecting 30,676 expert-authored Olympiad-level mathematics problems with solutions from 47 countries across 17 languages, spanning two decades of competitions. They designed a comprehensive benchmark supporting three distinct tasks: (i) Problem Solving, where models generate solutions to problems; (ii) Math-Aware Retrieval, where embedding models must retrieve mathematically equivalent or structurally similar problems from a corpus; and (iii) Retrieval-Augmented Problem Solving, which combines retrieval with generation. For the retrieval benchmark, human experts curated pairs of mathematically equivalent and structurally similar problems to enable rigorous evaluation of mathematical understanding in embedding models.