A group of university professors and International Mathematical Olympiad medalists published a detailed analytical report on April 9, 2026, regarding Riemann-Bench, a private evaluation tool designed exclusively to test artificial intelligence systems in research-level mathematics.
Recent language models have achieved exceptional performance at the International Mathematical Olympiad. They have demonstrated high competence in rapidly solving competition-style problems. Competition mathematics, however, represents only a narrow fraction of genuine analytical reasoning. Problems drawn from these limited domains require minimal theoretical machinery and often reward momentary intuition over deep theoretical knowledge.
The new testing standard introduces a set of 25 problems rigorously selected by expert evaluators. These technical challenges push far beyond the boundaries of standard competitions, authored directly by top-tier university mathematics professors and doctoral students. The human authors routinely required weeks to independently solve each exercise included in this benchmark. The Riemann-Bench toolmandates a double-blind verification process for every computational task. Two independent experts solve the problem from scratch to certify the integrity of the dataset and guarantee a unique solution. The final answer, expressed exclusively as a closed-form solution, is subsequently validated by automated code verification programs.
Frontier models are evaluated as unrestricted research agents, software systems operating with full access to programming tools and open-ended logical deduction capabilities. Performance is precisely measured using an unbiased statistical estimator, a mathematical function calculated over one hundred independent runs for each specific problem. This rigorous methodology eliminates random variations and provides a clear assessment of the evaluated algorithm’s stability.
The test results expose an undeniable technical reality: all current frontier models score below ten percent. This metric reveals a massive gap between solving olympiad-level exercises and executing the genuine mathematical deduction required in advanced academic research. Such low performance directly highlights the limitations of current neural architectures when confronted with extremely complex analytical challenges, problems unsolvable through simple text pattern recognition.
The platform administrators keep this dataset completely inaccessible to the public. Maintaining absolute privacy ensures the results reflect an authentic mathematical capacity for generalization, completely excluding the mere memorization of pre-existing training literature. This technical initiative allows engineers to accurately calibrate computing systems and efficiently guide future developments in high-precision algorithmic programming.
Source:
Cover Photo by Tim Mossholder

