每年,参加国际数学奥林匹克竞赛的国家(IMO)都会带着一本包含他们最好的,最原始问题的小册子。这些小册子在各代表团之间共享,,然后悄然消失。没有人系统地收集过它们,、清理过它们, 并让它们可用,,而不是供人工智能研究人员测试数学推理的极限,,也不是供世界各地主要自行训练这些比赛的学生使用。
Every year, the countries competing in the International Mathematical Olympiad (IMO) arrive with a booklet of their best, most original problems. Those booklets get shared among delegations, then quietly disappear. No one had ever collected them systematically, cleaned them, and made them available, not for AI researchers testing the limits of mathematical reasoning, and not for the students around the world training for these competitions largely on their own.
MathNet 是迄今为止创建的最大的基于证明的数学问题的高质量数据集。包含超过 30,000 个专家撰写的问题和解决方案,涵盖 47 个国家, 17 种语言, 和 143 个竞赛,,它比同类第二大数据集大五倍。这项工作将于本月晚些时候在巴西举行的国际学习表征会议 (ICLR) 上展示。
MathNet is the largest high-quality dataset of proof-based math problems ever created. Comprising more than 30,000 expert-authored problems and solutions spanning 47 countries, 17 languages, and 143 competitions, it is five times larger than the next-biggest dataset of its kind. The work will be presented at the International Conference on Learning Representations (ICLR) in Brazil later this month.
MathNet 的不同之处不仅在于它的大小,,还在于它的广度。以往的奥林匹克级别数据集几乎全部来自美国和中国的比赛。 MathNet 跨越六大洲的数十个国家, 涵盖 17 种语言, 包括基于文本和图像的问题及解决方案, 并跨越四十年的数学竞赛。目标是捕捉全球数学界, 存在的全部数学观点和解决问题的传统,而不仅仅是最明显的观点和解决问题的传统。
What makes MathNet different is not only its size, but its breadth. Previous Olympiad-level datasets draw almost exclusively from competitions in the United States and China. MathNet spans dozens of countries across six continents, covers 17 languages, includes both text- and image-based problems and solutions, and spans four decades of competition mathematics. The goal is to capture the full range of mathematical perspectives and problem-solving traditions that exist across the global math community, not just the most visible ones.
"每个国家都会带来一本介绍其最新颖和最具创造性问题的小册子," 麻省理工学院博士生、该论文的主要作者 Shaden Alshammari, 说。 "他们互相分享小册子,,但没有人努力收集它们,清理它们,并将它们上传到网上。"
"Every country brings a booklet of its most novel and most creative problems," says Shaden Alshammari, an MIT PhD student and lead author on the paper. "They share the booklets with each other, but no one had made the effort to collect them, clean them, and upload them online."
构建 MathNet 需要追踪 1,595 PDF 卷,总计超过 25,000 页,,涵盖十多种语言的数字文档和几十年前的扫描件。该档案的很大一部分来自不太可能的来源: Navid Safaei,,他是 IMO 社区的长期人物和合著者,自 2006 年以来一直在手工收集和扫描这些小册子。他的个人档案构成了数据集的大部分骨干。
Building MathNet required tracking down 1,595 PDF volumes totaling more than 25,000 pages, spanning digital documents and decades-old scans in more than a dozen languages. A significant portion of that archive came from an unlikely source: Navid Safaei, a longtime IMO community figure and co-author who had been collecting and scanning those booklets by hand since 2006. His personal archive formed much of the backbone of the dataset.
采购与规模一样重要。大多数现有数学数据集都从社区论坛(例如解决问题的艺术(AoPS),)中提取问题,MathNet 完全从官方国家竞赛手册中提取问题。这些小册子中的解决方案是由专家编写并经过同行评审的,,并且它们通常会涉及多个页面,,作者会逐步介绍解决同一问题的多种方法。这种深度为 AI 模型提供了比社区来源数据集典型的较短, 非正式解决方案更丰富的信号来学习数学推理。这也意味着该数据集对学生: 真正有用,任何准备 IMO 或全国竞赛的人现在都可以访问集中, 可搜索的高质量问题集合以及来自世界各地的传统解决方案。
The sourcing matters as much as the scale. Where most existing math datasets pull problems from community forums like Art of Problem Solving (AoPS), MathNet draws exclusively from official national competition booklets. The solutions in those booklets are expert-written and peer-reviewed, and they often run to multiple pages, with authors walking through several approaches to the same problem. That depth gives AI models a far richer signal for learning mathematical reasoning than the shorter, informal solutions typical of community-sourced datasets. It also means the dataset is genuinely useful for students: Anyone preparing for the IMO or a national competition now has access to a centralized, searchable collection of high-quality problems and worked solutions from traditions around the world.
"我记得很多学生都是个人努力的结果。他们国家没有人训练他们参加此类比赛," 自己作为学生参加过 IMO 比赛的 Alshammari, 说。 "我们希望这能为他们提供一个集中的地方,提供高质量的问题和解决方案供他们学习。"
"I remember so many students for whom it was an individual effort. No one in their country was training them for this kind of competition," says Alshammari, who competed in the IMO as a student herself. "We hope this gives them a centralized place with high-quality problems and solutions to learn from."
该团队在 IMO 社区中有着深厚的根基。 Sultan Albarakati, 是合著者,,目前在 IMO 委员会, 任职,研究人员正在努力直接与 IMO 基金会共享数据集。为了验证数据集,,他们组建了一个由来自亚美尼亚,、俄罗斯, 乌克兰, 越南, 和波兰, 等国家的 30 多名人类评估员组成的评级小组,他们共同协调验证了数千个解决方案。
The team has deep roots in the IMO community. Sultan Albarakati, a co-author, currently serves on the IMO board, and the researchers are working to share the dataset with the IMO foundation directly. To validate the dataset, they assembled a grading group of more than 30 human evaluators from countries including Armenia, Russia, Ukraine, Vietnam, and Poland, who coordinated together to verify thousands of solutions.
"MathNet 数据库有潜力成为寻求新问题或寻找难题解决方案的学生和领导者的绝佳资源," 瑞士的 IMO 副领导人 Tanish Patil, 表示。 "虽然确实存在其他奥林匹克问题档案(尤其是, AoPS 上的竞赛集合论坛), 这些资源缺乏标准化格式系统, 经过验证的解决方案, 以及主题和理论所需的重要问题元数据。看看如何使用该数据集来提高推理模型的性能,,以及我们是否很快能够在创建新颖的奥林匹克问题时可靠地回答一个重要问题: 确定问题是否真正是原创的,也将很有趣。"
"The MathNet database has the potential to be an excellent resource for both students and leaders seeking new problems to work on or looking for the solution to a difficult question," says Tanish Patil, deputy leader of Switzerland的 IMO. "Whilst other archives of Olympiad problems do exist (notably, the Contest Collections forums on AoPS), these resources lack standardized formatting system, verified solutions, and important problem metadata that topics and theory require. It will also be interesting to see how this dataset is used to improve the performance of reasoning models, and if we will soon be able to reliably answer an important issue when creating novel Olympiad questions: determining if a problem is truly original."
MathNet 还充当 AI 性能, 的严格基准,其结果揭示的情况比最近有关 AI 数学能力的头条新闻所暗示的更为复杂。前沿模型取得了非凡的进步: 据报道,一些模型在 IMO, 上取得了金牌性能,并且在标准基准上,它们现在解决了困扰大多数人的问题。但 MathNet 表明进展并不平衡。即使是测试的表现最好的模型 GPT-5,,在 MathNet的主要基准测试中,6,400 个问题, 的平均得分约为 69.3%,几乎三分之一的奥林匹克级别问题都失败了。当问题包括数字,时,性能全面显着下降,,即使对于最有能力的模型来说,视觉推理也是一个一贯的弱点。
MathNet also functions as a rigorous benchmark for AI performance, and the results reveal a more complicated picture than recent headlines about AI math prowess might suggest. Frontier models have made extraordinary progress: Some have reportedly achieved gold-medal performance at the IMO, and on standard benchmarks they now solve problems that would stump most humans. But MathNet shows that progress is uneven. Even GPT-5, the top-performing model tested, averaged around 69.3 percent on MathNet的 main benchmark of 6,400 problems, failing nearly one-in-three Olympiad-level problems. And when problems include figures, performance drops significantly across the board, exposing visual reasoning as a consistent weak point for even the most capable models.
多个开源模型在蒙古语问题上的得分为 0%,,这突显了当前人工智能系统尽管整体实力强劲但仍存在不足的另一个方面。
Several open-source models scored 0 percent on Mongolian-language problems, highlighting another dimension where current AI systems fall short despite their overall strength.
"GPT 模型在英语和其他语言中同样出色," Alshammari 说。 "但是许多开源模型在不太常见的语言,(例如蒙古语)上完全失败。"
"GPT models are equally good in English and other languages," Alshammari says. "But many of the open-source models fail completely at less-common languages, such as Mongolian."
MathNet 的多样性还旨在解决人工智能模型学习数学的更深层次限制。当训练数据偏向英语和中文问题时,, 模型吸收了数学文化的一小部分。罗马尼亚组合问题或巴西数论问题可能会从完全不同的角度探讨相同的基本概念。研究人员认为,暴露在,范围内,会让人类和人工智能系统成为更好的数学思考者。
The diversity of MathNet is also designed to address a deeper limitation in how AI models learn mathematics. When training data skews toward English and Chinese problems, models absorb a narrow slice of mathematical culture. A Romanian combinatorics problem or a Brazilian number theory problem may approach the same underlying concept from a completely different angle. Exposure to that range, the researchers argue, makes both humans and AI systems better mathematical thinkers.
除了解决问题, MathNet 引入了一个检索基准,该基准询问当两个问题共享相同的基础数学结构, 时模型是否能够识别,这种能力对于人工智能开发和数学社区本身都很重要。多年来,在真正的 IMO 考试中出现了近乎重复的问题,因为即使对于人类专家委员会来说,在不同的符号, 语言, 和格式之间找到数学等价性也确实很困难,。通过测试八个最先进的嵌入模型,,研究人员发现,即使是最强的嵌入模型,在第一次尝试时也只有大约 5% 的时间, 识别出正确的匹配,并且模型经常将结构上不相关的问题排名为比同等问题更相似。
Beyond problem-solving, MathNet introduces a retrieval benchmark that asks whether models can recognize when two problems share the same underlying mathematical structure, a capability that matters both for AI development and for the math community itself. Near-duplicate problems have appeared in real IMO exams over the years because finding mathematical equivalences across different notations, languages, and formats is genuinely hard, even for expert human committees. Testing eight state-of-the-art embedding models, the researchers found that even the strongest identified the correct match only about 5 percent of the time on the first try, with models frequently ranking structurally unrelated problems as more similar than equivalent ones.
该数据集还包括检索增强生成基准,,测试在要求模型解决新问题之前为其提供结构相关的问题是否可以提高性能。它确实,,但只有当检索到的问题真正相关时。 DeepSeek-V3.2-Speciale 通过良好匹配的检索, 提高了高达 12 个百分点,而在大约 22% 的情况下,不相关的检索降低了性能。
The dataset also includes a retrieval-augmented generation benchmark, testing whether giving a model a structurally related problem before asking it to solve a new one improves performance. It does, but only when the retrieved problem is genuinely relevant. DeepSeek-V3.2-Speciale gained up to 12 percentage points with well-matched retrieval, while irrelevant retrieval degraded performance in roughly 22 percent of cases.