RAQUEL D?AZ
Updated 11/13/2025 - 08:02 ET
The AlphaProof system formalizes millions of statements and verifies each step, allowing to check and demonstrate why it succeeds

An artificial intelligence has achieved something that until now belonged to the realm of the world's brightest students: competing, with results comparable to a silver medal, in the International Mathematical Olympiad (IMO). The work, published in Nature and developed by Google DeepMind, introduces AlphaProof, an AI system that not only solves problems but does so within a formal mathematical environment where each step is automatically verified. This places it a step above large language models that "seem" to reason but whose correctness is difficult to verify.
The starting point of the study is a known limitation: general AIs can write elegant proofs in natural language, almost human-like, but mathematicians need certainties. What AlphaProof does is train with an immense amount of formalized mathematical material—the authors mention self-formalizing about 80 million statements
Tested in a real setting, the IMO 2024, the system independently solved three out of the six problems (algebra and number theory) and needed support from AlphaGeometry 2 for the geometry exercise. The overall performance equated to that of a silver medalist, but with a significant difference: the system took two to three days to compute, whereas human students have four hours. "Of the six problems posed in this competition, AlphaProof correctly solved three problems in algebra and number theory, but failed to solve the two combinatorial ones," recalls Marta Macho-Stadler, a mathematician at the University of the Basque Country in statements to SMC Spain, highlighting the key advancement: "it adds a verification method to check the correctness of its results."
This automatic verification is what excites part of the community: it's not just that the AI gets it right, but we can understand why it does. However, the work also shows its limitations. Firstly, because the show of strength has been in a very particular type of problems—those of the Olympiad—that combine ingenuity and small tricks but do not require the background of a full course in algebraic geometry. And secondly, because the performance relies on a level of computing power available to very few labs. "In the current situation of AI, you either have 'infinite' computing resources or you don't develop a prototype from start to finish," notes Teodoro Calonge, a Computer Science professor at the University of Valladolid, who values the article as "worthy" and well-explained regarding the use of pre-trained models.
There is also a direct consequence that impacts education: if an AI can solve variations of seen problems, exams—and Olympiads—will have to become more creative. Calonge puts it plainly: teachers tend to "base exercises on previous ones with very slight variations," and that's where a system trained with millions of examples shines. When the problem moves away from the known, the AI still struggles. This is one of the limits that the authors themselves point out in Nature: transitioning from structured competition to open mathematics, which demands creativity, remains a challenge.
This distinction is also highlighted by Clara Grima, a Mathematics professor at the University of Seville, when she recalls to SMC Spain Hans Moravec's metaphor—popularized by Max Tegmark—about the "landscape" of human capabilities: first, tasks of calculation fell, then those of demonstration and chess, and gradually, the water is covering higher slopes. For this professor, the article shows that even that ingenuity we attributed to Olympic teenagers is starting to be within reach of machines, although we are not yet talking about original mathematical research or open conjectures.
Overall, DeepMind's work sets a clear boundary: it is possible to make AI reason in mathematics in a controllable and verifiable manner. In the short term, this could translate into assistants that help teachers generate exercises, more reliable automatic graders, or mathematicians who want to verify routine steps of a long proof. In the medium term, if computing costs decrease and the range of problems expands, it could become a tool that allows exploring entire branches of formal knowledge without relying solely on human hours.
.png)

