AI at IMO 2025: a round-up

3 months ago 3

Setting the scene

The 2025 International Mathematics Olympiad has come and gone. Reminder: this is an exam for high-school kids across the world (each country typically sends six kids), comprising of two 4.5-hour exams each containing three questions, so six questions in total, which I’ll call P1 to P6. Solutions to each question are scored out of 7, for some reason lost in the mists of time. As you can see pretty clearly from the individual results and especially the sea of scores highly close to the sequence “7,7,7,7,7,0”, this year had 5 reasonable questions and one stinker, namely P6. Around half of the participants get medals, determined by scores chosen so that approx 1/12th participants get a gold, 2/12ths get a silver and 3/12ths get a bronze. I still have my gold medal from 1987 in a drawer upstairs; I beat Terry Tao 😛 (although I was 18 and he had just turned 12 at the time; he was noticeably younger than all other contestants). The cut-off for gold this year was 35, which is also the answer to 7+7+7+7+7+0. You can take a look at the questions for 2025 (and indeed for any year) at the official IMO website.

Back in 2024 Google DeepMind “took the IMO” in the sense that they waited until the questions were released publically and then throw two tools at them; AlphaGeometry, which is designed to solve geometry problems in a human-readable format, and AlphaProof, which is designed to write Lean code to solve any mathematics problem written in Lean format. Humans translated the questions into a form appropriate to each tool, and then between them the tools solved 4 out of the 6 problems over a period of several days (rather than the 9 hours allowed for the exam), which led to DeepMind reporting that their system had “achieved Silver Medal standard” and which then led to the rest of the world saying that DeepMind had “got a Silver Medal” (although right now it is only possible for humans to get medals). Indeed DeepMind were one point off a gold in 2024.

This leads inevitably to two questions about IMO 2025. The first: will a system “get gold in 2025”? And the second: Were the IMO committee going to embrace this interest from the tech companies, define precisely what it would mean for an AI to “get gold”, demand an entrance fee or sponsorship from the tech companies and in return use official IMO markers to give official grades, also reporting on scores from tech companies who didn’t do very well? Spoiler alert: the answers are “yes” and “no” respectively.

Deciding the rules

Because the IMO committee chose not to write down any formal rules for how AI could enter the competition, this enabled each of the tech companies to make up their own rules and in particular to define what it means to “get gold”. In short, it enabled the tech companies to both set and mark their own homework. But before I go any further, it’s worth discussing what kind of entries are going to come in to the Wild West that is “AI at IMO 2025”. There are going to be two kinds of submissions — formal and informal. Let’s just break down what these mean.

“Informal” is just using a language model like ChatGPT — you give it the questions in English, you ask it for the answers in English, you then decide how many points to give it out of 7. Let’s scrutinise this last part a little more. IMO marking is done in a very precise fashion and the full marking scheme is not publically revealed; there are unpublished rules established by the coordinators (the markers) such as “lose a mark for not explicitly making a remark which deals with degenerate case X” and, because of this secrecy, it is not really possible for someone who is just “good at IMO problems” but who hasn’t seen the mark scheme to be able to accurately mark LLM output. Given that (spoiler alert) the systems are all going to get 0 points in problem 6, the claim that AI company X got a gold medal with an informal solution rests on things like “we paid some people to mark our solutions to questions 1 to 5 and in return they gave us 7/7 for each solution despite not having seen the official mark scheme”. Judging by this Reddit post there seems to have been some unofficial but failed(?) attempt by the tech companies to get the coordinators to officially give out scores to their informal solutions.

“Formal” is using an computer proof assistant, which is a programming language where code corresponds to mathematics. For such entries, someone has to translate the statement of each question into the language of the proof assistant, and then a language model trained to write code in this proof assistant will attempt to write a solution in code form. I had naively hoped that IMO 2025 would come with “official” translations of the questions into languages such as Lean (just as they supply official translations into many many human languages), but no such luck. So the formal solutions will involve something (probably a human) translating the question into the relevant computer language (possibly in many different ways; translation, just like translation between human languages, is not uniquely-determined by the input) and then giving the formal questions to a system which has been trained to write code solving them in the relevant language. Another catch here is that, as the name suggests, a proof assistant is something which can check proofs, and unfortunately the only question in the 2025 IMO of the form “Prove that…” was P2; all other questions were of the form “Determine this value”. This throws a spanner into the works of a proof-assistant-based attempt on the IMO: if one is asked to “determine the smallest real number c such that [something]” (which was what P3 asked us to do) then what is to stop a machine saying “the number to be determined is c, where c is the smallest real number such that [the same thing], and the proof that I am correct is that I am correct by definition”? It is actually rather difficult to formally even say what we mean by a “determine” question. Formal systems attempting the IMO thus typically use a second (typically informal) system to suggest answers and then get the formal tool to try and prove the suggested answer correct. Modulo checking that the corresponding “proof” question does correspond to the question being asked (and this does need checking), formal proofs are of a binary nature: if it compiles then it gets 7/7 because it is a computer-checked solution to the question which uses only the axioms of mathematics and their consequences. It is not really meaningful to ask for partial credit here, so anything other than a full solution gets 0/7 (although we’ll see an example of someone claiming to get 2/7 with a formal solution later).

The results are in!

First out of the starting blocks were the wonderful people at MathArena. These are people who don’t work for a tech company and are trying to do science. They reported on the performance of the best informal models available to mere mortals such as us (sometimes at a cost of hundreds of dollars a month), and the results were disappointing; the models that we regular people can get our hands on did not even get a bronze medal (this claim relies on the marking being accurate, which as I’ve already explained we cannot guarantee, but presumably they are in the right ball-park).

But of course we have not yet taken into account the fact that tech companies might have secret models up their sleeve, which have not been released to the public yet. And surprise surprise, this is what happened. So from this point on in the blog post, all claims made by tech companies are completely unable to be independently verified, a situation which is very far from my experience as a mathematician; one wonders if one is even allowed to call it science. Seems like Gregor Dolinar, the President of the IMO, is also cautious about the remainder of this post; his comments on AI entries can be seen here on the 2025 IMO website and I’ll reproduce them to save you the click:

It is very exciting to see progress in the mathematical capabilities of AI models, but we would like to be clear that the IMO cannot validate the methods, including the amount of compute used or whether there was any human involvement, or whether the results can be reproduced.

In particular, because there is huge incentive to be the first to get gold, and no way to check what is going on, tech companies can in theory just cheat and there’s no way to check. Let’s assume they didn’t cheat though, and press on.

As we can see from the IMO 2025 website, the exams this year were on the 15th and 16th of July. Marking occurred on 17th and 18th, the students’ results were announced in the closing ceremony on the 19th, and students left Australia when the IMO officially ended on the 20th.

The first company to claim gold was OpenAI. Minutes after the closing ceremony (so a day before the IMO had even officially ended) OpenAI released a twitter thread claiming a score of 7+7+7+7+7+0 with an informal model which is not available to the public. They also released their models’ natural language solutions to problems 1–5, and they are on the whole pretty good. The solutions were marked by former IMO contestants. Whether the solutions submitted were hand-picked by humans from a collection of potential solutions, who knows — there were no rules. I have no reason to believe that OpenAI cheated — but my feeling is that we have diverged a long way from the traditional scientific process at this point (peer review etc). What is clear was that OpenAI were desperate to claim the prize and couldn’t wait.

I had heard through the grapevine that there was an informal embargo on tech companies announcing anything until a week after the IMO had ended (i.e. July 28th) but of course once one company had claimed they’d got the ultimate prize, informal embargos don’t count for much. On 21st July DeepMind published a blog post also claiming gold in 7+7+7+7+7+0 form, together with a quote from Dolinar saying “We can confirm that Google DeepMind has reached the much-desired milestone” (which rather seems to contradict his earlier statement quoted above, in which he explicitly states that IMO cannot validate the methods being used by the tech companies). DeepMind’s blog post will no doubt be regarded as the source of truth for this claim, neatly avoiding the pesky refereeing process. Indeed it could be noted that over 1 year since the 2024 announcement by DeepMind of their silver medal, we still only have the (unrefereed) blog post to cite for this claim, with no published refereed paper detailing the process, and this (paywalled) article from the Financial Times seems to indicate that perhaps the paper was blocked at senior management level.

Not to be outdone, Harmonic posted on Twitter that same day the fact that they would be sticking to the embargo date of 28th, together with a screenshot of what was plainly Lean code containing part of a solution to P3. By July 22nd both the New York Times and Reuters were uncritically reporting that both Google and OpenAI had achieved Gold, together with non-sequitur quotes from someone who works at Google saying “The achievement suggests AI is less than a year away from being used by mathematicians to crack unsolved research problems at the frontier of the field”. Let me come back to this later.

By the 23rd the Chinese had weighed in, with ByteDance announcing (in Chinese) that their Seed Prover model got a silver (2+7+7+7+7+0) after 3 days of working on the problems (note that there are no time limits because there are no rules, each tech company gets to make their own rules up, so quite how they determined that they got 2/7 for a formal solution is anyone’s guess), upgrading to a gold (7+7+7+7+7+0) if the solver was given “additional attempts” (whatever that means). This was the first announcement of a formal (Lean) approach in 2025. The announcement also contains the claim “The final score for Seed Prover at IMO 2025 disclosed here was confirmed with the IMO committee prior to this announcement” which again seems to go against Dolinar’s quote above, which is a bit confusing. Seed Prover didn’t just do IMO — in fact Seed Prover also scored state-of-the-art in a Lean database of old Putnam problems (although who knows if solutions were in the training data; this is a big problem with testing LLMs on old exams); if IMO gold is now regarded as old news, the 2026 Putnam will no doubt be next in the firing line.

Finally we get on to poor old Harmonic, who dutifully waited until the embargo date of 28th before doing a livestream on Twitter announcing that they also got a gold, again with formal methods; mostly Lean, but (like Bytedance in 2025 and DeepMind in 2024) using a bespoke system to solve the geometry problem. There was no information given (either by Harmonic or by ByteDance) about how the numbers to be “determined” in P1,P3,P4 and P5 were given to the system. The joys of unrefereed announcements.

So there you have it, the take-home message is that “AI got a gold in the IMO” and you can forget about all the annoying details.

Summary

As is probably clear, I am a bit frustrated by all this shenannigans. Don’t get me wrong — it is abundantly clear that both informal and formal math provers are getting better and better. We have seen absolutely incredible and extraordinary developments this decade in computer systems doing mathematics autonomously. However at the end of the day, AI was one point off the gold medal boundary in 2024, and got exactly the gold medal boundary in 2025, so in a sense what the last 12 months have given us is a one point improvement on a test which is solvable using just high school methods. To extrapolate from this to “less than a year away from being used by mathematicians to crack unsolved research problems” is in my opinion completely unjustified. I cannot stress enough that these IMO problems are a million miles away from the kind of questions which most research mathematicians are working on. All the ignorant comments on social media of the form “it’ll be the Riemann hypothesis next” are in my mind simply way off.

I also feel that the IMO committee missed a trick. This was the year where the tech companies could have been forced to play by rules which the committee laid down. They did not take such action, we ended up with a rather chaotic endgame, and now everyone knows that “AI has got a gold in the IMO” and the question is probably no longer of interest. Note that things appear to be completely bimodal here — other than the enigmatic 2/7 claimed by ByteDance on P1, all scores were either 7/7 or 0/7. Four problems out of the six were solved in 2024, five in 2025, so perhaps it’s perfect score time in 2026. Probably a lot will depend on what the financial backers to the tech companies want to see though, maybe the better headline would be “AI gets degree from Harvard/Cambridge/Stanford/whatever”, something which is clearly within reach given that most university exams are far more formulaic than IMO because they are designed in such a way that an average student can pass them. And although this will no doubt generate a lot of excitement on social media, again it is a million miles away from what researchers are doing at the frontier of the field; indeed most pure mathematics taught to undergraduates at even the best universities was known by the 1940s. Mathematics is an extraordinarily conservative field in this regard; we choose not to teach our undergraduates what is happening on the boundary of research and instead focus on a rigorous development of everything from the axioms; because of this approach there is simply no time to get any further. I find it striking that at Imperial College (my university) basically the only pure mathematics course being offered to undergraduates which I was not offered myself in Cambridge 35 years ago is the Lean course.

I will finish by stating what I have stated so many times before. Yes it is absolutely clear that AI is getting better at mathematics. Yes it is absolutely clear that progress is currently very rapid. It is important however to remember that past performance is no guarantee of future results, and the systems are right now still worse at IMO problems than I was when I was 18, which is an extremely long way away from where I was at 30 with a PhD and several novel research papers under my belt. AI approaches to harder problems seem to become more and more vibes-based as the difficulty increases, with the systems being consistently unable to rigorously justify their claims, instead resorting to insisting that certain facts are true without having any understanding as to why. Will the systems start to solve interesting open problems which many mathematicians have spent lots of time on and failed to crack? Maybe — but this is still far from guaranteed.

Read Entire Article