I've been formalizing mathematical theorems in Lean for some years now, and one of the major blocks is the difficulty of formalizing elementary results that mathematicians do not take the time to even state. For example, a mathematician's integer can implicitly be a natural number at a line and a real number at the next one, while proof assistants require that some “coercion” maps be introduced. Having three `1` in the same formula, of three different natures, leads to unexpected nightmares. It is therefore very important for the development of proof-assisted mathematics that some automation be provided by the type checking system, with as many obvious things being taken care of by a combination of a fine-tuned API and efficient search algorithms.
Another reason for the difficulty of formalizing proofs is having to navigate in a gigantic mathematical library (Mathlib, for example, has 1.5 million line of codes) and the lack of a definitive search engine and of a reliable and complete documentation. This is not to blame my colleagues — all of this is difficult to write, time consuming, and our personal interests are driven in different directions.
This may explain that part of the field of formalization is driven by the “AI hype”, and — although this hype is so finely rebuted by Emily Bender and Alex Hannah in their last book, The AI Con, which I recommend heartily — that several colleagues use LLMs to create proofs, either in natural language, or in Lean's syntax. They say it is successful, but from the outside, it is really hard to tell whether it is really the case. My main impression is that these softwares obliterate the time where our mind tries to form ideas, leading — for me — to another kind of stress. There are also serious arguments that the systematic cognitive friction is a necessity of well-formed thinking, and cannot be replaced by stochastic optimization. Moreover, these colleagues usually do not address the environmental cost of using such kinds of generative AI, and whether its output is worth that cost.
A few days ago, after I complained — one more time — that stochastic algorithms do not think, I was asked my opinion about the “trinity autoformalization system”. I have to confess I carefully avoid the AI news and hadn't heard about it. A first search led me to a 2022 paper Autoformalization with Large Language Models. Here is the definition of “autoformalization” from their paper:
Autoformalization refers to the task of automatically translating from natural language mathematics to a formal language.In this case, the authors took as a benchmark statements of mathematical competition problems, and were happy to have their LLMs translate correctly roughly 25% of those problems. This might be a technical exploit, but le me stay unimpressed for the moment: what mathematicians need is more than that, and is not elementary statements of math olympiad problems, but the most advanced mathematical concepts that the human mind is capable to grasp despite the fact that they involve extremely intricate and sometimes combinatorically involved constructions. I am not sure that we fully know what it means to understand these concepts, but I am certain that it doesn't reduce to being able of formally stating a mathematical statement. (And what about those very fine colleagues who demonstrate everyday their mathematical depth while failing at stating precise statements?)
It appears my colleague from the Lean community meant another result. Trinity is a project from Morph Labs, which their web page describes as follows:
Morph Labs is building a cloud for superintelligence.
Infinibranch enables AI agents on Morph Cloud to snapshot, replicate, autoscale, test, debug, deploy, and reason about software at light speed.
This starts pretty badly. First of all, superintelligence is even less defined than intelligence is, and claims of developing superintelligence can only be suspicious, especially when no scientific claim justifies that AI softwares feature any intelligence at all, and most of the AI experiments show a poor rate of success. The next sentence may be worse, since the speed of reasoning is not measured in m/s (nor in ft/hr), and the speed of electric waves in cables is smaller than the speed of light (although, I just learnt, it can be up to 99% of it!).
Their blog page on Trinity starts as follows:
tl;dr
We're excited to announce Trinity, an autoformalization system that represents a critical step toward verified superintelligence.
which combines the colloquial tl;dr (too long, don't read) with the brutal claim that their autoformalization could be a step towards superintelligence. Later on the webpage, they boast about a nearly infinite supply of verified training environments which, in our finite world plagued by a brutal climate change, is quite a thing.
What these people claim to have done is the autoformalization of a 1962 theorem by N. G. de Bruijn (who, incidentally, is the father of Automath, one of the first proof assistants to be built), regarding the behaviour of the number $N(n)$ of integers $\leq n$ all of whose prime factors divide $n$, when $n$ grows to infinity. Answering a question of Erdös, de Bruijn proved that, on average, $N(n)$ is at most $n^{1+\varepsilon}$. On the other hand, the Trinity team puts the accent on a consequence of that result to the abc conjecture of Masser and Oesterlé. That conjecture asserts that for any $\varepsilon\gt0$, there exists a real number $k\gt0$ such that there are only finitely many triples $(a,b,c)$ of coprime natural numbers such that $a+b =c $ and such that the product $\operatorname{rad}(abc)$ of all prime numbers dividing $abc$ is at most $c^{1-\varepsilon}$. Implications of that mysterious elementary looking conjecture to number theoretical questions are manifold, from an asymptotic version of Fermat's Last Theorem to an effective version of the Mordell conjecture to Siegel zeroes of L-functions, etc. In effect, the theorem of de Bruijn implies that there are at most $\mathrm O(N^{2/3})$ triples $(a,b,c)$ of coprime natural numbers at most $N$ such that $\operatorname{rad}(abc)\lt c^{1-\varepsilon}$. When this bound is compared to the $\approx \frac6{\pi^2}N^2$ triples of coprime integers $(a,b,c)$ such that $a+b=c\leq N$, this indicates that the (hoped to be finite) set of the abc conjecture is kind of sparse.
To be fair, I am not capable of judging the difficulty of what they achieved, and I presume this is a difficult piece of software engineering to adjust the myriad of parameters of the implied neural networks. But what I can tell is whether they did what they say they did, and whether (this is more of an opinion) this has any relevance for the future of mathematical formalization. The answers, you can guess, will be no, definitely no.
I would first like to make a mathematical comment on the sentence
We obtain the first formalized theorem towards the abc conjecture, showing it is true almost always.
that can be read on the github page for the project which compares to the title of the (nice) exposition paper by Jared Lichtman: “The abc conjecture is true almost always”, but adds the idea that this theorem would be a step towards the abc conjecture. I claim that it is nothing like that. Indeed, it is an observation of Oesterlé (see Szpiro's paper, page 12) that the $\varepsilon=0$-variant of the abc conjecture is false. On the other hand, the exact same argument shows that the counterexamples to this false conjecture are as sparse, maybe with an upper bound $N^{2/3+\varepsilon}$ for the number of counterexamples. Consequently, if de Bruijn's result were a step towards the abc conjecture, it would simultaneously be a step towards a false conjecture. The most reasonable conclusion is then to admit that this theorem has no proof-theoretic relevance to the abc conjecture.
Then, while Morph's blog says:
Trinity systematically processes entire papers, intelligently corrects its own formalization errors by analyzing failed attempts, and automatically refactors lengthy proofs to extract useful lemmas and abstractions,the github page for the actual project mentions a human-supplied blueprint. This is a file in (La)TeX format that serves as a convenient host for the Lean project, providing information to the Lean prover. From the expression “human-supplied”, one has to understand that human beings have written that file, that is to say, have split the proof of de Bruijn's theorem in a series of 67 lemmas which divide the initial paper into tiny results, while giving references to other lemmas as indications of how they could be proved. As an example, here is “lemma24”:
\begin{lemma} \label{lem24} \lean{lemma24}Let $p_1<\cdots<p_k$ be distinct primes, and denote the product $r = p_1 \cdots p_k$. If an integer $n\ge1$ satisfies $\rad(n)=r$, then $n = p_1^{a_1}\cdots p_k^{a_k}$ for some integers $a_1,\dots,a_k\ge 1$.
\end{lemma}
\begin{proof}\leanok
\uses{thm3, lem20}
Uses theorem \ref{thm3} with $n$, then uses lemma \ref{lem20}.
\end{proof}
which means that if an integer $r$ is the product of distinct primes $p_1,\dots,p_k$ and $\operatorname{rad}(n)=r$, then $n$ can be written $p_1^{a_1}\cdots p_k^{a_k}$, for some integers $a_1,\dots,a_k\geq 1$. The indication is to use “thm3” (the fundamental theorem of arithmetic) and “lem20” (the formula for the radical of an integer in terms of its decomposition into prime factors).
To summary, humans have converted the 2-page proof of this result into 67 tiny chunks, supplying plenty of information that would have probably been unnecessary to most professional mathematicians, and fed that to their LLM. I am reluctant to draw a comparison with the AI farms in Kenya where workers were (and probably still are) exploited at tagging violent images for a ridiculous pay, but this is yet another instance where “mechanized artificial intelligence” relies crucially on human's work (beyond the invention and deployment of neural networks, etc., which are human's work as well, of course). And as in these other cases, the technostructure makes every effort to make this human work invisible: the blueprint has no author, the blog post has no author. Everything is made so that you believe that no human was implied in this project.
Let's go back to the content of the blueprint, taking for granted that such a work has to be done anyway. However, we, humans, do not want to have to write endlessly tiny lemmas, each of them apparently irrelevant to the big mathematical picture. We do not want to keep them in a mathematical library, for almost all of them have no interest outside of their specific contect. Moreover, once these lemmas are written, there would be no additional effort for a human to formalize the proof. What we do want is a way to translate with as little effort as possible mathematical ideas, keeping mostly close their natural expression.
Moreover, even if we would have had to provide these tiny lemmas, we wouldn't be able to keep them in a human-accessible mathematical library, for that library would be cluttered by millions of uninteresting rapidly redundant lemmas. Building a mathematical library is a human task, that reflects the activity of past human minds, to be used by future human minds.
As a matter of fact, this project considers a result in elementary mathematics, at an undergraduate level, and can completely ignore the difficulty of building a large scale mathematical library, while this is precisely there that we need better automation. Something which is missing in Lean (but exists in other proof assistant systems, such as Rocq) is a coherent management system of all algebraic/analytic structures, which would allow to optimize their construction without having to modify the rest of the code. There are some cases of such an automation in Lean though, for example the automatic conversion of mathematical lemmas for groups written in multiplicative form to groups written in additive forms, giving one proof for two. Building such a tool requires a good proficiency in parsers and a clear view of what mathematicians do “in their heads” when they perform that apparently innocuous conversion. (For this reason, another automatic conversion for changing the direction of ordering relations appears more difficult to write, because what we would want is less clear. For the moment, mathlib has resolved itself to systematically privilege the $\leq$ inequality; the other one can be made to the dual order, but automating this prompts out the very same problem.) Another kind of crucial automation is the implementation of solvers for linear inequalities, or automatic proofs of algebraic identifies (in numbers, in groups, whatrever) so that the sentence “a computation shows that…” could almost be translated as such in formal code. This requires a good knowledge in programming, and the use of subtle but classical algorithms (simplex method, Gröbner bases, SAT solvers…), each of them finely tailored for its applications. This is a truly beautiful combination of mathematics and computer science, but nothing close of a so-called “general intelligence” tool.
There is something positive, though, that I could say about the search for automatic formalization. The AI companies do not acknowledge it, most of the public opinion is delusionnal, but LLMs have no reasoning faculty at all, and they just can't. What they do is put words together that would fit together with a good probability given the existing corpus that the machine has been given. Patterns exist in language, they exist in mathematics as well, and the (impressing, I concede) faculty of these softwares allows them to simulate text, be it litterary or mathematical, that looks plausible. But that doesn't make it true. On the other hand, these softwares could be combined with proofs assistants that work in a formal, proof-amenable, language, hereby offering a way of assessing the veracity of the output of the vernacular text. (On the other hand, one would need to be sure that the text couldn't say $1+1=3$ when the Lean code has certified that $1+1=2$.)
As a final paragraph, I would like to comment on the metaphors invoked by Morph Labs. Everyone who is even loosely knowledgeable in Sci-Fi movies will have recognized the Wachowskis's movie The Matrix. In that dystopic movie, humans are trapped in a machine-made “reality” after they lost a war against the artificial intelligences they created. Of course, Morpheus and Trinity are humans who fight against this power, but I wonder what the creators of Morph Labs had in mind when they decided to call themselves under this name, and to go on with the metaphor by using the name Trinity. (And who's Neo?) Their obvious reference is a world where artificial intelligence led to humanity's doom, while their rhetoric is one of AI hype. We rejoin here the discussion of Bender and Hannah in their abovementioned book, where they explain that AI doomerism/boosterism are the two faces of the same coin, that takes every development of genAI as unavoidable, whatever its social, ethical, and ecological costs.