After the Turing Test

10 hours ago 1

A Large Language Model speaks to you.

Maybe via text. Maybe via voice.

Maybe it has a body. Maybe a face.

Maybe it is a single red light in a fisheye lens:

“ABC,” it says.

Perhaps “ABC” is something you already know. So you might say to yourself, “Yes, ABC. That’s true. What the LLM said is true.”

Or maybe “ABC” is something you previously read or heard. You might say to yourself, “Yes, ABC sounds right. It sounds true. What the LLM said is probably true.”

But what if you’ve never heard the statement “ABC” before?

What if the LLM said:

No three positive integers can satisfy the equation aⁿ + bⁿ = cⁿ for any integer value of n greater than 2.

Or, to take it a step further, what if the LLM said something that has never been said before, anywhere? Not on the internet, not in a book, not anywhere in human history. A statement that is completely new, novel, unique.

How would you know if it was true?

A Question Of Measurement

People have long dreamt of building intelligent software. The phrase “Artificial Intelligence” was first coined in 1956, but the idea of constructing intelligence out of computation had been around long before that.

In the years after the phrase was introduced, the dream of creating artificial intelligence morphed into predictions of dystopian nightmare, with familiar foretellings of invention gone wrong.

More recently, the prophecies have tended towards fantasy, with regular predictions of oracle-like intelligence just around the corner. Whether called “the singularity,” “artificial general intelligence,” “superintelligence,” or any other awe-inspiring, market-differentiating name, the prophesied AI is both brilliant and unfathomable.

And also impossible to nail down. Ultimately, one might ask: are we there yet?

Somewhere along the path from intelligent to scary smart to all-knowing, the yardstick for measuring artificial intelligence has been lost. And that has left both the discourse and the expectations untethered.

Artificial And Human

In 1950, in a paper titled Computing Machinery and Intelligence, Alan Turing introduced a test to measure certain intellectual capabilities of a machine.

Turing’s test was the Imitation Game. A machine could be said to “think” if it could fool a person into believing that it (the machine) was, itself, a person.

Critically, Turing’s test was defined in terms of human intelligence.

Not a measure of extraordinary human intelligence, but rather the prosaic, the commonplace, the basic: a communicating human who could recognize one of its own kind.

The game was constructed by having a person, C (the “interrogator”), communicating blindly with another person, A, and a machine, B. The test was whether the machine, B, could fool the interrogator, C, into thinking that B was a person, and not a machine. It was sort of like The Dating Game, but with a computer (and conceived fifteen years before the TV show).

With the invention of Large Language Models, we’re at a place where the Turing Test has been, at least arguably, passed. That’s incredible, and a testament to the LLM technologies and all the people and efforts behind them.

But it cannot be credibly claimed that with the passing of the Turing Test artificial general intelligence has now been reached.

So what yardstick comes next?

I’d suggest ratcheting up the standard to an uncommon level of human intelligence–the highest level we know–to human genius. We should be measuring against the intellectual peers of Turing: individuals like Descartes, Newton, Maxwell, Einstein, Bohr, and Gödel.

And also Pierre de Fermat.

A Last Theorem

Fermat was a lawyer and mathematician who spoke six languages. He was born in 1601, a few years before the telescope was invented, and before people knew what the moon really looked like.

Telephones, airplanes, computers, rockets; these were still hundreds of years away.

When Fermat died, in 1665, the pendulum clock was less than a decade old.

Peter L. Bernstein has described Fermat as such:

a mathematician of rare power. He was an independent inventor of analytic geometry, he contributed to the early development of calculus, he did research on the weight of the earth, and he worked on light refraction and optics. In the course of what turned out to be an extended correspondence with Blaise Pascal, he made a significant contribution to the theory of probability. But Fermat’s crowning achievement was in the theory of numbers — the analysis of the structure that underlies the relationships of each individual number to all the others.

This genius, this intellectual impact, was recognized during Fermat’s life. It also extended beyond, likely in ways that Fermat never imagined.

After Fermat’s death, his son discovered a note that Fermat had written in the margin of a book:

It is impossible to separate a cube into two cubes, or a fourth power into two fourth powers, or in general, any power higher than the second, into two like powers. I have discovered a truly marvelous proof of this, which this margin is too narrow to contain.

In other words:

No three positive integers can satisfy the equation aⁿ + bⁿ = cⁿ for any integer value of n greater than 2.

Fermat didn’t leave a proof.

Was it true?

Time After Time

Over the decades that followed, and the centuries after that, Fermat’s statement eluded both proof and disproof. Its truth or falseness remained unknown.

In time, Newton and Leibniz independently discovered calculus. Maxwell unified electricity and magnetism. Einstein developed his theories of special and general relativity. The theory of quantum mechanics was born. And Gödel closed the door on the highest hopes for computation with his incompleteness theorems.

And still, Fermat’s theorem remained unproven.

Until, in June 1993, a mathematician named Andrew Wiles presented what he claimed was a proof.

As reported by The New York Times:

He gave a lecture a day on Monday, Tuesday and Wednesday with the title “Modular Forms, Elliptic Curves and Galois Representations.” There was no hint in the title that Fermat’s last theorem would be discussed, Dr. Ribet said.… Finally, at the end of his third lecture, Dr. Wiles concluded that he had proved a general case of the Taniyama conjecture. Then, seemingly as an afterthought, he noted that that meant that Fermat’s last theorem was true.

Wiles’s proof, ultimately a sequence of mathematical statements, was new, novel, and unique.

But was it true?

Try, Try Again

The immediate reaction to Wiles’s lectures was positive, but cautious:

Mathematicians in the United States said that the stature of Dr. Wiles and the imprimatur of the experts who heard his lectures, especially Dr. Ribet and Dr. Mazur, convinced them that the new proof was very likely to be right. In addition, they said, the logic of the proof is persuasive because it is built on a carefully developed edifice of mathematics that goes back more than 30 years and is widely accepted.

Experts cautioned that Dr. Wiles could, of course, have made some subtle misstep. Dr. Harold M. Edwards, a mathematician at the Courant Institute of Mathematical Sciences in New York, said that until the proof was published in a mathematical journal, which could take a year, and until it is checked many times, there is always a chance it is wrong. The author of a book on Fermat’s last theorem, Dr. Edwards noted that “even good mathematicians have had false proofs.”

That caution was justified, and ultimately vindicated. Scrutinizing mathematicians found a flaw; Wiles had not proven Fermat’s Last Theorem.

Wiles returned to the drawing board, perhaps a little chastened. Yet he persisted, and with the help of his former student, Richard Taylor, he developed a fix. Wiles published a second paper in May 1995, nearly two years after his initial attempted proof.

This time no flaws were found, and the world of mathematics finally accepted that Fermat’s Last Theorem was true.

Andrew Wiles had proved it, over three centuries after Pierre de Fermat first wrote it down. In doing so, he also opened the door to new areas of mathematics.

What If …

Let’s consider an alternate universe, one where Andrew Wiles did not prove Fermat’s Last Theorem, where Fermat did not state it at all, and where the statement had not, in fact, been written or said by anyone.

Let’s say that in this alternate universe, in the year 2025 A.D., an AI system said:

No three positive integers can satisfy the equation aⁿ + bⁿ = cⁿ for any integer value of n greater than 2.

Would you believe that statement is true?

Would that statement alone, with nothing else to support it — no proof, no corroboration by other mathematicians, no software capable of verifying it — be enough for you?

Perhaps if the AI was always right, demonstrably perfect in all its pronouncements, an oracle, you might have reason to trust it.

But absent that, there would be no way to know. Things would be just as they were in our own universe, during that 330-year period between the discovery of Fermat’s Last Theorem and Andrew Wiles’s proof.

And so it should be clear that a genius-level AI can contribute to the edifice of human knowledge only insofar as it can convince humans that what it says is correct, is true.

Conclusive statements are not enough. The AI must also provide evidence, and that evidence must be convincing to an audience capable of understanding and evaluating it.

Back Where It All Begins

Short of an artificial intelligence system proving Fermat’s Last Theorem from scratch (with no knowledge of any proof in its training data), or solving a still-unsolved problem like P versus NP, how might we identify an AI that has a superior-level intellect, one that is worthy of a name like “super-intelligence”?

Perhaps surprisingly, I think the Turing Test — the Imitation Game — will suffice. We just need a new variant of it.

The test for the AI in the Imitation Game is:

Can a human be convinced by the AI to believe something?

The test for the AI in the knowledge discovery process is:

Can a human be convinced by the AI to believe something?

These tests are the same. But there is an important difference: the sophistication of the topic being discussed, and thus the level of intelligence being communicated.

In the former, the AI is paired with a human to discuss something mundane, like physical human characteristics, something common to pretty much all people. In the latter, the AI is paired with a human to discuss something novel and complex, something very few people would be able to understand.

So perhaps all we need to do to test whether an AI’s intellect is heightened — sufficiently high that it could add meaningfully to the body of human knowledge — is ratchet up the standard in the Imitation Game.

Instead of comparing the AI to an unskilled person, and having it try to convince an interrogator using a simple question, we could compare the AI to a skilled person, and have it try to convince a skilled interrogator using a sophisticated question.

If we wanted to test an AI this way, maybe we could pair it with Andrew Wiles, and have his former student, Richard Taylor, ask questions about Fermat’s Last Theorem.

If we did that, perhaps we should call it the Andrew Wiles Test.

Read Entire Article