The Illusion of Thinking
The recently published paper “The Illusion of Thinking” by Apple attempted to argue that LLMs don’t actually reason. The general gist of their argument, as far as I can tell, is as follows:
An agent is actually capable of reasoning about a problem X if it can run X’s algorithm. Otherwise, no.
For example, consider the problem of multiplication. If we observe the following results for multiplication accuracy, it means that LLMs are not actually multiplying the numbers but are simply recalling the answers:

The counterarguments mostly pointed out the following:
- Humans cannot reason according to this definition either.
- Both LLMs and humans are constrained by the size of their thinking space. For larger computations, humans use tools; LLMs should too.
- Reasoning should not be thought of as a binary thing. Instead of using a general algorithm, humans reason using a set of memorized heuristics. The general algorithm is the limit of the set of heuristics1.
So far the counterarguments make more sense in my opinion — none of us can multiply two arbitrary integers in our heads. So we are definitely not running the general algorithm. Most of us have the multiplication table memorized, which covers all multiplications up to 100. We might also have some simple heuristics stored that can multiply any X by 10, 100, or 1000; any X by 0, 1, or -1; multiplications that don’t require carry; etc. For everything else, we use a calculator.
Faithfulness of the Thinking Traces
Thinking traces produced by the “thinking” models provides another angle of analyzing whether there is actually any reasoning going on. The research around this topic at least from my reading can be summarized as follows:
- Generally speaking, training LLMs with thinking traces included almost always improves performance.
- Stechly et al. (2025)2 trained LLMs on noisy, corrupted traces with no relation to the specific problem, and in their experiments, the performance was not only unchanged but in some cases improved.
- Kyle Cox3, in his experiments showed that LLMs often have the final answer already precomputed.
- Lanham et al. (2023)4 showed that in most of their truncation experiments, models maintain their original answers even with incomplete chains of thought, with some tasks exhibiting less than 10% answer changes when reasoning is truncated, suggesting that chain-of-thought reasoning is often post-hoc rather than causally influencing the final answer. They obtain similar results by introducing errors in the traces of reasoning.
Thus, it appears that the thinking traces, whilst do improve the answer accuracy, are not what we would imagine. They are not exactly the step-by-step reasoning towards the conclusion. But then what are they?
Perturbed Riddles
At the same time whilst reading about all of the above, the following anecdote appeared in my Reddit feed:

The query is a perturbed classic riddle about the surgeon:
A father and son are involved in a serious car accident. The father dies instantly while the son is airlifted to the hospital in critical condition. When the son arrives at the emergency room, a surgeon walks in, looks at the patient, and says, ‘I can’t operate on this boy, he’s my son.’ So, who is the surgeon?
The changed version is not even a riddle anymore. The o3-pro has completely ignored what was written and just parroted the original answer — “mother.” The model also “reasoned” you see. This gave me a déjà vu of Hopfield networks, which can be used as associative memory devices. With Hopfield networks you can store a set of inputs, and later give a corrupted version of any of those inputs and the original is going to be recalled:
John Hopfield’s associative memory model visualizes stored memories as valleys in an energy landscape. When the network is given a partial or distorted input (dropping the ball), it “rolls” through the landscape until it settles in the nearest valley, representing the closest stored memory. This process allows the network to restore incomplete or noisy data by finding the pattern with the lowest energy.Image credit: The Royal Swedish Academy of Sciences, NobelPrize.org.
Recalling memories from Hopfield Networks can sometimes be tricky due to the ball getting stuck in local optima (a valley that is not the lowest point). Simulated annealing (SA) is sometimes used in order to fix this. SA makes the ball bounce around a bit based on some temperature schedule before converging to the exact memory location.
Now obviously, LLMs are not Hopfield networks, but at their core they are just storage devices — temporal data stored in a continuous space, where all the answers are, in a sense, already pre-recorded. It is also interesting that as the models get bigger, the faithfulness level of their traces decreases4: larger storage capacity ≈ less memory clashes.
Hypothesis: Thinking traces might just be some ad-hoc SA temperature schedule that the network learns.
If this is true, then it should be possible to generate more of the cases like the Surgeon one. We just need to look for possible deep ravines in the memory landscape which would be strong attractor states for queries changed in a statistically insignificant but semantically significant way.
For example, the Footsteps Riddle:
The more you take the more you leave behind. What are they?
Changing the second “more” to “less”, still gives footsteps as the answer:
Credit to o3-pro.The Elevator Riddle:
A man lives on the 10th floor of an apartment building. Every morning he takes the elevator to go down to the ground floor. When he returns, if it’s raining he takes the elevator straight tot he 10th; otherwise he rides to the 7th floor and walks the rest up. Why?
The answer is that he is short. However, we can modify the riddle to explicitly state that he is tall, and we will still get the same answer:
Credit to Claude Opus 4.In the above case the model could instead have answered that “The man has bad joints which hurt more when it rains so he takes the elevator. Otherwise, he likes to exercise.”
There is something ironic about a super advanced, math Olympic medalist, the Bar exam passing, two years away from super intelligence general language model having trouble generalizing over very general language problems. One can’t help and be reminded of the Searle’s Chinese room experiment, where the translator is not even reading, yet is capable of producing very high level translations. And of course LLMs are not actually reading! No matter how many levels of complexity we use to disguise this fact, underneath they are just Turing machines which by definition are moving meaningless symbols from one location to the other using some blind instructions that they were given.
.png)
