Why I'm excited about the Hierarchical Reasoning Model

3 months ago 3

Causal Wizard app

Press enter or click to view image in full size

Part of Figure 6 from the HRM paper. HRM is demonstrated on 4 tasks: ARC-AGI visual reasoning problems, Sudoku (Hard and Extreme difficulty) and maze navigation. After internal analysis, HRM produces complete solutions to each problem as a continuous output token stream.

Since the arrival of Chat GPT, I’ve felt that progress in AI has been largely incremental. Of course, even Chat GPT was/is mostly just a scaled up Transformer model, which was introduced by Vaswani et al in 2017.

The belief that AI is advancing relatively slowly is a radical and unpopular opinion these days, and it’s been frustrating defending that view while most of the Internet talks excitedly about the latest AI Large-Language Models (LLMs).

But in my opinion the new Hierarchical Reasoning Model by Wang et al is a genuine leap forward. Their model is the first one which seems to have the ability to think. In addition — or perhaps because of this capability — the model is also extremely efficient in terms of required training samples, trainable parameter count and computational resources such as memory, because they use relatively local gradient propagation without Back-Propagation-Through-Time (BPTT), or equivalent.

What is thinking anyway?

Researchers have been trying for a while now to build models with increased reasoning capabilities. Some improvement has been achieved through Chain-of-Thought (CoT) prompting and related techniques to induce LLMs to systematically analyze a problem while generating their response. Researchers have also tried adding “thinking tokens” to induce more deliberative output. They have also designed “Agentic AI architectures”, which aim to allow AIs to tackle a broad range of user problems more independently. However, Agentic architectures almost always involve human engineers inventing a way to break down and frame broad problems into specific sub-problems which will the AI will find more manageable, helping it remain task-focused. Who is doing the thinking here?

When using Machine Learning for complex problems, experience shows that training a model to decide how to approach a problem tends to be much better than designing an architecture which imposes a pre-determined approach. The reality is we’re only using designer Agentic architectures, thinking tokens and CoT prompting because we don’t know how to make the model think by itself in an open-ended yet task-focused way.

It’s true that LLM based AI models are achieving extremely impressive results on reasoning tasks, including recently reaching gold-medal standard at an international math olympiad. These are genuinely difficult tasks. But I would still define the process implemented by LLM-AIs as inference, rather than thinking. Why is that?

LLMs always perform a pre-defined quantity of computation to produce each output token. These models can’t decide to spend additional iterations exploring alternative hypotheses, invalidating assumptions or discovering new avenues. There is some potential for competing hypotheses to transiently exist in the latent space between the fixed number of layers, but the maximum effective depth of “thought” is architecturally limited. The model also can’t decide to go back and rewrite its previous output, and start again. Instead, the output is essentially locked in, token by token, as it is produced (see figure below).

Figure 2 from the HRM paper shows that Transformer architectural choices limit depth of thought for complex reasoning. Each series shows how model performance (accuracy) varies when changing the total number of trainable parameters. Scaling layer width shows no improvement with larger model size, because it is architecturally limited in a way which prevents it solving this problem (a Sudoku). However, scaling the number of transformer blocks (i.e. layers) does show improvement — until saturation. The fixed architectural choice of number of layers bounds the depth of reasoning the model can achieve.

In contrast, the new HRM model repeatedly swaps between iterations of two models which operate at different levels of abstraction (High and Low). Each time the High-level network (H) updates, the Low-level network (L) iterates to convergence on a steady-state output*. H then updates conditional on the L output. Conceptually, you could imagine this as H producing a broad strategy, and L exploring the implications and detailed realization of that strategy. The results of that rollout are then made available to H to update the strategy, if necessary. Finally, when it’s ready, the model outputs the solution. Convergence dictates the end of the thought-process.

*In fact the number of iterations of H and L networks is bounded, but the HRM paper does demonstrate that the parameters are effectively an upper bound which is sufficient for convergence; the paper also provides a HRM variant which successfully uses Reinforcement Learning (RL) to determine the number of iterations within these bounds to achieve comparable performance in a dynamically adjusted number of iterations, where the model chooses when to stop (see figure below). It’s a step in the direction of open-endedness.

Figure 5b from the HRM paper (reproduced here) shows that the performance of the HRM is comparable when using an adaptive number of compute steps (ACT series) or simply a constant upper bound (Fixed M series). In the ACT model variant, HRM “decides” how many M steps are required for a given problem.

HRM is scaleable and efficient because it’s “biologically-plausible”

In my view, HRM is not only impressive due to model performance on the cited benchmark reasoning problems. In addition:

  • HRM does not require any pre-training (most current, state-of-the-art LLM-AI ARC-AGI models benefit from huge amounts of pre-training on other tasks — in fact they are trained for other tasks and then incidentally applied to ARC-AGI).
  • HRM is sample efficient, requiring only 1000 training examples per task (presumably this also means per-ARC-AGI task type, although I haven’t confirmed this). This is a key measure of performance which will become increasingly relevant in future.
  • HRM uses relatively local and immediate gradient backpropagation, which is more biologically plausible.

Why would anyone care about biological plausibility or local gradients? This is actually a topic I’ve cared about for a long time, including having published an alternative to deep backpropagation through time (BPTT) by recurrently encoding historic latent states in the current state. You’ll notice the phrase “biologically plausible” occurs several times in that paper!

It’s tantalizing because the way deep-backpropagation is implemented in LLMs requires a) lots of memory and b) precise synchronization and data-matching between distant synapses. It’s extremely difficult to imagine that this can occur in human or animal brains, which suggests that there’s potentially a different, (and better) way to do it. Incidentally, the most promising biologically-plausible implementation of an approximation of deep-backpropagation using local gradients is this Predictive Coding paper by Millidge et al, although this doesn’t solve the implausibility of managing gradients over time.

The approximation of neural dendrites used in Artificial Neural Networks is a simplification of the multi-level tree-like structure which exists in biological neurons, which would be better approximated by shallow neural networks of 2–3 layers (see here, here and here). It is more reasonable to believe that the type of tightly synchronized feedback needed to implement some sort of error backpropagation exists within individual neurons rather than between them.

The other issue with deep BPTT is the time dimension. It’s even more of a stretch to believe that partial derivatives can be stored at each synapse and appropriately chained between layers when the model is used to generate output over time, and produces losses over time — some of them very delayed. Storage of these partial derivatives over multiple forward passes actually accounts for much of the memory needed during training of big models, so in this case the biological plausibility of local and immediate gradients makes a huge difference to the practical scalability of the model. If HRM becomes popular, this will be a big deal.

Where should HRM go next?

Personally, I’m already trying to adapt HRM into an Agent which can be trained entirely by Reinforcement Learning (RL), rather than supervised losses on static reasoning tasks. This will allow HRM to be applied to a wider range of problems, including problems where the environment is dynamic / changing, or partially-observable.

It will be fascinating to try to observe a HRM-Agent make long-range plans, start to execute them, and then be forced to come up with new plans when circumstances change.

Since both on-policy and off-policy RL models can be trained step-by-step using only model estimates of the future value function, this should be compatible with the shallow and immediate feedback HRM requires for model training.

Watch this space!

Read Entire Article