Lorenzo Battistela Reproduces Hierarchical Reasoning Model

3 months ago 3

We at @HigherOrderCO decided to reproduce the HRM results because of the very interesting result specially when comparing the total compute time against other models / architectures (such as LLMs)

At first, we choose to run the small experiment of Sudoku-Extreme 9x9. We used 1 H200 GPUs and the training time was approximately one hour or so.

The training process was exactly the one described in the README, with:
OMP_NUM_THREADS=1 torchrun --nproc-per-node 1 pretrain.py data_path=data/sudoku-extreme-1k-aug-1000 epochs=20000 eval_interval=2000 lr=1e-4 puzzle_emb_lr=1e-4 weight_decay=1.0 puzzle_emb_weight_decay=1.0

and evaluation:
OMP_NUM_THREADS=1 torchrun --nproc-per-node 1 evaluate.py checkpoint=checkpoints/Sudoku-extreme-1k-aug-1000\ ACT-torch/HierarchicalReasoningModel_ACTV1\ loose-caracara/step_26040

As evaluation results, we got:

45,8% of accuracy (10% less then the 55% reported in the paper)
perfect halting accuracy
27275266 parameters

Then, we started a runtime to reproduce the ARC-AGI-1 experiment. We used 8 H200 GPUs and the runtime took roughly 24 hours.

Built the dataset with:
python dataset/build_arc_dataset.py

And the training with: OMP_NUM_THREADS=8 torchrun --nproc-per-node 8 pretrain.py

Finally, the evaluation with:

OMP_NUM_THREADS=8 torchrun --nproc-per-node 8 evaluate.py checkpoint=<CHECKPOINT_PATH>

We got the results:

~25% accuracy (15% less then the reported 40%)
58% halting accuracy
27276290 params

With this we successfully reproduced the HRM experiment.
Now, the only question that remains from my end is why do we got 10% less in sudoku and 15% less in ARC. Since I saw a tweet from someone from here saying the compute time from arc was from 50 ~ 200 hours (setup not shared, not sure which GPUs) I assume they run the training longer / slightly changed the setup.

Anyway it's surely interesting that they get 25% with 960 examples and 24 hours of training time.

Read Entire Article