We at @HigherOrderCO decided to reproduce the HRM results because of the very interesting result specially when comparing the total compute time against other models / architectures (such as LLMs)
At first, we choose to run the small experiment of Sudoku-Extreme 9x9. We used 1 H200 GPUs and the training time was approximately one hour or so.
The training process was exactly the one described in the README, with:
OMP_NUM_THREADS=1 torchrun --nproc-per-node 1 pretrain.py data_path=data/sudoku-extreme-1k-aug-1000 epochs=20000 eval_interval=2000 lr=1e-4 puzzle_emb_lr=1e-4 weight_decay=1.0 puzzle_emb_weight_decay=1.0
and evaluation:
OMP_NUM_THREADS=1 torchrun --nproc-per-node 1 evaluate.py checkpoint=checkpoints/Sudoku-extreme-1k-aug-1000\ ACT-torch/HierarchicalReasoningModel_ACTV1\ loose-caracara/step_26040
As evaluation results, we got:
- 45,8% of accuracy (10% less then the 55% reported in the paper)
- perfect halting accuracy
- 27275266 parameters
Then, we started a runtime to reproduce the ARC-AGI-1 experiment. We used 8 H200 GPUs and the runtime took roughly 24 hours.
Built the dataset with:
python dataset/build_arc_dataset.py
And the training with: OMP_NUM_THREADS=8 torchrun --nproc-per-node 8 pretrain.py
Finally, the evaluation with:
OMP_NUM_THREADS=8 torchrun --nproc-per-node 8 evaluate.py checkpoint=<CHECKPOINT_PATH>
We got the results:
- ~25% accuracy (15% less then the reported 40%)
- 58% halting accuracy
- 27276290 params
With this we successfully reproduced the HRM experiment.
Now, the only question that remains from my end is why do we got 10% less in sudoku and 15% less in ARC. Since I saw a tweet from someone from here saying the compute time from arc was from 50 ~ 200 hours (setup not shared, not sure which GPUs) I assume they run the training longer / slightly changed the setup.
Anyway it's surely interesting that they get 25% with 960 examples and 24 hours of training time.
.png)





