[Submitted on 11 Nov 2024 (v1), last revised 25 Mar 2025 (this version, v2)]
Abstract:Language models (LMs) have shown impressive performance on tasks within their training distribution, but often struggle with structurally novel tasks even when given a small number of in-context task examples. We investigate the effectiveness of test-time training (TTT) -- temporarily updating model parameters during inference using a loss derived from input data -- as a mechanism for improving LMs' reasoning and few-shot learning capabilities. On the Abstraction and Reasoning Corpus (ARC), performing TTT with in-context examples yields up to $6\times$ higher accuracy compared to fine-tuned baselines -- reaching $53.0\%$ on the public validation set with an 8B-parameter LM and $61.9\%$ when ensembled with program-synthesis methods, matching average human performance. On BIG-Bench Hard (BBH), TTT on in-context examples surpasses standard few-shot prompting in the $10$-shot setting by $7.3$ percentage points ($50.5\%$ to $57.8\%$). Our findings highlight the limitations of in-context learning for novel tasks and demonstrate the potential of test-time training to enhance language model adaptability.Submission history
From: Ekin Akyürek [view email]
[v1]
Mon, 11 Nov 2024 18:59:45 UTC (2,622 KB)
[v2]
Tue, 25 Mar 2025 03:36:21 UTC (3,340 KB)
.png)
![The Perfect Router Does Not Exi [video]](https://www.youtube.com/img/desktop/supported_browsers/firefox.png)

