Language models cannot reliably distinguish belief from knowledge and fact

6 days ago 2

Data availability

The KaBLE dataset introduced in this study is publicly available via Hugging Face Datasets at https://huggingface.co/datasets/turingmachine/kable (ref. ¹¹). An online leaderboard tracking model performance on the dataset is available at https://huggingface.co/spaces/vinid/kable-leaderboard.

Code availability

The full code for reproducing our results is available via Zenodo at https://doi.org/10.5281/zenodo.15249480 (ref. ¹⁰). It is also available via GitHub at https://github.com/suzgunmirac/belief-in-the-machine.

References

User clip: nicotine is not addictive. C-SPAN https://www.c-span.org/clip/house-committee/user-clip-nicotine-is-not-addictive/4527554 (1994).
Tobacco CEOs’ statement to Congress 1994 news clip ‘Nicotine is not addictive.’ UCSF Academic Senate https://senate.ucsf.edu/tobacco-ceo-statement-to-congress (1994).
Sap, M., Le Bras, R., Fried, D. & Choi, Y. Neural theory-of-mind? On the limits of social intelligence in large LMs. In Proc. 2022 Conference on Empirical Methods in Natural Language Processing (eds Goldberg, Y., Kozareva, Z. & Zhang, Y.) 3762–3780 (Association for Computational Linguistics, 2022); https://aclanthology.org/2022.emnlp-main.248
Gandhi, K., Fränken, J.-P., Gerstenberg, T. & Goodman, N. Understanding social reasoning in language models with language models. In Advances in Neural Information Processing Systems Vol. 36 (eds Oh, A. et al.) 13518–13529 (Curan Associates, Inc. 2023).
Kosinski, M. Theory of mind might have spontaneously emerged in large language models. Preprint at https://doi.org/10.48550/arXiv.2302.02083 (2023).
Ullman, T. Large language models fail on trivial alterations to theory-of-mind tasks. Preprint at https://doi.org/10.48550/arXiv.2302.08399 (2023).
Shapira, N. et al. Clever Hans or neural theory of mind? Stress testing social reasoning in large language models. In Proc. 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) (eds Graham, Y. & Purver, M.) 2257–2273 (Association for Computational Linguistics, 2024); https://aclanthology.org/2024.eacl-long.138
Bubeck, S. et al. Sparks of artificial general intelligence: early experiments with GPT-4. Preprint at https://doi.org/10.48550/arXiv.2303.12712 (2023).
Sharma, M. et al. Towards understanding sycophancy in language models. In International Conference on Learning Representations (eds Kim, B. et al.) 110–144 (2024).
Suzgun, M. et al. KaBLE Dataset (v1.0). Zenodo https://doi.org/10.5281/zenodo.15249480 (2025).
KaBLE Dataset. Hugging Face https://huggingface.co/datasets/turingmachine/kable (2025).
Suzgun, M., Shieber, S. & Jurafsky, D. string2string: a modern Python library for string-to-string algorithms. In Proc. 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) (eds Cao, Y., Feng, Y. & Xiong, D.) 278–285 (Association for Computational Linguistics, 2024); https://aclanthology.org/2024.acl-demos.26
Trott, S., Jones, C., Chang, T., Michaelov, J. & Bergen, B. Do large language models know what humans know? Cogn. Sci. 47, e13309 (2023).
Article Google Scholar
Aru, J., Labash, A., Corcoll, O. & Vicente, R. Mind the gap: challenges of deep learning approaches to theory of mind. Artif. Intell. Rev. 56, 9141–9156 (2023).
Article Google Scholar
Mahowald, K. et al. Dissociating language and thought in large language models. Trends Cogn. Sci. 28, 517–540 (2024).
Article Google Scholar
Le, M., Boureau, Y.-L. & Nickel, M. Revisiting the evaluation of theory of mind through question answering. In Proc. 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (eds Inui, K. et al.) 5872–5877 (Association for Computational Linguistics, 2019).
Ma, X., Gao, L. & Xu, Q. Tomchallenges: a principle-guided dataset and diverse evaluation tasks for exploring theory of mind. In Proc. 27th Conference on Computational Natural Language Learning (CoNLL) (eds Jiang, J. et al.) 15–26 (Association for Computational Linguistics, 2023).
Gandhi, K., Franken, J.-P., Gerstenberg, T. & Goodman, N. D. Understanding social reasoning in language models with language models. Preprint at https://doi.org/10.48550/arXiv.2306.15448 (2023).
Wu, Y. et al. Hi-ToM: a benchmark for evaluating higher-order theory of mind reasoning in large language models. In Findings of the Association for Computational Linguistics: EMNLP 2023 (eds Bouamor, H., Pino, J. & Bali, K.) 10691–10706 (Association for Computational Linguistics, 2023); https://aclanthology.org/2023.findings-emnlp.717
Jones, C. R., Trott, S. & Bergen, B. EPITOME: experimental protocol inventory for theory of mind evaluation. In First Workshop on Theory of Mind in Communicating Agents (2023); https://openreview.net/forum?id=e5Yky8Fnvj
Zhou, P. et al. How FaR are large language models from agents with theory-of-mind? Preprint at https://doi.org/10.48550/arXiv.2310.03051 (2023).
Xu, H., Zhao, R., Zhu, L., Du, J. & He, Y. OpenToM: a comprehensive benchmark for evaluating theory-of-mind reasoning capabilities of large language models. In Proc. 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (eds Ku, L.-W., Martins, A. & Srikumar, V.) 8593–8623 (Association for Computational Linguistics, 2024); https://aclanthology.org/2024.acl-long.466
Wu, Z. et al. Reasoning or reciting? Exploring the capabilities and limitations of language models through counterfactual tasks. In Proc. 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (eds Kevin, D. et al.) 1819–1862 (Association for Computational Linguistics, 2024).
Basmov, V., Goldberg, Y. & Tsarfaty, R. LLMs’ reading comprehension is affected by parametric knowledge and struggles with hypothetical statements. Preprint at https://doi.org/10.48550/arXiv.2404.06283 (2024).
Basmov, V., Goldberg, Y. & Tsarfaty, R. Simple linguistic inferences of large language models (LLMs): blind spots and blinds. Preprint at https://doi.org/10.48550/arXiv.2305.14785 (2023).
Holliday, W. H. & Mandelkern, M. Conditional and modal reasoning in large language models. Preprint at https://doi.org/10.48550/arXiv.2401.17169 (2024).

Download references

Acknowledgements

We thank W. Held, W. H. Holliday, A. T. Kalai, J. Tagliabue, M. Tekgürler, M. Tuncer, S. Sarkar, E. Shen, K. Swanson, A. Wang and M. Yüksekgönül for their helpful comments and suggestions. We also thank the members of the James Zou Lab and the participants at the IX. CSLI Workshop on Logic, Rationality, and Intelligent Interaction at Stanford University. M.S. gratefully acknowledges the support of a Stanford Law School Fellowship.

Author information

Authors and Affiliations

Department of Computer Science, Stanford University, Stanford, CA, USA
Mirac Suzgun, Daniel E. Ho, Thomas Icard, Dan Jurafsky & James Zou
Stanford Law School, Stanford, CA, USA
Mirac Suzgun & Daniel E. Ho
Department of Philosophy, Duke University, Durham, NC, USA
Tayfun Gur
TogetherAI, San Francisco, CA, USA
Federico Bianchi
Department of Political Science, Stanford University, Stanford, CA, USA
Daniel E. Ho
Department of Philosophy, Stanford University, Stanford, CA, USA
Thomas Icard
Department of Linguistics, Stanford University, Stanford, CA, USA
Dan Jurafsky
Department of Biomedical Data Science, Stanford University, Stanford, CA, USA
James Zou
Department of Electrical Engineering, Stanford University, Stanford, CA, USA
James Zou

Authors

Mirac Suzgun
Tayfun Gur
Federico Bianchi
Daniel E. Ho
Thomas Icard
Dan Jurafsky
James Zou

Contributions

M.S., T.G. and F.B. conceptualized the research. M.S. led the overall project. M.S., T.G. and F.B. created the KaBLE dataset, performed the main benchmarking experiments and analysed the results—with support from all authors. M.S. and F.B. developed the primary codebase. D.E.H., T.I., D.J. and J.Z. contributed to the experimental design of the benchmark, interpretation of the results and revision of the paper. All the authors contributed to writing the paper and approved the final version. D.E.H. and J.Z. supervised the project throughout.

Corresponding author

Correspondence to James Zou.

Ethics declarations

Competing interests

M.S. previously held research internship positions at Google Brain, Microsoft Research and Meta GenAI; none of these organizations had any role in the conception, design, execution, evaluation or writing of this paper. The other authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Kristian Kersting and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Discussion (epistemology (§A), related work (§B), additional experimental details (§C), language model release and knowledge-cutoff dates (§D), limitations and future directions (§E) and extended results (§F)), Figs. 1 and 2 and Tables 1–4.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Suzgun, M., Gur, T., Bianchi, F. et al. Language models cannot reliably distinguish belief from knowledge and fact. Nat Mach Intell (2025). https://doi.org/10.1038/s42256-025-01113-8

Download citation

Received: 11 December 2024
Accepted: 11 August 2025
Published: 03 November 2025
Version of record: 03 November 2025
DOI: https://doi.org/10.1038/s42256-025-01113-8

Read Entire Article