Language models cannot reliably distinguish belief from knowledge and fact

6 days ago 2

Data availability

The KaBLE dataset introduced in this study is publicly available via Hugging Face Datasets at https://huggingface.co/datasets/turingmachine/kable (ref. 11). An online leaderboard tracking model performance on the dataset is available at https://huggingface.co/spaces/vinid/kable-leaderboard.

Code availability

The full code for reproducing our results is available via Zenodo at https://doi.org/10.5281/zenodo.15249480 (ref. 10). It is also available via GitHub at https://github.com/suzgunmirac/belief-in-the-machine.

References

  1. User clip: nicotine is not addictive. C-SPAN https://www.c-span.org/clip/house-committee/user-clip-nicotine-is-not-addictive/4527554 (1994).

  2. Tobacco CEOs’ statement to Congress 1994 news clip ‘Nicotine is not addictive.’ UCSF Academic Senate https://senate.ucsf.edu/tobacco-ceo-statement-to-congress (1994).

  3. Sap, M., Le Bras, R., Fried, D. & Choi, Y. Neural theory-of-mind? On the limits of social intelligence in large LMs. In Proc. 2022 Conference on Empirical Methods in Natural Language Processing (eds Goldberg, Y., Kozareva, Z. & Zhang, Y.) 3762–3780 (Association for Computational Linguistics, 2022); https://aclanthology.org/2022.emnlp-main.248

  4. Gandhi, K., Fränken, J.-P., Gerstenberg, T. & Goodman, N. Understanding social reasoning in language models with language models. In Advances in Neural Information Processing Systems Vol. 36 (eds Oh, A. et al.) 13518–13529 (Curan Associates, Inc. 2023).

  5. Kosinski, M. Theory of mind might have spontaneously emerged in large language models. Preprint at https://doi.org/10.48550/arXiv.2302.02083 (2023).

  6. Ullman, T. Large language models fail on trivial alterations to theory-of-mind tasks. Preprint at https://doi.org/10.48550/arXiv.2302.08399 (2023).

  7. Shapira, N. et al. Clever Hans or neural theory of mind? Stress testing social reasoning in large language models. In Proc. 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) (eds Graham, Y. & Purver, M.) 2257–2273 (Association for Computational Linguistics, 2024); https://aclanthology.org/2024.eacl-long.138

  8. Bubeck, S. et al. Sparks of artificial general intelligence: early experiments with GPT-4. Preprint at https://doi.org/10.48550/arXiv.2303.12712 (2023).

  9. Sharma, M. et al. Towards understanding sycophancy in language models. In International Conference on Learning Representations (eds Kim, B. et al.) 110–144 (2024).

  10. Suzgun, M. et al. KaBLE Dataset (v1.0). Zenodo https://doi.org/10.5281/zenodo.15249480 (2025).

  11. KaBLE Dataset. Hugging Face https://huggingface.co/datasets/turingmachine/kable (2025).

  12. Suzgun, M., Shieber, S. & Jurafsky, D. string2string: a modern Python library for string-to-string algorithms. In Proc. 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) (eds Cao, Y., Feng, Y. & Xiong, D.) 278–285 (Association for Computational Linguistics, 2024); https://aclanthology.org/2024.acl-demos.26

  13. Trott, S., Jones, C., Chang, T., Michaelov, J. & Bergen, B. Do large language models know what humans know? Cogn. Sci. 47, e13309 (2023).

    Article  Google Scholar 

  14. Aru, J., Labash, A., Corcoll, O. & Vicente, R. Mind the gap: challenges of deep learning approaches to theory of mind. Artif. Intell. Rev. 56, 9141–9156 (2023).

    Article  Google Scholar 

  15. Mahowald, K. et al. Dissociating language and thought in large language models. Trends Cogn. Sci. 28, 517–540 (2024).

    Article  Google Scholar 

  16. Le, M., Boureau, Y.-L. & Nickel, M. Revisiting the evaluation of theory of mind through question answering. In Proc. 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (eds Inui, K. et al.) 5872–5877 (Association for Computational Linguistics, 2019).

  17. Ma, X., Gao, L. & Xu, Q. Tomchallenges: a principle-guided dataset and diverse evaluation tasks for exploring theory of mind. In Proc. 27th Conference on Computational Natural Language Learning (CoNLL) (eds Jiang, J. et al.) 15–26 (Association for Computational Linguistics, 2023).

  18. Gandhi, K., Franken, J.-P., Gerstenberg, T. & Goodman, N. D. Understanding social reasoning in language models with language models. Preprint at https://doi.org/10.48550/arXiv.2306.15448 (2023).

  19. Wu, Y. et al. Hi-ToM: a benchmark for evaluating higher-order theory of mind reasoning in large language models. In Findings of the Association for Computational Linguistics: EMNLP 2023 (eds Bouamor, H., Pino, J. & Bali, K.) 10691–10706 (Association for Computational Linguistics, 2023); https://aclanthology.org/2023.findings-emnlp.717

  20. Jones, C. R., Trott, S. & Bergen, B. EPITOME: experimental protocol inventory for theory of mind evaluation. In First Workshop on Theory of Mind in Communicating Agents (2023); https://openreview.net/forum?id=e5Yky8Fnvj

  21. Zhou, P. et al. How FaR are large language models from agents with theory-of-mind? Preprint at https://doi.org/10.48550/arXiv.2310.03051 (2023).

  22. Xu, H., Zhao, R., Zhu, L., Du, J. & He, Y. OpenToM: a comprehensive benchmark for evaluating theory-of-mind reasoning capabilities of large language models. In Proc. 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (eds Ku, L.-W., Martins, A. & Srikumar, V.) 8593–8623 (Association for Computational Linguistics, 2024); https://aclanthology.org/2024.acl-long.466

  23. Wu, Z. et al. Reasoning or reciting? Exploring the capabilities and limitations of language models through counterfactual tasks. In Proc. 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (eds Kevin, D. et al.) 1819–1862 (Association for Computational Linguistics, 2024).

  24. Basmov, V., Goldberg, Y. & Tsarfaty, R. LLMs’ reading comprehension is affected by parametric knowledge and struggles with hypothetical statements. Preprint at https://doi.org/10.48550/arXiv.2404.06283 (2024).

  25. Basmov, V., Goldberg, Y. & Tsarfaty, R. Simple linguistic inferences of large language models (LLMs): blind spots and blinds. Preprint at https://doi.org/10.48550/arXiv.2305.14785 (2023).

  26. Holliday, W. H. & Mandelkern, M. Conditional and modal reasoning in large language models. Preprint at https://doi.org/10.48550/arXiv.2401.17169 (2024).

Download references

Acknowledgements

We thank W. Held, W. H. Holliday, A. T. Kalai, J. Tagliabue, M. Tekgürler, M. Tuncer, S. Sarkar, E. Shen, K. Swanson, A. Wang and M. Yüksekgönül for their helpful comments and suggestions. We also thank the members of the James Zou Lab and the participants at the IX. CSLI Workshop on Logic, Rationality, and Intelligent Interaction at Stanford University. M.S. gratefully acknowledges the support of a Stanford Law School Fellowship.

Author information

Authors and Affiliations

  1. Department of Computer Science, Stanford University, Stanford, CA, USA

    Mirac Suzgun, Daniel E. Ho, Thomas Icard, Dan Jurafsky & James Zou

  2. Stanford Law School, Stanford, CA, USA

    Mirac Suzgun & Daniel E. Ho

  3. Department of Philosophy, Duke University, Durham, NC, USA

    Tayfun Gur

  4. TogetherAI, San Francisco, CA, USA

    Federico Bianchi

  5. Department of Political Science, Stanford University, Stanford, CA, USA

    Daniel E. Ho

  6. Department of Philosophy, Stanford University, Stanford, CA, USA

    Thomas Icard

  7. Department of Linguistics, Stanford University, Stanford, CA, USA

    Dan Jurafsky

  8. Department of Biomedical Data Science, Stanford University, Stanford, CA, USA

    James Zou

  9. Department of Electrical Engineering, Stanford University, Stanford, CA, USA

    James Zou

Authors

  1. Mirac Suzgun
  2. Tayfun Gur
  3. Federico Bianchi
  4. Daniel E. Ho
  5. Thomas Icard
  6. Dan Jurafsky
  7. James Zou

Contributions

M.S., T.G. and F.B. conceptualized the research. M.S. led the overall project. M.S., T.G. and F.B. created the KaBLE dataset, performed the main benchmarking experiments and analysed the results—with support from all authors. M.S. and F.B. developed the primary codebase. D.E.H., T.I., D.J. and J.Z. contributed to the experimental design of the benchmark, interpretation of the results and revision of the paper. All the authors contributed to writing the paper and approved the final version. D.E.H. and J.Z. supervised the project throughout.

Corresponding author

Correspondence to James Zou.

Ethics declarations

Competing interests

M.S. previously held research internship positions at Google Brain, Microsoft Research and Meta GenAI; none of these organizations had any role in the conception, design, execution, evaluation or writing of this paper. The other authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Kristian Kersting and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Discussion (epistemology (§A), related work (§B), additional experimental details (§C), language model release and knowledge-cutoff dates (§D), limitations and future directions (§E) and extended results (§F)), Figs. 1 and 2 and Tables 1–4.

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Suzgun, M., Gur, T., Bianchi, F. et al. Language models cannot reliably distinguish belief from knowledge and fact. Nat Mach Intell (2025). https://doi.org/10.1038/s42256-025-01113-8

Download citation

  • Received: 11 December 2024

  • Accepted: 11 August 2025

  • Published: 03 November 2025

  • Version of record: 03 November 2025

  • DOI: https://doi.org/10.1038/s42256-025-01113-8

Read Entire Article