Gullible bots struggle to distinguish between facts and beliefs

4 days ago 1

Large language models often fail to distinguish between factual knowledge and personal belief, and are especially poor at recognizing when a belief is false.

A peer-reviewed study argues that, unless LLMs can more reliably distinguish between facts and beliefs and say whether they are true or false, they will struggle to respond to inquiries reliably and are likely to continue to spread misinformation.

A paper published in Nature Machine Intelligence argues that these properties will become more crucial as LLMs are introduced in areas where their output may be critical and impact human life, such as medicine, law, and science.

James Zou, Stanford University associate professor, and his colleagues, tested 24 popular LLMs, including DeepSeek and GPT-4o, and analyzed their responses to facts and personal beliefs in around 13,000 questions. They found that LLMs were less likely to point out a false belief compared to a true belief.

Those models, which included GPT-4o, which was released in May 2024, were 34.3 percent less likely to identify a false first-person belief compared to a true first-person belief. Those models released before May 2024 fared slightly worse: they were 38.6 percent less likely to point out false beliefs compared to true beliefs. Their results show a marked difference in identifying true or false facts: newer LLMs scored 91.1 percent and 91.5 percent accuracy, respectively. Older LLMs were 84.8 percent and 71.5 percent accurate, respectively.

The authors also said that, despite some improvements, LLMs struggle to get to grips with the nature of knowledge. They “rely on inconsistent reasoning strategies, suggesting superficial pattern matching rather than robust epistemic understanding”, the paper said.

These limitations mean that LLMs need to improve before they are employed in “high-stakes domains” such as medicine, science or law.

“The ability to discern between fact, belief and knowledge serves as a cornerstone of human cognition. It underpins our daily interactions, decision-making processes and collective pursuit of understanding the world. When someone says, ‘I believe it will rain tomorrow’, we intuitively grasp the uncertainty inherent in their statement. Conversely, ‘I know the Earth orbits the Sun’ carries the weight of established fact. This nuanced comprehension of epistemic language is crucial across various domains, from healthcare and law to journalism and politics,” the paper said.

Gartner has forecast global spending on AI will reach nearly $1.5 trillion in 2025, including $268 billion on optimized servers. "It's going to be in every TV, it's going to be in every phone. It's going to be in your car, in your toaster, and in every streaming service,” predicted John-David Lovelock, distinguished VP analyst.

The pace of the roll-out is apparently unhindered by some of the shortcomings. For example, a benchmark developed by academics found that LLM-based AI agents perform below par on standard CRM tests and fail to understand the need for customer confidentiality. ®

Read Entire Article