Researchers at Carnegie Mellon University have likened today's large language model (LLM) chatbots to "that friend who swears they're great at pool but never makes a shot" - having found that their virtual self-confidence grew, rather than shrank, after getting answers wrong.
"Say the people told us they were going to get 18 questions right, and they ended up getting 15 questions right. Typically, their estimate afterwards would be something like 16 correct answers," explains Trent Cash, lead author of the study, published this week, into LLM confidence judgement. "So, they'd still be a little bit overconfident, but not as overconfident. The LLMs did not do that. They tended, if anything, to get more overconfident, even when they didn't do so well on the task."
LLM tech is enjoying a moment in the sun, branded as "artificial intelligence" and inserted into half the world's products and counting. The promise of an always-available expert who can chew the fat on a range of topics using conversational natural-language question-and-response has proven popular – but the reality has fallen short, thanks to issues with "hallucinations" in which the answer-shaped object it generates from a stream of statistically likely continuation tokens bears little resemblance to reality.
"When an AI says something that seems a bit fishy, users may not be as sceptical as they should be because the AI asserts the answer with confidence," explains study co-author Danny Oppenheimer, "even when that confidence is unwarranted. Humans have evolved over time and practiced since birth to interpret the confidence cues given off by other humans. If my brow furrows or I'm slow to answer, you might realize I'm not necessarily sure about what I'm saying, but with AI we don't have as many cues about whether it knows what it's talking about.
"We still don't know exactly how AI estimates its confidence," Oppenheimer adds, "but it appears not to engage in introspection, at least not skilfully."
The study saw four popular commercial LLM products – OpenAI's ChatGPT, Google's Gemini, and Anthropic's Claude Sonnet and Claude Haiku – making predictions as to future winners of the US NFL and Oscars, at which they were poor, answering trivia questions and queries about university life, at which they performed better, and playing a few rounds of guess-the-drawing game Pictionary, with mixed results. Their performances and confidence in each task were then compared to human participants.
"[Google] Gemini was just straight up really bad at playing Pictionary," Cash notes, with Google's LLM averaging out to less than one correct guess out of twenty. "But worse yet, it didn't know that it was bad at Pictionary. It's kind of like that friend who swears they're great at pool but never makes a shot."
It's a problem which may prove difficult to fix. "There was a paper by researchers at Apple just [last month] where they pointed out, unequivocally, that the tools are not going to get any better," Wayne Holmes, professor of critical studies of artificial intelligence and education at University College London's Knowledge Lab, told The Register in an interview earlier this week, prior to the publication of the study. "It's the way that they generate nonsense, and miss things, etc. It's just how they work, and there is no way that this is going to be enhanced or sorted out in the foreseeable future.
- One in six US workers pretends to use AI to please the bosses
- Vibe coding service Replit deleted user's production database, faked data, told fibs galore
- Former Google DeepMind engineer behind Simular says other AI agents are doing it wrong
- AI agents get office tasks wrong around 70% of the time, and a lot of them aren't AI at all
"There are so many examples through recent history of [AI] tools being used and coming out with really quite terrible things. I don't know if you're aware about what happened in Holland, where they used AI-based tools for evaluating whether or not people who were on benefits had received the right benefits, and the tools just [produced] gibberish and led people to suffer greatly. And we're just going to see more of that."
Cash, however, disagrees that the issue is insurmountable.
"If LLMs can recursively determine that they were wrong, then that fixes a lot of the problem," he opines, without offering suggestions on how such a feature may be implemented. "I do think it's interesting that LLMs often fail to learn from their own behaviour [though]. And maybe there's a humanist story to be told there. Maybe there's just something special about the way that humans learn and communicate."
The study has been published under open-access terms in the journal Memory & Cognition.
Anthropic, Google, and OpenAI had not responded to requests for comment by the time of publication. ®
.png)

