AI terrible at knowing when it doesn’t know something
A new study reveals that AI chatbots appear to be unaware of their own mistakes
Published
11 months ago onBy
Talker News
By Stephen Beech
AI chatbots are "overconfident" - even when they’re wrong, warns new research.
They appear to be unaware of their own mistakes, say scientists, prompting concerns about their increasing use.
Artificial intelligence chatbots are now commonplace, from smartphone apps and customer service portals to online search engines.
American researchers asked both human participants and four large language models (LLMs) - including ChatGPT - how confident they felt in their ability to answer trivia questions, predict the outcomes of NFL games or Oscar ceremonies, or play a Pictionary-like image identification game.
Both the people and the LLMs tended to be overconfident about how they would hypothetically perform.
They also answered questions or identified images with relatively similar success rates, according to the study published in the journal Memory & Cognition.
But when the participants and LLMs were asked retroactively how well they thought they did, only the humans appeared able to adjust expectations.

Study lead author Dr Trent Cash, of Carnegie Mellon University (CMU), Pittsburgh, said: “Say the people told us they were going to get 18 questions right, and they ended up getting 15 questions right.
"Typically, their estimate afterwards would be something like 16 correct answers.
“So, they’d still be a little bit overconfident, but not as overconfident.
“The LLMs did not do that.
“They tended, if anything, to get more overconfident, even when they didn’t do so well on the task.”
He acknowledged that the world of AI is changing rapidly each day, which makes drawing general conclusions about its applications challenging.
But a strength of the study was that the data was collected over the course of two years, which meant using continuously updated versions of the LLMs known as ChatGPT, Bard/Gemini, Sonnet and Haiku.
Dr. Cash says that means that AI overconfidence was detectable across different models over time.
Co-author Professor Danny Oppenheimer, from CMU’s Department of Social and Decision Sciences, said: “When an AI says something that seems a bit fishy, users may not be as sceptical as they should be because the AI asserts the answer with confidence, even when that confidence is unwarranted.

“Humans have evolved over time and practiced since birth to interpret the confidence cues given off by other humans.
"If my brow furrows or I’m slow to answer, you might realise I’m not necessarily sure about what I’m saying, but with AI, we don’t have as many cues about whether it knows what it’s talking about."
While the accuracy of LLMs at answering trivia questions and predicting American football results is relatively low stakes, the researchers say their findings hint at the "pitfalls" associated with integrating chatbot technology into daily life.
A recent study conducted by the BBC found that when LLMs were asked questions about the news, more than half of the responses had “significant issues” - including factual errors, misattribution of sources and missing or misleading context.
Another study from 2023 found LLMs “hallucinated” - or produced incorrect information - in 69% to 88% of legal queries.
The researchers say LLMs are not designed to answer everything users are throwing at them on a daily basis.
Oppenheimer said: “If I'd asked ‘What is the population of London?’ the AI would have searched the web, given a perfect answer and given a perfect confidence calibration."
But, by asking questions about future events – such as the winners of the upcoming Academy Awards – or more subjective topics, such as the intended identity of a hand-drawn image, the research team were able to expose the chatbots’ apparent weakness in "metacognition" – the ability to be aware of one’s own thought processes.

Oppenheimer said: “We still don’t know exactly how AI estimates its confidence, but it appears not to engage in introspection, at least not skillfully.”
The study also showed that each LLM has strengths and weaknesses.
Overall, the LLM known as Sonnet tended to be less overconfident than its peers.
Likewise, ChatGPT-4 performed similarly to human participants in the Pictionary-like trial, accurately identifying 12.5 hand-drawn images out of 20, while Gemini could identify just 0.93 sketches, on average.
Gemini also predicted it would get an average of 10.03 sketches correct, and even after answering fewer than one out of 20 questions correctly, the LLM retrospectively estimated that it had answered 14.4 correctly, demonstrating its lack of self-awareness.
Dr. Cash said, “Gemini was just straight up really bad at playing Pictionary.
“But worse yet, it didn’t know that it was bad at Pictionary.
"It’s kind of like that friend who swears they’re great at pool but never makes a shot.”

For everyday chatbot users, Dr. Cash said the biggest takeaway is to remember that LLMs are not "inherently correct" and that it might be a good idea to ask them how confident they are when answering important questions.
The study suggests LLMs might not always be able to accurately judge confidence, but in the event that the chatbot does acknowledge low confidence, it's a good sign that its answer cannot be trusted.
The researchers say that it’s also possible that the chatbots could develop a better understanding of their own abilities over vastly larger data sets.
Oppenheimer said, “Maybe if it had thousands or millions of trials, it would do better."
The research team says that exposing the weaknesses, such as overconfidence, will only help those in the industry who are developing and improving LLMs.
And as AI becomes more advanced, it may develop the metacognition required to learn from its mistakes.
Dr. Cash said: "If LLMs can recursively determine that they were wrong, then that fixes a lot of the problem."
He added: “I do think it’s interesting that LLMs often fail to learn from their own behavior.
“And maybe there’s a humanist story to be told there.
"Maybe there’s just something special about the way that humans learn and communicate.”
Stories and infographics by ‘Talker Research’ are available & ready to use. Stories and videos by ‘Talker News’ are managed by Talker Inc. For queries, please submit an inquiry via our contact form.
You may like

Does cuddling cats or dogs actually ease anxiety?

Non-surgical procedure may help relieve knee pain

Study discovers longevity may be down to one particular gene

Why grandparents are crucial in tackling child mental health crisis

DNA from 2,000-year-old grape seeds reveals origins of modern wine

New 3D technology can detect forged artwork
Other Stories

Does cuddling cats or dogs actually ease anxiety?
A new study showed that interacting with pets when under pressure did not protect against the negative effects of stress...

Non-surgical procedure may help relieve knee pain
Osteoarthritis, the most common form of arthritis, causes inflammation, stiffness, reduced mobility and sensory nerve pain.

Study discovers longevity may be down to one particular gene
Dutch researchers have identified a gene in long-lived families that may be responsible for their longevity being passed on.

Can having a messy car ruin your love life?
In today’s difficult dating landscape, singles are paying attention to more than just chemistry — they’re also judging compatibility based...

How caring for aging parents brings families back together
In spite of burnout and exhaustion, most Americans who are caring for an aging parent said it has healed their...
Top Talkers
Health1 day agoChemicals in vapes trigger potentially deadly heart issue: study
Mental Health1 day agoWearing ‘cooling cap’ for 30 minutes can reduce depression symptoms
Health6 days agoStudy finds vitamin C boosts brain health in older adults
Talker Research5 days agoAmericans don’t know what’s in their tap water
Life6 days agoMetal detectorist digs up rare diamond ring worth $25,000
Health6 days ago1 in 3 middle-aged adults struggle with basic health tasks
Science6 days agoIcebergs putting oceans at risk by carrying debris: study
Sleep4 days agoMost Americans believe dreams and nightmares have hidden meanings