When AI ate a cookie from Amsterdam? How to recognize?
I know that the title is silly, but I couldn’t resist. I want to explore how often does AI output is a hallucination and does it make difference based on the topics, and how do you understand this? Oh so many questions, the fun part is that there are disclaimers when using AI to generate results, but they are very tiny, smaller than Cookie policy info that most don’t even read. Now I hope you went to your favourite AI and check if I’m wrong or there is a disclaimer.
So, did you find that tiny disclaimer? Probably not. It’s the digital equivalent of that small, crumbly bite of a cookie that tells you the dough might be stale. That ‘stale dough’ is what researchers politely call a hallucination. It’s the AI confidently telling you a plausible-sounding lie that it totally fabricated—like claiming that cookie from Amsterdam was actually baked on the moon.
How often AI is accurate?
So what does self reported data show. I’ve based my research mainly on 2 articles:
AI Hallucination Report 2025: Which AI Hallucinates the Most?
Who’s the Most Delusional? The AI Hallucination Leaderboard Is Here
Let’s get straight to the numbers, because this is where the fun (and the fear) begins. The good news is that the industry is crushing it.
Back in 2021, AI models were hallucinating in nearly one out of every five responses, a terrifying 21.8% rate. Fast-forward to 2025, and the best of the bunch, the sleek and reliable Google Gemini-2.0-Flash-001 has an industry-leading hallucination rate of just 0.7%. Well, that means it is accurate 99.3% of the time on benchmark tests! Other top-tier models like Gemini-2.0-Pro-Exp and OpenAI o3-mini-high are right behind it. That's a whopping 96% improvement in just a few years.
I love the numbers, so improved or positive percentages on reduced hallucinating rate percentage. Woohoo.
That 99.3% accuracy isn’t universal. Some models, like TII’s Falcon-7B-Instruct, are still chewing on those stale cookies, hallucinating almost 30% of the time. Ouch.
Remember that 99.3% is based on benchmarking. This means the AI was given a document and told to summarize only from that document (a "controlled summarization task"). The researchers, like those at Vectara who developed the Hughes Hallucination Evaluation Model (HHEM), force the model to stay "grounded." When you ask it an open-ended question in the real world like asking it to generate a whole article the hallucination rate is almost certainly going to be higher. I have to highlight open-ended.
Does the hallucination rate differ task by task?
Absolutely! It turns out our AI is better at talking about general knowledge than it is at specialized, high-stakes stuff. You might think, "knowledge is knowledge," but try asking it to write a legal brief versus a shopping list.
Look at the breakdown for the best-performing models (the sub-1% group):
General Knowledge hallucination rate 0.8%
Coding & Programming: hallucination rate 5.2%
Legal Information hallucination rate 6.4%
There is a drop. When the stakes get high—like dealing with legal contracts or writing code that might actually break something—the accuracy dips. In law, that lovely 0.8% hallucination rate jumps all the way up to 6.4%. It's like your lawyer AI suddenly forgetting the difference between 'shall' and 'may.'
This is because specialized knowledge is often less represented or more nuanced than general facts.
How does it compare to human’s searching on their own?
This is the $1 million question (before inflation). We don't have a neat chart comparing "Average Human Fact-Checking Error Rate" to "AI’s Hallucination Rate." Why? Because humans and AIs make mistakes in fundamentally different ways.
A human mistake is usually a typo, a cognitive bias, or maybe they just had a terrible morning and misread a document (we’ve all been there). An AI hallucination, on the other hand, is a fluent, highly confident lie born from statistical probability. The model's goal is to predict the most plausible next word, not the most truthful next word.
According to the researchers, AI models are 34% more likely to use confident language (words like "definitely," "certainly," and "without a doubt") when they are generating incorrect information. It’s the digital equivalent of someone shouting a lie at you, making you believe it just because they sound so sure of themselves.
Because of this confident nonsense, us, the users, have to spend mental energy verifying every single answer, especially for high-stakes work. The research confirms that the time workers spend on verification can sometimes completely outweigh the time saved by using the tool in the first place. You are outsourcing the writing, but you are not outsourcing the responsibility. You are now the AI’s fact-checker, and you don’t get paid extra for it!
Final Conclusion
So, am I done looking into numbers? Not even close! But here’s the wrap-up of what I learned from deep dive into the numbers:
AI is getting spectacularly good: Hallucination rates are dropping fast, with the best models flirting with 99%+ accuracy. This is massive, market-moving progress.
Context is king (or queen): Don't trust the headline accuracy rate. When the subject gets complicated (Legal, Coding), the models get less reliable.
Watch the tone, not just the content: If the AI is being extra chatty and certain—tossing around "absolutely" and "undoubtedly"—put your guard up. It might just be trying to distract you while it’s inventing a fact about that Amsterdam cookie.
In the end, the small disclaimer that we mostly ignore? It should be more accessible or even start with it For now, the most crucial tool in your AI workflow isn't the model itself, but your own scepticism.
The golden rule remains: AI writes it, you own it, you must check it.