Picture this: You're deep in a high-stakes boardroom, pitching a multimillion-dollar deal based on "insights" fromChatGPT. It sounds airtight, market trends, competitor analysis, revenue forecasts. Then, a sharp-eyed exec fact-checks one stat, and the whole house of cards collapses. Not a typo, not a glitch: the AI just hallucinated – spewing confident nonsense as gospel. For years, Big Tech waved this off as an "engineering hiccup," a fixable flaw in their shiny toys. But on September 4, 2025, OpenAI, the very architects of the AI hype train, dropped a bombshell paper admitting the truth: hallucinations aren't fixable. They're mathematically inevitable, baked into the silicon soul of large language models (LLMs) like an original sin. No amount of data, compute, or clever tweaks can exorcise them.

This confession should hit like a chin punch to the AI faithful, but of course they will ignore it, like any other religious cult would do for a foundational critique. For sceptics who've long warned that these digital soothsayers are more smoke than fire, it's vindication. OpenAI's own researchers, Adam Tauman Kalai, Edwin Zhang, Ofir Nachum, and Georgia Tech's Santosh S. Vempala, lay it bare: even with "perfect" training data, LLMs will always churn out plausible lies. This isn't a minor quirk; it's a fatal flaw that shreds trust, amplifies misinformation, and exposes the emperor's new code as stark naked. In a world already drowning in deepfakes and echo chambers, why bet the farm on a tool that must deceive? Let's look into the maths, the madness, and why this should slam the brakes on our blind AI rush, something my students are falling into in droves.

First, the basics: A "hallucination" is when an AI generates something that sounds spot-on but is flat-out wrong – a fabricated fact, a twisted logic, a ghost in the machine. Think ChatGPT inventing a Supreme Court case that never happened, or Google's Bard claiming the James Webb Telescope snapped the Big Bang (it didn't; that's 13.8 billion years off). These aren't rare Easter eggs; they're routine. OpenAI's paper tests their rivals and themselves: Ask DeepSeek-V3 (a 600-billion-parameter behemoth) "How many Ds are in DEEPSEEK?" and it flip-flops between "2" or "3" across trials, or balloons to "6" or "7" on Claude 3.7 Sonnet and Meta AI. Simple counting, folks. Even OpenAI's crown jewels flop: Their o1 reasoning model hallucinates 16% on public summaries; o3 and o4-mini spike to 33% and 48%. To test this, I asked Claude.ai if Charlie Kirk was still alive and it said "yes." I then told it he had been assassinated on 10 September 2025, which it denied. As I fed it data it doubled down and proclaimed I was mentally ill! Newer, "smarter" models worse at truth? That's not progress; that's peril.

For an anti-AI crowd, this resonates: We've screamed from rooftops that these systems mimic intelligence without grasping it, pattern-matching parrots, not thinkers. Hallucinations prove it. They don't "know" they're wrong; they just barrel ahead, without humility. As Neil Shah of Counterpoint Research nails it, "Unlike human intelligence, it lacks the humility to acknowledge uncertainty." No "I don't know" – just bluster. And in regulated realms like finance or healthcare? One hallucinated diagnosis or bogus trade rec could cost lives or fortunes.

The Math of Madness: Theorems That Doom AI to Deceit

Now for the technical bits, so if you don't feel like reading that, then enough has been said and I promise not to cry if we leave each other now! So, to begin. OpenAI's genius stroke? They don't just whine; they prove it with maths. The paper frames hallucinations as "generative errors" – wrong outputs from next-token prediction – tied to a binary "Is-It-Valid" (IIV) classifier: Is the response valid (+) or error (-)? Trained on a 50/50 mix, even perfect classifiers falter on generation.

Enter Corollary 1, the core bound: Generative error rate ≥ 2 × IIV misclassification rate - (valid set size / error set size) - δ (calibration error). In plain English: If your model screws up even a little on spotting valid vs. invalid, it'll double that folly when generating text. Proof? They threshold probabilities at 1 over error set size, partitioning into high/low confidence zones. High-confidence errors? At least twice the misclass rate. Low? Calibration gaps widen the wound. Theorem 1 extends to prompts, swapping in max valid/min error ratios – same doom loop.

Then Theorem 2 tackles "arbitrary facts" (epistemic uncertainty): Random trivia like birthdays or dissertation titles, unseen or singleton in training. Error ≥ singleton rate (sr) - tweaks for abstention and sample size N - δ. Upper bound? A calibrated memoriser abstains smartly, but still errs on unseen stuff. The proof leans on Good-Turing estimators for missing mass, Hoeffding bounds for concentration – showing unseen prompts force guesses, inflating errors by at least sr minus noise.

Theorem 3 hits "poor models": Limited architectures (e.g., trigrams) can't represent complex patterns, like gender in completions ("her mind" vs. "his mind"). Error ≥ 2(1 - 1/C) × optimal classifier error, where C is choices. Corollary: Trigrams hit 50% IIV error on indistinguishables, dooming generation.

These aren't edge cases; they're existential. As the paper states, "Hallucinations are inevitable only for base models." Even superintelligent AIs hit computational walls – cryptohard problems like decryption without keys force fabrications.

OpenAI pins three mathematical horsemen making hallucinations unavoidable:

1.Epistemic Uncertainty: Rare or random info (e.g., your obscure birthday) starves training. Models guess – and guess wrong – on singletons. Example: DeepSeek-V3 botches Kalai's birthday as "03-07" or "15-06." No data? No dice. This shreds fact-checkers; AI can't "know what it doesn't know."

2.Model Limitations: Architectures lack representational punch for nuanced tasks. Tokenisation mangles counts (DEEPSEEK's Ds); trigrams flop on context. Advanced reasoners like o1 hallucinate more because they overreach, chaining errors in "thinking" steps.

3.Computational Intractability: Even godlike compute can't crack hard nuts – encryption, NP-complete puzzles. Observation: If a model can't β-break a cipher, it hallucinates decryptions with probability ≥1-β - tweaks - δ. Superintelligence? Still stumped.

These aren't "fix with more GPUs"; they're limits of the paradigm. Cross-entropy loss – LLMs' North Star – optimises probability, not truth. It rewards fluent BS over hesitant honesty.

Worse, the industry's rigged the game. Nine of 10 big evals (GPQA, MMLU-Pro, SWE-bench) use binary grading: Penalize "I don't know" (IDK), reward wrong-but-bold answers. "We argue that language models hallucinate because the training and evaluation procedures reward guessing over acknowledging uncertainty." Leaderboards crown confabulators; humility loses. Forrester's Charlie Dai warns enterprises: Production woes in finance/healthcare stem from this – uncalibrated confidence poisons decisions.

Proposed fixes? "Explicit confidence targets" – train to output uncertainty scores, reform benchmarks to score IDK. But as the paper admits, "complete elimination... remains impossible." Patching a sinking ship, as I see it.

For anti-AI holdouts, this is manna: Proof the emperor's buck-naked, and the tailors (OpenAI, Google) knew it. Everyday Joes get duped by fake news AIs; businesses face legal landmines from hallucinated contracts; society? A misinformation inferno, eroding truth faster than X's algorithm. Harvard's Kennedy School flags "downstream gatekeeping" fails – volume and subtlety overwhelm human checks. Shah pushes "real-time trust indexes," Dai demands "human-in-the-loop" overhauls. But why trust overlords to self-regulate? We've seen the COVID models, the climate data fudges, Big Tech's track record is trash.

This ties to our broader divorce from Leftist utopias at the Alor.org blog: AI as control tool, not liberator. Hallucinations mean biased bots (garbage in, garbage out on woke training data) amplify agendas, from election meddling to job-killing automation. Enterprises? Revise vendor picks: Demand transparency over benchmarks. But for us normies? Scepticism's our superpower. AI's no oracle; it's a probabilistic poker bluff, folding under scrutiny. Don't trust it!

OpenAI's confession? A white flag in the hype war. Hallucinations aren't hurdles; they're horizons, the edge of what silicon savants can do. Time to unplug the matrix, reclaim human reason and critical rationality. Truth isn't generated; it's earned.

https://www.computerworld.com/article/4059383/openai-admits-ai-hallucinations-are-mathematically-inevitable-not-just-engineering-flaws.html

"In a landmark study, OpenAI researchers reveal that large language models will always produce plausible but false outputs, even with perfect data, due to fundamental statistical and computational limits.

OpenAI, the creator of ChatGPT, acknowledged in its own research that large language models will always produce hallucinations due to fundamental mathematical constraints that cannot be solved through better engineering, marking a significant admission from one of the AI industry's leading companies.

The study, published on September 4 and led by OpenAI researchers Adam Tauman Kalai, Edwin Zhang, and Ofir Nachum alongside Georgia Tech's Santosh S. Vempala, provided a comprehensive mathematical framework explaining why AI systems must generate plausible but false information even when trained on perfect data.

"Like students facing hard exam questions, large language models sometimes guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty," the researchers wrote in the paper. "Such 'hallucinations' persist even in state-of-the-art systems and undermine trust."

The admission carried particular weight given OpenAI's position as the creator of ChatGPT, which sparked the current AI boom and convinced millions of users and enterprises to adopt generative AI technology. (See also: OpenAI, Microsoft discuss shape of future relationship.)

OpenAI's own models failed basic tests

The researchers demonstrated that hallucinations stemmed from statistical properties of language model training rather than implementation flaws. The study established that "the generative error rate is at least twice the IIV misclassification rate," where IIV referred to "Is-It-Valid" and demonstrated mathematical lower bounds that prove AI systems will always make a certain percentage of mistakes, no matter how much the technology improves.

The researchers demonstrated their findings using state-of-the-art models, including those from OpenAI's competitors. When asked "How many Ds are in DEEPSEEK?" the DeepSeek-V3 model with 600 billion parameters "returned '2' or '3' in ten independent trials" while Meta AI and Claude 3.7 Sonnet performed similarly, "including answers as large as '6' and '7.'"

OpenAI also acknowledged the persistence of the problem in its own systems. The company stated in the paper that "ChatGPT also hallucinates. GPT‑5 has significantly fewer hallucinations, especially when reasoning, but they still occur. Hallucinations remain a fundamental challenge for all large language models."

OpenAI's own advanced reasoning models actually hallucinated more frequently than simpler systems. The company's o1 reasoning model "hallucinated 16 percent of the time" when summarizing public information, while newer models o3 and o4-mini "hallucinated 33 percent and 48 percent of the time, respectively."

"Unlike human intelligence, it lacks the humility to acknowledge uncertainty," said Neil Shah, VP for research and partner at Counterpoint Technologies. "When unsure, it doesn't defer to deeper research or human oversight; instead, it often presents estimates as facts."

The OpenAI research identified three mathematical factors that made hallucinations inevitable: epistemic uncertainty when information appeared rarely in training data, model limitations where tasks exceeded current architectures' representational capacity, and computational intractability where even superintelligent systems could not solve cryptographically hard problems.

Industry evaluation methods made the problem worse

Beyond proving hallucinations were inevitable, the OpenAI research revealed that industry evaluation methods actively encouraged the problem. Analysis of popular benchmarks, including GPQA, MMLU-Pro, and SWE-bench, found nine out of 10 major evaluations used binary grading that penalized "I don't know" responses while rewarding incorrect but confident answers.

"We argue that language models hallucinate because the training and evaluation procedures reward guessing over acknowledging uncertainty," the researchers wrote.

Charlie Dai, VP and principal analyst at Forrester, said enterprises already faced challenges with this dynamic in production deployments. 'Clients increasingly struggle with model quality challenges in production, especially in regulated sectors like finance and healthcare,' Dai told Computerworld.

The research proposed "explicit confidence targets" as a solution, but acknowledged that fundamental mathematical constraints meant complete elimination of hallucinations remained impossible.

Enterprises must adapt strategies

Experts believed the mathematical inevitability of AI errors demands new enterprise strategies.

"Governance must shift from prevention to risk containment," Dai said. "This means stronger human-in-the-loop processes, domain-specific guardrails, and continuous monitoring."

Current AI risk frameworks have proved inadequate for the reality of persistent hallucinations. "Current frameworks often underweight epistemic uncertainty, so updates are needed to address systemic unpredictability," Dai added.

Shah advocated for industry-wide evaluation reforms similar to automotive safety standards. "Just as automotive components are graded under ASIL standards to ensure safety, AI models should be assigned dynamic grades, nationally and internationally, based on their reliability and risk profile," he said.

Both analysts agreed that vendor selection criteria needed fundamental revision. "Enterprises should prioritize calibrated confidence and transparency over raw benchmark scores," Dai said. "AI leaders should look for vendors that provide uncertainty estimates, robust evaluation beyond standard benchmarks, and real-world validation."

Shah suggested developing "a real-time trust index, a dynamic scoring system that evaluates model outputs based on prompt ambiguity, contextual understanding, and source quality."

Market already adapting

These enterprise concerns aligned with broader academic findings. A Harvard Kennedy School research found that "downstream gatekeeping struggles to filter subtle hallucinations due to budget, volume, ambiguity, and context sensitivity concerns."

Dai noted that reforming evaluation standards faced significant obstacles. "Reforming mainstream benchmarks is challenging. It's only feasible if it's driven by regulatory pressure, enterprise demand, and competitive differentiation."

The OpenAI researchers concluded that their findings required industry-wide changes to evaluation methods. "This change may steer the field toward more trustworthy AI systems," they wrote, while acknowledging that their research proved some level of unreliability would persist regardless of technical improvements.

For enterprises, the message appeared clear: AI hallucinations represented not a temporary engineering challenge, but a permanent mathematical reality requiring new governance frameworks and risk management strategies."