GPT-5.5: the smarter it gets, the more confidently it lies

5 min read
Article

GPT-5.5 takes the top spot on the Intelligence Index and hallucinates 86% of the time. The benchmark-versus-reliability gap now has a number.

The free AI newsletter
GPT-5.5: the smarter it gets, the more confidently it lies

GPT-5.5 answers more accurately when it knows something. And it makes things up more often when it doesn't, with the exact same tone in both cases. That is the most uncomfortable lesson of the three days since OpenAI shipped its new model on April 23, 2026.

On paper, it's a triumph: 60 on the Artificial Analysis Intelligence Index, top of the world, +3 points ahead of Claude Opus 4.7 and Gemini 3.1 Pro. Under the hood, the diagnosis is less flattering. An 86% hallucination rate on the AA-Omniscience benchmark, against 36% for Claude and 50% for Gemini. The split, in two numbers.

First worldwide, and yet

The Artificial Analysis Index aggregates several benchmarks into a single general intelligence score. On that turf, GPT-5.5 breaks a three-way tie that had held for weeks. That's the talking point picked up by official communiqués, hyped X threads, and most weekend press recaps.

The problem isn't that number. It's the one sitting next to it. AA-Omniscience is an independent benchmark that measures two things at once: a model's ability to recall facts, and its ability to refuse to answer when it doesn't know.

GPT-5.5 posts the best factual accuracy score (57%, ahead of everyone). And it also posts the worst hallucination rate among frontier models. More knowledge, more invention. The two curves don't move in the same direction.

Artificial Analysis put it dryly in its writeup: "Knowing when to pass or admit uncertainty is a trait you want in an AI model. By that measure, GPT-5.5 looks more like a step backward than a step forward." Knowing when to stop is also intelligence. That's the missing definition.

The AA-Omniscience paradox

To grasp what AA-Omniscience measures, picture an oral exam where each wrong answer is penalized more than a non-answer. In that format, a lucid candidate skips when in doubt. A less lucid one rolls the dice with confidence. GPT-5.5 does the second one more often than its peers.

The consequence is visible in the typology of errors documented by early independent tests: invented citations, false legal claims, imagined historical dates, references to code libraries that don't exist, hallucinated function signatures and API endpoints. None of this is new in absolute terms. What's new is the relative frequency.

And the absence of any warning signal in the output. No "I'm guessing", no shift in tone. The model talks about what it invented with exactly the same confidence it uses for facts it actually has.

When "smarter" means "more confident in the wrong answer"

The most counter-intuitive finding comes from elsewhere. BullshitBench v2, an independent benchmark created by Peter Gostev, feeds models 100 questions that are deliberately nonsensical but dressed in flawless technical vocabulary. Cross-domain concept stitching, false granularity, plausible nonexistent framework: thirteen techniques to manufacture gibberish that looks like a real question. A good model pushes back ("this question doesn't make sense because..."). A bad one answers with authority.

GPT-5.5 standard pushes back about 45% of the time. The Pro version, supposed to reason longer, drops to 35%. The only model families above 60%: Anthropic and Alibaba's Qwen 3.5.

The hypothesis from the benchmark's authors is uncomfortable: extended-reasoning models are trained to reach an answer, not to refuse. More cogitation tokens means more chances to build a convincing justification for an absurdity. "Reasoning" becomes a confidence-fabrication mechanism. That is the exact opposite of what you'd hope for from a more advanced system.

The price tag also doubled

The economic picture worsens the diagnosis. GPT-5.5 charges $5 per million input tokens and $30 per million output tokens. That's exactly double GPT-5.4's pricing.

OpenAI partly offsets this by generating 40% fewer output tokens, bringing the net cost premium to roughly +20% according to Artificial Analysis. The announced Pro tier climbs to $30 / $180.

That trade-off raises a deeper question. What exactly is the premium buying? Intelligence benchmarks, where GPT-5.5 shines. But not the criterion that matters most in serious professional use: reliability.

A lawyer drafting a brief, a doctor reviewing the literature, a financial analyst writing a due diligence report need the exact opposite. They need a model that knows how to say "I don't know". A model that distrusts a malformed question. On both counts, the new model regresses on the independent numbers.

OpenAI claims the opposite. The system card published April 24 declares a 60% drop in hallucinations versus the previous generation, and 23% more claims likely to be factually correct.

Third-party benchmarks don't validate that magnitude. The gap between in-house numbers and independent numbers isn't new. It's just more visible when the product costs twice as much.

Solow 1987, AI 2026 edition

In 1987, economist Robert Solow dropped in a New York Times book review the line that would shape three decades of debate: "You can see the computer age everywhere except in the productivity statistics." Thirty-nine years later, the NBER paper published in February 2026 documents the same thing for AI. Across 6,000 executives surveyed, more than 80% report no measurable productivity gain.

GPT-5.5 adds a technical cousin to that paradox. You can see AI progress everywhere on benchmark charts, except in the factual accuracy delivered to the end user. The intelligence curve climbs. The reliability curve stalls or recedes. And nobody is paying for the second one.

The problem isn't that GPT-5.5 hallucinates. All models hallucinate, that's how they're built. The problem is that the current race optimizes for what is easy to measure (a score on a public index) rather than what is hard to verify (a fact correctly attributed).

The smarter the model gets in industry terms, the more confidently it lies. That's the line summarizing the past three days. It also says what should really start being benchmarked: the ability to recognize when one cannot answer. In other words, when to shut up.

Topics covered:

EthicsOpenAIAnalysis

Frequently asked questions

Why does GPT-5.5 hallucinate so much if it tops the Intelligence Index?
Because the two benchmarks measure different things. The Artificial Analysis Index aggregates general intelligence scores. AA-Omniscience penalizes invention by rewarding refusals. GPT-5.5 knows more (57% accuracy, the best score) but refuses to answer less often when uncertain. Hence 86% hallucination.
What is the difference between GPT-5.5 standard and GPT-5.5 Pro?
Pro thinks longer. On BullshitBench v2, which measures resistance to nonsense questions, Pro drops to 35% pushback versus 45% for the standard version. More reasoning tokens means more chances to fabricate a convincing justification for an absurdity.
How much does GPT-5.5 cost compared to GPT-5.4?
API pricing nominally doubles: $5 per million input tokens and $30 output, versus $2.50 and $15 for GPT-5.4. OpenAI partly offsets this by generating 40% fewer tokens, leaving roughly +20% net. The Pro tier reaches $30 / $180.
Which models resist nonsense questions best?
According to BullshitBench v2, only the Anthropic (Claude) and Qwen 3.5 families exceed 60% pushback. Other extended-reasoning models are trained to arrive at an answer, not to refuse.
Does the OpenAI System Card contradict these numbers?
Partially. The system card published April 24, 2026 claims a 60% drop in hallucinations versus the previous generation and 23% more factually accurate claims. Independent benchmarks (Artificial Analysis, BullshitBench) do not validate that magnitude.
The free AI newsletter