OpenAI's new reasoning AI models hallucinate more

OpenAI’s recently launched o3 and o4-mini AI models are state-of-the-art in many respects. However, the new models still hallucinate, or make things up — in fact, they hallucinate more than several of OpenAI’s older models.

Hallucinations have proven to be one of the biggest and most difficult problems to solve in AI, impacting even today’s best-performing systems. Historically, each new model has improved slightly in the hallucination department, hallucinating less than its predecessor. But that doesn’t seem to be the case for o3 and o4-mini.

According to OpenAI’s internal tests, o3 and o4-mini, which are so-called reasoning models, hallucinate more often than the company’s previous reasoning models — o1, o1-mini, and o3-mini — as well as OpenAI’s traditional, “non-reasoning” models, such as GPT-4o.

Perhaps more concerning, the ChatGPT maker doesn’t really know why it’s happening.

In its technical report for o3 and o4-mini, OpenAI writes that “more research is needed” to understand why hallucinations are getting worse as it scales up reasoning models. O3 and o4-mini perform better in some areas, including tasks related to coding and math. But because they “make more claims overall,” they’re often led to make “more accurate claims as well as more inaccurate/hallucinated claims,” per the report.

OpenAI found that o3 hallucinated in response to 33% of questions on PersonQA, the company’s in-house benchmark for measuring the accuracy of a model’s knowledge about people. That’s roughly double the hallucination rate of OpenAI’s previous reasoning models, o1 and o3-mini, which scored 16% and 14.8%, respectively. O4-mini did even worse on PersonQA — hallucinating 48% of the time.

Third-party testing by Transluce, a nonprofit AI research lab, also found evidence that o3 has a tendency to make up actions it took in the process of arriving at answers. In one example, Transluce observed o3 claiming that it ran code on a 2021 MacBook Pro “outside of ChatGPT,” then copied the numbers into its answer. While o3 has access to some tools, it can’t do that.

“Our hypothesis is that the kind of reinforcement learning used for o-series models may amplify issues that are usually mitigated (but not fully erased) by standard post-training pipelines,” said Neil Chowdhury, a Transluce researcher and former OpenAI employee, in an email to TechCrunch.

Sarah Schwettmann, co-founder of Transluce, added that o3’s hallucination rate may make it less useful than it otherwise would be.

Source link

What's Hot

Trump insists redistricting is not over in Indiana

XRP Supply In Profit Falls to 58.5% – Lowest Since 2024 Despite Higher Price

Forget XRP, DFDV Exec Predicts Solana Price Is Headed For $10,000

OpenAI’s new reasoning AI models hallucinate more

Rite Aid files for bankruptcy — again

How to Track Driver Performance Without Micromanaging

Ford says its Q1 profit fell by two-thirds and it expects a $1.5 billion hit from tariffs this year

Pakistan’s Asjad Iqbal knocks out India’s Pankaj Advani to make IBSF World Cup semi-finals

Fakhar Zaman, Usman Khan steer Pakistan to five-wicket win over Zimbabwe in T20 tri-series opener – Sport

Maaz Sadaqat shines again as Pakistan Shaheens rout UAE in Rising Stars Asia Cup – Sport

Pakistan win toss, bowl first against Zimbabwe in T20 tri-series opener – Sport

XRP Supply In Profit Falls to 58.5% – Lowest Since 2024 Despite Higher Price

Forget XRP, DFDV Exec Predicts Solana Price Is Headed For $10,000

Hyperliquid Price Rebounds After $96M Liquidation Shock

Subscribe to Updates

What's Hot

OpenAI’s new reasoning AI models hallucinate more

Related Posts