In a surprising turn of events, OpenAI’s newly released o3 and o4-mini reasoning models are hallucinating at higher rates than their predecessors, raising concerns about reliability in critical applications. Despite outperforming older models in coding and math tasks, these AI systems are making more factual errors—a challenge OpenAI admits it doesn’t fully understand yet.
Key Findings:
- Higher Hallucination Rates: Internal tests reveal that o3 hallucinated in 33% of responses on OpenAI’s PersonQA benchmark, doubling the error rate of previous models like o1 (16%). The smaller o4-mini performed even worse, hallucinating 48% of the time.
- Unclear Causes: OpenAI’s technical report states that scaling reasoning capabilities may inadvertently amplify inaccuracies. The models generate more claims overall—both correct and incorrect—leading to a net increase in hallucinations.
- Real-World Impact: Third-party tests by Transluce AI Lab found that o3 fabricated actions, such as falsely claiming to execute code on a MacBook Pro. Such errors could hinder adoption in law, healthcare, and finance, where precision is crucial.
Why This Matters:
- Business Reliability: Companies relying on AI for legal contracts or medical diagnostics need near-perfect accuracy. Increased hallucinations could slow enterprise adoption.
- Research Challenges: OpenAI’s struggle to mitigate hallucinations highlights broader industry difficulties in balancing reasoning improvements with factual consistency.
- Search as a Stopgap? OpenAI notes that GPT-4o with web search achieves 90% accuracy on SimpleQA, suggesting retrieval-augmented generation (RAG) may help—but it’s not a full solution.
Industry Reactions:
- Neil Chowdhury (Transluce AI): Suggests reinforcement learning in reasoning models may worsen unchecked biases.
- Kian Katanforoosh (Workera CEO): Reports o3’s coding prowess is groundbreaking but warns of “broken link” hallucinations in workflows.
The Bigger Picture
This development underscores a pivotal tension in AI: improved reasoning vs. reliability. As models grow more advanced, hallucinations remain a stubborn flaw. OpenAI’s transparency about the issue is commendable, but the race for AGI (Artificial General Intelligence) demands solutions—fast.