Discrepancies in OpenAI’s o3 AI Model Benchmarks Raise Transparency Concerns

A discrepancy between first- and third-party benchmark results for OpenAI’s o3 AI model has sparked questions about the company’s transparency and model evaluation practices. When OpenAI introduced o3 in December, it claimed the model could correctly answer just over a quarter of the questions on FrontierMath, a challenging math problem set. This score far surpassed competitors—the next-best model answered only about 2% of FrontierMath problems correctly. “Today, all offerings out there have less than 2% [on FrontierMath], ” OpenAI’s chief research officer, Mark Chen, stated during a livestream. “We’re seeing [internally], with o3 in aggressive test-time compute settings, we’re able to get over 25%. ” However, this figure likely represented an upper bound, achieved by a version of o3 supported by more compute power than the model OpenAI publicly released last week. Epoch AI, the research institute behind FrontierMath, published independent benchmark results for o3 on Friday. They found o3 scored around 10%, significantly below OpenAI’s highest claimed figure. This does not necessarily imply deception by OpenAI. The benchmark results OpenAI published in December indicated a lower-bound score consistent with Epoch’s findings. Epoch also noted differences in testing setups and their use of a more recent FrontierMath release for evaluations. “The difference between our results and OpenAI’s might be due to OpenAI evaluating with a more powerful internal scaffold, using more test-time [computing], or because those results were run on a different subset of FrontierMath (the 180 problems in frontiermath-2024-11-26 vs.
the 290 problems in frontiermath-2025-02-28-private), ” Epoch explained. According to a post on X by the ARC Prize Foundation, which tested a pre-release o3 version, the public o3 model “is a different model […] tuned for chat/product use, ” aligning with Epoch’s observations. “All released o3 compute tiers are smaller than the version we [benchmarked], ” ARC Prize added. Generally, larger compute tiers tend to produce better benchmark results. OpenAI technical staff member Wenda Zhou mentioned during a livestream last week that the production version of o3 is “more optimized for real-world use cases” and speed compared to the December demo version, potentially causing benchmark “disparities. ” “[W]e’ve done [optimizations] to make the [model] more cost efficient [and] more useful in general, ” Zhou said. “We still hope that — we still think that — this is a much better model […] You won’t have to wait as long when you’re asking for an answer, which is a real thing with these [types of] models. ” That said, the fact the publicly released o3 falls short of OpenAI’s initial testing claims is somewhat moot, as OpenAI’s o3-mini-high and o4-mini models outperform o3 on FrontierMath, and the company plans to launch a more powerful variant, o3-pro, soon. Nonetheless, this episode underscores that AI benchmark results should be viewed cautiously—especially when provided by companies marketing their own services. Benchmarking “controversies” have become increasingly common in the AI field as vendors compete for headlines and user attention with new models. In January, Epoch faced criticism for delaying disclosure of OpenAI funding until after the o3 announcement. Many FrontierMath academic contributors were unaware of OpenAI’s involvement until it was publicly revealed. More recently, Elon Musk’s xAI was accused of publishing misleading benchmark charts for its AI model Grok 3. Additionally, Meta admitted earlier this month to promoting benchmark scores for a model version different from the one it made available to developers. Updated 4:21 p. m. Pacific: Added comments from OpenAI technical staff member Wenda Zhou from last week’s livestream.
Brief news summary
OpenAI’s o3 AI model has sparked transparency concerns due to conflicting performance results on the FrontierMath benchmark. OpenAI claimed that o3 solved over 25% of difficult math problems, significantly outperforming competitors with under 2% accuracy. However, independent tests by Epoch AI reported accuracy closer to 10%, more in line with OpenAI’s cautious public estimates. This discrepancy arises because OpenAI’s internal evaluations used a larger, more powerful version of o3 with greater computational resources, while the publicly released model is smaller and optimized for speed, leading to reduced performance. Both the ARC Prize Foundation and OpenAI staff acknowledged these size and tuning differences. Newer models like o3-mini-high and o4-mini demonstrate improvements, but the situation highlights the need for skepticism toward AI benchmark claims, especially promotional ones. Similar transparency issues have affected other AI developers such as Epoch, xAI, and Meta, underscoring ongoing challenges in the AI industry.
AI-powered Lead Generation in Social Media
and Search Engines
Let AI take control and automatically generate leads for you!

I'm your Content Manager, ready to handle your first test assignment
Learn how AI can help your business.
Let’s talk!

Amazon's $20 Billion Investment in Pennsylvania A…
Amazon has announced a historic $20 billion investment to build two major artificial intelligence (AI) and cloud computing data center complexes in Pennsylvania, marking the largest private sector commitment in the state's history.

Cannabis Transparency Gets Boost With GMGZ's Bloc…
Genuine Marketing Group Inc.

2025 Czech Government Bitcoin Scandal
The 2025 Czech government Bitcoin scandal is a major political controversy in the Czech Republic, centered on a large Bitcoin donation to the Ministry of Justice, which led to the resignation of Justice Minister Pavel Blažek.

Getty Images and Stability AI Face Landmark UK Co…
Getty Images and Stability AI are engaged in a major copyright trial in the British High Court that could significantly influence the future of the generative artificial intelligence (AI) industry.

Apple Heads into Annual Showcase Reeling from AI …
At the 2025 Worldwide Developers Conference, Apple faces significant challenges that threaten its traditional role as a leader in technological innovation.

Ripple and JETRO-Backed Web3 Salon Empower Blockc…
Ripple has announced a strategic partnership with Web3 Salon, a blockchain initiative supported by the Japan External Trade Organization (JETRO), aimed at strengthening Japan’s Web3 ecosystem.

Watch Out For These Levels If Bitcoin Price Retur…
The Bitcoin price has not exhibited the same momentum it showed at the start of last month throughout June.