lang icon En
April 20, 2025, 8:55 p.m.
1866

Discrepancies in OpenAI’s o3 AI Model Benchmarks Raise Transparency Concerns

Brief news summary

OpenAI’s o3 AI model has sparked transparency concerns due to conflicting performance results on the FrontierMath benchmark. OpenAI claimed that o3 solved over 25% of difficult math problems, significantly outperforming competitors with under 2% accuracy. However, independent tests by Epoch AI reported accuracy closer to 10%, more in line with OpenAI’s cautious public estimates. This discrepancy arises because OpenAI’s internal evaluations used a larger, more powerful version of o3 with greater computational resources, while the publicly released model is smaller and optimized for speed, leading to reduced performance. Both the ARC Prize Foundation and OpenAI staff acknowledged these size and tuning differences. Newer models like o3-mini-high and o4-mini demonstrate improvements, but the situation highlights the need for skepticism toward AI benchmark claims, especially promotional ones. Similar transparency issues have affected other AI developers such as Epoch, xAI, and Meta, underscoring ongoing challenges in the AI industry.

A discrepancy between first- and third-party benchmark results for OpenAI’s o3 AI model has sparked questions about the company’s transparency and model evaluation practices. When OpenAI introduced o3 in December, it claimed the model could correctly answer just over a quarter of the questions on FrontierMath, a challenging math problem set. This score far surpassed competitors—the next-best model answered only about 2% of FrontierMath problems correctly. “Today, all offerings out there have less than 2% [on FrontierMath], ” OpenAI’s chief research officer, Mark Chen, stated during a livestream. “We’re seeing [internally], with o3 in aggressive test-time compute settings, we’re able to get over 25%. ” However, this figure likely represented an upper bound, achieved by a version of o3 supported by more compute power than the model OpenAI publicly released last week. Epoch AI, the research institute behind FrontierMath, published independent benchmark results for o3 on Friday. They found o3 scored around 10%, significantly below OpenAI’s highest claimed figure. This does not necessarily imply deception by OpenAI. The benchmark results OpenAI published in December indicated a lower-bound score consistent with Epoch’s findings. Epoch also noted differences in testing setups and their use of a more recent FrontierMath release for evaluations. “The difference between our results and OpenAI’s might be due to OpenAI evaluating with a more powerful internal scaffold, using more test-time [computing], or because those results were run on a different subset of FrontierMath (the 180 problems in frontiermath-2024-11-26 vs.

the 290 problems in frontiermath-2025-02-28-private), ” Epoch explained. According to a post on X by the ARC Prize Foundation, which tested a pre-release o3 version, the public o3 model “is a different model […] tuned for chat/product use, ” aligning with Epoch’s observations. “All released o3 compute tiers are smaller than the version we [benchmarked], ” ARC Prize added. Generally, larger compute tiers tend to produce better benchmark results. OpenAI technical staff member Wenda Zhou mentioned during a livestream last week that the production version of o3 is “more optimized for real-world use cases” and speed compared to the December demo version, potentially causing benchmark “disparities. ” “[W]e’ve done [optimizations] to make the [model] more cost efficient [and] more useful in general, ” Zhou said. “We still hope that — we still think that — this is a much better model […] You won’t have to wait as long when you’re asking for an answer, which is a real thing with these [types of] models. ” That said, the fact the publicly released o3 falls short of OpenAI’s initial testing claims is somewhat moot, as OpenAI’s o3-mini-high and o4-mini models outperform o3 on FrontierMath, and the company plans to launch a more powerful variant, o3-pro, soon. Nonetheless, this episode underscores that AI benchmark results should be viewed cautiously—especially when provided by companies marketing their own services. Benchmarking “controversies” have become increasingly common in the AI field as vendors compete for headlines and user attention with new models. In January, Epoch faced criticism for delaying disclosure of OpenAI funding until after the o3 announcement. Many FrontierMath academic contributors were unaware of OpenAI’s involvement until it was publicly revealed. More recently, Elon Musk’s xAI was accused of publishing misleading benchmark charts for its AI model Grok 3. Additionally, Meta admitted earlier this month to promoting benchmark scores for a model version different from the one it made available to developers. Updated 4:21 p. m. Pacific: Added comments from OpenAI technical staff member Wenda Zhou from last week’s livestream.


Watch video about

Discrepancies in OpenAI’s o3 AI Model Benchmarks Raise Transparency Concerns

Try our premium solution and start getting clients — at no cost to you

I'm your Content Creator.
Let’s make a post or video and publish it on any social media — ready?

Language

Content Maker

Our unique Content Maker allows you to create an SEO article, social media posts, and a video based on the information presented in the article

news image

Last news

The Best for your Business

Hot news

Jan. 27, 2026, 1:47 p.m.

AI Video Personalization Enhances Online Advertis…

In the rapidly evolving digital marketing landscape, advertisers are increasingly adopting artificial intelligence (AI) to boost campaign effectiveness.

Jan. 27, 2026, 1:41 p.m.

AI Drives Smart Manufacturing to Autonomy: 2025 P…

The State Council Information Office recently hosted a press conference emphasizing significant achievements in industrial and information technology development anticipated by 2025.

Jan. 27, 2026, 1:32 p.m.

AI and SEO: A Symbiotic Relationship for Digital …

The evolving relationship between artificial intelligence (AI) and search engine optimization (SEO) is forming a deeply symbiotic partnership that is reshaping digital marketing.

Jan. 27, 2026, 1:24 p.m.

Job Applicants Sue AI Screening Company for FCRA …

A recent class action lawsuit against Eightfold AI, an AI recruiting platform, may have major consequences for employers using artificial intelligence in candidate screening.

Jan. 27, 2026, 1:15 p.m.

Enata Introduces an AI "Second Brain" for Field S…

SAN FRANCISCO, Jan.

Jan. 27, 2026, 1:14 p.m.

Coca-Cola's AI-Generated Holiday Ad Sparks Contro…

Coca-Cola’s recent holiday advertisement, created with artificial intelligence, has ignited considerable debate and criticism among viewers and marketing experts.

Jan. 27, 2026, 9:33 a.m.

Record-Breaking Results: How AI Influenced $262 B…

After years of fluctuating consumer confidence and stagnant e-commerce traffic, retailers expected only modest results for the 2025 holiday season.

All news

AI Company

Launch your AI-powered team to automate Marketing, Sales & Growth

and get clients on autopilot — from social media and search engines. No ads needed

Begin getting your first leads today