April 20, 2025, 8:55 p.m.

1998

Discrepancies in OpenAI’s o3 AI Model Benchmarks Raise Transparency Concerns

Brief news summary

OpenAI’s o3 AI model has sparked transparency concerns due to conflicting performance results on the FrontierMath benchmark. OpenAI claimed that o3 solved over 25% of difficult math problems, significantly outperforming competitors with under 2% accuracy. However, independent tests by Epoch AI reported accuracy closer to 10%, more in line with OpenAI’s cautious public estimates. This discrepancy arises because OpenAI’s internal evaluations used a larger, more powerful version of o3 with greater computational resources, while the publicly released model is smaller and optimized for speed, leading to reduced performance. Both the ARC Prize Foundation and OpenAI staff acknowledged these size and tuning differences. Newer models like o3-mini-high and o4-mini demonstrate improvements, but the situation highlights the need for skepticism toward AI benchmark claims, especially promotional ones. Similar transparency issues have affected other AI developers such as Epoch, xAI, and Meta, underscoring ongoing challenges in the AI industry.

A discrepancy between first- and third-party benchmark results for OpenAI’s o3 AI model has sparked questions about the company’s transparency and model evaluation practices. When OpenAI introduced o3 in December, it claimed the model could correctly answer just over a quarter of the questions on FrontierMath, a challenging math problem set. This score far surpassed competitors—the next-best model answered only about 2% of FrontierMath problems correctly. “Today, all offerings out there have less than 2% [on FrontierMath], ” OpenAI’s chief research officer, Mark Chen, stated during a livestream. “We’re seeing [internally], with o3 in aggressive test-time compute settings, we’re able to get over 25%. ” However, this figure likely represented an upper bound, achieved by a version of o3 supported by more compute power than the model OpenAI publicly released last week. Epoch AI, the research institute behind FrontierMath, published independent benchmark results for o3 on Friday. They found o3 scored around 10%, significantly below OpenAI’s highest claimed figure. This does not necessarily imply deception by OpenAI. The benchmark results OpenAI published in December indicated a lower-bound score consistent with Epoch’s findings. Epoch also noted differences in testing setups and their use of a more recent FrontierMath release for evaluations. “The difference between our results and OpenAI’s might be due to OpenAI evaluating with a more powerful internal scaffold, using more test-time [computing], or because those results were run on a different subset of FrontierMath (the 180 problems in frontiermath-2024-11-26 vs.

the 290 problems in frontiermath-2025-02-28-private), ” Epoch explained. According to a post on X by the ARC Prize Foundation, which tested a pre-release o3 version, the public o3 model “is a different model […] tuned for chat/product use, ” aligning with Epoch’s observations. “All released o3 compute tiers are smaller than the version we [benchmarked], ” ARC Prize added. Generally, larger compute tiers tend to produce better benchmark results. OpenAI technical staff member Wenda Zhou mentioned during a livestream last week that the production version of o3 is “more optimized for real-world use cases” and speed compared to the December demo version, potentially causing benchmark “disparities. ” “[W]e’ve done [optimizations] to make the [model] more cost efficient [and] more useful in general, ” Zhou said. “We still hope that — we still think that — this is a much better model […] You won’t have to wait as long when you’re asking for an answer, which is a real thing with these [types of] models. ” That said, the fact the publicly released o3 falls short of OpenAI’s initial testing claims is somewhat moot, as OpenAI’s o3-mini-high and o4-mini models outperform o3 on FrontierMath, and the company plans to launch a more powerful variant, o3-pro, soon. Nonetheless, this episode underscores that AI benchmark results should be viewed cautiously—especially when provided by companies marketing their own services. Benchmarking “controversies” have become increasingly common in the AI field as vendors compete for headlines and user attention with new models. In January, Epoch faced criticism for delaying disclosure of OpenAI funding until after the o3 announcement. Many FrontierMath academic contributors were unaware of OpenAI’s involvement until it was publicly revealed. More recently, Elon Musk’s xAI was accused of publishing misleading benchmark charts for its AI model Grok 3. Additionally, Meta admitted earlier this month to promoting benchmark scores for a model version different from the one it made available to developers. Updated 4:21 p. m. Pacific: Added comments from OpenAI technical staff member Wenda Zhou from last week’s livestream.

News source

Watch video about

Discrepancies in OpenAI’s o3 AI Model Benchmarks Raise Transparency Concerns

Try our premium solution and start getting clients — at no cost to you

I'm your Content Creator.
Let’s make a post or video and publish it on any social media — ready?

Language

Discrepancies in OpenAI’s o3 AI Model Benchmarks Raise Transparency Concerns

Brief news summary

News source

Watch video about

Try our premium solution and start getting clients — at no cost to you

I'm your Content Creator.
Let’s make a post or video and publish it on any social media — ready?

Hot news

AI Marketing Pulse – AI Agents Replace Sales Team…

Brandi AI Unveils 2026 Trends for Generative Engi…

AI in Video Games: Creating More Realistic and Im…

Mining stocks are the new market darlings, fueled…

AI-Driven Social Media Management Platform AI-SMM…

IT Industry Leaders Address AI Integration Amidst…

AI-Enhanced Keyword Research: A Game Changer for …

AI Company

Sales

Marketing

Discrepancies in OpenAI’s o3 AI Model Benchmarks Raise Transparency Concerns

Brief news summary

News source

Watch video about

Try our premium solution and start getting clients — at no cost to you

I'm your Content Creator. Let’s make a post or video and publish it on any social media — ready?

Hot news

AI Marketing Pulse – AI Agents Replace Sales Team…

Brandi AI Unveils 2026 Trends for Generative Engi…

AI in Video Games: Creating More Realistic and Im…

Mining stocks are the new market darlings, fueled…

AI-Driven Social Media Management Platform AI-SMM…

IT Industry Leaders Address AI Integration Amidst…

AI-Enhanced Keyword Research: A Game Changer for …

AI Company

Your News is ready

Your article is ready

Generating video takes longer than text.

Join our community of experts

Reasons why you should be part of the experts community

Welcome to Neuron Expert!

Check your email

Launch Your AI-Powered Business

AI Marketing Across All Social Media

AI Sales Manager + CRM

Support

Content Maker

Topic

Specify the topic (Optional)

Link (Optional)

Learn how to craft press releases, create unique social media posts, write SEO-optimized articles for websites, and produce videos, all from a single source

I'm your Content Creator.
Let’s make a post or video and publish it on any social media — ready?