A discrepancy between first- and third-party benchmark results for OpenAI’s o3 AI model has sparked questions about the company’s transparency and model evaluation practices. When OpenAI introduced o3 in December, it claimed the model could correctly answer just over a quarter of the questions on FrontierMath, a challenging math problem set. This score far surpassed competitors—the next-best model answered only about 2% of FrontierMath problems correctly. “Today, all offerings out there have less than 2% [on FrontierMath], ” OpenAI’s chief research officer, Mark Chen, stated during a livestream. “We’re seeing [internally], with o3 in aggressive test-time compute settings, we’re able to get over 25%. ” However, this figure likely represented an upper bound, achieved by a version of o3 supported by more compute power than the model OpenAI publicly released last week. Epoch AI, the research institute behind FrontierMath, published independent benchmark results for o3 on Friday. They found o3 scored around 10%, significantly below OpenAI’s highest claimed figure. This does not necessarily imply deception by OpenAI. The benchmark results OpenAI published in December indicated a lower-bound score consistent with Epoch’s findings. Epoch also noted differences in testing setups and their use of a more recent FrontierMath release for evaluations. “The difference between our results and OpenAI’s might be due to OpenAI evaluating with a more powerful internal scaffold, using more test-time [computing], or because those results were run on a different subset of FrontierMath (the 180 problems in frontiermath-2024-11-26 vs.
the 290 problems in frontiermath-2025-02-28-private), ” Epoch explained. According to a post on X by the ARC Prize Foundation, which tested a pre-release o3 version, the public o3 model “is a different model […] tuned for chat/product use, ” aligning with Epoch’s observations. “All released o3 compute tiers are smaller than the version we [benchmarked], ” ARC Prize added. Generally, larger compute tiers tend to produce better benchmark results. OpenAI technical staff member Wenda Zhou mentioned during a livestream last week that the production version of o3 is “more optimized for real-world use cases” and speed compared to the December demo version, potentially causing benchmark “disparities. ” “[W]e’ve done [optimizations] to make the [model] more cost efficient [and] more useful in general, ” Zhou said. “We still hope that — we still think that — this is a much better model […] You won’t have to wait as long when you’re asking for an answer, which is a real thing with these [types of] models. ” That said, the fact the publicly released o3 falls short of OpenAI’s initial testing claims is somewhat moot, as OpenAI’s o3-mini-high and o4-mini models outperform o3 on FrontierMath, and the company plans to launch a more powerful variant, o3-pro, soon. Nonetheless, this episode underscores that AI benchmark results should be viewed cautiously—especially when provided by companies marketing their own services. Benchmarking “controversies” have become increasingly common in the AI field as vendors compete for headlines and user attention with new models. In January, Epoch faced criticism for delaying disclosure of OpenAI funding until after the o3 announcement. Many FrontierMath academic contributors were unaware of OpenAI’s involvement until it was publicly revealed. More recently, Elon Musk’s xAI was accused of publishing misleading benchmark charts for its AI model Grok 3. Additionally, Meta admitted earlier this month to promoting benchmark scores for a model version different from the one it made available to developers. Updated 4:21 p. m. Pacific: Added comments from OpenAI technical staff member Wenda Zhou from last week’s livestream.
Discrepancies in OpenAI’s o3 AI Model Benchmarks Raise Transparency Concerns
Artificial intelligence (AI) is transforming video analytics by enabling businesses to extract valuable insights from vast amounts of visual data.
Indian marketers are transitioning from merely experimenting with artificial intelligence to deploying it on a large scale, with an increasing emphasis on addressing business challenges rather than exploring creative possibilities.
Anthropic, a leading AI research firm, has developed an innovative security approach called "constitutional classifiers" to prevent AI models from generating harmful or unsafe content.
AI-Driven Update Enhances Discoverability Across Next-Generation Search and AI Platforms Sheridan, WY – [01-05-2026] – Monk Outsourcing, a global digital marketing and outsourcing agency, today unveiled its AI-optimized LLM content framework, a strategic enhancement aimed at boosting visibility, accuracy, and discoverability across AI-powered search engines and generative AI platforms
Jason Lemkin, often referred to as the Godfather of SaaS, believes the moment has arrived to push AI’s capabilities in the workplace to new extremes.
Workday, a leading provider of enterprise cloud applications for finance and human resources, has introduced a new Custom AI Model Library for its Workday Contract Intelligence Agent, powered by Evisort.
Research reveals that over 20% of videos recommended by YouTube’s algorithm to new users are “AI slop”—low-quality, AI-generated content created mainly to attract views.
Launch your AI-powered team to automate Marketing, Sales & Growth
and get clients on autopilot — from social media and search engines. No ads needed
Begin getting your first leads today