
OpenAI unveiled o3 and o3-mini on December 20 as the final announcement in its "12 Days of OpenAI" event, with the new reasoning model achieving performance that CEO Sam Altman called a breakthrough toward artificial general intelligence. The model scored 96.7% on the 2024 American Invitational Mathematics Exam and 87.5% on the ARC-AGI benchmark, surpassing the 85% threshold often cited as human-level performance.
Benchmark Performance
o3 demonstrated dramatic improvements across technical domains. On competitive programming platform Codeforces, the model achieved an ELO rating of 2727 under high-compute settings, placing it among the top programmers globally. A 2400 rating represents the 99.2nd percentile of human engineers. On SWE-Bench Verified, focused on real-world software engineering tasks, o3 scored 71.7%—more than 20 percentage points higher than its predecessor o1.
The model achieved 87.7% on GPQA Diamond, a set of graduate-level biology, physics, and chemistry questions where typical PhD-level expert performance reaches only 70%. On EpochAI's Frontier Math benchmark, o3 solved 25.2% of problems compared to previous models that struggled to exceed 2%.
The ARC-AGI results generated particular excitement. The benchmark tests AI systems' ability to adapt to novel tasks requiring pattern recognition and logical reasoning. Previous models scored just 5% on this test, while o3 reached 75.7% on low-compute settings and 87.5% with high compute allocation, marking the first time AI has outperformed humans on this specific reasoning benchmark.
Why o3, Not o2?
OpenAI skipped the "o2" designation to avoid potential trademark conflicts with British telecommunications company O2. The naming decision reflects OpenAI's ongoing intellectual property challenges as the company expands its product portfolio globally.
The o3 family includes two variants: the full o3 model and o3-mini, a smaller distilled version optimized for coding tasks with lower computational requirements. o3-mini supports adjustable reasoning effort settings—low, medium, and high—enabling developers to balance performance against cost and latency for specific applications.
Safety Testing Approach
OpenAI opened applications for safety and security researchers to evaluate o3 and o3-mini through January 10, 2025, before broader release. This red-teaming process allows external experts to probe for vulnerabilities, biases, and potential misuse cases before public availability.
The company introduced "deliberative alignment" techniques enhancing the model's ability to reason explicitly over safety policies before generating responses. This approach integrates chain-of-thought reasoning into training, helping the model balance safety compliance with utility in everyday use.
Market Response
Industry leaders immediately praised the benchmark scores. OpenAI Product Chief Kevin Weil wrote on LinkedIn that "o3 is bonkers good, a massive step up from o1 on every one of our hardest benchmarks." Box CEO Aaron Levie posted that the model "appears to perform insanely well across benchmarks."
François Chollet, creator of the ARC-AGI benchmark, acknowledged o3's performance while cautioning that the model still fails on "very easy tasks," indicating fundamental differences from human intelligence. Chollet noted that upcoming benchmark versions will likely reduce o3's score significantly while humans would still score above 95%.
Computational Costs
The high-compute version of o3 achieving 87.5% on ARC-AGI requires substantially more resources than the baseline model. Reports indicate the highly tuned version needs 172 times more compute than standard configurations, raising questions about practical deployment economics.
Release Timeline
OpenAI indicated o3-mini will launch by late January 2025, with the full o3 model following shortly after. The phased rollout allows the company to incorporate safety testing feedback before general availability while maintaining competitive pressure on Google's recently released Gemini 2.0.
The timing positions OpenAI to compete directly with rivals releasing advanced models, particularly as the company navigates competitive threats that prompted Altman's recent "code red" memo to employees.




