Last Updated: March 23, 2026

Every Score, Every Benchmark, Honest Context

Benchmark scores are simultaneously the most-cited and most-misunderstood numbers in AI. OpenAI's o3 model has produced some of the most remarkable benchmark results in the history of the field. They are also numbers that require context to interpret correctly - because a model that scores 96.7% on a math competition and 3% on a different benchmark in the same week is telling you something important about what benchmarks actually measure.

This guide compiles every major o3 benchmark score in one place - AIME, GPQA Diamond, HumanEval, ARC-AGI, SWE-bench, FrontierMath, Codeforces, and more - with comparisons against o1, Claude, Gemini, and DeepSeek. More importantly, it explains what each benchmark actually tests and what the scores mean for teams evaluating whether o3 belongs in their AI stack.

I have spent four years advising executives on AI implementation. The question I get most often after a new model releases is not "what did it score?" but "does the score actually mean it will work better for what we need?" The answer is almost always "it depends on the benchmark and your use case." This guide helps you answer that question for o3.

🎯 Before you read on - we put together a free 2026 AI Tools Cheat Sheet covering the tools business leaders are actually using right now. Get it instantly when you subscribe to AI Business Weekly.

Table of Contents

What Makes o3 Different from Previous OpenAI Models

Before the scores, it helps to understand why o3 produces different benchmark results than GPT-4o or o1.

OpenAI describes o3 as trained to "think" before responding using a private chain of thought - a reinforcement learning approach where the model runs through intermediate reasoning steps before generating an answer. Unlike standard language models that produce responses based on pattern matching from training data, o3 actively works through a problem before committing to an answer.

This architecture change is why o3's benchmark performance looks so different from GPT-4o on reasoning-heavy tasks. It is also why o3 is slower and more expensive per query than standard models - it is burning more compute on each response to run that reasoning process. The tradeoff between reasoning depth, response speed, and cost is the central practical question for any team evaluating o3.

According to OpenAI's own release notes, o3 makes 20% fewer major errors than o1 on difficult real-world tasks, with particular improvements in programming, business consulting, and creative ideation. The benchmark scores below reflect that improvement in detail.

o3 Benchmark Scores: The Complete Results Table

This table compiles o3's scores across all major benchmarks with comparison to its predecessor o1.

Benchmark

What It Tests

o3 Score

o1 Score

Improvement

AIME 2024

Advanced math competition

96.7%

74.3%

+22.4 pts

AIME 2025

Advanced math competition

88.9%

79.2%

+9.7 pts

GPQA Diamond

PhD-level science questions

87.7%

78.0%

+9.7 pts

HumanEval

Standard coding tasks

81.3%

~70%

+~11 pts

SWE-bench Verified

Real-world software engineering

71.7%

48.9%

+22.8 pts

ARC-AGI (low compute)

Novel visual reasoning patterns

75.7%

~25%

~3x

ARC-AGI (high compute)

Novel visual reasoning patterns

87.5%

~25%

~3.5x

ARC-AGI-2

Next-gen reasoning benchmark

2.9%

N/A

N/A

Codeforces Elo

Competitive programming

2,727

1,891

+836 pts

FrontierMath

Unseen research-level math

25.2%

~2%

~12x

MMLU

General knowledge

91.6%

90.8%

+0.8 pts

MathVista

Visual math reasoning

86.8%

71.8%

+15 pts

Math Benchmarks: AIME and FrontierMath

AIME: 96.7% on 2024, 88.9% on 2025

The American Invitational Mathematics Examination is one of the most respected math competition benchmarks in AI evaluation. It tests genuine mathematical problem-solving rather than recall of pre-trained answers - problems require multi-step reasoning that cannot be solved through pattern matching alone.

o3 scored 91.6% accuracy on the AIME 2024, compared to o1's 74.3% - a substantial improvement that reflects the advantage of o3's extended reasoning architecture on complex multi-step problems. o4-mini achieves 99.5% pass@1 on AIME 2025 when given access to a Python interpreter, showing how tool access compounds o3-family performance on math tasks.

What this means practically: if your team uses AI for financial modeling, engineering calculations, actuarial work, or any domain involving complex quantitative reasoning, o3's AIME performance is a meaningful signal. The improvement over o1 is real and measurable on genuine problem-solving tasks - not just benchmark optimization.

FrontierMath: 25.2% - The Most Important Benchmark Nobody Talks About

FrontierMath is where the o3 story gets genuinely interesting and more honest. Developed by Epoch AI, FrontierMath tests AI on unseen, research-level mathematical problems - problems that cannot be solved through training data memorization because they are novel. o3's score of 25.2% on this benchmark is a leap ahead of previous state-of-the-art performance, where no other model had exceeded 2%.

That context matters enormously. A 25.2% score sounds low until you understand that the previous best was 2% and human mathematicians solve these problems routinely. o3 represents a genuine discontinuity in AI mathematical reasoning on this benchmark - 12x better than anything that came before it. It also means o3 fails on roughly 75% of the hardest math problems, which is important context for teams expecting perfect mathematical reasoning.

Science and Reasoning: GPQA Diamond

GPQA Diamond: 87.7%

GPQA Diamond - Graduate-Level Google-Proof Question and Answer - tests PhD-level science knowledge in biology, chemistry, and physics using questions specifically designed to be unsearchable. The "Google-Proof" designation means answers cannot be found through web search, requiring genuine scientific understanding rather than information retrieval.

OpenAI reported that o3 achieved a score of 87.7% on the GPQA Diamond benchmark, which contains expert-level science questions not publicly available online. Human experts score approximately 70% on the same test, meaning o3 outperforms the average PhD scientist on this benchmark.

The business applications here are more specific than the math scores. For pharmaceutical companies, biotech teams, materials science researchers, or any organization doing knowledge work at the intersection of science and business, o3's GPQA performance is a genuine capability differentiator. For marketing teams writing blog posts, 87.7% on PhD-level chemistry questions is interesting but not particularly relevant to the day-to-day work.

Our ChatGPT vs Claude comparison covers how the two platforms compare on scientific and analytical tasks for business teams specifically.

Coding Benchmarks: SWE-bench and Codeforces

SWE-bench Verified: 71.7%

SWE-bench Verified tests AI on real GitHub issues from popular open-source repositories - actual software bugs and feature requests that real development teams filed. It is widely considered the most practically relevant coding benchmark because it tests performance on genuine production problems rather than algorithmic puzzles.

o3 achieved 71.7% on SWE-bench Verified, a software engineering benchmark assessing the ability to solve real GitHub issues, compared to 48.9% for o1 - a 22.8 percentage point improvement that represents a meaningful capability jump for software engineering applications.

For development teams, this score matters more than HumanEval. HumanEval tests whether a model can write clean code for well-defined problems. SWE-bench tests whether a model can diagnose and fix real bugs in real codebases - a much closer approximation of what developers actually need AI to do.

Codeforces Elo: 2,727

Codeforces is a competitive programming platform where humans compete to solve algorithmic problems under time pressure. An Elo rating of 2,727 places o3 in the top fraction of a percent of all competitive programmers globally - above the level of most professional software engineers and approaching international competitive programming champions.

On Codeforces, o3 reached an Elo score of 2,727, whereas o1 scored 1,891. For practical context, most senior software engineers have Codeforces ratings between 1,400 and 1,800. The implication is that for algorithmic problem-solving specifically, o3 outperforms the majority of professional developers.

For a complete comparison of AI coding tools including Claude Code, Cursor, and GitHub Copilot alongside o3, our AI coding tools guide covers the full landscape.

💡 Finding this helpful? Get bite-sized AI news and practical business insights like this delivered free every morning at 7 AM EST.

General Intelligence: ARC-AGI and MMLU

ARC-AGI: 75.7% (low compute), 87.5% (high compute)

ARC-AGI - Abstraction and Reasoning Corpus for Artificial General Intelligence - is the benchmark most directly associated with the question of whether AI is approaching human-level general intelligence. Created by François Chollet in 2019, it tests pattern recognition in genuinely novel situations that cannot be solved through memorization or pattern matching from training data.

o3 scored 75.7% on the ARC-AGI visual reasoning benchmark in low-compute scenarios, impressive compared to human-level performance of 85%. At high compute, o3 achieves 87.5% - surpassing average human performance on a benchmark specifically designed to be resistant to AI pattern matching.

This is the score that generated the most discussion about AGI in late 2024 and early 2025. It is worth being precise about what it means. o3 exceeded average human performance on ARC-AGI-1. It did not demonstrate general intelligence. ARC Prize published results showing OpenAI's o3 scores 2.9% on the newer ARC-AGI-2 benchmark, compared to 60% for average humans - demonstrating that the newer, harder version of the test immediately exposed the limits of the capability. Progress on one benchmark version does not automatically transfer to harder variants of the same test.

MMLU: 91.6%

MMLU - Massive Multitask Language Understanding - is a broad general knowledge benchmark covering 57 academic subjects. It is one of the most widely cited benchmarks despite being relatively easy to saturate - most frontier models now score above 85%, limiting its discriminating power between top models.

o3 hits 91.6% on MMLU. This is strong performance but not dramatically different from competing frontier models. Google's Gemini 2.5 Pro scores 92% on MMLU, and Claude models are competitive in the 88-91% range. MMLU is no longer a meaningful differentiator at the frontier model level.

o3 vs Competitors: How It Compares

Benchmark

o3

Claude Opus 4.6

Gemini 2.5 Pro

DeepSeek R1

AIME 2024

96.7%

~85%+

~90%

79.8%

GPQA Diamond

87.7%

84.2%

86.4%

71.5%

HumanEval

81.3%

78.9%

82.2%

~80%

SWE-bench

71.7%

70.3%

~65%

49.2%

MMLU

91.6%

~88%

92.0%

90.8%

ARC-AGI-1

87.5%

~50%

~60%

N/A

The competitive picture is genuinely close at the frontier. Google's Gemini 2.5 Pro edges out o3 on GPQA with 86.4% versus 83.3% and crushes HumanEval at 82.2%. Claude Opus 4.6 leads on SWE-bench relative to its cost point. DeepSeek R1 remains competitive on math and general reasoning benchmarks at a fraction of the cost.

The honest conclusion from the competitive data is that no single model dominates across all benchmarks. The best model for your use case depends on which benchmarks proxy most closely to your actual workflows - and what you are willing to pay per token.

Per our OpenAI Statistics guide, OpenAI commands approximately 55% of the paid AI subscriber market despite this competitive landscape, suggesting that benchmark leadership is not the only factor driving adoption. Ecosystem, integrations, and user experience matter alongside raw benchmark performance.

o3 Pricing: What the Model Costs

Pricing matters as much as scores for real-world deployment decisions. o3's pricing has changed significantly since launch.

Model

Input (per 1M tokens)

Output (per 1M tokens)

Context Window

o3 (standard)

$2.00

$8.00

200K tokens

o3 (flex mode)

$5.00

$20.00

200K tokens

o3-pro

$20.00

$80.00

200K tokens

o3-mini

$1.10

$4.40

200K tokens

o1 (for comparison)

$15.00

$60.00

128K tokens

OpenAI dropped o3 prices to $2 per million input tokens and $8 per million output tokens - a direct 80% reduction from the original $10/$40 pricing. This change, announced in March 2026, significantly altered the cost-benefit calculation for teams evaluating o3 against competing models.

One important cost consideration for API users: o-series models use reasoning tokens for internal thinking steps that are billed as output tokens but not visible in API responses - a 500-token visible response may consume 2,000+ total tokens. Teams budgeting for o3 API usage need to account for these invisible reasoning tokens in their cost estimates.

At the current pricing, o3 is cost-competitive with GPT-4o for complex reasoning tasks where the accuracy improvement justifies the premium, and significantly cheaper than o1 for the same capability level.

What the Scores Actually Mean for Business Teams

After reviewing every o3 benchmark in detail, here is the practical synthesis for business leaders making AI platform decisions.

o3 is the right choice when your use case genuinely involves complex multi-step reasoning. Financial analysis, scientific research, complex coding, and engineering problems all benefit from o3's architecture in ways that show up in real work, not just benchmarks. The SWE-bench and AIME improvements are real productivity gains for teams doing that kind of work.

o3 is not the right choice as a replacement for GPT-4o or Claude for everyday business tasks. Writing emails, summarizing documents, drafting marketing copy, answering customer questions - none of these benefit meaningfully from o3's extended reasoning. You will pay more per token for no meaningful quality improvement on these tasks. Our What is ChatGPT guide covers which ChatGPT models suit which business applications.

The ARC-AGI-2 result - 2.9% on the harder benchmark while humans score 60% - is the most important data point for executive teams thinking about long-term AI capabilities. o3 is genuinely impressive on tasks it was trained to handle well. It fails dramatically on novel reasoning tasks that fall outside its training distribution. That gap between benchmark-specific performance and genuine general intelligence is the most honest thing the o3 data tells us about where AI actually is in 2026.

For a full breakdown of OpenAI's model lineup and how o3 fits into the broader platform, our OpenAI Statistics guide covers revenue, user adoption, and product strategy in detail.

What is OpenAI? Complete Company Guide 2026 OpenAI's founding story, product lineup, and how o3 fits into the company's broader model strategy.

OpenAI Statistics 2026: Users, Revenue & Growth The full data picture on OpenAI - 800M weekly users, $29B projected revenue, and market position context for o3.

ChatGPT vs Claude: Which AI Is Better for Business in 2026? How o3 and the broader ChatGPT platform compare to Claude across business workflows.

AI Coding Tools 2026: Ranked & Compared Where o3 fits in the AI coding tool landscape alongside Claude Code, GitHub Copilot, and Cursor.

What is ChatGPT? Complete Guide 2026 How o3 fits into ChatGPT's model family and which plan gives you access to it.

Frequently Asked Questions

What is OpenAI o3's score on the AIME math benchmark? o3 scored 96.7% on the AIME 2024 benchmark and 88.9% on AIME 2025, compared to o1's scores of 74.3% and 79.2% respectively. AIME tests advanced multi-step mathematical reasoning - problems requiring genuine problem-solving rather than memorized answers. The improvement over o1 is one of the most significant performance jumps between consecutive OpenAI reasoning models.

How does o3 score on GPQA Diamond? o3 scored 87.7% on GPQA Diamond, a benchmark testing PhD-level science questions in biology, chemistry, and physics designed to be unsearchable. This compares to o1's score of 78% and surpasses average human expert performance of approximately 70%. Google's Gemini 2.5 Pro scores slightly higher at 86.4% on the same benchmark.

What did o3 score on ARC-AGI? o3 scored 75.7% on ARC-AGI-1 using low compute and 87.5% using high compute, compared to average human performance of 85%. This result generated significant AGI discussion in late 2024. However, on the newer ARC-AGI-2 benchmark, o3 scores only 2.9% compared to 60% for average humans - demonstrating that the ARC-AGI-1 result does not represent general intelligence.

What is o3's SWE-bench score? o3 scored 71.7% on SWE-bench Verified, which tests AI on real GitHub issues from production software repositories. This compares to o1's 48.9% - a 22.8 percentage point improvement and one of the most practically relevant performance gains for software development teams. Claude Opus 4.6 scores approximately 70.3% on the same benchmark.

How much does the o3 API cost in 2026? Following an 80% price reduction in March 2026, o3 costs $2 per million input tokens and $8 per million output tokens at standard pricing. Flex mode costs $5/$20 per million tokens. o3-pro costs $20/$80 per million tokens. o3-mini costs $1.10/$4.40 per million tokens. Note that o-series models use hidden reasoning tokens that are billed as output tokens but not visible in responses - actual costs are often higher than visible token counts suggest.

What is FrontierMath and how did o3 score? FrontierMath is a benchmark developed by Epoch AI testing AI on novel, research-level mathematical problems that cannot be solved through memorization. o3 scored 25.2% on FrontierMath - approximately 12x better than the previous best performance of under 2% from any other model. While 25.2% sounds low, it represents a genuine discontinuity in AI mathematical reasoning capability on genuinely novel problems.

How does o3 compare to Gemini 2.5 Pro on benchmarks? The comparison is competitive with neither model clearly dominant. o3 leads on AIME math (96.7% vs approximately 90%), SWE-bench coding (71.7% vs approximately 65%), and ARC-AGI. Gemini 2.5 Pro leads slightly on GPQA Diamond (86.4% vs 87.7% for o3 - essentially tied) and HumanEval coding (82.2% vs 81.3%). Gemini 2.5 Pro has a significant speed advantage - o3's extended reasoning produces slower responses. For research-level math and complex coding, o3 leads. For general business tasks, Gemini 2.5 Pro's speed advantage matters more than benchmark differences.

What is o3-mini and how does it compare to o3? o3-mini is a smaller, faster, and cheaper version of o3 designed for cost-efficient reasoning on STEM tasks. It costs $1.10/$4.40 per million tokens compared to o3's $2/$8. o3-mini scored 87.3% on AIME 2024 with high reasoning effort, compared to o3's 96.7% - delivering approximately 85-90% of o3's capability at roughly 55% of the cost. o3-mini has been largely succeeded by o4-mini in ChatGPT and the API as of April 2025, which offers better performance at similar cost.

What are OpenAI o3's benchmark scores? OpenAI o3's key benchmark scores are: AIME 2024 math - 96.7%; GPQA Diamond PhD science - 87.7%; SWE-bench Verified software engineering - 71.7%; ARC-AGI-1 general reasoning - 87.5% at high compute; Codeforces competitive programming - 2,727 Elo; FrontierMath research-level math - 25.2%; MMLU general knowledge - 91.6%; HumanEval coding - 81.3%. These scores represent substantial improvements over predecessor model o1 across nearly every benchmark.

How does OpenAI o3 score on math benchmarks? o3 scored 96.7% on AIME 2024 and 88.9% on AIME 2025, compared to o1's scores of 74.3% and 79.2%. On FrontierMath, which tests unseen research-level problems, o3 scored 25.2% - approximately 12x better than any previous model. On MathVista visual math reasoning, o3 scored 86.8% versus o1's 71.8%. These math benchmark improvements reflect o3's extended chain-of-thought reasoning architecture that actively works through problems rather than pattern-matching to memorized solutions.

What is o3's score on GPQA Diamond? o3 scored 87.7% on GPQA Diamond, a benchmark testing PhD-level science questions in biology, chemistry, and physics. This surpasses average human expert performance of approximately 70% and compares to o1's 78% score. Google's Gemini 2.5 Pro scores 86.4% on the same benchmark. GPQA Diamond is considered one of the most meaningful benchmarks for evaluating genuine scientific reasoning because questions are specifically designed to be unsearchable online.

What did o3 score on ARC-AGI and what does it mean? o3 scored 75.7% on ARC-AGI-1 at low compute and 87.5% at high compute, compared to human average performance of 85%. This result sparked AGI discussions in late 2024. However, on ARC-AGI-2 - the harder next-generation version of the same benchmark - o3 scores only 2.9% versus 60% for average humans. The ARC-AGI-1 result reflects strong performance on a specific test, not general intelligence. The ARC-AGI-2 result is the more important data point for understanding o3's actual reasoning limitations.

How does o3 compare to o1 on benchmarks? o3 outperforms o1 across nearly every benchmark: AIME math improved from 74.3% to 96.7%; GPQA Diamond science improved from 78% to 87.7%; SWE-bench coding improved from 48.9% to 71.7%; ARC-AGI improved from approximately 25% to 87.5% at high compute; Codeforces Elo improved from 1,891 to 2,727. o3 also reduced pricing by 80% from o1's rates. OpenAI describes o3 as making 20% fewer major errors than o1 on difficult real-world tasks.

What is the o3 model good at based on benchmarks? Based on benchmark performance, o3 is strongest at complex mathematical reasoning (96.7% AIME), PhD-level science questions (87.7% GPQA Diamond), real-world software engineering tasks (71.7% SWE-bench), competitive programming (2,727 Codeforces Elo), and visual reasoning on novel patterns (87.5% ARC-AGI). It is weakest on genuinely novel reasoning problems outside its training distribution, as demonstrated by the 2.9% score on ARC-AGI-2. For everyday business tasks like writing, summarization, and standard analysis, GPT-4o or Claude offer better cost-performance ratios.

How much does o3 cost per token in 2026? Following an 80% price reduction in March 2026, o3 costs $2 per million input tokens and $8 per million output tokens at standard pricing. This compares to the original pricing of $10/$40 per million tokens. o3-mini costs $1.10/$4.40 per million tokens. o3-pro costs $20/$80 per million tokens. Note that o3's reasoning architecture generates hidden reasoning tokens that are billed as output but not visible in responses - actual costs per visible response are often 2-4x higher than the visible token count suggests.

How does o3 compare to Claude and Gemini on benchmarks? On math (AIME): o3 leads at 96.7% versus approximately 85-90% for Claude Opus 4.6 and Gemini 2.5 Pro. On science (GPQA Diamond): essentially tied at 87.7% (o3), 86.4% (Gemini 2.5 Pro), 84.2% (Claude Opus 4.6). On coding (SWE-bench): o3 at 71.7%, Claude Opus 4.6 at 70.3%, Gemini approximately 65%. On general knowledge (MMLU): Gemini 2.5 Pro leads at 92% versus o3's 91.6% and Claude's approximately 88%. No single model dominates across all benchmarks - the best choice depends on your specific use case.

What is the difference between o3 and o3-mini benchmark performance? o3 scores higher than o3-mini across most benchmarks: o3 hits 96.7% on AIME 2024 versus o3-mini's 87.3% with high reasoning effort. o3 reaches 87.7% on GPQA Diamond versus o3-mini's 79.7%. o3's Codeforces Elo of 2,727 compares to o3-mini's 2,130. o3-mini delivers approximately 85-90% of o3's capability at roughly 55% of the cost, making it the preferred choice for most development use cases where full o3 performance is not required. o3-mini has been largely replaced by o4-mini in ChatGPT as of April 2025.

What is o3-pro and how does it score? o3-pro is a version of o3 designed to think longer on difficult problems for maximum reliability. OpenAI describes it as best for "challenging questions where reliability matters more than speed." It costs $20 per million input tokens and $80 per million output tokens. o3-pro scores higher than standard o3 on most benchmarks due to its extended reasoning time. It is available to ChatGPT Pro subscribers and via API, and is designed for specialized applications where maximum accuracy justifies both the cost premium and significantly slower response times.

What does the o3 ARC-AGI-2 score of 2.9% mean? ARC-AGI-2 is a harder version of the ARC-AGI benchmark designed to resist the techniques that allowed o3 to score well on ARC-AGI-1. o3's score of 2.9% on ARC-AGI-2 compared to 60% for average humans demonstrates that o3's ARC-AGI-1 success reflected optimization for that specific test format rather than general reasoning capability. This result is important context for interpreting all AI benchmarks - models can be specifically optimized for benchmark tasks in ways that do not reflect genuine underlying capability improvements.

Conclusion

o3's benchmark scores tell a story that is both genuinely impressive and usefully clarifying. The AIME, GPQA, SWE-bench, and Codeforces results represent real capability improvements over o1 that translate to better performance on complex reasoning, science, and coding tasks. The 80% price reduction makes those improvements accessible to a wider range of teams and use cases.

The ARC-AGI-2 result - 2.9% where humans score 60% - is the most honest data point in the entire o3 story. It tells you that benchmark performance on specific tests does not equal general intelligence, and that the gap between impressive benchmark results and genuine reasoning across novel problems remains very wide.

For business teams, the practical conclusion is straightforward. Use o3 where complexity justifies the cost - financial modeling, scientific research, complex software engineering, and analytical tasks requiring multi-step reasoning. Use GPT-4o, Claude, or Gemini for everything else. The benchmark scores point precisely to where the performance premium is real.

📨 Don't miss tomorrow's edition. Subscribe free to AI Business Weekly and get our 2026 AI Tools Cheat Sheet instantly - bite-sized AI news every morning, zero hype.

Keep Reading