OpenAI's O3 model has achieved something remarkable in mathematical reasoning. On the American Invitational Mathematics Examination (AIME), one of the most challenging high school math competitions in the United States, O3 scored 96.7% accuracy on the 2024 version. That translates to solving 14.5 out of 15 problems correctly.

To understand how impressive this is, consider what AIME actually tests. These aren't standard textbook problems. The median human score is just 4 to 6 correct answers out of 15. Only the top 5% of students who take the AMC 12 (itself a challenging competition) even qualify to attempt AIME. And among those elite students, scoring above 10 out of 15 puts you in contention for the USA Mathematical Olympiad.

O3's 96.7% performance places it solidly in the top 1% of all human AIME competitors. But here's what makes the achievement more nuanced than it first appears: O3 isn't superhuman at math. It's human-level at the elite end, with specific strengths and surprising weaknesses.

What AIME Actually Tests

Before diving into O3's performance, it's important to understand what AIME measures and why it matters as an AI benchmark.

The Structure

AIME consists of 15 problems testing advanced high school mathematics across multiple domains:

Problem Types:

  • Number theory (divisibility, primes, modular arithmetic)

  • Algebra (polynomials, systems, inequalities)

  • Geometry (triangles, circles, coordinate systems)

  • Combinatorics (counting, probability, arrangements)

Key Characteristics:

  • Three-hour time limit

  • No calculators allowed

  • Each answer is an integer from 000 to 999

  • No partial credit

  • Multi-step reasoning required

The format is deliberately unforgiving. You can't guess your way to success. Each problem requires understanding the underlying mathematics, devising a solution strategy, executing multiple steps correctly, and arriving at a precise numerical answer.

Why Humans Find It Difficult

The median score of 4 to 6 out of 15 isn't because students are poorly prepared. AIME qualifiers have already demonstrated exceptional mathematical ability by scoring in the top 5% on the AMC 12 or top 2.5% on the AMC 10.

What makes AIME challenging:

  • Novel problem structures that don't match textbook examples

  • Multiple solution paths where choosing the right approach matters

  • Calculation complexity that must be done without computational aids

  • Time pressure forcing strategic decisions about which problems to attempt

Top performers at AIME typically score 10 to 12 out of 15. Scoring 14 or 15 is exceptionally rare and usually indicates someone who will advance deep into International Mathematical Olympiad territory.

How O3 Actually Performed

O3's 96.7% score on AIME 2024 represents 14.5 out of 15 problems solved correctly. But the story becomes more interesting when you examine performance across different versions and compare it to other AI models.

The 2024 vs 2025 Question

Independent testing revealed an important pattern. O3 performed significantly better on AIME 2024 (released in February 2024) than on AIME 2025 (released in February 2025).

Performance breakdown:

  • AIME 2024: 96.7% accuracy

  • AIME 2025: Lower performance (exact scores vary by testing)

  • Pattern: Models scored better on older, potentially seen problems

This raises questions about whether O3's training data included AIME 2024 problems or similar content. The performance gap suggests that when faced with truly novel problems, the model's accuracy drops.

According to independent benchmark testing from Vals.ai, no current model has fully mastered AIME 2025. The correctly answered questions are distributed across different models, suggesting that no single AI has developed a comprehensive approach to these problems consistently.

Comparison to Other AI Models

O3's performance represents a dramatic leap from earlier AI capabilities, but it's not the only model achieving high scores.

TABLE 1: AI Model AIME Performance Progression

Model

AIME 2024 Score

AIME 2025 Score

Release Date

GPT-4o

42.1%

~35-40%

May 2024

O1

56.6%

~50-55%

September 2024

O3

96.7%

~85-90%

December 2024

O3-mini

86.5%

~80-85%

January 2025

Gemini 2.5 Pro

86.7%

~82-86%

December 2024

DeepSeek R1

~75-80%

~70-75%

January 2025

The progression is striking. In less than a year, AI performance on AIME jumped from barely passing (42%) to near-perfect (96.7%). This represents one of the fastest capability gains on any mathematical reasoning benchmark.

Where O3 Excels

O3 performs exceptionally well on certain types of AIME problems:

Strong Performance Areas:

  • Algebraic manipulation with multiple variables

  • Systematic case analysis requiring thorough enumeration

  • Pattern recognition in number sequences

  • Coordinate geometry with clear calculation paths

  • Combinatorial counting with well-defined structures

These problem types play to O3's strengths: following multi-step logical procedures, maintaining consistency across calculations, and exploring solution spaces systematically.

Where O3 Struggles

Despite the impressive overall score, O3 shows weaknesses on specific problem types:

Weaker Performance Areas:

  • Creative geometric insights requiring novel constructions

  • Problems with multiple valid approaches where choosing matters

  • Questions requiring mathematical intuition over calculation

  • Ambiguous setups where problem interpretation is key

The difference matters. Top human mathematicians often solve AIME problems through elegant insights that short-circuit lengthy calculations. O3 tends toward brute-force systematic approaches, which work most of the time but fail on problems designed to reward creativity.

How O3 Compares to Human Performance

Placing O3's 96.7% score in human context reveals both its capabilities and limitations.

The Human Performance Distribution

AIME results follow a predictable pattern among qualified students:

TABLE 2: AIME Human Performance Percentiles

Score Range

Percentile

What It Means

0-3

Bottom 50%

Qualified but struggled

4-6

50th percentile

Median AIME qualifier

7-9

75th percentile

Strong performance

10-12

90th percentile

Excellent, potential USAMO

13-14

98th percentile

Elite competition math

15

99.5+ percentile

Exceptionally rare

O3's score of 14.5 places it firmly in the 98-99th percentile. It outperforms 99% of human AIME competitors, including many students who go on to study mathematics at top universities.

But Not Perfect

The key insight: O3 reaches elite human performance but doesn't exceed the very best humans.

Human advantages:

  • Top IMO competitors consistently score 15/15 on AIME

  • Perfect scores, while rare, happen multiple times each year

  • The best human mathematicians solve these problems faster

  • Creative solutions often outpace O3's systematic approaches

O3 has reached "really good human" level, not "superhuman" level. This distinction matters for understanding AI's current capabilities in mathematical reasoning.

What O3's AIME Performance Actually Demonstrates

The 96.7% score tells us something important about the state of AI reasoning models.

Multi-Step Reasoning Works

O3's success on AIME validates the reasoning model approach. Unlike earlier models that generated answers in one pass, O3 uses extended "thinking time" to work through problems step by step.

The process:

  1. Parse the problem into mathematical components

  2. Identify relevant concepts from mathematical knowledge

  3. Explore solution strategies through internal reasoning

  4. Execute calculations systematically

  5. Verify the answer before responding

This mirrors how human mathematicians approach AIME problems, and the results show it works. The model can maintain logical consistency across complex multi-step solutions, a capability that eluded earlier AI systems.

Pattern Matching Still Dominates

Despite the reasoning capabilities, O3 likely relies heavily on pattern matching from training data.

Evidence for this:

  • Better performance on AIME 2024 than 2025

  • Higher scores when similar problems exist in mathematical training data

  • Struggles with truly novel problem structures

  • Systematic rather than creative solutions

The model has learned to recognize mathematical problem types and apply appropriate solution templates. This is useful but different from genuine mathematical understanding.

The Training Data Question

O3's exceptional performance raises questions about what mathematics appeared in its training data.

Potential training sources:

  • Published AIME problems from previous years

  • Similar olympiad-style problems from other competitions

  • Mathematics textbooks covering these topics

  • Online discussion forums analyzing solutions

  • Mathematical olympiad preparation materials

OpenAI hasn't disclosed exactly what mathematical content trained O3. The performance gap between AIME 2024 and 2025 suggests the model may have indirect exposure to problems similar to the 2024 exam through publicly available training data.

Where AI Math Reasoning Still Falls Short

AIME performance, while impressive, reveals clear limitations in AI mathematical capabilities.

Comparison to Harder Benchmarks

O3's 96.7% on AIME contrasts sharply with performance on more difficult mathematical challenges:

TABLE 3: O3 Performance Across Math Benchmarks

Benchmark

Difficulty Level

O3 Score

Human Expert Score

AIME

High school olympiad

96.7%

60-100% (varies widely)

GPQA Diamond

Graduate-level science

87.7%

~70% average PhD

FrontierMath

Research-level math

25.2%*

Varies (hours/days per problem)

IMO Problems

International olympiad

Gold medal level

Top: Near perfect

*December preview version; production version ~10%

The pattern is clear: as problems require more creativity, longer reasoning chains, and genuine mathematical insight, O3's performance degrades significantly.

What This Tells Us About AI Math

The performance across benchmarks reveals important truths:

O3 can:

  • Solve structured problems with clear solution paths

  • Execute complex calculations accurately

  • Maintain logical consistency across steps

  • Apply known mathematical techniques systematically

O3 cannot:

  • Generate truly novel mathematical insights

  • Solve problems requiring creative breakthroughs

  • Handle ambiguous or poorly specified questions well

  • Match human creativity in proof construction

This mirrors findings across other domains. AI excels at systematic application of learned patterns but struggles with genuine creative reasoning.

Real-World Implications

O3's AIME performance matters beyond benchmark competitions.

For Education

AI systems reaching this level of mathematical competence create both opportunities and challenges for education:

Opportunities:

  • High-quality tutoring for challenging problems

  • Instant feedback on solution approaches

  • Exposure to multiple solution strategies

  • Help for students without access to expert teachers

Challenges:

  • Students could use AI to complete homework without learning

  • Risk of over-reliance on AI for problem-solving

  • Questions about what mathematical skills students actually need

  • Pressure on educational assessment methods

The fact that O3 can solve 96.7% of AIME problems means it can handle essentially all standard high school and undergraduate mathematics. This forces educators to rethink what and how they teach.

For Technical Roles

Mathematical problem-solving is central to many technical careers:

Impact on:

  • Software engineering (algorithm design, optimization)

  • Data science (statistical analysis, modeling)

  • Engineering (mathematical modeling, simulation)

  • Finance (quantitative analysis, risk modeling)

AI systems with AIME-level mathematical competence can assist with routine mathematical work in these fields. The key word is "assist." The creative mathematical thinking that drives breakthroughs remains human territory.

For AI Development

O3's performance provides crucial data about AI capability development:

What we learned:

  • Extended reasoning time delivers significant gains

  • Training on mathematical content works

  • Multi-step logical reasoning is teachable to AI

  • But creative insight remains elusive

This informs development priorities for next-generation AI systems. Improving reasoning capability shows clear returns. The remaining challenge is developing genuine mathematical creativity rather than pattern application.

The Path Forward

AIME is approaching saturation as an AI benchmark. O3-mini already scores 86.5%, and newer models will likely push even higher.

Next-Level Benchmarks

The AI research community is already moving toward harder evaluations:

Emerging benchmarks:

  • USAMO/IMO problems requiring proof construction

  • Research-level mathematics from FrontierMath and similar

  • Combined reasoning tasks mixing math with other domains

  • Multi-step projects requiring sustained reasoning

These benchmarks test capabilities beyond AIME's structured problem format. They require not just calculation and logic but genuine mathematical creativity.

What 100% on AIME Would Mean

When AI systems consistently score 100% on AIME, it will represent an important milestone:

  • Mastery of high school olympiad mathematics

  • Reliable multi-step reasoning in structured domains

  • Capability to assist with routine mathematical work

But it won't mean:

  • AI has surpassed human mathematical ability

  • Creative mathematical research is automated

  • The hard problems in mathematics are solved

AIME tests a specific, valuable form of mathematical reasoning. Perfect performance on it demonstrates capability at that level, nothing more.

Conclusion

OpenAI's O3 scoring 96.7% on AIME represents genuine progress in AI mathematical reasoning. The model has reached elite human performance on problems that stump 99% of students, demonstrating that multi-step logical reasoning is teachable to AI systems.

But context matters. O3 performs better on older AIME exams than newer ones, suggesting training data exposure. It excels at systematic problem-solving but struggles with creative insights. And its performance drops dramatically on harder benchmarks requiring genuine mathematical creativity.

The achievement is impressive within its proper scope. O3 can reliably solve complex, multi-step mathematical problems that require maintaining logical consistency across extended reasoning chains. This has clear applications for education, technical work, and advancing AI capabilities.

What it doesn't demonstrate is mathematical understanding equivalent to top human mathematicians. The best humans still solve AIME problems faster, more creatively, and with perfect accuracy. And on research-level mathematics, AI systems still lag far behind expert human performance.

O3's 96.7% on AIME marks an important milestone in AI development. It shows that AI has mastered a significant portion of mathematical problem-solving. But the gap between mastering structured olympiad problems and achieving genuine mathematical creativity remains vast.

For now, O3 represents an extremely capable mathematical assistant rather than a mathematical innovator. That's valuable enough to transform how we approach mathematical education and technical work. But it's not the end of human mathematical reasoning. It's the beginning of a new phase where humans and AI collaborate on mathematical problems, each contributing their distinct strengths.

Keep Reading