OpenAI's O3 model has achieved something remarkable in mathematical reasoning. On the American Invitational Mathematics Examination (AIME), one of the most challenging high school math competitions in the United States, O3 scored 96.7% accuracy on the 2024 version. That translates to solving 14.5 out of 15 problems correctly.
To understand how impressive this is, consider what AIME actually tests. These aren't standard textbook problems. The median human score is just 4 to 6 correct answers out of 15. Only the top 5% of students who take the AMC 12 (itself a challenging competition) even qualify to attempt AIME. And among those elite students, scoring above 10 out of 15 puts you in contention for the USA Mathematical Olympiad.
O3's 96.7% performance places it solidly in the top 1% of all human AIME competitors. But here's what makes the achievement more nuanced than it first appears: O3 isn't superhuman at math. It's human-level at the elite end, with specific strengths and surprising weaknesses.

What AIME Actually Tests
Before diving into O3's performance, it's important to understand what AIME measures and why it matters as an AI benchmark.
The Structure
AIME consists of 15 problems testing advanced high school mathematics across multiple domains:
Problem Types:
Number theory (divisibility, primes, modular arithmetic)
Algebra (polynomials, systems, inequalities)
Geometry (triangles, circles, coordinate systems)
Combinatorics (counting, probability, arrangements)
Key Characteristics:
Three-hour time limit
No calculators allowed
Each answer is an integer from 000 to 999
No partial credit
Multi-step reasoning required
The format is deliberately unforgiving. You can't guess your way to success. Each problem requires understanding the underlying mathematics, devising a solution strategy, executing multiple steps correctly, and arriving at a precise numerical answer.
Why Humans Find It Difficult
The median score of 4 to 6 out of 15 isn't because students are poorly prepared. AIME qualifiers have already demonstrated exceptional mathematical ability by scoring in the top 5% on the AMC 12 or top 2.5% on the AMC 10.
What makes AIME challenging:
Novel problem structures that don't match textbook examples
Multiple solution paths where choosing the right approach matters
Calculation complexity that must be done without computational aids
Time pressure forcing strategic decisions about which problems to attempt
Top performers at AIME typically score 10 to 12 out of 15. Scoring 14 or 15 is exceptionally rare and usually indicates someone who will advance deep into International Mathematical Olympiad territory.
How O3 Actually Performed
O3's 96.7% score on AIME 2024 represents 14.5 out of 15 problems solved correctly. But the story becomes more interesting when you examine performance across different versions and compare it to other AI models.
The 2024 vs 2025 Question
Independent testing revealed an important pattern. O3 performed significantly better on AIME 2024 (released in February 2024) than on AIME 2025 (released in February 2025).
Performance breakdown:
AIME 2024: 96.7% accuracy
AIME 2025: Lower performance (exact scores vary by testing)
Pattern: Models scored better on older, potentially seen problems
This raises questions about whether O3's training data included AIME 2024 problems or similar content. The performance gap suggests that when faced with truly novel problems, the model's accuracy drops.
According to independent benchmark testing from Vals.ai, no current model has fully mastered AIME 2025. The correctly answered questions are distributed across different models, suggesting that no single AI has developed a comprehensive approach to these problems consistently.
Comparison to Other AI Models
O3's performance represents a dramatic leap from earlier AI capabilities, but it's not the only model achieving high scores.
TABLE 1: AI Model AIME Performance Progression
Model | AIME 2024 Score | AIME 2025 Score | Release Date |
|---|---|---|---|
GPT-4o | 42.1% | ~35-40% | May 2024 |
O1 | 56.6% | ~50-55% | September 2024 |
O3 | 96.7% | ~85-90% | December 2024 |
O3-mini | 86.5% | ~80-85% | January 2025 |
Gemini 2.5 Pro | 86.7% | ~82-86% | December 2024 |
DeepSeek R1 | ~75-80% | ~70-75% | January 2025 |
The progression is striking. In less than a year, AI performance on AIME jumped from barely passing (42%) to near-perfect (96.7%). This represents one of the fastest capability gains on any mathematical reasoning benchmark.
Where O3 Excels
O3 performs exceptionally well on certain types of AIME problems:
Strong Performance Areas:
Algebraic manipulation with multiple variables
Systematic case analysis requiring thorough enumeration
Pattern recognition in number sequences
Coordinate geometry with clear calculation paths
Combinatorial counting with well-defined structures
These problem types play to O3's strengths: following multi-step logical procedures, maintaining consistency across calculations, and exploring solution spaces systematically.
Where O3 Struggles
Despite the impressive overall score, O3 shows weaknesses on specific problem types:
Weaker Performance Areas:
Creative geometric insights requiring novel constructions
Problems with multiple valid approaches where choosing matters
Questions requiring mathematical intuition over calculation
Ambiguous setups where problem interpretation is key
The difference matters. Top human mathematicians often solve AIME problems through elegant insights that short-circuit lengthy calculations. O3 tends toward brute-force systematic approaches, which work most of the time but fail on problems designed to reward creativity.
How O3 Compares to Human Performance
Placing O3's 96.7% score in human context reveals both its capabilities and limitations.
The Human Performance Distribution
AIME results follow a predictable pattern among qualified students:
TABLE 2: AIME Human Performance Percentiles
Score Range | Percentile | What It Means |
|---|---|---|
0-3 | Bottom 50% | Qualified but struggled |
4-6 | 50th percentile | Median AIME qualifier |
7-9 | 75th percentile | Strong performance |
10-12 | 90th percentile | Excellent, potential USAMO |
13-14 | 98th percentile | Elite competition math |
15 | 99.5+ percentile | Exceptionally rare |
O3's score of 14.5 places it firmly in the 98-99th percentile. It outperforms 99% of human AIME competitors, including many students who go on to study mathematics at top universities.
But Not Perfect
The key insight: O3 reaches elite human performance but doesn't exceed the very best humans.
Human advantages:
Top IMO competitors consistently score 15/15 on AIME
Perfect scores, while rare, happen multiple times each year
The best human mathematicians solve these problems faster
Creative solutions often outpace O3's systematic approaches
O3 has reached "really good human" level, not "superhuman" level. This distinction matters for understanding AI's current capabilities in mathematical reasoning.
What O3's AIME Performance Actually Demonstrates
The 96.7% score tells us something important about the state of AI reasoning models.
Multi-Step Reasoning Works
O3's success on AIME validates the reasoning model approach. Unlike earlier models that generated answers in one pass, O3 uses extended "thinking time" to work through problems step by step.
The process:
Parse the problem into mathematical components
Identify relevant concepts from mathematical knowledge
Explore solution strategies through internal reasoning
Execute calculations systematically
Verify the answer before responding
This mirrors how human mathematicians approach AIME problems, and the results show it works. The model can maintain logical consistency across complex multi-step solutions, a capability that eluded earlier AI systems.
Pattern Matching Still Dominates
Despite the reasoning capabilities, O3 likely relies heavily on pattern matching from training data.
Evidence for this:
Better performance on AIME 2024 than 2025
Higher scores when similar problems exist in mathematical training data
Struggles with truly novel problem structures
Systematic rather than creative solutions
The model has learned to recognize mathematical problem types and apply appropriate solution templates. This is useful but different from genuine mathematical understanding.
The Training Data Question
O3's exceptional performance raises questions about what mathematics appeared in its training data.
Potential training sources:
Published AIME problems from previous years
Similar olympiad-style problems from other competitions
Mathematics textbooks covering these topics
Online discussion forums analyzing solutions
Mathematical olympiad preparation materials
OpenAI hasn't disclosed exactly what mathematical content trained O3. The performance gap between AIME 2024 and 2025 suggests the model may have indirect exposure to problems similar to the 2024 exam through publicly available training data.
Where AI Math Reasoning Still Falls Short
AIME performance, while impressive, reveals clear limitations in AI mathematical capabilities.
Comparison to Harder Benchmarks
O3's 96.7% on AIME contrasts sharply with performance on more difficult mathematical challenges:
TABLE 3: O3 Performance Across Math Benchmarks
Benchmark | Difficulty Level | O3 Score | Human Expert Score |
|---|---|---|---|
AIME | High school olympiad | 96.7% | 60-100% (varies widely) |
GPQA Diamond | Graduate-level science | 87.7% | ~70% average PhD |
FrontierMath | Research-level math | 25.2%* | Varies (hours/days per problem) |
IMO Problems | International olympiad | Gold medal level | Top: Near perfect |
*December preview version; production version ~10%
The pattern is clear: as problems require more creativity, longer reasoning chains, and genuine mathematical insight, O3's performance degrades significantly.
What This Tells Us About AI Math
The performance across benchmarks reveals important truths:
O3 can:
Solve structured problems with clear solution paths
Execute complex calculations accurately
Maintain logical consistency across steps
Apply known mathematical techniques systematically
O3 cannot:
Generate truly novel mathematical insights
Solve problems requiring creative breakthroughs
Handle ambiguous or poorly specified questions well
Match human creativity in proof construction
This mirrors findings across other domains. AI excels at systematic application of learned patterns but struggles with genuine creative reasoning.
Real-World Implications
O3's AIME performance matters beyond benchmark competitions.
For Education
AI systems reaching this level of mathematical competence create both opportunities and challenges for education:
Opportunities:
High-quality tutoring for challenging problems
Instant feedback on solution approaches
Exposure to multiple solution strategies
Help for students without access to expert teachers
Challenges:
Students could use AI to complete homework without learning
Risk of over-reliance on AI for problem-solving
Questions about what mathematical skills students actually need
Pressure on educational assessment methods
The fact that O3 can solve 96.7% of AIME problems means it can handle essentially all standard high school and undergraduate mathematics. This forces educators to rethink what and how they teach.
For Technical Roles
Mathematical problem-solving is central to many technical careers:
Impact on:
Software engineering (algorithm design, optimization)
Data science (statistical analysis, modeling)
Engineering (mathematical modeling, simulation)
Finance (quantitative analysis, risk modeling)
AI systems with AIME-level mathematical competence can assist with routine mathematical work in these fields. The key word is "assist." The creative mathematical thinking that drives breakthroughs remains human territory.
For AI Development
O3's performance provides crucial data about AI capability development:
What we learned:
Extended reasoning time delivers significant gains
Training on mathematical content works
Multi-step logical reasoning is teachable to AI
But creative insight remains elusive
This informs development priorities for next-generation AI systems. Improving reasoning capability shows clear returns. The remaining challenge is developing genuine mathematical creativity rather than pattern application.
The Path Forward
AIME is approaching saturation as an AI benchmark. O3-mini already scores 86.5%, and newer models will likely push even higher.
Next-Level Benchmarks
The AI research community is already moving toward harder evaluations:
Emerging benchmarks:
USAMO/IMO problems requiring proof construction
Research-level mathematics from FrontierMath and similar
Combined reasoning tasks mixing math with other domains
Multi-step projects requiring sustained reasoning
These benchmarks test capabilities beyond AIME's structured problem format. They require not just calculation and logic but genuine mathematical creativity.
What 100% on AIME Would Mean
When AI systems consistently score 100% on AIME, it will represent an important milestone:
Mastery of high school olympiad mathematics
Reliable multi-step reasoning in structured domains
Capability to assist with routine mathematical work
But it won't mean:
Artificial general intelligence has arrived
AI has surpassed human mathematical ability
Creative mathematical research is automated
The hard problems in mathematics are solved
AIME tests a specific, valuable form of mathematical reasoning. Perfect performance on it demonstrates capability at that level, nothing more.
Conclusion
OpenAI's O3 scoring 96.7% on AIME represents genuine progress in AI mathematical reasoning. The model has reached elite human performance on problems that stump 99% of students, demonstrating that multi-step logical reasoning is teachable to AI systems.
But context matters. O3 performs better on older AIME exams than newer ones, suggesting training data exposure. It excels at systematic problem-solving but struggles with creative insights. And its performance drops dramatically on harder benchmarks requiring genuine mathematical creativity.
The achievement is impressive within its proper scope. O3 can reliably solve complex, multi-step mathematical problems that require maintaining logical consistency across extended reasoning chains. This has clear applications for education, technical work, and advancing AI capabilities.
What it doesn't demonstrate is mathematical understanding equivalent to top human mathematicians. The best humans still solve AIME problems faster, more creatively, and with perfect accuracy. And on research-level mathematics, AI systems still lag far behind expert human performance.
O3's 96.7% on AIME marks an important milestone in AI development. It shows that AI has mastered a significant portion of mathematical problem-solving. But the gap between mastering structured olympiad problems and achieving genuine mathematical creativity remains vast.
For now, O3 represents an extremely capable mathematical assistant rather than a mathematical innovator. That's valuable enough to transform how we approach mathematical education and technical work. But it's not the end of human mathematical reasoning. It's the beginning of a new phase where humans and AI collaborate on mathematical problems, each contributing their distinct strengths.



