When OpenAI CEO Sam Altman showcased O3 in December 2024, the AI community watched in amazement. The model achieved 87.5% on ARC-AGI, a test designed to measure progress toward artificial general intelligence. It scored over 25% on FrontierMath, where competing models barely exceeded 2%. The demonstrations suggested a genuine breakthrough in AI reasoning.
Then in April 2025, OpenAI finally released O3 to the public. Users eager to access this groundbreaking technology quickly discovered something unexpected.
The O3 they could actually use was dramatically weaker than what Altman had demonstrated four months earlier.

OpenAI updates Operator to o3, making its $200 monthly ChatGPT Pro subscription more enticing
The production version scored 41% on ARC-AGI instead of 87.5%. Independent testing showed it achieved just 10% on FrontierMath instead of the claimed 25%. Even OpenAI quietly acknowledged on their website that the released O3 was "a materially different system" optimized for real-world use rather than benchmark performance.
This wasn't an accident or a mistake. OpenAI deliberately released a weaker version of O3 while keeping the most powerful configuration locked away. The reasons why reveal fundamental tensions in AI development between what's technically possible and what's practically deployable.
The Three Versions of O3 Nobody Talks About
Most people think there's just one O3 model. In reality, OpenAI has created at least three distinct versions, each with dramatically different capabilities and costs.
O3-Preview: The Uneconomical Benchmark Destroyer
This is the version Sam Altman demonstrated in December 2024.
Performance highlights:
87.5% on ARC-AGI
25% on FrontierMath (internal testing)
Record-breaking scores across nearly every AI benchmark
The catch? According to the ARC Prize Foundation, this version cost an estimated $34,400 per task to run. At those economics, you could hire a professional mathematician to solve problems for $5 each and still save 99.9% compared to using O3-preview at high compute.
Even the "low-compute" preview version that scored 75.7% on ARC-AGI still cost $200 per task. For context, the average API call to GPT-4 costs fractions of a cent.
Production O3: The Compromise Version
This is what OpenAI actually released to users in April 2025.
Performance reality:
41-53% on ARC-AGI (depending on reasoning level)
10% on FrontierMath in independent testing
Significantly reduced compute per query
The pricing for production O3 launched at $10 per million input tokens and $40 per million output tokens. In June 2025, OpenAI slashed these prices by 80% to make the model more accessible, bringing costs down to $2 input and $8 output per million tokens.
Released in June 2025, O3-pro represents OpenAI's attempt to offer something closer to preview-level performance at a price some organizations might actually pay.
The economics:
$20 per million input tokens
$80 per million output tokens
10x more expensive than standard O3
Still uses significantly less compute than O3-preview
Even at these premium prices, O3-pro doesn't match the December preview's benchmark performance. It's better than production O3, but it's not the model that set records.
TABLE 1: O3 Version Comparison
Version | ARC-AGI Score | FrontierMath | Cost Per Task | Availability |
|---|---|---|---|---|
O3-Preview (High) | 87.5% | ~25%* | $34,400 | Not released |
O3-Preview (Low) | 75.7% | ~25%* | $200 | Not released |
O3-Pro | ~60-70%** | 15-19% | ~$50-100** | Released June 2025 |
Production O3 (Medium) | 53% | ~10% | ~$5-10 | Released April 2025 |
Production O3 (Low) | 41% | ~10% | ~$2-5 | Released April 2025 |
Internal testing, not independently verified *Estimated based on available data
Why OpenAI Couldn't Release the Real O3
The gap between O3-preview and production O3 isn't a bug. It's the result of fundamental constraints that make the preview version impossible to deploy at scale.
Reason 1: The Economics Don't Work
At $34,400 per task, O3-preview costs more than hiring human experts for most problems.
The math is brutal:
Human mathematician: $5-50 per problem solved
O3-preview (high compute): $34,400 per problem
Cost difference: 688x to 6,880x more expensive than humans
Even wealthy enterprises wouldn't pay those rates for routine work. The model could only justify its cost for problems so difficult that human experts would take weeks to solve, and even then, the ROI would be questionable.
When OpenAI cut production O3 pricing by 80% in June 2025, bringing it from $10/$40 to $2/$8 per million tokens, they were trying to make the technology accessible. At preview-level pricing, O3 would have remained a curiosity rather than a useful product.
Reason 2: Infrastructure Can't Scale
O3-preview uses 172x more computing resources than the low-compute configuration to achieve its highest scores.
The scaling problem:
ChatGPT serves millions of users simultaneously
Each O3-preview query requires massive GPU clusters
Response times stretch into minutes, not seconds
Can't handle the concurrent load of a production service
Imagine if every ChatGPT user suddenly required 172x more computing power. OpenAI would need data centers the size of small cities just to serve current demand. The infrastructure simply doesn't exist to deploy O3-preview to millions of users.
According to OpenAI's technical staff, the production version was deliberately "optimized for real-world use cases and speed" rather than maximum benchmark performance. Translation: they reduced the compute requirements so the model could actually run at scale.
Reason 3: Speed Matters More Than Perfection
Users expect AI responses in seconds, not minutes.
The latency trade-off:
O3-preview: Can take 20-30 seconds or longer for complex queries
Production O3: Typically responds in 5-15 seconds
Standard models like GPT-4: Respond in 1-3 seconds
In user testing, latency matters more than slightly better accuracy for most applications. A model that gives an 85% quality answer in 10 seconds beats one that gives a 90% answer in 60 seconds for nearly every real-world use case.
OpenAI's Wenda Zhou confirmed in a livestream that the company made deliberate optimizations "to make the model more cost-efficient and more useful in general," explicitly accepting lower benchmark scores in exchange for faster, more practical performance.
Reason 4: Safety Testing Takes Time
More powerful models require more extensive safety evaluations.
The safety challenge:
O3-preview wasn't thoroughly safety tested before December demo
Production O3 underwent months of additional evaluation
O3-pro required even more testing before June release
Unknown capabilities need unknown safeguards
OpenAI's updated Preparedness Framework requires rigorous testing across biological/chemical risks, cybersecurity threats, and AI self-improvement capabilities. The more powerful the model, the longer this process takes.
By the time O3-preview completed safety testing, OpenAI had already optimized it down to production specifications anyway. The preview version never needed to pass the safety bar because it was never going to be released.
Reason 5: The Business Model Requires Volume
OpenAI needs paying customers, not showcase models.
The revenue reality:
Valued at $500 billion with massive infrastructure commitments
Needs sustainable pricing that attracts developers and enterprises
Can't build a business on models that cost $34,400 per query
Must compete with Google, Anthropic, and others on price
Production O3 at $2/$8 per million tokens (after the June price cut) allows developers to build applications that actually make economic sense. That generates API revenue, attracts enterprise customers, and justifies OpenAI's enormous valuation.
O3-preview, no matter how impressive on benchmarks, couldn't generate meaningful revenue at any price point users would accept.
What Users Actually Got Instead
The production O3 that shipped in April 2025 isn't the breakthrough model from December demonstrations. But it's not useless either.
What Production O3 Does Well
Despite lower benchmark scores, production O3 offers genuine improvements over earlier models:
Coding and Software Development:
20% fewer major errors than O1 on real-world programming tasks
Strong performance on algorithmic challenges
Better at catching edge cases and suggesting optimizations
Useful for AI coding tools and development workflows
Business Analysis:
Generates and evaluates hypotheses effectively
Breaks down complex problems systematically
Considers multiple perspectives and trade-offs
Works well as a thought partner for strategic decisions
Multimodal Reasoning:
First reasoning model that can "think with images"
Parses diagrams, charts, and visual data
Solves problems requiring both visual and textual understanding
Handles low-quality or handwritten documents
Scientific Research Support:
Assists with hypothesis generation in biology, math, and engineering
Explores problem spaces and suggests experimental approaches
Identifies potential flaws in research designs
Functions as a capable research assistant (with human oversight)
The Cost-Performance Sweet Spot
For most applications, production O3 hits a practical balance:
TABLE 2: Cost vs. Performance Trade-offs
Configuration | Input Cost | Output Cost | Response Time | Best For |
|---|---|---|---|---|
O3-mini-low | $1.10/M | $4.40/M | 3-5 seconds | High-volume tasks, simple queries |
O3-mini-medium | $1.10/M | $4.40/M | 5-10 seconds | Moderate complexity coding |
O3-mini-high | $1.10/M | $4.40/M | 10-15 seconds | Complex STEM problems |
O3 (standard) | $2/M | $8/M | 10-20 seconds | Advanced reasoning tasks |
O3-pro | $20/M | $80/M | 20-60 seconds | Mission-critical accuracy |
*Pricing as of June 2025 after 80% reduction
For the 90% of use cases that don't require absolute maximum performance, production O3 and O3-mini offer excellent value. They're fast enough for interactive applications, cheap enough for high-volume deployment, and capable enough for most real-world problems.
O3's reasoning capability comes from a technique called test-time compute scaling. The model literally "thinks longer" before responding, exploring multiple solution paths.
This creates a new problem: unpredictable costs.
How Test-Time Compute Works
Traditional AI models have predictable costs:
Input tokens: Known quantity based on prompt length
Output tokens: Roughly estimable based on desired response length
Total cost: Straightforward to calculate
Reasoning models like O3 break this model:
Thinking tokens: Hidden internal reasoning process
Variable compute: More difficult problems use more resources
Unpredictable costs: Can't easily estimate expenses in advance
According to AI researcher Jack Clark, "this is interesting because it has made the costs of running AI systems somewhat less predictable - previously, you could work out how much it cost to serve a generative model by just looking at the model and the cost to generate a given output."
The 1,024 Attempts Problem
To achieve its best benchmark scores, O3-preview sometimes needed 1,024 attempts at each task.
The efficiency issue:
Brute force approach: Try many solutions, return the best one
Works for benchmarks: Multiple attempts improve scores
Terrible for production: Can't bill users for 1,024 attempts when they wanted one answer
Not genuine reasoning: More like exhaustive search
This approach works for research and benchmarks but fails for real applications. Users don't want to pay for a model's trial-and-error process. They want accurate answers on the first try.
Production O3 uses far fewer attempts per query, making costs predictable but reducing performance on difficult problems where multiple approaches might be needed.
What O3-Pro Actually Offers
When OpenAI released O3-pro in June 2025, they positioned it as offering "the most reliable responses" in the O3 family.
O3-Pro Performance Claims
OpenAI states that O3-pro:
Thinks longer than standard O3
Provides more reliable responses for challenging questions
Excels in science, coding, and math domains
Outperforms standard O3 in expert evaluations
The pricing reflects these capabilities:
10x more expensive than standard O3
Still cheaper than preview by orders of magnitude
Targets enterprise customers with mission-critical needs
Who Actually Needs O3-Pro
The 10x price premium makes sense for specific use cases:
Legal Analysis:
One accurate interpretation of contract law saves more than $100 in tokens
Reliability matters more than cost for regulatory compliance
Deep context understanding justifies premium pricing
Financial Modeling:
Accurate risk assessment worth far more than token costs
Complex scenarios require extended reasoning
Errors in analysis carry significant consequences
Scientific Research:
PhD-level problem solving at $50-100 per query beats human researcher time
Novel hypothesis generation has high value
Deep domain expertise justifies compute costs
Advanced Code Synthesis:
Complex architectural decisions benefit from extended reasoning
Catching subtle bugs saves debugging time
System design trade-offs require careful analysis
For these applications, O3-pro's 10x price increase delivers more than 10x value. For everything else, standard O3 or O3-mini makes more economic sense.
TABLE 3: Model Selection Decision Framework
Your Use Case | Recommended Model | Why |
|---|---|---|
Simple queries, high volume | O3-mini-low | Fastest, cheapest, sufficient for 70% of tasks |
Standard coding tasks | O3-mini-high | Good balance of speed and capability |
Complex business analysis | O3 (standard) | Extended reasoning at reasonable cost |
Mission-critical decisions | O3-pro | Reliability justifies premium pricing |
Benchmark research | N/A | Production models optimized for cost, not scores |
The Broader Pattern in AI Development
OpenAI's decision to release a weaker O3 reflects an industry-wide tension between research capabilities and production reality.
Laboratory vs. Production Gap
Every major AI lab faces the same challenge:
Google's Gemini:
Ultra version showcased in demos
Pro version actually released to users
Significant capability differences between versions
Anthropic's Claude:
Various Opus models with different compute profiles
Production versions optimized for cost and speed
Research capabilities exceed deployed systems
Microsoft's Copilot:
Powered by various models at different price points
Performance varies based on subscription tier
Most advanced capabilities reserved for highest tiers
The pattern is consistent: what gets demonstrated isn't what gets deployed.
Why This Matters for Businesses
Understanding the laboratory/production gap is critical for AI adoption decisions.
Key lessons:
Benchmark scores don't predict production performance
Demo versions often unavailable at any price
Production models optimized for different goals than research
Cost and speed matter more than peak capability
Companies evaluating AI solutions should:
Test production versions, not demo systems
Measure performance on actual tasks, not benchmarks
Calculate real costs including test-time compute variability
Expect capability gaps between announced and delivered systems
Will OpenAI Ever Release the "Real" O3?
The O3-preview from December 2024 probably won't ever become publicly available at anything close to its original capability level.
The Economics Won't Change Enough
Even with massive improvements in efficiency:
Hardware costs have floors - GPUs, data centers, electricity aren't free
172x compute can't shrink to 1x without architectural changes
Physics constrains optimization - some computation is irreducible
Economic viability requires volume - can't build business on $34,400 queries
The Competition is Heating Up
OpenAI faces pressure from multiple directions:
Google's Gemini offering competitive performance at lower costs
Anthropic's Claude attracting enterprise customers
Open-source models like DeepSeek and Llama providing "good enough" at minimal cost
Pricing pressure pushing toward commoditization
In this environment, OpenAI must optimize for cost and speed rather than absolute peak performance. The market demands practical, affordable AI, not showcase research models.
Future Models Will Follow the Same Pattern
Expect the pattern to repeat with future releases:
O4 (whenever it launches):
Preview version will set new benchmark records
Production version will be weaker but practical
Price will be competitive with current O3
Actual capabilities will exceed marketing promises but fall short of previews
GPT-5 (announced for "a few months" after O3):
Demonstrations will be impressive
Released version will be optimized for production
Performance gap will exist between showcase and deployment
Business model will require accessible pricing
This isn't deceptive marketing. It's the reality of deploying frontier AI systems at scale. Research capabilities advance faster than production economics can support.
What This Means for AI's Future
The O3 situation reveals important truths about where AI development is heading.
Test-Time Compute is the New Scaling Frontier
After years of making models bigger, AI labs are now making them "think longer."
The shift:
Pre-training scaling hitting diminishing returns
Test-time compute offering new performance gains
Cost-performance trade-offs becoming central to model design
Inference costs mattering as much as training costs
This creates new dynamics where the same model can deliver vastly different results depending on how much compute you're willing to spend per query. It's like having a consultant who can provide quick answers or deep analysis, depending on your budget and timeline.
AGI Remains Distant Despite Impressive Demos
O3-preview's 87.5% on ARC-AGI doesn't mean AGI is here.
The reality check:
Still fails on simple tasks that humans find trivial
ARC-AGI-2 reduces O3 to under 30% while humans score 95%
Hallucination problems persist across all configurations
Reliability issues prevent deployment without human oversight
François Chollet, creator of ARC-AGI, put it directly: "Passing ARC-AGI does not equate to achieving AGI, and, as a matter of fact, I don't think o3 is AGI yet."
The performance gains are real and meaningful. But they're incremental progress toward AGI, not evidence that AGI has arrived.
The Transparency Problem Will Get Worse
As AI companies compete more intensely, expect:
More aggressive benchmark claims
Larger gaps between demos and deployments
Less clarity about testing conditions
More confusion about version differences
The solution isn't trusting vendor claims. It's:
Demanding independent verification
Testing production systems yourself
Focusing on real-world performance metrics
Building relationships with technical communities that share honest assessments
Conclusion
OpenAI released a weaker version of O3 because the alternative was releasing nothing at all.
The O3-preview that scored 87.5% on ARC-AGI and dominated every AI benchmark would cost $34,400 per task to run. It used 172x more computing than practical configurations. It took minutes to respond instead of seconds. And it required infrastructure that doesn't exist at the scale needed to serve millions of users.
The production O3 that actually shipped in April 2025 is dramatically less capable on benchmarks. It scores 41-53% on ARC-AGI instead of 87.5%. It achieves 10% on FrontierMath instead of 25%. But it costs $2-8 per million tokens instead of thousands of dollars per query. It responds in seconds instead of minutes. And it can actually run on OpenAI's existing infrastructure.
This isn't a failure. It's the reality of deploying frontier AI systems at scale. The most advanced capabilities remain locked in research labs not because companies are withholding them maliciously, but because the economics and infrastructure don't exist to make them practical.
For businesses evaluating AI tools, the lesson is clear: focus on what's actually available, not what gets demonstrated. Test production systems with realistic workloads. Calculate true costs including variable test-time compute. And expect meaningful gaps between announced capabilities and delivered performance.
O3-preview set records that prove AI continues advancing. Production O3 represents what's actually deployable today. The gap between them shows how far we still have to go before AI research capabilities become AI production realities.
The "real" O3 exists. You just can't afford to use it, and OpenAI can't afford to let you try.



