When OpenAI CEO Sam Altman showcased O3 in December 2024, the AI community watched in amazement. The model achieved 87.5% on ARC-AGI, a test designed to measure progress toward artificial general intelligence. It scored over 25% on FrontierMath, where competing models barely exceeded 2%. The demonstrations suggested a genuine breakthrough in AI reasoning.

Then in April 2025, OpenAI finally released O3 to the public. Users eager to access this groundbreaking technology quickly discovered something unexpected.

The O3 they could actually use was dramatically weaker than what Altman had demonstrated four months earlier.

OpenAI updates Operator to o3, making its $200 monthly ChatGPT Pro subscription more enticing

The production version scored 41% on ARC-AGI instead of 87.5%. Independent testing showed it achieved just 10% on FrontierMath instead of the claimed 25%. Even OpenAI quietly acknowledged on their website that the released O3 was "a materially different system" optimized for real-world use rather than benchmark performance.

This wasn't an accident or a mistake. OpenAI deliberately released a weaker version of O3 while keeping the most powerful configuration locked away. The reasons why reveal fundamental tensions in AI development between what's technically possible and what's practically deployable.

The Three Versions of O3 Nobody Talks About

Most people think there's just one O3 model. In reality, OpenAI has created at least three distinct versions, each with dramatically different capabilities and costs.

O3-Preview: The Uneconomical Benchmark Destroyer

This is the version Sam Altman demonstrated in December 2024.

Performance highlights:

  • 87.5% on ARC-AGI

  • 25% on FrontierMath (internal testing)

  • Record-breaking scores across nearly every AI benchmark

The catch? According to the ARC Prize Foundation, this version cost an estimated $34,400 per task to run. At those economics, you could hire a professional mathematician to solve problems for $5 each and still save 99.9% compared to using O3-preview at high compute.

Even the "low-compute" preview version that scored 75.7% on ARC-AGI still cost $200 per task. For context, the average API call to GPT-4 costs fractions of a cent.

Production O3: The Compromise Version

This is what OpenAI actually released to users in April 2025.

Performance reality:

  • 41-53% on ARC-AGI (depending on reasoning level)

  • 10% on FrontierMath in independent testing

  • Significantly reduced compute per query

The pricing for production O3 launched at $10 per million input tokens and $40 per million output tokens. In June 2025, OpenAI slashed these prices by 80% to make the model more accessible, bringing costs down to $2 input and $8 output per million tokens.

O3-Pro: The Premium Option Most Can't Afford

Released in June 2025, O3-pro represents OpenAI's attempt to offer something closer to preview-level performance at a price some organizations might actually pay.

The economics:

  • $20 per million input tokens

  • $80 per million output tokens

  • 10x more expensive than standard O3

  • Still uses significantly less compute than O3-preview

Even at these premium prices, O3-pro doesn't match the December preview's benchmark performance. It's better than production O3, but it's not the model that set records.

TABLE 1: O3 Version Comparison

Version

ARC-AGI Score

FrontierMath

Cost Per Task

Availability

O3-Preview (High)

87.5%

~25%*

$34,400

Not released

O3-Preview (Low)

75.7%

~25%*

$200

Not released

O3-Pro

~60-70%**

15-19%

~$50-100**

Released June 2025

Production O3 (Medium)

53%

~10%

~$5-10

Released April 2025

Production O3 (Low)

41%

~10%

~$2-5

Released April 2025

Internal testing, not independently verified *Estimated based on available data

Why OpenAI Couldn't Release the Real O3

The gap between O3-preview and production O3 isn't a bug. It's the result of fundamental constraints that make the preview version impossible to deploy at scale.

Reason 1: The Economics Don't Work

At $34,400 per task, O3-preview costs more than hiring human experts for most problems.

The math is brutal:

  • Human mathematician: $5-50 per problem solved

  • O3-preview (high compute): $34,400 per problem

  • Cost difference: 688x to 6,880x more expensive than humans

Even wealthy enterprises wouldn't pay those rates for routine work. The model could only justify its cost for problems so difficult that human experts would take weeks to solve, and even then, the ROI would be questionable.

When OpenAI cut production O3 pricing by 80% in June 2025, bringing it from $10/$40 to $2/$8 per million tokens, they were trying to make the technology accessible. At preview-level pricing, O3 would have remained a curiosity rather than a useful product.

Reason 2: Infrastructure Can't Scale

O3-preview uses 172x more computing resources than the low-compute configuration to achieve its highest scores.

The scaling problem:

  • ChatGPT serves millions of users simultaneously

  • Each O3-preview query requires massive GPU clusters

  • Response times stretch into minutes, not seconds

  • Can't handle the concurrent load of a production service

Imagine if every ChatGPT user suddenly required 172x more computing power. OpenAI would need data centers the size of small cities just to serve current demand. The infrastructure simply doesn't exist to deploy O3-preview to millions of users.

According to OpenAI's technical staff, the production version was deliberately "optimized for real-world use cases and speed" rather than maximum benchmark performance. Translation: they reduced the compute requirements so the model could actually run at scale.

Reason 3: Speed Matters More Than Perfection

Users expect AI responses in seconds, not minutes.

The latency trade-off:

  • O3-preview: Can take 20-30 seconds or longer for complex queries

  • Production O3: Typically responds in 5-15 seconds

  • Standard models like GPT-4: Respond in 1-3 seconds

In user testing, latency matters more than slightly better accuracy for most applications. A model that gives an 85% quality answer in 10 seconds beats one that gives a 90% answer in 60 seconds for nearly every real-world use case.

OpenAI's Wenda Zhou confirmed in a livestream that the company made deliberate optimizations "to make the model more cost-efficient and more useful in general," explicitly accepting lower benchmark scores in exchange for faster, more practical performance.

Reason 4: Safety Testing Takes Time

More powerful models require more extensive safety evaluations.

The safety challenge:

  • O3-preview wasn't thoroughly safety tested before December demo

  • Production O3 underwent months of additional evaluation

  • O3-pro required even more testing before June release

  • Unknown capabilities need unknown safeguards

OpenAI's updated Preparedness Framework requires rigorous testing across biological/chemical risks, cybersecurity threats, and AI self-improvement capabilities. The more powerful the model, the longer this process takes.

By the time O3-preview completed safety testing, OpenAI had already optimized it down to production specifications anyway. The preview version never needed to pass the safety bar because it was never going to be released.

Reason 5: The Business Model Requires Volume

OpenAI needs paying customers, not showcase models.

The revenue reality:

  • Valued at $500 billion with massive infrastructure commitments

  • Needs sustainable pricing that attracts developers and enterprises

  • Can't build a business on models that cost $34,400 per query

  • Must compete with Google, Anthropic, and others on price

Production O3 at $2/$8 per million tokens (after the June price cut) allows developers to build applications that actually make economic sense. That generates API revenue, attracts enterprise customers, and justifies OpenAI's enormous valuation.

O3-preview, no matter how impressive on benchmarks, couldn't generate meaningful revenue at any price point users would accept.

What Users Actually Got Instead

The production O3 that shipped in April 2025 isn't the breakthrough model from December demonstrations. But it's not useless either.

What Production O3 Does Well

Despite lower benchmark scores, production O3 offers genuine improvements over earlier models:

Coding and Software Development:

  • 20% fewer major errors than O1 on real-world programming tasks

  • Strong performance on algorithmic challenges

  • Better at catching edge cases and suggesting optimizations

  • Useful for AI coding tools and development workflows

Business Analysis:

  • Generates and evaluates hypotheses effectively

  • Breaks down complex problems systematically

  • Considers multiple perspectives and trade-offs

  • Works well as a thought partner for strategic decisions

Multimodal Reasoning:

  • First reasoning model that can "think with images"

  • Parses diagrams, charts, and visual data

  • Solves problems requiring both visual and textual understanding

  • Handles low-quality or handwritten documents

Scientific Research Support:

  • Assists with hypothesis generation in biology, math, and engineering

  • Explores problem spaces and suggests experimental approaches

  • Identifies potential flaws in research designs

  • Functions as a capable research assistant (with human oversight)

The Cost-Performance Sweet Spot

For most applications, production O3 hits a practical balance:

TABLE 2: Cost vs. Performance Trade-offs

Configuration

Input Cost

Output Cost

Response Time

Best For

O3-mini-low

$1.10/M

$4.40/M

3-5 seconds

High-volume tasks, simple queries

O3-mini-medium

$1.10/M

$4.40/M

5-10 seconds

Moderate complexity coding

O3-mini-high

$1.10/M

$4.40/M

10-15 seconds

Complex STEM problems

O3 (standard)

$2/M

$8/M

10-20 seconds

Advanced reasoning tasks

O3-pro

$20/M

$80/M

20-60 seconds

Mission-critical accuracy

*Pricing as of June 2025 after 80% reduction

For the 90% of use cases that don't require absolute maximum performance, production O3 and O3-mini offer excellent value. They're fast enough for interactive applications, cheap enough for high-volume deployment, and capable enough for most real-world problems.

The Hidden Cost of Test-Time Compute

O3's reasoning capability comes from a technique called test-time compute scaling. The model literally "thinks longer" before responding, exploring multiple solution paths.

This creates a new problem: unpredictable costs.

How Test-Time Compute Works

Traditional AI models have predictable costs:

  • Input tokens: Known quantity based on prompt length

  • Output tokens: Roughly estimable based on desired response length

  • Total cost: Straightforward to calculate

Reasoning models like O3 break this model:

  • Thinking tokens: Hidden internal reasoning process

  • Variable compute: More difficult problems use more resources

  • Unpredictable costs: Can't easily estimate expenses in advance

According to AI researcher Jack Clark, "this is interesting because it has made the costs of running AI systems somewhat less predictable - previously, you could work out how much it cost to serve a generative model by just looking at the model and the cost to generate a given output."

The 1,024 Attempts Problem

To achieve its best benchmark scores, O3-preview sometimes needed 1,024 attempts at each task.

The efficiency issue:

  • Brute force approach: Try many solutions, return the best one

  • Works for benchmarks: Multiple attempts improve scores

  • Terrible for production: Can't bill users for 1,024 attempts when they wanted one answer

  • Not genuine reasoning: More like exhaustive search

This approach works for research and benchmarks but fails for real applications. Users don't want to pay for a model's trial-and-error process. They want accurate answers on the first try.

Production O3 uses far fewer attempts per query, making costs predictable but reducing performance on difficult problems where multiple approaches might be needed.

What O3-Pro Actually Offers

When OpenAI released O3-pro in June 2025, they positioned it as offering "the most reliable responses" in the O3 family.

O3-Pro Performance Claims

OpenAI states that O3-pro:

  • Thinks longer than standard O3

  • Provides more reliable responses for challenging questions

  • Excels in science, coding, and math domains

  • Outperforms standard O3 in expert evaluations

The pricing reflects these capabilities:

  • 10x more expensive than standard O3

  • Still cheaper than preview by orders of magnitude

  • Targets enterprise customers with mission-critical needs

Who Actually Needs O3-Pro

The 10x price premium makes sense for specific use cases:

Legal Analysis:

  • One accurate interpretation of contract law saves more than $100 in tokens

  • Reliability matters more than cost for regulatory compliance

  • Deep context understanding justifies premium pricing

Financial Modeling:

  • Accurate risk assessment worth far more than token costs

  • Complex scenarios require extended reasoning

  • Errors in analysis carry significant consequences

Scientific Research:

  • PhD-level problem solving at $50-100 per query beats human researcher time

  • Novel hypothesis generation has high value

  • Deep domain expertise justifies compute costs

Advanced Code Synthesis:

  • Complex architectural decisions benefit from extended reasoning

  • Catching subtle bugs saves debugging time

  • System design trade-offs require careful analysis

For these applications, O3-pro's 10x price increase delivers more than 10x value. For everything else, standard O3 or O3-mini makes more economic sense.

TABLE 3: Model Selection Decision Framework

Your Use Case

Recommended Model

Why

Simple queries, high volume

O3-mini-low

Fastest, cheapest, sufficient for 70% of tasks

Standard coding tasks

O3-mini-high

Good balance of speed and capability

Complex business analysis

O3 (standard)

Extended reasoning at reasonable cost

Mission-critical decisions

O3-pro

Reliability justifies premium pricing

Benchmark research

N/A

Production models optimized for cost, not scores

The Broader Pattern in AI Development

OpenAI's decision to release a weaker O3 reflects an industry-wide tension between research capabilities and production reality.

Laboratory vs. Production Gap

Every major AI lab faces the same challenge:

Google's Gemini:

  • Ultra version showcased in demos

  • Pro version actually released to users

  • Significant capability differences between versions

Anthropic's Claude:

  • Various Opus models with different compute profiles

  • Production versions optimized for cost and speed

  • Research capabilities exceed deployed systems

Microsoft's Copilot:

  • Powered by various models at different price points

  • Performance varies based on subscription tier

  • Most advanced capabilities reserved for highest tiers

The pattern is consistent: what gets demonstrated isn't what gets deployed.

Why This Matters for Businesses

Understanding the laboratory/production gap is critical for AI adoption decisions.

Key lessons:

  • Benchmark scores don't predict production performance

  • Demo versions often unavailable at any price

  • Production models optimized for different goals than research

  • Cost and speed matter more than peak capability

Companies evaluating AI solutions should:

  1. Test production versions, not demo systems

  2. Measure performance on actual tasks, not benchmarks

  3. Calculate real costs including test-time compute variability

  4. Expect capability gaps between announced and delivered systems

Will OpenAI Ever Release the "Real" O3?

The O3-preview from December 2024 probably won't ever become publicly available at anything close to its original capability level.

The Economics Won't Change Enough

Even with massive improvements in efficiency:

  • Hardware costs have floors - GPUs, data centers, electricity aren't free

  • 172x compute can't shrink to 1x without architectural changes

  • Physics constrains optimization - some computation is irreducible

  • Economic viability requires volume - can't build business on $34,400 queries

The Competition is Heating Up

OpenAI faces pressure from multiple directions:

  • Google's Gemini offering competitive performance at lower costs

  • Anthropic's Claude attracting enterprise customers

  • Open-source models like DeepSeek and Llama providing "good enough" at minimal cost

  • Pricing pressure pushing toward commoditization

In this environment, OpenAI must optimize for cost and speed rather than absolute peak performance. The market demands practical, affordable AI, not showcase research models.

Future Models Will Follow the Same Pattern

Expect the pattern to repeat with future releases:

O4 (whenever it launches):

  • Preview version will set new benchmark records

  • Production version will be weaker but practical

  • Price will be competitive with current O3

  • Actual capabilities will exceed marketing promises but fall short of previews

GPT-5 (announced for "a few months" after O3):

  • Demonstrations will be impressive

  • Released version will be optimized for production

  • Performance gap will exist between showcase and deployment

  • Business model will require accessible pricing

This isn't deceptive marketing. It's the reality of deploying frontier AI systems at scale. Research capabilities advance faster than production economics can support.

What This Means for AI's Future

The O3 situation reveals important truths about where AI development is heading.

Test-Time Compute is the New Scaling Frontier

After years of making models bigger, AI labs are now making them "think longer."

The shift:

  • Pre-training scaling hitting diminishing returns

  • Test-time compute offering new performance gains

  • Cost-performance trade-offs becoming central to model design

  • Inference costs mattering as much as training costs

This creates new dynamics where the same model can deliver vastly different results depending on how much compute you're willing to spend per query. It's like having a consultant who can provide quick answers or deep analysis, depending on your budget and timeline.

AGI Remains Distant Despite Impressive Demos

O3-preview's 87.5% on ARC-AGI doesn't mean AGI is here.

The reality check:

  • Still fails on simple tasks that humans find trivial

  • ARC-AGI-2 reduces O3 to under 30% while humans score 95%

  • Hallucination problems persist across all configurations

  • Reliability issues prevent deployment without human oversight

François Chollet, creator of ARC-AGI, put it directly: "Passing ARC-AGI does not equate to achieving AGI, and, as a matter of fact, I don't think o3 is AGI yet."

The performance gains are real and meaningful. But they're incremental progress toward AGI, not evidence that AGI has arrived.

The Transparency Problem Will Get Worse

As AI companies compete more intensely, expect:

  • More aggressive benchmark claims

  • Larger gaps between demos and deployments

  • Less clarity about testing conditions

  • More confusion about version differences

The solution isn't trusting vendor claims. It's:

  • Demanding independent verification

  • Testing production systems yourself

  • Focusing on real-world performance metrics

  • Building relationships with technical communities that share honest assessments

Conclusion

OpenAI released a weaker version of O3 because the alternative was releasing nothing at all.

The O3-preview that scored 87.5% on ARC-AGI and dominated every AI benchmark would cost $34,400 per task to run. It used 172x more computing than practical configurations. It took minutes to respond instead of seconds. And it required infrastructure that doesn't exist at the scale needed to serve millions of users.

The production O3 that actually shipped in April 2025 is dramatically less capable on benchmarks. It scores 41-53% on ARC-AGI instead of 87.5%. It achieves 10% on FrontierMath instead of 25%. But it costs $2-8 per million tokens instead of thousands of dollars per query. It responds in seconds instead of minutes. And it can actually run on OpenAI's existing infrastructure.

This isn't a failure. It's the reality of deploying frontier AI systems at scale. The most advanced capabilities remain locked in research labs not because companies are withholding them maliciously, but because the economics and infrastructure don't exist to make them practical.

For businesses evaluating AI tools, the lesson is clear: focus on what's actually available, not what gets demonstrated. Test production systems with realistic workloads. Calculate true costs including variable test-time compute. And expect meaningful gaps between announced capabilities and delivered performance.

O3-preview set records that prove AI continues advancing. Production O3 represents what's actually deployable today. The gap between them shows how far we still have to go before AI research capabilities become AI production realities.

The "real" O3 exists. You just can't afford to use it, and OpenAI can't afford to let you try.

Keep Reading