This website uses cookies

Read our Privacy policy and Terms of use for more information.

Last Updated: July 2, 2026

What Is Grok 4? The Complete Guide to xAI's Model Family in 2026

Grok 4 is xAI's frontier AI model family, launched July 9, 2025, trained with reinforcement learning at pretraining scale using the Colossus 200,000-GPU supercomputer. The Grok 4 family includes six distinct variants ranging from Grok 4.1 Fast at $0.20 per million tokens to Grok 4 Heavy - the only consumer AI model to score 100% on AIME 2025 and the first to break 50% on Humanity's Last Exam. The current API flagship is Grok 4.3, launched April 30, 2026.

That one paragraph answers the direct question. What follows is everything a business professional or developer needs to know about the Grok 4 family in June 2026 - what each variant does, what the benchmark scores actually mean, where Grok 4 leads competitors and where it does not, how to access it, and what Grok 5 means for anyone building on Grok 4 today.

🎯 Before you read on - we put together a free 2026 AI Tools Cheat Sheet covering the tools business leaders are actually using right now. Get it instantly when you subscribe to AI Business Weekly.

Table of Contents

What Is Grok 4 and Why It Matters

Grok 4 is the fourth major model generation from xAI - Elon Musk's AI company, now a subsidiary of SpaceX following the February 2026 acquisition. It launched on July 9, 2025 via a livestream event and marked a step change from Grok 3 in three specific ways: reinforcement learning applied at pretraining scale using the Colossus supercomputer, native tool use trained directly into the model rather than bolted on, and a multi-agent parallel reasoning variant (Grok 4 Heavy) that represented a genuine architectural departure from single-model inference.

The name comes from Robert Heinlein's 1961 novel Stranger in a Strange Land, where "to grok" means to understand something deeply and intuitively - a naming choice that reflects xAI's stated goal of building AI that comprehends rather than merely retrieves, per Suprmind's complete Grok model guide.

What distinguishes the Grok 4 family from other frontier models:

Three things. First, real-time native access to approximately 68 million English-language posts per day from X - the only major frontier model with this capability built natively into inference rather than as a third-party connector. Second, the Grok 4.1 Fast and Grok 4.20 variants offer 2M token context windows - matching Gemini 2.5 Pro and exceeding GPT-5.4's 1M token ceiling. Third, the price efficiency of the fast tier: Grok 4.1 Fast at $0.20 per million input tokens is one of the most competitive frontier API prices available, per TokenMix's April 2026 benchmark and pricing analysis.

Understanding the Grok 4 family requires holding two things simultaneously: the benchmark scores are genuinely impressive, and the model's design choices produce specific failure modes that matter for business use. Both are true, and both are covered below.

For background on xAI as a company and the Colossus infrastructure that trained Grok 4, our xAI statistics guide covers the full corporate picture. For pricing across all consumer subscription tiers, our Grok AI pricing guide covers every option.

The Grok 4 Model Family: Every Variant Explained

The "Grok 4" name covers six distinct models in June 2026, with meaningfully different capabilities, context windows, speeds, and price points. Most benchmark numbers floating online describe one specific variant - usually the original Grok 4 or Grok 4 Heavy from July 2025 - while the model most users are actually accessing may be different.

Complete Grok 4 family as of June 2026:

Model

Released

Context

Speed

Input Price

Best For

Grok 4

July 2025

256K (API) / 128K (app)

Standard

$3/M

Original flagship, heavy reasoning

Grok 4 Heavy

July 2025

256K

Slow

SuperGrok Heavy only

Max reasoning, parallel agents

Grok 4.1

Nov 2025

256K

Standard

$1.25/M

Improved reasoning over Grok 4

Grok 4.1 Fast

Nov 2025

2M

235 tok/s

$0.20/M

High-volume, cost-efficient

Grok 4.20

March 2026

2M (multi-agent)

235 tok/s

$2/M

Multi-agent inference, throughput

Grok 4.3

April 2026

1M

Standard

$1.25/M

Current API flagship, video input

Important note on context windows: The original Grok 4 and 4 Heavy have 128K context in the consumer app and 256K in the API - not generous by 2026 standards, per DataCamp's Grok 4 review. The 2M context window that appears in Grok's marketing is from Grok 4.1 Fast and Grok 4.20 - not the original Grok 4. Grok 4.3, despite being the current flagship, has a 1M token API context - smaller than the 4.20 it nominally succeeds. If long-context work is your primary requirement, Grok 4.1 Fast or 4.20 are the right choices, not necessarily the newest model.

Note on model aliases: As of May 2026, Grok 3 was pulled from the platform and its traffic redirects to Grok 4.3. Eight older model identifiers automatically redirect to Grok 4.3. Legacy aliases like grok-4 and grok-latest now resolve to Grok 4.20, per Suprmind.

What Makes Grok 4 Different From Grok 3

The architectural gap between Grok 3 and Grok 4 is the largest in xAI's model history - and understanding it explains why the benchmark improvements are real rather than marketing.

1. Reinforcement learning at pretraining scale

Grok 3 used reinforcement learning in the standard post-training phase - the same approach most frontier labs use. For Grok 4, xAI ran RL training at roughly 10x the compute used for Grok 3, applied at the pretraining scale using the Colossus 200,000-GPU cluster, per xAI's official Grok 4 announcement. The practical effect: the model develops reasoning patterns during pretraining rather than having them imposed afterward. This is why Grok 4 shows particularly strong scaling on hard math and science benchmarks - those gains come from reasoning built into the weights rather than stitched on top.

2. Native tool use trained in

Grok 3's web search and tool access were integrated at the product level. For Grok 4, tool use was trained directly into the model - meaning Grok 4 was taught to select its own search queries, choose when to use a code interpreter, and decide how deeply to research before responding. The distinction matters: Grok 4 does not use tools because an external system routes it to them. It uses tools because it learned that tool use improves answers, per xAI's Grok 4 announcement.

With tools disabled, Grok 4 relies only on trained knowledge. With tools enabled - the default in consumer interfaces - it can verify answers using code execution, live data, and logical workflows. This is why benchmark scores with tools are substantially higher than without, especially in math, science, and research tasks, per igmGuru's Grok 4 guide.

3. X data integration goes deeper

Grok 3 had basic X search access. Grok 4 added advanced keyword and semantic search tools that can search deep within X, plus the ability to view media (images and video from X posts) to improve answer quality. The canonical example from xAI's launch: a user asks about a popular post from a few days ago involving a word puzzle about legs. Grok 4 can locate, read, and reference that specific post within its response using visual and semantic X search.

4. Parallel test-time compute (Heavy variant)

Grok 4 Heavy introduced genuine multi-agent parallel inference - multiple hypotheses considered simultaneously rather than a single sequential chain of thought. Where standard Grok 4 and 4.1 produce one reasoning path, Heavy runs parallel paths that can catch each other's errors and converge on more reliable answers for high-stakes complex tasks.

💡 Finding this helpful? Get bite-sized AI news and practical business insights like this delivered free every morning at 7 AM EST.

Grok 4 Benchmarks: The Honest Picture

The benchmark data for the Grok 4 family is genuinely strong - and requires careful reading because xAI self-reported scores, independent scores, and scores from specific model variants are frequently mixed in online coverage. Here is the clean version.

Benchmark source labeling: Where scores below say "xAI," the figure is self-reported by xAI. Where scores say "independent," they come from outside evaluators such as Artificial Analysis, TokenMix, or DataCamp's own testing, per TechJack Solutions' Grok 4.3 guide.

Grok 4 Heavy benchmarks (original July 2025 model, xAI self-reported unless noted):

Benchmark

What It Tests

Grok 4 Heavy Score

Notes

AIME 2025

Advanced math competition

100%

xAI, with tools

Humanity's Last Exam

PhD-level science/math

50.7%

xAI - first model to break 50%

Humanity's Last Exam

Same, text-only

44.4%

xAI

USAMO 2025

Math olympiad

61.9%

xAI

ARC-AGI-2

Novel visual reasoning

15.9%

Independent - first to break 10%, ~2x prior best

GPQA Diamond

PhD-level science

~88-89%

Multiple sources

Vending-Bench

Agentic tool use

$4,694 net worth

xAI vs humans at $844

Grok 4.20 benchmarks (March 2026, independent via TokenMix April 2026):

Benchmark

Grok 4.20

GPT-5.4

Claude Opus 4.6

Context

SWE-bench Verified

78%

81.5%

76%

3rd place, 60% cheaper output than GPT-5.4

MMLU

91.2%

~92%

~88%

Near parity

Coding Arena Elo

1,385

Higher

Lower

Competitive

Context window

2M tokens

1M tokens

200K tokens

Grok leads

The benchmark caveats every reader needs:

Caveat 1: Self-reported vs independent. The most impressive Grok 4 scores (100% AIME, 50.7% HLE) are xAI self-reported from benchmark conditions that may include specific prompting, tool configurations, and test harnesses not available to typical users. Independent scores, where available, tend to be lower, per Suprmind's model analysis.

Caveat 2: Version confusion. Most benchmark numbers circulating online describe Grok 4 or Grok 4 Heavy from July 2025. Independent Grok 4.3-specific benchmark scores are still thin as of June 2026. When you read a Grok 4 benchmark, check which specific model variant it describes.

Caveat 3: Benchmark ≠ workflow performance. The useful question is not whether Grok 4 wins a benchmark but whether the benchmark maps to your actual work. SWE-bench at 78% for Grok 4.20 is meaningful for developers building AI coding tools. AIME 100% for Grok 4 Heavy is meaningful for quantitative finance researchers. ARC-AGI-2 at 15.9% is meaningful for AGI researchers tracking general reasoning progress. For marketing copy, email drafting, and general business communication, benchmark rankings rarely determine which tool you should use.

Caveat 4: The overconfidence problem. On the AA-Omniscience benchmark, Grok 4 attempts answers it should refuse 64% of the time. Claude 4.1 Opus achieves a 0% rate on the same metric by declining when uncertain. This is a fundamental design choice - Grok prioritizes attempting answers over acknowledging uncertainty. For tasks where a wrong answer costs more than no answer (medical, legal, financial research), this design choice matters, per Suprmind.

Grok 4 vs GPT-5, Claude, and Gemini

No single model wins everything in 2026. The Grok 4 family leads in some categories, trails in others, and offers unique value in one area that no competitor can match.

Category-by-category comparison:

Category

Leader

Grok 4 Position

Notes

Coding (SWE-bench)

GPT-5.4 (81.5%)

2nd - Grok 4.20 at 78%

Claude Opus 4.6 at 76% in 3rd

GPQA Diamond (science)

Gemini 3.1 Pro (77.1% ARC-AGI-2)

Strong - ~88-89% GPQA

Gemini leads newer reasoning benchmarks

Math - AIME

Grok 4 Heavy (100%)

Leader

xAI self-reported

Math - Humanity's Last Exam

Grok 4 Heavy (50.7%)

Leader

First to break 50%

Context window

Grok 4.1 Fast / 4.20 (2M)

Leader

Tied with Gemini 2.5 Pro

Speed

Grok 4.20 (235 tok/s)

Leader

Fastest frontier model

Cost efficiency

Grok 4.1 Fast ($0.20/M)

Leader

Among cheapest frontier APIs

Real-time X data

Grok 4

Exclusive

No competitor has native X access

Computer use

GPT-5.4

Trails

GPT-5.4 leads computer use benchmarks

Coding satisfaction

Claude Code (46% "most loved")

Trails

Developer survey data

Overconfidence calibration

Claude 4.1 Opus (0% wrong refusal rate)

Trails

Grok 4 at 64% attempts-should-refuse

The honest competitive summary from AI/ML API's analysis:

"If your workload needs real-time data, very long context, or maximum throughput at low cost - Grok 4.20 is the strongest option right now. If you need best-in-class coding (Claude Opus 4.6), top GPQA scores (Gemini 3.1 Pro), or all-around benchmark leadership with computer use (GPT-5.4), those models still lead in their respective lanes."

That summary applies as well to the full Grok 4 family. The context window, price efficiency, and X data advantages are real and specific. The areas where GPT-5, Claude, and Gemini lead are also real and specific.

For detailed head-to-head comparisons, our Grok vs ChatGPT guide and our SuperGrok vs ChatGPT Plus comparison cover the workflow-specific decision framework.

How to Access Grok 4 in 2026

Access to specific Grok 4 variants depends on which tier you are on. The mapping is not intuitive, particularly because Grok 4.3 - the current flagship - is still rolling out to consumer tiers in stages.

Consumer access:

Plan

Price

Grok 4 Access

Free tier

$0

No Grok 4 access (Grok 3.5)

X Premium

$8/month

Basic Grok 4.1 access, higher rate limits

SuperGrok Lite

$10/month

No confirmed Grok 4 access

SuperGrok

$30/month

Grok 4 and 4.1 (4.3 rolling out in stages)

X Premium+

$40/month

Grok 4/4.1 (4.3 rolling out in stages)

SuperGrok Heavy

$300/month

Confirmed full Grok 4.3 + exclusive Grok 4 Heavy

The staged rollout caveat: As of June 2026, two SuperGrok subscribers sending identical queries may receive responses from different model versions during the Grok 4.3 rollout. SuperGrok Heavy is currently the only consumer plan with confirmed, guaranteed Grok 4.3 access at all times. For work where model version consistency matters, this is worth factoring into your subscription decision.

API access (developer):

Grok 4 family models are available through the xAI API via console.x.ai. API access does not require a SuperGrok subscription - it is a separate developer billing relationship charged per token. Multiple model variants are available simultaneously, which is different from the consumer interface where you use whichever version xAI serves you.

For a complete consumer tier decision framework, our what is SuperGrok guide covers every plan in detail.

Grok 4 API: Pricing and Technical Specs

The API is where the Grok 4 family's cost efficiency story is most compelling.

Current API pricing and specs (June 2026):

Model

Input (per 1M tokens)

Output (per 1M tokens)

Context

Speed

Grok 4

$3.00

$15.00

256K

Standard

Grok 4.1

$1.25

$2.50

256K

Standard

Grok 4.1 Fast

$0.20

$0.50

2M

235 tok/s

Grok 4.20 (multi-agent)

$2.00

$6.00

2M

235 tok/s

Grok 4.3

$1.25

$2.50

1M

Standard

Grok 4 (cached input)

$0.75

-

-

-

The cost efficiency story:

At $0.20 per million input tokens, Grok 4.1 Fast is approximately 25x cheaper than the original Grok 4 at $3.00/M and 10x cheaper than GPT-5.4. TokenMix calculates cost efficiency as output price per million tokens divided by SWE-bench score - at that metric, Grok 4.1 Fast ties with DeepSeek V4 as the most cost-efficient frontier API available, per TokenMix's April 2026 analysis.

For Grok 4.20, the cost per SWE-benchmark point is $0.077 - 2.4x more efficient than Claude Opus ($0.197) or GPT-5.4 ($0.184). At 500 million tokens per month, Grok 4.20 saves approximately $4,584 versus GPT-5.4 for equivalent benchmark performance, per TokenMix.

Additional API costs to factor in:

Beyond token costs, the xAI API charges separately for built-in tools. Web search, X search, and code execution each cost $5 per 1,000 successful calls. File attachments cost $10 per 1,000 calls. A research query that triggers 3-5 tool calls adds $0.015-0.025 per query in tool fees on top of token costs. For high-volume production deployments, these tool costs are a meaningful line item.

xAI offers up to $175 per month in free API credits through their data-sharing program - the most generous free tier among major AI API providers. Credits are available by enabling data sharing in your xAI console settings, per xAI's documentation.

Prompt caching: Grok 4 cached input costs $0.75 per million tokens - an 75% reduction from the $3/M standard rate. For applications that send the same system prompt or document context repeatedly, prompt caching dramatically reduces operating costs.

What Grok 4 Is Best For

Based on the benchmark data, the design philosophy, and real-world use cases documented by practitioners, the Grok 4 family earns its place for specific workflows.

Research requiring real-time X intelligence:

This is Grok 4's exclusive advantage and the clearest case for using it over any competitor. When you need an AI that can tell you what is being said about a topic on X right now - not last month, not in a training dataset, but in the last few hours - Grok 4 is the only frontier model with native access to do that. Journalists tracking breaking news, PR professionals monitoring brand sentiment, market researchers tracking product launches, political analysts monitoring public discourse - for all of these, Grok 4's X integration creates genuine, unique value.

Complex math and scientific reasoning (Heavy variant):

Grok 4 Heavy's 100% AIME and 50.7% HLE scores reflect genuine capability on hard quantitative problems that most models cannot approach. Biomedical researchers, quantitative finance teams, and scientific computing applications represent the legitimate production use cases for Heavy. xAI reports it is already in use at biomedical labs and financial firms, per DataCamp's Grok 4 review.

High-volume API workloads requiring long context at low cost:

Grok 4.1 Fast at $0.20/M input tokens with a 2M token context window is the strongest option in the market for applications where you need to process large documents at scale, run thousands of daily queries, or build cost-efficient production pipelines. The math is straightforward: at scale, Grok 4.1 Fast saves meaningful money against GPT-5.4 or Claude Opus while providing larger context than either.

Multi-source research synthesis:

Grok 4.20's 4-agent architecture produces unique insights that consensus models miss. The Suprmind Multi-Model Divergence Index found Grok surfacing 509 unique insights (19.7% share) per 1,324 production turns - third among five providers, per Suprmind's analysis. For research tasks where perspective diversity matters more than calibration caution, Grok 4.20's willingness to attempt answers creates breadth that more conservative models lack.

What Grok 4 Is Not Best For

Honest evaluation requires both sides. There are clear use cases where Grok 4 is not the strongest available option.

Everyday conversational queries:

The original Grok 4 is a reasoning-first model designed for hard problems. For quick factual questions, casual conversation, and light-touch assistance, Grok 3 was faster and cheaper. The Grok 4.1 Fast variant fills this gap better than the original Grok 4, but Grok 4 itself is overkill for most everyday queries, per DataCamp.

Best-in-class coding production work:

Grok 4.20 scores 78% on SWE-bench - competitive but behind GPT-5.4 at 81.5%. More importantly, Claude Code at 54% AI coding market share and 46% "most loved" rating in the Pragmatic Engineer Survey reflects genuine developer workflow preference that benchmark scores do not fully capture. For teams choosing an AI coding tool, Claude Code and GitHub Copilot have deeper ecosystem integration and higher developer satisfaction than Grok 4, per our claude-code-statistics guide.

High-stakes tasks requiring calibrated refusals:

Grok 4's design choice to attempt answers rather than refuse uncertain questions produces a 64% wrong-answer rate on AA-Omniscience when the model is uncertain. For legal research, medical information, financial compliance work, and any application where a confidently wrong answer is worse than an honest "I don't know," this design choice is a structural mismatch. Claude's 0% wrong-refusal rate on the same benchmark reflects a fundamentally different approach to uncertainty handling, per Suprmind.

Long document processing at consumer tier:

The 128K token app context and 256K API context of the original Grok 4 are not generous by 2026 standards - Gemini 2.5 Pro offers 2M tokens. If you are using Grok 4 through the consumer interface (not API Grok 4.1 Fast or 4.20), the context ceiling creates real limitations for long-document work, per DataCamp.

For broader context on where each major AI platform leads in enterprise workflows, our best AI chatbots for business guide covers the full decision framework.

The Controversies Worth Knowing

Any honest guide to Grok 4 has to address the concerns that reputable technical reviewers have raised.

No formal model card at launch:

Grok 4 launched without a model card - the standard documentation that describes a model's training data, safety evaluations, known limitations, and intended use cases. This makes independent safety assessment significantly harder, per Chatbase's Grok 4 analysis. For enterprise customers with AI governance requirements, the absence of formal safety documentation at launch is a material gap.

Overconfidence on uncertain questions:

The 64% wrong-answer rate on AA-Omniscience (answering questions it should refuse) is not a bug in xAI's view - it is a deliberate design choice reflecting Grok's philosophy of attempting answers over declining. For many consumer use cases, this produces more useful outputs. For high-stakes professional applications, it produces confidently wrong information. Understanding which category your workflow falls into is essential before deploying Grok 4 in production.

Training data transparency:

xAI has not published comprehensive information about Grok 4's training data sources beyond stating that it used large-scale internet data and reinforcement learning at pretraining scale. The training cutoff inconsistency - November 2024 per API documentation versus December 2025 per release notes - has not been officially resolved, per Suprmind.

The rapid iteration pace:

xAI's model cadence is genuinely fast - Grok 3 was pulled in May 2026, Grok 4.3 arrived in April, and Grok 4.4 and 4.5 are reportedly in stages of development. This means any production workflow built on a specific Grok 4 variant needs to account for potential deprecation. Eight older model identifiers already automatically redirect to Grok 4.3. Teams building on the API should pin to specific model version strings rather than using aliases that may redirect to new models without notice.

Grok 5: What Comes Next

Grok 5 has been repeatedly referenced by Elon Musk and xAI's official channels as the next major architectural step. Here is what is reported and what to treat as target rather than shipped fact.

What is reported:

Grok 5 is in training on Colossus 2 - xAI's gigawatt-scale supercomputer launched January 2026 in Southaven, Mississippi. Musk has publicly referenced a 10 trillion parameter target, per Suprmind. Multiple sources report xAI is training parallel Grok 5 variants ranging from 6 trillion to 10 trillion parameters simultaneously. Intermediate checkpoints are reportedly being designated as Grok 4.4 (~1T parameters, reported 2-3 weeks from late April 2026) and Grok 4.5 (~1.5T, reported 4-5 weeks from late April 2026).

What Q2-Q3 2026 means for current Grok 4 users:

The Grok 4.3 timeframe and the Grok 5 projection overlap - meaning the model you access via SuperGrok today could be superseded within months. For consumer users, this means updated model access as new versions roll out within existing subscription tiers (xAI's pattern). For API developers, it means pinning to specific model version strings is important to avoid unexpected behavior changes when aliases redirect to new models.

The honest caveat: Grok 5 as described would represent one of the largest parameter counts in any publicly accessible model. Capability improvements at that scale are genuinely uncertain - scaling laws have produced diminishing returns for some organizations at large parameter counts, while others have found continued gains. Treat any Grok 5 specification as a target until shipped.

For context on Colossus 2's infrastructure and how it compares to OpenAI's Stargate, our xAI statistics guide covers the full infrastructure picture.

What is Grok AI? Complete Guide 2026
Full background on Grok as a platform - history from Grok 1 through Grok 4, xAI's founding, and where the platform fits in the AI landscape.

Grok AI Pricing 2026: Every Plan Compared
Full breakdown of all eight Grok access options including SuperGrok, SuperGrok Heavy, and the xAI API.

What is SuperGrok?
The complete guide to SuperGrok subscription - what you get at $30/month including Grok 4 access, DeepSearch, and Big Brain mode.

Grok Context Window 2026
Detailed breakdown of every Grok 4 variant's context window and what it means for your workflow.

Grok AI Statistics 2026
User numbers, market share, and growth data for the Grok platform.

SuperGrok vs ChatGPT Plus 2026
Head-to-head subscription comparison for professionals choosing between Grok 4 and GPT-5.5 access.

xAI Statistics 2026
The corporate and infrastructure data behind Grok 4 - Colossus, SpaceX acquisition, and xAI's competitive position.

Frequently Asked Questions

What is Grok 4?
Grok 4 is xAI's fourth-generation frontier AI model family, launched July 9, 2025. It was trained with reinforcement learning at pretraining scale using the Colossus 200,000-GPU supercomputer - approximately 10x the compute used for Grok 3. The family includes six variants: Grok 4, Grok 4 Heavy, Grok 4.1, Grok 4.1 Fast, Grok 4.20, and Grok 4.3 (current API flagship, launched April 30, 2026). The defining features are native tool use trained into the model, real-time X data access, and multi-agent parallel inference in the Heavy variant.

What benchmark scores does Grok 4 achieve?
Grok 4 Heavy scores 100% on AIME 2025 (math competition) and 50.7% on Humanity's Last Exam (PhD-level science/math) - the first model to break 50% on HLE - per xAI's self-reported benchmarks. Grok 4 Heavy scored 15.9% on ARC-AGI-2 (independent), nearly doubling the previous best. Grok 4.20 (March 2026) scores 78% on SWE-bench Verified (independent, TokenMix April 2026) - third behind GPT-5.4 at 81.5% but with 60% cheaper output pricing. Scores marked "xAI" are self-reported and should be treated as directional - independent scores are generally lower.

What is Grok 4 Heavy?
Grok 4 Heavy is the multi-agent parallel reasoning variant of Grok 4, exclusive to SuperGrok Heavy subscribers at $300/month. Instead of single-model sequential reasoning, Grok 4 Heavy runs multiple agents in parallel - each working independently on the same problem, comparing results, and converging on an answer. This architecture produced the 100% AIME and 50.7% HLE scores and is designed for researchers and enterprise teams with complex analytical workflows where reasoning quality directly drives outcomes. It is not available via the standard API.

What is the Grok 4.20 multi-agent model?
Grok 4.20 is xAI's multi-agent flagship launched March 10, 2026. It uses a 4-agent "Society of Mind" architecture: Grok (coordinator), Harper (research), Benjamin (math/code), and Lucas (synthesis/creativity) - all four running in parallel on shared weights at inference time. It offers a 2M token context window, 235 tokens per second speed, and costs $2 per million input tokens. The overhead is approximately 1.5-2.5x a single call (not 4x) due to shared KV caching on Colossus infrastructure. Available via API as grok-4.20-0309 in reasoning and non-reasoning variants.

What is Grok 4.3 and how does it differ from Grok 4.20?
Grok 4.3 is xAI's current API flagship, launched April 30, 2026. It costs $1.25 per million input tokens (versus $2 for Grok 4.20) with a 1M token context window (versus 2M for Grok 4.20) and adds native video input capability. Grok 4.3 is recommended as the general-purpose starting point for developers. Grok 4.20 is the right choice when you specifically need the 2M context window, the 4-agent parallel architecture, or the lowest per-token pricing for reasoning-intensive workloads. Eight legacy model aliases now redirect to Grok 4.3 automatically.

How much does Grok 4 cost?
Via consumer subscription: Grok 4/4.1 is included with SuperGrok at $30/month. Grok 4.3 access is rolling out to SuperGrok in stages. Grok 4 Heavy is exclusive to SuperGrok Heavy at $300/month. Via API: Grok 4 costs $3/$15 per million input/output tokens. Grok 4.3 costs $1.25/$2.50. Grok 4.1 Fast costs $0.20/$0.50 with a 2M token context window. Grok 4.20 costs $2/$6. Cached Grok 4 input costs $0.75/M. xAI offers up to $175/month in free API credits through data sharing. Tool use (web search, X search, code execution) costs $5 per 1,000 calls additionally.

Is Grok 4 better than GPT-5 or Claude?
Grok 4 leads in: AIME math benchmarks (100% vs competitors), Humanity's Last Exam (50.7% first to break 50%), context window at the fast tier (2M tokens), throughput speed (235 tok/s), cost efficiency ($0.20/M for Grok 4.1 Fast), and real-time X data access (exclusive). GPT-5.4 leads in: SWE-bench coding (81.5% vs Grok 4.20 at 78%) and computer use benchmarks. Claude Opus 4.6 leads in: developer preference (46% "most loved"), uncertainty calibration (0% wrong refusal vs Grok's 64%), and enterprise coding market share (54%). Gemini 3.1 Pro leads on ARC-AGI-2 at 77.1%.

What is Grok 4's context window?
Context window varies by variant. The original Grok 4: 128K tokens in the consumer app, 256K in the API. Grok 4.1 Fast and Grok 4.20: 2M tokens. Grok 4.3 (current API flagship): 1M tokens. The headline 2M context window associated with Grok in marketing reflects the Fast and 4.20 variants - not the original Grok 4. For consumer SuperGrok subscribers, effective context depends on which model variant is served. For developers, the specific API model string determines context.

Quick Answers

What is Grok 4 in simple terms?
Grok 4 is xAI's AI model launched July 2025, trained with reinforcement learning at 10x the compute of Grok 3 using the Colossus 200,000-GPU supercomputer. It includes native tool use (code interpreter, web browsing), real-time X data access, and a multi-agent parallel variant (Grok 4 Heavy) that was the first AI to break 50% on Humanity's Last Exam. The current API flagship is Grok 4.3 (April 2026) at $1.25/M input tokens. The family spans six variants from $0.20/M (Grok 4.1 Fast) to $300/month consumer (SuperGrok Heavy). Grok 5 is reportedly in training on Colossus 2.

What benchmark scores does Grok 4 achieve in 2026?
Grok 4 Heavy scored 100% on AIME 2025 and 50.7% on Humanity's Last Exam (first model to break 50%) per xAI's self-reported benchmarks. Grok 4.20 scores 78% on SWE-bench Verified (third behind GPT-5.4 at 81.5%) and 91.2% on MMLU per independent TokenMix data from April 2026. Grok 4 scored 15.9% on ARC-AGI-2 (independently verified) - nearly double the prior record. Scores vary by model variant; most impressive figures are Heavy with tools, self-reported by xAI, which typically exceed independently verified scores.

How is Grok 4 different from Grok 3?
Three main differences. First, Grok 4 used reinforcement learning at pretraining scale - roughly 10x more compute than Grok 3 applied earlier in training rather than as post-training fine-tuning. Second, Grok 4 has native tool use trained into the model (code interpreter and web search), not bolted on externally. Third, Grok 4 Heavy introduced genuine multi-agent parallel inference where multiple agents work simultaneously on the same problem. Grok 3 was retired from the platform in May 2026 and its traffic now redirects to Grok 4.3.

What is Grok 4 Heavy and what can it do?
Grok 4 Heavy is the multi-agent parallel reasoning variant of Grok 4, exclusive to SuperGrok Heavy subscribers at $300/month and not available via standard API. It runs multiple agents in parallel that compare results and converge on more reliable answers. Benchmark results per xAI: 100% on AIME 2025, 50.7% on Humanity's Last Exam (first model to break 50%), 61.9% on USAMO 2025 math olympiad. Designed for researchers, quant traders, and enterprise teams where reasoning quality on complex technical problems directly drives outcomes. SuperGrok Heavy is currently the only consumer plan with confirmed full Grok 4.3 access.

Conclusion

The Grok 4 family in June 2026 represents a genuine capability step from Grok 3 - the benchmark scores on hard math and science tasks are real, the X data access is exclusive, and the cost efficiency at the Grok 4.1 Fast tier is unmatched among frontier APIs.

The honest picture requires holding the strengths alongside the limitations. The 64% overconfidence rate on uncertain questions is a design choice, not a bug - but it is a design choice that matters for how you deploy Grok 4 in professional contexts. The benchmark version confusion is real - most numbers people cite describe the July 2025 Grok 4 or Grok 4 Heavy, not the April 2026 Grok 4.3 that API users are actually calling. The context window is larger on the fast tier than on the flagship tier - an unusual inversion that requires checking before assuming the newest model is the best fit.

For most professionals evaluating Grok 4: the consumer entry point is SuperGrok at $30/month, where Grok 4/4.1 access provides the full feature set including DeepSearch and X data integration. The API entry point is Grok 4.3 at $1.25/M input tokens for general use, Grok 4.1 Fast at $0.20/M for high-volume cost-sensitive workloads, and Grok 4.20 for multi-agent tasks where the 2M context and parallel architecture add specific value.

Grok 5, in training on Colossus 2 at a reported 10 trillion parameter target, is the next inflection point. Until it ships, Grok 4.3 is the stable foundation - and Grok 4.1 Fast remains one of the best value propositions in the frontier API market.

📨 Don't miss tomorrow's edition. Subscribe free to AI Business Weekly and get our 2026 AI Tools Cheat Sheet instantly - bite-sized AI news every morning, zero hype.

Keep Reading