Last Updated: July 2, 2026

What Is Grok 4? The Complete Guide to xAI's Model Family in 2026
Grok 4 is xAI's frontier AI model family, launched July 9, 2025, trained with reinforcement learning at pretraining scale using the Colossus 200,000-GPU supercomputer. The Grok 4 family includes six distinct variants ranging from Grok 4.1 Fast at $0.20 per million tokens to Grok 4 Heavy - the only consumer AI model to score 100% on AIME 2025 and the first to break 50% on Humanity's Last Exam. The current API flagship is Grok 4.3, launched April 30, 2026.
That one paragraph answers the direct question. What follows is everything a business professional or developer needs to know about the Grok 4 family in June 2026 - what each variant does, what the benchmark scores actually mean, where Grok 4 leads competitors and where it does not, how to access it, and what Grok 5 means for anyone building on Grok 4 today.
🎯 Before you read on - we put together a free 2026 AI Tools Cheat Sheet covering the tools business leaders are actually using right now. Get it instantly when you subscribe to AI Business Weekly.
Table of Contents
What Is Grok 4 and Why It Matters
Grok 4 is the fourth major model generation from xAI - Elon Musk's AI company, now a subsidiary of SpaceX following the February 2026 acquisition. It launched on July 9, 2025 via a livestream event and marked a step change from Grok 3 in three specific ways: reinforcement learning applied at pretraining scale using the Colossus supercomputer, native tool use trained directly into the model rather than bolted on, and a multi-agent parallel reasoning variant (Grok 4 Heavy) that represented a genuine architectural departure from single-model inference.
The name comes from Robert Heinlein's 1961 novel Stranger in a Strange Land, where "to grok" means to understand something deeply and intuitively - a naming choice that reflects xAI's stated goal of building AI that comprehends rather than merely retrieves, per Suprmind's complete Grok model guide.
What distinguishes the Grok 4 family from other frontier models:
Three things. First, real-time native access to approximately 68 million English-language posts per day from X - the only major frontier model with this capability built natively into inference rather than as a third-party connector. Second, the Grok 4.1 Fast and Grok 4.20 variants offer 2M token context windows - matching Gemini 2.5 Pro and exceeding GPT-5.4's 1M token ceiling. Third, the price efficiency of the fast tier: Grok 4.1 Fast at $0.20 per million input tokens is one of the most competitive frontier API prices available, per TokenMix's April 2026 benchmark and pricing analysis.
Understanding the Grok 4 family requires holding two things simultaneously: the benchmark scores are genuinely impressive, and the model's design choices produce specific failure modes that matter for business use. Both are true, and both are covered below.
For background on xAI as a company and the Colossus infrastructure that trained Grok 4, our xAI statistics guide covers the full corporate picture. For pricing across all consumer subscription tiers, our Grok AI pricing guide covers every option.
The Grok 4 Model Family: Every Variant Explained
The "Grok 4" name covers six distinct models in June 2026, with meaningfully different capabilities, context windows, speeds, and price points. Most benchmark numbers floating online describe one specific variant - usually the original Grok 4 or Grok 4 Heavy from July 2025 - while the model most users are actually accessing may be different.
Complete Grok 4 family as of June 2026:
Model | Released | Context | Speed | Input Price | Best For |
|---|---|---|---|---|---|
Grok 4 | July 2025 | 256K (API) / 128K (app) | Standard | $3/M | Original flagship, heavy reasoning |
Grok 4 Heavy | July 2025 | 256K | Slow | SuperGrok Heavy only | Max reasoning, parallel agents |
Grok 4.1 | Nov 2025 | 256K | Standard | $1.25/M | Improved reasoning over Grok 4 |
Grok 4.1 Fast | Nov 2025 | 2M | 235 tok/s | $0.20/M | High-volume, cost-efficient |
Grok 4.20 | March 2026 | 2M (multi-agent) | 235 tok/s | $2/M | Multi-agent inference, throughput |
Grok 4.3 | April 2026 | 1M | Standard | $1.25/M | Current API flagship, video input |
Sources: xAI official documentation, Suprmind's model guide, AI/ML API's Grok 4.20 analysis, TechJack Solutions' Grok 4.3 guide
Important note on context windows: The original Grok 4 and 4 Heavy have 128K context in the consumer app and 256K in the API - not generous by 2026 standards, per DataCamp's Grok 4 review. The 2M context window that appears in Grok's marketing is from Grok 4.1 Fast and Grok 4.20 - not the original Grok 4. Grok 4.3, despite being the current flagship, has a 1M token API context - smaller than the 4.20 it nominally succeeds. If long-context work is your primary requirement, Grok 4.1 Fast or 4.20 are the right choices, not necessarily the newest model.
Note on model aliases: As of May 2026, Grok 3 was pulled from the platform and its traffic redirects to Grok 4.3. Eight older model identifiers automatically redirect to Grok 4.3. Legacy aliases like grok-4 and grok-latest now resolve to Grok 4.20, per Suprmind.
What Makes Grok 4 Different From Grok 3
The architectural gap between Grok 3 and Grok 4 is the largest in xAI's model history - and understanding it explains why the benchmark improvements are real rather than marketing.
1. Reinforcement learning at pretraining scale
Grok 3 used reinforcement learning in the standard post-training phase - the same approach most frontier labs use. For Grok 4, xAI ran RL training at roughly 10x the compute used for Grok 3, applied at the pretraining scale using the Colossus 200,000-GPU cluster, per xAI's official Grok 4 announcement. The practical effect: the model develops reasoning patterns during pretraining rather than having them imposed afterward. This is why Grok 4 shows particularly strong scaling on hard math and science benchmarks - those gains come from reasoning built into the weights rather than stitched on top.
2. Native tool use trained in
Grok 3's web search and tool access were integrated at the product level. For Grok 4, tool use was trained directly into the model - meaning Grok 4 was taught to select its own search queries, choose when to use a code interpreter, and decide how deeply to research before responding. The distinction matters: Grok 4 does not use tools because an external system routes it to them. It uses tools because it learned that tool use improves answers, per xAI's Grok 4 announcement.
With tools disabled, Grok 4 relies only on trained knowledge. With tools enabled - the default in consumer interfaces - it can verify answers using code execution, live data, and logical workflows. This is why benchmark scores with tools are substantially higher than without, especially in math, science, and research tasks, per igmGuru's Grok 4 guide.
3. X data integration goes deeper
Grok 3 had basic X search access. Grok 4 added advanced keyword and semantic search tools that can search deep within X, plus the ability to view media (images and video from X posts) to improve answer quality. The canonical example from xAI's launch: a user asks about a popular post from a few days ago involving a word puzzle about legs. Grok 4 can locate, read, and reference that specific post within its response using visual and semantic X search.
4. Parallel test-time compute (Heavy variant)
Grok 4 Heavy introduced genuine multi-agent parallel inference - multiple hypotheses considered simultaneously rather than a single sequential chain of thought. Where standard Grok 4 and 4.1 produce one reasoning path, Heavy runs parallel paths that can catch each other's errors and converge on more reliable answers for high-stakes complex tasks.
💡 Finding this helpful? Get bite-sized AI news and practical business insights like this delivered free every morning at 7 AM EST.
Grok 4 Benchmarks: The Honest Picture
The benchmark data for the Grok 4 family is genuinely strong - and requires careful reading because xAI self-reported scores, independent scores, and scores from specific model variants are frequently mixed in online coverage. Here is the clean version.
Benchmark source labeling: Where scores below say "xAI," the figure is self-reported by xAI. Where scores say "independent," they come from outside evaluators such as Artificial Analysis, TokenMix, or DataCamp's own testing, per TechJack Solutions' Grok 4.3 guide.
Grok 4 Heavy benchmarks (original July 2025 model, xAI self-reported unless noted):
Benchmark | What It Tests | Grok 4 Heavy Score | Notes |
|---|---|---|---|
AIME 2025 | Advanced math competition | 100% | xAI, with tools |
Humanity's Last Exam | PhD-level science/math | 50.7% | xAI - first model to break 50% |
Humanity's Last Exam | Same, text-only | 44.4% | xAI |
USAMO 2025 | Math olympiad | 61.9% | xAI |
ARC-AGI-2 | Novel visual reasoning | 15.9% | Independent - first to break 10%, ~2x prior best |
GPQA Diamond | PhD-level science | ~88-89% | Multiple sources |
Vending-Bench | Agentic tool use | $4,694 net worth | xAI vs humans at $844 |
Grok 4.20 benchmarks (March 2026, independent via TokenMix April 2026):
Benchmark | Grok 4.20 | GPT-5.4 | Claude Opus 4.6 | Context |
|---|---|---|---|---|
SWE-bench Verified | 78% | 81.5% | 76% | 3rd place, 60% cheaper output than GPT-5.4 |
MMLU | 91.2% | ~92% | ~88% | Near parity |
Coding Arena Elo | 1,385 | Higher | Lower | Competitive |
Context window | 2M tokens | 1M tokens | 200K tokens | Grok leads |
The benchmark caveats every reader needs:
Caveat 1: Self-reported vs independent. The most impressive Grok 4 scores (100% AIME, 50.7% HLE) are xAI self-reported from benchmark conditions that may include specific prompting, tool configurations, and test harnesses not available to typical users. Independent scores, where available, tend to be lower, per Suprmind's model analysis.
Caveat 2: Version confusion. Most benchmark numbers circulating online describe Grok 4 or Grok 4 Heavy from July 2025. Independent Grok 4.3-specific benchmark scores are still thin as of June 2026. When you read a Grok 4 benchmark, check which specific model variant it describes.
Caveat 3: Benchmark ≠ workflow performance. The useful question is not whether Grok 4 wins a benchmark but whether the benchmark maps to your actual work. SWE-bench at 78% for Grok 4.20 is meaningful for developers building AI coding tools. AIME 100% for Grok 4 Heavy is meaningful for quantitative finance researchers. ARC-AGI-2 at 15.9% is meaningful for AGI researchers tracking general reasoning progress. For marketing copy, email drafting, and general business communication, benchmark rankings rarely determine which tool you should use.
Caveat 4: The overconfidence problem. On the AA-Omniscience benchmark, Grok 4 attempts answers it should refuse 64% of the time. Claude 4.1 Opus achieves a 0% rate on the same metric by declining when uncertain. This is a fundamental design choice - Grok prioritizes attempting answers over acknowledging uncertainty. For tasks where a wrong answer costs more than no answer (medical, legal, financial research), this design choice matters, per Suprmind.
Grok 4 vs GPT-5, Claude, and Gemini
No single model wins everything in 2026. The Grok 4 family leads in some categories, trails in others, and offers unique value in one area that no competitor can match.
Category-by-category comparison:
Category | Leader | Grok 4 Position | Notes |
|---|---|---|---|
Coding (SWE-bench) | GPT-5.4 (81.5%) | 2nd - Grok 4.20 at 78% | Claude Opus 4.6 at 76% in 3rd |
GPQA Diamond (science) | Gemini 3.1 Pro (77.1% ARC-AGI-2) | Strong - ~88-89% GPQA | Gemini leads newer reasoning benchmarks |
Math - AIME | Grok 4 Heavy (100%) | Leader | xAI self-reported |
Math - Humanity's Last Exam | Grok 4 Heavy (50.7%) | Leader | First to break 50% |
Context window | Grok 4.1 Fast / 4.20 (2M) | Leader | Tied with Gemini 2.5 Pro |
Speed | Grok 4.20 (235 tok/s) | Leader | Fastest frontier model |
Cost efficiency | Grok 4.1 Fast ($0.20/M) | Leader | Among cheapest frontier APIs |
Real-time X data | Grok 4 | Exclusive | No competitor has native X access |
Computer use | GPT-5.4 | Trails | GPT-5.4 leads computer use benchmarks |
Coding satisfaction | Claude Code (46% "most loved") | Trails | Developer survey data |
Overconfidence calibration | Claude 4.1 Opus (0% wrong refusal rate) | Trails | Grok 4 at 64% attempts-should-refuse |
The honest competitive summary from AI/ML API's analysis:
"If your workload needs real-time data, very long context, or maximum throughput at low cost - Grok 4.20 is the strongest option right now. If you need best-in-class coding (Claude Opus 4.6), top GPQA scores (Gemini 3.1 Pro), or all-around benchmark leadership with computer use (GPT-5.4), those models still lead in their respective lanes."
That summary applies as well to the full Grok 4 family. The context window, price efficiency, and X data advantages are real and specific. The areas where GPT-5, Claude, and Gemini lead are also real and specific.
For detailed head-to-head comparisons, our Grok vs ChatGPT guide and our SuperGrok vs ChatGPT Plus comparison cover the workflow-specific decision framework.
How to Access Grok 4 in 2026
Access to specific Grok 4 variants depends on which tier you are on. The mapping is not intuitive, particularly because Grok 4.3 - the current flagship - is still rolling out to consumer tiers in stages.
Consumer access:
Plan | Price | Grok 4 Access |
|---|---|---|
Free tier | $0 | No Grok 4 access (Grok 3.5) |
X Premium | $8/month | Basic Grok 4.1 access, higher rate limits |
SuperGrok Lite | $10/month | No confirmed Grok 4 access |
SuperGrok | $30/month | Grok 4 and 4.1 (4.3 rolling out in stages) |
X Premium+ | $40/month | Grok 4/4.1 (4.3 rolling out in stages) |
SuperGrok Heavy | $300/month | Confirmed full Grok 4.3 + exclusive Grok 4 Heavy |
The staged rollout caveat: As of June 2026, two SuperGrok subscribers sending identical queries may receive responses from different model versions during the Grok 4.3 rollout. SuperGrok Heavy is currently the only consumer plan with confirmed, guaranteed Grok 4.3 access at all times. For work where model version consistency matters, this is worth factoring into your subscription decision.
API access (developer):
Grok 4 family models are available through the xAI API via console.x.ai. API access does not require a SuperGrok subscription - it is a separate developer billing relationship charged per token. Multiple model variants are available simultaneously, which is different from the consumer interface where you use whichever version xAI serves you.
For a complete consumer tier decision framework, our what is SuperGrok guide covers every plan in detail.
Grok 4 API: Pricing and Technical Specs
The API is where the Grok 4 family's cost efficiency story is most compelling.
Current API pricing and specs (June 2026):
Model | Input (per 1M tokens) | Output (per 1M tokens) | Context | Speed |
|---|---|---|---|---|
Grok 4 | $3.00 | $15.00 | 256K | Standard |
Grok 4.1 | $1.25 | $2.50 | 256K | Standard |
Grok 4.1 Fast | $0.20 | $0.50 | 2M | 235 tok/s |
Grok 4.20 (multi-agent) | $2.00 | $6.00 | 2M | 235 tok/s |
Grok 4.3 | $1.25 | $2.50 | 1M | Standard |
Grok 4 (cached input) | $0.75 | - | - | - |
The cost efficiency story:
At $0.20 per million input tokens, Grok 4.1 Fast is approximately 25x cheaper than the original Grok 4 at $3.00/M and 10x cheaper than GPT-5.4. TokenMix calculates cost efficiency as output price per million tokens divided by SWE-bench score - at that metric, Grok 4.1 Fast ties with DeepSeek V4 as the most cost-efficient frontier API available, per TokenMix's April 2026 analysis.
For Grok 4.20, the cost per SWE-benchmark point is $0.077 - 2.4x more efficient than Claude Opus ($0.197) or GPT-5.4 ($0.184). At 500 million tokens per month, Grok 4.20 saves approximately $4,584 versus GPT-5.4 for equivalent benchmark performance, per TokenMix.
Additional API costs to factor in:
Beyond token costs, the xAI API charges separately for built-in tools. Web search, X search, and code execution each cost $5 per 1,000 successful calls. File attachments cost $10 per 1,000 calls. A research query that triggers 3-5 tool calls adds $0.015-0.025 per query in tool fees on top of token costs. For high-volume production deployments, these tool costs are a meaningful line item.
xAI offers up to $175 per month in free API credits through their data-sharing program - the most generous free tier among major AI API providers. Credits are available by enabling data sharing in your xAI console settings, per xAI's documentation.
Prompt caching: Grok 4 cached input costs $0.75 per million tokens - an 75% reduction from the $3/M standard rate. For applications that send the same system prompt or document context repeatedly, prompt caching dramatically reduces operating costs.
What Grok 4 Is Best For
Based on the benchmark data, the design philosophy, and real-world use cases documented by practitioners, the Grok 4 family earns its place for specific workflows.
Research requiring real-time X intelligence:
This is Grok 4's exclusive advantage and the clearest case for using it over any competitor. When you need an AI that can tell you what is being said about a topic on X right now - not last month, not in a training dataset, but in the last few hours - Grok 4 is the only frontier model with native access to do that. Journalists tracking breaking news, PR professionals monitoring brand sentiment, market researchers tracking product launches, political analysts monitoring public discourse - for all of these, Grok 4's X integration creates genuine, unique value.
Complex math and scientific reasoning (Heavy variant):
Grok 4 Heavy's 100% AIME and 50.7% HLE scores reflect genuine capability on hard quantitative problems that most models cannot approach. Biomedical researchers, quantitative finance teams, and scientific computing applications represent the legitimate production use cases for Heavy. xAI reports it is already in use at biomedical labs and financial firms, per DataCamp's Grok 4 review.
High-volume API workloads requiring long context at low cost:
Grok 4.1 Fast at $0.20/M input tokens with a 2M token context window is the strongest option in the market for applications where you need to process large documents at scale, run thousands of daily queries, or build cost-efficient production pipelines. The math is straightforward: at scale, Grok 4.1 Fast saves meaningful money against GPT-5.4 or Claude Opus while providing larger context than either.
Multi-source research synthesis:
Grok 4.20's 4-agent architecture produces unique insights that consensus models miss. The Suprmind Multi-Model Divergence Index found Grok surfacing 509 unique insights (19.7% share) per 1,324 production turns - third among five providers, per Suprmind's analysis. For research tasks where perspective diversity matters more than calibration caution, Grok 4.20's willingness to attempt answers creates breadth that more conservative models lack.
What Grok 4 Is Not Best For
Honest evaluation requires both sides. There are clear use cases where Grok 4 is not the strongest available option.
Everyday conversational queries:
The original Grok 4 is a reasoning-first model designed for hard problems. For quick factual questions, casual conversation, and light-touch assistance, Grok 3 was faster and cheaper. The Grok 4.1 Fast variant fills this gap better than the original Grok 4, but Grok 4 itself is overkill for most everyday queries, per DataCamp.
Best-in-class coding production work:
Grok 4.20 scores 78% on SWE-bench - competitive but behind GPT-5.4 at 81.5%. More importantly, Claude Code at 54% AI coding market share and 46% "most loved" rating in the Pragmatic Engineer Survey reflects genuine developer workflow preference that benchmark scores do not fully capture. For teams choosing an AI coding tool, Claude Code and GitHub Copilot have deeper ecosystem integration and higher developer satisfaction than Grok 4, per our claude-code-statistics guide.
High-stakes tasks requiring calibrated refusals:
Grok 4's design choice to attempt answers rather than refuse uncertain questions produces a 64% wrong-answer rate on AA-Omniscience when the model is uncertain. For legal research, medical information, financial compliance work, and any application where a confidently wrong answer is worse than an honest "I don't know," this design choice is a structural mismatch. Claude's 0% wrong-refusal rate on the same benchmark reflects a fundamentally different approach to uncertainty handling, per Suprmind.
Long document processing at consumer tier:
The 128K token app context and 256K API context of the original Grok 4 are not generous by 2026 standards - Gemini 2.5 Pro offers 2M tokens. If you are using Grok 4 through the consumer interface (not API Grok 4.1 Fast or 4.20), the context ceiling creates real limitations for long-document work, per DataCamp.
For broader context on where each major AI platform leads in enterprise workflows, our best AI chatbots for business guide covers the full decision framework.
The Controversies Worth Knowing
Any honest guide to Grok 4 has to address the concerns that reputable technical reviewers have raised.
No formal model card at launch:
Grok 4 launched without a model card - the standard documentation that describes a model's training data, safety evaluations, known limitations, and intended use cases. This makes independent safety assessment significantly harder, per Chatbase's Grok 4 analysis. For enterprise customers with AI governance requirements, the absence of formal safety documentation at launch is a material gap.
Overconfidence on uncertain questions:
The 64% wrong-answer rate on AA-Omniscience (answering questions it should refuse) is not a bug in xAI's view - it is a deliberate design choice reflecting Grok's philosophy of attempting answers over declining. For many consumer use cases, this produces more useful outputs. For high-stakes professional applications, it produces confidently wrong information. Understanding which category your workflow falls into is essential before deploying Grok 4 in production.
Training data transparency:
xAI has not published comprehensive information about Grok 4's training data sources beyond stating that it used large-scale internet data and reinforcement learning at pretraining scale. The training cutoff inconsistency - November 2024 per API documentation versus December 2025 per release notes - has not been officially resolved, per Suprmind.
The rapid iteration pace:
xAI's model cadence is genuinely fast - Grok 3 was pulled in May 2026, Grok 4.3 arrived in April, and Grok 4.4 and 4.5 are reportedly in stages of development. This means any production workflow built on a specific Grok 4 variant needs to account for potential deprecation. Eight older model identifiers already automatically redirect to Grok 4.3. Teams building on the API should pin to specific model version strings rather than using aliases that may redirect to new models without notice.
Grok 5: What Comes Next
Grok 5 has been repeatedly referenced by Elon Musk and xAI's official channels as the next major architectural step. Here is what is reported and what to treat as target rather than shipped fact.
What is reported:
Grok 5 is in training on Colossus 2 - xAI's gigawatt-scale supercomputer launched January 2026 in Southaven, Mississippi. Musk has publicly referenced a 10 trillion parameter target, per Suprmind. Multiple sources report xAI is training parallel Grok 5 variants ranging from 6 trillion to 10 trillion parameters simultaneously. Intermediate checkpoints are reportedly being designated as Grok 4.4 (~1T parameters, reported 2-3 weeks from late April 2026) and Grok 4.5 (~1.5T, reported 4-5 weeks from late April 2026).
What Q2-Q3 2026 means for current Grok 4 users:
The Grok 4.3 timeframe and the Grok 5 projection overlap - meaning the model you access via SuperGrok today could be superseded within months. For consumer users, this means updated model access as new versions roll out within existing subscription tiers (xAI's pattern). For API developers, it means pinning to specific model version strings is important to avoid unexpected behavior changes when aliases redirect to new models.
The honest caveat: Grok 5 as described would represent one of the largest parameter counts in any publicly accessible model. Capability improvements at that scale are genuinely uncertain - scaling laws have produced diminishing returns for some organizations at large parameter counts, while others have found continued gains. Treat any Grok 5 specification as a target until shipped.
For context on Colossus 2's infrastructure and how it compares to OpenAI's Stargate, our xAI statistics guide covers the full infrastructure picture.
What is Grok AI? Complete Guide 2026
Full background on Grok as a platform - history from Grok 1 through Grok 4, xAI's founding, and where the platform fits in the AI landscape.
Grok AI Pricing 2026: Every Plan Compared
Full breakdown of all eight Grok access options including SuperGrok, SuperGrok Heavy, and the xAI API.
What is SuperGrok?
The complete guide to SuperGrok subscription - what you get at $30/month including Grok 4 access, DeepSearch, and Big Brain mode.
Grok Context Window 2026
Detailed breakdown of every Grok 4 variant's context window and what it means for your workflow.
Grok AI Statistics 2026
User numbers, market share, and growth data for the Grok platform.
SuperGrok vs ChatGPT Plus 2026
Head-to-head subscription comparison for professionals choosing between Grok 4 and GPT-5.5 access.
xAI Statistics 2026
The corporate and infrastructure data behind Grok 4 - Colossus, SpaceX acquisition, and xAI's competitive position.
Frequently Asked Questions
What is Grok 4?
Grok 4 is xAI's fourth-generation frontier AI model family, launched July 9, 2025. It was trained with reinforcement learning at pretraining scale using the Colossus 200,000-GPU supercomputer - approximately 10x the compute used for Grok 3. The family includes six variants: Grok 4, Grok 4 Heavy, Grok 4.1, Grok 4.1 Fast, Grok 4.20, and Grok 4.3 (current API flagship, launched April 30, 2026). The defining features are native tool use trained into the model, real-time X data access, and multi-agent parallel inference in the Heavy variant.
What benchmark scores does Grok 4 achieve?
Grok 4 Heavy scores 100% on AIME 2025 (math competition) and 50.7% on Humanity's Last Exam (PhD-level science/math) - the first model to break 50% on HLE - per xAI's self-reported benchmarks. Grok 4 Heavy scored 15.9% on ARC-AGI-2 (independent), nearly doubling the previous best. Grok 4.20 (March 2026) scores 78% on SWE-bench Verified (independent, TokenMix April 2026) - third behind GPT-5.4 at 81.5% but with 60% cheaper output pricing. Scores marked "xAI" are self-reported and should be treated as directional - independent scores are generally lower.
What is Grok 4 Heavy?
Grok 4 Heavy is the multi-agent parallel reasoning variant of Grok 4, exclusive to SuperGrok Heavy subscribers at $300/month. Instead of single-model sequential reasoning, Grok 4 Heavy runs multiple agents in parallel - each working independently on the same problem, comparing results, and converging on an answer. This architecture produced the 100% AIME and 50.7% HLE scores and is designed for researchers and enterprise teams with complex analytical workflows where reasoning quality directly drives outcomes. It is not available via the standard API.
What is the Grok 4.20 multi-agent model?
Grok 4.20 is xAI's multi-agent flagship launched March 10, 2026. It uses a 4-agent "Society of Mind" architecture: Grok (coordinator), Harper (research), Benjamin (math/code), and Lucas (synthesis/creativity) - all four running in parallel on shared weights at inference time. It offers a 2M token context window, 235 tokens per second speed, and costs $2 per million input tokens. The overhead is approximately 1.5-2.5x a single call (not 4x) due to shared KV caching on Colossus infrastructure. Available via API as grok-4.20-0309 in reasoning and non-reasoning variants.
What is Grok 4.3 and how does it differ from Grok 4.20?
Grok 4.3 is xAI's current API flagship, launched April 30, 2026. It costs $1.25 per million input tokens (versus $2 for Grok 4.20) with a 1M token context window (versus 2M for Grok 4.20) and adds native video input capability. Grok 4.3 is recommended as the general-purpose starting point for developers. Grok 4.20 is the right choice when you specifically need the 2M context window, the 4-agent parallel architecture, or the lowest per-token pricing for reasoning-intensive workloads. Eight legacy model aliases now redirect to Grok 4.3 automatically.
How much does Grok 4 cost?
Via consumer subscription: Grok 4/4.1 is included with SuperGrok at $30/month. Grok 4.3 access is rolling out to SuperGrok in stages. Grok 4 Heavy is exclusive to SuperGrok Heavy at $300/month. Via API: Grok 4 costs $3/$15 per million input/output tokens. Grok 4.3 costs $1.25/$2.50. Grok 4.1 Fast costs $0.20/$0.50 with a 2M token context window. Grok 4.20 costs $2/$6. Cached Grok 4 input costs $0.75/M. xAI offers up to $175/month in free API credits through data sharing. Tool use (web search, X search, code execution) costs $5 per 1,000 calls additionally.
Is Grok 4 better than GPT-5 or Claude?
Grok 4 leads in: AIME math benchmarks (100% vs competitors), Humanity's Last Exam (50.7% first to break 50%), context window at the fast tier (2M tokens), throughput speed (235 tok/s), cost efficiency ($0.20/M for Grok 4.1 Fast), and real-time X data access (exclusive). GPT-5.4 leads in: SWE-bench coding (81.5% vs Grok 4.20 at 78%) and computer use benchmarks. Claude Opus 4.6 leads in: developer preference (46% "most loved"), uncertainty calibration (0% wrong refusal vs Grok's 64%), and enterprise coding market share (54%). Gemini 3.1 Pro leads on ARC-AGI-2 at 77.1%.
What is Grok 4's context window?
Context window varies by variant. The original Grok 4: 128K tokens in the consumer app, 256K in the API. Grok 4.1 Fast and Grok 4.20: 2M tokens. Grok 4.3 (current API flagship): 1M tokens. The headline 2M context window associated with Grok in marketing reflects the Fast and 4.20 variants - not the original Grok 4. For consumer SuperGrok subscribers, effective context depends on which model variant is served. For developers, the specific API model string determines context.
Quick Answers
What is Grok 4 in simple terms?
Grok 4 is xAI's AI model launched July 2025, trained with reinforcement learning at 10x the compute of Grok 3 using the Colossus 200,000-GPU supercomputer. It includes native tool use (code interpreter, web browsing), real-time X data access, and a multi-agent parallel variant (Grok 4 Heavy) that was the first AI to break 50% on Humanity's Last Exam. The current API flagship is Grok 4.3 (April 2026) at $1.25/M input tokens. The family spans six variants from $0.20/M (Grok 4.1 Fast) to $300/month consumer (SuperGrok Heavy). Grok 5 is reportedly in training on Colossus 2.
What benchmark scores does Grok 4 achieve in 2026?
Grok 4 Heavy scored 100% on AIME 2025 and 50.7% on Humanity's Last Exam (first model to break 50%) per xAI's self-reported benchmarks. Grok 4.20 scores 78% on SWE-bench Verified (third behind GPT-5.4 at 81.5%) and 91.2% on MMLU per independent TokenMix data from April 2026. Grok 4 scored 15.9% on ARC-AGI-2 (independently verified) - nearly double the prior record. Scores vary by model variant; most impressive figures are Heavy with tools, self-reported by xAI, which typically exceed independently verified scores.
How is Grok 4 different from Grok 3?
Three main differences. First, Grok 4 used reinforcement learning at pretraining scale - roughly 10x more compute than Grok 3 applied earlier in training rather than as post-training fine-tuning. Second, Grok 4 has native tool use trained into the model (code interpreter and web search), not bolted on externally. Third, Grok 4 Heavy introduced genuine multi-agent parallel inference where multiple agents work simultaneously on the same problem. Grok 3 was retired from the platform in May 2026 and its traffic now redirects to Grok 4.3.
What is Grok 4 Heavy and what can it do?
Grok 4 Heavy is the multi-agent parallel reasoning variant of Grok 4, exclusive to SuperGrok Heavy subscribers at $300/month and not available via standard API. It runs multiple agents in parallel that compare results and converge on more reliable answers. Benchmark results per xAI: 100% on AIME 2025, 50.7% on Humanity's Last Exam (first model to break 50%), 61.9% on USAMO 2025 math olympiad. Designed for researchers, quant traders, and enterprise teams where reasoning quality on complex technical problems directly drives outcomes. SuperGrok Heavy is currently the only consumer plan with confirmed full Grok 4.3 access.
Conclusion
The Grok 4 family in June 2026 represents a genuine capability step from Grok 3 - the benchmark scores on hard math and science tasks are real, the X data access is exclusive, and the cost efficiency at the Grok 4.1 Fast tier is unmatched among frontier APIs.
The honest picture requires holding the strengths alongside the limitations. The 64% overconfidence rate on uncertain questions is a design choice, not a bug - but it is a design choice that matters for how you deploy Grok 4 in professional contexts. The benchmark version confusion is real - most numbers people cite describe the July 2025 Grok 4 or Grok 4 Heavy, not the April 2026 Grok 4.3 that API users are actually calling. The context window is larger on the fast tier than on the flagship tier - an unusual inversion that requires checking before assuming the newest model is the best fit.
For most professionals evaluating Grok 4: the consumer entry point is SuperGrok at $30/month, where Grok 4/4.1 access provides the full feature set including DeepSearch and X data integration. The API entry point is Grok 4.3 at $1.25/M input tokens for general use, Grok 4.1 Fast at $0.20/M for high-volume cost-sensitive workloads, and Grok 4.20 for multi-agent tasks where the 2M context and parallel architecture add specific value.
Grok 5, in training on Colossus 2 at a reported 10 trillion parameter target, is the next inflection point. Until it ships, Grok 4.3 is the stable foundation - and Grok 4.1 Fast remains one of the best value propositions in the frontier API market.
📨 Don't miss tomorrow's edition. Subscribe free to AI Business Weekly and get our 2026 AI Tools Cheat Sheet instantly - bite-sized AI news every morning, zero hype.



