AI Tools Review
Mercury 2 Review: The Diffusion LLM That Hits 1,000 Tokens Per Second (2026)

Mercury 2 Review: The Diffusion LLM That Hits 1,000 Tokens Per Second (2026)

10 March 2026

Note: This article is based on publicly available information from Inception Labs, independent benchmarks, developer reports, and AI research community analysis. We have not independently tested Mercury 2 at production scale.

Quick answer:

Mercury 2 is Inception Labs' second-generation diffusion language model, capable of generating text at approximately 1,000 tokens per second - roughly 5 to 10 times faster than comparable autoregressive models. It achieves this by iteratively denoising a sequence in parallel rather than predicting tokens one at a time. Quality is competitive with mid-tier models on coding and instruction tasks, making it a strong choice for latency-critical applications like real-time autocomplete, interactive coding assistants, and high-volume document processing.

What is Mercury 2?

Mercury 2 diffusion language model visualisation showing parallel token generation streams in a dark navy environment with cyan and magenta neon accents

In February 2026, Inception Labs published results for Mercury 2, the successor to their original Mercury diffusion language model. The headline number was hard to ignore: 1,000 tokens per second. For context, GPT-4o at standard throughput produces roughly 80-120 tokens per second. Claude Opus 4.6 operates in a similar range. Mercury 2 is not incrementally faster - it is an order of magnitude faster at text generation.

The reason is architectural. Mercury 2 is not an autoregressive model. It does not predict the next token based on all previous tokens, then repeat that process until the sequence is complete. Instead, it uses a diffusion process borrowed from the image generation world: it starts with a noisy, blurry representation of the output and iteratively refines it until a coherent, high-quality response emerges.

This is a genuinely different approach to language modelling, not an incremental improvement. The implications for latency, cost, and the types of tasks AI can realistically handle in real-time are significant. Whether Mercury 2 represents a paradigm shift or a specialist tool for specific workloads is the central question this article addresses.

Background:

Inception Labs was founded by former researchers from Meta AI, Google DeepMind, and other leading AI labs. The original Mercury model demonstrated diffusion LLM viability in 2025. Mercury 2 represents their second-generation model, with substantially improved quality benchmarks alongside the high throughput the architecture is known for.

Diffusion vs autoregressive: the fundamental difference

Side-by-side architectural comparison showing autoregressive left-to-right token generation on one panel and parallel diffusion denoising steps on the other, with holographic data overlays

To understand why Mercury 2's speed claim is credible, you need to understand the core limitation of standard language models.

Every LLM you have used before - GPT-4, Claude, Gemini, Llama, Mistral - uses autoregressive generation. When you ask a question, the model generates the first token of the response, then uses that token (plus your original input) to generate the second token, then repeats with token three depending on tokens one and two, and so on. Each token generation requires a full forward pass through the model. If your response is 500 tokens long, the model runs 500 sequential forward passes.

This is inherently serial. You cannot parallelise it. You cannot generate token 50 until you have generated tokens 1 through 49. Modern GPUs are extraordinarily good at parallel computation, but autoregressive generation largely wastes that parallelism because each step depends on the previous one.

How diffusion changes the equation

Diffusion models work differently. In image generation (Stable Diffusion, DALL-E 3, Midjourney), you start with a grid of random noise and run it through a denoising network repeatedly. Each denoising step makes the image slightly less noisy and slightly more coherent. After enough steps, noise becomes a high-quality image.

Inception Labs applies this same concept to language. Mercury 2 starts with a sequence of masked or noisy token embeddings representing the entire output - from the first word to the last - and runs a denoising process over the whole sequence simultaneously. Each denoising step refines all positions in parallel. The result after enough denoising steps is a coherent, grammatical response.

The critical difference: all positions are processed simultaneously rather than sequentially. A GPU that can run 1,000 parallel computations uses all of those for every denoising step. An autoregressive model generating a 1,000-token response runs 1,000 sequential steps, each using only a fraction of the available parallelism.

PropertyAutoregressive (GPT-4, Claude)Diffusion (Mercury 2)
Generation orderLeft to right, one token at a timeAll positions simultaneously, iteratively refined
GPU parallelismLimited by sequential dependencyFull parallelism per denoising step
Throughput80-150 tokens/sec (typical)~1,000 tokens/sec (Mercury 2)
Time to first tokenFast (instant for first token)Slower (must complete denoising first)
Streaming supportYes (natural, token by token)Partial (requires special handling)
Multi-step reasoningStrong (later tokens see earlier thinking)Weaker (global denoising, less chain-of-thought)

The trade-offs are real. Mercury 2's diffusion approach has a higher latency to the first token compared to autoregressive models. When you ask GPT-4 a question, the first word of the response appears almost instantly because the model starts outputting immediately. Mercury 2 has to complete several denoising steps across the full output before showing you anything. For short responses, this initial delay is noticeable. For longer responses, the total wall-clock time is much lower.

How Mercury 2 achieves 1,000 tokens per second

Detailed technical visualisation of Mercury 2 denoising architecture showing parallel transformer layers processing all token positions simultaneously with heat map overlays showing refinement progress

The 1,000 tokens per second figure requires some unpacking. It refers to the throughput of the full generation process, not a sustained per-step rate. Here is how Inception Labs achieves it.

Masked diffusion over discrete tokens

Mercury 2 uses a form of discrete diffusion applied to token masks. During training, the model learns to reconstruct masked tokens in text sequences. During inference, it starts with all output positions masked and gradually unmasks them across multiple denoising steps. This is analogous to BERT's masked language modelling objective, but applied as a generative process with multiple refinement passes.

Because all positions are unmasked simultaneously (or in parallel groups), a single denoising step can update hundreds of token positions at once. With a modern A100 or H100 GPU cluster, this degree of parallelism translates directly into high throughput.

Fewer denoising steps than image diffusion

Image diffusion models typically require 20-50 denoising steps to produce a high-quality image. Early language diffusion models required a similar number, which offset their parallelism advantage. Mercury 2 reduces the required steps significantly - reportedly to around 5-10 steps for most tasks - through architectural improvements and improved training procedures.

Fewer steps means the full generation process completes faster, and the parallelism advantage compounds. A 10-step diffusion process that refines 500 tokens in parallel is structurally very different from a 500-step autoregressive process.

Speculative decoding is not needed

One technique used to speed up autoregressive models is speculative decoding: a smaller "draft" model generates candidate tokens, which the larger model verifies in parallel. This adds complexity and requires careful system engineering. Mercury 2 does not need speculative decoding - its architecture is inherently parallel. This simplifies deployment and reduces infrastructure overhead.

Important context:

The 1,000 tokens/sec figure is measured on Inception Labs' own infrastructure. Individual API users will experience different throughput depending on server load, concurrent requests, and request size. Independent developers have reported throughput in the 600-900 tokens/sec range in early API testing.

Benchmarks and output quality

Speed is irrelevant if quality suffers severely. Inception Labs has published benchmark results comparing Mercury 2 against GPT-4o, Llama 3.1 70B, Claude Haiku 4.5, and Mistral Large. The picture is nuanced.

Coding benchmarks

Mercury 2 performs strongest on coding tasks. On HumanEval (Python code generation), Inception Labs reports Mercury 2 scoring comparably to GPT-4o on straightforward single-function problems. On SWE-Bench Verified (real-world software engineering tasks requiring multi-file reasoning), the gap widens significantly - frontier autoregressive models like Claude Sonnet 4.6 outperform Mercury 2.

The pattern makes intuitive sense. Diffusion models are strong at generating syntactically correct, locally coherent code. They struggle with tasks that require tracking complex state across hundreds of lines, understanding dependencies across multiple files, or planning a sequence of interdependent steps.

Instruction following

On MT-Bench and AlpacaEval, Mercury 2 scores competitively with Llama 3.1 70B and Mistral Large. For standard instruction-following tasks - writing emails, summarising documents, answering factual questions, generating content - the quality is practically indistinguishable from mid-tier frontier models when assessed by human raters in A/B tests.

Multi-step reasoning

This is where diffusion models currently show their clearest weakness. On MATH, AIME, and reasoning benchmarks that reward step-by-step problem solving, Mercury 2 lags behind models trained explicitly for chain-of-thought reasoning. The diffusion approach does not naturally produce "thinking out loud" before committing to an answer, because all positions are generated together rather than left-to-right.

Inception Labs has explored ways to incorporate chain-of-thought into diffusion models - essentially reserving some token positions as a scratchpad - but as of early 2026, this remains an area of active research rather than a production capability.

Task categoryMercury 2 qualityvs GPT-4ovs Claude Opus 4.6
Code generation (simple)StrongComparableSlightly below
Code generation (complex)ModerateBelowSignificantly below
Instruction followingStrongComparableSlightly below
SummarisationStrongComparableComparable
Multi-step reasoningWeakSignificantly belowSignificantly below
Long-context tasksModerateBelowSignificantly below
Creative writingStrongComparableSlightly below

Real-world use cases and where it fits

Four-panel infographic showing Mercury 2 deployed in real-time code autocomplete, high-volume document processing, interactive chat, and batch content generation scenarios with speed metrics overlaid

The combination of high throughput and competitive quality on standard tasks creates a distinct product niche. Mercury 2 is not a replacement for frontier reasoning models, but it is compelling for specific workloads where speed matters more than maximum depth.

Code autocompletion

Coding assistants like GitHub Copilot and Cursor are heavily constrained by latency. A suggestion that arrives after 500 milliseconds disrupts flow. Mercury 2's throughput means a 200-token code suggestion can complete in roughly 200 milliseconds rather than the 1.5-2 seconds that comparable autoregressive models typically require. For inline autocomplete, this difference is deeply felt by developers.

Several developer tools companies have already integrated or expressed interest in integrating Mercury 2 as the backend for autocomplete specifically, using a frontier model only when the developer triggers a more complex request like "refactor this function" or "explain what this code does."

High-volume document processing

Legal firms, insurance companies, and financial institutions process large volumes of documents daily - contracts, claims, reports, filings. Summarising, classifying, or extracting fields from each document typically runs through an LLM API. At standard autoregressive speeds, processing 10,000 documents takes hours and carries significant API costs.

Mercury 2's throughput means the same volume completes in a fraction of the time. For tasks that do not require deep reasoning - extracting a policy number from an insurance document, classifying an email as complaint or enquiry, summarising a meeting transcript - Mercury 2 quality is fully adequate and the speed advantage is enormous.

Interactive chat and customer service

Customer-facing chat applications benefit from faster responses, but the interaction style matters. Users tolerate slightly longer waits if text streams in immediately - which autoregressive models do naturally. Mercury 2's higher latency to first token can feel slower in a chat context even when total generation time is lower. Applications need to design around this, showing a typing indicator and then displaying the full response, rather than streaming token by token.

For FAQ-style customer service where responses are relatively predictable, Mercury 2 is well-suited. For complex, branching conversations that require nuanced reasoning or multi-turn context tracking, frontier models still have an advantage.

Content generation at scale

Marketing teams, publishers, and content platforms that need to generate large volumes of text - product descriptions, social media variations, article drafts, email campaigns - find Mercury 2 economically attractive. If quality is acceptable (which for many templated content tasks it is), generating 10x more content in the same time at comparable cost changes what is operationally feasible.

Practical tip:

The best architecture for most production applications is a tiered approach: use Mercury 2 for high-volume, latency-sensitive tasks (autocomplete, classification, summarisation) and route complex reasoning tasks (multi-step planning, code review, research) to a frontier model. This captures the cost and speed benefits where they matter most without sacrificing quality where it counts.

Mercury 2 vs GPT-4o, Claude, and Gemini

To be clear about what Mercury 2 is competing with: it is not primarily competing with frontier reasoning models. The honest comparison is with fast, cost-efficient models designed for throughput - GPT-4o mini, Claude Haiku 4.5, Gemini Flash, and Llama 3.1 8B. Against that cohort, Mercury 2's speed advantage is dramatic.

ModelThroughputQuality tierTime to first tokenStreaming
Mercury 2~1,000 tok/sMid-tierSlowerLimited
GPT-4o mini~100-150 tok/sMid-tierFastYes
Claude Haiku 4.5~100-130 tok/sMid-tierFastYes
Gemini 3.1 Flash~120-160 tok/sMid-tierFastYes
Llama 3.1 8B~200-400 tok/sLower-midFastYes
Claude Opus 4.6~60-80 tok/sFrontierModerateYes

Throughput figures are approximate and vary significantly based on hardware, batch size, and concurrent load. Hosted API throughput differs from dedicated deployment. Mercury 2's speed advantage is real but is most evident in batch and high-concurrency scenarios rather than low-concurrency API calls.

It is also worth noting that Groq - the LPU (Language Processing Unit) hardware company - has achieved high throughput for autoregressive models through specialised silicon. Llama 3.1 70B on Groq's infrastructure has been demonstrated at 300-800 tokens per second. Mercury 2's architecture-level advantage is genuine but not in a vacuum - hardware acceleration is narrowing the gap for autoregressive models too.

Current limitations and trade-offs

Warning indicator panel showing Mercury 2 limitations including first-token latency graph, reasoning benchmark gap visualisation, and context window constraints in a dark control room aesthetic

No technology is without trade-offs. Mercury 2 has several meaningful limitations that matter depending on your use case.

Time to first token

Because the diffusion process must complete before output is ready, Mercury 2 has a higher time to first token than autoregressive models. For a 500-token response, you might wait 200-300 milliseconds for Mercury 2 to begin showing output, versus near-instant streaming from GPT-4o or Claude. In conversational interfaces, this can feel noticeably slower even when total generation time is lower.

Inception Labs is working on partial streaming approaches - showing progressively refined output as denoising steps complete - but this is architecturally challenging and adds complexity. As of early 2026, most Mercury 2 integrations show the full response at once rather than streaming.

Context window constraints

Mercury 2's context window is smaller than frontier models. Where Claude Opus 4.6 supports 1 million tokens in context and GPT-4o handles 128,000 tokens, Mercury 2 operates with a more limited window - around 32,000 tokens as of initial releases. For most tasks this is sufficient, but it rules out the long-document analysis, codebase-wide reasoning, and extended conversation use cases where frontier models have pulled ahead significantly.

Chain-of-thought reasoning

The most significant quality limitation is reasoning. Autoregressive models can be trained to think step by step - to write out reasoning before committing to an answer. This dramatically improves performance on maths, logic, coding problems, and multi-step planning. Diffusion models generate all positions simultaneously, which makes this kind of sequential reasoning structurally difficult.

For tasks where the answer can be generated without extensive deliberation - summarising a document, completing a code snippet, classifying a support ticket - this limitation rarely surfaces. For tasks that genuinely require working through a problem, Mercury 2 produces answers that feel more like pattern matching than reasoning, and this shows in the results.

Ecosystem maturity

Mercury 2 is newer than established API providers. Documentation is thinner, community resources are sparser, and integration support is more limited. Developers choosing Mercury 2 are trading the mature ecosystems of OpenAI or Anthropic for a newer platform that is still building its developer experience. This is a practical consideration for production deployments.

Reliability caveat:

As a younger company, Inception Labs has not yet established the uptime track record and enterprise SLAs that OpenAI and Anthropic offer. For mission-critical applications, factor in infrastructure reliability and support responsiveness alongside raw performance metrics.

Who is Inception Labs?

Inception Labs was founded by AI researchers with roots at Meta AI, Google DeepMind, and other frontier labs. The founding team published influential work on diffusion models applied to language, arguing that the image diffusion paradigm could be adapted to discrete token sequences with careful design choices.

The original Mercury model, released in 2025, was a proof-of-concept demonstrating that diffusion LLMs could produce competitive quality at high throughput. Mercury 2 is their first production-grade model, with substantially improved benchmarks and a commercial API offering targeting enterprise customers.

The company has been open about its technical approach in published research, positioning itself as building a fundamentally different model architecture rather than competing directly with frontier labs on parameter count and compute. Their thesis is that the AI industry's inference bottleneck - the cost and latency of serving large models to millions of users - creates a durable market for high-throughput architectures even if they do not top reasoning benchmarks.

Inception Labs has raised venture funding from AI-focused investors and has announced enterprise partnerships with several technology companies. They have indicated plans for open-weight releases of smaller Mercury models, which would allow self-hosted deployments, though timelines remain unconfirmed.

What Mercury 2 means for the future of AI inference

Futuristic AI inference landscape showing multiple model architectures competing in a vast data centre environment with speed metrics and architectural diagrams floating in holographic displays

Mercury 2 matters beyond its own specific capabilities. It signals something important about where AI inference is heading.

The inference cost problem is real

Running large language models at scale is expensive. Every API call to a frontier model costs money, and as AI usage grows, those costs add up quickly. Organisations building AI features into products need to think carefully about which model to use for which task - not just in terms of quality, but in terms of economics.

Mercury 2's higher throughput means lower cost per token at scale - more output for the same server capacity. If the quality is sufficient for a given task, this is a compelling economic argument. The emergence of high-throughput alternatives creates price pressure across the entire inference market, which benefits users.

Architecture diversity is healthy

The AI field has been dominated by the autoregressive transformer since GPT-2 demonstrated its capabilities. Mercury 2, alongside other architectural experiments (state-space models, mixture-of-experts, hybrid approaches), represents growing architectural diversity. This is good. Different architectures have different strengths, and a healthy ecosystem will use multiple model types for different workloads rather than forcing all tasks through a single architecture.

The quality gap will narrow

Current diffusion LLMs lag frontier autoregressive models on complex reasoning. But autoregressive models also lagged on benchmarks compared to human experts just a few years ago, and that gap has shrunk dramatically. The diffusion approach is young. As training techniques improve, more data is incorporated, and architecture refinements accumulate, the quality gap with autoregressive models will close.

The scenario most worth watching is a future where diffusion models achieve frontier-level quality on most tasks while maintaining their throughput advantage. If that happens, autoregressive generation as the dominant AI serving paradigm faces a genuine challenge.

Hardware companies are paying attention

Nvidia's GPU dominance in AI is partly because autoregressive generation is inefficient enough that you need a lot of GPU memory bandwidth to serve models at reasonable latency. High-throughput architectures like Mercury 2 are more efficient in their GPU utilisation, which could shift hardware economics. Intel, Groq, Cerebras, and others building AI-specific hardware will all be evaluating whether diffusion LLMs change their optimal chip designs.

Bottom line:

Mercury 2 is not yet a replacement for frontier reasoning models. But it is a genuinely impressive demonstration that diffusion language models have moved from research curiosity to production-viable technology. For high-throughput, latency-critical applications at mid-tier quality requirements, it deserves serious evaluation. And it is an early signal of architectural diversity that will reshape the AI inference market over the coming years.

Frequently asked questions

What is Mercury 2?

Mercury 2 is Inception Labs' second-generation diffusion language model. It uses a denoising approach to generate entire output sequences in parallel rather than token by token, achieving approximately 1,000 tokens per second throughput - far higher than comparable autoregressive models.

How does a diffusion LLM differ from a standard LLM?

Standard LLMs generate tokens one at a time left-to-right (autoregressive). Diffusion LLMs start with a noisy representation of the full output and iteratively refine all positions simultaneously, similar to how image diffusion models work. This enables much higher parallelism and throughput at the cost of higher time-to-first-token and weaker sequential reasoning.

Is Mercury 2 as good as GPT-4 or Claude?

Mercury 2 performs comparably to mid-tier models on standard instruction following, coding, and summarisation tasks. On complex multi-step reasoning, maths, and long-context tasks, frontier autoregressive models like Claude Opus 4.6 and GPT-5.3 remain significantly stronger. The quality trade-off is real but acceptable for many production applications.

What is Mercury 2 best suited for?

Code autocompletion, high-volume document processing (summarisation, classification, extraction), content generation at scale, and interactive applications where response speed is critical. Less suited to complex reasoning, multi-step planning, and tasks requiring long context windows.

Can I use Mercury 2 via an API?

Yes. Inception Labs offers API access to Mercury 2 with usage-based pricing. Enterprise plans with dedicated throughput and SLAs are available. The company has also indicated plans for open-weight smaller model releases, though specific timelines are unconfirmed.

Does Mercury 2 support streaming?

Not in the same natural way as autoregressive models. Because diffusion requires multiple denoising passes before output is ready, streaming is architecturally challenging. Most integrations currently show the full response at once. Inception Labs is working on progressive output approaches, but as of early 2026 this is not a standard feature.

Last updated: 10 March 2026. This article is based on publicly available research and community reports. Browse more AI insights.