Startup Inception debuts Mercury 2, the first diffusion model for lightning-fast AI reasoning

By replacing sequential generation with parallel diffusion, Mercury 2 delivers record-breaking speed and cost efficiency for complex reasoning.

February 24, 2026

Startup Inception debuts Mercury 2, the first diffusion model for lightning-fast AI reasoning
The landscape of artificial intelligence has long been dominated by the autoregressive paradigm, a method of generating language where models predict text one word or token at a time in a strictly sequential, left-to-right fashion.[1][2][3][4] While this approach has powered the rise of famous systems like GPT-4 and Claude, it has always been hampered by a fundamental bottleneck: speed is limited by the serial nature of the process, and errors made early in a sentence often compound throughout the output. AI startup Inception has now introduced a potential solution to these structural limitations with the launch of Mercury 2, the first large-scale language reasoning model based entirely on diffusion.[4] By abandoning the word-by-word generation loop in favor of a parallel refinement process, Mercury 2 achieves a significant leap in both inference speed and reasoning efficiency, signaling a major architectural pivot for the industry.
At the core of Mercury 2 is a technical departure from the industry standard known as Next Token Prediction. Instead of building a sentence piece by piece, Mercury 2 utilizes a process called Masked Diffusion Modeling. This approach treats text generation similarly to how high-end image generators like Midjourney or Sora create visuals. It begins with a rough, noisy approximation of the entire response and then iteratively refines the passage in parallel. During each step of this "denoising" process, the model looks at the whole output simultaneously, correcting errors and sharpening the logic across multiple blocks of text at once.[4][5] This coarse-to-fine generation allows the model to maintain better global coherence because it is not restricted to considering only the words that came before its current position.[1][2] Inception’s engineers have described this as the difference between a writer typing a letter through a narrow keyhole and an editor revising an entire manuscript page in a single glance.
The performance gains resulting from this architecture are stark.[6][7][5] Running on the latest Nvidia Blackwell GPUs, Mercury 2 has demonstrated a throughput of 1,009 tokens per second.[8] To put this into perspective, the model is more than five times faster than traditional language models of comparable reasoning depth.[6][4] The impact of this speed is most visible in the end-to-end latency of complex reasoning tasks. While leading speed-optimized models like Gemini 3 Flash and Claude Haiku 4.5 can take between 14 and 23 seconds to return a reasoned response, Mercury 2 delivers similar outputs in just 1.7 seconds.[8] This reduction in latency is not merely a quantitative improvement but a qualitative shift that makes real-time, high-logic applications viable for the first time. For industries relying on instant decision-making, such as high-frequency financial analysis or real-time autonomous agent loops, the ability to "think" in sub-second intervals could redefine operational standards.
Beyond raw speed, Mercury 2 has been positioned as a top-tier reasoning model, achieving benchmark scores that rival the best-known frontier models.[9] On the AIME 2025 mathematics benchmark, Mercury 2 scored a 91.1, placing it in the upper echelon of reasoning AI. It also performed strongly on the GPQA Diamond benchmark for graduate-level science, scoring 73.6, and the IFBench for instruction following, where it reached 71.3. These metrics suggest that the diffusion-based approach does not sacrifice intelligence for speed. In fact, the iterative nature of the diffusion process provides a built-in mechanism for error correction that autoregressive models lack. If a traditional model makes a logical misstep in the middle of a paragraph, it is forced to continue building on that mistake. Mercury 2, however, can refine and "fix" its own logic as it iterates through the denoising steps, leading to more reliable outputs in complex coding and scientific tasks.
The economic implications of this launch are equally disruptive. Inception has priced Mercury 2 at $0.25 per million input tokens and $0.75 per million output tokens.[8] This pricing strategy aggressively undercuts the current market leaders. For example, it is roughly four times cheaper on input and nearly seven times cheaper on output than some comparable reasoning-enabled models from larger competitors. This cost efficiency stems directly from the model's architecture; because it generates multiple tokens per neural network evaluation, it utilizes GPU compute far more intensively and efficiently than serial models. This translates to lower operational overhead for developers. For enterprise clients who have been wary of the high "reasoning tax" associated with advanced AI models, Mercury 2 offers a path to deploy sophisticated logic at a fraction of the previous cost.[4]
This breakthrough is the product of a research team with deep roots in the foundational technology of the AI era.[1][7][4] Inception was founded by researchers from Stanford, UCLA, and Cornell who were instrumental in developing the very diffusion methods used in modern image generation.[4] The team also includes co-inventors of critical techniques like Flash Attention and Direct Preference Optimization, which are now standard across almost all large language models. This pedigree has attracted significant backing from major Silicon Valley investors, including Microsoft’s venture capital fund M12, Mayfield, and Menlo Ventures, along with individual contributions from AI pioneers like Andrew Ng and Andrej Karpathy. The consensus among these backers is that the industry is entering a post-Transformer era where the architectural bottlenecks of the last five years must be overcome to enable the next wave of agentic AI.
The broader industry impact of Mercury 2 lies in its potential to unlock the "agentic" future that AI labs have long promised. Most current AI agents are limited by the time it takes for a model to reason through a plan; if an agent needs to perform ten reasoning steps to solve a problem, and each step takes fifteen seconds, the user experience becomes unusable. By compressing that reasoning time into a second or less, Mercury 2 enables agents that can act, react, and correct their course in real time. This is particularly relevant for coding assistants and automated software engineering tools, where the model must understand the global context of a codebase while generating thousands of lines of code. Inception’s earlier specialized model, Mercury Coder, already showed glimpses of this potential, but Mercury 2 generalizes these capabilities into a broader reasoning framework suitable for a wide variety of professional domains.
As the AI field continues to grapple with the rising costs of compute and the diminishing returns of simply scaling existing architectures, Mercury 2 represents a successful proof of concept for an alternative foundation. It challenges the assumption that language must be processed as a linear sequence, proving instead that language can be treated as a multidimensional data structure to be refined and sculpted. This shift could prompt other major players in the space to revisit their own reliance on autoregressive transformers. If diffusion-based language models continue to scale as efficiently as their image-based predecessors, the industry may see a rapid transition toward these parallel systems.
In summary, the launch of Mercury 2 marks a pivotal moment in the evolution of artificial intelligence. By successfully applying diffusion techniques to the complexities of human language and reasoning, Inception has broken the speed and cost barriers that have long constrained the most capable AI systems. With 1,000 tokens per second throughput and a price point that challenges the economics of the entire sector, Mercury 2 is more than just a new product; it is a demonstration of a new way for machines to think. As developers begin to integrate this model into real-world workflows, the focus of the industry is likely to shift from how much data a model can see to how efficiently and quickly it can refine that data into actionable intelligence. Mercury 2 has set a new benchmark for what is possible, suggesting that the future of AI will be characterized by speed, parallel reasoning, and structural innovation.

Sources
Share this article