Revolutionary diffusion LLM Mercury slashes AI costs and boosts speed tenfold.
Inspired by visual AI, Mercury's diffusion architecture unlocks 10x faster, cheaper LLMs, reshaping the future of language models.
June 27, 2025

A new challenger has entered the artificial intelligence arena, promising a radical shift in the speed and efficiency of large language models (LLMs). US-based AI startup Inception Labs has launched Mercury, which it bills as the first commercial-scale diffusion large language model.[1][2] This new architecture, inspired by the technology behind AI image and video generators like Midjourney and Sora, offers a significant speed boost over the established autoregressive models that have dominated the field.[1][3] Inception Labs claims Mercury can be up to 10 times faster and cheaper than its contemporaries, a development that could have profound implications for the accessibility and application of AI across various industries.[4][5] The initial release focuses on coding applications, with a model called Mercury Coder, but a version tailored for chat is also in the works, signaling a broader ambition to reshape the LLM landscape.[4][6]
The core innovation of Mercury lies in its departure from the standard autoregressive method of text generation.[4] Traditional LLMs, such as those in the GPT and Claude families, generate text sequentially, producing one word or "token" at a time from left to right.[5] This process is inherently slow, as each new token can only be generated after the preceding one is finalized.[5] Diffusion models, in contrast, operate on a "coarse-to-fine" principle.[4][6] They begin with a rough, "noisy" or incomplete version of the entire text and then refine it in parallel across multiple steps, much like an image generator sharpens a blurry picture into a clear one.[1][3] This parallel processing allows for a dramatic increase in generation speed, with Inception Labs reporting that Mercury can exceed 1,000 tokens per second on standard NVIDIA H100 GPUs, a velocity previously thought to require specialized hardware.[4][2]
The performance claims for Mercury are substantial, positioning it as a direct competitor to the speed-optimized models from major AI labs. According to Inception Labs and independent analysis, Mercury Coder's Mini version can generate 1,109 tokens per second, while the Small version reaches 737 tokens per second.[7] This far outpaces models like GPT-4o Mini, which operates at around 59 tokens per second, and Claude 3.5 Haiku.[7] In terms of quality, Mercury Coder is said to be competitive with these models, even surpassing them on several coding benchmarks.[4][7] For instance, in evaluations on standard coding tasks, Mercury Coder Small reportedly outperformed models like Gemini 2.0 Flash-Lite and GPT-4o Mini on at least four out of six benchmarks.[7] While it didn't surpass every competitor on every test, its combination of high speed and comparable quality makes it a compelling alternative for developers.[7][8] The company also highlights that in a "Copilot Arena" benchmark, developers preferred Mercury's code completions, ranking it highly for both speed and quality.[3]
Beyond raw speed, the diffusion architecture offers other potential advantages. Inception Labs suggests that because the model refines the entire output iteratively, it has a built-in mechanism for error correction, which could lead to more robust reasoning and a reduction in "hallucinations" or factual errors.[1] The parallel nature of the generation process also lends itself to better control over the output's structure, making it ideal for tasks like function calling and generating structured data.[1] Furthermore, since diffusion is the dominant technology in image, video, and audio generation, Inception Labs believes this unified framework will give their models stronger performance on multimodal tasks in the future.[1] The company, founded by academics from Stanford, UCLA, and Cornell, aims for Mercury to be a "drop-in replacement" for existing LLMs, supporting common workflows like retrieval-augmented generation (RAG) and agentic systems without requiring a major overhaul of infrastructure.[4][2][8]
The introduction of a commercially viable diffusion LLM marks a significant moment for the AI industry. While the underlying transformer architecture is still part of Mercury's process, its application within a diffusion framework represents a new paradigm.[4][9] This could challenge the dominance of purely autoregressive models and spur further innovation in model architecture. The dramatic increase in speed and efficiency addresses a major bottleneck in scaling AI applications: the high cost and latency of inference.[5] By making high-performance AI more accessible, models like Mercury could accelerate the adoption of generative AI in real-time applications such as customer support, business automation, and, most immediately, developer tools for code generation.[8][6] While Inception Labs has kept many details about the model's size, training data, and specific methods under wraps, the public release of Mercury Coder and its impressive performance metrics have firmly placed diffusion models on the map as a serious contender in the evolution of language generation technology.[7]
Research Queries Used
Inception Labs Mercury diffusion LLM
fastest commercial-grade diffusion LLM
Inception Labs Mercury performance benchmarks
diffusion models for language generation
Mercury LLM vs GPT-4.1 Nano vs Claude 3.5 Haiku