Google Gemma 4 achieves 3x faster generation via new multi-token prediction drafters
New multi-token prediction architecture triples Gemma 4 generation speeds, enabling real-time, high-quality reasoning for open-weights models and on-device applications.
May 6, 2026

Google has fundamentally altered the performance trajectory of its open-source artificial intelligence ecosystem with the introduction of multi-token prediction drafters for the Gemma 4 model family.[1][2] This technical advancement addresses one of the most persistent bottlenecks in large language model inference: the sequential nature of text generation.[3][2] By moving beyond the traditional one-token-at-a-time paradigm, the new architecture achieves a threefold increase in generation speed without compromising the underlying quality or reasoning accuracy of the model outputs.[3][4][2][1] This release represents a pivotal moment for developers and enterprises relying on open-weights models, as it significantly lowers the latency barriers that have historically hindered the deployment of sophisticated AI in real-time environments and on-device applications.
The core of this speed breakthrough lies in a technique known as speculative decoding, specifically refined through the use of multi-token prediction drafters.[5][6][7][3][4][2][8] In standard autoregressive large language models, text is generated by predicting exactly one token—a word or a fragment of a word—at a time.[2][3] Each new token requires a full forward pass of the model, which entails loading billions of parameters from high-speed video memory into the processor’s compute units.[3][2][1] This process is heavily memory-bandwidth bound, meaning the bottleneck is not the raw mathematical capability of the GPU or TPU, but rather the speed at which data can be moved.[3] During this process, expensive compute resources often sit idle while waiting for the next set of weights to arrive.[6][3] This inefficiency is especially pronounced when the model is predicting highly predictable text, such as common phrases or boilerplate code, where the full power of a frontier-class model is computationally overkill for such trivial predictions.
To solve this inefficiency, Google’s multi-token prediction drafters decouple the suggestion of tokens from their final verification.[2] In this new architecture, a small, highly optimized auxiliary model—the drafter—takes a fast "guess" at the next several tokens in a sequence.[7][6][4][2][8][1] Because this drafter is significantly smaller than the primary Gemma 4 model, it can generate these suggestions in a fraction of the time. The main Gemma 4 model, acting as the verifier, then checks all suggested tokens simultaneously in a single forward pass.[7][5][2][8][1] If the verifier agrees with the drafter's suggestions, the entire block of text is accepted instantly. If the drafter makes an error, the verifier corrects it and the process resumes.[7] Because the main model retains the final authority on every token, the output is mathematically identical to what would have been produced by the slower, sequential method.[3] This "lossless" speedup ensures that users receive the high-quality reasoning of a 31-billion parameter model at the speed of a much smaller system.
The implementation for Gemma 4 goes beyond basic speculative decoding by introducing several architectural optimizations that maximize hardware utilization. One of the most significant enhancements is the sharing of the input embedding table and the key-value cache between the drafter and the primary model. By sharing the key-value cache, the system avoids redundant computations of the context window, allowing the drafter to build directly upon the work the larger model has already performed.[3][1] Furthermore, the drafter is designed to utilize the activations from the last layer of the target model, providing it with high-level conceptual information that improves the accuracy of its guesses. For the edge-focused E2B and E4B models within the Gemma 4 family, Google has implemented an efficient clustering technique in the embedder to reduce the computational load of logit calculation on hardware-constrained devices like smartphones and tablets. These refinements allow the system to output a full drafted sequence plus one extra token in roughly the same wall-clock time it would normally take to generate a single token.[3][1][4][2]
The performance gains vary across the Gemma 4 family, which includes both dense and Mixture-of-Experts architectures.[7][8] For the dense 31-billion parameter model, the 3x speedup is most evident on consumer-grade GPUs and professional workstations, where memory bandwidth has traditionally been the limiting factor. In contrast, the 26-billion parameter Mixture-of-Experts variant, known as the A4B model, presents a unique challenge for speculative decoding because different experts are activated for different tokens.[7] However, Google has optimized these workflows for higher batch sizes where expert reuse is more common, ensuring that even complex agentic workflows can benefit from increased throughput. The implications for the broader industry are substantial, as these speeds enable more complex offline coding assistants and fluid, real-time voice interactions that were previously only possible through massive, cloud-hosted proprietary models.
Beyond the technical metrics, the release of these drafters reinforces Google's strategy of positioning the Gemma family as a leader in the open-weights market. Just weeks after Gemma 4 surpassed 60 million downloads, this update addresses the primary complaint of the developer community: the trade-off between model intelligence and inference latency. As Gemma 4 supports sophisticated multimodal capabilities including image and audio processing, as well as a "thinking" mode for step-by-step reasoning, the ability to accelerate these processes three-fold makes the models far more viable for production-grade AI agents. By providing the tools to run frontier-level intelligence on local hardware with the responsiveness of a lightweight model, Google is effectively shifting the industry focus from merely increasing the size of neural networks to maximizing the efficiency of how they are utilized in the real world.
The long-term impact of multi-token prediction likely signals a broader shift in the artificial intelligence landscape toward more modular and cooperative model architectures. As developers increasingly seek to deploy AI at the edge to enhance privacy and reduce cloud costs, the efficiency of the decoding layer becomes as important as the pre-training of the base model itself. The success of Gemma 4’s MTP drafters demonstrates that the next frontier of AI performance will not only be found in larger datasets or more parameters, but in the intelligent management of the hardware-software interface. By breaking the autoregressive bottleneck, Google has provided a blueprint for how the next generation of open-source AI can achieve real-time, fluid interaction, paving the way for a future where high-performance reasoning is accessible on every device, from the data center to the pocket.