AI Tech Suite

Cerebras Doubles Nvidia's Blackwell Speed, Claims LLM Inference Record

Wafer-scale AI challenges GPU giants: Cerebras claims record LLM inference speed, intensifying rivalry with Nvidia.

May 30, 2025

Cerebras Doubles Nvidia's Blackwell Speed, Claims LLM Inference Record

Cerebras Systems, a company known for its massive wafer-scale processors, has claimed a significant performance advantage over Nvidia's latest Blackwell GPUs in large language model (LLM) inference.[1][2] According to Cerebras, its CS-3 system, powered by the Wafer Scale Engine 3 (WSE-3), achieved an inference speed of over 2,500 tokens per second (t/s) on Meta's 400-billion parameter Llama 4 Maverick model.[1][2] This performance, measured by the independent benchmarking firm Artificial Analysis, reportedly more than doubles the 1,038 t/s achieved by an Nvidia DGX B200 system equipped with eight Blackwell GPUs on the same model.[1][3] Cerebras asserts this result sets a new world record for LLM inference speed on this particular Llama model.[1][2]

The announcement has ignited fresh debate in the highly competitive AI hardware market, where Nvidia has long been the dominant force.[4][5] Cerebras CEO Andrew Feldman has been vocal about his company's perceived advantages, stating, "The most important AI applications being deployed in enterprise today—agents, code generation, and complex reasoning—are bottlenecked by inference latency."[1] He argues that Cerebras's architecture is uniquely suited to address these bottlenecks.[1][6] The company also highlights that its hardware and API are currently available, contrasting this with the availability of Nvidia's Blackwell, and claims their record performance was achieved without special kernel optimizations that may not be accessible to all users.[1]

The core of Cerebras's technology is its Wafer Scale Engine, a single, massive chip that contrasts with the traditional approach of using multiple smaller GPUs.[7][8] The WSE-3, a 5nm chip with 4 trillion transistors and 900,000 AI cores, is designed to offer significant advantages in on-chip memory and memory bandwidth.[9][8] Cerebras claims the WSE-3 provides 7,000 times more memory bandwidth than an Nvidia H100 GPU.[10][11][12] This architecture, according to Cerebras, allows it to process large models more efficiently by keeping more data on-chip, reducing the communication overhead that can slow down GPU clusters.[7][13] For instance, Cerebras states a single CS-3 can offer the raw performance equivalent to roughly 3.5 Nvidia DGX B200 servers but in a more compact footprint and with lower power consumption per unit of performance.[9][14] Specifically, while a CS-3 consumes 23kW peak compared to a DGX B200's 14.3 kW, Cerebras claims a 2.2x improvement in performance per watt due to its higher overall throughput.[9][14]

Nvidia, with its recently unveiled Blackwell architecture, is not standing still.[15][16] The Blackwell platform, which includes the B200 GPU with 208 billion transistors, promises significant performance gains over its Hopper predecessor, including up to 4x faster LLM training and 30x faster LLM inference in certain configurations.[16][17] Nvidia emphasizes extensive software optimizations with TensorRT-LLM and techniques like speculative decoding to maximize Blackwell's performance.[18][19] Nvidia also points to the scalability of its solutions, with the DGX NVL72, a full rack solution with 72 B200 GPUs, offering substantial aggregate compute power.[14] While Cerebras touts its current availability and raw single-system performance on specific benchmarks, Nvidia's established ecosystem, broad software support (CUDA), and strong partnerships with major cloud providers present a formidable competitive advantage.[20][4][5] Analysts note that enterprises often prioritize these established ecosystems and the ease of integration they offer.[20][4]

The implications of this competition are significant for the AI industry. The demand for faster and more efficient LLM inference is surging as AI applications become more sophisticated and widespread, moving beyond training to focus on the cost and speed of deploying these models.[20][4][6] Use cases like real-time reasoning, advanced AI agents, and complex code generation require extremely low latency, making inference speed a critical factor.[1][10] Cerebras's reported performance on the Llama 4 Maverick model, if consistently replicable and broadly applicable, could offer a compelling alternative for companies prioritizing raw inference speed for very large models.[1][21] The company has also highlighted its success in other demanding applications, such as molecular dynamics simulations, where it claims its CS3 system can be 700 times faster than the Frontier supercomputer.[10][12] However, Nvidia's Blackwell is also targeting these demanding workloads with significant architectural improvements and a strong focus on both throughput and latency.[18][19] The market is also seeing other players, like Groq and SambaNova, offering specialized AI inference solutions.[1][20] Ultimately, the choice between these competing architectures will depend on a variety of factors, including the specific AI workloads, the scale of operations, existing infrastructure, engineering expertise, and total cost of ownership.[20][4][22] While Cerebras positions itself as a performance leader for specific large-model inference tasks, Nvidia's incumbency, vast ecosystem, and continuous innovation with platforms like Blackwell ensure a dynamic and evolving AI hardware landscape.