AI Tech Suite

Microsoft's Azure Achieves World-Record 1.1 Million AI Tokens/Second Inference

Azure and NVIDIA's 1.1M tokens/second milestone redefines AI speed, powering responsive, cost-effective generative models for enterprise innovation.

November 4, 2025

Microsoft's Azure Achieves World-Record 1.1 Million AI Tokens/Second Inference

Microsoft has established a new benchmark in artificial intelligence, achieving an inference speed of 1.1 million tokens per second using its Azure cloud platform. This industry record was set on a single rack of its ND GB300 v6 virtual machines, powered by NVIDIA's latest GB300 NVL72 systems. The breakthrough, announced by Microsoft CEO Satya Nadella, represents a significant leap in the speed at which large language models can generate responses, a critical factor for the widespread deployment of generative AI applications. The achievement is a direct result of a long-standing collaboration between Microsoft and NVIDIA, combining Microsoft's expertise in large-scale AI operations with NVIDIA's cutting-edge hardware. This new performance level is expected to accelerate the capabilities of enterprise-level AI solutions, making complex models more responsive and economically viable.

The record-breaking performance was achieved using Meta's Llama 2 70B model, a widely adopted open-source large language model that serves as an industry standard for benchmarking.[1][2] The test was conducted on a single NVIDIA GB300 NVL72 rack, which contains 72 NVIDIA Blackwell Ultra GPUs and 36 NVIDIA Grace CPUs in a liquid-cooled, rack-scale configuration.[2][3] Microsoft's Azure ND GB300 v6 virtual machines are specifically optimized for these demanding inference workloads, featuring 50% more GPU memory and a 16% higher Thermal Design Power (TDP) compared to their predecessors.[1][2] The test utilized FP4 precision, a form of quantization that significantly speeds up calculations while maintaining high levels of accuracy.[1] This performance translates to approximately 15,200 tokens per second for each of the 72 Blackwell Ultra GPUs, a substantial increase over previous hardware generations.[1][2] The results and methodology were validated by Signal65, an independent performance and benchmarking firm, lending third-party credibility to the milestone.[2]

This achievement marks a significant step up from Microsoft's own previous record of 865,000 tokens per second, which was set on a system using the NVIDIA GB200 NVL72 rack.[1][2] The 27% performance increase with the GB300 system demonstrates the rapid pace of innovation in AI hardware and cloud infrastructure.[1][2] When compared to the prior generation of hardware, the performance gains are even more stark. The Azure ND GB300 v6 VMs deliver approximately five times higher throughput per GPU than the previous-generation ND H100 v5 virtual machines.[1] This exponential increase in efficiency is crucial for the AI industry, as the demand for real-time, responsive AI applications continues to grow. Faster inference speeds directly impact the user experience in applications like chatbots, virtual assistants, and content creation tools, making them more fluid and natural.

The implications of breaking the million-token-per-second barrier are far-reaching. For enterprises, this level of performance makes it more feasible to deploy powerful, large-scale AI models without prohibitive costs or latency.[4] It opens up new possibilities for applications that require instantaneous processing of vast amounts of data, such as real-time language translation, complex financial modeling, and advanced scientific research. By demonstrating this capability on a commercially available cloud platform, Microsoft is signaling to the market that enterprise-grade, high-performance AI is no longer a niche or experimental technology but a reliable and scalable utility. The record serves as a proof point that the infrastructure required for transformative AI is now accessible, potentially accelerating the adoption of generative AI across various industries.[1][4]

In conclusion, Microsoft's new inference record is more than just a number; it is a clear indicator of the direction of the AI industry. The deep collaboration between Microsoft and NVIDIA continues to push the boundaries of what is possible in cloud computing and artificial intelligence.[5][4] This achievement in processing speed will likely intensify the competition among major cloud providers to offer the most powerful and efficient AI infrastructure. For developers and businesses, this milestone promises a future where sophisticated AI models are faster, more accessible, and more integrated into daily operations, ultimately driving the next wave of innovation powered by artificial intelligence.