OpenAI and tech giants launch MRC protocol to solve bottlenecks in massive AI supercomputers

OpenAI and tech giants launch the MRC protocol to eliminate networking bottlenecks and power the world's largest AI supercomputers

May 6, 2026

OpenAI and tech giants launch MRC protocol to solve bottlenecks in massive AI supercomputers
As the race for artificial intelligence supremacy shifts from model architecture to the physical foundations of computing, a new bottleneck has emerged at the heart of the world’s largest supercomputers. When clusters grow to the scale of 100,000 or more graphics processing units, the traditional methods used to move data between those chips begin to fail, creating massive delays that leave billions of dollars in hardware sitting idle. To dismantle this barrier, OpenAI has spearheaded a high-stakes collaboration with the industry’s most influential hardware and software powers, including AMD, Broadcom, Intel, Microsoft, and NVIDIA.[1] Together, this consortium has developed a new open-source networking protocol known as Multipath Reliable Connection, or MRC.[2][3][4] Designed to replace aging networking standards that were never intended for the unique stresses of generative AI training, MRC is already being deployed within the massive Stargate supercomputer project to ensure that the next generation of frontier models can scale without technical interruption.
The primary challenge facing modern AI infrastructure is a phenomenon known as the straggler effect. In large-scale synchronous training, thousands of GPUs must work in lockstep, frequently stopping to exchange data and synchronize their progress.[2][1] If a single network link becomes congested or a switch fails, the entire cluster must wait for that one "straggler" to catch up. As clusters scale toward 100,000 GPUs, these delays become mathematically inevitable under traditional networking architectures like standard Ethernet or InfiniBand. Conventional networks typically rely on a single path for data transfers, meaning that if a specific route is blocked, the data must wait or be re-routed through a complex series of software-heavy decisions. MRC addresses this by implementing a technique called packet spraying.[5][1] Instead of pinning a data flow to a single wire, MRC distributes packets across hundreds of available network paths simultaneously.[1][6][4][7] This prevents any single congested link from stalling a larger operation and allows the network to function as a fluid, unified fabric rather than a collection of rigid, isolated connections.
Beyond mere congestion management, MRC enables a fundamental redesign of how supercomputers are physically built. Most high-performance computing clusters use a multi-tier "fat tree" or Clos topology, which often requires three or four layers of expensive network switches to connect massive numbers of nodes. Each additional layer adds latency, increases power consumption, and introduces more points of failure. By utilizing MRC in conjunction with high-radix switches, engineers can now connect over 100,000 GPUs using only two switch layers. This is achieved by splitting high-speed 800 gigabit-per-second network interfaces into eight parallel 100 gigabit networks. This approach drastically improves "radix efficiency," allowing a single switch to manage more connections than was previously possible. For a project as large as Stargate, reducing the number of switch tiers can save hundreds of millions of dollars in capital expenditure while significantly lowering the massive energy requirements of the data center.
One of the most radical technical departures in the MRC protocol is its move away from traditional dynamic routing protocols like the Border Gateway Protocol, which has long been a staple of internet and data center networking. In traditional setups, if a switch fails, the network must "converge," a process where routers talk to one another to figure out a new path—a process that can take seconds or even minutes.[4] In the context of AI training, a multi-second delay is catastrophic. MRC replaces this with static source routing based on the IPv6 Segment Routing standard.[5] In this model, the sender—the GPU or its network interface card—defines the entire path the data will take through the network by encoding it directly into the packet header.[1] If a failure is detected, the hardware can reroute traffic in microseconds, a speed that is virtually invisible to the AI training job.[4] This level of resilience is already being proven in production environments; reports indicate that technicians have successfully rebooted core switches during active frontier model training runs without causing any measurable impact on the workload’s progress.
The strategic importance of this protocol is underscored by the unlikely alliance of companies that brought it to life. While NVIDIA has long dominated the AI networking space with its proprietary InfiniBand technology, its participation in the MRC project—alongside direct competitors like AMD, Intel, and Broadcom—signals a shift toward a more open, Ethernet-based future for hyper-scale AI. By contributing the MRC specification to the Open Compute Project, the consortium is establishing a common standard that ensures hardware from different vendors can interoperate seamlessly. This prevent massive companies from being locked into a single supplier’s ecosystem, fostering a competitive market for AI-optimized networking hardware. For chipmakers like AMD and Broadcom, MRC represents a path to challenge NVIDIA's dominance by offering high-performance, open-standard alternatives that are purpose-built for the "AI factory" model of computing.
The real-world application of MRC is perhaps most visible in the Stargate and Fairwater supercomputers, massive infrastructure projects led by Microsoft and OpenAI. These facilities represent an investment of tens, and potentially hundreds, of billions of dollars, intended to provide the raw horsepower necessary for Artificial General Intelligence.[8] In these environments, the network is not just a utility but a primary component of the "computer" itself. The deployment of MRC allows these clusters to maintain high GPU utilization, ensuring that the expensive silicon is actually processing data rather than waiting for it. By integrating the protocol into cutting-edge hardware like Broadcom’s Tomahawk switches and NVIDIA’s Blackwell-generation systems, the partners have moved the technology from a theoretical concept to an operational foundation for the global AI industry.
As AI models continue to grow in complexity and parameter count, the ability to scale infrastructure will be the defining factor in which organizations lead the field. The creation of the MRC protocol marks the end of the era where general-purpose networking could suffice for specialized AI tasks. By treating the network as an integrated part of the AI stack—capable of managing its own failures and optimizing its own paths at microsecond speeds—OpenAI and its partners have cleared a major hurdle on the path to gigascale computing. The industry's move toward open-source, multi-path reliable networking ensures that the next leap in machine intelligence will be supported by a foundation that is as resilient as it is powerful, turning what was once a bottleneck into a high-speed highway for the future of digital thought.

Sources
Share this article