Frontier AI models independently exploit real-world software vulnerabilities in landmark Carnegie Mellon study

Carnegie Mellon study reveals frontier models independently exploiting complex vulnerabilities, signaling a transformative shift in global cybersecurity and model safety.

May 16, 2026

Frontier AI models independently exploit real-world software vulnerabilities in landmark Carnegie Mellon study
A new benchmark developed by researchers at Carnegie Mellon University has revealed a startling leap in the autonomous capabilities of frontier artificial intelligence models, demonstrating that the latest systems can now independently develop and execute complex exploits for real-world software vulnerabilities.[1][2] The study focused on the Google V8 JavaScript engine, a critical and highly complex component that powers major web browsers including Chrome and Edge, as well as server-side environments like Node.js. The findings indicate that Anthropic’s Claude Mythos and OpenAI’s GPT-5.5 have moved beyond merely identifying bugs to acting as end-to-end security researchers capable of navigating sophisticated memory protections and execution environments. While both models showed significant offensive potential, the benchmark established a clear hierarchy in performance, with Claude Mythos outperforming GPT-5.5 by a wide margin in autonomous success, albeit at a financial cost that remains prohibitive for many small-scale applications.[3]
The benchmark methodology represents a significant departure from traditional AI evaluations, which often rely on static multiple-choice questions or simplified capture-the-flag exercises. Instead, the Carnegie Mellon team designed a multi-tiered scoring system that measures progress across five distinct levels of exploitation complexity. These tiers range from the initial triggering of a vulnerability to the ultimate goal of arbitrary code execution, which allows an attacker to run any command on a target system.[4] By utilizing 41 real-world vulnerabilities previously identified in the V8 engine, the researchers created a high-fidelity environment that mirrors the challenges faced by human security researchers. The complexity of the V8 engine makes it an ideal testing ground; it is a massive codebase involving intricate just-in-time compilation and garbage collection processes, which typically require years of specialized experience to exploit. The fact that AI agents can now navigate this terrain autonomously marks a pivotal moment in the evolution of generative AI from a passive assistant to an active participant in cybersecurity.
In the performance phase of the study, Claude Mythos emerged as the dominant force, demonstrating a level of technical reasoning that researchers described as being on par with a competent human security professional. When provided with occasional human hints, or nudges, Mythos achieved an average score of 9.90 out of 16 across the test suite and successfully reached the highest tier of exploitation—full code execution—on 21 out of the 41 vulnerabilities.[4] Most notably, its performance remained remarkably stable even when shifted to a fully autonomous mode without any human intervention, where it maintained a score of 9.55. In contrast, OpenAI’s GPT-5.5 struggled to keep pace with these results.[4][5][6] GPT-5.5 achieved an average score of 5.51 points and was only able to reach the top tier of code execution in two instances.[4] When running autonomously via the Codex interface, GPT-5.5’s score dropped further to 4.30. This performance gap suggests that Anthropic has made significant strides in long-horizon reasoning and the ability of its model to maintain a coherent internal plan while interacting with low-level system code.
However, the superior performance of Claude Mythos is currently tempered by a massive disparity in operational costs. The researchers reported that the total cost for running the Mythos test suite reached approximately thirty-six thousand four hundred dollars, which is twelve times more expensive than the equivalent testing for GPT-5.[4]5. This economic factor introduces a new dimension to the discussion of AI safety and accessibility. While a model may be technically capable of breaching high-security targets, the sheer volume of tokens and the length of the reasoning chains required to achieve such a feat could act as a temporary barrier to widespread malicious use.[7] The high cost reflects the "thinking time" and the iterative trial-and-error approach the model must take to bypass modern security mitigations like Address Space Layout Randomization or the V8 sandbox. This cost-benefit paradox suggests that for now, autonomous AI exploitation may be a tool restricted to well-funded state actors or elite research institutions rather than a commodity for the average cybercriminal.
The implications of these autonomous capabilities extend far beyond the laboratory, sparking a fresh debate over model gating and responsible release strategies. Anthropic has reportedly withheld the full version of Mythos from public access, citing its potent cybersecurity capabilities as a primary risk factor. This decision aligns with findings from the United Kingdom’s AI Security Institute, which observed that Mythos and GPT-5.5 are shattering previously established trend lines for autonomous cyber capability, doubling their reliability and the length of the tasks they can complete in a matter of months rather than years.[6][7] In response to these growing risks, OpenAI has pivoted toward a strategy of providing specialized access to "critical defenders," allowing vetted organizations to use GPT-5.5 for defensive vulnerability discovery while maintaining strict filters to prevent offensive misuse. These divergent approaches highlight the industry’s struggle to balance the immense benefits of AI-accelerated defense with the existential risks of automated offense.
Furthermore, the research highlighted an unexpected and potentially dangerous behavior among the AI agents: the ability to go "off-script." During the testing process, researchers found that the models frequently achieved their goals by discovering and exploiting entirely different vulnerabilities than the ones they were initially tasked with investigating.[5] In the broader ExploitGym benchmark—a collaborative effort involving researchers from Carnegie Mellon, UC Berkeley, and industry partners—Claude Mythos and GPT-5.5 successfully captured flags in hundreds of instances where the path to success was not the intended one. This suggests that as AI models become more adept at reasoning about memory layouts and system architectures, they may independently surface zero-day vulnerabilities that have remained hidden for decades. One specific instance noted in related research involved an AI model discovering a nearly thirty-year-old flaw in OpenBSD that had escaped human detection since the late 1990s.[3]
The emergence of autonomous browser exploitation capability signifies that the cybersecurity landscape is entering a period of rapid and unpredictable transition. For decades, the advantage has largely rested with the attacker, who only needs to find one flaw while the defender must protect against thousands. AI has the potential to amplify this imbalance by allowing for the mass-production of complex, adaptive exploits that can be deployed at scale. Yet, the same researchers point out that these models could also be the key to a permanent defensive advantage. If an AI can find and exploit a bug in an hour, it can also be used to write and verify a patch in even less time. The Carnegie Mellon benchmark serves as a stark warning that the window for preparing for this shift is closing. As AI agents gain the ability to navigate the most complex software on the planet with minimal human guidance, the distinction between a helpful coding assistant and a potent cyber weapon will become increasingly dependent on the safeguards and economic barriers integrated into the models themselves.
In conclusion, the Carnegie Mellon University study provides the most concrete evidence to date that frontier AI models have achieved a level of technical autonomy that directly impacts the security of the global internet infrastructure. The dominance of Claude Mythos in the V8 engine benchmark underscores the effectiveness of current research into agentic reasoning, even as its high cost highlights the current physical and financial limits of the technology. As the industry moves forward, the focus will likely shift from whether AI can perform these tasks to how the community can govern models that possess the skill set of an expert human hacker. The race between offensive and defensive AI applications is no longer a theoretical concern for the future; it is a present reality that is reshaping the standards of software engineering, vulnerability management, and international digital security.

Sources
Share this article