Rival Anthropic's Claude Beats OpenAI's GPT-5 on Groundbreaking New AI Test
OpenAI's transparent new professional benchmark finds a rival's AI outperforming its own, signaling a shift in defining true capability.
September 30, 2025

In a striking admission from a leader in the artificial intelligence race, a new benchmark study released by OpenAI reveals that a competing model from rival firm Anthropic has outperformed its own flagship offering, GPT-5. The study details the performance of various AI models on a novel evaluation suite called GDPval, designed to measure capabilities on complex, real-world tasks that mirror professional jobs. The results show Anthropic’s Claude Opus 4.1 as the top-performing model, signaling a significant shift in how AI capabilities are measured and understood, moving from academic exercises to practical, economically valuable work. This development underscores the intense competition and nuanced differentiation emerging among frontier AI systems, suggesting the title of "best" model is becoming increasingly dependent on the specific application.
The new evaluation framework, named GDPval, represents a deliberate move away from traditional AI benchmarks that often resemble academic exams or coding competitions.[1][2] Instead, OpenAI has developed a test that measures model performance on tasks directly sourced from the daily work of experienced professionals across 44 different occupations.[3][4][1] These occupations are spread across nine of the top sectors contributing to the U.S. Gross Domestic Product, including fields like healthcare, finance, manufacturing, and media.[4][5] The benchmark is comprised of 1,320 specialized tasks in its full set, with a publicly available "gold" subset of 220 tasks for broader research.[3][5][6] Unlike simple text-based prompts, GDPval tasks are complex, often requiring the AI to work with multiple reference files and produce sophisticated deliverables like slide decks, spreadsheets, technical diagrams, and even CAD files.[5][7][8] To ensure realism and quality, each task was crafted and vetted by industry experts who have an average of 14 years of experience in their respective fields.[4][5][7] The primary method for scoring is a blind, head-to-head comparison where human experts rate the AI-generated deliverable against one produced by a human professional, without knowing which is which.[7][8]
The results from the GDPval assessment provided a surprising hierarchy among leading AI models. Anthropic’s Claude Opus 4.1 emerged as the clear leader, with 47.6% of its deliverables being judged as either better than or as good as the human-created examples.[6][2] In second place was OpenAI’s own advanced model, GPT-5 (high), which scored 38.8%.[6][2] The gap highlights a significant performance difference on these new, practical evaluations. Other models tested, including Google’s Gemini 2.5 Pro, xAI’s Grok 4, and OpenAI’s older GPT-4o, scored considerably lower, demonstrating the rapid pace of advancement at the frontier of AI development.[6][2] OpenAI’s paper further detailed the specific strengths of the top two models, noting that Claude Opus 4.1 particularly excelled at tasks involving aesthetics, such as the formatting of documents and the layout of presentation slides.[3][1][6] Conversely, GPT-5 demonstrated superior performance in areas requiring accuracy, like carefully following instructions and executing calculations correctly.[3][1][6] This distinction suggests that different models are developing specialized strengths, a critical insight for businesses and users looking to deploy AI for specific professional tasks.
The decision by OpenAI to publish a study that prominently features a competitor’s superior performance has been widely seen as a significant act of transparency in a highly competitive industry. OpenAI stated that its goal with GDPval is to "transparently communicate progress on how AI models can help people in the real world," aligning with its broader mission to ensure artificial general intelligence benefits all of humanity.[1][2] The rivalry between OpenAI and Anthropic is particularly notable, as Anthropic was founded by former senior members of OpenAI.[9] The two companies are often seen as representing differing philosophies on AI development, with Anthropic placing a strong emphasis on safety and constitutional AI principles.[10] The introduction of GDPval also reflects a broader industry trend toward creating more realistic and less gameable benchmarks. As models have become more powerful, there is a growing recognition that success on standardized tests does not always translate to utility in messy, real-world scenarios. This new focus on practical application is forcing a reevaluation of what makes an AI model truly capable and valuable.
Ultimately, the findings from the GDPval benchmark signal a maturing landscape for artificial intelligence evaluation. The era of single, universal leaderboards may be giving way to a more nuanced understanding of AI capabilities, where the best tool depends entirely on the job at hand. For businesses and professionals, the clear differentiation between models—one excelling in aesthetic presentation and another in logical accuracy—provides a more practical framework for choosing the right AI assistant. While the study shows that top models are approaching or exceeding expert-level performance on nearly half of the tested tasks, OpenAI is careful to note the benchmark’s current limitations.[3] The tasks are "one-shot" assignments and do not yet capture the iterative feedback loops, context-building, and collaborative dynamics that define much of human knowledge work.[3][8] Nonetheless, this transparent and reality-grounded approach to benchmarking is a crucial step forward, pushing the industry beyond abstract scores and toward a future where the true measure of an AI is its tangible impact on productivity and creativity in the real world.