New Benchmark: AI Models Now Rival Human Experts in Real-World Professions

OpenAI's GDPval benchmark reveals frontier AI now performs at human-expert levels in diverse, economically valuable professional tasks.

September 26, 2025

New Benchmark: AI Models Now Rival Human Experts in Real-World Professions
The latest generation of top-tier artificial intelligence models are demonstrating capabilities that approach, and in some cases meet or exceed, the quality of work produced by human experts in a wide range of professional tasks. This is the central finding from a new and comprehensive evaluation framework developed by OpenAI, called GDPval, which aims to measure the performance of AI on economically valuable, real-world assignments.[1][2] The benchmark suggests a rapid acceleration in AI capabilities, moving beyond academic exercises to tackle the complex, nuanced work that defines many knowledge-based professions. The results indicate that the gap between human and machine performance in these areas is closing faster than many anticipated, signaling a potentially transformative shift for numerous industries.[3]
At the heart of this development is GDPval, a novel benchmark designed to bridge the gap between abstract AI evaluations and practical, real-world applications.[4] Unlike previous tests that focused on academic or narrowly defined problems, GDPval is comprised of 1,320 specialized tasks spanning 44 different occupations.[4][2] These occupations are drawn from the top nine sectors that contribute most significantly to the U.S. Gross Domestic Product (GDP), including fields like healthcare, finance, law, and software engineering.[4][5] The tasks themselves were meticulously crafted by industry professionals with an average of 14 years of experience to ensure they reflect authentic work products, such as creating legal briefs, engineering blueprints, or nursing care plans.[4][2] This methodology moves beyond simple text prompts, often providing AI models with reference files and requiring the creation of complex deliverables like presentations, spreadsheets, and diagrams, thereby simulating real-world professional assignments.[4][2]
The initial findings from the GDPval tests are striking. In blind evaluations, human experts compared the outputs of leading AI models against work produced by their professional peers.[4] The results showed that the best "frontier models" are already "approaching the quality of work produced by industry experts".[2] One of the top performers, Anthropic's Claude Opus 4.1, produced work that was rated as equal to or better than human experts in just under half of the tasks.[6][7] OpenAI’s own advanced model, GPT-5, also demonstrated strong performance, winning or tying with human experts in 40.6% of the tasks.[3] This represents a significant leap from the 13.7% success rate of its predecessor, GPT-4o, just 15 months prior, highlighting the exponential rate of improvement in AI capabilities.[3] While Claude Opus 4.1 often excelled in aesthetics and presentation, such as document formatting, GPT-5 showed superior performance in accuracy and the application of domain-specific knowledge.[4][1]
The implications of these findings for the future of work and the broader economy are profound. The ability of AI to perform a substantial percentage of knowledge work tasks at a human-expert level could reshape entire industries.[3] OpenAI's research also found that frontier models can complete these tasks at a fraction of the time and cost of human experts—roughly 100 times faster and cheaper based on pure processing time and API costs.[4][7] However, the company is quick to qualify this, noting that these figures do not account for the necessary human oversight, iteration, and integration required in a real-world workflow.[4][8] The results suggest a future where AI acts as a powerful collaborator, augmenting human capabilities and allowing professionals to offload certain tasks and focus on higher-value activities that require uniquely human skills like complex reasoning, creativity, and emotional intelligence.[9][10][11]
Despite the impressive results, OpenAI acknowledges that GDPval is still in its early stages and has limitations.[4] The current version focuses on "one-shot" tasks and does not fully capture the iterative, collaborative, and often ambiguous nature of real-world knowledge work.[2][7] The benchmark is a starting point, intended to ground conversations about AI's societal impact in concrete evidence rather than speculation.[8][12] As these models continue to improve at an accelerated pace, the GDPval framework provides a crucial tool for tracking their progress and understanding their evolving potential. The clear trend is that AI is rapidly moving from a tool for simple automation to a capable partner in complex, economically valuable work, heralding a new era of human-machine collaboration across the professional landscape.

Sources
Share this article