Google's AI Agent Outperforms OpenAI on 'Humanity's Last Exam'

Google's Deep Research Agent beats GPT-5 Pro on 'Humanity's Last Exam,' escalating the AI race for autonomous intelligence.

December 12, 2025

Google's AI Agent Outperforms OpenAI on 'Humanity's Last Exam'
Google has unveiled a powerful new AI agent, the Deep Research Agent, which has achieved state-of-the-art results on several challenging industry benchmarks. This development signals a significant advancement in autonomous research capabilities and intensifies the competitive landscape in the artificial intelligence sector. The new agent, built upon the Gemini 3 Pro model, demonstrates a notable leap in handling complex, multi-step reasoning and information synthesis tasks, outperforming prominent rivals in key evaluations. This move underscores a broader industry shift away from simple chatbots towards more sophisticated, agentic systems designed to perform complex tasks autonomously. The Deep Research Agent is engineered to tackle multifaceted queries that would traditionally require hours of human effort, such as in-depth market analysis or scientific literature reviews.[1][2] By iteratively formulating queries, analyzing sources, identifying information gaps, and conducting further searches, the agent can produce comprehensive, cited reports on complex topics.[3][4]
The most striking result from Google's announcement is the Deep Research Agent's performance on the "Humanity's Last Exam" benchmark, a test designed to assess expert-level reasoning and problem-solving across a wide range of academic subjects.[3] The agent achieved a score of 46.4%, surpassing OpenAI's GPT-5 Pro, which scored 38.9%.[3] This benchmark is notoriously difficult, designed to push the limits of AI reasoning capabilities. The agent's success here suggests a significant improvement in its ability to understand nuance and perform complex logical steps. In addition to this, Google introduced its own benchmark, DeepSearchQA, specifically designed to evaluate the comprehensiveness of AI agents in web research tasks.[5] On this new benchmark, the Gemini Deep Research agent scored 66.1%, narrowly edging out GPT-5 Pro's 65.2%.[3] On another benchmark, BrowseComp, which focuses on locating hard-to-find facts, the agent scored 59.2%, nearly matching GPT-5 Pro's 59.5%.[3] These results highlight the agent's robust capabilities across different facets of advanced information retrieval and analysis.
The technical underpinnings of the Deep Research Agent reveal a sophisticated architecture designed for long-running, autonomous tasks.[6] It is based on the Gemini 3 Pro model and leverages an "agentic workflow" to achieve its impressive results.[4] This means that instead of just generating a response based on an initial prompt, the agent can plan and execute a series of actions, including performing web searches and deeply browsing websites for specific data.[1][4] A key feature of its design is an asynchronous execution platform, allowing it to handle research tasks that may take several minutes to complete.[1] Before beginning its work, the agent presents a research plan to the user, who has the option to review and modify it, offering a degree of transparency and control over the process.[1][2] This iterative planning capability, which allows the agent to adjust its strategy based on the information it gathers, was a significant technical hurdle to overcome.[1] The agent's ability to parallelize tasks and ground its subsequent actions on previously found information is central to its effectiveness.[1]
The introduction of the Deep Research Agent and its accompanying benchmark performance has significant implications for the AI industry, further escalating the competition between major players like Google and OpenAI. The nearly simultaneous release of Google's agent and OpenAI's GPT-5.2 highlights this fierce rivalry.[7] This competition is driving rapid innovation, pushing the industry towards agents that are not just conversationalists but capable assistants that can perform complex, knowledge-based work.[7] For developers, Google is making these advanced capabilities accessible through a new Interactions API, allowing them to integrate the Deep Research Agent into their own applications.[5][8] This could lead to a new wave of AI-powered tools for various industries, from finance to scientific research.[9][7] However, the release of a new benchmark by Google, on which its own model excels, also raises questions about the objectivity of evaluation metrics in the field.[10] As AI agents become more specialized, the lack of neutral, universally accepted benchmarks could make direct comparisons between different systems increasingly complex.[10]
In conclusion, Google's Deep Research Agent represents a significant milestone in the development of autonomous AI systems. Its state-of-the-art performance on demanding benchmarks, particularly its lead over OpenAI's GPT-5 Pro on "Humanity's Last Exam," demonstrates a clear advancement in AI-driven research and reasoning. The underlying technology, which emphasizes iterative planning and deep web navigation, points to a future where AI agents can handle increasingly complex and time-consuming knowledge work. The agent's release through an API for developers is poised to spur innovation across various sectors. As the rivalry between major AI labs intensifies, the focus will likely remain on creating more capable and reliable agentic systems, a trend that promises to reshape how information is gathered, analyzed, and synthesized. The ongoing challenge will be to develop fair and comprehensive benchmarks to accurately measure the true capabilities of these rapidly evolving technologies.

Sources
Share this article