OdysseyBench Reveals: Older AI O3 Outperforms Flagship GPT-5 on Complex Office Tasks

OpenAI's older, specialized o3 surprisingly beats flagship GPT-5 on complex office tasks, challenging assumptions about AI progress.

August 16, 2025

OdysseyBench Reveals: Older AI O3 Outperforms Flagship GPT-5 on Complex Office Tasks
In a surprising turn of events for the rapidly advancing field of artificial intelligence, a new, rigorous benchmark has revealed that OpenAI's newer flagship model, GPT-5, is consistently outperformed by its older, more specialized predecessor, o3, on complex office-related tasks. The findings from a novel evaluation suite called OdysseyBench are sending ripples through the AI community, suggesting that the race for bigger, more generalized models may overlook the nuanced capabilities required for practical, real-world agentic workflows. This development challenges the long-held assumption that newer is always better in the world of large language models and highlights a growing distinction between raw intelligence and functional capability in AI agents.
The OdysseyBench benchmark, developed by researchers at Microsoft and the University of Edinburgh, was specifically designed to move beyond the limitations of existing AI evaluations.[1] For years, benchmarks have primarily tested models on atomic, self-contained tasks, which fail to capture the complexities of realistic work environments.[2] OdysseyBench introduces a more challenging paradigm: "long-horizon" workflows that unfold over simulated days and require an AI agent to interact with multiple applications such as Word, Excel, PDF clients, email, and calendars.[2] The benchmark is split into two parts: OdysseyBench+, which uses 300 tasks derived from real-world use cases, and OdysseyBench-Neo, which contains 302 newly synthesized and particularly complex scenarios.[1][2] In these tests, the AI is not just answering a single query but must maintain context from extended dialogues, plan multi-step sequences, and fluidly coordinate actions across different software tools to achieve a final goal.[2]
The results from this demanding evaluation were unexpected. On the most difficult, hand-crafted tasks within OdysseyBench-Neo, OpenAI's o3 model achieved a success rate of 61.26%.[1] In contrast, the newer and more powerful GPT-5 scored 55.96%.[1] The performance gap became even more pronounced in tasks that demanded the simultaneous use of three different applications, where o3 succeeded 59.06% of the time compared to GPT-5's 53.80%.[1] A similar pattern emerged on the OdysseyBench+ tasks, with o3 scoring 56.2% to GPT-5's 54.0%.[1] This consistent advantage suggests that the architectural design of the o3 model is better suited for the specific challenges posed by multi-application, long-term agentic tasks, even though GPT-5 is considered a more advanced and generally capable model on many other standard industry benchmarks.
The reasons for this performance discrepancy likely lie in the fundamental architectural differences between the two models. The o3 model was purpose-built as a "reasoning model," designed to excel at tasks requiring deep, step-by-step logical thinking and autonomous tool use. Its architecture has been described as an "Orchestrated Optimization Architecture," which utilizes dynamic modules specialized for different tasks, allowing it to deliberately plan and execute complex sequences. In essence, o3 was engineered with the core principles of an AI agent in mind: the ability to perceive, plan, and act within a software environment to complete a goal. This specialized focus on reasoning and tool orchestration appears to give it a distinct edge in the intricate, multi-step workflows simulated by OdysseyBench.
GPT-5, on the other hand, represents a different evolutionary path. Released in August 2025, it is a unified flagship model that integrates advancements from the "o" series into a more generalized and powerful system.[3] Its hybrid, multi-model architecture employs a router that dynamically selects from various sub-models (such as "main," "thinking," or "nano") based on the perceived complexity of the user's prompt.[3] While this design makes GPT-5 incredibly versatile and efficient for a wide range of tasks, from creative writing to complex coding, it may be less optimized for the sustained, state-tracking demands of long-horizon agentic work.[4][3] The routing mechanism, designed for general efficiency, might not be as effective as o3's dedicated reasoning structure when a task requires unwavering focus and intricate planning across multiple software interfaces over an extended period.
The implications of the OdysseyBench results are significant for the future of AI development. They suggest a potential divergence in model evolution, splitting between all-purpose intelligent systems and highly specialized "agentic" models. While general models like GPT-5 will continue to improve on broad measures of knowledge and reasoning, the findings indicate that specialized architectures like that of o3 may be necessary to create reliable and effective AI agents for practical workplace automation. This challenges the "one-model-to-rule-them-all" narrative and suggests that the future of AI in the enterprise may involve a suite of different models tailored to specific functions, orchestrated by a central routing system. For the industry, this signals a crucial need to refine benchmarks to better reflect real-world utility and to invest in architectures that prioritize task execution and reliability over sheer conversational intelligence. The surprising success of an older model in a sophisticated, next-generation benchmark serves as a potent reminder that in the quest for artificial general intelligence, the ability to "do" is just as important as the ability to "think."

Sources
Share this article