Claude Opus 4.5 Agent Achieves Record Five-Hour Autonomous Task Horizon
The four-hour milestone validates agents that can handle complex, multi-step software engineering tasks autonomously.
December 21, 2025

A new benchmark result from the AI research organization METR has positioned Anthropic's Claude Opus 4.5 as a new leader in the field of autonomous, long-duration task completion, marking a significant milestone in the development of capable AI agents. The flagship model from Anthropic has achieved a "50 percent time horizon" of approximately four hours and forty-nine minutes, according to the test results released by METR.[1][2] This metric, designed to move beyond short, synthetic evaluations, provides an intuitive and human-centric way to measure an AI's ability to handle complex, real-world work that requires sustained reasoning and multi-step execution.[3][4]
The 50 percent time horizon is a specialized metric developed by METR, which stands for Measuring AI Ability to Complete Long Tasks.[5] The time horizon specifically represents the estimated duration a skilled human professional would typically take to complete tasks that an AI model can succeed at 50% of the time.[3][6] To establish this benchmark, the METR team uses a comprehensive suite of tasks, largely focused on software engineering and related research, that range in human completion time from a few seconds up to 16 hours.[7] The evaluation involves timing human experts performing the tasks to establish a ground truth baseline, then fitting a success probability curve for the AI models based on that human task duration data.[8][3] For Claude Opus 4.5, a result of four hours and forty-nine minutes is the highest published time horizon to date, shattering previous records.[2] By comparison, a previous leading model, Claude 3.7 Sonnet, had a 50% time horizon of around 59 minutes.[9] This substantial jump underscores the rapid, non-linear progress being made in large language model agency and persistence.
Anthropic officially released Claude Opus 4.5 in late November, touting it as its most advanced model, specifically designed for complex, multi-step problems, coding, and agentic capabilities.[10][11] The model's architecture includes key features that align directly with the requirements for long-horizon tasks, such as an extremely large context window of up to 200,000 tokens by default, with special modes extending this to one million tokens.[11][12] This massive capacity allows the model to process entire codebases, lengthy documents, or multi-day conversation histories without losing track of crucial context. Furthermore, Anthropic introduced a dynamic memory management system it calls an "endless chat" mechanism, which automatically compresses or summarizes older messages to free up context space when the limit is reached, ensuring continuous dialogues and seamless long workflows.[11] These technological advancements are foundational to its performance in METR's testing, which assesses an AI's ability to maintain a consistent plan, reason through complex steps, and execute an extended sequence of decisions without human intervention. The company itself has stated that the model can confidently handle multi-day software development projects, delivering them in a matter of hours, which is consistent with the four-hour-plus time horizon.[4]
The implications of an AI model achieving a nearly five-hour time horizon are profound for the entire AI industry and its impact on the future of knowledge work. The metric’s focus is on the tasks that define much of human work, as opposed to academic benchmarks that often test only a single step or narrowly defined problem.[6][5] An AI agent capable of successfully tackling tasks that take a human expert almost a full half-day to complete moves models out of the realm of mere assistants and into that of autonomous contributors. This capability is especially transformative for domains like software engineering, where Opus 4.5 has already achieved state-of-the-art results, including an industry-leading 80.9% on the rigorous SWE-Bench Verified coding challenge.[11][4] It signals that sophisticated AI agents are rapidly approaching the threshold for independently managing substantive projects.
However, the METR results also introduce a critical nuance in evaluating an AI model's reliability for real-world deployment. While Claude Opus 4.5 sets a new record at the 50% success probability, its "80 percent time horizon" is significantly lower, standing at only 27 minutes.[2] This lower-bound metric, which indicates the maximum human-equivalent task length the model can complete with a higher degree of reliability, is similar to or even below some older, rival models. The disparity between the 50% and 80% horizons suggests that Opus 4.5 has a "flatter logistic success curve" than its predecessors, meaning it differentially succeeds on longer tasks but still has a drop-off in reliability for guaranteed high-stakes performance.[2] For enterprise and mission-critical applications, where an 80% or 90% success rate is the minimum requirement, this suggests that human oversight or fine-tuning of the autonomous agents remains crucial, despite the breakthrough in maximum demonstrable capability.
The METR research has consistently demonstrated an exponential trend in AI's ability to complete longer tasks.[6] Historically, the 50% task completion time horizon has been doubling approximately every seven months for the last six years, and some analysis suggests this pace may be accelerating.[7][5] The leap to nearly five hours with Claude Opus 4.5 is a dramatic data point on this exponential curve. Extrapolating this trend, even with the inherent uncertainties of forecasting, suggests that generalist autonomous agents capable of performing a wide range of tasks that currently take humans days or even weeks may arrive in the near future.[5] Anthropic’s new achievement is a clear indicator that the industry is firmly on the path toward highly autonomous AI, raising the urgency for stakeholders to consider the ethical and practical implications of systems that can maintain complex, multi-hour, or even multi-day agency.
Sources
[5]
[7]
[8]
[10]
[11]
[12]