Anthropic’s Opus 4.6 Claims Top AI Benchmark; OpenAI Immediately Counters with GPT-5.3.
Anthropic’s new high-water mark battles a premium price tag and OpenAI’s competing self-improving agent.
February 8, 2026

Anthropic’s Claude Opus 4.6 has seized the top position on the influential Artificial Analysis Intelligence Index, momentarily dethroning its chief rival and establishing a new high-water mark for large language model capability in the fiercely competitive AI landscape. The new model scored 53 points on the Index's v4.0, securing a two-point lead over OpenAI’s formidable GPT-5.2, which registered 51 points. The achievement marks a significant technical milestone for Anthropic, underscoring its relentless pursuit of frontier intelligence and putting it squarely in the lead for current, publicly tested model performance.[1] However, this victory is tempered by the model’s premium cost and the almost immediate counter-launch of an advanced competitor, ensuring the top spot on the leaderboard remains an ephemeral prize.
The Artificial Analysis Intelligence Index provides a composite score that synthesizes performance across ten rigorous evaluations designed to measure diverse aspects of high-level AI capability, moving far beyond simple language generation tasks.[1][2] These benchmarks include GDPval-AA, which assesses performance on economically valuable knowledge work in domains like finance and law, and Humanity's Last Exam, a complex, multidisciplinary reasoning test.[1][3] Opus 4.6's superior score is a direct result of advancements in its agentic capabilities, planning, and long-context comprehension, making it particularly adept at sustained, multi-step professional tasks.[4] For instance, Anthropic claims the new model outperforms GPT-5.2 by approximately 144 Elo points on GDPval-AA, showcasing a marked improvement in real-world knowledge work automation.[3] The model also achieved an industry-leading score on Terminal-Bench 2.0, a crucial agentic coding evaluation that tests a model's ability to execute complex tasks using shell access and web browsing within an agentic loop.[4][5] This suggests a growing maturity in AI systems to handle end-to-end workflows that require intricate reasoning and tool use, moving the technology closer to functioning as a capable digital co-worker.
The celebration of technical leadership is, however, immediately complicated by the economic realities of running a top-tier large language model. Claude Opus 4.6 is positioned as the most capable, but also the most expensive, model currently available.[1] Its pricing starts at five dollars per million input tokens and twenty-five dollars per million output tokens, which significantly exceeds the cost structure of key competitors.[6][4] For context, the input cost is nearly three times that of the previous-generation Claude Opus 4.5, and the output cost is more than double that of Google’s Gemini 3 Pro.[7] While Anthropic offers potential cost savings through prompt caching and batch processing, the high base rate presents a substantial barrier to entry for many developers and enterprises.[4] The elevated cost structure forces a critical cost-benefit analysis for users: the marginal increase in intelligence must translate into a clear, measurable gain in productivity or performance that justifies the higher operational expenditure. For high-stakes, specialized applications like advanced legal reasoning or complex software development, where a small error can incur massive costs, the premium price may be justified by Opus 4.6’s superior accuracy and consistency across large projects.[4]
The immediate threat to Anthropic’s hard-won position comes from the simultaneous release of OpenAI's new agentic model, GPT-5.3-Codex. Despite the initial reporting that framed the latter as "waiting in the wings," search results indicate that OpenAI officially launched GPT-5.3-Codex at the same time as or very shortly after the Claude Opus 4.6 announcement, highlighting the fierce, neck-and-neck competition between the two AI labs.[8] OpenAI is positioning GPT-5.3-Codex as its most capable agentic coding model yet, explicitly designed to transcend simple code generation and function as a general-purpose agent capable of operating a computer and handling complete professional workflows.[8] The model is reported to be an upgrade of GPT-5.2-Codex, blending its specialized coding strength with the broader reasoning capabilities of the general GPT-5.2 model.[9] OpenAI claims a significant improvement in efficiency, touting that the new model is 25 percent faster and utilizes fewer tokens than its predecessor, directly challenging the cost-efficiency narrative where Anthropic currently struggles.[8]
Beyond performance metrics and cost, GPT-5.3-Codex introduces a paradigm-shifting concept: self-improvement. OpenAI has publicly stated that the model was "instrumental in creating itself," assisting engineers in crucial tasks such as debugging its own training and managing deployments.[8][10] This recursive self-improvement capability, if scalable and reliable, represents a fundamental leap in AI development, potentially accelerating the model's future growth curve beyond linear human intervention. The focus on agentic skills is another key factor, with the model leading on benchmarks like SWE-Bench Pro, Terminal-Bench, and OSWorld, which are critical for software engineering and computer use automation.[11] While Opus 4.6 currently holds the overall Intelligence Index crown, the targeted nature of GPT-5.3-Codex's coding and agentic performance—combined with a reportedly improved cost structure and the promise of a self-evolving system—suggests that the technical lead may rapidly shift. This scenario sets the stage for a new phase in the frontier AI race, where the winning model will not only be the smartest but also the one that most effectively integrates superior performance with compelling economics for enterprise adoption. The dueling releases confirm that the industry has entered an era defined by a constant, high-speed iteration cycle, where a model's reign at the top of the leaderboard may be measured in days or weeks, rather than months.