xAI's Grok 4 Surpasses GPT-5 in Critical AGI Reasoning Test

Grok 4 leads in complex AGI reasoning, yet GPT-5's efficiency highlights the cost dilemma shaping AI's future.

August 7, 2025

xAI's Grok 4 Surpasses GPT-5 in Critical AGI Reasoning Test
In a significant development for the artificial intelligence industry, xAI's new model, Grok 4, has surpassed OpenAI's highly anticipated GPT-5 in a key benchmark designed to measure complex reasoning abilities. The Abstraction and Reasoning Corpus (ARC) AGI benchmark, particularly its second iteration, ARC-AGI-2, has become a critical test for evaluating an AI's capacity for generalization and problem-solving on novel tasks, a core tenet of artificial general intelligence.[1][2][3] The results show Grok 4 achieving a higher score than GPT-5, signaling a notable shift in the competitive landscape of AI development, though a closer look at the data reveals a nuanced picture of performance versus cost.
The ARC-AGI benchmark, created by François Chollet, is specifically designed to test for fluid intelligence, the ability to adapt and learn in new situations, rather than rewarding rote memorization.[1][2] It presents AI models with abstract visual puzzles that are easy for humans to solve but have proven exceptionally difficult for machines, highlighting gaps in their reasoning and adaptability.[3][4] The second version, ARC-AGI-2, increases this difficulty, with most advanced AI systems scoring in the single digits, while humans can solve every task in under two attempts.[1][4] This makes performance on ARC-AGI-2 a closely watched indicator of progress toward more generalized AI capabilities.
According to results released by ARC Prize, Grok 4 (Thinking) achieved a score of approximately 16 percent on the challenging ARC-AGI-2 benchmark.[5] This performance is a significant leap, nearly doubling the scores of previous leading models like Anthropic's Claude 4 Opus, which scored around 8.6%.[6][7] The ARC Prize organization verified Grok 4's score of 15.9% on a hidden evaluation set, confirming it as the new state-of-the-art for closed models.[7][8] However, this superior reasoning performance comes at a considerable cost, estimated at $2 to $4 per task.[5] In contrast, GPT-5 (High) scored 9.9 percent on the same benchmark but at a much lower cost of $0.73 per task, presenting a more efficient, if less capable, alternative on this specific metric.[5]
On the less demanding ARC-AGI-1 test, the competition between the two models remains tight. Grok 4 again took the lead with a score of about 68 percent, narrowly beating GPT-5's 65.7 percent.[5] The cost differential persists here as well, with Grok 4's tasks costing around $1 each, while GPT-5 achieved its comparable performance for just $0.51 per task.[5] This efficiency has significant implications for the practical application and scalability of these models. OpenAI also offers lighter versions of its model, with GPT-5 Mini scoring 54.3 percent on AGI-1 and 4.4 percent on AGI-2, and GPT-5 Nano achieving 16.5 percent and 2.5 percent respectively, both at a fraction of the cost.[5] These variants provide a trade-off between performance and expense, catering to different user needs.
The strong showing of Grok 4, particularly its "Heavy" variant which utilizes a multi-agent system to consider multiple hypotheses in parallel, underscores a different architectural approach that appears to yield benefits in complex, logic-heavy reasoning tasks.[6][7] This multi-agent setup, where different AI agents collaborate on a problem, contributes to its higher accuracy on difficult benchmarks.[9] While OpenAI's GPT-5, which also has a reasoning model designed to break down complex problems, is touted as a significant upgrade with improved reliability and coding skills, its performance on this specific benchmark suggests a different balance between raw reasoning power and operational efficiency.[10][11][12] The results from the ARC-AGI benchmark highlight that the path to AGI is not just about achieving higher scores, but also about the economic and computational feasibility of these powerful systems, a factor that will continue to shape the future of AI development.[1][13]

Sources
Share this article