AI Writes Code, Passes Exams, But Can't Read Analog Clocks
AI that writes code and passes exams is baffled by analog clocks, exposing a surprising gap in visual reasoning.
September 14, 2025

In an era where artificial intelligence models can compose music, write complex code, and even pass notoriously difficult professional exams, a new wave of research has exposed a surprisingly mundane, yet significant, blind spot: the inability to reliably read an analog clock. This fundamental gap in visual and spatial reasoning has been starkly highlighted by recent studies, revealing that even the most advanced AI systems are consistently baffled by a task most human children master by the age of eight. The findings challenge the narrative of rapidly accelerating AI capabilities and point to deeper architectural limitations that have profound implications for the future of artificial intelligence and its integration into our daily lives.
A groundbreaking study has put a precise number on this deficiency, showing a vast chasm between human and machine performance. The research, centered around a novel visual reasoning benchmark called ClockBench, found that while humans can read analog clocks with an average accuracy of 89.1 percent, the top-performing AI model out of eleven leading systems only achieved a score of 13.3 percent.[1][2] This staggering difference underscores that for all their computational power, today's large language models lack a fundamental component of real-world understanding. The ClockBench test, created by serial entrepreneur and AI benchmark developer Alek Safar, was specifically designed to be "easy for humans, hard for AI," and its results are a sobering reality check for the AI industry.[3][1] The benchmark is not a simple image recognition task; it is a comprehensive test of visual reasoning involving 180 custom-built clock faces with 720 distinct questions designed to probe an AI's understanding of time in multiple dimensions.[3][1]
The difficulty for AI models lies in the multi-layered reasoning required to interpret a clock face. The ClockBench test doesn't just ask "What time is it?". It presents a variety of challenges, including performing time-based calculations, mentally rotating the clock hands by specific angles, and converting time zones based on the visual information provided.[1][4] The clocks themselves feature a wide range of designs intended to prevent the models from relying on familiar patterns from their training data. These designs include a mix of Roman and Arabic numerals, different orientations, unusual hour markers, mirrored layouts, and colorful backgrounds.[1] The study found that certain features were particularly challenging for AI, with accuracy dropping significantly for clocks with Roman numerals or a prominent second hand.[3] The results showed that Google's Gemini 2.5 Pro was the top performer, yet still only managed 13.3% accuracy, followed by its sibling model Gemini Flash at 10.5%.[3] Strikingly, other highly touted models performed even worse, with OpenAI's GPT-5 scoring 8.4% and the Grok 4 model achieving a near-zero accuracy of just 0.7%.[3]
This poor performance is not an isolated finding but rather corroborates a growing body of research into the limitations of AI's spatial and temporal reasoning. A separate study from the University of Edinburgh, presented at the 2025 International Conference on Learning Representations (ICLR), came to a similar conclusion.[5][6] This research team tested a variety of leading multimodal large language models, including earlier versions of Gemini, Claude, and GPT-4o, and found they could correctly read a clock in an image only 38.7% of the time and interpret a calendar with just 26.3% accuracy.[5] The Edinburgh researchers concluded that understanding analog clocks requires a combination of spatial awareness, context, and basic mathematics that remains a formidable challenge for AI.[7][8] The models struggled with detecting overlapping hands, measuring angles, and navigating diverse designs, suggesting deep-seated issues with how they process visual information that requires logical inference.[5][7]
Experts theorize that this failure stems from the fundamental way current AI models "think." Unlike humans, who learn the rules and logic of telling time, large language models operate primarily on pattern recognition, predicting outputs based on the vast datasets they were trained on.[5][6] Reading a clock isn't just about recognizing a shape; it's about understanding the spatial relationship between the hands, the markers on the dial, and the abstract system of time they represent. This is a form of reasoning within the visual space, a task that proves much more difficult for AI than reasoning in the text-based domain where they have shown so much success.[4] Interestingly, the ClockBench study noted a surprising twist: when a model did manage to correctly read the time, it was almost always successful in performing the subsequent calculations.[1] This indicates the core problem is not math, but the initial act of visual interpretation—translating the positions of hands on a circular dial into a coherent, structured understanding of a specific time.
The implications of this "clock blindness" extend far beyond time-telling. It serves as a potent symbol of the current gap between AI's performance on standardized benchmarks and its readiness for real-world applications that require robust, reliable spatial and temporal awareness.[5][7] Fields like autonomous robotics, smart assistants for scheduling, and assistive technologies for the visually impaired all depend on an accurate understanding of time and the physical world.[8] The inability to master a seemingly simple, rule-based task like reading a clock raises serious questions about the reliability of these systems in more complex, time-sensitive scenarios. This research highlights the urgent need for new AI development approaches that move beyond simply scaling up existing models and training data. It suggests that true artificial general intelligence, or AGI, will require novel architectures capable of the kind of spatial scanning and integrated reasoning that humans use instinctively. Benchmarks like ClockBench, while focused on a simple task, are therefore crucial for the industry, as they expose these fundamental weaknesses and provide a clear target for the next generation of AI research to overcome.