The "Curse of Knowledge" Blinds Advanced AI to Student Struggles.
AI can ace the exam, but its failure to perceive cognitive difficulty limits its use in personalized education.
January 4, 2026

The rapid acceleration in the power of large language models has created a paradoxical problem for the future of educational and cognitive AI systems: the "curse of knowledge." This established human cognitive bias, which causes experts to overestimate the prior knowledge of novices, is now manifesting in sophisticated AI, creating a fundamental blind spot that prevents even the most advanced models from accurately identifying where human learners will struggle. A recent large-scale empirical analysis of over 20 large language models (LLMs) across diverse educational domains has exposed this systematic misalignment, demonstrating that an AI's superior problem-solving capability often comes at the cost of its ability to perceive human cognitive difficulty.
The curse of knowledge, a term first coined in 1989 by economists, describes the inherent difficulty an informed individual has in taking the perspective of someone who is uninformed, leading the expert to assume others share their background knowledge.[1][2] In human terms, a college professor may struggle to explain a basic concept because they have lost the ability to recall what it was like to be a novice, or a design expert might create a user interface that is intuitive to them but baffling to a first-time user.[1][3] The underlying psychological mechanism relates to a "hindsight bias," where having an outcome or piece of knowledge makes it seem more predictable and obscures the memory of one's own prior, less knowledgeable state.[1] This phenomenon, which resists attempts at correction even when people are financially incentivized or explicitly told about the bias, is now being mirrored and amplified in artificial intelligence.[1]
The new study, which rigorously tested LLMs’ ability to predict human student struggles, found a crucial disconnect between a model’s problem-solving accuracy and its difficulty-perception accuracy. The research leveraged real student field testing data, using questions from high-stakes exams such as the USMLE (medical knowledge), Cambridge (linguistic proficiency), and standardized tests like the SAT Reading & Writing and SAT Math.[4] The difficulty level of these items was established by the performance of hundreds of real students, providing a reliable "ground truth" for human struggle.[4] The findings revealed that scaling up the model size—the conventional pathway to improved AI performance—does not reliably improve the model's alignment with human difficulty perceptions.[4][5] Instead of becoming more human-like in their assessment, the most capable LLMs, including models like GPT-5 and GPT-4o, converged toward a "cohesive machine consensus" that systematically diverged from the measured human reality.[4][5] For example, models exhibited the highest alignment in SAT Math but the lowest in the USMLE medical domain, suggesting that the complexity and domain of the task play a role in the misalignment.[4]
One of the study’s most revealing findings concerned the models’ failure to simulate a novice’s mind, which is a key method humans use to overcome the curse of knowledge. The researchers attempted to mitigate the bias by prompting the models to act as external observers predicting student difficulty or as internal actors simulating lower-proficiency reasoning.[4] Even with explicit instructions to adopt a specific, lower-proficiency level, the advanced models struggled to accurately mimic the capability limitations of students.[5] This inability to reconstruct or simulate a less-informed state of mind represents a profound limitation in the models’ cognitive architecture, suggesting that their knowledge is not stored or accessed in a way that allows for dynamic, internal introspection or an "unlearning" of facts to predict a knowledge gap.[5][6] Furthermore, the study identified a critical lack of introspection, noting that the models failed to predict their *own* limitations on challenging questions.[5]
The implications of this AI-specific curse of knowledge are significant, particularly for the future of the burgeoning educational technology (EdTech) sector. Accurate item difficulty estimation is vital for applications like adaptive testing, which adjusts question difficulty in real-time based on a student’s performance, and for the design of personalized curricula.[4] If an AI cannot reliably gauge what a human finds difficult, its ability to act as an effective, empathetic teacher is severely compromised. A model may be able to ace an exam, demonstrating an impressive command of the material, but its inability to pinpoint the common conceptual hurdles or tricky distractions in a question makes it an unsuitable tool for scaffolding human learning.[4][5] This gap suggests that general problem-solving capability, which is a major focus of current LLM evaluation, does not equate to an understanding of human cognitive struggles, highlighting a challenge that simple scaling or prompt engineering has not yet overcome.[5] For AI to transition from a powerful answer-engine to a truly effective pedagogical partner, future research must move beyond mere accuracy metrics to focus on psychometric modeling and the development of architectures capable of genuinely simulating different, lower-knowledge states of mind, thereby breaking the machine consensus and aligning AI perception with human reality.[7][5]