Frontier AI models lose 33 percent of reasoning accuracy during extended conversation threads
New research shows frontier AI models lose 33 percent of their reasoning accuracy as conversations grow long and complex.
February 28, 2026

The promise of an artificial intelligence that can maintain perfect coherence across days of collaboration or thousands of pages of text has long been the primary goal for developers of frontier large language models. However, new research into the latest generation of AI, including flagship systems like GPT-5.2 and Claude 4.6, reveals a persistent and troubling limitation known as contextual decay. Despite the introduction of massive context windows that theoretically allow these models to process millions of tokens at once, their actual reasoning accuracy tends to plummet as a conversation progresses. Recent industry benchmarks indicate that even these most advanced frontier models lose up to 33 percent of their accuracy when a dialogue extends beyond a few dozen exchanges, a phenomenon that challenges the enterprise-grade reliability many organizations have come to expect.
The core of this performance degradation lies in the delta between a model's ability to see information and its ability to prioritize it correctly. In short, single-turn interactions, models like GPT-5.2 show near-perfect performance on standard logic and retrieval benchmarks, often reaching success rates as high as 90 percent. However, as the conversation history grows, the success rate for the same type of tasks frequently drops to approximately 65 percent.[1] This 25-percentage-point decline, or roughly a one-third loss in total accuracy, suggests that the "working memory" of an AI is not a static repository but a fluid and increasingly noisy environment. The research highlights that the sheer volume of a conversation acts as a form of interference, where the model begins to struggle to distinguish between the primary instructions and the cumulative weight of its own previous responses.[2]
This specific failure mode has been dubbed the lost in conversation effect by researchers at Microsoft and Salesforce, who recently analyzed over 200,000 simulated multi-turn dialogues.[1] Their findings suggest that the models are not just forgetting information, but are actively being led astray by the structure of long-form dialogue.[1] One of the primary technical drivers for this is the reinforcement of early errors.[2] If a model makes a minor incorrect assumption or provides a slightly flawed piece of data in the third turn of a forty-turn conversation, it tends to treat that error as an immutable truth in all subsequent turns. Rather than self-correcting when new, clarifying information is provided by the user, frontier models often double down on their initial misunderstandings.[2] This leads to a compounding effect where unreliability can increase by more than 100 percent over the course of an hour-long debugging session or a complex research task.
Furthermore, the physical characteristics of the AI's output change as it tires. Data shows that as a session continues, the average response length from models like Claude 4.6 and GPT-5.2 can triple, even when the user's prompts remain concise. This increase in verbosity is often a precursor to a total failure in reasoning. When a model becomes more wordy, it often buries the correct answer under layers of fluff, or worse, begins to hallucinate details to fill the expanded space it has created for itself.[2] This suggests that the attention mechanisms used by these models, while architecturally capable of handling millions of tokens, suffer from a form of signal dilution. The more tokens the model has to attend to, the less importance it assigns to each individual token, effectively thinning out the "focus" required for high-stakes reasoning.
The implications for the AI industry are significant, particularly for the burgeoning sector of autonomous agents and long-horizon planning tools. If a coding agent or a legal analysis tool loses a third of its accuracy after thirty minutes of interaction, the productivity gains of the technology are severely mitigated by the need for constant human oversight and manual verification. Companies are finding that the "context window war"—the race to offer the largest possible input capacity—is hitting a wall of diminishing returns. Anthropic’s Claude 4.6, for instance, offers a 1-million-token window, which is 2.5 times larger than GPT-5.2’s 400,000 tokens. While Claude 4.6 maintains higher retrieval accuracy at these extreme scales, it is still subject to the same fundamental decay in reasoning logic as the conversation deepens. This suggests that the problem is not a lack of memory space, but a lack of sophisticated memory management.
Architecturally, the industry is now forced to reconsider how it handles the history of a chat. Many developers are beginning to implement aggressive summarization and "context pruning" techniques to keep the AI focused. Instead of feeding the entire, raw conversation back into the model for every new turn, systems are being designed to distill previous turns into key facts and constraints. This mimics human memory more closely; we do not recall every word of a three-hour meeting, but rather the core decisions and action items. However, even these strategies have drawbacks, as the act of summarization is itself an AI-driven task prone to the same 33 percent accuracy loss, potentially creating a feedback loop of misinformation.
The current state of frontier AI reveals a paradox of plenty. We have models that can ingest the entire works of Shakespeare in seconds but cannot accurately track a shifting set of requirements in a complex software project for more than an hour. For enterprise users, this means that the most effective way to use high-end models is not to engage in long, winding threads, but to treat each task as a discrete, fresh start. Frequent resets of the conversation thread remain the most effective "patch" for the accuracy drop, though this limits the potential for the AI to act as a truly integrated, long-term collaborator.
Looking forward, the focus of AI development is likely to shift from raw context size to context quality and architectural stability. The next phase of the industry may move away from the current "monolithic attention" models toward systems that utilize external, structured memory databases that can be queried with high precision, rather than relying on the internal, decaying attention of the transformer itself. Until these breakthroughs occur, the 33 percent performance cliff remains a critical boundary for anyone relying on frontier AI for complex, multi-stage problem solving. The reality is that the smartest machines in the world today still have a tendency to lose their way if the conversation goes on just a little too long.