Microsoft, Salesforce Find AI Reliability Plummets 39% in Incremental Conversations
Beyond initial prompts, AI’s memory and reasoning falter, revealing a 39% drop in reliability for complex interactions.
May 28, 2025

A new study from Microsoft and Salesforce has found that even the most advanced AI language models experience a significant decline in reliability as conversations extend and user requirements are revealed incrementally.[1] The research indicates that, on average, the performance of these systems dropped by 39 percent in scenarios where information was provided piece by piece, rather than all at once.[1] This phenomenon raises critical questions about the current capabilities and future development of conversational AI, particularly for complex, evolving interactions.
The core issue highlighted by the study is the struggle of Large Language Models (LLMs) to maintain coherence and accuracy when instructions are delivered in stages.[1][2] Instead of patiently integrating new information, models often make premature assumptions about the user's ultimate goal and attempt to provide a final solution too early.[2] These initial misinterpretations can then become difficult to correct, leading to a cascade of errors as the conversation progresses.[2] The study employed a method called "sharding," where prompts were broken down into smaller fragments and fed to the AI one at a time.[1] This approach mimics natural human conversation, where context and needs are often clarified progressively.[1][2] The findings revealed that even leading models like GPT-4 and Gemini can swing wildly between near-perfect responses and significant failures depending on how the task is presented.[1] In some cases, output consistency dropped by more than half when instructions were fragmented.[1] This unreliability in multi-turn, underspecified tasks contrasts sharply with their performance on single-turn, fully-specified prompts, which is how most LLMs are currently benchmarked.[2]
Several underlying factors contribute to this degradation in performance during longer exchanges. One of the most significant is the limitation of the "context window," which is the amount of information an AI model can actively process and "remember" at any given time.[3][4][5] While context windows have been expanding, with some models like GPT-4o and Claude 3.5 Sonnet boasting capacities of 128,000 to 200,000 tokens (roughly equivalent to tens of thousands of words), they are still finite.[4][6] As a conversation continues, older information may be pushed out of this window to make room for new input, leading to the model "forgetting" crucial earlier details.[3][4][5][7] This can result in responses that are repetitive, irrelevant, or contradictory to what was discussed previously.[3][5] Furthermore, models may struggle to prioritize the most relevant information from a long history, especially when key details are buried deep within the preceding dialogue.[3] Studies have shown that retrieval accuracy can decrease substantially as input contexts grow, even within the stated token limits.[3] For instance, GPT-4 Turbo, despite a 128,000-token context window, showed degraded information retrieval after 32,000 tokens and a more significant drop beyond 64,000 tokens.[3] Another issue is the accumulation of "cumulative noise," where minor inaccuracies or misinterpretations from earlier turns build upon each other, progressively degrading the overall quality and coherence of the conversation.[3]
The implications of these findings are far-reaching for the AI industry and its users. For developers, it underscores the challenge of building truly robust conversational AI systems that can handle the nuances and evolving nature of real-world human interactions.[8][2] The current benchmarks, which often rely on single, complete prompts, may not accurately reflect a model's utility in practical, dynamic scenarios.[2] This suggests a need for new evaluation methodologies that better simulate these extended, incremental conversations.[8][2] For users, particularly in professional settings like customer service, coding assistance, or research, the unreliability in longer interactions can lead to frustration, wasted time, and a loss of trust in AI tools.[9][10][11] If a chatbot provides inconsistent or inaccurate information deep into a complex problem-solving session, the consequences can range from minor annoyance to significant errors in work output.[9][12] This is particularly concerning in critical fields like healthcare, where AI models assisting with tasks like patient history intake have shown a decline in accuracy during back-and-forth exchanges.[8]
Addressing this "long-chat degradation" is a key area of ongoing research and development.[3][13] One approach involves improving how models manage their context window, potentially through more sophisticated summarization techniques or methods to identify and retain the most salient information from earlier in a conversation.[3][14] Researchers at MIT, for example, have developed a method called StreamingLLM, designed to help AI maintain efficiency and coherence even in conversations extending to millions of words by better managing the "key-value cache," which is essentially the bot's conversational memory.[13] Other strategies focus on "contextual pruning," which dynamically filters out irrelevant conversational data while preserving critical context.[14] Salesforce itself is working on benchmarks like "SIMPLE" and "ContextualJudgeBench" to better diagnose and reduce "jagged intelligence"—the erratic performance of AI agents across tasks of similar complexity—and to evaluate an agent's ability to maintain accuracy in context-specific answers.[15] The development of specialized model families, such as Salesforce's xLAM (eXtended Language and Action Models), is also aimed at improving tool use and multi-turn interaction.[15] Furthermore, the concept of "agentic AI," where models can take a sequence of actions and adapt based on outcomes, requires a very low error rate for each individual step, highlighting the need for more reliable underlying models.[16]
In conclusion, the study by Microsoft and Salesforce brings to the forefront a significant challenge in the advancement of conversational AI: ensuring reliability and consistency in extended dialogues where information is revealed progressively. The observed 39 percent average drop in performance in such scenarios highlights the limitations of current LLMs in handling real-world conversational dynamics, primarily due to issues like context window constraints, difficulty in tracking evolving user intent, and the accumulation of errors.[1] While these technologies continue to evolve rapidly, this research underscores the critical need for developing more robust context management techniques, refining evaluation benchmarks to reflect real-world usage patterns, and ultimately building AI systems that can maintain a high degree of accuracy and coherence over the entire course of a conversation. The ability to overcome these hurdles will be pivotal in fostering greater trust and unlocking the full potential of AI chatbots in complex, interactive applications.
Research Queries Used
Microsoft Salesforce AI chatbot reliability long conversations study
AI chatbot performance degradation longer conversations
context window limitations AI chatbots
AI model accuracy decline extended dialogue
research on AI chatbot reliability in lengthy interactions
Salesforce Microsoft study AI context length
Sources
[1]
[3]
[4]
[5]
[6]
[7]
[9]
[10]
[11]
[12]
[13]
[14]
[15]