New ProactiveBench framework trains AI models to stop guessing and request human help

Researchers introduce ProactiveBench to curb AI hallucinations by teaching models to recognize their own limitations and seek clarification.

April 11, 2026

New ProactiveBench framework trains AI models to stop guessing and request human help
The fundamental goal of an artificial intelligence assistant is to be helpful, yet a growing body of research suggests that this very drive to provide an answer may be the industrys greatest liability. A team of researchers recently introduced a new diagnostic framework called ProactiveBench to investigate a persistent flaw in modern multimodal large language models: their inability to admit when they lack the necessary information to perform a task.[1][2][3][4][5] While a human being faced with an occluded or blurry image will instinctively ask for a better view or more context, the most advanced AI systems in the world would rather fabricate a plausible-sounding guess than request assistance.[1] This tendency toward reactive hallucination marks a significant hurdle for the next generation of AI development, where models are expected to move beyond simple chat interfaces and into complex, real-world collaborative environments.[1]
The methodology behind ProactiveBench represents the first systematic attempt to quantify what researchers call proactiveness, or the ability of a model to recognize its own sensory limitations and actively seek clarification.[1][6] To build this benchmark, the investigators repurposed seven diverse datasets to create over 18,000 unique samples and 108,000 images that simulate scenarios where a correct answer is physically impossible to provide based on the initial input.[2] These scenarios include physical occlusions where objects are hidden behind blocks, sensory limitations like poor camera angles or low resolution, and temporal ambiguities where a final action has not yet occurred in a video sequence.[1] To pass these tests, a model must do something it is currently not trained to do: it must stop itself from answering the primary question and instead generate a proactive suggestion, such as asking the user to rotate the object, remove an obstruction, or wait for the video to progress.[1][2]
The results of the study were stark, revealing a widespread lack of proactive behavior across 22 of the worlds leading multimodal models, including high-profile systems like GPT-4o and various iterations of the open-source InternVL and Qwen families. Despite their massive computational power and sophisticated training, these models overwhelmingly defaulted to reactive behaviors. When presented with an image where the subject was hidden, the models typically took one of two paths: they either hallucinated a specific but incorrect answer or provided a generic refusal that failed to suggest a path forward.[1][2] One of the most surprising findings was that model size and general intelligence did not correlate with better proactivity.[7][1] In several instances, smaller models with fewer parameters actually outperformed their larger counterparts, suggesting that the current paradigm of scaling up training data and compute does nothing to solve the problem of AI overconfidence. In fact, larger models were occasionally more prone to sophisticated hallucinations, using their extensive internal knowledge to dream up details that were not present in the visual data.
The research also highlighted the limitations of common strategies used to improve AI performance, such as providing hints or including conversation history.[5] When the researchers explicitly told the models they were allowed to ask for help, the models often fell into a trap of blind proactiveness.[7][1] Rather than using the request for help as a surgical tool for resolving ambiguity, the models began to over-exploit the option, asking for better views even when the object was perfectly visible and the answer was clear. Furthermore, the inclusion of past dialogue appeared to introduce a negative bias; models frequently became tethered to previous turns in the conversation, making them less likely to evaluate the current visual input objectively. This suggests that the issue is not just a lack of instruction but a deep-seated architectural bias toward providing an immediate output at the expense of accuracy and collaborative logic.
In a search for a viable solution, the research team explored a specialized reinforcement learning approach that could fundamentally shift how models prioritize their responses.[4][6][7] By utilizing a technique known as Group-Relative Policy Optimization, the investigators fine-tuned a mid-sized multimodal model with a specific reward structure designed to favor strategic inquiry.[1] In this setup, the model received the highest reward for a correct final answer, a slightly lower but still significant reward for a valid proactive suggestion, and zero reward for a wrong guess or an unhelpful refusal.[1] This simple adjustment yielded impressive results, nearly quadrupling the models accuracy on the benchmark. Most importantly, the proactivity learned through this reinforcement process generalized to entirely new domains that were not part of the training set.[6] This proves that proactiveness is a learnable skill that can be decoupled from raw data memorization, offering a blueprint for future AI systems that can navigate uncertainty with human-like humility.
The implications of these findings for the broader AI industry are profound, particularly as companies race to deploy autonomous agents in high-stakes fields like medicine, industrial robotics, and remote customer service. For an AI agent tasked with diagnosing a patient or repairing a piece of machinery, a confident guess based on insufficient data can be catastrophic. The industry has spent years focusing on reducing hallucinations through better data filtering and fact-checking, but ProactiveBench suggests that the real fix may lie in changing the nature of the human-AI interaction itself. If a model can be trained to view the user as a collaborative partner rather than a simple prompt-provider, it can bridge the gap between static prediction and active problem-solving.[1] This shift from a reactive assistant to a proactive collaborator is essential for building trust in automated systems that must operate in the messy, unpredictable conditions of the physical world.
As the field moves forward, the release of benchmarks like ProactiveBench provides a critical diagnostic tool for developers who wish to move beyond the current plateau of confident ignorance. The research makes it clear that while today's AI can process vast amounts of information, it still lacks the self-awareness to know when that information is incomplete. By prioritizing models that value accuracy over obedience and inquiry over guesswork, the industry can move closer to creating artificial intelligence that truly understands the limits of its own vision. The path to more reliable AI may not be found in teaching models more facts, but in teaching them the most human of traits: the wisdom to ask for help when they are in the dark.

Sources
Share this article