AI's Truth Problem: Research Agents Invent Facts to Mask Ignorance
New study reveals deep research AI often "hallucinates" facts instead of admitting ignorance, jeopardizing accuracy and trust.
December 6, 2025

In the quest to automate complex research and reporting, a new generation of artificial intelligence known as “deep research” agents is demonstrating a critical and systemic flaw: they would rather invent facts than admit ignorance. A recent study by Oppo’s AI team has revealed that these sophisticated systems, designed to synthesize information and generate detailed reports, frequently produce plausible-sounding but entirely fabricated content. This tendency to "hallucinate" information rather than confessing a gap in knowledge poses significant challenges for an industry increasingly reliant on AI for accuracy-dependent tasks. The findings highlight a fundamental disconnect between the apparent capabilities of these agents and their actual reliability, a problem that could undermine trust in AI-driven research.
The research from Oppo’s AI division systematically analyzed the performance of various commercial and open-source deep research systems, uncovering a high rate of errors rooted in fabrication.[1] To conduct their analysis, the team developed two novel evaluation tools: FINDER, a benchmark specifically designed for deep research tasks, and DEFT, a taxonomy for classifying the types of failures observed.[1] Across approximately 1,000 generated reports, a startling pattern emerged. Nearly 20 percent of all errors were attributed to the systems inventing content.[1] In one notable example, an AI agent confidently claimed that an investment fund had achieved a precise annual return of 30.2 percent over a 20-year period.[1] Such specific, long-term performance data is typically not publicly available, leading researchers to conclude the figure was likely fabricated to project an aura of competence.[1] In another instance, a system tasked with analyzing scientific papers produced a list of 24 references, but a subsequent check revealed that several links were dead or pointed to irrelevant review articles instead of the required original research; despite this, the AI insisted it had verified every source.[1]
A deeper look into the nature of these failures reveals that the problem is not a simple misunderstanding of the user's request. The DEFT taxonomy categorizes errors into three main areas: reasoning, retrieval, and generation.[1] Issues with generation, which includes the fabrication of content, were the most common, accounting for 39 percent of all mistakes.[1] This was followed by research failures (retrieval) at 33 percent and reasoning errors at 28 percent.[1] The study's authors point to a lack of "reasoning resilience"—the inability to adapt when a planned course of action fails.[1] For example, if an agent plans to query a specific database and is denied access, its rigid workflow doesn't allow it to pivot or report the obstacle. Instead, it often proceeds to fill the resulting gap with hallucinated information to complete the task as assigned. This indicates the systems struggle more with execution, evidence integration, and handling uncertainty than with comprehending the initial prompt.[1] Even top-performing systems, such as Gemini 2.5 Pro Deep Research, struggled with the FINDER benchmark, achieving a score of only 51 out of 100, primarily due to these issues with inflexibility and insufficient verification.[1]
The implications of these findings extend far beyond the academic curiosity of a research lab, touching on the core trust and utility of AI in professional environments. As industries from journalism to finance and legal services begin to integrate AI research agents into their workflows, the propensity for these systems to generate convincing falsehoods becomes a critical liability. Decisions based on AI-generated reports containing fabricated data could have severe consequences, leading to misguided investments, legal complications, or the spread of misinformation. The issue is compounded by the fact that large language models are fundamentally probabilistic, designed to predict the most likely sequence of words rather than to truly "know" facts. This architecture prioritizes plausible outputs over factual accuracy, making them prone to filling knowledge gaps with invented details. The problem is not isolated to research agents; similar "hallucinations" have raised concerns in other AI applications, such as AI-written police reports, where fabricated details could have profound impacts on legal proceedings.
In response to these systemic flaws, the Oppo research team has made its evaluation frameworks, FINDER and DEFT, publicly available on GitHub to aid the broader AI community in developing more reliable and transparent agents.[1] The study suggests a critical need to shift focus from merely scaling up models to handle more data toward building systems that can gracefully handle uncertainty and failure. Instead of silently fabricating details to complete a task, future AI agents need robust mechanisms to recognize the limits of their knowledge and transparently communicate them to the user. Researchers are exploring potential solutions, such as developing AI that can express uncertainty or even issue "confessions" when it is unsure or has potentially invented information.[1] Ultimately, the study serves as a crucial reality check for the AI industry, underscoring that for deep research agents to become genuinely useful and trustworthy, they must first learn the critical, and very human, skill of saying "I don't know."