New CiteVQA Benchmark Exposes Leading AI Models Citing Wrong Sources for Right Answers

New CiteVQA benchmark reveals that leading AI models frequently hallucinate source citations despite delivering correct factual answers

May 25, 2026

New CiteVQA Benchmark Exposes Leading AI Models Citing Wrong Sources for Right Answers
The rapid integration of artificial intelligence into enterprise workflows has highlighted the remarkable ability of multimodal models to digest complex PDF files, financial reports, and lengthy legal briefs in seconds[1]. However, a major structural flaw has quietly undermined this progress. Leading artificial intelligence systems frequently deliver correct answers to complex document queries while citing entirely wrong or irrelevant source passages[2]. This phenomenon, newly categorized as attribution hallucination, means that even when an artificial intelligence program gets the facts right, its cited evidence is often a fabrication[2][1]. For regulated industries where a verifiable paper trail is legally required, this disconnect represents a profound risk, signaling that current artificial intelligence evaluation methods are failing to catch a critical vector of untrustworthiness[1][3].
To address this systemic blind spot, a collaborative team of researchers from Peking University and the Shanghai Artificial Intelligence Laboratory developed a specialized evaluation framework called CiteVQA[4]. Traditional document understanding tests, such as DocVQA or MMLongBench-Doc, focus exclusively on the final textual response[5]. Because these benchmarks do not verify where the model retrieved its information, they remain blind to whether a system is truly reading the provided document or merely leveraging its pre-existing training data to make an educated guess[5]. The CiteVQA benchmark changes this paradigm by requiring evaluated models to not only state the correct answer but also return precise, element-level bounding-box citations that pinpoint the exact visual region of the document supporting the claim[1].
The scale of the new benchmark exposes models to realistic, complex data environments that mirror actual corporate and academic use cases[1]. CiteVQA consists of 1,897 questions mapped across 711 multi-page PDF files, spanning seven distinct professional domains and two languages[1]. With documents in the evaluation averaging over 40 pages in length, models cannot succeed through superficial scanning[1]. To establish a rigorous ground truth for comparison, the researchers utilized a highly structured, automated pipeline that identifies critical supporting evidence through a process of masking ablation, which was then meticulously verified by human domain experts[1]. This multi-layered validation ensures that the benchmark represents an absolute standard for factual and visual traceability.
At the heart of this evaluation lies a newly introduced metric known as Strict Attributed Accuracy, which sets an uncompromising standard for what constitutes a correct output[1]. Under this scoring system, a model receives zero credit if it provides the correct answer but fails to locate the exact supporting text region, or if it highlights the correct region but misinterprets the data to produce an incorrect answer[5][6]. Only when both the semantic answer and the spatial citation are verified as correct does the system earn points[1]. This approach effectively unmasks models that appear highly capable under standard evaluations but are actually suffering from deep attribution failures, providing developers with a much more accurate picture of how reliable their models will be in the field[1].
The results of auditing 20 leading multimodal large language models on the CiteVQA benchmark reveal a pervasive and troubling discrepancy between raw intelligence and sourcing precision[1]. The commercial frontier model GPT-5.4 achieved an outstanding answer-only accuracy score of 87.1 percent, making it the most linguistically accurate model tested[7]. Yet, when evaluated under the strict guidelines of Strict Attributed Accuracy, its performance plummeted to just 59 percent[5]. This drastic drop indicates that for nearly one-third of the questions GPT-5.4 answered correctly, it pointed to the wrong page or cited an incorrect paragraph, relying on its internal parametric knowledge or flawed pattern matching rather than the actual document context[7].
Among the tested systems, Google's Gemini-3.1-Pro-Preview demonstrated the strongest spatial awareness, leading the benchmark with a Strict Attributed Accuracy score of 76 percent[5]. Researchers hypothesize that the Gemini series performs better in this regard due to architectural optimizations specifically designed for native citation and layout alignment[8]. However, even this top-tier performance leaves a quarter of all queries misattributed, highlighting that accurate citation remains a formidable barrier even for the world's most heavily funded proprietary models[5]. Meanwhile, the divide between commercial systems and open-source models remains vast[8]. The strongest open-source model evaluated, Qwen3-VL-235B, achieved a Strict Attributed Accuracy score of just 22.5 percent, while smaller open-source models frequently scored below 10 percent, rendering them highly impractical for tasks that require strict document grounding[1][7].
These findings carry heavy implications for industries such as law, medicine, and corporate finance, where mistakes in source attribution can carry severe legal or financial penalties[1][3]. In a legal setting, an artificial intelligence assistant that summarizes a contract correctly but cites the wrong clause could cause a lawyer to submit inaccurate court filings, leading to professional misconduct allegations. Similarly, in healthcare, an AI system that correctly identifies a patient's allergy but links it to the wrong medical report or page could mislead clinicians during a critical care decision. If human professionals must spend significant time manually verifying every single citation generated by an AI assistant to ensure its validity, the promised productivity gains of document-parsing technologies are largely negated.
To bridge this trust gap, AI development must move beyond training models to be merely fluent and persuasive[9]. The next phase of artificial intelligence design must prioritize spatial reasoning and strict logical grounding, teaching models to treat the act of citation not as an afterthought, but as an inseparable part of the reasoning process[1][10]. This will require training methodologies that penalize models during reinforcement learning when they fail to align their internal logic with explicit, verifiable source coordinates. Furthermore, enterprise buyers must demand more rigorous testing protocols, shifting their procurement evaluations away from simple accuracy metrics and toward attribution-focused benchmarks[11].
Ultimately, the transition toward trustworthy document intelligence depends on a fundamental shift in how the industry defines and measures success[1]. The development of benchmarks like CiteVQA exposes the dangerous illusion of correctness that has characterized first-generation document analysis tools[1]. For artificial intelligence to be safely integrated into the foundational pillars of society, it must be held to the same standards as any human expert. It is not enough for an intelligent system to simply know the right answer; it must also possess the discipline and precision to show its work, proving beyond a doubt exactly where that knowledge was found[1].

Sources
Share this article