Google DeepMind's FACTS Benchmark Reveals Top AI Models Struggle With Truth

A new benchmark uncovers a sobering truth: even leading AI models deeply struggle with factual accuracy, especially in multimodal tasks.

December 11, 2025

Google DeepMind's FACTS Benchmark Reveals Top AI Models Struggle With Truth
A new comprehensive benchmark from Google DeepMind reveals a sobering truth about the current state of artificial intelligence: even the most advanced models from industry leaders struggle significantly with factual accuracy. The FACTS Benchmark Suite, designed to be a more holistic measure of AI truthfulness, shows that top-tier models like Gemini 3 Pro and GPT-5 are far from infallible, with none of the tested models achieving an overall accuracy score above 70%.[1][2][3] This rigorous evaluation highlights a critical gap between the generative capabilities of AI and their reliability, a gap that has profound implications for industries looking to deploy these technologies in high-stakes environments.[2][4]
Developed by Google's FACTS team in collaboration with the data science platform Kaggle, the benchmark aims to fill a crucial void left by previous evaluation methods.[3][5] Researchers at DeepMind argued that prior tests often assessed skills in isolation, failing to provide a complete picture of a model's reliability.[6] A model might excel at one task, like summarizing a document, but fail when asked to retrieve facts from its internal knowledge base.[6] To address this, the FACTS suite is structured around four distinct sub-benchmarks designed to simulate real-world scenarios and potential failure points.[2][5] These include a Parametric Benchmark to test the accuracy of a model's internal knowledge, a Search Benchmark to evaluate its ability to use web search to find and synthesize information, a Multimodal Benchmark for interpreting visual data like charts and images, and an updated Grounding Benchmark to assess if a model's answers are firmly based on provided text.[2][5] To ensure impartiality and prevent models from being trained specifically for the test, Kaggle hosts the official leaderboard and uses a mix of public and private datasets for evaluation.[7][3][5]
The initial results from the FACTS leaderboard are telling. Google's own Gemini 3 Pro model achieved the highest overall score of 68.8%, followed by Gemini 2.5 Pro at 62.1% and OpenAI's GPT-5 at 61.8%.[1][3] While Gemini 3 Pro showed particular strength in the search-related tasks, scoring 83.8%, its performance on tasks relying on its internal knowledge was lower at 76.4%.[3] The results underscore a critical challenge for the entire field: despite rapid advancements in AI's reasoning and creative abilities, fundamental truthfulness remains a significant hurdle.[1][2] The fact that even the leading model answers incorrectly nearly a third of the time signals a clear message to developers and enterprise users: the era of "trust but verify" is far from over.[2]
Perhaps the most alarming findings came from the Multimodal Benchmark, which tests the ability of models to accurately understand and answer questions about visual information.[2][6] In this category, every model tested scored below 50%, with the top-performing model only reaching 46.9%.[1][3] This "disaster zone," as some commentators noted, indicates that AI is not yet reliable for tasks requiring unsupervised data extraction from visual sources like invoices or financial charts.[1][2] The high error rate suggests that deploying AI for such tasks without rigorous human oversight could introduce severe inaccuracies.[2] The tendency for models to "hallucinate," or invent false information, remains a persistent problem that erodes trust and limits real-world applications.[8][9] This issue is not merely a technical glitch but a fundamental aspect of how current generative models work, as they are designed to predict plausible-sounding text based on patterns rather than possessing a true understanding of facts.[9][10]
The release of the FACTS benchmark and its candid results have spurred discussion throughout the AI community. While it provides a much-needed standardized tool for measuring a crucial aspect of AI performance, the fact that a Google-created benchmark ranked Google's model at the top has raised some questions of potential bias.[1] Regardless, the benchmark sets a new, more challenging standard for the industry. It moves beyond simple question-answering to assess how well models can synthesize complex information and ground their responses in evidence.[8][7] For industries like finance, law, and healthcare, where accuracy is paramount, this level of scrutiny is essential.[2][3] The findings suggest that enterprise users should carefully consider the specific strengths and weaknesses of different models, potentially using them in conjunction with other tools like search and vector databases to bolster accuracy.[3]
In conclusion, the FACTS benchmark serves as a crucial reality check for the artificial intelligence industry. While large language models continue to demonstrate astonishing capabilities, their grasp on the truth remains tenuous. The sub-70% accuracy ceiling and particularly poor performance in multimodal understanding reveal that the path to creating genuinely reliable and trustworthy AI is still long.[1][2] As AI becomes more integrated into society, this benchmark highlights the urgent need for continued research and development focused not just on making models smarter, but on making them more truthful. For the foreseeable future, human oversight and critical evaluation of AI-generated content will remain indispensable.[11]

Sources
Share this article