"Humanity's Last Exam" for AI Fails Its Own Test: Questions Flawed

The ultimate AI test, "Humanity's Last Exam," is failing its own test of accuracy, complicating AI evaluation.

July 24, 2025

"Humanity's Last Exam" for AI Fails Its Own Test: Questions Flawed
A groundbreaking benchmark for artificial intelligence, heralded as "Humanity's Last Exam" (HLE), is facing scrutiny after an analysis revealed that a significant portion of its questions in key scientific domains may be flawed.[1][2] Researchers from FutureHouse, a company focused on AI for scientific research, discovered that nearly 29 percent of the biology and chemistry questions in the HLE dataset have answers that are either incorrect or misleading when compared with published scientific literature.[1][2] This finding casts a shadow on a tool designed to be the ultimate test of advanced AI, raising critical questions about the methods used to create such benchmarks and their ultimate reliability in gauging the true capabilities of frontier AI models.
"Humanity's Last Exam" was introduced by the Center for AI Safety (CAIS) and Scale AI as a response to the growing problem of "benchmark saturation," where leading AI models were achieving near-perfect scores on existing tests, making it difficult to track further progress.[3][4][5] The exam, comprising 2,500 to 3,000 highly challenging questions across a vast range of subjects, was intended to push AI systems to their absolute limits, testing their reasoning and knowledge at the frontiers of human expertise.[3][4][6] The creators crowdsourced questions from nearly 1,000 experts, including university professors and top researchers from over 500 institutions, offering a $500,000 prize pool to incentivize the submission of problems that would genuinely challenge the most advanced AI.[7][4][8] The core idea was to create a benchmark so difficult that it would remain relevant even as AI capabilities continued their rapid advance, providing a clear measure of how close machines are to expert-level human intelligence.[7][5][8]
The methodology for creating the exam, however, may have inadvertently contributed to the high error rate. A key criterion for including a question was that current frontier AI models could not answer it correctly.[2] This prerequisite, combined with a review process where human experts were not required to spend more than five minutes verifying a question's accuracy if it was too time-consuming, likely led to the inclusion of "gotcha" or adversarial-style questions.[2] The analysis by FutureHouse, which utilized a combination of human expert review and their own AI tools, found that this focus on stumping AI may have compromised the factual soundness of the questions themselves.[1][2] They argue that the incentive structure prioritized difficulty over correctness, leading to questions that were esoteric or based on misinterpretations of scientific knowledge.[2] One example cited is a question asking for the rarest noble gas on Earth in 2002, with the provided answer being Oganesson, a synthetic element that only existed for milliseconds in a Russian nuclear reactor and is not a naturally occurring terrestrial substance.[2]
The implications of these findings are significant for the AI industry, which relies heavily on benchmarks to evaluate and compare the performance of different models. If a benchmark designed to be the gold standard contains a high percentage of flawed questions, it undermines the validity of the results. Top-performing models like OpenAI's o1, Google's Gemini, and Anthropic's Claude have all been tested against HLE, with even the best systems scoring relatively low, initially suggesting a large gap remained between AI and human expert performance.[9][4][10] However, if a substantial number of questions are fundamentally incorrect, the low scores may not accurately reflect the models' reasoning abilities but rather the flawed nature of the test itself. This calls into question the true meaning of performance on HLE and complicates the narrative of AI progress toward artificial general intelligence (AGI).
The discovery of these errors highlights a critical challenge in the field of AI evaluation. As models become more powerful, creating robust, accurate, and truly challenging benchmarks becomes increasingly difficult. The creators of HLE did implement a bug bounty program to identify and correct errors after its release, acknowledging the difficulty of creating a perfect dataset.[3][4] The FutureHouse analysis, however, suggests a more systematic issue rooted in the initial design and review process.[2] As the industry continues its quest to build and measure superintelligent systems, this episode serves as a crucial reminder that the tools used for measurement must be as rigorously vetted and validated as the AI models they are designed to assess. The integrity of these benchmarks is paramount for fostering genuine progress, ensuring transparency, and guiding the responsible development of artificial intelligence.

Sources
Share this article