GPT-5, Gemini 3 Pro flunk physics reasoning, exposing AI's limits.
CritPt, a rigorous physics benchmark by 50+ physicists, reveals advanced AI's profound ineptitude for novel, independent scientific discovery.
November 23, 2025

A new, formidable benchmark in the field of physics has delivered a sobering reality check to the artificial intelligence industry, revealing that even the most advanced large language models, such as Gemini 3 Pro and GPT-5, are profoundly inept at tackling complex scientific reasoning at the level of genuine research. The benchmark, known as CritPt, was meticulously designed by a consortium of over 50 physicists to mirror the challenges faced by early-stage PhD students.[1] The stark results show that while AI has made remarkable strides in many areas, the dream of an autonomous AI scientist capable of novel, independent discovery remains a distant frontier. The findings highlight a significant gap between the pattern-matching capabilities of current AI and the nuanced, multi-step reasoning required for scientific breakthroughs.
The CritPt benchmark, which stands for Complex Research using Integrated Thinking-Physics Test, is a purpose-built challenge to evaluate the authentic reasoning abilities of AI in the context of modern physics research.[1][2] It comprises 71 composite research challenges and 190 more granular "checkpoint" tasks, all derived from the real-world research of its creators across a wide array of physics disciplines, including quantum physics, astrophysics, and biophysics.[1][3][4] Crucially, the problems are unpublished, rendering them "search-proof" and preventing the models from simply retrieving answers from their vast training data.[3][2] The open-ended nature of the questions demands more than just a simple numerical answer; models are expected to produce complex symbolic expressions and even executable code, pushing them far beyond the scope of typical academic-style questions.[1] This design philosophy is intended to probe what the creators call the "critical point" of AI reasoning—the transition from superficial pattern recognition to genuine, deep understanding and problem-solving.[2]
The performance of the leading AI models on this rigorous test was profoundly underwhelming. On the full research challenges, the best-performing base model, GPT-5, achieved an average accuracy of only 4.0%.[2] Even when augmented with powerful tools like a code interpreter and web search, its accuracy only modestly increased to around 11.7%.[5][6] This indicates that the models are far from being able to handle the multi-step, integrated reasoning required to see a complex research problem through from beginning to end. While the models fared slightly better on the smaller, isolated "checkpoint" tasks, with accuracy reaching over 20% in some cases, their performance was still far from reliable.[5] A particularly telling finding was the models' lack of consistency; even when a model managed to solve a problem correctly, it would often fail on subsequent attempts of the same problem, revealing a fragile and unreliable reasoning process.[5]
The implications of the CritPt results are far-reaching for the future of AI in science. The benchmark exposes a fundamental weakness in current large language models: they often generate plausible-sounding answers that contain subtle but critical flaws. In the unforgiving domain of physics, a single incorrect assumption or flawed inference can invalidate an entire line of reasoning.[5][7] This suggests that the current paradigm of training models on vast amounts of text is insufficient for instilling the kind of robust, logical, and creative thinking that scientific discovery demands. While AI has demonstrated impressive successes in specific scientific applications, such as Google DeepMind's AlphaFold predicting the structure of millions of proteins, these are often in more constrained and well-defined problem spaces.[8] The open-ended, novel challenges of frontier research, as simulated by CritPt, require a level of abstraction and genuine understanding that remains elusive.
In conclusion, the CritPt benchmark serves as a crucial milestone in assessing the true capabilities of AI in the scientific domain. It demonstrates that while models like Gemini 3 Pro and GPT-5 are powerful tools, they are not yet the autonomous scientific collaborators that some have envisioned. The path forward will likely involve not just scaling up existing models, but developing new architectures and training methodologies that can foster a deeper, more robust form of reasoning. The future of AI in science will likely be one of collaboration, where AI assists human researchers with specific, well-defined tasks, rather than taking the lead in discovery.[5] CritPt has laid bare the significant hurdles that remain, providing a clear and challenging roadmap for the development of AI that can one day truly think like a physicist.
Sources
[2]
[3]
[4]
[5]
[6]
[8]