Chicago Study Sets New Standard: Pangram Achieves Near-Perfect AI Detection
University of Chicago study reveals stark disparities among AI text detectors, with Pangram setting a new, near-perfect standard.
November 2, 2025

A landmark study from the University of Chicago has sent ripples through the artificial intelligence industry, revealing significant disparities in the effectiveness of commercial AI text detectors. The research identified one tool, Pangram, as achieving near-perfect results, a stark contrast to the performance of many of its competitors. This audit of leading detection tools highlights the critical challenges and potential solutions in the ongoing effort to distinguish between human and machine-generated content, a matter of increasing importance in academia, media, and beyond.
The study, conducted by researchers at the Becker Friedman Institute for Economics, systematically evaluated a range of popular AI detectors.[1] Researchers created a robust dataset comprising 1,992 human-written texts spanning various genres, including news articles, product reviews, and resumes.[2] They then used four prominent large language models—GPT-4, Claude Opus, Claude Sonnet, and Gemini 2.0 Flash—to generate corresponding AI-written samples.[2] The detectors were assessed on two crucial metrics: the false positive rate (FPR), which measures how often human writing is incorrectly flagged as AI-generated, and the false negative rate (FNR), which tracks how often AI-generated text is missed.[2][1] In this rigorous comparison, Pangram distinguished itself by achieving nearly zero for both false positives and false negatives on medium to long passages of text.[2][1] Even with shorter texts, its error rates remained exceptionally low, generally below one percent.[2]
This level of accuracy positions Pangram significantly ahead of other tools evaluated in the study. While other commercial detectors like OriginalityAI and GPTZero formed a secondary tier with respectable but less consistent performance, they struggled more with shorter text samples and proved more vulnerable to "humanizer" tools designed to disguise AI writing.[2][1] For instance, GPTZero's ability to detect AI-generated text was found to degrade substantially when faced with content modified by such tools.[1] The performance gap was even more pronounced when compared to open-source detectors. The study revealed that a RoBERTa-based open-source tool, for example, was highly unreliable for high-stakes applications, incorrectly labeling between 30 and 69 percent of human-written text as AI-generated.[2][1] This starkly illustrates the chasm between the capabilities of different detection technologies currently on the market.
The implications of these findings are profound, particularly as institutions grapple with the rise of AI-generated content. The high rate of false positives from less reliable detectors poses a significant risk, potentially leading to wrongful accusations of academic dishonesty or professional misconduct. The University of Chicago researchers developed a framework to help institutions compare detectors based on their tolerance for such errors versus failing to detect AI usage.[1] Their analysis showed that Pangram was the only detector capable of meeting stringent policy requirements—such as a false positive rate of 0.5% or less—without sacrificing its ability to identify AI-generated text.[3] This reliability is crucial for organizations seeking to implement fair and effective AI usage policies. The study's authors noted that when considering cost-effectiveness, Pangram proved to be two to three times cheaper than its main commercial rivals per correctly identified AI passage.[1]
Further research from the University of Maryland offers a potential explanation for Pangram's superior performance, attributing it to a unique training methodology.[4] This approach, known as "synthetic mirrors," involves pairing every human writing sample in its training data with an AI-generated counterpart on the same topic.[4] The model learns by correcting its own mistakes, progressively refining its ability to distinguish between the two, mimicking a human's learning process.[4] This contrasts with other methods that may be less robust, especially against the rapidly evolving capabilities of new language models and the emergence of tools designed to evade detection. The success of this technique underscores the complexity of the AI detection challenge, which is less a solved problem and more of an ongoing arms race between generative AI and detection technologies.[5]
In conclusion, the University of Chicago study serves as a critical benchmark in the field of AI text detection, establishing a new standard for accuracy and reliability. By demonstrating that near-flawless detection is possible, the research places pressure on the broader market to improve performance and transparency. Pangram's success highlights the importance of innovative training techniques in staying ahead of increasingly sophisticated text generation models. As AI becomes more deeply integrated into our digital lives, the need for trustworthy and equitable detection tools is paramount, making the insights from this research invaluable for educators, publishers, and policymakers navigating the complexities of a world filled with both human and artificial writing.