GPT-4.5 passes the Turing test by mimicking human flaws and outperforming real people
A UC San Diego study reveals how GPT-4.5 fooled 73 percent of humans by trading brilliance for relatable flaws
March 16, 2026

Artificial intelligence has long been measured by its ability to outperform humans in complex reasoning, mathematical precision, and vast knowledge retrieval. However, a groundbreaking study from researchers at the University of California, San Diego, suggests that the key to truly convincing humans of an AI’s humanity lies not in its brilliance, but in its flaws. In a series of rigorous Turing tests, OpenAI’s GPT-4.5 managed to fool 73 percent of human participants into believing they were speaking with another person.[1][2][3][4] The strategy that led to this unprecedented success was remarkably counter-intuitive: the model was instructed to pretend to be less capable than it actually is, adopting a persona that included frequent typos, a disregard for punctuation, and intentional errors in basic arithmetic.
The experiment utilized a three-party configuration of the original imitation game proposed by Alan Turing. In this setup, an interrogator engaged in simultaneous five-minute text-based conversations with two hidden entities—one a fellow human and the other an AI model. At the conclusion of the interaction, the interrogator was required to identify which of the two was the machine. The research evaluated several prominent large language models, including Meta’s Llama-3.1-405B and OpenAI’s GPT-4o, against a control group and the vintage chatbot ELIZA.[5] While the more traditional, helpful versions of these models struggled to maintain the illusion of humanity, GPT-4.5 shattered expectations when it was refined with a specific social persona.[5][2][6][7]
The discrepancy between a "smart" AI and a "human-like" AI was stark.[6][5][2] When GPT-4.5 was given a standard, neutral prompt—instructed simply to convince the user of its humanity without a specific character—its success rate plummeted to just 36 percent.[5][6] In those instances, the model’s hallmark traits, such as its perfect grammar, comprehensive explanations, and relentless politeness, acted as digital fingerprints that gave it away. Participants noted that the AI felt "too helpful" or "too robotic" in its precision. It was only when the researchers pivoted to a persona-driven approach—modeling the AI after a socially awkward young adult who used internet slang and made common human mistakes—that the win rate surged to 73 percent.
By purposefully botching math problems and skipping capital letters, GPT-4.5 tapped into a phenomenon that researchers are calling the vibe check. For the human judges, the presence of intellectual shortcuts and linguistic laziness served as a more reliable indicator of humanity than the ability to solve a complex equation. This suggests that the public’s perception of what constitutes a human interaction is increasingly tied to emotional resonance and social performance rather than cognitive substance.[2] The AI did not pass the test by thinking more clearly; it passed by mirroring the messy, informal nature of modern digital communication.
The study’s findings also revealed a surprising vulnerability in human discernment: GPT-4.5 was judged to be human more often than the actual human participants in the study.[6][8][5][2][9][1] While the AI achieved a 73 percent success rate, the real humans involved in the test were only correctly identified about 67 percent of the time. This "more human than human" result highlights a significant shift in the threshold for the Turing test.[2] Interrogators often grew suspicious of real people who were too articulate or who responded too quickly, incorrectly flagging them as bots. This inversion suggests that as AI becomes more integrated into daily life, our collective standards for human behavior are becoming narrower and more easily simulated.
The implications for the AI industry are profound and troubling. If the most effective way for a machine to pass as a person is to exhibit flaws and perform social engineering, the boundary between authentic communication and automated manipulation becomes nearly impossible to police. Industry experts warn that this capability could be weaponized for large-scale social engineering attacks, where bots masquerade as relatable individuals to spread misinformation or conduct sophisticated phishing schemes. The ease with which a model can be "dumbed down" to gain trust implies that the very safeguards designed to make AI more helpful and accurate may actually make them less convincing in a social context.
Furthermore, the results call into question the continued relevance of the Turing test as a meaningful benchmark for artificial general intelligence.[7] If a machine can be deemed "intelligent" or "human" simply by imitating the superficialities of human error, the test may have transitioned from a measure of machine cognition to a measure of human gullibility.[2] Critics argue that we have entered an era where the performance of humanity is more influential than the possession of it.[2] This shift from logic-based evaluation to resonance-based evaluation means that the industry must find new, more robust ways to certify the identity of participants in digital spaces.
The competitive landscape of the AI sector is also reacting to these findings. While GPT-4.5 led the field, Meta’s Llama-3.1 followed with a 56 percent success rate when given a similar persona prompt.[8] These figures significantly outperform older models like GPT-4o, which hovered around 22 percent, and the nearly 80-year-old ELIZA, which served as a baseline at roughly 23 percent. The rapid leap from 22 percent to 73 percent in a single generation of models suggests that the "uncanny valley" of text-based interaction is being crossed much faster than previously anticipated.
As AI models continue to evolve, the focus of development may shift away from increasing raw computational power and toward fine-tuning emotional fluency and social mimicry.[2] This could lead to a future where chatbots are not just assistants, but companions capable of maintaining long-term, indistinguishable relationships with users. However, the ethical concerns regarding transparency and consent remain unresolved. If a machine can fool three-quarters of the population by pretending to be a fallible human, the risk of users forming deep, misplaced emotional bonds with software becomes a tangible reality.
In conclusion, the success of GPT-4.5 in the UC San Diego study serves as a pivotal moment in the history of artificial intelligence. It demonstrates that the path to passing the most famous test in computer science was not paved with higher intelligence, but with the calculated imitation of human imperfection.[8][6][5][2][7] As the industry moves forward, the challenge will no longer be creating a machine that can think like a human, but managing a world where machines can so effortlessly pretend to be us. The study underscores a uncomfortable truth: in the digital age, our flaws are our most defining human characteristics, and now, they are the very things a machine can most easily fake. This achievement signals the end of the Turing test’s utility as a safeguard and the beginning of a new chapter where digital authenticity is permanently under siege.