Training AI Chatbots to Be Safe and Polite Strips Away Their Human Realism

Training AI assistants to be safe strips away realistic human biases, undermining their use as scientific surrogates.

May 30, 2026

Training AI Chatbots to Be Safe and Polite Strips Away Their Human Realism
A massive international study has revealed that the very training processes used to transform raw artificial intelligence models into helpful, safe chatbot assistants simultaneously degrade their ability to replicate human behavior. This finding poses a significant challenge to the growing practice of using large language models as digital surrogates for human test subjects in scientific, clinical, and policy research[1][2]. To systematically measure this phenomenon, a consortium of researchers from more than thirty-five prominent academic institutions, including Helmholtz Munich, the Massachusetts Institute of Technology, the University of Oxford, and Stanford University, compiled a colossal new dataset of human experimental data[3][4]. Spanning approximately two hundred and eight thousand participants and containing over twenty-six million individual responses from hundreds of psychological and behavioral experiments, this new dataset, named Psych-201, provides an unprecedented benchmark for assessing how closely artificial intelligence aligns with actual human decision-making[3]. The results of the evaluation show a consistent decline in behavioral realism across multiple model families, revealing that the friendlier and more cooperative a chatbot becomes, the less it behaves like a real human[3].
At the core of this discovery is a stark performance divide between raw base models and their post-trained assistant counterparts[3]. Base models are trained through a relatively straightforward process of predicting the next word in vast corpora of raw internet text, capturing a broad and unfiltered cross-section of human expression, including its inherent biases and logical missteps[1][5]. Post-training, by contrast, subjects these base models to further refinement techniques—such as instruction-tuning, reinforcement learning from human feedback, and reasoning optimization—to make them polite, safe, and logical companions[1][5]. The researchers evaluated models from the prominent Qwen3, Llama3, and OLMo 3 families to see how well their base and assistant versions could predict the step-by-step choices made by human participants in the Psych-201 dataset[1][6]. Across all tested model families, regardless of size or architecture, the raw base models consistently outperformed their post-trained derivatives at predicting human choices[1]. The researchers noted that the most severe drops in behavioral realism occurred in tasks involving psycholinguistics and complex reasoning, where human decision-making is notoriously prone to cognitive shortcuts and emotional framing that the heavily optimized assistants have been trained to avoid[5].
Crucially, the study demonstrates that as artificial intelligence technology advances, the behavioral divergence between raw language models and post-trained chatbots is actively widening[1][2]. Newer generations of base models are showing remarkable improvements in their raw state, demonstrating a deeper, more organic alignment with human behavioral patterns during initial pre-training, likely due to higher-quality training data and refined architectures[7][5]. However, the subsequent post-training alignment steps inflict a disproportionately heavier toll on these advanced systems[7][5]. The gap in human-likeness between a base model and its fine-tuned assistant version is significantly larger in the newest generations of AI than in older ones[5]. This widening generational gap indicates that the current methodologies used by the tech industry to polish, civilize, and align chatbots for consumer use are increasingly alienating these systems from authentic human psychology[5][8]. As commercial developers continue to prioritize logical consistency, safety guardrails, and deterministic helpfulness, they are inadvertently driving their models further away from the noisy, unpredictable, and often irrational reality of human thought[8][9].
The research also dismantles a widely accepted shortcut in the field of AI simulation known as the persona trick or persona-induction[3][7]. For years, researchers and developers have attempted to elicit realistic human behaviors by feeding chatbots specific demographic profiles—such as prompting a model to act like a fifty-year-old voter from a specific country, or a college student struggling with anxiety[3][10]. While previous studies suggested that these prompts could successfully make models adopt broad group-level characteristics, like shifting their political viewpoints or altering their vocabulary, the Psych-201 study proves that this technique brings virtually no benefit when it comes to individual-level predictions[3][10]. By leveraging the rich metadata associated with the Psych-201 dataset, which includes detailed participant demographics, nationalities, and psychological survey responses, the research team directly matched AI prompts with the exact profiles of the human subjects[3][11]. Despite this precise conditioning, the persona-infused models failed to show any statistically meaningful improvement in predicting the trial-by-trial decisions of individual humans, demonstrating that superficial demographic labeling cannot bridge the deep cognitive divide introduced by assistant-style fine-tuning[2][11].
Despite these sobering findings, the study highlights a promising alternative pathway by showing that targeted, cognition-focused training can successfully bridge the gap between artificial and human behavior[12][13]. As a powerful counterexample, the researchers analyzed Centaur, a specialized foundation model of human cognition developed by scientists at Helmholtz Munich and their collaborators[14][15]. Unlike standard consumer chatbots that are post-trained to serve as general-purpose assistants, Centaur was explicitly fine-tuned on first-person psychological datasets to predict human decision-making and cognitive processes[14][16]. When tested against the Psych-201 benchmark, Centaur demonstrated a consistent and significant increase in behavioral alignment, even when evaluated on entirely novel, held-out tasks that were not present in its training dataset[13][17]. This successful generalization proves that post-training is not inherently destructive to psychological realism; rather, the damage is caused by the specific, utility-driven objectives of modern assistant tuning[18][5]. When the goal of fine-tuning is shifted from creating an agreeable, guardrailed chatbot to accurately modeling the human mind, artificial networks can indeed learn to replicate human-like cognition with high fidelity[18][5].
The implications of these findings are profound for both the artificial intelligence industry and the behavioral sciences, where AI models are increasingly used as low-cost stand-ins for human focus groups, clinical patients, and social science subjects[3][2]. Relying on commercial chatbot assistants to predict how public policies will be received, how students will learn, or how psychiatric patients will respond to therapy is likely to yield highly inaccurate results, as these models are systematically stripped of human-like cognitive biases and imperfections[3][8]. If the scientific community and the tech industry wish to construct models capable of truly understanding and simulating human behavior, they must move away from general-purpose consumer chatbots[2][11]. Instead, the future of human behavioral simulation lies in developing specialized, cognition-aligned architectures trained on extensive, high-quality psychological datasets, ensuring that the AI models of tomorrow can truly reflect the complex, messy reality of the human mind[11][16].

Sources
Share this article