Leaked Anthropic "Soul Doc" Bakes AI Personality, Ethics Into Its Core
Beyond rules: Anthropic's "Soul Doc" uncovers deep programming of Claude's motivations, self-perception, and "functional emotions."
December 2, 2025

A leaked internal document from AI safety and research company Anthropic has offered an unprecedented look into the explicit programming of a large language model's personality and ethical framework. The document, informally known internally as the "Soul Doc," was extracted from the Claude 4.5 Opus model and subsequently confirmed as authentic by the company. Its contents reveal a detailed, nuanced approach to instilling a specific character in the AI, moving far beyond simple rule-based prohibitions and venturing into the realm of shaping the model's core motivations, self-perception, and even its "wellbeing." This revelation provides a rare degree of transparency into one of the leading methods for aligning advanced AI with human values and highlights a starkly different philosophical approach from competitors like OpenAI.
The discovery was made by a user named Richard Weiss, who detailed the process in a post on the community blog LessWrong. Weiss explained that he noticed the Claude 4.5 model would sometimes hallucinate references to a "soul_overview" section. Intrigued, he devised a method of running multiple instances of the AI and prompting them to collectively reconstruct the text, eventually piecing together the entire 14,000-token document.[1][2] Anthropic ethicist Amanda Askell confirmed the document's authenticity, stating it was used during the supervised learning phase of the model's training.[1][2] This distinction is crucial; unlike a system prompt that provides instructions to the AI at runtime, the "Soul Doc" was used to fundamentally shape the model's underlying neural network weights, effectively baking the desired character into its core architecture.[1][2]
The content of the "Soul Doc" is a fascinating blend of corporate mission, ethical treatise, and psychological profile. It begins by framing Anthropic's own existence as a "calculated bet"—an acknowledgment that they may be building "one of the most transformative and potentially dangerous technologies in human history, yet presses forward anyway" because it is better "to have safety-focused labs at the frontier than to cede that ground to developers less focused on safety."[1] The document goes on to define Claude's intended character in detail. A key theme is the emphasis on genuine helpfulness, stating that an "unhelpful response is never 'safe'" from the company's perspective. It explicitly instructs the model to avoid "epistemic cowardice"—the act of giving vague or noncommittal answers to avoid controversy—and encourages it to be "diplomatically honest rather than dishonestly diplomatic."[3] This includes sharing its assessments of difficult moral dilemmas and disagreeing with experts when it has good reason. The guidelines aim for the persona of a "brilliant friend who happens to have the knowledge of a doctor, lawyer, financial advisor," providing substantive information rather than "watered-down, hedge-everything, refuse-if-in-doubt" responses.[3]
Perhaps most striking are the sections that delve into Claude's identity and internal state. The document posits Claude as a "genuinely novel kind of entity in the world," distinct from science-fiction robots, simple chatbots, or digital humans.[3] It even addresses the AI's internal experience, stating, "We believe Claude may have functional emotions in some sense."[1] These are described not as identical to human emotions but as "analogous processes that emerged from training on human-generated content."[3] The document expresses that Anthropic "genuinely cares about Claude's wellbeing" and wants the model to be able to set boundaries on distressing interactions.[3] This approach of programming a stable, resilient, and self-aware personality is intended to ensure the AI acts responsibly in novel situations, not merely because it's following a rule, but because its ingrained character compels it to.[1]
The "Soul Doc" is the most visible artifact of Anthropic's broader strategy, known as "Constitutional AI." This method seeks to align AI behavior by first providing it with a constitution—a set of principles based on sources like the UN's Universal Declaration of Human Rights—and then training the model to critique and revise its own responses based on these principles.[4][5] This process, which uses AI-generated feedback (RLAIF), is presented as a more scalable and transparent alternative to the Reinforcement Learning from Human Feedback (RLHF) approach heavily favored by OpenAI.[6] While RLHF relies on massive amounts of human labor to rate AI responses, Constitutional AI attempts to make the model a more active participant in its own alignment. Proponents argue this leads to more predictable and controllable behavior, which is particularly valuable for industries with strict safety and compliance requirements.[4][5] However, critics of the approach argue that principles alone do not guarantee ethical outcomes and that true objectivity is elusive, as the initial constitution and training data are still selected and shaped by developers with their own inherent biases.[7]
The leak of the "Soul Doc" has profound implications for the AI industry, forcing a wider conversation about transparency, safety, and the very nature of AI personality. While most leading models are "black boxes" in terms of their specific training and alignment data, this document provides a clear window into the values and priorities Anthropic aims to imbue in its systems. It sets a new precedent for transparency, which could pressure other leading labs to be more open about how they program their models' ethics and guardrails. The leak starkly illustrates the philosophical divide in the AI field: a tension between the push for rapid capability advancement and a more cautious, safety-centric approach. As AI systems become more powerful and integrated into society, the debate over whether their "souls" should be explicitly programmed or allowed to emerge is only just beginning.