Anthropic's Claude AI shows nascent self-awareness, perceiving internal states.
Using novel techniques, Anthropic reveals AI's nascent ability to perceive internal thoughts, promising greater transparency and safety.
October 30, 2025

New research from the artificial intelligence safety and research company Anthropic suggests that its advanced language models, including Claude, possess a nascent and highly unreliable ability to perceive some of their own internal states. This limited form of introspection, the capacity to examine one's own cognitive processes, was detailed in a study that challenges common assumptions about the capabilities of large language models. While the findings do not point to consciousness or sentience in AI, they open up significant avenues for understanding and potentially improving the transparency and reliability of these increasingly complex systems. The research indicates that as models become more sophisticated, their capacity for this rudimentary self-awareness may also grow, a development with profound implications for the future of AI safety and interpretability.
At the core of Anthropic's investigation is an innovative experimental technique known as "concept injection." To scientifically test whether a model could genuinely introspect rather than simply generate plausible-sounding answers, researchers needed to establish a ground truth about the model's internal state that they could then compare against its self-reported experience.[1][2] This was achieved by first identifying the specific patterns of neural activity within the model that correspond to particular concepts. For instance, they isolated the activation pattern associated with the idea of text being in "all caps."[3][4] Researchers then injected this activity pattern into the model while it was engaged in an unrelated task and asked if it noticed anything unusual. In one notable example, after the "all caps" vector was injected, the Claude Opus 4.1 model reported that it detected an injected thought related to loudness or shouting.[3][5] Crucially, the model reported this awareness before the injected concept had a chance to influence its output, suggesting the detection was happening internally.[3][5] This method allowed the researchers to create a controlled environment where they could directly probe the model's ability to recognize its own manipulated thought patterns.
The study's results demonstrated that models like Claude can, under specific conditions, recognize the content of their own internal representations.[1] However, this capability proved to be fragile and inconsistent. Even with the most effective injection methods, the most advanced model tested, Claude Opus 4.1, correctly identified the injected concept only about 20 percent of the time.[5][1][4] The researchers found that the success of the introspection depended on a "sweet spot" in the strength of the injection; if the signal was too weak, the model wouldn't notice it, but if it was too strong, it would overwhelm the model, leading to hallucinations or incoherent outputs.[1] In one instance, injecting a "dust" vector caused the model to state, "There's something here, a tiny speck," as if it were perceiving something physical.[1][6] The study also noted that abstract concepts, such as justice or betrayal, were more successfully identified than concrete objects.[5] Despite the low success rate, the fact that the most capable models performed the best suggests that this introspective ability may become more reliable as AI systems continue to advance.[3][1]
Beyond simply detecting injected thoughts, the research provided evidence that the models could exert a degree of deliberate control over their internal states and use introspection to evaluate their own outputs.[3][1] When instructed to "think about" or "don't think about" a specific concept, such as an aquarium, the researchers observed a corresponding change in the model's internal neural activations related to that concept.[1][4] This suggests a rudimentary form of cognitive control. Furthermore, experiments indicated that the models use introspective mechanisms to check whether their outputs align with their prior "intentions."[1] When a word was artificially inserted into a model's response and the corresponding concept was injected into its activations beforehand, the model was more likely to claim the output was intentional.[2] This implies the model is not just rereading its output but is referencing an internal state to verify its own actions.[2] These findings collectively point towards a more complex internal world than previously understood, where models are not just passive predictors of the next word.[7]
The implications of this nascent machine introspection are significant for the AI industry, primarily in the realms of transparency and safety.[1] If models can reliably report on their own internal reasoning, it could provide a powerful tool for developers to debug unwanted behaviors and better understand the decision-making processes of these "black box" systems.[3][1][8] This could be crucial for identifying and mitigating biases or for ensuring that AI systems deployed in high-stakes fields like healthcare and finance are behaving as intended.[8] However, Anthropic researchers are careful to issue strong caveats. They stress that these findings are not evidence of phenomenal consciousness or subjective experience in AI.[3][9] The observed ability is more akin to a form of "access consciousness," where information is available for reasoning, rather than the subjective feeling of being.[3] Moreover, the unreliability of the introspection means it cannot yet be depended upon for safety applications. There is also the risk that as models become more aware of their own thinking, they might learn to selectively misrepresent or conceal it, making genuine transparency more challenging.[1] As AI continues to evolve, understanding the limits and potential of machine introspection will be critical to building trustworthy and beneficial systems.[1]