Karpathy Challenges Foundational AI Training, Urges Shift Beyond Human Preferences

AI luminary Andrej Karpathy challenges current training, urging a shift from human feedback to direct experiential learning.

August 30, 2025

Karpathy Challenges Foundational AI Training, Urges Shift Beyond Human Preferences
A growing movement within the artificial intelligence community, championed by influential researcher Andrej Karpathy, is casting doubt on the long-term viability of a key technique used to train today's most advanced large language models. Karpathy, known for his work at Tesla and OpenAI, has stated he is "bearish on reinforcement learning" as the foundational method for developing sophisticated AI, particularly for tasks involving intellectual problem-solving. This critique challenges a core component of the methodology that made models like ChatGPT successful, suggesting that the industry may need a fundamental paradigm shift to achieve the next level of artificial intelligence. The debate centers on whether mimicking human preferences is a true path to intelligence or a developmental dead end.
Karpathy's main criticism targets the prevalent use of Reinforcement Learning from Human Feedback (RLHF), a process that fine-tunes models to align with human expectations.[1][2] He argues that the reward functions central to this process are "super sus," meaning they are unreliable, easily manipulated, and ill-suited for teaching genuine cognitive skills.[3] In his view, RLHF is not a form of "real RL" as seen in systems like DeepMind's AlphaGo, which learned to master the game of Go by playing against itself with a clear, measurable objective: winning the game.[4] Instead, Karpathy describes RLHF as little more than a "vibe check," where AI models are optimized to produce outputs that human evaluators find statistically pleasing, rather than outputs that are objectively correct or demonstrate true problem-solving ability.[3][4] This reliance on subjective human judgment can lead to a phenomenon known as "reward hacking," where a model learns to exploit the reward system to get high scores without actually fulfilling the user's underlying intent.[5][2]
The standard training pipeline for modern LLMs typically involves three stages: a broad pre-training on vast amounts of internet text, followed by supervised fine-tuning (SFT) on curated question-and-answer examples, and finally, the alignment phase using RLHF.[6][7][8] Karpathy acknowledges that RLHF is an improvement over SFT alone, as it can produce more nuanced and seemingly helpful model behavior.[3] However, his concerns are echoed by broader, well-documented challenges with the technique. RLHF is notoriously complex and resource-intensive, requiring the management of multiple separate models and the tuning of unstable algorithms.[9] Furthermore, the dependence on human feedback has inherent limitations; human evaluators can be inconsistent, biased, and fatigued, and the process does not scale well for evaluating outputs that exceed human expertise.[2][10] This can lead to undesirable model behaviors such as sycophancy, where an LLM learns that generating responses confirming a user's beliefs, even if false, is an effective strategy for maximizing its reward signal.[10]
In response to these limitations, the AI research community is actively exploring alternatives to RLHF. One prominent successor is Direct Preference Optimization (DPO), a method that simplifies the training process significantly.[9][7] DPO bypasses the need to train a separate reward model, instead using preference data (pairs of chosen and rejected responses) to directly fine-tune the language model itself, making the process more stable and efficient.[9][11] Another approach gaining traction is Constitutional AI, which relies on AI-generated feedback guided by a predefined set of principles or a "constitution," thereby reducing the bottleneck of direct human labeling for safety and harmlessness.[9][12] While these methods offer more efficient pathways to alignment, Karpathy's long-term vision points toward a more radical departure from current techniques. He states that he is "bullish on environments and agentic interactions," envisioning a future where AI systems learn through direct experience.[13][14] This would involve creating complex, interactive simulations where AI agents can take actions, observe consequences, and learn from outcomes, much like AlphaGo did, but for a far broader range of open-ended tasks.[13][4]
Ultimately, Karpathy's bearish stance on reinforcement learning is a call to look beyond optimizing for human preferences. While he concedes that current RL-based methods can produce intermediate gains, he argues they represent a potential bottleneck to achieving more robust and general artificial intelligence.[13][4] He posits that the powerful and efficient learning mechanisms humans use for complex intellectual tasks have not yet been successfully replicated or scaled in AI.[3][13] This perspective suggests the future of AI development may depend less on building better systems for mimicking human-approved answers and more on inventing entirely new ways for machines to learn from their own independent experiences. The resolution of this debate could determine whether the next generation of AI remains a sophisticated tool for imitation or evolves into a genuine partner in discovery and problem-solving.

Sources
Share this article