Rival LLMs Crown OpenAI's GPT-5.1 Most Capable in Groundbreaking AI Peer Review

Andrej Karpathy's LLM Council pioneered AI peer review, declaring GPT-5.1 superior and reshaping how we measure capability.

November 24, 2025

Rival LLMs Crown OpenAI's GPT-5.1 Most Capable in Groundbreaking AI Peer Review
In a novel experiment that turned the tables on the artificial intelligence industry's top competitors, a council of leading large language models (LLMs) has collectively declared OpenAI's GPT-5.1 to be the most capable among them. The surprising verdict emerged from the "LLM Council," a project created by esteemed AI researcher Andrej Karpathy, which forces rival AIs to anonymously judge each other's work. This peer-review process, involving prominent models such as Google's Gemini 3.0, Anthropic's Claude, and xAI's Grok, offers a dynamic new method for evaluating the rapidly advancing technology, suggesting that even the most sophisticated AIs can recognize superior performance in a competitor. The outcome is particularly noteworthy as it challenges recent benchmarks that had indicated other models were pulling ahead in the AI race, highlighting a potential gap between standardized tests and qualitative, context-rich performance.
The LLM Council operates on a straightforward yet ingenious three-stage process designed to elicit unbiased evaluations.[1][2] First, a user's query is simultaneously dispatched to a panel of top-tier LLMs, including hypothetical advanced versions like GPT-5.1, Gemini 3.0 Pro Preview, Claude Sonnet 4.5, and Grok-4.[3][4] These models, accessed through the OpenRouter API, generate their individual responses in isolation.[5][2] In the second stage, the crucial peer review begins. Each AI is presented with the complete set of anonymized answers generated by its counterparts.[6][2] Without knowing the author of each response, the models are tasked with ranking them based on criteria such as accuracy, insight, and overall quality.[1][2] This blind review is critical to prevent inherent biases, such as a model favoring its own output.[6] Finally, a designated "Chairman LLM" takes all the individual responses and the collective rankings to synthesize a single, consolidated final answer for the user, effectively creating a consensus built from a competitive, collaborative process.[1][7]
During his initial tests with the system, Karpathy observed a clear and consistent hierarchy emerge from the AI judges. While using the council to analyze book chapters, the models repeatedly praised GPT-5.1, identifying it as the "best and most insightful model" of the group.[1][4] The consensus was decisive, with OpenAI's offering consistently outranking its peers in this specific domain. At the other end of the spectrum, Anthropic's Claude was regularly selected as the "worst model," while Gemini and Grok occupied the middle ground.[1][4] However, Karpathy, a former head of AI at Tesla and a founding member of OpenAI, was quick to provide a crucial human counterpoint. He noted that the council's rankings did not perfectly align with his own qualitative assessment.[1] From his perspective, GPT-5.1's responses, while insightful, could be "a little too wordy and sprawled." In contrast, he found Gemini 3 to be "a bit more condensed and processed," and described Claude's output as "too terse" for the task.[1][4][6] This discrepancy underscores the subjective nature of evaluating AI-generated text and highlights how human preference for style and conciseness can differ from the models' own criteria for quality.
The implications of Karpathy's weekend project extend far beyond a simple leaderboard ranking. The experiment introduces a fascinating new paradigm for AI evaluation, moving away from static benchmarks that can potentially be gamed towards a more fluid, interactive form of peer assessment.[1][8] Karpathy himself remarked that "models are surprisingly willing to select another LLM's response as superior to their own, making this an interesting model evaluation strategy more generally."[1] This capacity for self-assessment and recognition of quality in rival systems could pave the way for more robust and reliable testing methodologies. The concept aligns with academic research into "Language Model Councils" (LMC), which propose that a democratic evaluation process involving a diverse group of AIs can produce rankings more consistent with human judgment, especially for subjective tasks like evaluating emotional intelligence or creative writing.[9][10] This approach also taps into the growing field of multi-model ensembles, where combining the outputs of several AIs is used to mitigate the biases and weaknesses of any single model, potentially boosting accuracy and the quality of final responses.[3][7][11]
In conclusion, the LLM Council has provided a compelling, if informal, snapshot of the current AI landscape from the perspective of the models themselves. The collective endorsement of GPT-5.1 by its direct competitors is a powerful testament to the model's capabilities in generating insightful, high-quality text. Yet, the experiment's true significance lies not in the crowning of a single winner, but in the innovative methodology it showcases. By creating a system where AIs critique each other, Karpathy has opened the door to more nuanced and dynamic evaluation frameworks. The divergence between the AI consensus and expert human opinion further serves as a critical reminder that metrics of quality remain complex and multifaceted. As the race towards more advanced artificial intelligence continues at a breakneck pace, this new form of digital peer review could become an invaluable tool for developers, offering a more holistic understanding of model performance and pushing the entire field toward a higher standard of excellence.

Sources
Share this article