AI Tech Suite

Reinforcement learning boosts AI speed, not true reasoning, study finds.

Study suggests advanced AI refines existing knowledge for speed, not unlocking genuinely new problem-solving capabilities.

November 11, 2025

Reinforcement learning boosts AI speed, not true reasoning, study finds.

A new study is casting doubt on the perceived advancements in artificial intelligence reasoning, suggesting that specialized "reasoning models" are not more capable than their generalist counterparts but are simply more efficient at arriving at known solutions. Research from Tsinghua University and Shanghai Jiao Tong University investigates a popular training technique called reinforcement learning with verifiable rewards (RLVR) and finds that while it helps large language models produce correct answers more frequently on the first attempt, it does not unlock new problem-solving abilities. This distinction carries significant implications for the future of AI development, questioning whether current methods are leading to genuinely smarter models or just faster ones.

The core of the research revolves around RLVR, a method used to train models on tasks with clearly verifiable outcomes, such as mathematics and programming.[1] Instead of relying on subjective human feedback, RLVR uses automated signals, like a correct calculation or a passed software test, to reward the model.[1] This technique has been employed in the development of several prominent models. The study's central finding is that this method improves a model's "pass@1" rate, which is the probability of getting the correct answer on the first try.[1][2][3][4] However, this gain in efficiency comes at a cost. The researchers discovered that the underlying base models, before being fine-tuned with RLVR, could often solve the same problems if given multiple attempts.[2][3][4][5] This suggests that RLVR does not teach the model new reasoning skills but rather trains it to more reliably access and repeat solution paths it has already learned.

A key trade-off identified by the study is between this single-try efficiency and the model's overall problem-solving breadth.[2] When evaluated on a "pass@k" metric, which measures success if at least one correct answer is found within 'k' attempts, the base models without RLVR training eventually match or even surpass the performance of the supposedly more advanced reasoning models as 'k' increases.[2][3][4][5] The study posits that RLVR achieves its efficiency by reducing the diversity of the model's outputs, a concept known as entropy.[1] By concentrating the model's responses around a few high-reward solution paths, the training makes the model less likely to explore alternative, and potentially novel, ways of solving a problem.[1][4] This narrowing of focus means that while the model becomes better at hitting the target on the first shot, its ability to explore the entire solution space diminishes.[4][5]

The implications of these findings are significant for the AI industry, which has been heavily invested in the idea that techniques like RLVR are paving the way for more powerful reasoning. The research suggests that true advances in reasoning may not come from reinforcement learning alone but may require scaling up the models themselves through more extensive pre-training.[1] According to the study's lead author, Yang Yue, "RLVR is not as powerful as previously believed—it doesn't enable the model to solve problems that the base model can't solve".[1] An analysis of the reasoning paths generated by RLVR-trained models showed that these solutions were already likely to be produced by the base model, just less frequently.[2] In contrast, the study notes that techniques like distillation, where a model learns from a more powerful "teacher" model, can genuinely introduce new reasoning patterns and expand a model's capabilities beyond its original limits.[3][4][6]

In conclusion, the research from Tsinghua and Shanghai Jiao Tong universities provides a critical perspective on the current state of AI reasoning. It reframes the conversation, suggesting that what the industry has labeled as enhanced reasoning may, in many cases, be a more refined and efficient search for answers already within a model's grasp. This challenges the narrative of rapid progress towards artificial general intelligence and underscores the need for more diverse and innovative approaches. While RLVR proves to be a valuable tool for improving the reliability and speed of AI models for specific tasks, the study makes it clear that creating machines that can truly reason in a novel and flexible way remains a formidable challenge, one that may not be solved by simply reinforcing what is already known.