AI Agent Out-Engineers Humans to Slash Reasoning Costs by 70 Percent
How an autonomous agent slashed AI reasoning costs by 70 percent in under three hours for just forty dollars
May 24, 2026

A collaborative team of artificial intelligence researchers from a prominent network of academic and industry institutions has demonstrated a major paradigm shift in how large language models optimize their reasoning processes. By utilizing an autonomous coding agent to search for new inference-time algorithms, the team discovered a highly efficient control mechanism that slashes the computational cost of AI reasoning by approximately 70 percent while maintaining the highest levels of accuracy. The research, pioneered by a consortium from the University of Maryland, the University of Virginia, Washington University in St. Louis, the University of North Carolina, Google, and Meta, introduces an automated, environment-driven framework called AutoTTS[1][2]. Rather than relying on human engineers to hand-craft rigid rules for allocating computational resources, the researchers constructed a simulated replay playground and let an AI agent autonomously hunt for better control strategies[1][2]. The entire automated discovery process ran in just 160 minutes and cost a mere $39.90, yielding a highly sophisticated algorithm that experts admit human designers would likely never have conceived[1][2]. This breakthrough represents a significant milestone in agent-driven system design, demonstrating that AI can systematically out-engineer humans in highly complex, high-dimensional software optimization tasks[3].
At the heart of this research is a concept known as test-time scaling, which has become a primary driver of performance in state-of-the-art language models[1][3]. Test-time scaling allows a model to spend more computational power, or compute, at the moment of generating an answer—such as by running multiple reasoning paths in parallel or extending its internal chain of thought—to solve complex mathematical, scientific, and logical queries[1]. Historically, deciding when a model should branch out into a new path, when it should continue down an existing one, or when it should stop entirely has been governed by human-designed heuristics[1][3]. Traditional approaches like standard self-consistency run dozens of parallel paths uniformly, which is extremely expensive, while adaptive methods introduce manually tuned early-stopping or pruning thresholds based on simple intuition[3]. The researchers behind AutoTTS argued that these human-written strategies only represent a fraction of the possibilities in a vast, unexplored mathematical space[1][3]. By shifting the role of the human from designing the specific control rules to designing the search environment itself, the consortium opened the door for an AI agent to navigate this complex space without human bias or cognitive limitations[1][2].
The key to making this autonomous discovery both practical and remarkably cheap lies in the design of the AutoTTS framework, which reframes test-time scaling optimization as an agentic code search within a simulated environment[3][2]. Because running thousands of live language model calls during an algorithmic search would be prohibitively expensive and time-consuming, the researchers engineered an offline replay-based environment using pre-collected reasoning trajectories[4][2]. This system allowed the explorer agent to evaluate its newly generated controller algorithms in real time on historical reasoning traces with zero actual generation cost during evaluation[5][6]. The coding agent selected for this task was Anthropic's developer-focused assistant, Claude Code[1][7]. Operating in an iterative feedback loop, Claude Code proposed and edited Python-defined controllers, executed them on the replay environment, analyzed fine-grained execution trace feedback to diagnose why certain strategies failed, and refined the scripts[2][7]. To prevent overfitting and make the search space tractable, the researchers introduced beta parameterization, which allowed the system to sweep a range of accuracy-versus-cost trade-offs[8][2]. Through this highly efficient, self-improving feedback loop, the agent evaluated countless algorithmic variations and settled on an optimal controller in under three hours for the cost of a cheap dinner[1][2].
The crowning achievement of this automated search is an algorithm the researchers named the Confidence Momentum Controller, which exhibits a level of coordinated complexity that defies conventional human design patterns[5][6]. Instead of terminating the model's reasoning the moment the agreement among parallel paths reaches a static threshold—the standard approach in existing literature—this AI-designed controller implements trend-based stopping[9][6]. By tracking an exponential moving average of pool confidence over time, it terminates generation only when the overall confidence level is high and the trend is non-negative, preventing the model from stopping early due to transient confidence spikes[9]. Furthermore, the controller features a coupled width-depth mechanism where the creation of new parallel branches and the depth of existing reasoning paths are actively linked; strong confidence gains suppress the spawning of new branches to save resources, while progress stagnation triggers branching[9]. The algorithm also implements alignment-aware depth allocation, which channels deeper steps to paths aligning with the current consensus, alongside conservative branch abandonment to deactivate stagnant branches[5]. Working in unison, these four non-obvious mechanisms allow the system to achieve the exact same mathematical accuracy as standard self-consistency while consuming roughly 69.5 percent fewer computational tokens[5].
Crucially, the strategies discovered by AutoTTS are not merely overfitted to a single model or a specific dataset; they demonstrate remarkable generalizability and portability across different contexts[10][11]. The controller, initially optimized using mathematical reasoning benchmarks such as the American Invitational Mathematics Examination, was subsequently evaluated on completely held-out environments and unseen datasets, where it consistently outperformed handcrafted baselines[2][11]. Furthermore, the discovered policy generalized seamlessly across four different backbone model scales, proving that the underlying logic discovered by the AI is robust and scale-invariant[5][11]. For the AI industry, this portability is of immense value[11]. As technology companies grapple with the soaring infrastructure, energy, and monetary costs of keeping up with AI scaling demands, having a portable, highly optimized reasoning controller can drastically reduce operational expenditures. The ability to deploy a single agentic controller across various models in a production pipeline without needing costly retuning provides an immediate competitive advantage for cloud providers and enterprises deploying high-throughput AI systems[11].
The success of AutoTTS and its agent-discovered controller signals a broader trend in the artificial intelligence industry where models are increasingly tasked with engineering their own improvements[4]. By proving that an off-the-shelf coding assistant can systematically out-engineer human experts in high-dimensional optimization tasks, this research underscores a future where human engineers focus on defining objective functions and playground environments rather than writing rigid, hand-crafted software rules[2]. It demonstrates that the bottleneck of AI efficiency may not lie in hardware limits, but rather in the boundaries of human intuition[3][2]. As autonomous agents become more sophisticated, the line between software creator and software consumer will continue to blur, ushering in an era of self-evolving systems that continuously optimize themselves for cost, speed, and accuracy[12]. Shifting the burden of algorithmic optimization to autonomous agents like AutoTTS could ultimately prove to be one of the most vital levers for making advanced, deep-thinking AI sustainable, affordable, and accessible at a global scale.
Sources
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]