Google DeepMind's Gemini 2.5 AI Sees and Controls Digital World Like Humans.

Google's AI now autonomously navigates web and mobile interfaces like humans, unlocking unprecedented automation and advanced agentic capabilities.

October 8, 2025

Google DeepMind's Gemini 2.5 AI Sees and Controls Digital World Like Humans.
Google DeepMind has introduced a significant advancement in artificial intelligence with the unveiling of its new Gemini 2.5 Computer Use model, an AI capable of autonomously operating web browsers and mobile applications.[1][2] Now available in preview through the Gemini API, this specialized model is built upon the visual understanding and reasoning capabilities of Gemini 2.5 Pro.[3][1] It empowers developers to create sophisticated AI agents that can interact with graphical user interfaces (GUIs) in a human-like manner, performing actions such as clicking, typing, and scrolling to complete complex tasks across different platforms.[3][4] This development marks a pivotal step towards creating more powerful and general-purpose AI agents that can navigate the digital world as humans do, moving beyond the limitations of structured APIs.[3][5]
The core functionality of the Gemini 2.5 Computer Use model lies in its ability to perceive and act upon visual information presented on a screen.[6] The system operates in a cyclical loop: it receives a user's request, a screenshot of the current digital environment, and a history of recent actions.[7] The model then analyzes these inputs to generate a command, such as clicking a specific button or typing text into a form.[3] This command, delivered as a function call, is then executed by client-side code.[7] After the action is performed, a new screenshot is captured and sent back to the model, restarting the loop until the user's objective is achieved.[3][7] This iterative process allows the AI to handle multi-step workflows, like extracting information from one website and entering it into a customer relationship management system on another, all based on a single user prompt.[3] While primarily optimized for web browsers, the model has also shown strong promise in controlling mobile app interfaces, though it is not yet designed for desktop operating system control.[3][7]
This new model represents a substantial leap forward for AI-powered automation.[5] Early testers and Google's internal teams have already demonstrated its potential. For instance, the AI assistant company Poke.com reported that Gemini 2.5 Computer Use is often 50% faster and more effective than competing solutions.[3] Another company, Autotab, noted an 18% performance increase in reliably parsing context in complex situations.[3] Google's own payments team has utilized the model to automatically repair over 60% of failed user interface tests, a process that previously took days to resolve manually.[3] The model has demonstrated leading performance on multiple web and mobile control benchmarks, outperforming alternatives with lower latency.[3][8] This efficiency and reliability are crucial for deploying autonomous agents in real-world business processes where mistakes can be costly.[3] Versions of this technology are already being used to power features in Project Mariner, a research prototype for human-agent interaction, and AI Mode in Google Search.[7][1][9]
The introduction of autonomous agents capable of controlling user interfaces has broad implications for the AI industry and the future of software development and automation.[4][10] Such technology can automate repetitive data entry, perform comprehensive testing of web applications, and streamline complex digital workflows, freeing up human workers to focus on more creative and strategic tasks.[6][4] This shift aligns with a growing trend in the AI field toward "agentic AI," where autonomous systems can independently plan and execute multi-step tasks to achieve a goal.[11][12] The global market for autonomous AI and agents is projected to grow significantly, indicating a major shift in how businesses operate.[10] However, the power of these autonomous systems also introduces new challenges and risks.[13] To address these, Google has built safety features directly into the model, including per-step risk checks and developer-controlled restrictions on high-risk actions to prevent misuse.[8][14]
In conclusion, the release of Google's Gemini 2.5 Computer Use model is a landmark event, signaling a new era of human-computer interaction. By enabling AI to directly manipulate the graphical interfaces designed for humans, Google has unlocked a vast new potential for automation across countless digital tasks. While the technology is still in its preview stage and requires careful implementation with safety guardrails, its superior performance in early benchmarks and the tangible benefits reported by initial users highlight its transformative power.[3][6] As developers begin to build on this platform, the capabilities of AI agents are set to expand dramatically, further integrating intelligent automation into the fabric of our digital lives and business operations. This move pushes the entire AI industry forward, setting a new standard for what is possible with autonomous systems and redefining the boundaries of software interaction.[11][13]

Sources
Share this article