Google DeepMind AI agents now control software like humans

Google's Gemini 2.5 model empowers AI agents to become active digital participants, navigating and manipulating software through GUIs.

October 8, 2025

Google DeepMind AI agents now control software like humans
In a significant move toward creating more autonomous and capable artificial intelligence, Google DeepMind has released the Gemini 2.5 Computer Use model, a new system designed to power AI agents that can directly interact with graphical user interfaces.[1][2] This specialized model, built upon the visual understanding and reasoning prowess of Gemini 2.5 Pro, enables AI to navigate and manipulate software just as a human would, by clicking, typing, and scrolling through websites and applications.[3][4] The release, now in public preview for developers, marks a critical advancement in the quest to build general-purpose agents capable of handling complex, multi-step digital tasks, pushing the boundaries of automation beyond the limitations of structured APIs.[3][1] The new technology allows AI agents to perform tasks like filling out forms, manipulating interactive elements such as dropdown menus, and even operating behind user logins, a crucial step in making AI assistants more practical for real-world workflows.[3][1]
The core of the Gemini 2.5 Computer Use model operates on an iterative loop that combines visual analysis with action execution.[3][4] The process begins when a user provides a request, which is paired with a screenshot of the current digital environment and a history of recent actions. The model analyzes these inputs and generates a response in the form of a specific UI action, such as a command to click a button or type text into a field.[3][4][5] This command is then executed by client-side code. After the action is performed, a new screenshot is captured and sent back to the model, restarting the loop with updated context.[1][4] This cycle continues until the task is successfully completed, an error occurs, or the process is terminated by the user or a safety protocol.[1] While the system is primarily optimized for web browsers, Google notes that it also shows strong potential for controlling mobile user interfaces, though it is not yet designed for desktop operating system-level control.[3][1][2] This focused approach on browser-based tasks has reportedly paid dividends in performance.
Google DeepMind has substantiated the model's capabilities by highlighting its strong performance on several industry benchmarks for web and mobile control, including Online-Mind2Web, WebVoyager, and AndroidWorld.[3][4] The company asserts that its model not only delivers high accuracy, exceeding 70% in some tests, but does so with lower latency than leading alternatives, a critical factor for creating a seamless user experience.[3][1] Demonstrations showcase the model's practical applications, such as autonomously transferring pet-care data from a webpage into a customer relationship management (CRM) system and organizing a chaotic board of digital sticky notes into predefined categories.[3][4] These examples illustrate a shift from simple command execution to more sophisticated, agentic workflows where the AI can interpret context, make decisions, and execute tasks across different digital environments.[6] Early adopters have reported significant performance increases, with one firm noting an 18% improvement on its most difficult evaluations and another successfully using the model to autonomously recover from and fix over 60% of script execution failures that previously required manual intervention.[1]
The release of the Gemini 2.5 Computer Use model intensifies the competition in the burgeoning field of AI agents. Companies like OpenAI and Anthropic have already introduced models with similar computer-control capabilities, placing Google in a position of catching up in some respects.[4] However, unlike some competing tools that aim for broader operating system control, Google's focused optimization for web browsers appears to have yielded superior performance within that domain.[4] The development of AI agents that can interact with GUIs is seen by many in the industry as a transformative step.[7] It removes the dependency on developers creating specific APIs for every application, instead allowing the AI to interact with software through the same visual interface designed for humans.[8] This capability is enabled by advancements in multimodal large language models that can process visual inputs like screenshots, reason symbolically about the elements on the screen, and perform actions.[7][8] While significant progress has been made, challenges remain in reliably handling the dynamic and variable nature of user interfaces and in reducing inference delays to match human expectations.[7]
Recognizing the inherent risks of granting AI control over computer functions, Google DeepMind has emphasized the integration of safety features. The company acknowledges that such agents could be misused, exhibit unexpected behavior, or fall prey to web-based scams and prompt injections.[1] To mitigate these risks, safety guardrails have been trained directly into the model, and developers are provided with controls to prevent harmful actions.[3] These controls can be configured to require user confirmation before the agent performs certain sensitive actions or to refuse specific requests outright.[3] The Gemini 2.5 Computer Use model is now accessible to developers through the Gemini API in both Google AI Studio and Vertex AI, allowing the broader tech community to begin building and experimenting with these powerful new agentic capabilities.[3][1] This preview release signals Google's intent to rapidly iterate and improve upon its AI's ability to act as a truly helpful digital assistant.
In conclusion, the launch of Google's Gemini 2.5 Computer Use model is a landmark development in the evolution of artificial intelligence. By enabling AI agents to understand and interact with graphical interfaces, it opens up a vast new landscape for automation, from streamlining complex business processes to simplifying everyday digital chores for consumers. While the technology is still in its early stages and faces competition and challenges related to safety and reliability, it represents a fundamental shift in how humans and machines collaborate. The ability to autonomously navigate the digital world through its native visual language moves AI from a passive tool to an active participant, heralding a future where intelligent agents can execute complex tasks on our behalf with increasing autonomy and efficiency.

Sources
Share this article