Microsoft Fara-7B: Small AI visually controls PC, outperforming larger models.
Fara-7B visually perceives your screen, automating tasks locally for enhanced privacy and speed.
November 29, 2025

Microsoft has unveiled a compact yet powerful artificial intelligence system, Fara-7B, engineered to operate a computer's user interface using purely visual input, much like a human would. This 7-billion-parameter model represents a significant stride in the move toward capable, on-device AI, offering the potential to automate complex tasks locally on consumer hardware, thereby enhancing privacy and reducing processing delays. Described by the company as its first "agentic" small language model designed specifically for computer use, Fara-7B can interpret on-screen information and execute actions by controlling the mouse and keyboard, a stark contrast to traditional AI assistants that rely on cloud computing and structured data feeds.[1][2][3][4] The experimental release signals a potential future where AI agents can seamlessly handle everyday digital chores like booking travel, filling out forms, or comparing products online, all without sending sensitive data to external servers.[5][2][6]
At its core, Fara-7B operates as a "Computer Use Agent" (CUA), a category of AI that directly interacts with graphical user interfaces.[7][8][6] Its primary innovation lies in its methodology; the model works by visually perceiving a webpage or application through screenshots.[9][2][3] It does not require access to underlying code, accessibility trees, or other metadata that many automation tools depend on.[1][2][4] This approach allows Fara-7B to function with the same modalities as a person, interpreting the layout, text, and icons on the screen to decide its next move, whether it's clicking on specific coordinates, typing text into a field, or scrolling through a page.[9][2] This entire process is handled by a single, end-to-end multimodal model, which simplifies the complex chain of parsing, reasoning, and acting that often requires multiple, larger AI systems working in concert.[7][10] The architecture is based on the Qwen2.5-VL-7B model, chosen for its strong grounding capabilities and its ability to process long contexts of up to 128,000 tokens, allowing it to maintain a history of its actions for multi-step tasks.[11][2][12]
Despite its relatively small size, Fara-7B demonstrates performance that is competitive with, and in some cases superior to, much larger and more resource-intensive AI systems.[1][2] On the WebVoyager benchmark, a test for web navigation, Fara-7B achieved a success rate of 73.5%, outperforming a configuration of OpenAI's significantly larger GPT-4o, which scored 65.1%.[9][2] Beyond raw success rates, the model is remarkably efficient, completing tasks in an average of about 16 steps, compared to 41 steps for a comparable 7B-parameter model.[2][13] This efficiency translates into lower computational costs and faster task completion.[14][15] The key advantage of its compact design is the ability to run natively on a personal computer, including future silicon-optimized versions for Windows 11 Copilot+ PCs.[9][5][3] This local execution is a fundamental differentiator from cloud-dependent assistants, as it keeps all user data on the device, bolstering privacy and minimizing the latency associated with sending information to and from data centers.[1][2][3]
The development of such a capable small model was made possible by an innovative approach to data generation. A key bottleneck in training computer control agents is the scarcity of high-quality, step-by-step data of human web interactions.[11][3] To overcome this, Microsoft developed FaraGen, a synthetic data pipeline that uses an AI agent system to perform tasks on real websites across tens of thousands of domains.[14][11] This system generated 145,630 verified sessions containing over one million individual actions, creating a massive and realistic dataset that taught the model to handle the complexities of human-like web navigation, including mistakes and retries.[14][11] In a move to spur further innovation, Microsoft has released Fara-7B as an open-weight model under a permissive MIT license, making it accessible to developers and researchers through platforms like Hugging Face and Microsoft Foundry.[7][14][5] This allows the community to experiment with and build upon the technology for a wide range of applications, from automating repetitive business workflows to creating new accessibility tools.[5][10][6]
While Fara-7B represents a significant advancement in on-device AI, Microsoft acknowledges it is still an experimental technology with limitations.[2][16] The company's own testing reveals the model can struggle with accuracy on more complex tasks, occasionally makes mistakes in following instructions, and is susceptible to hallucinations, a common challenge for current AI systems.[1][2][13] Safety has been a core design consideration; the model is trained to refuse tasks involving illegal activities, impersonation, or other high-risk domains and to halt its actions at critical points that require user permission or sensitive information.[7][12][3] Microsoft recommends users run Fara-7B in a sandboxed, or controlled, environment to monitor its execution and avoid use with sensitive data.[5][2] The release of Fara-7B is a clear indication of a broader industry shift toward more compact, private, and efficient AI agents that can be integrated more deeply and safely into daily computing life.
Sources
[1]
[3]
[5]
[6]
[7]
[8]
[10]
[11]
[12]
[13]
[15]
[16]