China's DeepEyesV2 Makes AI Smarter, Outperforming Giants With External Tools

This Chinese innovation proves smaller, tool-smart AI can beat bigger models, redefining the path to truly capable intelligence.

November 16, 2025

China's DeepEyesV2 Makes AI Smarter, Outperforming Giants With External Tools
In a significant stride for artificial intelligence, Chinese researchers have developed a multimodal model, DeepEyesV2, that demonstrates a powerful new direction in AI design, favoring the intelligent use of external tools over the sheer accumulation of internal knowledge. This "agentic" model, which can analyze images, write and execute code, and search the web, has shown an ability to outperform larger, more data-intensive rivals on a variety of complex tasks. The development signals a potential shift in the AI industry, suggesting that the path to more capable and reliable AI may lie not just in building bigger models, but in creating smarter ones that can adeptly leverage external resources.
Developed by researchers at the technology company Xiaohongshu, DeepEyesV2 is built upon the open-source Qwen2.5-VL-7B model.[1][2] What sets it apart is its unified reasoning loop that seamlessly integrates programmatic code execution and web retrieval as interleavable tools.[3][4][1] Unlike traditional models that process information based solely on their training data, DeepEyesV2 can actively decide when to use a tool, such as running Python code in a sandboxed environment to perform calculations or cropping an image for detailed analysis, or querying a web search API to gather real-time information.[4][1][2] This iterative process—where the model plans a step, invokes a tool, incorporates the results, and continues its reasoning—allows for a more dynamic and robust problem-solving capability.[4][1] For example, when asked to identify a flower in a photograph, the model can first crop the relevant area, use that cropped image to perform a visual web search, and then use text search to verify the species, a multi-step process that mirrors human analytical methods.[5][1]
The key to unlocking this advanced capability lies in a novel two-stage training pipeline. The researchers found that using reinforcement learning (RL) alone was not enough to teach the model how to use tools effectively.[4][6][7] Instead, they first implemented a "cold-start" phase using supervised fine-tuning.[6][7] During this stage, the model was trained on a carefully curated dataset of moderately challenging problems where tool use is explicitly beneficial, establishing foundational tool-use patterns.[3][6][8] This was followed by a reinforcement learning stage that refined these skills, allowing DeepEyesV2 to learn more complex combinations of tools and to selectively invoke them based on the context of the problem.[4][6][8][9] This sophisticated training enables the model to exhibit task-adaptive behavior, for instance, tending to use image operations for perception-based queries and numerical computations for mathematical reasoning tasks.[4][6]
The performance of DeepEyesV2 has been validated across multiple benchmarks, including one created by the researchers themselves called RealX-Bench, specifically designed to test the integration of perception, search, and reasoning.[5][6] On this new benchmark, DeepEyesV2 showed a significant 6 percentage point improvement in average accuracy over its base model.[1] The model also demonstrated substantial gains in established tests for mathematical and real-world reasoning. On the MathVerse benchmark, it achieved an improvement of 7.1 percentage points, and on the MMSearch benchmark for information retrieval, it saw an 11.5 percentage point increase in performance, surpassing specialized models.[1] These results indicate that by systematically integrating external tools, a smaller model can achieve more reliable and extensible reasoning than larger models that rely solely on internalized knowledge, which can be static or outdated.
The emergence of models like DeepEyesV2 carries significant implications for the future of AI development. It challenges the prevailing "bigger is better" paradigm, which has led to a race to create ever-larger language models that are computationally expensive to train and operate.[10] The tool-centric approach offers a more efficient and potentially more accurate path forward. By offloading tasks like complex calculations or fact-checking to specialized external tools, the core AI model can remain relatively small and agile. This approach not only enhances performance but also addresses the issue of AI "hallucinations" by grounding the model's responses in real-time, verifiable data from the web or precise outputs from a code interpreter.[10] As the project is fully open-source, with its code, datasets, and model weights publicly available, it provides a valuable resource for the wider research community to build upon and explore this promising agentic, tool-augmented framework.[1]

Sources
Share this article