Baidu ERNIE-4.5-VL: Open-Source Visual AI Outperforms Google, OpenAI Rivals

Baidu's ERNIE-4.5-VL brings human-like visual reasoning to open source, efficiently challenging proprietary AI giants.

November 12, 2025

Baidu ERNIE-4.5-VL: Open-Source Visual AI Outperforms Google, OpenAI Rivals
In a significant move for the open-source artificial intelligence community, Chinese technology giant Baidu has released ERNIE-4.5-VL-28B-A3B-Thinking, a new multimodal AI model capable of sophisticated visual reasoning. This development introduces a powerful tool that can process and interpret images as part of its reasoning process, a capability that has largely been the domain of proprietary systems from companies like OpenAI and Google. The release challenges the status quo by offering a high-performance, cost-effective alternative that is freely available for commercial use, potentially accelerating the adoption of advanced AI applications across various industries. Baidu's latest offering is not just an incremental update; it represents a strategic push towards making powerful AI more accessible and efficient, a move that could have wide-ranging implications for the global AI landscape.
At the core of ERNIE-4.5-VL's impressive performance is its innovative architecture. The model is built on a Mixture of Experts (MoE) framework, which is a key factor in its efficiency.[1] While the model has a total of 28 to 30 billion parameters, it only activates approximately three billion parameters during any given task.[2][3][1][4][5] This "lightweight" approach significantly reduces the computational cost and memory footprint typically associated with large-scale AI models, making it possible to run on a single 80 GB GPU, such as an Nvidia A100.[4] This efficiency does not come at the expense of capability; the MoE design allows the model to maintain a large knowledge capacity while only using a fraction of its parameters for any specific inference.[1] The architecture also features a heterogeneous design with shared parameters for text and vision, alongside modality-specific experts, which enhances its ability to handle complex multimodal tasks.[3] Baidu has made this model available under the permissive Apache 2.0 license, allowing for both personal and commercial use and signaling a strong commitment to the open-source community.[1][4][5]
ERNIE-4.5-VL's capabilities extend far beyond simple image recognition. It is designed to understand and reason about dense, non-textual data often found in enterprise settings, such as engineering schematics, medical scans, and logistics dashboards.[2] One of its most notable features is "Thinking with Images," which allows the model to intelligently zoom in on specific regions of an image to analyze fine-grained details before synthesizing that information into a comprehensive answer.[3][1][4] This mimics a human-like approach to visual analysis. The model has demonstrated its ability to perform complex tasks such as solving a bridge circuit diagram by applying Ohm's and Kirchhoff's laws, interpreting charts to find optimal visiting hours, and extracting subtitles from videos and mapping them to precise timestamps.[2] Furthermore, ERNIE-4.5-VL can integrate with external tools; for example, if it encounters an unknown object in an image, it can trigger a web search to identify it.[2][3] This combination of visual grounding, tool utilization, and deep reasoning positions the model as a powerful asset for automating complex analytical and operational tasks.[2][3]
The release of ERNIE-4.5-VL is poised to have a significant impact on the competitive AI industry. Baidu has released benchmark data suggesting that its new model outperforms larger, commercial counterparts like Google's Gemini 2.5 Pro and OpenAI's GPT-5 High on several key multimodal tests, including MathVista and ChartQA.[2] While these claims have yet to be independently verified, the reported performance of a lightweight, open-source model matching or exceeding that of industry-leading proprietary systems is a noteworthy development.[4] The cost-effectiveness of the ERNIE model is also a major disruptive factor. By offering comparable or superior performance at a fraction of the computational cost, Baidu is intensifying the AI price war and making sophisticated AI more affordable for a wider range of businesses and developers.[6][7] This shift could force Western AI leaders to innovate more rapidly and potentially lower their own costs, ultimately benefiting end-users with more powerful and accessible AI solutions.[6] The move is part of a broader trend of Chinese tech companies embracing open-source to drive AI development and diffusion, fostering a more collaborative and competitive global ecosystem.[8]
In conclusion, Baidu's ERNIE-4.5-VL-28B-A3B-Thinking represents a convergence of several key trends in artificial intelligence: advanced multimodal reasoning, efficient model architecture, and a commitment to open-source principles. By delivering a model that is both powerful and accessible, Baidu is not only challenging established players but also empowering a global community of developers and researchers to build the next generation of AI applications. The ability to reason with visual data has the potential to unlock valuable insights from the vast amount of non-textual information in the world, and by open-sourcing this technology, Baidu is accelerating that process. As the AI landscape continues to evolve, the impact of efficient, open-source models like ERNIE-4.5-VL will likely be a driving force in shaping a more democratized and innovative future for artificial intelligence.

Sources
Share this article