Baidu's ERNIE AI Surpasses OpenAI/Google, Delivers Efficient Multimodal Insights for Business.

Outperforming rivals, Baidu's ERNIE 4.5 redefines enterprise AI with superior multimodal processing and 1% the cost.

November 12, 2025

Baidu's ERNIE AI Surpasses OpenAI/Google, Delivers Efficient Multimodal Insights for Business.
In a significant development within the competitive artificial intelligence landscape, Chinese technology giant Baidu has released a new multimodal AI model that is reportedly outperforming leading models from OpenAI and Google on several key industry benchmarks. The model, named ERNIE-4.5-VL-28B-A3B-Thinking, is distinguished not only by its performance but also by its highly efficient architecture, which is specifically designed to interpret and analyze complex, non-textual data crucial to enterprise operations. This move signals a direct challenge to the dominance of text-focused models and highlights a growing emphasis on AI that can unlock insights from a wider variety of data formats.
At the core of Baidu's latest offering is an innovative and efficient design. The new ERNIE model is built on a Mixture of Experts (MoE) architecture, which contains a total of approximately 28 to 30 billion parameters, yet only activates a fraction of them—around three billion—during any given task.[1][2][3] This "lightweight" approach provides a significant advantage by drastically reducing the computational power and associated costs required for inference, a common barrier that can hinder the widespread adoption and scaling of AI projects within large organizations.[2] This architecture allows the model to access a vast repository of knowledge while maintaining high performance and low latency.[1][3] The model is part of the broader ERNIE 4.5 family, developed on Baidu's proprietary PaddlePaddle deep learning platform, underscoring the company's commitment to building a robust, independent AI ecosystem.[4][5] This focus on efficiency and cost-effectiveness represents a strategic bet that these factors will be critical drivers for enterprise adoption.[2][6]
The claims of ERNIE's superiority are substantiated by its performance across several challenging multimodal benchmarks. In a head-to-head comparison, ERNIE-4.5-VL-28B-A3B-Thinking scored higher than models like GPT and Gemini in tests designed to evaluate visual reasoning and data interpretation.[2] For instance, on the MathVista benchmark, which assesses visual mathematical reasoning, ERNIE achieved a score of 82.5, narrowly beating Gemini's 82.3 and GPT's 81.3.[2] The gap was more pronounced in the ChartQA benchmark, which tests the ability to understand and answer questions about charts, where ERNIE scored 87.1, significantly ahead of Gemini's 76.3 and GPT's 78.2.[2] Similarly, in the "VLMs Are Blind" benchmark, ERNIE scored 77.3, surpassing both Gemini (76.5) and GPT (69.6).[2] Baidu also reported superior performance on other vision-language tests such as CCBench and OCRBench.[7] These results underscore the model's advanced capabilities in handling dense, non-text data, a critical area where many businesses hold untapped value.
Beyond outperforming on standardized tests, the true impact of Baidu's new model lies in its potential for real-world enterprise applications. A vast amount of valuable business intelligence is often trapped in non-textual formats, such as engineering schematics, video feeds from factory floors, medical imagery, and complex logistics dashboards.[2] The ERNIE model is specifically engineered to process this type of information. Demonstrations have shown its ability to analyze intricate statistical charts to identify causal relationships, interpret "Peak Time Reminder" charts for resource scheduling, and even solve complex physics problems depicted in a bridge circuit diagram by applying Ohm's and Kirchhoff's laws.[1][2] A key feature is its "Thinking with Images" capability, which allows the model to zero in on specific regions of an image and conduct detailed reasoning on these cropped views.[3] Baidu's vision extends beyond simple data perception; the company aims to establish this technology as a foundation for sophisticated "multimodal agents" capable of not just understanding but also reasoning and taking action based on visual inputs.[2]
The release of this advanced ERNIE model has broader implications for the global AI industry. It represents a strategic push by Baidu to establish a strong domestic alternative at a time when access to some Western AI models is becoming restricted in China.[8] This fosters a more self-reliant AI ecosystem within the country. Furthermore, Baidu is competing aggressively on cost, with reports suggesting that ERNIE 4.5 can outperform models like GPT-4.5 at just 1% of the price.[6][9][10] The decision to release the model under the open-source Apache 2.0 license further encourages widespread adoption, customization, and commercial application.[1] This combination of superior performance in key areas, radical cost-efficiency, and an open-source approach positions ERNIE as a formidable competitor in the international AI market.
In conclusion, Baidu's ERNIE-4.5-VL-28B-A3B-Thinking is more than just an incremental improvement in AI technology. Its efficient architecture directly addresses the critical enterprise challenges of cost and scalability, while its demonstrated strength in multimodal data processing unlocks new avenues for data-driven insights. By achieving top scores on key benchmarks and targeting the specific needs of businesses, Baidu has not only created a powerful tool but has also made a significant strategic move. This development intensifies the global AI race, challenging the established leaders with an accessible, powerful, and enterprise-ready model that could reshape how organizations leverage their most complex data.

Sources
Share this article