AI Tech Suite

Baidu Open-Sources ERNIE 4.5, Sets New AI Benchmarks with Efficient MoE

Pioneering a Heterogeneous MoE, ERNIE 4.5 redefines multimodal AI with efficiency and open-sourced accessibility.

June 30, 2025

Baidu Open-Sources ERNIE 4.5, Sets New AI Benchmarks with Efficient MoE

In a significant stride for the artificial intelligence sector, Chinese technology giant Baidu has unveiled its latest family of models, ERNIE 4.5, built upon a novel 'Heterogeneous Mixture of Experts' (MoE) architecture. This development not only pushes the boundaries of multimodal AI but also signals a strategic move by Baidu to enhance performance while optimizing computational efficiency. The open-sourcing of the ERNIE 4.5 family, which includes a range of models with varying parameter counts, underscores a growing trend of democratizing access to powerful AI tools, setting new benchmarks for the industry. The flagship language model, with 300 billion parameters, has reportedly outperformed other prominent models in the field on several key benchmarks.[1]

At the heart of the ERNIE 4.5 series is its innovative Heterogeneous Multimodal MoE architecture, a departure from traditional MoE structures that utilize uniform 'expert' models.[1][2][3] This new framework is designed from the ground up for multimodal learning, integrating text, image, and video data more seamlessly.[4][5] The architecture features modality-isolated routing, which directs different data types to specialized processing pathways.[4] It employs separate experts for text and vision, alongside shared experts that facilitate the integration of knowledge across these different modalities.[1][5] This design allows for what Baidu describes as "mutual reinforcement during training," where the learning in one modality can enhance the understanding in another.[1] For instance, the joint training on both visual and textual data helps the models capture finer nuances of information, improving performance on tasks that require cross-modal reasoning.[5][6] This approach has been shown to enhance multimodal understanding without compromising, and in some cases even improving, performance on text-only tasks.[5]

The technical specifications of the Heterogeneous MoE architecture reveal a focus on computational efficiency. A key aspect of this is the differing sizes of the expert models; visual experts have an intermediate dimension that is one-third the size of the text experts.[1][5] This structural choice significantly reduces the computational load for visual tokens by approximately 66%.[1] Furthermore, the architecture is designed to be flexible. In scenarios that only require text processing, the vision experts can be skipped entirely, which helps to reduce memory overhead.[1] This efficiency is further bolstered by a series of optimizations, including 'intra-node' expert parallelism and the use of FP8 mixed-precision training, which have enabled the largest model in the family to achieve an impressive 47% Model FLOPs Utilization (MFU) during pre-training on NVIDIA H800 GPUs.[1][5] Baidu claims these optimizations allow for "optimal training performance" even with limited compute resources, citing the use of around 96 GPUs for the largest ERNIE 4.5 model.[1]

The performance of the ERNIE 4.5 family, particularly its larger variants, has been positioned to compete with leading models in the AI landscape. The 300-billion parameter language model, ERNIE-4.5-300B-A47B-Base, has demonstrated superior performance against DeepSeek-V3-671B in 22 out of 28 benchmarks covering general reasoning, mathematics, and coding.[1][4] Even smaller variants have shown competitive results; the 21-billion parameter ERNIE-4.5-21B-A3B-Base outperformed Alibaba's Qwen3-30B on several math and reasoning benchmarks, despite having 30% fewer parameters.[1] The multimodal capabilities are also noteworthy, with the largest vision-language model (VLM) in the family, ERNIE-4.5-VL-424B-A47B, showing strong performance across a range of benchmarks that test visual perception, document understanding, and visual knowledge.[6] These models, along with the entire ERNIE 4.5 family, have been made publicly accessible under the Apache 2.0 license on platforms like Hugging Face and AI Studio, complete with a 128K context window.[4]

The release of ERNIE 4.5 and its underlying architecture has significant implications for the broader AI industry. By open-sourcing a diverse family of powerful models, Baidu is not only fostering further research and development but also intensifying the competitive landscape.[5] The emphasis on computational efficiency through the Heterogeneous MoE design addresses a critical challenge in the development of large-scale AI, potentially making advanced capabilities more accessible to a wider range of developers and organizations.[7] The company has also released ERNIEKit, an industrial-grade development toolkit to support fine-tuning, alignment, and deployment of the new models.[6][7] This strategic combination of high performance, architectural innovation, and open accessibility positions Baidu as a formidable player in the global AI race, pushing the industry towards more efficient and capable multimodal systems.