Alibaba Unveils Wan2.5: AI Generates Video with Native, Synchronized Audio
Alibaba breaks AI's visual-audio barrier with Wan2.5, generating immersive video clips complete with natively synchronized sound.
September 25, 2025

Alibaba has entered a new frontier in the generative artificial intelligence race, unveiling Wan2.5-Preview, a sophisticated video generation model capable of creating short video clips complete with synchronized, high-fidelity audio from simple text prompts or static images.[1][2] Announced at the company's annual Apsara Conference, this development marks a significant step forward in multimodal AI, aiming to streamline the content creation process by natively integrating sound and visuals.[3][4] The new model doubles the potential video length of its predecessors to ten seconds and offers resolutions up to 1080p, positioning Alibaba as a formidable competitor in the rapidly evolving field of AI-driven media.[1][5]
The standout feature of Wan2.5 is its ability to generate audio that is not merely added but is intrinsically linked to the video content.[6] This is achieved through a natively integrated multimodal architecture, where the model is jointly trained on a vast dataset of text, audio, and visual information.[1][4] This unified training process allows for what Alibaba calls "aligned multi-modal generation," ensuring that spoken dialogue, ambient sounds, and sound effects are synchronized with the on-screen action.[4][7] For creators, this eliminates the complex and time-consuming process of sourcing or generating separate audio tracks and manually syncing them in post-production.[7] The model can produce everything from character dialogue with corresponding lip movements to subtle background noises like the rustle of leaves or the roar of an engine, adding a layer of realism and immersion directly from the initial prompt.[8][7] This capability extends to both text-to-video and image-to-video functions, allowing a user to animate a still photo with accompanying sound or build a scene from a purely descriptive text.[4]
This advancement is built upon Alibaba's extensive work in large-scale AI, particularly its Qwen series of models.[1] The foundation for a tool like Wan2.5 was laid by the development of powerful Large Vision-Language Models (LVLMs) such as Qwen-VL-Max and omni-models like Qwen2.5-Omni.[9][10][11] These underlying models are designed to process and understand a wide array of inputs, including text, images, audio, and video, creating a comprehensive understanding of the relationship between different data types.[10][12] The Qwen-Omni model, for instance, features a unique "Thinker–Talker" architecture, allowing it to process diverse inputs and generate real-time, streaming responses in both text and natural speech.[1][13] By leveraging this deep expertise in multimodal comprehension and generation, Wan2.5 can more effectively interpret user prompts that combine visual descriptions, actions, and auditory cues, resulting in a more coherent and faithful final output.[7] This connection to a broader, continuously evolving AI ecosystem signals a strategic, full-stack approach from Alibaba, aiming to integrate powerful foundation models into a wide range of applications.[1]
With the launch of Wan2.5-Preview, Alibaba intensifies the competition among major technology firms vying for dominance in the generative video space. The model's native audio-generation capability places it in direct comparison with a select few rivals, most notably Google's Veo 3, which also boasts synchronized audio and dialogue.[14][8] While other leading models like OpenAI's Sora and Runway's Gen-3 have demonstrated remarkable visual generation, the seamless integration of audio and video remains a key differentiator.[15][16] Wan2.5's ability to generate clips up to 10 seconds long with strong prompt adherence and stylistic flexibility further solidifies its position as a serious contender.[1][7] The model is already being made available through third-party platforms, suggesting a strategy to encourage widespread adoption and experimentation among creators and developers.[5][7] This move could accelerate innovation and push competitors to advance their own multimodal capabilities.
In conclusion, the introduction of Alibaba's Wan2.5-Preview represents a significant milestone in the quest for truly comprehensive AI content creation. By breaking down the barrier between visual and audio generation, the model offers a more intuitive and efficient workflow, empowering creators to produce richer, more immersive narratives with greater ease. Its development underscores the critical importance of deep multimodal understanding, built upon a strong foundation of vision-language models. As this technology becomes more accessible, it promises to reshape the landscape of digital media, making high-quality video production more democratized while fueling the intense innovation race that defines the current era of artificial intelligence.
Sources
[3]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]