DeepMind Proposes Video Models Will Become Vision's LLMs

DeepMind’s Veo 3 pioneers the shift to general-purpose video models, becoming vision's LLM for unified understanding and reasoning.

September 29, 2025

DeepMind Proposes Video Models Will Become Vision's LLMs
Researchers at Google DeepMind are proposing a transformative shift in the landscape of artificial intelligence, suggesting that advanced video generation models could soon serve as the visual equivalent of large language models (LLMs).[1][2] Just as LLMs like Gemini have become general-purpose tools for a vast array of text-based tasks, from translation to coding, DeepMind envisions models such as Veo 3 achieving a similar status for visual understanding and manipulation.[2][3] This development signals a move away from specialized, single-task visual AI towards unified, flexible systems that can perceive, model, and reason about the world in motion. The core of this vision lies in the emergent, zero-shot capabilities of these video models, which are demonstrating proficiency in tasks for which they were not explicitly trained.[3]
The comparison to the evolution of LLMs is a crucial one. A few years ago, natural language processing was characterized by a fragmented ecosystem of models, each designed for a specific function like summarization or sentiment analysis. The advent of large-scale, generative models trained on web-scale data unified these capabilities into single, powerful platforms.[2] DeepMind researchers argue that the field of machine vision is on a similar trajectory.[2] Currently, it relies on an array of task-specific models for functions like object detection or image segmentation.[2] However, generative video models, trained on a massive and diverse dataset of visual information, are beginning to show they can handle a wide variety of these visual tasks simply through prompting, much like their text-based counterparts.[2][3] This suggests a future where a single, robust video model could become a foundational tool for nearly any visual problem.
At the forefront of this shift is Google's Veo 3, a model that showcases significant advancements in generative video technology. Veo 3 is capable of producing high-resolution 4K video with a sophisticated understanding of real-world physics and temporal consistency.[4][5] Its abilities extend beyond simple video generation; it can simulate complex physical interactions, maintain character consistency across scenes, and adhere to intricate, nuanced prompts that specify cinematic styles and camera movements.[4][5] Crucially, DeepMind has highlighted Veo 3's ability to perform zero-shot reasoning on visual tasks it has never encountered before. An analysis of thousands of generated videos revealed the model's capacity for tasks ranging from segmentation and edge detection to understanding material properties and even solving visual puzzles like mazes.[2][3] This hints at a deeper, more generalized understanding of the visual world.
A key concept introduced by DeepMind researchers is "chain-of-frames" (CoF) reasoning, a parallel to the "chain-of-thought" (CoT) prompting that enhanced the reasoning abilities of LLMs.[2][6] While CoT allows language models to break down complex problems into intermediate textual steps, CoF enables video models to perform step-by-step visual reasoning across time and space.[2] By generating a sequence of frames, the model can effectively "think through" a visual problem, manipulating objects and scenarios over time to arrive at a solution.[6] This frame-by-frame process is what allows Veo 3 to tackle challenges that require temporal understanding, such as predicting the outcome of physical interactions or navigating a path.[2] Although specialized models still outperform these generalist video models on specific tasks, the rapid improvement from one generation to the next indicates a swift closing of that performance gap.[2]
The implications of this transition toward general-purpose video models are vast and poised to impact numerous industries. For content creation, it democratizes the production of high-quality video, allowing filmmakers, marketers, and individual creators to generate complex scenes and special effects from simple text prompts.[7][8] Beyond entertainment, these models could serve as powerful world simulators, capable of creating realistic virtual environments for training robots, autonomous vehicles, or for scientific research.[9][10][11] In fields like education and design, they offer new tools for visualization and rapid prototyping.[12][13] However, this powerful technology also brings significant ethical challenges, including the potential for creating convincing deepfakes and spreading misinformation, which will necessitate the development of robust detection methods and regulatory frameworks.[14][15]
In conclusion, the proposition by Google DeepMind that video models will become foundational systems for vision represents a significant paradigm shift in artificial intelligence. The rapid evolution of models like Veo 3, with their emergent zero-shot capabilities and new forms of visual reasoning, strongly supports the analogy to the rise of LLMs in the text domain.[1][3] By moving from a collection of specialized tools to a single, general-purpose system, the AI industry is on the cusp of unlocking unprecedented capabilities in how machines understand and interact with the visual world. This will not only revolutionize video content creation but also provide a powerful new platform for simulation, research, and problem-solving across a multitude of disciplines, fundamentally altering the relationship between human creativity and machine intelligence.[7]

Sources
Share this article