Google DeepMind transforms Street View into interactive 3D worlds to train physical AI
Transforming static Street View imagery into interactive 3D environments, Google is building a revolutionary training ground for physical AI.
May 20, 2026

Google DeepMind has bridged the gap between virtual simulation and physical reality by integrating its state-of-the-art Genie world model with Google Maps Street View data[1][2]. This newly unveiled capability, integrated into the company’s Project Genie application, allows users to select almost any location on a map and instantly generate an interactive, walkable three-dimensional environment styled by artificial intelligence[3][4]. While the public-facing application functions as a highly sophisticated creative playground, the underlying technology signals a major leap forward for the broader AI industry[5][6]. By transforming twenty years of real-world street photography into a responsive digital training ground, Google is positioning its massive spatial database as a crucial strategic resource for the future of robotics, autonomous vehicles, and physical artificial intelligence[7][8].
The mechanics behind this new integration rely heavily on Genie 3, DeepMind’s advanced general-purpose world model[9][2]. Through Project Genie, which is rolling out to Google AI Ultra subscribers, users can interact with these AI-generated spaces using standard video-game-like controls[9][3]. To begin, a user drops a pin on a map within the United States, chooses a visual style such as Desert Sands, Stone Age, Ocean World, or vintage black-and-white film, and describes a custom character to navigate the scene[3][4]. Genie 3 then accesses Google’s database of over 280 billion Street View images—collected across 110 countries and seven continents—and converts this static photography into a fully realized, interactive simulation[4][10]. The resulting environment renders at 720p resolution and runs in real time at 20 to 24 frames per second[5][7]. Although the exploration sessions are currently limited to sixty seconds due to the immense processing power required, they showcase unprecedented physical consistency, dynamic lighting, and object permanence as the user moves a camera around the avatar[11][5][3].
Beyond consumer entertainment, the capability to ground world models in real-world geography is already transforming industrial autonomous systems, most notably within Alphabet-owned self-driving unit Waymo[10][12]. Earlier this year, Waymo introduced its specialized Waymo World Model, built directly on the foundation of DeepMind’s Genie 3 architecture[13][14]. In the past, self-driving development relied almost entirely on data accumulated by physical test vehicles driving on real streets, which meant that rare, highly dangerous events were poorly represented in training datasets[15][16]. By leveraging Genie 3’s immense world knowledge, Waymo engineers can now simulate extreme or highly improbable driving scenarios—such as a sudden tornado crossing an intersection or an unexpected encounter with an elephant—without placing real vehicles or drivers at risk[14][12]. Crucially, the system goes beyond generating visual footage by producing coordinated multi-sensor simulations, including both high-resolution camera feeds and LiDAR data[13][15]. This allows engineers to systematically stress-test their autonomous software and convert ordinary dashcam footage into controllable "what-if" scenarios, accelerating safe deployments in complex urban environments[13][15].
This integration also highlights a profound technological shift in the AI sector from passive video generation to true world modeling, which many researchers consider a critical stepping stone on the path to Artificial General Intelligence[17][5]. For several years, generative AI has excelled at creating visually stunning but fundamentally passive media, such as video tools that generate short, unalterable cinematic clips[5]. These older systems often lack an understanding of physical laws, causing objects to morph unnaturally or disappear when they leave the frame[5]. In contrast, Genie 3 operates as a functional world model, simulating the underlying rules of gravity, momentum, lighting, and fluid dynamics[11][5]. When a user explores a scene, the model remembers the geometry of the space, allowing details to remain consistent even when the camera pans away and returns[11][7]. By mastering these rules of cause-and-effect, the model enables AI agents to predict how a physical environment will evolve and how their specific actions will alter it, providing a crucial framework for training intelligent systems before they are deployed in the real world[17][7].
Furthermore, the combination of Genie 3 and Street View highlights a massive competitive advantage for Alphabet in the rapidly consolidating AI market[12]. As top labs struggle to find high-quality internet text and public video datasets to train their increasingly large models, proprietary, real-world data has become the ultimate strategic moat. While competitors must rely on scraped web data or synthetic environments to teach their models about physical spaces, Google possesses a two-decade-old archive of actual, ground-truth physical reality mapped across the entire globe[8][10]. This vast dataset provides an unmatched training curriculum for physical AI systems, commonly referred to as embodied AI. By grounding these virtual worlds in actual geographic coordinates, Google is creating highly precise, localized simulators that can train household robots, industrial machines, and delivery drones in the exact physical environments they will eventually operate in, giving the tech giant a significant edge over its rivals[7][8][10].
Ultimately, the integration of Street View with Google DeepMind’s Project Genie represents a fundamental pivot in the generative AI era[1][2]. By turning the static imagery of the physical planet into a dynamic, endless curriculum of interactive simulations, Google is effectively laying the groundwork for how future AI systems will learn to understand our world[17][7]. The implications of this development will echo far beyond casual creative tools or virtual tourism[8][10]. As world models continue to mature, they will redefine the boundaries of game development, virtual reality, and physical automation, proving that the digital simulations of today are rapidly becoming the training grounds for the intelligent physical machines of tomorrow[17][7].
Sources
[1]
[4]
[5]
[6]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]