World models, also known as world simulators, are being touted by some as the next big thing in AI.
AI pioneer Fei-Fei Li’s World Labs has raised $230 million to build “large world models,” and DeepMind hired one of the creators of OpenAI’s video generator, Sora, to work on “world simulators.” (Sora was released on Monday; here are some early impressions.)
But what the heck are these things?
World models take inspiration from the mental models of the world that humans develop naturally. Our brains take the abstract representations from our senses and form them into more concrete understanding of the world around us, producing what we called “models” long before AI adopted the phrase. The predictions our brains make based on these models influence how we perceive the world.
A paper by AI researchers David Ha and Jürgen Schmidhuber gives the example of a baseball batter. Batters have milliseconds to decide how to swing their bat — shorter than the time it takes for visual signals to reach the brain. The reason they’re able to hit a 100-mile-per-hour fastball is because they can instinctively predict where the ball will go, Ha and Schmidhuber say.
“For professional players, this all happens subconsciously,” the research duo writes. “Their muscles reflexively swing the bat at the right time and location in line with their internal models’ predictions. They can quickly act on their predictions of the future without the need to consciously roll out possible future scenarios to form a plan.”
It’s these subconscious reasoning aspects of world models that some believe are prerequisites for human-level intelligence.
Modeling the world
While the concept has been around for decades, world models have gained popularity recently in part because of their promising applications in the field of generative video.
Most, if not all, AI-generated videos veer into uncanny valley territory. Watch them long enough and something bizarre will happen, like limbs twisting and merging into each other.
While a generative model trained on years of video might accurately predict that a basketball bounces, it doesn’t actually have any idea why — just like language models don’t really understand the concepts behind words and phrases. But a world model with even a basic grasp of why the basketball bounces like it does will be better at showing it do that thing.
To enable this kind of insight, world models are trained on a range of data, including photos, audio, videos, and text, with the intent of creating internal representations of how the world works, and the ability to reason about the consequences of actions.
“A viewer expects that the world they’re watching behaves in a similar way to their reality,” Alex Mashrabov, Snap’s ex-AI chief of AI and the CEO of Higgsfield, which is building generative models for video, said. “If a feather drops with the weight of an anvil or a bowling ball shoots up hundreds of feet into the air, it’s jarring and takes the viewer out of the moment. With a strong world model, instead of a creator defining how each object is expected to move — which is tedious, cumbersome, and a poor use of time — the model will understand this.”
But better video generation is only the tip of the iceberg for world models. Researchers including Meta chief AI scientist Yann LeCun say the models could someday be used for sophisticated forecasting and planning in both the digital and physical realm.
In a talk earlier this year, LeCun described how a world model could help achieve a desired goal through reasoning. A model with a base representation of a “world” (e.g. a video of a dirty room), given an objective (a clean room), could come up with a sequence of actions to achieve that objective (deploy vacuums to sweep, clean the dishes, empty the trash) not because that’s a pattern it has observed but because it knows at a deeper level how to go from dirty to clean.
“We need machines that understand the world; [machines] that can remember things, that have intuition, have common sense — things that can reason and plan to the same level as humans,” LeCun said. “Despite what you might have heard from some of the most enthusiastic people, current AI systems are not capable of any of this.”
While LeCun estimates that we’re at least a decade away from the world models he envisions, today’s world models are showing promise as elementary physics simulators.
OpenAI notes in a blog that Sora, which it considers to be a world model, can simulate actions like a painter leaving brush strokes on a canvas. Models like Sora — and Sora itself — can also effectively simulate video games. For example, Sora can render a Minecraft-like UI and game world.
Future world models may be able to generate 3D worlds on demand for gaming, virtual photography, and more, World Labs co-founder Justin Johnson said on an episode of the a16z podcast.
“We already have the ability to create virtual, interactive worlds, but it costs hundreds and hundreds of millions of dollars and a ton of development time,” Johnson said. “[World models] will let you not just get an image or a clip out, but a fully simulated, vibrant, and interactive 3D world.”
High hurdles
While the concept is enticing, many technical challenges stand in the way.
Training and running world models requires massive compute power even compared to the amount currently used by generative models. While some of the latest language models can run on a modern smartphone, Sora (arguably an early world model) would require thousands of GPUs to train and run, especially if their use becomes commonplace.
World models, like all AI models, also hallucinate — and internalize biases in their training data. A world model trained largely on videos of sunny weather in European cities might struggle to comprehend or depict Korean cities in snowy conditions, for example, or simply do so incorrectly.
A general lack of training data threatens to exacerbate these issues, says Mashrabov.
“We have seen models being really limited with generations of people of a certain type or race,” he said. “Training data for a world model must be broad enough to cover a diverse set of scenarios, but also highly specific to where the AI can deeply understand the nuances of those scenarios.”
In a recent post, AI startup Runway’s CEO, Cristóbal Valenzuela, says that data and engineering issues prevent today’s models from accurately capturing the behavior of a world’s inhabitants (e.g. humans and animals). “Models will need to generate consistent maps of the environment,” he said, “and the ability to navigate and interact in those environments.”
If all the major hurdles are overcome, though, Mashrabov believes that world models could “more robustly” bridge AI with the real world — leading to breakthroughs not only in virtual world generation but robotics and AI decision-making.
They could also spawn more capable robots.
Robots today are limited in what they can do because they don’t have an awareness of the world around them (or their own bodies). World models could give them that awareness, Mashrabov said — at least to a point.
“With an advanced world model, an AI could develop a personal understanding of whatever scenario it’s placed in,” he said, “and start to reason out possible solutions.”
TechCrunch has an AI-focused newsletter! Sign up here to get it in your inbox every Wednesday.
This story originally published October 28, 2024, and was updated December 14, 2024, with new updates about Sora.