Why World Foundation Models Will Be Key to Advancing Physical AI

by Noah Kravitz
Why World Foundation Models Will Be Key to Advancing Physical AI

In the fast-evolving landscape of AI, it’s becoming increasingly important to develop models that can accurately simulate and predict outcomes in physical, real-world environments to enable the next generation of physical AI systems.

Ming-Yu Liu, vice president of research at NVIDIA and an IEEE Fellow, joined the NVIDIA AI Podcast to discuss the significance of world foundation models (WFM) — powerful neural networks that can simulate physical environments. WFMs can generate detailed videos from text or image input data and predict how a scene evolves by combining its current state (image or video) with actions (such as prompts or control signals).

“World foundation models are important to physical AI developers,” said Liu. “They can imagine many different environments and can simulate the future, so we can make good decisions based on this simulation.”

This is particularly valuable for physical AI systems, such as robots and self-driving cars, which must interact safely and efficiently with the real world.

Why Are World Foundation Models Important?

Building world models often requires vast amounts of data, which can be difficult and expensive to collect. WFMs can generate synthetic data, providing a rich, varied dataset that enhances the training process.

In addition, training and testing physical AI systems in the real world can be resource-intensive. WFMs provide virtual, 3D environments where developers can simulate and test these systems in a controlled setting without the risks and costs associated with real-world trials.

Open Access to World Foundation Models

At the CES trade show, NVIDIA announced NVIDIA Cosmos, a platform of generative WFMs that accelerate the development of physical AI systems such as robots and self-driving cars.

The platform is designed to be open and accessible, and includes pretrained WFMs based on diffusion and auto-regressive architectures, along with tokenizers that can compress videos into tokens for transformer models.

Liu explained that with these open models, enterprises and developers have all the ingredients they need to build large-scale models. The open platform also provides teams with the flexibility to explore various options for training and fine-tuning models, or build their own based on specific needs.

Enhancing AI Workflows Across Industries

WFMs are expected to enhance AI workflows and development in various industries. Liu sees particularly significant impacts in two areas:

“The self-driving car industry and the humanoid [robot] industry will benefit a lot from world model development,” said Liu. “[WFMs] can simulate different environments that will be difficult to have in the real world, to make sure the agent behaves respectively.”

For self-driving cars, these models can simulate environments that allow for comprehensive testing and optimization. For example, a self-driving car can be tested in various simulated weather conditions and traffic scenarios to help ensure it performs safely and efficiently before deployment on roads.

In robotics, WFMs can simulate and verify the behavior of robotic systems in different environments to make sure they perform tasks safely and efficiently before deployment.

NVIDIA is collaborating with companies like 1X, Huobi and XPENG to help address challenges in physical AI development and advance their systems.

“We are still in the infancy of world foundation model development — it’s useful, but we need to make it more useful,” Liu said. “We also need to study how to best integrate these world models into the physical AI systems in a way that can really benefit them.”

Listen to the podcast with Ming-Yu Liu, or read the transcript.

Learn more about NVIDIA Cosmos and the latest announcements in generative AI and robotics by watching the CES opening keynote by NVIDIA founder and CEO Jensen Huang, as well as joining NVIDIA sessions at the show.