Factlen ExplainerWorld ModelsExplainerJun 25, 2026, 1:21 AM· 5 min read· #3 of 3 in ai

Explainer: How 'World Models' Are Teaching AI the Intuitive Physics of Reality

Artificial intelligence is moving beyond text prediction to a new architecture called 'world models,' which learn the physical rules of reality by observing video. The shift is unlocking major breakthroughs in robotics and autonomous driving by giving machines the common sense of a human infant.

By Factlen Editorial Team

Share this story

World Model Pioneers 45%Applied Robotics Developers 30%Embodied AI Skeptics 25%

World Model Pioneers: Researchers who believe predicting latent states from video is the key to human-level AI.
Applied Robotics Developers: Engineers focused on using world models as practical simulation engines.
Embodied AI Skeptics: Critics who argue that true physical understanding requires physical interaction.

What's not represented

· Cognitive Psychologists
· Hardware Manufacturers

Why this matters

Large language models have hit a ceiling because they understand text, not reality. World models bridge this gap, giving AI the physical intuition required to safely operate robots, self-driving cars, and industrial machinery in the real world.

Key points

Large language models struggle with physical reasoning because they learn from text rather than physical reality.
World models learn 'intuitive physics' by watching video and predicting the next physical state of an environment.
The AI industry is heavily investing in world models, highlighted by Yann LeCun's €3 billion startup AMI Labs.
These models are unlocking zero-shot planning for robotics and generating highly realistic simulations for autonomous driving.

€3 billion

Valuation of LeCun's AMI Labs

1.2 billion

Parameters in Meta's V-JEPA 2

24 fps

Genie 3 real-time simulation speed

98%

V-JEPA accuracy on physical plausibility tests

For all their ability to write code, pass bar exams, and generate poetry, the world's most advanced large language models lack a fundamental cognitive skill possessed by a six-month-old human infant: they do not know that a ball rolling behind a couch still exists. This gap in "intuitive physics" has become the central bottleneck in artificial intelligence. Text-based models learn statistical patterns of language, but they have no grounded understanding of physical reality, leading to hallucinations when they are asked to reason about the physical world.[1]

In early 2026, the AI industry is executing a massive pivot to solve this problem. Researchers are moving beyond text-prediction engines to build "world models"—AI systems designed to build and maintain an internal representation of how the physical world works. By observing millions of hours of video, these models are learning the laws of gravity, object permanence, and momentum without ever being explicitly programmed with physics equations.[1][7]

The momentum behind this architectural shift reached a tipping point this year. Yann LeCun, a Turing Award winner and long-time chief AI scientist at Meta, departed the company after 12 years to found Advanced Machine Intelligence (AMI) Labs. The startup immediately raised €500 million at a €3 billion valuation to build world models, betting that this approach is the only viable path to human-level AI.[2][4]

The ecosystem is expanding rapidly alongside AMI Labs. Google DeepMind recently released Genie 3, an interactive world model capable of generating persistent 3D environments at 24 frames per second. Meanwhile, Fei-Fei Li's World Labs launched its Marble platform, and NVIDIA introduced the Cosmos World Foundation Model Platform to provide physics-aware training environments for robotics developers.[2][3][4]

Unlike language models that predict text, world models predict the next physical state of an environment.

To understand how world models work, it helps to understand why previous computer vision systems failed at physical reasoning. Historically, AI models tried to understand video by working in "pixel space," treating every pixel on a screen as equally important. If an AI was watching a dashcam video of a suburban street, it would spend massive amounts of computing power trying to predict the exact motion of rustling leaves on a tree, rather than focusing on the oncoming car.[1][5]

Modern world models solve this by abandoning pixels entirely. Instead, they operate in what researchers call a "latent representation space." When a model like Meta's Video Joint Embedding Predictive Architecture (V-JEPA 2) watches a video, an encoder compresses the visual information into a high-level, abstract mathematical summary. It ignores the rustling leaves and captures the core semantic state of the scene—the objects, their positions, and their trajectories.[1][5]

The actual learning happens through a sophisticated game of fill-in-the-blank. During training, the system masks out portions of a video—sometimes hiding the final few seconds entirely. A predictor module is then forced to guess the abstract representation of the missing frames based on the context it can see. By constantly predicting what happens next and adjusting its internal parameters when it guesses wrong, the AI naturally deduces the rules of the physical world.[1][5]

The actual learning happens through a sophisticated game of fill-in-the-blank.

This self-supervised learning process closely mirrors how human infants develop cognitive skills. Babies do not learn that a dropped cup will fall by reading a physics textbook; they learn by observing the world, making subconscious predictions, and experiencing surprise when reality defies their expectations. World models replicate this observational learning at a massive scale, ingesting thousands of years' worth of video data to build their physical intuition.[1][5]

The JEPA architecture forces AI to learn physics by predicting abstract representations of missing video frames.

To prove that these models are actually learning physics, researchers use a testing framework borrowed directly from developmental psychology: the "violation of expectation" test. In benchmarks like IntPhys, the AI is shown two videos. One depicts a normal physical event, while the other shows an impossible scenario, such as an object passing seamlessly through a solid wall.[1][4]

When presented with the impossible video, models like V-JEPA register a massive spike in prediction error—the mathematical equivalent of surprise. In recent evaluations, V-JEPA was nearly 98% accurate at identifying physically implausible events, vastly outperforming traditional pixel-based models and multimodal language models that merely guess based on surface-level visual patterns.[1]

The implications for physical AI are profound. For decades, training a robot to chop an onion or fold laundry required painstaking trial and error in the real world, or brittle hand-coded instructions. With a robust world model, a robot can engage in "zero-shot planning." It can imagine a sequence of actions in its internal simulation, predict the physical consequences of those actions, and execute the successful sequence in the real world without prior physical practice.[3][5]

Autonomous vehicle companies are already leveraging this capability. Waymo recently integrated DeepMind's Genie 3 to create a specialized driving simulator. Because the world model understands physics, it can generate highly realistic, synchronized camera and lidar data for dangerous "edge cases"—like a truck spilling its cargo on a highway—allowing the driving software to practice evasive maneuvers in scenarios that are too rare or dangerous to test on real roads.[4]

Modern world models can generate interactive 3D simulations in real time.

Despite the rapid progress, the world model paradigm faces significant skepticism from some corners of the AI research community. Critics argue that passively predicting video frames, even in an abstract latent space, is not the same as true physical comprehension. They point out that humans and animals learn through physical interaction and trial-and-error manipulation, not just by watching a screen.[6]

Furthermore, current world models still suffer from severe memory limitations. Many models operate on time scales of just three to four seconds, exhibiting a "goldfish memory" that prevents them from reasoning about long-term physical causality or complex, multi-step object interactions. While they understand that a ball rolling behind a wall still exists a second later, they struggle to track that object's state over a minute-long sequence.[6]

Nevertheless, the shift from text prediction to physical simulation represents one of the most significant architectural evolutions in artificial intelligence since the invention of the transformer. By giving machines the ability to imagine, predict, and understand the physical consequences of their actions, world models are laying the foundation for AI that can finally step out of the data center and safely navigate the real world.[2][7]

How we got here

2022
Yann LeCun proposes the Joint Embedding Predictive Architecture (JEPA) as a path to autonomous machine intelligence.
Feb 2024
Google DeepMind introduces Genie, an early interactive environment model.
2025
Meta releases V-JEPA, demonstrating AI can learn object permanence from video.
Early 2026
The ecosystem accelerates with the launch of AMI Labs, Genie 3, and NVIDIA Cosmos.

Viewpoints in depth

World Model Pioneers

Researchers who believe predicting latent states from video is the key to human-level AI.

This camp, led by figures like Yann LeCun, argues that the current paradigm of auto-regressive text prediction has hit a fundamental ceiling. They contend that true intelligence requires an internal model of reality. By forcing AI to predict abstract representations of missing video frames, they believe machines can develop the same intuitive physics that human infants acquire through observation, paving the way for Advanced Machine Intelligence (AMI).

Embodied AI Skeptics

Critics who argue that true physical understanding requires physical interaction.

Skeptics caution against anthropomorphizing the capabilities of world models. They argue that while models like V-JEPA are excellent at detecting statistical anomalies in video frames, this does not equate to a genuine understanding of physics. From this perspective, a human baby learns gravity not just by watching objects fall, but by dropping them and feeling their weight. They argue that until AI systems are embodied in robots that physically interact with their environments, their 'understanding' will remain superficial.

Applied Robotics Developers

Engineers focused on using world models as practical simulation engines.

For this group, the philosophical debate over whether an AI 'understands' physics is secondary to its practical utility. Developers at companies like Waymo and NVIDIA are leveraging world models to generate synthetic training data and highly realistic simulations. By using models like Genie 3 to simulate dangerous edge cases—such as a pedestrian stepping into traffic—they can safely train autonomous systems without risking physical hardware or human lives.

What we don't know

Whether passive video observation can ever fully replicate the physical intuition humans gain through tactile interaction.
How quickly researchers can solve the 'goldfish memory' problem that limits current models to short time horizons.
When world models will become computationally efficient enough to run in real-time on consumer-grade robotics hardware.

Key terms

World Model: An AI system that builds an internal representation of an environment to predict how it will change in response to actions.
Latent Space: An abstract, compressed mathematical representation of data, allowing the AI to ignore irrelevant details like pixel noise.
JEPA: Joint Embedding Predictive Architecture, a framework that trains AI to predict missing information in an abstract space rather than pixel-by-pixel.
Intuitive Physics: The innate understanding of basic physical laws, such as gravity and object permanence, typically developed by infants through observation.

Frequently asked

Why can't large language models understand physics?

Language models learn statistical patterns from text. While they can describe a ball falling, they do not possess a spatial or physical understanding of gravity, leading to hallucinations when reasoning about the physical world.

How do world models learn without being programmed?

They watch millions of hours of video and learn to predict what happens next. By minimizing their prediction errors over time, they naturally deduce rules like object permanence and momentum.

What are world models used for?

They are primarily used to train physical AI, allowing robots and autonomous vehicles to practice in highly accurate, AI-generated simulations before operating in the real world.

Sources

[1]Quanta MagazineWorld Model Pioneers
'World Models,' an Old Idea in AI, Mount a Comeback
Read on Quanta Magazine →
[2]IntrolWorld Model Pioneers
Yann LeCun's AMI Labs Raises €500M to Build AI World Models
Read on Introl →
[3]arXivApplied Robotics Developers
Cosmos World Foundation Model Platform for Physical AI
Read on arXiv →
[4]WikipediaApplied Robotics Developers
World model (artificial intelligence)
Read on Wikipedia →
[5]MetaWorld Model Pioneers
V-JEPA 2: A new world model for physical understanding
Read on Meta →
[6]MediumEmbodied AI Skeptics
Critique: Why Meta's V-JEPA Doesn't Actually Understand Physics
Read on Medium →
[7]Factlen Editorial TeamApplied Robotics Developers
Synthesis by Factlen editorial team
Read on Factlen Editorial Team →

Up next

AI Liability

Landmark Wrongful Death Lawsuit Tests AI Product Liability for Chatbot-Induced Homicide-Suicide

A California court is weighing whether OpenAI and Microsoft can be held liable for a 2025 murder-suicide, in a case that could determine if AI models are protected speech or defective products.

Every angle. Every day.

Get ai stories with full source coverage and perspective breakdowns delivered to your inbox.

Get the briefing →Browse ai