Factlen ExplainerRobotic AIExplainerJun 17, 2026, 9:44 AM· 6 min read· #7 of 7 in ai

How Vision-Language-Action Models Are Teaching Robots to Understand the Physical World

A new class of AI models is transforming robotics from rigid, pre-programmed machines into adaptable systems that can learn physical tasks by watching humans. Vision-Language-Action (VLA) models bridge the gap between digital reasoning and physical movement.

By Factlen Editorial Team

Share this story

End-to-End VLA Advocates 40%Diffusion Policy Pioneers 35%Generalist Synthesizers 25%

End-to-End VLA Advocates: Believe single neural networks should handle everything from perception to motor control.
Diffusion Policy Pioneers: Focus on generative AI to teach fluid, dexterous physical behaviors.
Generalist Synthesizers: Analyze the broader shift from programmed machines to learning agents.

What's not represented

· Industrial Safety Regulators
· Labor Economists

Why this matters

For decades, deploying a robot meant spending hundreds of hours hand-coding specific movements for a single, controlled environment. VLA models allow robots to generalize—meaning a machine can be told to 'clean up the kitchen' and autonomously figure out how to handle objects it has never seen before.

Key points

Vision-Language-Action (VLA) models allow robots to translate visual inputs and text instructions directly into physical movements.
Google DeepMind's RT-2 proved that internet-scale semantic knowledge can give robots physical 'chain-of-thought' reasoning.
Figure AI's Helix model expanded VLA capabilities to full upper-body continuous control on humanoid robots.
Toyota Research Institute is using 'Diffusion Policy' to teach robots dexterous skills without writing new code.
These models solve the scaling problem, paving the way for general-purpose robots in unstructured environments like homes.

62%

RT-2 success rate on unseen tasks

60+

Dexterous skills taught by TRI without code

6-DoF

Degrees of freedom controlled by standard VLA action vectors

For decades, the robotics industry operated under a strict, unforgiving paradigm. Machines were confined to narrow, pre-programmed tasks in highly controlled environments like automotive assembly lines or sterile laboratories. If a factory needed a robotic arm to pick up a slightly different bolt, a team of engineers had to spend hours manually rewriting the code to adjust the machine's precise joint angles. The robots were incredibly strong and fast, but they were entirely blind to context and incapable of adapting to the unexpected.[6]

That paradigm is now being dismantled by a fundamental breakthrough in artificial intelligence. Vision-Language-Action (VLA) models are bridging the historic gap between digital reasoning and physical embodiment. By combining visual perception, natural language understanding, and action generation, these models are allowing robots to interpret messy, real-world scenes and guide their own physical actions without requiring a single line of task-specific code.[4][6]

To understand the shift, it helps to look at the evolution of foundational AI. Large Language Models (LLMs) like ChatGPT process and generate text. Vision-Language Models (VLMs) add the ability to understand images, allowing an AI to caption a photo or answer questions about a video. VLA models take the crucial third step: they extend that multimodal understanding into the physical world by generating executable motor commands. They do not just describe what they see; they decide how to physically interact with it.[4][5]

Under the hood, a VLA operates in two primary stages. First, a pre-trained vision-language backbone serves as the perception and reasoning core. It takes an input image of the robot's surroundings—captured by onboard cameras—along with a natural language instruction from a human user. The model encodes both the visual data and the text into a sequence of tokens within a shared latent space, effectively translating the physical scene into a mathematical language the AI can process.[5]

How VLA models translate pixels and text into physical movement.

In the second stage, an action decoder maps those internal tokens to discrete symbols that represent physical movement. These symbols are then de-tokenized into continuous robot commands. Instead of outputting a word, the model outputs a vector that dictates the precise displacement of the robot's end-effector across its degrees of freedom (DoF), as well as the open or closed state of its gripper. The AI is literally "speaking" in the language of motor control.[5]

The concept was pioneered in mid-2023 by Google DeepMind with the introduction of Robotic Transformer 2 (RT-2). DeepMind's researchers took massive vision-language models that had been trained on internet-scale data and fine-tuned them using real-world robot demonstration data. The goal was to see if the broad semantic knowledge the AI learned from the web could be transferred directly to a physical robot arm operating in a laboratory kitchen.[1][5]

Because RT-2 inherited the vast semantic knowledge of the internet, it demonstrated emergent "chain-of-thought" reasoning in the physical world. If a user told the robot to "pick up the improvised hammer," the robot could scan the table, identify a rock, and pick it up—even though it had never been explicitly programmed to associate a rock with a hammer. If told to select a drink for a tired person, it would autonomously reach for an energy drink over a bottle of water.[1]

The results were a paradigm shift for the field. When tested on novel, unseen scenarios involving objects and backgrounds it had not encountered during its training, RT-2 achieved a 62% success rate—nearly double the 32% success rate of its predecessor, RT-1. It proved that robots could generalize their skills, dramatically reducing the need to hand-code every possible edge case they might encounter in the real world.[1]

Google DeepMind's RT-2 demonstrated a massive leap in zero-shot generalization to novel tasks.

It proved that robots could generalize their skills, dramatically reducing the need to hand-code every possible edge case they might encounter in the real world.

By 2025, the technology had evolved from simple table-top robotic arms to highly complex humanoid robots. Figure AI introduced "Helix," a generalist VLA model designed to overcome the longstanding challenges of operating a bipedal humanoid in unstructured environments. Helix demonstrated that the VLA architecture could scale to manage vastly more complex physical bodies.[2]

Helix marked a series of firsts for the industry. It was the first VLA to output high-rate continuous control of an entire humanoid upper body, managing the wrists, torso, head, and individual fingers simultaneously. Crucially, it achieved this while running entirely onboard the robot's embedded, low-power GPUs, untethering the machine from massive external server racks and making it viable for commercial deployment.[2]

The model even enabled multi-robot collaboration. Using a single set of neural network weights, two Figure robots could operate simultaneously to solve a shared, long-horizon manipulation task—like putting away groceries—handling thousands of household items they had never encountered before, simply by following natural language prompts.[2]

While companies like DeepMind and Figure focus on transformer-based VLAs, others are exploring alternative generative architectures to achieve similar goals. The Toyota Research Institute (TRI) has pioneered a breakthrough approach to teaching robots dexterous skills using what they call a "Diffusion Policy."[3]

Diffusion Policy uses the same stochastic denoising processes that power AI image generators like Stable Diffusion. But instead of distilling random noise into a coherent picture, the TRI model distills noise into predicted, continuous robot actions. This mathematical approach is naturally stable to train and is exceptionally well-suited for high-dimensional, fluid movements that traditional programming struggles to replicate.[3]

Generative AI techniques like Diffusion Policy allow robots to learn complex, fluid tasks like pouring liquids.

Using this approach, TRI successfully taught robots over 60 difficult, dexterous skills—including pouring liquids, using tools, and manipulating soft, deformable objects—without writing a single line of new code. Engineers simply provided the robot with haptic physical demonstrations and a language description of the goal, allowing the AI to autonomously deploy the new behavior.[3]

The ultimate goal of this research is the creation of Large Behavior Models (LBMs). Just as LLMs serve as foundational engines for text generation across countless applications, LBMs aim to serve as foundational engines for physical movement, providing a scalable architecture that can be dropped into any robot to grant it a baseline understanding of how to interact with the world.[3]

This technological leap solves the most stubborn barrier in robotics: the scaling problem. The home is an unstructured environment filled with unpredictable objects—delicate glassware, crumpled clothing, scattered toys. Classical robotics, which requires PhD-level experts to manually script behaviors, cannot scale to the home. By allowing robots to learn through observation and language, VLAs make general-purpose household assistants a realistic possibility.[2][6]

However, the path to general-purpose robotics still faces significant hurdles. The most pressing is the "data bottleneck." While text and image data are abundant on the internet, high-quality physical demonstration data—recordings of robots successfully completing physical tasks—is scarce and incredibly expensive to collect. AI models are only as good as their training data, and the physical world cannot be easily scraped.[6]

Hardware limitations also present a formidable challenge. Even with a perfect VLA brain capable of understanding any scene, a robot is still constrained by its physical body. Current battery densities limit operational time, actuator speeds dictate how fast a robot can move safely, and the sheer physical durability required for continuous real-world operation remains a difficult engineering problem.[6]

Despite these hurdles, the trajectory of the industry is unmistakable. The era of the blind, pre-programmed machine is ending. Driven by Vision-Language-Action models and generative diffusion policies, the next generation of robots will not just execute code—they will perceive, reason, and act, fundamentally changing how machines assist humans in the physical world.[6]

How we got here

Dec 2022
Google DeepMind introduces RT-1, proving transformers can learn robotic tasks from demonstration data.
Jul 2023
DeepMind unveils RT-2, the first major Vision-Language-Action model, merging web-scale AI with physical control.
Sep 2023
Toyota Research Institute announces its Diffusion Policy approach, teaching robots dexterous skills without coding.
Feb 2025
Figure AI introduces Helix, a VLA capable of full upper-body continuous control running on embedded GPUs.

Viewpoints in depth

VLA Developers

Researchers building end-to-end neural networks for robotic control.

Proponents of Vision-Language-Action models argue that the only way to achieve general-purpose robotics is through end-to-end learning. By feeding pixels and text directly into a massive transformer model and outputting motor commands, they bypass the brittle, hand-coded logic of classical robotics. They point to the rapid generalization seen in models like RT-2 and Helix as proof that internet-scale semantic knowledge can successfully map to physical intuition.

Classical Roboticists

Engineers who emphasize structured programming and deterministic control.

Traditional roboticists acknowledge the power of VLAs for high-level reasoning but caution against abandoning structured control entirely. They argue that end-to-end neural networks are 'black boxes' that cannot guarantee safety or precision in critical moments. For industrial applications where millimeter-level accuracy and absolute reliability are required, many advocate for a hybrid approach: using AI for semantic understanding, but relying on classical control theory for the actual physical execution.

Behavioral AI Researchers

Teams focusing on generative diffusion policies for physical skills.

Researchers at institutes like TRI believe that while VLAs are excellent for reasoning, the physical execution of complex, dexterous tasks requires a different mathematical approach. They champion 'Diffusion Policy'—using the same stochastic denoising processes found in AI image generators to predict continuous robot actions. They argue this method is far more stable to train for high-dimensional, fluid movements like handling soft objects or tying knots.

What we don't know

How quickly the robotics industry can overcome the 'data bottleneck' of collecting enough high-quality physical demonstration data to train massive models.
Whether end-to-end neural networks can be made reliable and safe enough to operate around humans without the hard-coded safety constraints of classical robotics.
When the hardware capabilities (battery density, actuator speed, and durability) will fully catch up to the rapid advancements in robotic AI brains.

Key terms

Vision-Language-Action (VLA) Model: An AI model that processes images and text to directly output physical motor commands for a robot.
Large Behavior Model (LBM): A foundational AI model designed to generate physical actions and behaviors, analogous to how LLMs generate text.
Diffusion Policy: A method of teaching robots that uses the same generative AI math behind image creators to predict fluid physical movements.
Degrees of Freedom (DoF): The number of independent parameters that define a robot's configuration or movement in space.
End-Effector: The device at the end of a robotic arm, such as a gripper or hand, designed to interact with the environment.

Frequently asked

How is a VLA different from ChatGPT?

While ChatGPT (an LLM) outputs text, a VLA outputs physical motor commands. It understands the text and images, but translates that understanding into instructions that move a robot's joints.

Do engineers still have to program the robots?

Increasingly, no. Instead of writing code for every movement, engineers provide the robot with physical demonstrations and natural language instructions, allowing the AI to learn the behavior autonomously.

Can these robots work in normal homes?

That is the ultimate goal. Because VLA models can generalize and handle objects they have never seen before, they are the key technology needed to move robots out of structured factories and into unpredictable human environments.

Sources

[1]Google DeepMindEnd-to-End VLA Advocates
RT-2: New model translates vision and language into action
Read on Google DeepMind →
[2]Figure AIEnd-to-End VLA Advocates
Introducing Helix: A Generalist Vision-Language-Action Model
Read on Figure AI →
[3]Toyota Research InstituteDiffusion Policy Pioneers
Toyota Research Institute Unveils Breakthrough in Teaching Robots New Behaviors
Read on Toyota Research Institute →
[4]Exxact CorpEnd-to-End VLA Advocates
What are Vision-Language-Action (VLA) Models?
Read on Exxact Corp →
[5]WikipediaGeneralist Synthesizers
Vision-language-action model
Read on Wikipedia →
[6]Factlen Editorial TeamGeneralist Synthesizers
Synthesis by Factlen editorial team
Read on Factlen Editorial Team →

Up next

Animal Cognition

AI Decodes Sperm Whale 'Phonetic Alphabet,' Revealing Complex Language Parallels

Using advanced machine learning, marine biologists and AI researchers have discovered that sperm whale vocalizations contain a phonetic alphabet with vowel-like structures. The breakthrough reveals striking parallels to human speech and brings scientists closer to translating interspecies communication.

Every angle. Every day.

Get ai stories with full source coverage and perspective breakdowns delivered to your inbox.

Get the briefing →Browse ai