Factlen ExplainerRobot LearningExplainerJun 8, 2026, 4:43 AM· 7 min read· #12 of 32 in ai

How Vision-Language-Action Models Are Giving Robots 'Common Sense'

A new class of AI called Vision-Language-Action (VLA) models is replacing decades of rigid robot programming, allowing humanoids to learn physical tasks directly from internet data and human video.

Share this story

Commercial Developers 35%Open-Source Community 35%Classical Roboticists 30%

Commercial Developers: Advocate for end-to-end neural networks to achieve general-purpose adaptability.
Open-Source Community: Focus on democratizing access to robotic foundation models and datasets.
Classical Roboticists: Prioritize safety, predictability, and mathematically verifiable control systems.

What's not represented

· Factory Floor Workers
· Regulators and Safety Certifiers
· Consumer Privacy Advocates

Why this matters

For decades, robots were confined to predictable factory floors because they couldn't adapt to unexpected changes. VLA models are the breakthrough needed to bring general-purpose robots into messy, unpredictable human environments like homes, hospitals, and small businesses.

Key points

Vision-Language-Action (VLA) models are replacing traditional, rigid robotic programming with adaptable, end-to-end neural networks.
By training on internet-scale data, VLAs allow robots to understand natural language commands and recognize novel objects without explicit programming.
Open-source models like OpenVLA and massive datasets like Open X-Embodiment are democratizing advanced robotics research.
Companies like Figure AI are successfully using VLAs to control complex humanoid robots, achieving milestones like zero-shot learning from human video.
Classical roboticists caution that the unpredictable nature of neural networks requires hybrid systems to ensure physical safety.

970,000

Robot trajectories in Open X-Embodiment

7 billion

Parameters in OpenVLA model

1,000 Hz

Figure AI motor adjustment frequency

For decades, the robotics industry has been defined by a frustrating paradox: machines that can assemble a car chassis with sub-millimeter precision are entirely defeated by a messy kitchen counter. Traditional robots are confined to highly predictable environments—such as factory floors, automated warehouses, and controlled laboratories—where every single movement is meticulously pre-programmed by human engineers. In these settings, the world is organized around the robot. If an object is moved a few inches out of place, or if the ambient lighting changes unexpectedly, the entire system typically fails and requires human intervention.[6]

This extreme brittleness stems directly from how robots have historically been built. Classical robotics relies on a highly segmented "relay race" architecture. A dedicated vision module identifies an object, a separate language module processes a human command, and a distinct control module calculates the exact joint angles needed to move the robotic arm. Because these separate systems barely communicate and rely on rigid hand-coded logic, the robot lacks a holistic understanding of what it is actually doing. It doesn't know what a cup is; it only knows a set of geometric coordinates.[6]

That era of rigid, segmented programming is rapidly coming to an end. The robotics industry is currently undergoing a fundamental transformation driven by a new class of artificial intelligence: Vision-Language-Action (VLA) models. By unifying visual perception, natural language understanding, and physical action generation into a single, massive neural network, VLAs are finally giving robots the equivalent of physical "common sense." Instead of relying on thousands of lines of if-then code, these models learn how to interact with the world by absorbing massive amounts of data.[1][7]

How end-to-end neural networks replace the segmented modules of classical robotics.

To understand how a VLA works, it helps to look at large language models like ChatGPT. Language models generate coherent text by mathematically predicting the next logical word in a sequence based on their training data. VLAs use the exact same underlying transformer architecture, but instead of just predicting words, they predict "action tokens." When fed an image from a robot's onboard camera and a text instruction from a human, the VLA directly outputs a sequence of low-level motor commands—such as "move arm left 0.5 meters" or "close gripper slightly"—that can be immediately executed by the hardware.[1][2]

The conceptual foundation for this shift was pioneered in 2023 by Google DeepMind with the introduction of Robotics Transformer 2 (RT-2). DeepMind's breakthrough was realizing that a robot doesn't just need to learn from physical demonstrations; it can learn from the entire internet. By fine-tuning a massive vision-language model on robotic trajectory data, RT-2 successfully transferred broad web knowledge directly into physical control. The AI already understood the semantic concepts of the world from reading the internet, and the robotic data taught it how to physically interact with those concepts.[2]

This web-to-robot transfer unlocked emergent capabilities that were previously considered impossible in robotics. For example, if a human asks a traditional robot to "throw away the trash," the robot must be explicitly programmed to identify every possible piece of garbage it might encounter. RT-2, having read the internet, already knows what a banana peel, a crumpled wrapper, or an empty chip bag looks like. It can even perform rudimentary reasoning, such as identifying that a heavy rock could be used as an improvised hammer if no actual tools are available on the table.[2]

VLAs translate pixels and text directly into low-level action tokens.

While Google's foundational models remain closed research artifacts, the open-source community has rapidly democratized this technology. In 2024, researchers introduced OpenVLA, a 7-billion-parameter open-source model that set new benchmarks for generalist robot manipulation. Because OpenVLA is open-source and relatively lightweight compared to massive corporate models, developers can fine-tune it for specific tasks using standard consumer graphics cards. This accessibility has drastically lowered the barrier to entry, allowing university labs and small startups to experiment with cutting-edge robotic intelligence without needing billions of dollars in computing infrastructure.[3][8]

While Google's foundational models remain closed research artifacts, the open-source community has rapidly democratized this technology.

The shift to VLA models has fundamentally changed the core challenge of robotics. The primary engineering problem is no longer "how do we mathematically program this movement?" but rather "how do we gather enough physical data to train the neural network?" To solve this bottleneck, the industry collaborated to create the Open X-Embodiment dataset, a massive collection of 970,000 real-world robot manipulation trajectories sourced from over 70 different environments. This shared data pool is the critical fuel that allowed models like OpenVLA to achieve their high success rates across diverse physical tasks.[3][7]

The Open X-Embodiment dataset provides the massive scale of physical data required to train VLAs.

The most dramatic and highly publicized application of VLA technology is currently happening in the humanoid sector. Companies like Figure AI are moving beyond simple tabletop robotic arms and applying end-to-end neural networks to full bipedal robots. Figure's proprietary VLA, known as Helix, acts as a unified brain for their Figure 03 humanoid. Rather than having one system for walking and another for grabbing, a single neural network controls everything from dexterous finger manipulation to navigating around a cluttered living room.[4]

Figure AI recently achieved a major milestone in this space known as "zero-shot human-to-robot transfer." Traditionally, training a robot required a human operator to physically guide the machine through a task hundreds of times using a teleoperation rig. Figure bypassed this tedious process by training their Helix model entirely on egocentric video—footage captured from cameras worn by humans going about their daily lives in residential homes. The AI watched the humans and figured out how to translate those biological movements into robotic motor commands without any robot-specific data.[4]

Because the neural network learns end-to-end, it develops highly efficient, human-like behaviors that were never explicitly programmed by an engineer. During a recent demonstration, a Figure robot was observed closing a dishwasher drawer using a smooth bump from its hip, rather than awkwardly turning around to use its hands. This kind of improvised efficiency proves that the AI is genuinely reasoning about its physical embodiment and its environment in real-time, rather than just playing back a rigidly recorded script.[4][7]

End-to-end learning allows robots to develop unprogrammed, human-like efficiencies.

Despite these rapid and visually stunning advancements, the transition to end-to-end neural networks has sparked intense debate within the classical robotics community. Traditional roboticists point out that neural networks are essentially "black boxes." Because the model maps raw pixel inputs directly to motor outputs without explicit intermediate steps, it can be incredibly difficult to debug exactly why a robot made a specific, potentially dangerous movement. If a robot drops a plate, an engineer cannot simply look at a line of code to fix the error.[5][6]

In safety-critical applications—such as autonomous driving, heavy manufacturing, or robotic surgery—this unpredictability is unacceptable. If a VLA encounters a physical scenario vastly different from its training data, it might hallucinate a physical action, just as a language model might confidently hallucinate a fake historical fact. To mitigate this severe risk, the most reliable deployed systems today use a hybrid approach. The VLA acts as the intelligent "brain" suggesting actions, but its commands are filtered through a hand-engineered "safety cage" of classical control loops that physically prevent the robot from moving too fast or colliding with humans.[6][7]

The successful commercial deployment of VLA models is poised to fundamentally reshape the global economy. The industrial robotics market, currently valued at over $44 billion, has historically been limited to highly structured manufacturing and logistics. By giving robots the ability to navigate clutter, understand natural language, and adapt to novel objects on the fly, VLAs are unlocking entirely new, massive markets in dynamic warehousing, healthcare assistance, and eventually, domestic household labor.[1][7]

We are witnessing the "ChatGPT moment" for physical machines. For decades, robots were trapped in the realm of rigid, brittle automation, waiting for the world to be perfectly organized and structured for them before they could be useful. With the rapid maturation of Vision-Language-Action models, that limitation is dissolving. By combining the vast semantic knowledge of the internet with massive datasets of physical movement, robots are finally learning to meet the world as it actually is—messy, unpredictable, and profoundly human.[7]

How we got here

December 2022
Google DeepMind introduces RT-1, demonstrating that transformer architectures can help robots learn multi-task demonstrations.
July 2023
DeepMind announces RT-2, the first major Vision-Language-Action model capable of transferring web-scale knowledge directly to robotic control.
June 2024
Researchers release OpenVLA, a 7-billion-parameter open-source model that democratizes access to advanced robotic manipulation.
Late 2025
Figure AI introduces Helix, a unified VLA model capable of controlling full bipedal humanoids and navigating complex environments.

Viewpoints in depth

Commercial Robotics Developers

Companies building humanoids argue that end-to-end neural networks are the only path to general-purpose robots.

Firms like Figure AI and Google DeepMind believe that classical robotics has hit a ceiling. They argue that the real world is too complex to be hand-coded with 'if-then' logic. By training massive Vision-Language-Action models on internet-scale data and human video, these developers aim to create robots that learn physical intuition the same way humans do—through observation and massive scale. For them, the 'black box' nature of neural networks is an acceptable trade-off for unprecedented adaptability.

Open-Source Advocates

Researchers focused on democratizing AI to prevent a few massive tech companies from monopolizing robotic intelligence.

The teams behind projects like OpenVLA emphasize that the future of physical AI must be accessible. Training a foundational robotics model requires millions of dollars in compute and data collection, threatening to lock academic researchers and startups out of the field. By releasing 7-billion-parameter models and massive datasets like Open X-Embodiment to the public, this camp ensures that innovation in robotic manipulation can happen in university labs and small startups, not just inside trillion-dollar corporations.

Classical Roboticists

Traditional engineers who emphasize mathematical guarantees, safety, and predictability in physical machines.

Veterans of the robotics industry view the rush toward end-to-end neural networks with caution. While they acknowledge the impressive generalization of VLA models, they warn that AI 'hallucinations' in the physical world can cause catastrophic damage. Classical roboticists advocate for hybrid systems, where a neural network might suggest a high-level plan, but traditional, mathematically verifiable control loops actually execute the motor movements to ensure the robot never violates strict safety parameters.

What we don't know

It remains unclear how quickly VLA-powered humanoids will transition from controlled demonstrations to reliable, commercial deployment in unpredictable homes.
The robotics industry has not yet established standardized safety certification protocols for robots controlled entirely by end-to-end neural networks.
Researchers are still determining the absolute limits of zero-shot transfer—how far a robot can generalize without needing physical, task-specific practice.

Key terms

Vision-Language-Action (VLA) model: A type of multimodal AI that integrates visual perception, language understanding, and physical action generation into a single system.
End-to-end learning: A machine learning approach where a single neural network maps raw inputs directly to final outputs without intermediate, hand-coded steps.
Action token: A discretized piece of data representing a specific, low-level robotic movement, treated by the AI much like a word in a sentence.
Open X-Embodiment: A massive, open-source dataset containing nearly a million real-world robot trajectories, used to train general-purpose robotic AI.
Zero-shot transfer: The ability of an AI model to successfully complete a task it has never explicitly encountered before, relying entirely on generalized prior knowledge.

Frequently asked

What is a Vision-Language-Action (VLA) model?

A VLA is an artificial intelligence model that takes in visual data and natural language instructions, and directly outputs low-level motor commands for a robot to execute.

How is a VLA different from ChatGPT?

While ChatGPT predicts the next word in a text sequence, a VLA uses similar underlying architecture to predict the next 'action token,' translating digital reasoning into physical movement.

What is zero-shot transfer in robotics?

Zero-shot transfer occurs when a robot successfully performs a task it was never explicitly trained to do, often by applying general knowledge it learned from internet data or human video.

Are end-to-end neural networks safe for robots?

Safety remains a major concern, as neural networks can be unpredictable 'black boxes.' Many engineers use hybrid approaches, wrapping the AI's decisions in traditional, hand-coded safety limits to prevent accidents.

Sources

[1]WikipediaOpen-Source Community
Vision-language-action model
Read on Wikipedia →
[2]Google DeepMindCommercial Developers
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Read on Google DeepMind →
[3]arXivOpen-Source Community
OpenVLA: An Open-Source Vision-Language-Action Model
Read on arXiv →
[4]Figure AICommercial Developers
Helix: A Single Neural Network for Humanoid Control
Read on Figure AI →
[5]MIT NewsClassical Roboticists
A faster way to estimate uncertainty in AI-assisted decision-making
Read on MIT News →
[6]Dalhousie UniversityClassical Roboticists
Classical Robotics vs. End-to-End Deep Learning
Read on Dalhousie University →
[7]Factlen Editorial TeamOpen-Source Community
Synthesis by Factlen editorial team
Read on Factlen Editorial Team →
[8]Enterprise DNAOpen-Source Community
OpenVLA: An open-source vision-language-action model for robotic manipulation
Read on Enterprise DNA →

Up next

Silicon Photonics

The Shift to Light: How Photonic Chips Are Solving AI's Power Bottleneck

As traditional electronic GPUs hit physical and thermal limits, the AI industry is turning to silicon photonics—using light to process and transmit data at unprecedented speeds with a fraction of the energy.

Every angle. Every day.

Get ai stories with full source coverage and perspective breakdowns delivered to your inbox.

Get the briefing →Browse ai