Factlen ExplainerAI ArchitectureTechnology ExplainerJun 17, 2026, 5:06 AM· 5 min read

How 'End-to-End' Neural Networks Are Rewriting the Rules of Autonomous Driving

The self-driving industry is abandoning millions of lines of hand-coded rules in favor of single, massive AI models that learn to drive by watching humans.

By Factlen Editorial Team

Share this story

Pure End-to-End Advocates 45%Hybrid AI Researchers 35%Modular System Defenders 20%

Pure End-to-End Advocates: Believe that scaling single neural networks with massive data is the only way to solve the infinite edge cases of real-world driving.
Hybrid AI Researchers: Seek to combine the generalization of neural networks with the interpretability of structured feature maps and rule-based guardrails.
Modular System Defenders: Prioritize explainability and deterministic safety, arguing that black-box AI must be heavily constrained in safety-critical physical systems.

What's not represented

· Insurance Actuaries
· Highway Safety Regulators

Why this matters

This architectural shift mirrors the 'ChatGPT moment' for physical robotics. By treating driving as a data problem rather than a logic problem, vehicles are learning to navigate complex, unscripted environments far more smoothly than traditional code ever allowed.

Key points

The autonomous driving industry is shifting from hand-coded rules to 'end-to-end' neural networks.
These systems learn to drive by watching millions of hours of human driving footage, a process called behavioral cloning.
End-to-end models bypass the traditional modular stack, reducing cascading errors between perception and planning systems.
The approach introduces a 'black box' problem, making it harder for engineers to debug exactly why an AI made a specific mistake.
Researchers are developing 'attention maps' to improve the interpretability of these massive neural networks.

300,000

Lines of C++ code removed in Tesla FSD v12

1.5 PB

Driving data processed per training cycle

For the past decade, the quest for the self-driving car was essentially a massive exercise in rule-writing. Engineers painstakingly coded 'if-then' statements for every conceivable scenario: if the light is red, stop; if a pedestrian steps off the curb, brake; if the speed limit drops, decelerate. It was a triumph of deterministic logic.[7]

But the real world is infinitely complex, and the number of edge cases—what the industry calls the 'long tail'—is practically limitless. A plastic bag blowing across the highway looks different to a sensor than a rock; a double-parked delivery truck requires a car to illegally cross a double-yellow line to proceed. Writing explicit C++ code for all of this eventually hits a wall of diminishing returns.[6]

Now, the autonomous vehicle industry is undergoing a radical architectural shift. Companies are abandoning the millions of lines of hand-written code in favor of 'end-to-end' (E2E) neural networks, fundamentally changing how machines interact with the physical world.[3]

In an end-to-end system, the artificial intelligence learns to drive much like a human teenager does: by watching and practicing. Raw sensor data—photons hitting the cameras—goes into one side of a massive neural network, and steering, braking, and acceleration commands come directly out the other.[1]

This approach collapses the traditional, modular 'AV 1.0' software stack. Historically, a self-driving car used separate, specialized modules: a perception module to identify objects, a prediction module to guess where they would go, and a planning module to plot the car's exact path.[4]

The architectural shift from modular subsystems to a single unified neural network.

The modular approach was logical and easy to debug, but it suffered from cascading errors. If the perception module misclassified a stop sign as a billboard, the planning module would never even know it needed to stop. End-to-end models bypass this brittle chain of command, processing the entire scene holistically.[3]

Tesla has been the most visible proponent of this shift in the consumer market. With the release of its Full Self-Driving (FSD) version 12 architecture, the company deleted roughly 300,000 lines of explicit C++ control code, replacing it with a unified neural network.[1]

Instead of rules, Tesla's system relies on 'behavioral cloning.' The company feeds its neural networks millions of hours of high-quality video collected from its customer fleet. The AI learns the subtle, unwritten rules of human driving—like inching forward at a blind intersection to get a better view—simply by mimicking what human drivers do in similar situations.[1]

But Tesla is far from alone in this pursuit. Wayve, a London-based AI company, was founded on this exact premise. They coined the term 'AV 2.0' to describe a generalization-first approach that relies entirely on deep learning rather than high-definition maps and rigid, city-specific rules.[4]

Wayve, a London-based AI company, was founded on this exact premise.

Wayve's architecture utilizes 'world models'—neural networks that don't just react to the current frame of video, but actively simulate and predict what will happen next in the physical environment. This allows the vehicle's AI to reason through complex, unseen scenarios before it even encounters them.[4]

The compute requirements for training end-to-end driving models have scaled exponentially.

Even the traditional giants of the modular approach are exploring the end-to-end frontier. Waymo, which operates commercial robotaxis using a highly refined modular stack and LiDAR, recently published research on a new model called EMMA (End-to-End Multimodal Model for Autonomous driving).[2]

Built on the foundation of Google's Gemini large language model, EMMA processes raw camera data and navigation instructions as a unified stream of text-like tokens, generating driving trajectories directly. While Waymo notes that EMMA is a research project and computationally expensive, it highlights the undeniable gravitational pull of the end-to-end philosophy.[2]

The hardware industry is also pivoting to support this new paradigm. NVIDIA recently won the CVPR 2024 Autonomous Grand Challenge with Hydra-MDP, an end-to-end driving model that streamlines perception and planning into a single network using bird's-eye view feature maps, proving the architecture's viability at scale.[3]

However, the transition to end-to-end AI introduces a massive new challenge: the 'black box' problem. When a modular system makes a mistake, engineers can look at the logs and see exactly which line of code or which specific perception module failed.[5]

When an end-to-end neural network makes a mistake, it is exceedingly difficult to understand why. The decision is buried in the mathematical weights and biases of billions of parameters. If the car abruptly brakes for no obvious reason, engineers cannot simply rewrite a rule to fix it.[5]

Researchers use attention maps to understand what a 'black box' neural network is looking at when making a driving decision.

To solve this, researchers are developing new interpretability techniques. By extracting 'attention maps' from the neural network, developers can see which specific pixels or objects in the camera's view most heavily influenced the AI's decision to steer or brake, bringing a measure of transparency back to the system.[5]

Furthermore, training these massive models requires staggering amounts of computing power. Processing petabytes of video data and running continuous simulations demands tens of thousands of advanced GPUs, making the price of entry into the AV 2.0 race astronomically high.[1]

Despite these hurdles, the consensus in the AI community is solidifying: the same scaling laws that allowed Large Language Models to master human text are now being successfully applied to the physical world.[4]

By treating driving as a massive data problem rather than a rigid logic problem, the industry is betting that neural networks will eventually surpass the limitations of human programmers.[7]

The ultimate result could be autonomous vehicles that drive less like cautious, jerky robots and more like highly attentive, experienced humans—capable of smoothly navigating the chaotic, unscripted reality of our roads.[7]

How we got here

Pre-2020
The industry standardizes on the 'modular stack,' requiring millions of lines of explicit C++ code to handle perception, planning, and control.
2017
Wayve is founded with a contrarian vision to build autonomous systems entirely on end-to-end deep learning.
Late 2023
Tesla releases FSD v12, replacing 300,000 lines of explicit control code with a unified neural network trained on human driving data.
Mid 2024
NVIDIA wins the CVPR Autonomous Grand Challenge with Hydra-MDP, an end-to-end driving model.
Late 2024
Waymo publishes research on EMMA, an end-to-end multimodal driving model built on Google's Gemini architecture.

Viewpoints in depth

Pure End-to-End Advocates

Proponents argue that the physical world is too complex for hand-coded rules.

Companies like Tesla and Wayve argue that the 'AV 1.0' approach of writing explicit rules for every scenario is a dead end. They point out that the real world contains an infinite number of edge cases—the 'long tail'—that programmers can never fully anticipate. By feeding massive amounts of human driving data into a single neural network, they believe the AI can learn to generalize and handle unseen situations far better than rigid code. This camp views scaling compute and data as the only viable path to true, generalized autonomy.

Hybrid AI Researchers

Researchers seeking a middle ground between generalization and interpretability.

Many academic and industry researchers acknowledge the power of end-to-end learning but are uncomfortable with a pure 'black box' outputting steering commands. This camp advocates for hybrid architectures—like NVIDIA's Hydra-MDP—that use neural networks to process data end-to-end but still output intermediate, human-readable representations, such as bird's-eye view maps. They argue this provides the generalization benefits of deep learning while allowing engineers to apply rule-based safety guardrails before the final control command is sent to the wheels.

Modular System Defenders

Traditionalists who prioritize deterministic safety and explainability.

Defenders of the traditional modular stack argue that safety-critical physical systems require deterministic, provable logic. If a robotaxi crashes, regulators and engineers need to know exactly which line of code or which sensor failed. Because end-to-end neural networks distribute their decision-making across billions of mathematical weights, debugging a specific failure is incredibly difficult. This camp maintains that while end-to-end models are fascinating research projects, commercial deployment requires the strict, isolatable subsystems of the modular approach.

What we don't know

How regulatory bodies will certify the safety of 'black box' neural networks that lack explicit, readable code.
Whether end-to-end models will eventually require LiDAR and radar integration, or if pure vision (cameras) will be sufficient.
At what point the scaling laws for physical AI will hit a plateau of diminishing returns.

Key terms

Behavioral Cloning: A machine learning technique where an AI learns to perform a task by mimicking human demonstrations, such as copying how human drivers steer and brake.
World Model: An AI system that can simulate and predict the physics and future state of its environment, allowing it to reason about what will happen next.
Modular Stack: The traditional approach to self-driving software, which divides the task into separate, human-coded subsystems like perception, prediction, and control.
Long-tail Scenarios: Rare, highly unusual driving situations (like a couch falling off a truck) that are difficult to anticipate and program explicit rules for.
Attention Map: A visualization tool used by researchers to see which specific parts of an image or data stream a neural network focused on when making a decision.

Frequently asked

What does 'end-to-end' mean in autonomous driving?

It means a single artificial intelligence model takes in raw sensor data (like camera video) and directly outputs driving commands (steering, braking), without relying on separate, hand-coded modules for perception and planning.

Why are companies moving away from traditional code?

The real world has too many unpredictable 'edge cases' to write a specific rule for every scenario. Neural networks learn to generalize from massive amounts of data instead of relying on rigid rules.

What is the 'black box' problem?

Because neural networks make decisions based on billions of mathematical weights, it is very difficult for engineers to understand exactly why an end-to-end system made a specific mistake, making debugging challenging.

Does this mean cars will drive like ChatGPT writes?

Conceptually, yes. Just as Large Language Models predict the next word based on vast training data, end-to-end driving models predict the next steering or braking action based on millions of hours of human driving video.

Sources

[1]ElectrekPure End-to-End Advocates
Tesla pushes end-to-end neural networks for highway driving
Read on Electrek →
[2]Waymo ResearchModular System Defenders
EMMA: End-to-End Multimodal Model for Autonomous Driving
Read on Waymo Research →
[3]NVIDIA Technical BlogHybrid AI Researchers
End-to-End Driving at Scale with Hydra-MDP
Read on NVIDIA Technical Blog →
[4]WayvePure End-to-End Advocates
Pioneering AV2.0: End-to-End Deep Learning for Autonomous Driving
Read on Wayve →
[5]Computer Vision FoundationHybrid AI Researchers
Enhancing Interpretability in End-to-End Autonomous Driving
Read on Computer Vision Foundation →
[6]arXivHybrid AI Researchers
WOD-E2E: Waymo Open Dataset for End-to-End Driving in Challenging Long-tail Scenarios
Read on arXiv →
[7]Factlen Editorial TeamModular System Defenders
Synthesis by Factlen editorial team
Read on Factlen Editorial Team →

Stay informed

Every angle. Every day.

Get automotive stories with full source coverage and perspective breakdowns delivered to your inbox.

Get the briefing →Browse automotive