Factlen ExplainerAgent ArchitectureExplainerJun 26, 2026, 11:24 PM· 8 min read

Anthropic Research Reveals Core LLMs Are Unreliable, Agent Intelligence Now Resides in External 'Scaffold'

New research and a massive code leak reveal that standalone AI models are highly unreliable for long tasks. The true intelligence of autonomous agents actually lives in the 'scaffold'—the external software wrapped around the model.

By Factlen Editorial Team

Share this story

Harness Engineers 45%Enterprise Adopters 30%Foundation Model Labs 25%

Harness Engineers: Advocates for treating AI reliability as a traditional software engineering problem.
Enterprise Adopters: Focuses on the necessity of deterministic rules and data governance for business AI.
Foundation Model Labs: Maintains that scaling core model capabilities remains the ultimate path to generalized intelligence.

What's not represented

· Hardware Providers
· Open-Source Model Developers

Why this matters

Understanding that AI is not a magic brain, but a software engineering problem, democratizes the technology. It means businesses and developers can build highly reliable autonomous systems today without waiting for trillion-dollar supercomputers.

Key points

Standalone large language models are highly unreliable for long-horizon tasks, acting like amnesiacs between sessions.
Anthropic research shows analytical accuracy jumps from 21% to 95% when a model is wrapped in a domain-specific scaffold.
A scaffold is the external code that manages an AI's memory, tool usage, and error correction.
A massive leak of Anthropic's Claude Code revealed a complex, 512,000-line architecture dedicated entirely to scaffolding.
The industry is shifting from 'prompt engineering' to 'harness engineering,' treating AI reliability as a traditional software problem.

21%

Standalone LLM accuracy on analytics

95%

Accuracy when wrapped in a scaffold

90.2%

Performance gain of multi-agent orchestrators

512,000

Lines of leaked scaffolding code

For the past three years, the technology industry has treated large language models as artificial brains—monolithic engines of intelligence that simply need to be prompted correctly to perform complex tasks. But a quiet consensus is emerging among the engineers actually building autonomous systems: the 'brain' is highly unreliable. Without strict external supervision, even the most advanced frontier models behave like brilliant but amnesiac interns. They lose track of long-term goals, hallucinate tool inputs, and spiral into chaotic loops when left to their own devices. Now, unprecedented transparency from leading AI labs has confirmed what many developers suspected. The true intelligence of an autonomous agent does not reside solely in the neural network's weights. It lives in the 'scaffold'—the rigid, deterministic software architecture wrapped around the model to keep it on track.[1][6]

The scale of this reality was starkly quantified in recent internal benchmarks published by Anthropic, the creators of the Claude family of models. When asked to perform complex data analytics tasks using only its native reasoning capabilities, the core large language model achieved an accuracy rate of just 21 percent. It frequently selected the wrong database fields, relied on stale knowledge, or simply failed to find the relevant information. However, when that exact same model was embedded within a domain-specific scaffold—a system equipped with strict workflow rules, semantic data layers, and continuous validation loops—its aggregate accuracy skyrocketed to 95 percent. The model had not become smarter; it had simply been constrained and guided by traditional software engineering.[1][5]

This architectural revelation marks a fundamental shift in how artificial intelligence is deployed in 2026. The industry is rapidly pivoting away from 'prompt engineering'—the art of whispering the right words to a model—and toward 'harness engineering.' A harness, or scaffold, is the external infrastructure that manages everything the model cannot reliably handle itself. It controls memory retrieval, sequences tool invocations, enforces context window discipline, and runs human-in-the-loop checkpoints. If the large language model is the engine providing raw reasoning horsepower, the scaffold is the transmission, steering wheel, and braking system that actually makes the vehicle drivable.[4][6]

Anthropic's internal benchmarks reveal a massive leap in accuracy when models are constrained by domain-specific scaffolding.

The mechanics of production-grade scaffolding were thrust into the spotlight earlier this year when a massive debugging artifact was accidentally published to a public package registry. The leak exposed over 512,000 lines of TypeScript code detailing the complete internal architecture of Anthropic’s flagship agentic tool, Claude Code. For the global developer community, it was an unprecedented look under the hood of a frontier AI system. The codebase revealed that Anthropic does not rely on a single, omniscient model prompt. Instead, the system relies on a complex web of specialized sub-routines, race-condition protections, and strict memory hierarchies that treat the underlying model's outputs with deep skepticism.[3][6]

One of the most critical challenges the scaffold solves is the illusion of memory. By default, a large language model forgets everything between sessions; it has no persistent state. Early agent designs attempted to solve this by simply feeding the entire history of a task back into the model's context window. But context windows are expensive to process, and flooding them with irrelevant history quickly degrades the model's ability to reason. The leaked Anthropic architecture demonstrated a highly disciplined alternative: a three-tier memory hierarchy built entirely around bandwidth management. The scaffold maintains external progress logs, reads previous code commits, and injects only the most critical, condensed pointers into the model's active memory at any given time.[1][3]

This approach transforms the AI from a continuous, wandering thinker into a highly structured worker. In a properly scaffolded system, an 'initializer' agent first sets up the project, creating a comprehensive feature list, marking tasks as incomplete, and establishing a rigid directory structure. Then, a separate 'worker' agent is spun up for a single, isolated task. It reads the external progress log, picks exactly one failing feature, writes the code, runs the tests, updates the external log, and then terminates. The worker agent does not need to remember the grand vision of the project; it only needs to successfully execute the immediate step dictated by the scaffold.[1][4]

This approach transforms the AI from a continuous, wandering thinker into a highly structured worker.

When tasks become too complex for a single worker, modern scaffolds employ an orchestrator-worker pattern. Anthropic's internal research evaluations demonstrated that multi-agent systems—where a lead model coordinates the process while delegating to specialized subagents—dramatically outperform solitary models. In one benchmark, a multi-agent system utilizing a lead orchestrator and several parallel subagents outperformed a single, highly capable model by 90.2 percent. The orchestrator analyzes the user's intent, develops a multi-step strategy, and spawns specialized workers to explore different aspects of the problem simultaneously. The scaffold ensures these parallel threads are eventually synthesized into a coherent final output.[1][6]

A production-grade scaffold intercepts the model's outputs, manages memory, and enforces strict validation loops.

Crucially, the scaffold also serves as the ultimate defense against AI hallucinations. In an open-loop system, if a model hallucinates a tool command or misinterprets a database schema, the error compounds, eventually derailing the entire task. A robust scaffold operates as a closed-loop feedback system. It intercepts the model's proposed actions, runs them through deterministic guardrails, and forces the model to self-correct if the output violates established rules. The system is explicitly designed around the assumption that the model's initial output might be wrong. Memory and initial reasoning are treated as hints, not absolute truths, requiring the scaffold to verify the data before any permanent action is taken.[3][5]

The academic community has increasingly validated this structural reality. Recent surveys and empirical studies published on arXiv highlight that the execution harness is the true binding constraint on long-horizon agent performance. Researchers have pointed out a glaring flaw in how AI models are traditionally evaluated: leaderboards often report a single score and attribute it entirely to the underlying neural network. However, studies show that holding the model fixed while upgrading the surrounding scaffold can improve benchmark performance by up to 15 percentage points. The academic consensus is clear: reliability over a long time horizon is a property of the external controller, not just the open-loop policy of the model itself.[2][6]

This shift in understanding has profound implications for the economics of the artificial intelligence industry. If the intelligence and reliability of a system reside primarily in the domain-specific scaffolding, the competitive moat for businesses changes dramatically. Companies no longer need to wait for frontier labs to release exponentially smarter, trillion-parameter models to achieve production-grade reliability. Instead, they can achieve exceptional results using smaller, faster, and cheaper models—including open-source alternatives—provided they invest the engineering effort into building robust, domain-specific harnesses. The value is migrating from the raw foundational weights to the proprietary workflows, semantic layers, and testing loops built around them.[3][4]

Multi-agent architectures, where a lead orchestrator delegates to specialized subagents, dramatically outperform solitary models.

A critical component of this engineering discipline is how the scaffold manages the model's access to external tools. When an agent is given a massive, unfiltered list of APIs and functions, it frequently becomes confused, selecting the wrong tool or providing improperly formatted arguments. Modern scaffolding solves this by dynamically routing tool access based on the specific context of the task. The harness provides explicit heuristics, ensuring the model only sees the precise tools required for the immediate step. Furthermore, the scaffold translates the model's natural language intent into the rigid, deterministic syntax required by external databases, acting as an essential semantic translation layer.[1][5]

This semantic layer is particularly vital for overcoming the specific failure modes that plague standalone models. Anthropic's analysis identified that un-scaffolded models frequently fail due to incorrect field selection in databases or a reliance on stale, outdated knowledge embedded in their training weights. The scaffold bypasses these flaws entirely by forcing the model to query a single source of truth. It provides detailed reference documentation for data tables, filters, and keys directly into the active context, ensuring the model retrieves real-time, accurate data rather than guessing based on its pre-trained memory.[5][6]

For developers and enterprise adopters, this demystification of AI is highly empowering. It moves artificial intelligence out of the realm of unpredictable alchemy and places it firmly within the discipline of traditional software engineering. Building a reliable AI agent no longer requires hoping the model will magically understand a complex prompt. It requires writing clear tests, maintaining clean state management, and enforcing strict data governance—skills that the global software engineering workforce already possesses. By acknowledging the inherent unreliability of the core models, the industry has finally discovered the blueprint for making them genuinely useful.[5][6]

The shift toward 'harness engineering' places AI development firmly back into the realm of traditional software engineering.

Ultimately, the revelation that AI agents require extensive scaffolding is not a failure of the technology, but a maturation of its application. Just as a powerful engine requires a transmission to deliver useful torque to the wheels, a large language model requires a harness to deliver useful work in the real world. As the focus shifts from scaling parameter counts to refining execution loops, the next generation of autonomous systems will be defined not by how smart their underlying models are, but by how brilliantly they are constrained.[2][4][6]

How we got here

Late 2025
Early agent frameworks struggle with compounding errors and amnesia during long-running tasks.
March 2026
A massive source map leak reveals the 512,000-line internal scaffolding architecture of Anthropic's Claude Code.
May 2026
Academic researchers publish empirical data proving that benchmark scores are heavily dependent on the execution harness, not just the model.
June 2026
Anthropic formally details how domain-specific scaffolding increases analytical accuracy from 21% to 95%.

Viewpoints in depth

Harness Engineers

Advocates for treating AI reliability as a traditional software engineering problem.

This camp argues that the AI industry has over-indexed on building larger, more expensive foundation models while neglecting the execution environment. They view the LLM as merely a raw reasoning engine—powerful but inherently flawed. By focusing on 'harness engineering,' they believe developers can build highly reliable, production-grade systems using smaller, open-source models, shifting the competitive moat from raw compute power to proprietary workflow architecture.

Enterprise Adopters

Focuses on the necessity of deterministic rules and data governance for business AI.

For enterprise leaders, the unpredictability of standalone LLMs is a non-starter for deployment in finance, healthcare, or critical operations. This perspective champions the scaffolding approach because it reintroduces deterministic control. By forcing the AI to interact through strict semantic layers and validation loops, enterprises can guarantee data accuracy and maintain compliance, ensuring the AI acts as a disciplined worker rather than a creative but erratic liability.

Foundation Model Labs

Maintains that scaling core model capabilities remains the ultimate path to generalized intelligence.

While acknowledging the immediate necessity of scaffolding for current-generation agents, researchers at frontier labs still view these harnesses as temporary bridges. They argue that as context windows expand to millions of tokens and models improve their native long-horizon reasoning, the need for complex, hand-coded external memory structures will diminish. In their view, the ultimate goal is to internalize the scaffold's discipline directly into the neural network's weights.

What we don't know

It remains unclear if future foundation models will eventually internalize these scaffolding capabilities, rendering external harnesses obsolete.
The industry has not yet standardized a universal scaffolding framework, leading to highly fragmented enterprise implementations.
It is unknown how the shift toward heavy scaffolding will impact the pricing power of major AI labs selling raw API access.

Key terms

Agent Harness (Scaffold): The external software infrastructure wrapped around an AI model that manages memory, tool use, and error correction.
Orchestrator-Worker Pattern: A multi-agent architecture where a lead AI plans a strategy and delegates specific, isolated tasks to sub-agents.
Semantic Layer: A translation system within the scaffold that converts the AI's natural language intent into the exact code or database queries required.
Context Window: The limited amount of active memory or text an AI model can process at one time before it begins to forget or degrade in performance.

Frequently asked

Why can't the AI model just remember things on its own?

Large language models are inherently stateless; they forget everything between sessions. To maintain long-term memory, they require an external scaffold to save and retrieve progress logs.

What happens when an AI hallucinates inside a scaffold?

A well-designed scaffold uses deterministic validation loops to catch the error. It intercepts the hallucinated command, blocks it from executing, and forces the model to self-correct.

Does this mean the base AI models don't matter anymore?

The base models still matter as the core reasoning engine, but they are no longer the sole differentiator. A mid-tier model with an excellent scaffold can outperform a frontier model with poor scaffolding.

Sources

[1]Anthropic ResearchFoundation Model Labs
Building Effective Harnesses for Long-Running Agents
Read on Anthropic Research →
[2]arXivEnterprise Adopters
The Execution Harness is the Binding Constraint on LLM-Agent Performance
Read on arXiv →
[3]Business EngineerHarness Engineers
Anthropic's Leak & The Scaffolding Map of AI
Read on Business Engineer →
[4]Dev.toHarness Engineers
The AI Industry Just Shifted to Harness Engineering
Read on Dev.to →
[5]SozaiEnterprise Adopters
Intelligence in AI Systems is Shifting from Pure LLMs to Integrated Scaffolding
Read on Sozai →
[6]Factlen Editorial TeamHarness Engineers
Synthesis by Factlen editorial team
Read on Factlen Editorial Team →

Stay informed

Every angle. Every day.

Get ai stories with full source coverage and perspective breakdowns delivered to your inbox.

Get the briefing →Browse ai