Factlen ExplainerPrompt EngineeringExplainerJun 12, 2026, 3:23 AM· 6 min read· #7 of 59 in ai

How to Make AI Reason: The Science of Chain-of-Thought and ReAct Prompting

By forcing large language models to show their work step-by-step and interact with external tools, developers are unlocking unprecedented reasoning capabilities without retraining the underlying models.

By Factlen Editorial Team

Share this story

AI Researchers 35%Applied Developers 35%Prompt Engineers 30%

AI Researchers: Focused on scaling laws and emergent cognitive capabilities in neural networks.
Applied Developers: Focused on building reliable, cost-effective software systems using AI APIs.
Prompt Engineers: Focused on optimizing the linguistic interface and structural context provided to AI models.

What's not represented

· Enterprise IT Managers
· End-User Experience Designers

Why this matters

Understanding how to structure AI prompts moves users from passive consumers of text generation to active directors of intelligent agents, unlocking the ability to solve complex, multi-step problems reliably.

Key points

Chain-of-Thought (CoT) prompting forces AI models to show their work, drastically improving accuracy on complex tasks.
CoT is an emergent ability that only functions effectively in models with roughly 100 billion parameters or more.
The ReAct framework solves CoT's isolation problem by allowing the AI to interact with external tools and verify facts.
ReAct workflows operate on a continuous loop of Thought, Action, and Observation.
Newer 'reasoning models' internalize these processes, changing how developers must structure their prompts.
While powerful, agentic prompting increases API costs and latency due to the high volume of tokens generated.

34%

ReAct success rate boost on ALFWorld

CoT exemplars needed for GSM8K state-of-the-art

100B+

Parameters where CoT abilities emerge

For the first few years of the generative AI boom, interacting with a large language model felt like pulling a slot machine lever. Users would type a complex question, hit enter, and hope the statistical weights aligned to produce a correct answer. If the model hallucinated or skipped a crucial logical step, the user had little recourse other than to rephrase the question and try again. But as models grew larger, researchers discovered that the bottleneck wasn't always the AI's underlying intelligence—it was the way humans were asking it to think.[6]

The breakthrough came with the realization that language models, much like humans, struggle to solve multi-step math or logic problems in a single breath. If you ask a person to instantly calculate the trajectory of a moving train, they will likely guess wrong. If you give them a whiteboard and ask them to show their work, they can solve it. This exact dynamic applies to artificial neural networks, birthing a discipline known as prompt engineering, which shifts the focus from what to ask the AI, to how to structure its cognitive environment.[4][6]

The foundational pillar of this new discipline is "Chain-of-Thought" (CoT) prompting. Formally introduced in a landmark 2022 paper by Jason Wei and a team of Google researchers, CoT is a remarkably simple technique that forces a model to generate a series of intermediate reasoning steps before outputting a final answer. Instead of providing the model with a few examples of questions and direct answers, the user provides examples where the answer includes a step-by-step logical breakdown.[1]

The empirical results of this minor formatting tweak were staggering. The researchers found that prompting a 540-billion-parameter model with just eight Chain-of-Thought exemplars allowed it to achieve state-of-the-art accuracy on the GSM8K benchmark—a notoriously difficult dataset of grade-school math word problems. By simply showing its work, the off-the-shelf model surpassed even specialized systems that had been explicitly fine-tuned to solve math equations.[1]

Chain-of-Thought prompting forces the model to show its work, drastically improving accuracy on complex tasks.

The mechanics behind why Chain-of-Thought works are rooted in how language models process compute. Every word—or "token"—a model generates gives it another cycle of computational processing. By forcing the model to write out its intermediate steps, CoT effectively grants the AI more compute time to allocate to a complex problem. It prevents the model from rushing to a conclusion, allowing it to unpack the variables, state its assumptions, and follow a logical path to the end.[3][6]

Interestingly, researchers discovered that Chain-of-Thought is an "emergent ability" tied to model scale. When applied to smaller models, the technique often fails, as the models lack the internal knowledge to generate coherent logical chains. However, once a model crosses the threshold of roughly 100 billion parameters, the ability to reason step-by-step unlocks naturally, yielding massive performance gains across arithmetic, commonsense, and symbolic reasoning tasks.[1]

While Chain-of-Thought revolutionized static reasoning, it suffered from a critical limitation: isolation. A model using CoT relies entirely on its internal, pre-trained knowledge base. If it hallucinates a fact in step two of its reasoning chain, that error propagates through the rest of the thought process, leading to a confidently incorrect final answer. The model had no way to verify its assumptions against the real world.[2]

While Chain-of-Thought revolutionized static reasoning, it suffered from a critical limitation: isolation.

To solve this, researchers Shunyu Yao and colleagues introduced "ReAct"—a framework that synergizes Reasoning and Acting. ReAct transforms the AI from a static thinker into an interactive agent. Instead of just generating a chain of thought, a ReAct prompt instructs the model to alternate between thinking about what to do, executing an action using an external tool, and observing the result of that action before deciding on the next step.[2]

The ReAct loop typically follows a strict sequence: Thought, Action, Observation. For example, if asked a complex trivia question, the model might generate a Thought ("I need to find the birth year of the actor who played the lead in The Matrix"), execute an Action (searching Wikipedia for "Keanu Reeves"), process the Observation (reading the search snippet that says 1964), and then generate a new Thought based on that verified fact.[2][6]

The ReAct loop allows language models to verify their assumptions against external tools and environments.

By grounding the model's reasoning in external environments, ReAct drastically reduces hallucinations and error propagation. In testing on ALFWorld, an interactive decision-making benchmark, the ReAct framework outperformed both imitation and reinforcement learning methods by an absolute success rate of 34%. On the WebShop benchmark, which requires an agent to navigate an online store to purchase specific items, ReAct proved highly capable of dynamically adjusting its plans based on what items were actually in stock.[2]

The industry has rapidly adopted these frameworks, moving them from academic papers to official developer guidelines. Anthropic, the creator of the Claude models, explicitly advises developers to ask the AI to "think first" by providing it with designated scratchpad space—often using XML tags like `<thinking>`—before it generates its final output. This gives the AI the necessary space to consider constraints and potential approaches without cluttering the final response seen by the user.[5]

OpenAI has similarly integrated these concepts into its ecosystem. While traditional GPT models benefit immensely from explicit Chain-of-Thought instructions, OpenAI's newer "reasoning models" have internalized this process. These models automatically generate an invisible chain of thought during inference, spending extra time and compute to analyze complex tasks and multi-step planning before returning any text to the user.[4]

Grounding reasoning in external actions yields massive performance gains over static prompting.

For developers and prompt engineers, this evolution requires a shift in strategy. When using standard models, the prompt must act as a strict scaffolding, explicitly demanding step-by-step breakdowns and providing clear examples. When using native reasoning models, the prompt can focus more on defining the end goal, the persona, and the constraints, trusting the model's internal ReAct and CoT loops to handle the intermediate logic.[4][5]

Despite these advancements, agentic prompting is not a silver bullet. ReAct workflows can sometimes get stuck in repetitive loops, repeatedly issuing the same search query if the observation doesn't contain the exact phrasing the model expects. Furthermore, generating long reasoning chains consumes significantly more tokens, increasing both the financial cost and the latency of every API call.[2][6]

Nevertheless, the transition from zero-shot guessing to structured, agentic reasoning represents a fundamental maturation in how humans interact with artificial intelligence. By combining the internal logic of Chain-of-Thought with the external verification of ReAct, developers are building systems that don't just predict the next word, but actively investigate, plan, and solve problems in ways that increasingly mirror human cognition.[3][6]

How we got here

2020
GPT-3 demonstrates that large language models can learn new tasks via few-shot prompting without retraining.
Jan 2022
Google researchers publish the foundational paper on Chain-of-Thought prompting, revealing emergent reasoning capabilities.
Oct 2022
Researchers introduce the ReAct framework, allowing language models to synergize reasoning with external actions.
Late 2024
OpenAI releases the o1 model series, which bakes Chain-of-Thought reasoning directly into the inference process.
2025-2026
Agentic workflows and structured reasoning prompts become standard practice in enterprise AI deployments.

Viewpoints in depth

AI Researchers

Focused on scaling laws and emergent cognitive capabilities.

For the academic community, the most fascinating aspect of Chain-of-Thought is that it was not explicitly programmed into the models. Researchers view CoT as an 'emergent ability'—a capability that simply did not exist in models with fewer than 100 billion parameters, but suddenly unlocked at scale. They study these frameworks to understand the latent reasoning structures hidden within massive neural networks, using benchmarks like GSM8K and ALFWorld to quantify the boundary between statistical parroting and genuine problem-solving.

Applied Developers

Focused on building reliable, cost-effective software systems.

Software engineers building production applications view ReAct and CoT through a pragmatic lens of reliability versus cost. While these frameworks drastically reduce hallucinations and allow AI to interact with external APIs, they also consume significantly more tokens. Developers must constantly balance the need for high-fidelity reasoning against the latency and financial cost of generating long, step-by-step thought processes for every user query.

Prompt Engineers

Focused on optimizing the linguistic interface between humans and AI.

Prompt engineers treat language models as highly capable but easily confused collaborators. They argue that the quality of an AI's output is directly proportional to the structural clarity of the input. For this camp, techniques like CoT and ReAct are essential tools for providing 'cognitive scaffolding.' They advocate for iterative testing, using XML tags to separate instructions from data, and explicitly defining personas to guide the model's reasoning pathways.

What we don't know

Whether future model architectures will render explicit prompt engineering entirely obsolete.
How to completely prevent ReAct agents from getting stuck in infinite action-observation loops when tools fail.
The exact mathematical mechanism that causes Chain-of-Thought to emerge only at specific parameter scales.

Key terms

Chain-of-Thought (CoT): A prompting technique that instructs an AI to break down a complex problem into a series of intermediate logical steps before providing an answer.
ReAct: A framework that combines reasoning and acting, allowing an AI to alternate between thinking about a problem and using external tools (like search engines) to gather information.
Token: The fundamental unit of data processed by a language model, roughly equivalent to a word or part of a word.
Emergent Ability: A capability that is not present in smaller AI models but suddenly appears when the model reaches a certain scale or parameter count.
Hallucination: When an AI model confidently generates false or fabricated information because it lacks the correct data or logical grounding.
Zero-shot vs Few-shot: Zero-shot means asking an AI a question with no examples provided; few-shot means providing a few examples of the desired input and output format within the prompt.

Frequently asked

Do I need to use Chain-of-Thought for every prompt?

No. Chain-of-Thought is most useful for complex logic, math, or multi-step reasoning tasks. For simple factual retrieval or creative writing, standard prompting is usually sufficient and faster.

How is ReAct different from standard Chain-of-Thought?

While Chain-of-Thought relies entirely on the model's internal knowledge to reason, ReAct allows the model to pause its reasoning, take an action (like searching the web), observe the result, and then continue reasoning based on verified facts.

Why does asking an AI to 'think step-by-step' actually work?

Language models process information one token (word piece) at a time. By forcing the model to write out its steps, you are effectively giving it more computational time and space to process the variables before it commits to a final answer.

Are newer models making prompt engineering obsolete?

Not obsolete, but different. Newer 'reasoning models' handle the step-by-step logic internally, meaning prompt engineers now focus more on defining strict constraints, personas, and end goals rather than hand-holding the model through the logic.

Sources

[1]Semantic ScholarAI Researchers
Chain of Thought Prompting Elicits Reasoning in Large Language Models
Read on Semantic Scholar →
[2]arXivAI Researchers
ReAct: Synergizing Reasoning and Acting in Language Models
Read on arXiv →
[3]PromptingGuide.aiPrompt Engineers
Chain-of-Thought Prompting
Read on PromptingGuide.ai →
[4]OpenAIApplied Developers
Prompt engineering strategies
Read on OpenAI →
[5]AnthropicApplied Developers
6 Techniques for Effective Prompt Engineering
Read on Anthropic →
[6]Factlen Editorial TeamPrompt Engineers
Synthesis by Factlen editorial team
Read on Factlen Editorial Team →

Up next

On-Device AI

How Small Language Models Are Bringing Powerful AI Directly to Your Phone

A new generation of compact, highly efficient AI models is moving processing away from the cloud, offering users unprecedented privacy, speed, and cost savings on their own devices.

Every angle. Every day.

Get ai stories with full source coverage and perspective breakdowns delivered to your inbox.

Get the briefing →Browse ai