Chain-of-Thought vs ReAct vs Reflexion Agent Comparison — Pure-Reasoning vs Thought-Action-Observation Loops vs Self-Critique Retry Paradigms, Tool-Use Integration, Error-Recovery Mechanics, Benchmark Performance Deltas, and the Specific Agent Paradigm That Fits Each Workload as of 2026-04
Agent reasoning paradigm comparison covering pure chain-of-thought reasoning vs ReAct thought-action-observation loops vs Reflexion self-critique-and-retry, tool-use integration patterns, error-recovery mechanics, convergence-guarantee trade-offs, cost-per-task analysis, benchmark performance on WebShop HotpotQA GSM8K AlfWorld, and the specific paradigm-to-workload mapping that maximizes reliability and minimizes cost as of 2026-04.
You Built an LLM Agent That Answers Customer Questions Correctly 72% of the Time in Testing, Your Boss Wants 95%, and You Keep Adding Tools Thinking It Will Help — The Problem Isn’t the Tools, It’s That You’re Using Chain-of-Thought Where You Need ReAct, and Adding Tools Without a Thought-Action-Observation Loop Just Multiplies the Ways the Agent Can Get Confused
Agent reasoning paradigms are not interchangeable. Chain-of-Thought (CoT), ReAct (Reasoning + Acting), and Reflexion (self-critique + retry) each optimize for different constraint sets, and using the wrong paradigm for a workload produces predictable failure patterns. A CoT-only agent fails when the task requires real-world lookup. A ReAct agent fails when the observation space is too complex to reason about in-context. A Reflexion agent fails on tasks with no reliable self-critique signal. The right paradigm for each workload is stable — and getting it wrong is the single most common reason production LLM agents underperform their benchmarks.
As of 2026-04, the three paradigms represent three distinct answers to the question “what happens after the model generates its next token.” CoT chains tokens into a reasoning trace before committing to an answer. ReAct interleaves reasoning with tool invocations and observations from the environment. Reflexion adds an outer loop that inspects the final answer, self-critiques, and regenerates if the critique finds errors. Each adds latency, each adds cost, and each unlocks a different quality ceiling. The correct architecture composes them — simple queries get CoT, complex tool-use tasks get ReAct, high-stakes tasks with verifiable outputs get Reflexion on top of ReAct.
This article compares the three paradigms across nine decision dimensions — task fit, tool integration, error recovery, convergence guarantees, cost-per-task, latency, observability, benchmark performance, and failure modes. The framing is of paradigm classes rather than specific model recipes; the specific prompts that make Claude Opus 4.7 shine at ReAct today will shift with the next model generation, but the paradigm classes are stable primitives that compose.
Three-paradigm comparison — core mechanics
Paradigm
Core loop
Tool use
Self-correction
Typical latency
Chain-of-Thought (CoT)
Reason → Answer
No
No
1× baseline
ReAct
Thought → Action → Observation → (repeat) → Answer
Yes
Within loop — re-plan on observation
3-10× baseline
Reflexion
Act → Evaluate → Self-critique → Retry → Answer
Often yes (via ReAct)
Yes — outer loop
2-5× baseline
CoT + Self-consistency
N parallel CoT → Majority vote
No
No — sampling-based
N× baseline
ReAct + Reflexion
ReAct inner + Reflexion outer
Yes
Yes — both levels
5-20× baseline
Tree of Thoughts (ToT)
Search tree of partial reasoning paths
Optional
Via pruning
5-50× baseline
When each paradigm wins — task-to-paradigm mapping
Task type
Best paradigm
Rationale
Typical accuracy delta
Simple math / logic
CoT
No tools needed
+20-30% vs direct answer
Multi-step arithmetic
CoT + Self-consistency
Majority vote smooths errors
+5-10% over plain CoT
Fact retrieval from knowledge
ReAct with search tool
External lookup needed
+15-25% vs parametric-only
Multi-hop QA (HotpotQA)
ReAct
Must search, read, combine
+20-30% vs CoT alone
Web navigation (WebShop)
ReAct
Environment interaction
+30-50% vs CoT alone
Code generation with tests
Reflexion
Test feedback enables retry
+10-20% vs single-shot
Household task (AlfWorld)
ReAct + Reflexion
Environment + verifiable outcome
+15-25% vs ReAct alone
Creative writing
CoT
No verifiable ground truth
No paradigm dominates
Classification with few classes
Direct (no CoT)
CoT overhead without benefit
CoT sometimes hurts
High-stakes decision (medical, legal)
Reflexion with expert-style critique
Self-critique adds safety
+5-10% accuracy, +security
Chain-of-Thought — when it helps and when it doesn’t
Task characteristic
CoT benefit
Why
Multi-step reasoning required
Strong benefit
Model externalizes intermediate state
Single-step answer
No benefit; sometimes harmful
Verbosity introduces errors
Simple classification
No benefit; usually harmful
Forces justification that may be wrong
Math / arithmetic
Strong benefit
Step-by-step reduces errors
Commonsense reasoning
Strong benefit
Explicit inference chain
Language translation
No benefit
Direct task
Summarization
No benefit
CoT substitutes for answer
Code explanation
Moderate benefit
Step-by-step trace useful
CoT prompting variants
Variant
Prompt pattern
Best use
Zero-shot CoT
”Let’s think step by step.”
General-purpose default
Few-shot CoT
Example reasoning traces in prompt
Task-specific accuracy boost
Self-consistency
Sample N times; majority vote
Math, structured reasoning
Plan-and-solve
”Let’s first devise a plan…”
Multi-step tasks
Program-aided
Generate code; execute
Math, symbolic tasks
Least-to-most
Decompose into sub-problems
Complex multi-step
Analogical
Retrieve/generate analogous solved problem first
Novel tasks
ReAct — Thought-Action-Observation loop anatomy
Step
Purpose
Failure mode
Mitigation
Thought
Reason about current state
Hallucinated plan
Ground in observation
Action
Invoke tool with structured args
Wrong tool selection
Narrow tool set, descriptive names
Observation
Receive tool output
Output too large for context
Summarize/truncate
Repeat
Until goal reached or budget exhausted
Infinite loop
Max-iteration budget
Answer
Final response
Answer not grounded in observations
Require observation-citation in answer
ReAct failure modes — the five classic patterns
Failure
Signature
Cause
Fix
Tool hallucination
Invokes tool that doesn’t exist
Overlapping descriptions
Strict tool schema + enum tool_choice
Observation blind-spot
Repeats same action ignoring observation
Context-length dilution
Summarize prior observations
Premature termination
Answers without enough info
Confidence miscalibration
Require N observations before answer
Infinite loop
Same thought-action pair
No progress detection
Cycle detection + forced explore
Plan fragmentation
Each thought ignores prior plan
Lack of persistent plan state
Explicit plan field maintained across turns
Reflexion — self-critique and retry mechanics
Component
Role
Implementation
Actor
Generates candidate answer (often via ReAct)
Policy model
Evaluator
Scores answer against task
Heuristic rule, reward model, or test-runner
Self-reflection
Produces critique from failure
LLM-generated reflection in prose
Memory
Stores past reflections
Persistent buffer across attempts
Retry
Incorporates reflection into next attempt
Reflection prepended to next prompt
Termination
Stops on success or budget exhausted
Typically 3-5 iterations
When Reflexion adds value vs wastes compute
Condition
Reflexion benefit
Alternative
Verifiable ground truth
High — critique grounded
N/A
Unit-test-style evaluation
High — test results guide critique
N/A
Subjective quality
Low-medium — critique weak signal
Human review
No automatic evaluation
No benefit — self-critique unfounded
Don’t use
High task cost (tokens)
Only if accuracy critical
Single-shot cheaper
Real-time latency
Usually too slow
Skip Reflexion; cache
Cost-per-task comparison
Paradigm
Typical token cost per task
Latency
Accuracy ceiling
Direct answer
100-500 tokens
1-3s
Workload baseline
CoT
300-1500 tokens
2-5s
Baseline +10-30%
CoT + Self-consistency (N=5)
1500-7500 tokens
5-15s parallel
CoT +5-10%
ReAct (3-5 turns)
2000-8000 tokens
10-30s
Baseline +20-50%
Reflexion (3 iterations)
3000-15000 tokens
20-60s
ReAct +5-15%
ReAct + Reflexion
5000-25000 tokens
30-120s
Baseline +30-60%
Tree of Thoughts
5000-50000 tokens
30-300s
ReAct +5-15% on structured
Cost-accuracy Pareto — which paradigm for what stakes
Stakes level
Example
Recommended
Low (internal tool)
Ticket classification
Direct or CoT
Medium (user-facing Q&A)
Customer support FAQ
ReAct
High (agent executes action)
Booking flight, purchase
ReAct + human confirm
Very high (medical, legal)
Diagnosis aid, contract review
Reflexion + human review
Mission-critical
Financial trade, medication dosing
Not LLM-owned — advisory only
Tool-use integration patterns across paradigms
Pattern
CoT
ReAct
Reflexion
No tool use
Native
N/A
N/A
Single tool (search)
Via function calling
Core primitive
As inner loop
Multi-tool
Awkward in CoT
Native
Native
Parallel tool calls
Not applicable
Supported in modern APIs
Supported
Tool result streaming
N/A
Requires async
Requires async
Stateful tools
N/A
Requires session mgmt
Requires session mgmt
Benchmark performance deltas — representative numbers as of 2026-04
Benchmark
CoT baseline
ReAct
Reflexion
Notes
GSM8K (math word problems)
85-92%
85-92% (no improvement — no tool)
90-95% (with self-check)
CoT sufficient
HotpotQA (multi-hop QA)
35-50%
55-70%
60-75%
ReAct with search dominates
AlfWorld (household tasks)
15-25%
50-70%
60-80%
Environment interaction
WebShop (web navigation)
10-20%
35-55%
45-65%
ReAct with web-nav tools
HumanEval (code)
70-85%
75-90% (with execution tool)
80-92%
Reflexion with test feedback
MedQA (medical QA)
65-75%
70-80% (with lookup)
72-82%
Domain-specific gains
LegalBench
60-75%
65-80%
68-82%
Similar pattern
MMLU (multi-subject)
70-85%
72-86%
73-87%
Marginal gains — knowledge-bound
Observability requirements
Paradigm
What to log
Why
CoT
Full reasoning trace
Debug wrong conclusions
ReAct
Every thought, action, observation
Audit decisions + compliance
Reflexion
All critique iterations + final
Track quality trajectory
All
Token counts per turn
Cost monitoring
All
Latency per turn
Performance regression
Agent actions
Tool-call args + results
Security + reproducibility
Failure-mode matrix
Failure
CoT
ReAct
Reflexion
Hallucinated facts
Common
Mitigated by tool grounding
Mitigated if critic catches
Wrong tool
N/A
Common — top failure
Mitigated by critic
Looping
Rare
Common
Bounded by retry budget
Premature answer
Rare
Medium
Mitigated by evaluator
Over-reasoning
Medium (verbose)
Medium
High — many iterations
Silent failure
Rare
Medium (ignores obs)
Low — explicit critique
Cost blowout
Low
Medium
High
When to compose paradigms
Composition
Trade-off
Best for
CoT only
Fast, simple
Simple reasoning tasks
CoT + Self-consistency
Better math accuracy
Math, structured reasoning
ReAct
Tool-grounded
Most agent tasks
ReAct + Reflexion
Highest accuracy
High-stakes with verifiable outcomes
ReAct + ToT
Exploration + branching
Complex search problems
Reflexion alone (no tools)
Self-correcting CoT
Writing, code (with tests)
Human-in-the-loop
Safety + quality
Medical, legal, financial
Quick Reference Summary
Decision
Default
When to deviate
Paradigm
ReAct
CoT for no-tool; Reflexion overlay for high-stakes
Default to ReAct for agent tasks with external tools — HotpotQA-style multi-hop QA, WebShop-style navigation, and multi-tool routing all favor ReAct over CoT by 20-50% accuracy.
Use plain CoT for no-tool reasoning tasks — math, commonsense, logical inference — and skip CoT entirely for simple classification where verbosity introduces errors.
Add Self-consistency (N=5) to CoT for math and structured reasoning — majority-voting across parallel reasoning traces adds 5-10% accuracy with 5× cost.
Add Reflexion overlay only when an automatic evaluator exists — unit tests for code, verifiers for math, expert-style rubric critics for high-stakes text. Without an evaluator, Reflexion wastes compute.
Cap ReAct iteration budgets at 5-10 turns per task with explicit cycle detection — infinite loops and observation blind-spots are the top-two ReAct failure modes.
Set tool schemas strictly — narrow tool sets, enum tool_choice when possible, descriptive names — the #1 ReAct failure is wrong-tool selection traceable to ambiguous tool descriptions.
Log every thought-action-observation tuple — agent observability is non-negotiable for compliance, debugging, and behavior regression detection.
Compose ReAct + Reflexion only when the task justifies 5000-25000 tokens per invocation and 30-120s latency — for most production workloads, ReAct alone is the right operating point.
Honest Limitations
Model-capability generations shift on 6-18 month cycles: The specific tool-use fidelity, self-critique quality, and in-context reasoning depth of Claude Opus 4.7, GPT-5, and Gemini 2.5 as of 2026-04 will change with successor models. Paradigm classes (CoT · ReAct · Reflexion · ToT · Self-consistency) remain stable primitives; model-specific accuracy numbers will shift.
Benchmark numbers are representative, not guaranteed: The 10-80% ranges above reflect typical published results across multiple model generations on GSM8K, HotpotQA, AlfWorld, WebShop, HumanEval, MedQA, LegalBench, and MMLU. Your specific workload will produce different absolute numbers; the relative paradigm ordering tends to hold.
Reflexion quality depends on evaluator quality: Self-critique with a weak evaluator (LLM-only, no ground truth, no tests) produces diminishing returns. With strong evaluators (unit tests, gold labels, expert rubrics), Reflexion is highly effective. Measure evaluator reliability before trusting Reflexion outputs.
ReAct is latency-heavy: 10-30s latency for a 3-5 turn ReAct loop is often unacceptable for real-time UX. Stream observations, prefetch common tool results, and cache where possible.
“Agent” in 2026 still means different things to different teams: Some treat agent = ReAct; others treat agent = autonomous multi-step workflow orchestration; others treat agent = RL-trained policy. This article treats agent as “LLM-driven tool-use loop” per the original ReAct paper framing.
Tool-use safety is not covered here: Agent paradigms that invoke tools that modify external state (sending email, making purchases, executing code) require safety layers — human confirmation, sandboxing, rate limiting, action auditing — that this article does not specify. These are non-negotiable for production agent deployment.
We use cookies and similar technologies for essential site function (always on), for traffic analytics (Google Analytics 4), and for advertising (Google AdSense). Pick what you consent to; you can change this anytime from the cookie policy page.
Manage preferences
Required for the site to function: theme preference, URL-state tool inputs, consent record itself. No cross-site tracking.
Google Analytics 4 pageview counts + retention metrics. Aggregated, not tied to your identity. Helps us understand which content helps visitors.
Google AdSense cookies for relevant ads + frequency capping. Rejecting means we see only generic, non-personalized ads (still shown — they fund the free tools).