Chain-of-Thought vs ReAct vs Reflexion Agent Comparison — Pure-Reasoning vs Thought-Action-Observation Loops vs Self-Critique Retry Paradigms, Tool-Use Integration, Error-Recovery Mechanics, Benchmark Performance Deltas, and the Specific Agent Paradigm That Fits Each Workload as of 2026-04

Agent reasoning paradigm comparison covering pure chain-of-thought reasoning vs ReAct thought-action-observation loops vs Reflexion self-critique-and-retry, tool-use integration patterns, error-recovery mechanics, convergence-guarantee trade-offs, cost-per-task analysis, benchmark performance on WebShop HotpotQA GSM8K AlfWorld, and the specific paradigm-to-workload mapping that maximizes reliability and minimizes cost as of 2026-04.

Kenny Tan Reviewed by Shanire 25 April 2026

You Built an LLM Agent That Answers Customer Questions Correctly 72% of the Time in Testing, Your Boss Wants 95%, and You Keep Adding Tools Thinking It Will Help — The Problem Isn’t the Tools, It’s That You’re Using Chain-of-Thought Where You Need ReAct, and Adding Tools Without a Thought-Action-Observation Loop Just Multiplies the Ways the Agent Can Get Confused

Agent reasoning paradigms are not interchangeable. Chain-of-Thought (CoT), ReAct (Reasoning + Acting), and Reflexion (self-critique + retry) each optimize for different constraint sets, and using the wrong paradigm for a workload produces predictable failure patterns. A CoT-only agent fails when the task requires real-world lookup. A ReAct agent fails when the observation space is too complex to reason about in-context. A Reflexion agent fails on tasks with no reliable self-critique signal. The right paradigm for each workload is stable — and getting it wrong is the single most common reason production LLM agents underperform their benchmarks.

As of 2026-04, the three paradigms represent three distinct answers to the question “what happens after the model generates its next token.” CoT chains tokens into a reasoning trace before committing to an answer. ReAct interleaves reasoning with tool invocations and observations from the environment. Reflexion adds an outer loop that inspects the final answer, self-critiques, and regenerates if the critique finds errors. Each adds latency, each adds cost, and each unlocks a different quality ceiling. The correct architecture composes them — simple queries get CoT, complex tool-use tasks get ReAct, high-stakes tasks with verifiable outputs get Reflexion on top of ReAct.

This article compares the three paradigms across nine decision dimensions — task fit, tool integration, error recovery, convergence guarantees, cost-per-task, latency, observability, benchmark performance, and failure modes. The framing is of paradigm classes rather than specific model recipes; the specific prompts that make Claude Opus 4.7 shine at ReAct today will shift with the next model generation, but the paradigm classes are stable primitives that compose.

Three-paradigm comparison — core mechanics

Paradigm	Core loop	Tool use	Self-correction	Typical latency
Chain-of-Thought (CoT)	Reason → Answer	No	No	1× baseline
ReAct	Thought → Action → Observation → (repeat) → Answer	Yes	Within loop — re-plan on observation	3-10× baseline
Reflexion	Act → Evaluate → Self-critique → Retry → Answer	Often yes (via ReAct)	Yes — outer loop	2-5× baseline
CoT + Self-consistency	N parallel CoT → Majority vote	No	No — sampling-based	N× baseline
ReAct + Reflexion	ReAct inner + Reflexion outer	Yes	Yes — both levels	5-20× baseline
Tree of Thoughts (ToT)	Search tree of partial reasoning paths	Optional	Via pruning	5-50× baseline

When each paradigm wins — task-to-paradigm mapping

Task type	Best paradigm	Rationale	Typical accuracy delta
Simple math / logic	CoT	No tools needed	+20-30% vs direct answer
Multi-step arithmetic	CoT + Self-consistency	Majority vote smooths errors	+5-10% over plain CoT
Fact retrieval from knowledge	ReAct with search tool	External lookup needed	+15-25% vs parametric-only
Multi-hop QA (HotpotQA)	ReAct	Must search, read, combine	+20-30% vs CoT alone
Web navigation (WebShop)	ReAct	Environment interaction	+30-50% vs CoT alone
Code generation with tests	Reflexion	Test feedback enables retry	+10-20% vs single-shot
Household task (AlfWorld)	ReAct + Reflexion	Environment + verifiable outcome	+15-25% vs ReAct alone
Creative writing	CoT	No verifiable ground truth	No paradigm dominates
Classification with few classes	Direct (no CoT)	CoT overhead without benefit	CoT sometimes hurts
High-stakes decision (medical, legal)	Reflexion with expert-style critique	Self-critique adds safety	+5-10% accuracy, +security

Chain-of-Thought — when it helps and when it doesn’t

Task characteristic	CoT benefit	Why
Multi-step reasoning required	Strong benefit	Model externalizes intermediate state
Single-step answer	No benefit; sometimes harmful	Verbosity introduces errors
Simple classification	No benefit; usually harmful	Forces justification that may be wrong
Math / arithmetic	Strong benefit	Step-by-step reduces errors
Commonsense reasoning	Strong benefit	Explicit inference chain
Language translation	No benefit	Direct task
Summarization	No benefit	CoT substitutes for answer
Code explanation	Moderate benefit	Step-by-step trace useful

CoT prompting variants

Variant	Prompt pattern	Best use
Zero-shot CoT	”Let’s think step by step.”	General-purpose default
Few-shot CoT	Example reasoning traces in prompt	Task-specific accuracy boost
Self-consistency	Sample N times; majority vote	Math, structured reasoning
Plan-and-solve	”Let’s first devise a plan…”	Multi-step tasks
Program-aided	Generate code; execute	Math, symbolic tasks
Least-to-most	Decompose into sub-problems	Complex multi-step
Analogical	Retrieve/generate analogous solved problem first	Novel tasks

ReAct — Thought-Action-Observation loop anatomy

Step	Purpose	Failure mode	Mitigation
Thought	Reason about current state	Hallucinated plan	Ground in observation
Action	Invoke tool with structured args	Wrong tool selection	Narrow tool set, descriptive names
Observation	Receive tool output	Output too large for context	Summarize/truncate
Repeat	Until goal reached or budget exhausted	Infinite loop	Max-iteration budget
Answer	Final response	Answer not grounded in observations	Require observation-citation in answer

ReAct failure modes — the five classic patterns

Failure	Signature	Cause	Fix
Tool hallucination	Invokes tool that doesn’t exist	Overlapping descriptions	Strict tool schema + enum tool_choice
Observation blind-spot	Repeats same action ignoring observation	Context-length dilution	Summarize prior observations
Premature termination	Answers without enough info	Confidence miscalibration	Require N observations before answer
Infinite loop	Same thought-action pair	No progress detection	Cycle detection + forced explore
Plan fragmentation	Each thought ignores prior plan	Lack of persistent plan state	Explicit plan field maintained across turns

Reflexion — self-critique and retry mechanics

Component	Role	Implementation
Actor	Generates candidate answer (often via ReAct)	Policy model
Evaluator	Scores answer against task	Heuristic rule, reward model, or test-runner
Self-reflection	Produces critique from failure	LLM-generated reflection in prose
Memory	Stores past reflections	Persistent buffer across attempts
Retry	Incorporates reflection into next attempt	Reflection prepended to next prompt
Termination	Stops on success or budget exhausted	Typically 3-5 iterations

When Reflexion adds value vs wastes compute

Condition	Reflexion benefit	Alternative
Verifiable ground truth	High — critique grounded	N/A
Unit-test-style evaluation	High — test results guide critique	N/A
Subjective quality	Low-medium — critique weak signal	Human review
No automatic evaluation	No benefit — self-critique unfounded	Don’t use
High task cost (tokens)	Only if accuracy critical	Single-shot cheaper
Real-time latency	Usually too slow	Skip Reflexion; cache

Cost-per-task comparison

Paradigm	Typical token cost per task	Latency	Accuracy ceiling
Direct answer	100-500 tokens	1-3s	Workload baseline
CoT	300-1500 tokens	2-5s	Baseline +10-30%
CoT + Self-consistency (N=5)	1500-7500 tokens	5-15s parallel	CoT +5-10%
ReAct (3-5 turns)	2000-8000 tokens	10-30s	Baseline +20-50%
Reflexion (3 iterations)	3000-15000 tokens	20-60s	ReAct +5-15%
ReAct + Reflexion	5000-25000 tokens	30-120s	Baseline +30-60%
Tree of Thoughts	5000-50000 tokens	30-300s	ReAct +5-15% on structured

Cost-accuracy Pareto — which paradigm for what stakes

Stakes level	Example	Recommended
Low (internal tool)	Ticket classification	Direct or CoT
Medium (user-facing Q&A)	Customer support FAQ	ReAct
High (agent executes action)	Booking flight, purchase	ReAct + human confirm
Very high (medical, legal)	Diagnosis aid, contract review	Reflexion + human review
Mission-critical	Financial trade, medication dosing	Not LLM-owned — advisory only

Tool-use integration patterns across paradigms

Pattern	CoT	ReAct	Reflexion
No tool use	Native	N/A	N/A
Single tool (search)	Via function calling	Core primitive	As inner loop
Multi-tool	Awkward in CoT	Native	Native
Parallel tool calls	Not applicable	Supported in modern APIs	Supported
Tool result streaming	N/A	Requires async	Requires async
Stateful tools	N/A	Requires session mgmt	Requires session mgmt

Benchmark performance deltas — representative numbers as of 2026-04

Benchmark	CoT baseline	ReAct	Reflexion	Notes
GSM8K (math word problems)	85-92%	85-92% (no improvement — no tool)	90-95% (with self-check)	CoT sufficient
HotpotQA (multi-hop QA)	35-50%	55-70%	60-75%	ReAct with search dominates
AlfWorld (household tasks)	15-25%	50-70%	60-80%	Environment interaction
WebShop (web navigation)	10-20%	35-55%	45-65%	ReAct with web-nav tools
HumanEval (code)	70-85%	75-90% (with execution tool)	80-92%	Reflexion with test feedback
MedQA (medical QA)	65-75%	70-80% (with lookup)	72-82%	Domain-specific gains
LegalBench	60-75%	65-80%	68-82%	Similar pattern
MMLU (multi-subject)	70-85%	72-86%	73-87%	Marginal gains — knowledge-bound

Observability requirements

Paradigm	What to log	Why
CoT	Full reasoning trace	Debug wrong conclusions
ReAct	Every thought, action, observation	Audit decisions + compliance
Reflexion	All critique iterations + final	Track quality trajectory
All	Token counts per turn	Cost monitoring
All	Latency per turn	Performance regression
Agent actions	Tool-call args + results	Security + reproducibility

Failure-mode matrix

Failure	CoT	ReAct	Reflexion
Hallucinated facts	Common	Mitigated by tool grounding	Mitigated if critic catches
Wrong tool	N/A	Common — top failure	Mitigated by critic
Looping	Rare	Common	Bounded by retry budget
Premature answer	Rare	Medium	Mitigated by evaluator
Over-reasoning	Medium (verbose)	Medium	High — many iterations
Silent failure	Rare	Medium (ignores obs)	Low — explicit critique
Cost blowout	Low	Medium	High

When to compose paradigms

Composition	Trade-off	Best for
CoT only	Fast, simple	Simple reasoning tasks
CoT + Self-consistency	Better math accuracy	Math, structured reasoning
ReAct	Tool-grounded	Most agent tasks
ReAct + Reflexion	Highest accuracy	High-stakes with verifiable outcomes
ReAct + ToT	Exploration + branching	Complex search problems
Reflexion alone (no tools)	Self-correcting CoT	Writing, code (with tests)
Human-in-the-loop	Safety + quality	Medical, legal, financial

Quick Reference Summary

Decision	Default	When to deviate
Paradigm	ReAct	CoT for no-tool; Reflexion overlay for high-stakes
Max turns	5-10 for ReAct	3 for cost; 15 for complex
Reflection iterations	3	5 for critical; 1 for cheap retry
Self-consistency N	5	10 for math; 1 for non-reasoning
Tool budget	10 calls per task	Lower for cost; higher for research
Critic type	LLM-based for Reflexion	Rule-based faster, test-runner strongest
Termination	Answer or budget	Add early-stop on confidence
Temperature	0 for strict reasoning	0.5+ for exploration (ToT)

How to apply this

Default to ReAct for agent tasks with external tools — HotpotQA-style multi-hop QA, WebShop-style navigation, and multi-tool routing all favor ReAct over CoT by 20-50% accuracy.

Use plain CoT for no-tool reasoning tasks — math, commonsense, logical inference — and skip CoT entirely for simple classification where verbosity introduces errors.

Add Self-consistency (N=5) to CoT for math and structured reasoning — majority-voting across parallel reasoning traces adds 5-10% accuracy with 5× cost.

Add Reflexion overlay only when an automatic evaluator exists — unit tests for code, verifiers for math, expert-style rubric critics for high-stakes text. Without an evaluator, Reflexion wastes compute.

Cap ReAct iteration budgets at 5-10 turns per task with explicit cycle detection — infinite loops and observation blind-spots are the top-two ReAct failure modes.

Set tool schemas strictly — narrow tool sets, enum tool_choice when possible, descriptive names — the #1 ReAct failure is wrong-tool selection traceable to ambiguous tool descriptions.

Log every thought-action-observation tuple — agent observability is non-negotiable for compliance, debugging, and behavior regression detection.

Compose ReAct + Reflexion only when the task justifies 5000-25000 tokens per invocation and 30-120s latency — for most production workloads, ReAct alone is the right operating point.

Honest Limitations

Model-capability generations shift on 6-18 month cycles: The specific tool-use fidelity, self-critique quality, and in-context reasoning depth of Claude Opus 4.7, GPT-5, and Gemini 2.5 as of 2026-04 will change with successor models. Paradigm classes (CoT · ReAct · Reflexion · ToT · Self-consistency) remain stable primitives; model-specific accuracy numbers will shift.
Benchmark numbers are representative, not guaranteed: The 10-80% ranges above reflect typical published results across multiple model generations on GSM8K, HotpotQA, AlfWorld, WebShop, HumanEval, MedQA, LegalBench, and MMLU. Your specific workload will produce different absolute numbers; the relative paradigm ordering tends to hold.
Reflexion quality depends on evaluator quality: Self-critique with a weak evaluator (LLM-only, no ground truth, no tests) produces diminishing returns. With strong evaluators (unit tests, gold labels, expert rubrics), Reflexion is highly effective. Measure evaluator reliability before trusting Reflexion outputs.
ReAct is latency-heavy: 10-30s latency for a 3-5 turn ReAct loop is often unacceptable for real-time UX. Stream observations, prefetch common tool results, and cache where possible.
“Agent” in 2026 still means different things to different teams: Some treat agent = ReAct; others treat agent = autonomous multi-step workflow orchestration; others treat agent = RL-trained policy. This article treats agent as “LLM-driven tool-use loop” per the original ReAct paper framing.
Tool-use safety is not covered here: Agent paradigms that invoke tools that modify external state (sending email, making purchases, executing code) require safety layers — human confirmation, sandboxing, rate limiting, action auditing — that this article does not specify. These are non-negotiable for production agent deployment.

Kenny Tan Co-founder & Technical Lead

AI Model Evaluation Systems · LLM Benchmark Infrastructure · Prompt Engineering Research

Cross-domain expertise in software engineering, content systems, and infrastructure architecture.

Reviewed by Shanire — Co-founder & Creative Lead

25 April 2026 Also at: kennytan.net

Continue reading

LLM Cost-Per-Query Optimization — Per-Query Cost Decomposition, Model-Routing Economics, Semantic-Cache ROI Math, Tiered-Architecture Breakpoint Analysis, Prompt-Compression Savings Table, and the Per-Decision Financial Model That Separates Real Wins From Engineering Traps

LLM cost-per-query optimization decision framework with per-query cost decomposition (input tokens + output tokens + retrieval + caching + fallback retries), model-routing economics across GPT-4o + Claude Sonnet + Haiku + Gemini tiers with per-task quality-cost ratios, semantic-cache hit-rate ROI math (break-even cache size + hit-rate threshold), tiered-architecture breakpoint analysis (when to upgrade from flat-model to routed architecture), prompt-compression techniques ranked by savings-per-engineering-hour, and the per-decision financial model that separates 30%-savings-2-hour wins from 5%-savings-40-hour traps.

Retrieval-Augmented Generation Chunk Sizing Strategy — Token-Window vs Semantic-Boundary Chunking, Overlap Ratio Tuning, Hierarchical and Parent-Document Retrieval, Sliding-Window Recursive-Character Patterns, and the Specific Chunking Decision That Determines RAG Quality as of 2026-04

RAG chunk-sizing reference covering token-window vs semantic-boundary chunking, overlap ratio trade-offs, recursive character splitting, hierarchical and parent-document retrieval, sliding-window context patterns, embedding-model token-window impact, metadata-enrichment strategies, chunk-level vs document-level relevance, and the specific chunking decisions that determine RAG quality across OpenAI text-embedding-3-large, Cohere, and Voyage as of 2026-04.

Structured Output JSON Schema Prompt Patterns — Schema-Enforced Generation, Tool-Call vs Response-Format APIs, Retry-on-Parse-Fail Protocols, Pydantic and Zod Coercion, Nested Object Depth Limits, and the Specific Patterns That Produce Parseable JSON at 99%+ Reliability as of 2026-04

Structured output reference covering JSON Schema-enforced prompt patterns, OpenAI response_format vs tool-call discrimination, Anthropic tool_use structured extraction, Pydantic and Zod runtime coercion, retry-on-schema-fail protocols with exponential backoff, nested depth and token-window impact, enum vs free-string trade-offs, and the specific patterns that produce parseable JSON at 99%+ reliability across GPT-4, Claude, and Gemini as of 2026-04.

All articles in prompt engineering