You Built an LLM Agent That Answers Customer Questions Correctly 72% of the Time in Testing, Your Boss Wants 95%, and You Keep Adding Tools Thinking It Will Help — The Problem Isn’t the Tools, It’s That You’re Using Chain-of-Thought Where You Need ReAct, and Adding Tools Without a Thought-Action-Observation Loop Just Multiplies the Ways the Agent Can Get Confused

Agent reasoning paradigms are not interchangeable. Chain-of-Thought (CoT), ReAct (Reasoning + Acting), and Reflexion (self-critique + retry) each optimize for different constraint sets, and using the wrong paradigm for a workload produces predictable failure patterns. A CoT-only agent fails when the task requires real-world lookup. A ReAct agent fails when the observation space is too complex to reason about in-context. A Reflexion agent fails on tasks with no reliable self-critique signal. The right paradigm for each workload is stable — and getting it wrong is the single most common reason production LLM agents underperform their benchmarks.

As of 2026-04, the three paradigms represent three distinct answers to the question “what happens after the model generates its next token.” CoT chains tokens into a reasoning trace before committing to an answer. ReAct interleaves reasoning with tool invocations and observations from the environment. Reflexion adds an outer loop that inspects the final answer, self-critiques, and regenerates if the critique finds errors. Each adds latency, each adds cost, and each unlocks a different quality ceiling. The correct architecture composes them — simple queries get CoT, complex tool-use tasks get ReAct, high-stakes tasks with verifiable outputs get Reflexion on top of ReAct.

This article compares the three paradigms across nine decision dimensions — task fit, tool integration, error recovery, convergence guarantees, cost-per-task, latency, observability, benchmark performance, and failure modes. The framing is of paradigm classes rather than specific model recipes; the specific prompts that make Claude Opus 4.7 shine at ReAct today will shift with the next model generation, but the paradigm classes are stable primitives that compose.

Three-paradigm comparison — core mechanics

ParadigmCore loopTool useSelf-correctionTypical latency
Chain-of-Thought (CoT)Reason → AnswerNoNo1× baseline
ReActThought → Action → Observation → (repeat) → AnswerYesWithin loop — re-plan on observation3-10× baseline
ReflexionAct → Evaluate → Self-critique → Retry → AnswerOften yes (via ReAct)Yes — outer loop2-5× baseline
CoT + Self-consistencyN parallel CoT → Majority voteNoNo — sampling-basedN× baseline
ReAct + ReflexionReAct inner + Reflexion outerYesYes — both levels5-20× baseline
Tree of Thoughts (ToT)Search tree of partial reasoning pathsOptionalVia pruning5-50× baseline

When each paradigm wins — task-to-paradigm mapping

Task typeBest paradigmRationaleTypical accuracy delta
Simple math / logicCoTNo tools needed+20-30% vs direct answer
Multi-step arithmeticCoT + Self-consistencyMajority vote smooths errors+5-10% over plain CoT
Fact retrieval from knowledgeReAct with search toolExternal lookup needed+15-25% vs parametric-only
Multi-hop QA (HotpotQA)ReActMust search, read, combine+20-30% vs CoT alone
Web navigation (WebShop)ReActEnvironment interaction+30-50% vs CoT alone
Code generation with testsReflexionTest feedback enables retry+10-20% vs single-shot
Household task (AlfWorld)ReAct + ReflexionEnvironment + verifiable outcome+15-25% vs ReAct alone
Creative writingCoTNo verifiable ground truthNo paradigm dominates
Classification with few classesDirect (no CoT)CoT overhead without benefitCoT sometimes hurts
High-stakes decision (medical, legal)Reflexion with expert-style critiqueSelf-critique adds safety+5-10% accuracy, +security

Chain-of-Thought — when it helps and when it doesn’t

Task characteristicCoT benefitWhy
Multi-step reasoning requiredStrong benefitModel externalizes intermediate state
Single-step answerNo benefit; sometimes harmfulVerbosity introduces errors
Simple classificationNo benefit; usually harmfulForces justification that may be wrong
Math / arithmeticStrong benefitStep-by-step reduces errors
Commonsense reasoningStrong benefitExplicit inference chain
Language translationNo benefitDirect task
SummarizationNo benefitCoT substitutes for answer
Code explanationModerate benefitStep-by-step trace useful

CoT prompting variants

VariantPrompt patternBest use
Zero-shot CoT”Let’s think step by step.”General-purpose default
Few-shot CoTExample reasoning traces in promptTask-specific accuracy boost
Self-consistencySample N times; majority voteMath, structured reasoning
Plan-and-solve”Let’s first devise a plan…”Multi-step tasks
Program-aidedGenerate code; executeMath, symbolic tasks
Least-to-mostDecompose into sub-problemsComplex multi-step
AnalogicalRetrieve/generate analogous solved problem firstNovel tasks

ReAct — Thought-Action-Observation loop anatomy

StepPurposeFailure modeMitigation
ThoughtReason about current stateHallucinated planGround in observation
ActionInvoke tool with structured argsWrong tool selectionNarrow tool set, descriptive names
ObservationReceive tool outputOutput too large for contextSummarize/truncate
RepeatUntil goal reached or budget exhaustedInfinite loopMax-iteration budget
AnswerFinal responseAnswer not grounded in observationsRequire observation-citation in answer

ReAct failure modes — the five classic patterns

FailureSignatureCauseFix
Tool hallucinationInvokes tool that doesn’t existOverlapping descriptionsStrict tool schema + enum tool_choice
Observation blind-spotRepeats same action ignoring observationContext-length dilutionSummarize prior observations
Premature terminationAnswers without enough infoConfidence miscalibrationRequire N observations before answer
Infinite loopSame thought-action pairNo progress detectionCycle detection + forced explore
Plan fragmentationEach thought ignores prior planLack of persistent plan stateExplicit plan field maintained across turns

Reflexion — self-critique and retry mechanics

ComponentRoleImplementation
ActorGenerates candidate answer (often via ReAct)Policy model
EvaluatorScores answer against taskHeuristic rule, reward model, or test-runner
Self-reflectionProduces critique from failureLLM-generated reflection in prose
MemoryStores past reflectionsPersistent buffer across attempts
RetryIncorporates reflection into next attemptReflection prepended to next prompt
TerminationStops on success or budget exhaustedTypically 3-5 iterations

When Reflexion adds value vs wastes compute

ConditionReflexion benefitAlternative
Verifiable ground truthHigh — critique groundedN/A
Unit-test-style evaluationHigh — test results guide critiqueN/A
Subjective qualityLow-medium — critique weak signalHuman review
No automatic evaluationNo benefit — self-critique unfoundedDon’t use
High task cost (tokens)Only if accuracy criticalSingle-shot cheaper
Real-time latencyUsually too slowSkip Reflexion; cache

Cost-per-task comparison

ParadigmTypical token cost per taskLatencyAccuracy ceiling
Direct answer100-500 tokens1-3sWorkload baseline
CoT300-1500 tokens2-5sBaseline +10-30%
CoT + Self-consistency (N=5)1500-7500 tokens5-15s parallelCoT +5-10%
ReAct (3-5 turns)2000-8000 tokens10-30sBaseline +20-50%
Reflexion (3 iterations)3000-15000 tokens20-60sReAct +5-15%
ReAct + Reflexion5000-25000 tokens30-120sBaseline +30-60%
Tree of Thoughts5000-50000 tokens30-300sReAct +5-15% on structured

Cost-accuracy Pareto — which paradigm for what stakes

Stakes levelExampleRecommended
Low (internal tool)Ticket classificationDirect or CoT
Medium (user-facing Q&A)Customer support FAQReAct
High (agent executes action)Booking flight, purchaseReAct + human confirm
Very high (medical, legal)Diagnosis aid, contract reviewReflexion + human review
Mission-criticalFinancial trade, medication dosingNot LLM-owned — advisory only

Tool-use integration patterns across paradigms

PatternCoTReActReflexion
No tool useNativeN/AN/A
Single tool (search)Via function callingCore primitiveAs inner loop
Multi-toolAwkward in CoTNativeNative
Parallel tool callsNot applicableSupported in modern APIsSupported
Tool result streamingN/ARequires asyncRequires async
Stateful toolsN/ARequires session mgmtRequires session mgmt

Benchmark performance deltas — representative numbers as of 2026-04

BenchmarkCoT baselineReActReflexionNotes
GSM8K (math word problems)85-92%85-92% (no improvement — no tool)90-95% (with self-check)CoT sufficient
HotpotQA (multi-hop QA)35-50%55-70%60-75%ReAct with search dominates
AlfWorld (household tasks)15-25%50-70%60-80%Environment interaction
WebShop (web navigation)10-20%35-55%45-65%ReAct with web-nav tools
HumanEval (code)70-85%75-90% (with execution tool)80-92%Reflexion with test feedback
MedQA (medical QA)65-75%70-80% (with lookup)72-82%Domain-specific gains
LegalBench60-75%65-80%68-82%Similar pattern
MMLU (multi-subject)70-85%72-86%73-87%Marginal gains — knowledge-bound

Observability requirements

ParadigmWhat to logWhy
CoTFull reasoning traceDebug wrong conclusions
ReActEvery thought, action, observationAudit decisions + compliance
ReflexionAll critique iterations + finalTrack quality trajectory
AllToken counts per turnCost monitoring
AllLatency per turnPerformance regression
Agent actionsTool-call args + resultsSecurity + reproducibility

Failure-mode matrix

FailureCoTReActReflexion
Hallucinated factsCommonMitigated by tool groundingMitigated if critic catches
Wrong toolN/ACommon — top failureMitigated by critic
LoopingRareCommonBounded by retry budget
Premature answerRareMediumMitigated by evaluator
Over-reasoningMedium (verbose)MediumHigh — many iterations
Silent failureRareMedium (ignores obs)Low — explicit critique
Cost blowoutLowMediumHigh

When to compose paradigms

CompositionTrade-offBest for
CoT onlyFast, simpleSimple reasoning tasks
CoT + Self-consistencyBetter math accuracyMath, structured reasoning
ReActTool-groundedMost agent tasks
ReAct + ReflexionHighest accuracyHigh-stakes with verifiable outcomes
ReAct + ToTExploration + branchingComplex search problems
Reflexion alone (no tools)Self-correcting CoTWriting, code (with tests)
Human-in-the-loopSafety + qualityMedical, legal, financial

Quick Reference Summary

DecisionDefaultWhen to deviate
ParadigmReActCoT for no-tool; Reflexion overlay for high-stakes
Max turns5-10 for ReAct3 for cost; 15 for complex
Reflection iterations35 for critical; 1 for cheap retry
Self-consistency N510 for math; 1 for non-reasoning
Tool budget10 calls per taskLower for cost; higher for research
Critic typeLLM-based for ReflexionRule-based faster, test-runner strongest
TerminationAnswer or budgetAdd early-stop on confidence
Temperature0 for strict reasoning0.5+ for exploration (ToT)

How to apply this

Default to ReAct for agent tasks with external tools — HotpotQA-style multi-hop QA, WebShop-style navigation, and multi-tool routing all favor ReAct over CoT by 20-50% accuracy.

Use plain CoT for no-tool reasoning tasks — math, commonsense, logical inference — and skip CoT entirely for simple classification where verbosity introduces errors.

Add Self-consistency (N=5) to CoT for math and structured reasoning — majority-voting across parallel reasoning traces adds 5-10% accuracy with 5× cost.

Add Reflexion overlay only when an automatic evaluator exists — unit tests for code, verifiers for math, expert-style rubric critics for high-stakes text. Without an evaluator, Reflexion wastes compute.

Cap ReAct iteration budgets at 5-10 turns per task with explicit cycle detection — infinite loops and observation blind-spots are the top-two ReAct failure modes.

Set tool schemas strictly — narrow tool sets, enum tool_choice when possible, descriptive names — the #1 ReAct failure is wrong-tool selection traceable to ambiguous tool descriptions.

Log every thought-action-observation tuple — agent observability is non-negotiable for compliance, debugging, and behavior regression detection.

Compose ReAct + Reflexion only when the task justifies 5000-25000 tokens per invocation and 30-120s latency — for most production workloads, ReAct alone is the right operating point.

Honest Limitations

  • Model-capability generations shift on 6-18 month cycles: The specific tool-use fidelity, self-critique quality, and in-context reasoning depth of Claude Opus 4.7, GPT-5, and Gemini 2.5 as of 2026-04 will change with successor models. Paradigm classes (CoT · ReAct · Reflexion · ToT · Self-consistency) remain stable primitives; model-specific accuracy numbers will shift.
  • Benchmark numbers are representative, not guaranteed: The 10-80% ranges above reflect typical published results across multiple model generations on GSM8K, HotpotQA, AlfWorld, WebShop, HumanEval, MedQA, LegalBench, and MMLU. Your specific workload will produce different absolute numbers; the relative paradigm ordering tends to hold.
  • Reflexion quality depends on evaluator quality: Self-critique with a weak evaluator (LLM-only, no ground truth, no tests) produces diminishing returns. With strong evaluators (unit tests, gold labels, expert rubrics), Reflexion is highly effective. Measure evaluator reliability before trusting Reflexion outputs.
  • ReAct is latency-heavy: 10-30s latency for a 3-5 turn ReAct loop is often unacceptable for real-time UX. Stream observations, prefetch common tool results, and cache where possible.
  • “Agent” in 2026 still means different things to different teams: Some treat agent = ReAct; others treat agent = autonomous multi-step workflow orchestration; others treat agent = RL-trained policy. This article treats agent as “LLM-driven tool-use loop” per the original ReAct paper framing.
  • Tool-use safety is not covered here: Agent paradigms that invoke tools that modify external state (sending email, making purchases, executing code) require safety layers — human confirmation, sandboxing, rate limiting, action auditing — that this article does not specify. These are non-negotiable for production agent deployment.