AI Agent Design Patterns — Tool Use, Planning, and Memory Architectures

Agent architecture decision matrix comparing ReAct, Plan-and-Execute, and Tree-of-Thought with tool integration patterns, memory systems, and failure mode analysis for production agent systems.

Kenny Tan 15 April 2026

Your AI Agent Works Perfectly on Simple Tasks but Falls Apart on Multi-Step Problems — The Architecture Determines the Ceiling

AI agents — LLMs that can plan, use tools, and take actions — are the most powerful and most dangerous application pattern in production AI. Powerful because a well-designed agent can complete tasks that require 10+ steps, multiple tool calls, and dynamic decision-making. Dangerous because a poorly designed agent can loop infinitely, call the wrong tools, spend $500 on a task that should cost $0.50, and take irreversible actions based on hallucinated reasoning. The architecture pattern you choose determines both the capability ceiling and the failure floor. This guide compares the major agent architectures, documents their failure modes, and provides the decision matrix for production agent systems.

The Agent Architecture Taxonomy

Pattern 1 — ReAct (Reasoning + Action)

The model alternates between reasoning (thinking about what to do) and acting (calling tools). Each cycle: observe → think → act → observe result → think → act.

Dimension	Value
Mechanism	Interleaved thought and tool-call steps
Planning horizon	One step at a time (greedy)
Tool integration	Native — each action is a tool call
Max effective steps	5-15 steps before quality degrades
Cost per task	Medium (reasoning tokens + tool calls)
Latency	Medium (sequential think-act cycles)
Failure mode	Loops, tool misuse, premature termination
Best for	Information gathering, simple multi-step tasks

Pattern 2 — Plan-and-Execute

The model first creates a complete plan, then executes each step. A separate planning phase identifies all required steps before any action is taken.

Dimension	Value
Mechanism	Plan phase → execution phase (separate LLM calls)
Planning horizon	Full task decomposition upfront
Tool integration	Each plan step maps to one or more tool calls
Max effective steps	10-30 steps (plan provides structure)
Cost per task	Higher (planning tokens + execution tokens)
Latency	Higher (planning adds latency before first action)
Failure mode	Rigid plan can’t adapt to unexpected tool results
Best for	Complex, well-defined tasks with predictable tool outputs

Pattern 3 — Plan-and-Execute with Replanning

Extension of Plan-and-Execute where the agent re-evaluates the plan after each step based on actual results.

Dimension	Value
Mechanism	Plan → execute step → re-evaluate plan → execute next → …
Planning horizon	Dynamic — plan adapts to reality
Tool integration	Same as Plan-and-Execute
Max effective steps	15-50 steps (replanning prevents derailment)
Cost per task	Highest (replanning at each step)
Latency	Highest (replanning adds latency per step)
Failure mode	Over-replanning (plan changes every step, never converges)
Best for	Complex tasks with uncertain outcomes or external dependencies

Pattern 4 — LLM Compiler (Parallel Execution)

The model identifies independent steps that can run in parallel, creates an execution DAG, and runs non-dependent steps concurrently.

Dimension	Value
Mechanism	Dependency analysis → parallel execution of independent steps
Planning horizon	Full task with dependency graph
Tool integration	Multiple concurrent tool calls
Max effective steps	10-20 (parallelism reduces effective depth)
Cost per task	Medium (fewer sequential LLM calls due to parallelism)
Latency	Lowest for multi-step tasks (parallel execution)
Failure mode	Incorrect dependency analysis; one failure blocks downstream
Best for	Tasks with independent subtasks (research, data gathering)

Head-to-Head Comparison

Dimension	ReAct	Plan-Execute	Plan-Replan	LLM Compiler
Task completion rate (simple)	85-92%	88-95%	90-96%	85-93%
Task completion rate (complex)	50-65%	65-78%	72-85%	60-75%
Avg cost per task (GPT-4o)	$0.05-0.30	$0.10-0.50	$0.15-0.80	$0.08-0.40
Avg latency (simple task)	5-15s	8-20s	10-25s	4-12s
Avg latency (complex task)	30-120s	20-60s	25-90s	15-45s
Reliability	Medium	Medium-high	High	Medium
Implementation complexity	Low	Medium	High	High
Debugging difficulty	Low (linear trace)	Medium	High (plan history)	High (parallel traces)
Recovery from tool failure	Good (retry next step)	Poor (plan assumes success)	Good (replan around failure)	Medium (must rebuild DAG)

Tool Integration Patterns

Agents are only as capable as their tools. The tool design determines whether the agent succeeds or loops.

Tool Design Principles

Principle	What it means	Why it matters	Violation consequence
Single responsibility	Each tool does one thing	Agent can compose tools; doesn’t need to understand complex tool behavior	Agent misuses multi-function tools, passes wrong parameters
Descriptive naming	Tool name describes the action (“search_database”, not “tool_3”)	Agent selects tools based on name + description match	Agent calls wrong tools; selection accuracy drops 15-25%
Typed parameters	Strict parameter schema with descriptions	Agent generates valid parameters more reliably	Invalid tool calls, JSON schema violations
Idempotent when possible	Same call twice produces same result	Safe retries; agent can re-call tools without side effects	Duplicate actions (double-send email, double-charge)
Error messages in natural language	”User not found: [email protected]” not “Error 404”	Agent can reason about failures and adjust strategy	Agent can’t interpret error; loops or hallucinates fix
Bounded output	Tool returns structured, bounded output	Prevents context window overflow; agent can parse results	Long tool outputs consume context, degrade reasoning

Tool Count vs. Agent Performance

Number of tools	Selection accuracy	Task completion rate	Notes
1-5	95-98%	85-92%	Agent can reliably choose from small set
6-10	88-94%	78-88%	Selection accuracy starts to degrade
11-20	78-88%	70-82%	Need clear tool descriptions; ambiguity causes mis-selection
21-50	65-80%	55-72%	Retrieval-based tool selection recommended
50+	45-65%	40-58%	Must use tool retrieval; agent can’t reason about full set

The practical limit: Without retrieval-augmented tool selection, agents reliably handle 5-10 tools. Beyond that, provide a tool-retrieval step where the agent first searches for relevant tools based on the current subtask, then selects from the filtered set.

Memory Architectures

Agents need memory to maintain context across steps and sessions. Three memory types serve different needs:

Memory type	What it stores	Persistence	Implementation	Use case
Working memory	Current task context, intermediate results	Within task only	System prompt + conversation history	Multi-step task execution
Short-term memory	Recent conversation turns, user preferences	Within session	Sliding window of recent messages	Chat agents, session context
Long-term memory	User profiles, past interactions, learned patterns	Across sessions	Vector DB or key-value store	Personalization, knowledge accumulation

Memory Management Strategies

Strategy	How it works	Context savings	Information loss	Best for
Full history	Keep all messages in context	0%	0%	Short tasks (<10 steps)
Sliding window	Keep last N messages	50-80%	Older context lost	Chat agents with recency bias
Summarization	LLM summarizes older context	70-90%	Detail loss in summaries	Long conversations
Retrieval-augmented	Store all history, retrieve relevant parts	80-95%	Relevance-dependent	Long-running agents with diverse topics
Structured state	Extract key-value pairs from conversation	90-95%	Only structured facts retained	Task-oriented agents

The working memory limit: Current frontier models (GPT-4o, Claude Sonnet 4) with 128K-200K context windows can hold approximately 50-100 agent steps before quality degrades. Beyond that, summarization or retrieval-augmented memory is required.

Failure Mode Analysis

Agent failures are categorized by root cause, not symptom:

Failure mode	Frequency	Root cause	Detection	Prevention
Infinite loop	8-15% of complex tasks	Agent repeats same action expecting different result	Step counter exceeding threshold	Max step limit (hard stop at N steps)
Tool misuse	10-20% of tasks with 10+ tools	Agent calls tool with wrong parameters or wrong tool entirely	Schema validation failure; unexpected tool output	Typed schemas, clear descriptions, few-shot tool examples
Premature termination	5-12% of tasks	Agent declares task complete before all steps are done	Compare output against task requirements	Completion verification step (separate LLM call)
Context overflow	5-10% of long tasks	Too many steps exhaust context window	Token count monitoring	Summarization, structured state extraction
Hallucinated tool result	3-8% of tasks	Agent “invents” a tool result instead of actually calling the tool	Verify tool call logs match agent’s claimed results	Strict tool-call-before-result enforcement
Cost explosion	2-5% of tasks	Agent makes unnecessary tool calls or generates excessive reasoning	Real-time cost monitoring	Cost budget per task with hard cutoff
Irreversible action error	1-3% of tasks with write tools	Agent performs wrong write action (wrong email, wrong deletion)	Post-action verification (often too late)	Confirmation step before any write/send action

Cost Control

Mechanism	How it works	Recommended limit
Max steps	Hard limit on agent loop iterations	15-30 steps for most tasks
Cost budget	Track cumulative token cost; stop if exceeded	$1-5 per task for standard agents
Time budget	Stop if task exceeds time limit	5-10 minutes for interactive; 30 min for background
Tool call budget	Limit total tool invocations	20-50 calls per task
Human-in-the-loop	Require approval for high-risk actions	Any write/send/delete action

Architecture Decision Matrix

Your requirements	Recommended pattern	Why
Simple Q&A with tool use (search, calculate)	ReAct	Low complexity; sequential reasoning sufficient
Multi-step research (gather data from multiple sources)	LLM Compiler	Independent searches can run in parallel
Complex task with clear steps (build report, analyze dataset)	Plan-and-Execute	Upfront planning structures the work
Complex task with uncertain outcomes (debugging, troubleshooting)	Plan-and-Replan	Replanning adapts to unexpected results
Cost-sensitive, high volume	ReAct	Lowest per-task cost; minimal overhead
Latency-sensitive	LLM Compiler	Parallelism reduces wall-clock time
Reliability-critical	Plan-and-Replan	Highest task completion rate on complex tasks
50+ tools available	Any + tool retrieval layer	Tool retrieval is orthogonal to agent architecture

How to Apply This

Use the token-counter tool to estimate per-step costs — each reasoning step consumes input tokens (conversation history) plus output tokens (reasoning + tool call). Multiply by average steps per task to project per-task cost.

Start with ReAct for your first agent. It’s the simplest architecture, easiest to debug, and sufficient for 80% of agent use cases. Graduate to Plan-and-Execute only when ReAct’s greedy planning fails on your specific tasks.

Implement cost controls from day one. Every production agent needs a max-step limit, cost budget, and time budget. An unconstrained agent on GPT-4o can spend $50+ on a single runaway task.

Design tools for agent consumption, not human consumption. Tools that return “Error 500” work for humans who can interpret the context. Agents need “Failed to create user: email address already exists in the system” to reason about next steps.

Add a completion verifier. A separate LLM call that checks “Did the agent actually complete all parts of the task?” catches premature termination — the most subtle and common failure mode.

Honest Limitations

Task completion rates are measured on standardized benchmarks (SWE-bench, ToolBench, HotpotQA); real-world tasks with ambiguous requirements and complex tool ecosystems have lower completion rates. Cost estimates assume GPT-4o pricing; costs vary 10x across models. The “50+ tools need retrieval” finding applies to current models; larger context windows and better tool selection may shift this threshold. Agent architectures are evolving rapidly — new patterns (multi-agent, hierarchical) may supersede the patterns described here within 12-18 months. The failure mode frequencies are based on research benchmarks and production observations; your specific failure distribution depends on your tools, prompts, and use cases. Human-in-the-loop for every write action creates friction that may be unacceptable for automated workflows — the right balance depends on the cost of errors vs. the cost of delays.

Kenny Tan Co-founder & Technical Lead

Cross-domain expertise in software engineering, content systems, and infrastructure architecture.

15 April 2026

Continue reading

AI API Integration Patterns — Direct Call vs Streaming vs Batch Processing

Latency, cost, and complexity comparison across AI API integration patterns with architecture decision matrix, failure handling strategies, and production throughput data.

AI Cost Optimization in Production — Techniques That Cut Spend by 60-80%

Cost reduction technique comparison with percentage savings, implementation effort, and quality impact data across model routing, caching, prompt compression, and architectural patterns.

AI Evaluation Frameworks — Test Suites That Catch Regressions Before Users Do

Metric selection matrix by task type, evaluation framework comparison across RAGAS, DeepEval, and custom suites, with regression detection architecture and production monitoring patterns.

All articles in ai development workflows