AI Agent Design Patterns — Tool Use, Planning, and Memory Architectures
Agent architecture decision matrix comparing ReAct, Plan-and-Execute, and Tree-of-Thought with tool integration patterns, memory systems, and failure mode analysis for production agent systems.
Your AI Agent Works Perfectly on Simple Tasks but Falls Apart on Multi-Step Problems — The Architecture Determines the Ceiling
AI agents — LLMs that can plan, use tools, and take actions — are the most powerful and most dangerous application pattern in production AI. Powerful because a well-designed agent can complete tasks that require 10+ steps, multiple tool calls, and dynamic decision-making. Dangerous because a poorly designed agent can loop infinitely, call the wrong tools, spend $500 on a task that should cost $0.50, and take irreversible actions based on hallucinated reasoning. The architecture pattern you choose determines both the capability ceiling and the failure floor. This guide compares the major agent architectures, documents their failure modes, and provides the decision matrix for production agent systems.
The Agent Architecture Taxonomy
Pattern 1 — ReAct (Reasoning + Action)
The model alternates between reasoning (thinking about what to do) and acting (calling tools). Each cycle: observe → think → act → observe result → think → act.
| Dimension | Value |
|---|---|
| Mechanism | Interleaved thought and tool-call steps |
| Planning horizon | One step at a time (greedy) |
| Tool integration | Native — each action is a tool call |
| Max effective steps | 5-15 steps before quality degrades |
| Cost per task | Medium (reasoning tokens + tool calls) |
| Latency | Medium (sequential think-act cycles) |
| Failure mode | Loops, tool misuse, premature termination |
| Best for | Information gathering, simple multi-step tasks |
Pattern 2 — Plan-and-Execute
The model first creates a complete plan, then executes each step. A separate planning phase identifies all required steps before any action is taken.
| Dimension | Value |
|---|---|
| Mechanism | Plan phase → execution phase (separate LLM calls) |
| Planning horizon | Full task decomposition upfront |
| Tool integration | Each plan step maps to one or more tool calls |
| Max effective steps | 10-30 steps (plan provides structure) |
| Cost per task | Higher (planning tokens + execution tokens) |
| Latency | Higher (planning adds latency before first action) |
| Failure mode | Rigid plan can’t adapt to unexpected tool results |
| Best for | Complex, well-defined tasks with predictable tool outputs |
Pattern 3 — Plan-and-Execute with Replanning
Extension of Plan-and-Execute where the agent re-evaluates the plan after each step based on actual results.
| Dimension | Value |
|---|---|
| Mechanism | Plan → execute step → re-evaluate plan → execute next → … |
| Planning horizon | Dynamic — plan adapts to reality |
| Tool integration | Same as Plan-and-Execute |
| Max effective steps | 15-50 steps (replanning prevents derailment) |
| Cost per task | Highest (replanning at each step) |
| Latency | Highest (replanning adds latency per step) |
| Failure mode | Over-replanning (plan changes every step, never converges) |
| Best for | Complex tasks with uncertain outcomes or external dependencies |
Pattern 4 — LLM Compiler (Parallel Execution)
The model identifies independent steps that can run in parallel, creates an execution DAG, and runs non-dependent steps concurrently.
| Dimension | Value |
|---|---|
| Mechanism | Dependency analysis → parallel execution of independent steps |
| Planning horizon | Full task with dependency graph |
| Tool integration | Multiple concurrent tool calls |
| Max effective steps | 10-20 (parallelism reduces effective depth) |
| Cost per task | Medium (fewer sequential LLM calls due to parallelism) |
| Latency | Lowest for multi-step tasks (parallel execution) |
| Failure mode | Incorrect dependency analysis; one failure blocks downstream |
| Best for | Tasks with independent subtasks (research, data gathering) |
Head-to-Head Comparison
| Dimension | ReAct | Plan-Execute | Plan-Replan | LLM Compiler |
|---|---|---|---|---|
| Task completion rate (simple) | 85-92% | 88-95% | 90-96% | 85-93% |
| Task completion rate (complex) | 50-65% | 65-78% | 72-85% | 60-75% |
| Avg cost per task (GPT-4o) | $0.05-0.30 | $0.10-0.50 | $0.15-0.80 | $0.08-0.40 |
| Avg latency (simple task) | 5-15s | 8-20s | 10-25s | 4-12s |
| Avg latency (complex task) | 30-120s | 20-60s | 25-90s | 15-45s |
| Reliability | Medium | Medium-high | High | Medium |
| Implementation complexity | Low | Medium | High | High |
| Debugging difficulty | Low (linear trace) | Medium | High (plan history) | High (parallel traces) |
| Recovery from tool failure | Good (retry next step) | Poor (plan assumes success) | Good (replan around failure) | Medium (must rebuild DAG) |
Tool Integration Patterns
Agents are only as capable as their tools. The tool design determines whether the agent succeeds or loops.
Tool Design Principles
| Principle | What it means | Why it matters | Violation consequence |
|---|---|---|---|
| Single responsibility | Each tool does one thing | Agent can compose tools; doesn’t need to understand complex tool behavior | Agent misuses multi-function tools, passes wrong parameters |
| Descriptive naming | Tool name describes the action (“search_database”, not “tool_3”) | Agent selects tools based on name + description match | Agent calls wrong tools; selection accuracy drops 15-25% |
| Typed parameters | Strict parameter schema with descriptions | Agent generates valid parameters more reliably | Invalid tool calls, JSON schema violations |
| Idempotent when possible | Same call twice produces same result | Safe retries; agent can re-call tools without side effects | Duplicate actions (double-send email, double-charge) |
| Error messages in natural language | ”User not found: [email protected]” not “Error 404” | Agent can reason about failures and adjust strategy | Agent can’t interpret error; loops or hallucinates fix |
| Bounded output | Tool returns structured, bounded output | Prevents context window overflow; agent can parse results | Long tool outputs consume context, degrade reasoning |
Tool Count vs. Agent Performance
| Number of tools | Selection accuracy | Task completion rate | Notes |
|---|---|---|---|
| 1-5 | 95-98% | 85-92% | Agent can reliably choose from small set |
| 6-10 | 88-94% | 78-88% | Selection accuracy starts to degrade |
| 11-20 | 78-88% | 70-82% | Need clear tool descriptions; ambiguity causes mis-selection |
| 21-50 | 65-80% | 55-72% | Retrieval-based tool selection recommended |
| 50+ | 45-65% | 40-58% | Must use tool retrieval; agent can’t reason about full set |
The practical limit: Without retrieval-augmented tool selection, agents reliably handle 5-10 tools. Beyond that, provide a tool-retrieval step where the agent first searches for relevant tools based on the current subtask, then selects from the filtered set.
Memory Architectures
Agents need memory to maintain context across steps and sessions. Three memory types serve different needs:
| Memory type | What it stores | Persistence | Implementation | Use case |
|---|---|---|---|---|
| Working memory | Current task context, intermediate results | Within task only | System prompt + conversation history | Multi-step task execution |
| Short-term memory | Recent conversation turns, user preferences | Within session | Sliding window of recent messages | Chat agents, session context |
| Long-term memory | User profiles, past interactions, learned patterns | Across sessions | Vector DB or key-value store | Personalization, knowledge accumulation |
Memory Management Strategies
| Strategy | How it works | Context savings | Information loss | Best for |
|---|---|---|---|---|
| Full history | Keep all messages in context | 0% | 0% | Short tasks (<10 steps) |
| Sliding window | Keep last N messages | 50-80% | Older context lost | Chat agents with recency bias |
| Summarization | LLM summarizes older context | 70-90% | Detail loss in summaries | Long conversations |
| Retrieval-augmented | Store all history, retrieve relevant parts | 80-95% | Relevance-dependent | Long-running agents with diverse topics |
| Structured state | Extract key-value pairs from conversation | 90-95% | Only structured facts retained | Task-oriented agents |
The working memory limit: Current frontier models (GPT-4o, Claude Sonnet 4) with 128K-200K context windows can hold approximately 50-100 agent steps before quality degrades. Beyond that, summarization or retrieval-augmented memory is required.
Failure Mode Analysis
Agent failures are categorized by root cause, not symptom:
| Failure mode | Frequency | Root cause | Detection | Prevention |
|---|---|---|---|---|
| Infinite loop | 8-15% of complex tasks | Agent repeats same action expecting different result | Step counter exceeding threshold | Max step limit (hard stop at N steps) |
| Tool misuse | 10-20% of tasks with 10+ tools | Agent calls tool with wrong parameters or wrong tool entirely | Schema validation failure; unexpected tool output | Typed schemas, clear descriptions, few-shot tool examples |
| Premature termination | 5-12% of tasks | Agent declares task complete before all steps are done | Compare output against task requirements | Completion verification step (separate LLM call) |
| Context overflow | 5-10% of long tasks | Too many steps exhaust context window | Token count monitoring | Summarization, structured state extraction |
| Hallucinated tool result | 3-8% of tasks | Agent “invents” a tool result instead of actually calling the tool | Verify tool call logs match agent’s claimed results | Strict tool-call-before-result enforcement |
| Cost explosion | 2-5% of tasks | Agent makes unnecessary tool calls or generates excessive reasoning | Real-time cost monitoring | Cost budget per task with hard cutoff |
| Irreversible action error | 1-3% of tasks with write tools | Agent performs wrong write action (wrong email, wrong deletion) | Post-action verification (often too late) | Confirmation step before any write/send action |
Cost Control
| Mechanism | How it works | Recommended limit |
|---|---|---|
| Max steps | Hard limit on agent loop iterations | 15-30 steps for most tasks |
| Cost budget | Track cumulative token cost; stop if exceeded | $1-5 per task for standard agents |
| Time budget | Stop if task exceeds time limit | 5-10 minutes for interactive; 30 min for background |
| Tool call budget | Limit total tool invocations | 20-50 calls per task |
| Human-in-the-loop | Require approval for high-risk actions | Any write/send/delete action |
Architecture Decision Matrix
| Your requirements | Recommended pattern | Why |
|---|---|---|
| Simple Q&A with tool use (search, calculate) | ReAct | Low complexity; sequential reasoning sufficient |
| Multi-step research (gather data from multiple sources) | LLM Compiler | Independent searches can run in parallel |
| Complex task with clear steps (build report, analyze dataset) | Plan-and-Execute | Upfront planning structures the work |
| Complex task with uncertain outcomes (debugging, troubleshooting) | Plan-and-Replan | Replanning adapts to unexpected results |
| Cost-sensitive, high volume | ReAct | Lowest per-task cost; minimal overhead |
| Latency-sensitive | LLM Compiler | Parallelism reduces wall-clock time |
| Reliability-critical | Plan-and-Replan | Highest task completion rate on complex tasks |
| 50+ tools available | Any + tool retrieval layer | Tool retrieval is orthogonal to agent architecture |
How to Apply This
Use the token-counter tool to estimate per-step costs — each reasoning step consumes input tokens (conversation history) plus output tokens (reasoning + tool call). Multiply by average steps per task to project per-task cost.
Start with ReAct for your first agent. It’s the simplest architecture, easiest to debug, and sufficient for 80% of agent use cases. Graduate to Plan-and-Execute only when ReAct’s greedy planning fails on your specific tasks.
Implement cost controls from day one. Every production agent needs a max-step limit, cost budget, and time budget. An unconstrained agent on GPT-4o can spend $50+ on a single runaway task.
Design tools for agent consumption, not human consumption. Tools that return “Error 500” work for humans who can interpret the context. Agents need “Failed to create user: email address already exists in the system” to reason about next steps.
Add a completion verifier. A separate LLM call that checks “Did the agent actually complete all parts of the task?” catches premature termination — the most subtle and common failure mode.
Honest Limitations
Task completion rates are measured on standardized benchmarks (SWE-bench, ToolBench, HotpotQA); real-world tasks with ambiguous requirements and complex tool ecosystems have lower completion rates. Cost estimates assume GPT-4o pricing; costs vary 10x across models. The “50+ tools need retrieval” finding applies to current models; larger context windows and better tool selection may shift this threshold. Agent architectures are evolving rapidly — new patterns (multi-agent, hierarchical) may supersede the patterns described here within 12-18 months. The failure mode frequencies are based on research benchmarks and production observations; your specific failure distribution depends on your tools, prompts, and use cases. Human-in-the-loop for every write action creates friction that may be unacceptable for automated workflows — the right balance depends on the cost of errors vs. the cost of delays.
Continue reading
AI API Integration Patterns — Direct Call vs Streaming vs Batch Processing
Latency, cost, and complexity comparison across AI API integration patterns with architecture decision matrix, failure handling strategies, and production throughput data.
AI Cost Optimization in Production — Techniques That Cut Spend by 60-80%
Cost reduction technique comparison with percentage savings, implementation effort, and quality impact data across model routing, caching, prompt compression, and architectural patterns.
AI Evaluation Frameworks — Test Suites That Catch Regressions Before Users Do
Metric selection matrix by task type, evaluation framework comparison across RAGAS, DeepEval, and custom suites, with regression detection architecture and production monitoring patterns.