Your Chatbot Forgets What the User Said 5 Messages Ago — And Your “Solution” of Stuffing Every Message into Context Costs $0.50 Per Conversation

Multi-turn conversation is the most common LLM use case and the most poorly implemented. The naive approach — concatenating the entire conversation history into each API call — works for 3 turns and breaks at 30. Context windows fill up, costs scale linearly with conversation length, and the model starts ignoring early messages buried under thousands of tokens of later context. The alternative — aggressive truncation — loses critical context and produces responses that contradict what the user said earlier. This guide provides the context management strategies, the cost data at each conversation length, and the decision framework that balances memory, cost, and quality.

The Context Window Problem

Every message in a multi-turn conversation adds to the input token count. The cost and quality implications compound:

Conversation turnTokens (cumulative, typical)Input cost per turn (GPT-4o)Quality issue
Turn 1500 (system + user)$0.0013None — full context available
Turn 52,500$0.0063None — well within context window
Turn 105,000$0.0125Minimal — most models handle well
Turn 2012,000$0.0300Beginning of “lost in the middle” effect
Turn 3020,000$0.0500Model starts ignoring middle turns
Turn 5035,000$0.0875Significant quality degradation on early context
Turn 10075,000$0.1875Only recent and very first turns reliably processed

The “lost in the middle” problem: Research confirms that LLMs pay disproportionate attention to the beginning and end of their context window. Information in the middle of a long context is recalled 10-30% less accurately than information at the boundaries. At 30+ turns, critical earlier context buried in the middle is effectively invisible.

Cost at Scale

Conversations/dayAvg turnsNaive approach (full history)Managed approachMonthly savings
1,00010$375/mo$150/mo$225 (60%)
1,00020$900/mo$270/mo$630 (70%)
1,00050$2,625/mo$525/mo$2,100 (80%)
10,00020$9,000/mo$2,700/mo$6,300 (70%)
10,00050$26,250/mo$5,250/mo$21,000 (80%)

GPT-4o pricing. Claude Sonnet 4 at $3/1M input would be ~20% higher. Output costs additional.

Context Management Strategies

Strategy 1 — Sliding Window

Keep the last N messages. Discard anything older.

DimensionValue
ImplementationKeep last N turns (typically 5-20)
Token usageFixed ceiling regardless of conversation length
QualityGood for recent context; loses all early context
CostPredictable and capped
Best forCasual chat, customer support, simple Q&A
Worst forAny conversation where early decisions affect later turns

Strategy 2 — Summarization

Periodically summarize older messages and include the summary instead of raw history.

DimensionValue
ImplementationEvery N turns, LLM summarizes conversation so far; summary replaces older messages
Token usageSummary (200-500 tokens) + recent messages (variable)
QualityPreserves key context; loses verbatim details
CostSummary LLM call adds cost, but net savings at 20+ turns
Best forLong conversations where high-level context matters
Worst forConversations requiring exact recall of earlier statements

Strategy 3 — Hierarchical Memory

Multiple memory levels: system prompt (permanent instructions), conversation summary (periodic updates), recent messages (last 5-10 turns), and key facts (extracted entities).

DimensionValue
ImplementationSystem prompt + running summary + last N turns + key-value fact store
Token usageControlled: 500 (system) + 300 (summary) + 2,000 (recent) + 200 (facts) = ~3,000
QualityBest overall — preserves critical context at all levels
CostHigher implementation cost; lower per-turn token cost than full history
Best forComplex multi-session interactions, agents, personal assistants
Worst forSimple chatbots where implementation complexity isn’t justified

Strategy 4 — Retrieval-Augmented Conversation

Store all messages in a vector database. Retrieve relevant past messages based on the current query.

DimensionValue
ImplementationEmbed each message; retrieve top-k relevant messages per new query
Token usageSystem prompt + retrieved context (k × avg message size) + recent messages
QualityExcellent recall of relevant context; may miss contextual flow
CostEmbedding + retrieval cost per turn; lower LLM input cost
Best forVery long conversations (100+ turns), cross-session memory
Worst forShort conversations where retrieval overhead exceeds benefit

Strategy Comparison

DimensionFull historySliding windowSummarizationHierarchicalRetrieval-augmented
Context quality (turns 1-10)ExcellentExcellentExcellentExcellentExcellent
Context quality (turns 20-50)Degraded (lost in middle)Poor (early context lost)Good (summary preserves)GoodGood (retrieves relevant)
Context quality (turns 100+)Fails (exceeds context window)PoorGoodGoodBest
Cost per turn (at turn 50)$0.09$0.02$0.03$0.03$0.025
Implementation complexityTrivialLowMediumHighHigh
Latency impactNoneNone+200-500ms per summary+200ms per turn+100-300ms per retrieval
Max conversation lengthLimited by context windowUnlimitedUnlimitedUnlimitedUnlimited

Token Budget Allocation

For a managed context window, allocate tokens deliberately:

ComponentToken budgetPriorityWhat happens if reduced
System prompt500-2,000Non-negotiableModel loses core behavior and constraints
Key facts / extracted state100-500HighModel forgets user preferences, prior decisions
Conversation summary200-500HighModel loses awareness of conversation arc
Recent messages2,000-8,000MediumModel loses immediate conversational flow
Retrieved context500-2,000MediumModel can’t reference specific past exchanges
Remaining for generation500-4,000Output budgetShorter responses, potential truncation

Total managed budget: 4,000-17,000 tokens per turn, regardless of conversation length. Compare to naive full-history approach that grows to 50,000-100,000 tokens at turn 50.

The Allocation Decision Tree

Conversation lengthSystem promptSummaryRecent turnsFactsTotal budget
1-5 turns1,000None neededAll messagesNone needed~3,500
5-15 turns1,000None neededLast 10 turns200~6,200
15-30 turns1,000300Last 8 turns300~5,600
30-100 turns1,000500Last 5 turns500~4,500
100+ turns1,000500Last 5 turns + 3 retrieved500~5,500

Conversation Reset Patterns

Not all conversations should persist. Strategic resets improve quality and reduce cost.

Reset triggerWhen to applyHow to implementWhat to preserve
Topic changeUser switches to entirely new subjectDetect topic shift → archive → start fresh with user profileUser profile, preferences
Quality degradationModel responses become repetitive or inconsistentDetect via quality score → summarize → reset contextConversation summary, key facts
Token budget exceededContext approaching model limitAutomatic summarization triggerSummary + recent + facts
Session timeoutUser returns after extended absenceSummarize previous session → start new with summarySession summary, user state
Explicit user requestUser says “start over” or “new topic”Clear context; optionally retain user profileUser profile only

Reset Quality Impact

Reset strategyQuality immediately after resetQuality 5 turns laterUser satisfaction
Hard reset (clear everything)Low — model has no contextMedium — rebuilds from new messagesLow — feels like talking to amnesiac
Soft reset (keep summary + facts)Medium — has general contextHigh — summary provides foundationHigh — feels natural
No reset (let context degrade)Decreasing over timePoor — contradictions and repetitionLow — feels like model is confused

Conversation Quality Metrics

MetricWhat it measuresHow to collectTarget
Context consistencyDoes response contradict earlier messages?LLM-as-judge on sampled conversations<3% contradiction rate
Reference accuracyWhen user references earlier message, does model recall correctly?Test with planted references at turn 10, 20, 50>90% recall at turn 20
Topic coherenceDoes response stay on the conversation topic?Topic classifier on responses>95% on-topic
Repetition rateDoes model repeat information already given?Semantic similarity between responses in same conversation<5% repetition score
Turn efficiencyHow many turns to complete user’s task?Count turns to task completionDecreasing over time (model learns user’s patterns)

How to Apply This

Use the token-counter tool to measure your actual conversation token usage — this determines when context management becomes necessary and which strategy provides the most savings.

Start with sliding window (last 10 turns) for most applications. It’s the simplest strategy and sufficient for 80% of chatbot use cases. Add summarization only when users report the model “forgetting” earlier context.

Implement token budget allocation once conversations regularly exceed 15 turns. The allocation framework prevents the “gradually then suddenly” cost and quality failure that hits naive full-history implementations.

Measure reference accuracy at turn 20 and turn 50. This is the most diagnostic quality metric for context management. If your model can’t recall what the user said at turn 5 when asked at turn 20, your strategy needs upgrading.

Reset proactively on topic change. A soft reset (summarize + clear) on topic change improves both quality and cost. Users don’t expect a cooking chatbot to remember they asked about tax law 30 turns ago.

Honest Limitations

The “lost in the middle” effect varies by model and context length — newer models with longer training on long contexts show less degradation, but the effect is not eliminated. Cost calculations assume uniform message lengths; real conversations have high variance in per-turn token count. Summarization quality depends on the summarizer model — using a cheap model for summarization can lose critical details. Retrieval-augmented conversation memory adds embedding and search latency that may be unacceptable for real-time chat. The hierarchical memory pattern requires significant engineering investment — it’s not justified for simple chatbots. Token budget allocation assumes you can predict response length; in practice, models may generate shorter or longer responses than budgeted. These strategies focus on text conversations; multimodal conversations (images, documents, audio) have different context management challenges.