LLM Cost-Per-Query Optimization — Per-Query Cost Decomposition, Model-Routing Economics, Semantic-Cache ROI Math, Tiered-Architecture Breakpoint Analysis, Prompt-Compression Savings Table, and the Per-Decision Financial Model That Separates Real Wins From Engineering Traps

LLM cost-per-query optimization decision framework with per-query cost decomposition (input tokens + output tokens + retrieval + caching + fallback retries), model-routing economics across GPT-4o + Claude Sonnet + Haiku + Gemini tiers with per-task quality-cost ratios, semantic-cache hit-rate ROI math (break-even cache size + hit-rate threshold), tiered-architecture breakpoint analysis (when to upgrade from flat-model to routed architecture), prompt-compression techniques ranked by savings-per-engineering-hour, and the per-decision financial model that separates 30%-savings-2-hour wins from 5%-savings-40-hour traps.

Kenny Tan Reviewed by Shanire 25 April 2026

Your AI Feature Cost $800 Last Month and $4,200 This Month Despite No Usage Growth, the Finance Team Wants a 50% Reduction, and Every Optimization Article Tells You to “Use a Smaller Model” — Here’s the Per-Query Cost Decomposition That Shows Which Lever Returns 30% for 2 Hours of Work Versus Which Returns 5% for 40 Hours

LLM cost is not a single number. It decomposes into input-token cost, output-token cost, retrieval/embedding cost, fallback-retry cost, and caching infrastructure cost — each with different optimization levers, different engineering effort to realize, and different quality trade-offs. The common failure pattern is optimizing the easiest lever first (switching to a smaller model) without measuring which lever actually drives the cost. Teams cut model size, quality drops, users complain, they revert — and the real cost driver (3× unnecessary retrieval calls per query) sits untouched. This guide builds the per-query cost decomposition that tells you which lever to pull first, the model-routing economics for picking the right model for the right task, and the semantic-cache ROI math that tells you whether caching pays back or just adds complexity.

The Per-Query Cost Decomposition

Every query has a cost anatomy. Measure this before optimizing:

Cost component	How it accumulates	Typical share of total cost	Optimization levers	Engineering effort
Input tokens	System prompt + RAG context + conversation history + user query	30-55%	Prompt compression, context window management, conversation truncation, retrieval chunk limits	Low-to-medium (1-8 hours per lever)
Output tokens	Generated response length	35-55%	max_tokens caps, structured output (JSON schema), stop sequences, terseness prompting	Very low (1-2 hours)
Retrieval/embedding	Embedding generation + vector-DB query	3-12%	Cache embeddings, reduce chunk count, hybrid retrieval with BM25 prefilter	Medium (4-16 hours)
Fallback retries	Failed responses re-run on stronger model	2-15% (highly variable)	Better prompt engineering, structured validation with regeneration budget cap	Medium-high
Caching infrastructure	Cache storage + lookup compute	1-5% (if implemented)	Exact-match + semantic cache tiering; TTL tuning	Medium-high
Judge/evaluation calls	Quality gates and structured checks	2-10%	Deterministic validators first, LLM-judge only on ambiguous outputs	Medium

The 80/20 of LLM cost: Output tokens + input tokens usually account for 75-90% of total cost. If you’re optimizing retrieval caching before you’ve capped output tokens with max_tokens + trimmed system prompt bloat, you’re optimizing the wrong lever.

Model-Routing Economics

Not every query needs the same model. Model routing economics:

Task profile	Best-fit model (Q1 2026 pricing)	Per-1M input / output cost	When routing wins	When flat-model wins
Simple classification	GPT-4o-mini / Claude Haiku	$0.15 / $0.60 — $0.80 / $4.00	Query volume > 50K/month; quality differential on classification < 3%	Volume < 5K/month — routing infrastructure doesn’t amortize
Extraction / structured output	GPT-4o-mini / Claude Haiku	$0.15 / $0.60 — $0.80 / $4.00	Structured schema well-defined; quality parity achievable	Very novel schemas where strong model reduces retry rate
Summarization	Claude Haiku / Gemini Flash	$0.80 / $4.00 — $0.075 / $0.30	Long-context documents; cost differential > 3×	Critical summaries where comprehension gap matters
RAG answer generation	GPT-4o / Claude Sonnet	$2.50 / $10.00 — $3.00 / $15.00	Consistent-domain queries amenable to smaller models	Open-domain / multi-hop queries — fallback rate too high on small models
Multi-step reasoning	GPT-4o / Claude Sonnet / o-series reasoning	$2.50 / $10.00 — $3.00 / $15.00 — $15.00 / $60.00	Reasoning depth predictable per task type	Variable-depth reasoning — router can’t classify reliably
Code generation	Claude Sonnet / GPT-4o	$3.00 / $15.00 — $2.50 / $10.00	Scoped-task generation (unit tests, boilerplate)	Open-ended architectural code — quality gap costly
Creative long-form	Claude Sonnet / GPT-4o	$3.00 / $15.00 — $2.50 / $10.00	Usually flat-model wins — routing doesn’t amortize	Default case

Router-economics break-even: Building a router requires a classifier (another LLM call or a trained model) that costs tokens on every request. The routing-classifier cost is justified when the cost-savings-from-downgrade × queries-downgraded exceeds routing-classifier-cost × all-queries. For a 40%-downgrade-rate scenario with 10× cost differential, the break-even is ~5K queries/month — below that volume, the routing overhead eats the savings.

Router Implementation Patterns

Routing approach	Implementation effort	Accuracy on task classification	Per-query routing overhead
Rule-based (keyword + length)	4-8 hours	60-75% accuracy	$0 (no LLM call)
Embedding-based classifier (pre-trained)	8-16 hours	75-85% accuracy	$0.0001-0.0005 per query
Small-LLM classifier (Haiku / Flash)	16-24 hours	85-93% accuracy	$0.001-0.003 per query
Fine-tuned classifier	40-80 hours + labeled data	90-97% accuracy	$0.0005-0.002 per query

Practical starting point: Embedding-based classifier with rule-based fallback. Gets you 80-85% routing accuracy at negligible per-query cost with a weekend of engineering.

Semantic-Cache ROI Math

Caching is the highest-impact optimization for systems with query-repetition patterns — and the worst waste-of-time for systems without them. The ROI math:

Variable	Definition	Typical range
Cache hit rate (H)	Fraction of queries that match a cached entry within similarity threshold	5% (low-repetition) to 60% (FAQ-style traffic)
Per-query cost (C_q)	Cost of a full LLM call	$0.005-$0.15
Per-lookup cost (C_l)	Cost of semantic cache lookup (embedding + vector search)	$0.0001-$0.001
Cache infrastructure cost (I)	Monthly cost of cache storage + compute	$20-$500
Monthly query volume (V)	Queries per month	Varies

Break-even formula:

Monthly savings = V × H × C_q  -  V × C_l  -  I
Break-even H = (V × C_l + I) / (V × C_q)

For V = 100,000 queries/month, C_q = $0.05, C_l = $0.0005, I = $100:

Required hit rate for break-even: (100,000 × $0.0005 + $100) / (100,000 × $0.05) = $150 / $5,000 = 3%
At 30% hit rate, savings = 100,000 × 0.30 × $0.05 − 100,000 × $0.0005 − $100 = $1,500 − $50 − $100 = $1,350/month

Hit-rate-threshold reality:

Traffic pattern	Typical hit rate	Caching recommendation
FAQ / customer support	40-70%	Strong fit — exact-match cache first, semantic cache second
Search / Q&A over docs	15-35%	Moderate fit — semantic cache with TTL tied to corpus freshness
Personalized recommendations	3-10%	Marginal — exact-match only, skip semantic
Creative generation	< 3%	Don’t cache — fundamental cache-miss workload
Tool-use / agent workflows	5-15%	Tier cache by tool (stable tools high-hit, dynamic tools low-hit)

The cache-invalidation tax: Semantic cache correctness requires invalidation when the underlying corpus changes. Teams that skip invalidation serve stale answers confidently. Build TTL into the cache design from day one; the operational cost of stale-answer incidents dwarfs cache hit-rate savings.

Prompt-Compression Savings Table

Prompt compression ranked by savings-per-engineering-hour:

Technique	Typical savings	Engineering effort	Quality impact	Savings per hour
max_tokens cap on output	15-30%	30 minutes	Low (truncation risk if cap too aggressive)	Highest ROI lever
Remove unused system-prompt examples	8-15%	1-2 hours	None (they weren’t being used)	Very high
Trim RAG chunk count from 8→5	10-20%	1 hour	Low-medium (quality test required)	Very high
Structured output with JSON schema	12-25%	2-4 hours	None or positive (structure reduces verbosity)	High
Conversation history truncation (sliding window 6 turns)	15-40%	4-8 hours	Medium (context loss risk)	High
Compress system prompt with LLMLingua / LongLLMLingua	20-40%	8-16 hours	Low-medium (systematic testing required)	Medium
Dynamic example selection (few-shot → retrieved few-shot)	10-30%	16-40 hours	Positive (more relevant examples)	Medium-low
Fine-tune to remove system prompt entirely	40-70%	60-120 hours + training data	Variable — can be positive or negative	Low for most teams

Sequencing discipline: Apply max_tokens + structured output + chunk-count reduction first. These are days of work for double-digit percent savings. Fine-tuning for cost reduction is a last-resort lever after you’ve exhausted the high-ROI options.

Tiered-Architecture Breakpoint Analysis

When does flat-model architecture become tiered-architecture? Decision framework by scale:

Monthly query volume	Architecture	Rationale
< 10K queries	Flat (one model for everything)	Routing overhead doesn’t amortize; engineering effort better spent elsewhere
10K - 100K queries	Two-tier (strong model + small model)	Router pays back; semantic cache starts pulling weight if hit-rate > 15%
100K - 1M queries	Three-tier (flagship + workhorse + mini) + semantic cache + prompt compression	All major levers pull weight; savings justify 40-80 hours engineering
> 1M queries	Routed + cached + fine-tuned specialist models for high-volume paths	Custom fine-tuning economically viable; cost/quality frontier moves

Break-point heuristic: If LLM spend is < 0.5% of revenue, don’t over-engineer cost. If LLM spend is > 5% of revenue or > $10K/month, full optimization stack pays back within 2 months.

Hidden Cost Traps

Hidden cost	How it accumulates	Detection	Mitigation
Conversation-history bloat	Each turn appends full history; by turn 15, input tokens 10× original	Per-turn input-token logging	Sliding-window truncation or summarization compression every N turns
Silent retry storms	Schema-validation fails → regenerate → fails again → infinite loop absent cap	Retry-count metric per query	Hard retry-cap (usually 2) with fallback to stronger model on final attempt
Embedding re-generation	Corpus doesn’t change but embeddings re-generated on every index rebuild	Monthly embedding-API bill comparison	Cache embeddings in object storage keyed by content hash
Fallback-to-flagship inflation	Router routes 20% to flagship; over time drifts to 45% as classifier degrades	Daily routing-decision-histogram	Re-calibrate router weekly; fix classifier drift
Streaming-token billing edge cases	Some providers charge for generated tokens even when user disconnects	Provider-specific billing docs	Server-side disconnect handling with generation cancellation
System-prompt version sprawl	12 variants of system prompt in production from A/B experiments; each different token count	Prompt-registry with per-prompt token-cost tracking	Canonical prompt catalog with deprecation schedule

Cost-Quality Trade-Off Measurement

You cannot optimize cost without measuring quality. The cost-quality measurement protocol:

Step	Activity	Output
1. Baseline quality	Run 100 representative queries on current production model	Quality metrics (accuracy, helpfulness, or domain-specific score)
2. Candidate evaluation	Run same 100 queries on candidate cheaper model	Per-query quality delta
3. Cost-quality Pareto	Plot each candidate on cost (x-axis) × quality (y-axis)	Pareto frontier visualization
4. Quality threshold	Define minimum-acceptable quality per task type	Per-task thresholds
5. Cost-efficient frontier	For each task type, pick the cheapest model that clears quality threshold	Task → model assignment
6. A/B test in production	Validate on real traffic (synthetic doesn’t capture all failure modes)	Production-validated cost/quality pair

The silent-quality-regression failure: Teams run offline evaluation, see “quality is within 2%,” ship the cheaper model — and three weeks later user-satisfaction scores have dropped 8% because the offline evaluation didn’t include the 15% of queries where the quality gap is catastrophic. Online A/B testing with user-outcome metrics is non-negotiable for production cost-cutting.

Per-Decision Financial Model

For each proposed optimization, run the financial model:

Annual savings = (per-query savings) × (annual query volume)
Implementation cost = (engineering hours) × (fully-loaded hourly rate)
Quality cost = (quality-regression %) × (user-value-at-risk)
Net annual value = Annual savings − Implementation cost − Quality cost

For a 15% cost reduction from prompt compression at 500K queries/month at $0.04 per query:

Annual savings: 0.15 × 500,000 × 12 × $0.04 = $36,000
Implementation cost: 16 hours × $150 = $2,400
Quality cost: assume 0.5% quality regression × $10K value-at-risk = $50
Net annual value: $33,550

Compare to switching from GPT-4o to GPT-4o-mini:

Annual savings: ~65% × 500,000 × 12 × $0.04 = $156,000
Implementation cost: 4 hours (config change) = $600
Quality cost: if 5% quality regression on user-facing metric worth $200K/year = $10,000
Net annual value: $145,400

The quality-cost term is where most “cost optimizations” secretly lose money. A 5% quality regression on a feature that drives $200K/year of user value is $10K/year of lost value — which can exceed the “savings” from cheaper models. Measure the quality cost, don’t assume.

Anti-Patterns

Anti-pattern	Why teams do it	Why it fails	Correct pattern
Model-switch-first optimization	Single-config change, seductive	Often fails quality gate; wastes 2-4 weeks on revert cycle	Measure per-query cost decomposition first; optimize dominant component
Caching without hit-rate measurement	Intuition says “our queries repeat”	Hit rate below break-even for the traffic pattern	Instrument cache-miss logging for 2 weeks BEFORE deploying cache
Token-optimization-as-theater	Shows effort; measurable in isolation	Small absolute savings (< 5%) on non-dominant cost component	Skip if output tokens are dominant and uncapped
No quality-cost measurement	Harder to quantify	Quality regressions eat the savings silently	Online A/B with user-outcome metric before full rollout
Optimizing pre-validation prototype	Premature	Architecture is about to change; optimization work becomes stranded	Defer cost optimization until product-market fit validates the feature
Fine-tuning for cost at low volume	Maximal theoretical savings	Amortization requires > 500K queries/month	Use prompt + caching + routing until volume justifies fine-tuning

Honest Limitations

Pricing tables decay fast. Q1 2026 pricing will not match Q1 2027 pricing. The decision framework is stable; the absolute numbers are not. Re-run the financial model quarterly.
Per-query cost decomposition is harder than it looks. Instrumenting token-count, retrieval-count, retry-count per request requires observability plumbing that many teams lack. Budget 16-40 hours for decomposition telemetry before you can optimize.
Quality measurement for cost decisions is often skipped. Teams ship the cheaper model based on offline eval and find out via user complaints that it was worse. The cost of silent quality regression can exceed the savings by 2-5× for user-facing features.
Semantic-cache correctness is an operational discipline. Wrong cache hit is worse than a cache miss. Teams that add caching without TTL + invalidation discipline create worse problems than the cost they solved.
Routing accuracy degrades as the model portfolio changes. When you add a new model or a provider releases a new tier, router classification drifts. Expect to re-calibrate routers every 60-90 days.
Fallback retries can mask cost bugs. A query that retries 3× on a cheap model before falling back to flagship costs more than a direct flagship call. Retry-cost accounting is often missing from dashboards — build it in.
Streaming tokens complicate cost attribution. For interactive UX that streams tokens, cost accumulates even if the user stops reading. Server-side cancellation + output-cap enforcement are required to avoid runaway bills.
Volume projections drive architecture decisions but are usually wrong. Teams build tiered architecture for projected volume that doesn’t materialize, or use flat architecture past the point where tiered would have saved 40%. Track actual volume monthly; re-evaluate architecture breakpoint quarterly.

The Cost-Optimized Production Stack

A mature cost-optimized LLM system has:

Per-query cost decomposition instrumented (input / output / retrieval / retries logged per request).
max_tokens capped on every endpoint.
System prompts versioned in a prompt registry with per-prompt cost tracking.
Semantic cache where hit rate justifies it (>15% for most systems).
Model routing for volume paths (>50K queries/month).
Quality measurement on every cost-reduction change (online A/B, not offline-only).
Monthly cost-quality Pareto re-evaluation.
Explicit retry caps with fallback cost budgeted.

Cost optimization is continuous, not a one-time project. The systems that stay efficient have the measurement discipline; the systems that drift are the ones that optimized once and stopped looking.

Kenny Tan Co-founder & Technical Lead

AI Model Evaluation Systems · LLM Benchmark Infrastructure · Prompt Engineering Research

Cross-domain expertise in software engineering, content systems, and infrastructure architecture.

Reviewed by Shanire — Co-founder & Creative Lead

25 April 2026 Also at: kennytan.net

Continue reading

Chain-of-Thought vs ReAct vs Reflexion Agent Comparison — Pure-Reasoning vs Thought-Action-Observation Loops vs Self-Critique Retry Paradigms, Tool-Use Integration, Error-Recovery Mechanics, Benchmark Performance Deltas, and the Specific Agent Paradigm That Fits Each Workload as of 2026-04

Agent reasoning paradigm comparison covering pure chain-of-thought reasoning vs ReAct thought-action-observation loops vs Reflexion self-critique-and-retry, tool-use integration patterns, error-recovery mechanics, convergence-guarantee trade-offs, cost-per-task analysis, benchmark performance on WebShop HotpotQA GSM8K AlfWorld, and the specific paradigm-to-workload mapping that maximizes reliability and minimizes cost as of 2026-04.

Retrieval-Augmented Generation Chunk Sizing Strategy — Token-Window vs Semantic-Boundary Chunking, Overlap Ratio Tuning, Hierarchical and Parent-Document Retrieval, Sliding-Window Recursive-Character Patterns, and the Specific Chunking Decision That Determines RAG Quality as of 2026-04

RAG chunk-sizing reference covering token-window vs semantic-boundary chunking, overlap ratio trade-offs, recursive character splitting, hierarchical and parent-document retrieval, sliding-window context patterns, embedding-model token-window impact, metadata-enrichment strategies, chunk-level vs document-level relevance, and the specific chunking decisions that determine RAG quality across OpenAI text-embedding-3-large, Cohere, and Voyage as of 2026-04.

Structured Output JSON Schema Prompt Patterns — Schema-Enforced Generation, Tool-Call vs Response-Format APIs, Retry-on-Parse-Fail Protocols, Pydantic and Zod Coercion, Nested Object Depth Limits, and the Specific Patterns That Produce Parseable JSON at 99%+ Reliability as of 2026-04

Structured output reference covering JSON Schema-enforced prompt patterns, OpenAI response_format vs tool-call discrimination, Anthropic tool_use structured extraction, Pydantic and Zod runtime coercion, retry-on-schema-fail protocols with exponential backoff, nested depth and token-window impact, enum vs free-string trade-offs, and the specific patterns that produce parseable JSON at 99%+ reliability across GPT-4, Claude, and Gemini as of 2026-04.

All articles in prompt engineering