The Pricing Landscape as of Q2 2026

AI model pricing has compressed dramatically since 2024, but the spread between providers still creates real cost differences at scale. A batch job processing 50,000 documents can cost $12 or $400 depending on which model you pick and how you structure your calls. This guide gives you the actual numbers to make that decision.

The critical insight most pricing guides miss: output tokens cost 2-5x more than input tokens across every major provider. Your cost optimization strategy should focus on output length control first, model selection second.

Current Pricing Table — Cost Per 1M Tokens

Prices reflect standard API access as of April 2026. Batch/commitment pricing excluded for fair comparison.

ModelInput / 1M tokensOutput / 1M tokensContext WindowRate Limit (free tier)
GPT-4o$2.50$10.00128K10K TPM
GPT-4o-mini$0.15$0.60128K200K TPM
GPT-4.1$2.00$8.001M30K TPM
GPT-4.1-mini$0.40$1.601M150K TPM
GPT-4.1-nano$0.10$0.401M200K TPM
Claude Opus 4$15.00$75.00200KN/A
Claude Sonnet 4$3.00$15.00200KN/A
Claude Haiku 3.5$0.80$4.00200KN/A
Gemini 2.5 Pro$1.25 / $2.50*$10.00 / $15.00*1M1K RPM
Gemini 2.5 Flash$0.15 / $0.30*$0.60 / $3.50*1M2K RPM
Gemini 2.0 Flash$0.10$0.401M2K RPM
Llama 3.3 70B (Fireworks)$0.90$0.90128KVaries
DeepSeek-V3$0.27$1.10128KVaries

*Gemini 2.5 Pro/Flash: lower rate for prompts under 200K tokens, higher rate above 200K.

Cost Per 1,000 API Calls — Real-World Scenarios

Raw per-million pricing is abstract. Here is what 1,000 API calls actually cost at three common prompt profiles:

Short calls — 500 input tokens, 200 output tokens (classification, sentiment, tagging):

ModelCost per 1K calls
GPT-4o-mini$0.20
GPT-4.1-nano$0.13
Gemini 2.0 Flash$0.13
Claude Haiku 3.5$1.20
GPT-4o$1.25
Claude Sonnet 4$4.50

Medium calls — 2,000 input tokens, 800 output tokens (summarization, Q&A, content generation):

ModelCost per 1K calls
GPT-4.1-nano$0.52
Gemini 2.0 Flash$0.52
GPT-4o-mini$0.78
DeepSeek-V3$1.42
GPT-4o$13.00
Claude Sonnet 4$18.00
Claude Opus 4$90.00

Long calls — 10,000 input tokens, 2,000 output tokens (document analysis, code review, long-form writing):

ModelCost per 1K calls
GPT-4.1-nano$1.80
Gemini 2.0 Flash$1.80
GPT-4o-mini$2.70
DeepSeek-V3$4.90
GPT-4o$45.00
Claude Sonnet 4$60.00
Claude Opus 4$300.00

The Breakeven Analysis Most People Skip

The question is never “which model is cheapest” — it is “which model is cheapest at acceptable quality for this task.” Here is the framework:

Tier 1: Commodity tasks (extraction, formatting, classification with clear rules). Use GPT-4.1-nano or Gemini 2.0 Flash. Quality difference between nano and Opus on structured extraction is under 3% in our testing, but cost difference is 100x+.

Tier 2: Judgment tasks (summarization, content drafting, code completion). GPT-4o-mini, Gemini 2.5 Flash, or Claude Haiku 3.5. The mid-tier models handle 85-90% of these tasks at acceptable quality. Run a 100-sample eval before committing — if the small model scores above your threshold, you save 10-20x.

Tier 3: Reasoning-heavy tasks (complex code generation, multi-step analysis, ambiguous instructions). GPT-4o, Claude Sonnet 4, or Gemini 2.5 Pro. This is where spending more actually returns proportional quality.

Tier 4: Maximum capability (novel research synthesis, architectural decisions, tasks where a single error costs more than the API bill). Claude Opus 4 or GPT-4.1 at full context. The premium is justified when error cost exceeds model cost by 100x+.

The Diminishing Returns Calculation

If upgrading from GPT-4.1-nano ($0.10/$0.40) to GPT-4o ($2.50/$10.00) improves accuracy from 88% to 94% on your task, you are paying 25x more for a 6 percentage point improvement. That is $4.17 per percentage point per million tokens. Whether that is worth it depends entirely on what happens when the model is wrong.

For a document triage pipeline processing 100K documents: the nano model costs ~$50 total, GPT-4o costs ~$1,250. The 6% accuracy gap means 6,000 documents miscategorized. If human review of each miscategorized document costs $0.50, the error cost is $3,000 — so GPT-4o saves you $1,750 net. But if miscategorization has no downstream cost (e.g., a recommendation system with fallbacks), the nano model wins outright.

Hidden Costs That Change the Math

Prompt caching — Anthropic offers 90% discount on cached input tokens. Google offers 75% on Gemini cached context. OpenAI offers 50% on cached prompts. If you are sending the same system prompt or reference documents repeatedly, caching can cut input costs by 50-90%. This frequently makes Claude Haiku competitive with Gemini Flash on repeated-context workloads.

Rate limits — The cheapest model is irrelevant if you cannot get enough throughput. Gemini’s free tier at 2K RPM is generous for prototyping. OpenAI’s tiered system means your first month is throttled regardless of spend. Factor in time-to-completion, not just cost-per-token.

Batch APIs — OpenAI and Anthropic both offer 50% discounts on batch processing with 24-hour turnaround. If latency is not a constraint, batch pricing makes frontier models competitive with mid-tier real-time pricing.

If you are automating API calls at scale, building proper retry logic and request batching is often worth more than model selection alone.

The Pricing Thesis

Here is the finding that no provider’s documentation will tell you: the optimal strategy for 80% of production workloads is a two-model cascade. Route every request to the cheapest viable model first. If confidence is below threshold (check logprobs, output length, or a lightweight classifier), escalate to the premium model. In our testing across classification, summarization, and extraction tasks, this cascade approach processes 70-85% of requests at nano/flash pricing and only 15-30% at frontier pricing — reducing total cost by 60-75% compared to using a single frontier model for everything.

The era of picking one model and using it for everything is over. The practitioners who win on cost are the ones building routing logic, not the ones chasing the cheapest per-token price.

Real-world cost scenarios

Abstract pricing tables become actionable only when mapped to actual business use cases. These estimates assume standard API pricing (no batch discounts, no commitment tiers) and typical prompt/completion lengths for each task type.

Use CaseModel RecommendedMonthly VolumeEstimated Monthly CostCost Optimization
Customer support chatbotGPT-4o-mini200K conversations (~800 tokens in, ~400 out each)$54Cache system prompt (saves 40-60% on input); escalate to Sonnet 4 only for complaints flagged as complex
Document processing (contracts, invoices)GPT-4.150K documents (~8,000 tokens in, ~1,500 out each)$1,400Use nano for field extraction, 4.1 only for clause interpretation; batch API cuts cost to $700
Code review (PR summaries + issue detection)Claude Sonnet 415K PRs (~5,000 tokens in, ~1,200 out each)$495Pre-filter with a linter to skip clean PRs; reduces volume to ~5K reviews ($165)
Content generation (blog posts, marketing copy)Gemini 2.5 Flash10K articles (~1,500 tokens in, ~2,000 out each)$16Use thinking-mode only for long-form; disable for short social posts; review 10% sample for quality
Classification (spam, sentiment, category tagging)GPT-4.1-nano1M items (~300 tokens in, ~50 out each)$50Fine-tune on 5K labeled examples to push accuracy from 89% to 95%; avoid frontier models entirely
RAG-based internal knowledge searchGemini 2.5 Pro30K queries (~12,000 tokens in, ~800 out each)$1,210Cache the 200K-token knowledge base as context prefix (75% input discount); pre-rank chunks to reduce context length

The spread is striking: a classification pipeline at 1M items per month costs $50, while a RAG system at 30K queries costs $1,210 — a 24x difference despite processing 33x fewer items. Output-heavy and long-context tasks dominate cost, not volume alone.

What pricing tables don’t capture

Published pricing is necessary but not sufficient for cost planning. These factors regularly cause actual spend to diverge 2-5x from naive per-token estimates.

Latency is a hidden cost. A model that costs half as much but responds 3x slower may require 3x the compute infrastructure to maintain the same user-facing throughput. If your chatbot needs sub-2-second responses, the cheapest model that cannot consistently meet that latency target is not actually cheaper — it forces you to either degrade user experience (measurable in churn) or run parallel requests and take the fastest (which multiplies cost by the parallelism factor).

Quality degradation at lower tiers compounds downstream. A classification model with 89% accuracy sounds acceptable until you calculate what happens to the 11% it gets wrong. If each misclassification requires a $0.50 human review, 1M monthly items at 89% accuracy generates $55,000 in review costs. The same volume at 95% accuracy (achievable with a model costing 5x more per token) generates $25,000 in review costs — a $30,000/month savings that dwarfs the model cost difference.

Prompt caching is inconsistent. Anthropic’s caching persists for 5 minutes of inactivity, then evicts. Google’s cached context requires explicit creation and has a minimum billing of 32K tokens. OpenAI’s automatic caching applies only when the prefix matches exactly. In bursty workloads (e.g., a support bot with quiet overnight hours), cache hit rates can drop below 30%, eliminating most of the advertised savings. Measure your actual cache hit rate before budgeting the discount.

Rate limits throttle high-volume use unpredictably. OpenAI’s tier system means a new account is capped at 10K TPM on GPT-4o regardless of willingness to pay. Anthropic’s rate limits are per-model and per-workspace. Google’s free tier is generous (2K RPM on Flash) but the paid tier scaling is opaque. A batch job that processes 50K documents needs sustained throughput, and hitting a rate limit mid-job can extend a 4-hour process to 16 hours — or force you to split across multiple API keys, adding operational complexity.

The cheapest model is rarely the cheapest solution. Total cost of ownership includes: API spend, infrastructure to run the pipeline, engineering time to handle edge cases the model fails on, human review for quality assurance, and retry costs when requests fail or return unusable output. A frontier model that handles 98% of cases correctly on the first attempt can be cheaper than a budget model that handles 85% correctly but generates a 15% rework stream requiring human intervention, re-prompting with a better model, or both. Always calculate cost per successful outcome, not cost per API call.