What is RPM vs TPM and why do LLM APIs have both?

RPM (requests per minute) caps how many API calls you can make; TPM (tokens per minute) caps the total token throughput. Traditional APIs limit only by request count because request cost is nearly-flat. LLM APIs cost scales with GENERATED tokens (GPU compute per token), so a single 4000-token request is much more expensive than ten 200-token requests — providers cap both axes. Your effective throughput is MIN(RPM-bound, TPM-bound). Hit either ceiling + you get 429 + rate-limit error. Anthropic separates input-TPM from output-TPM; most orgs treat the tighter limit as the effective cap.

How do I know which limit is bottlenecking me?

Compute both constraints. RPM-bound RPS = RPM / 60. TPM-bound RPS = TPM / (input + output tokens per request) / 60. Whichever number is smaller is your bottleneck. RPM-bound = small-request fleet (classification, routing, short Q&A — request overhead dominates). TPM-bound = large-request fleet (RAG with 2000+ context tokens, long-form generation, multi-turn with history replay). Same tier produces 10x different sustainable throughput depending on request shape — tool shows both live so you can redirect optimization effort to the right axis.

Why does concurrency matter if I have rate budget available?

Little’s Law: throughput = concurrency / latency. If you target 5 req/sec at 2s average latency you need at least 10 concurrent in-flight connections. Under-provisioned concurrency means arrival rate exceeds service rate — queue grows unboundedly + requests timeout + P95 explodes. Over-provisioned wastes socket handles + file descriptors + TLS state. Tool computes required concurrency via Little’s Law + compares to your setting. When you switch from Haiku (0.5s) to Opus (3s), same throughput target needs 6x more concurrency.

When should I upgrade tier vs fan-out across multiple keys vs multi-provider?

Three scaling paths ranked by complexity/cost. (1) TIER UPGRADE: simplest, no code change, 2-10x capacity for 1.5-2x cost. Gated by spend + account age (Anthropic T4 = $400+ spend; OpenAI T5 = $1000+ spend + 30 days). First option for most orgs. (2) MULTI-KEY FAN-OUT: multiple API keys on same provider round-robin’d via load balancer. Doubles/triples cap without tier gating. Needs key management + load balancer logic. Use when tier spend-gate blocks upgrade but traffic demands now. (3) MULTI-PROVIDER FAN-OUT: Anthropic + OpenAI + Gemini in parallel. Diversifies against single-provider outage. Adds latency variance (different provider speeds). Justified when reliability SLO requires multi-provider redundancy, or for latency-critical paths where you race requests across providers + take fastest response.

How does prompt caching change the throughput math?

Caching is a throughput multiplier for TPM-bound workloads, not only a cost saver. Cached tokens (1) cost 10% of base input rate + (2) do NOT count against TPM. For workloads with shared static preamble (system prompt + tool definitions + RAG context) across many requests, caching halves TPM consumed even if raw tokens appear the same to your app. A 100 RPS TPM-bound workload becomes 160-180 RPS after caching. Write cost +25% on first request; read cost 10%. Break-even: 2 cache reads per 5-min TTL (Anthropic) or 1-hour TTL (OpenAI). For production apps with over 10 requests per TTL sharing preamble: always enable. The tool flags caching as a recommendation when input tokens exceed 1000 + TPM is the bottleneck.

LLM Rate-Limit + Throughput Planner — size concurrency, queue depth, tier before launch

Name: LLM Rate-Limit + Throughput Planner
Author: Kenny Tan

Every production LLM deployment hits rate limits the first time traffic matches projections. The question is whether you designed for it or discovered it at 2 AM when 429 errors cascade into timeouts. Rate limits come in two flavors — RPM (request-count) and TPM (token-throughput) — and the bottleneck depends on YOUR request shape. Small fast classification requests hit RPM first; long RAG queries hit TPM first. Little's Law (N = throughput x latency) dictates your minimum concurrency. Without this math you either over-provision (waste money on unused tier headroom) or under-provision (outages under predictable load). This planner runs the math in-browser before you touch production.

Free Private Planner

What this rate-limit planner delivers

10 provider tier presets — Anthropic Tier 1-4 (50 RPM / 20K TPM to 4K RPM / 400K TPM), OpenAI Tier 1-5 (500 RPM / 30K TPM to 15K RPM / 2M TPM), Google Gemini Free + Paid. Switch preset instantly to see how a tier upgrade changes the math.
Bottleneck identification — RPM vs TPM — tool computes both constraints and reports which one limits you. Optimizing the non-bottleneck gives zero net throughput gain. Knowing the bottleneck directs optimization effort (concurrency for RPM-bound; payload compression for TPM-bound).
Little's Law concurrency sizing — required concurrency = effective throughput x average latency. Under-provisioned concurrency means queue grows unboundedly even if rate budget is abundant. Over-provisioned wastes connection pool + TLS handshake overhead.
Queue depth for burst absorption — M/M/c approximation sizes the in-flight queue needed to absorb 2x traffic bursts within P95 SLO. Real LLM traffic has heavier tails than M/M/c; use as baseline + validate under load.
5 ranked recommendations — tier-upgrade-required, Batches API opportunity, prompt-compression-leverage, RPM-bound-heavy-concurrency-diminishing-returns, connection-pooling-at-high-concurrency, over-provisioned-downgrade. Each with implementation ranking by cost/effort.
5 warning classes — capacity-150%-critical, concurrency-below-half-needed-queue-grows-unboundedly, large-per-request-tokens-TPM-hard-ceiling, latency-over-10s-model-choice-review, free-tier-inadequate-for-projected-volume.
3 demo scenarios — low-volume chatbot (Anthropic T2 / 500+400 tokens), mid-scale RAG agent (Anthropic T4 / 2500+500 tokens / 20 concurrent), high-throughput classification (OpenAI T3 / 400+50 tokens / 50 concurrent / 5M daily). Shows how bottleneck + sizing shifts across request profile.
URL-state shareable — all inputs are numerical rate/volume parameters. Paste a URL into team chat; everyone sees the same capacity analysis.

Rate-limit values approximated from vendor public tier docs 2024-2026 (docs.anthropic.com/en/api/rate-limits, platform.openai.com/docs/guides/rate-limits, ai.google.dev). Your org's actual limits may differ — use Custom preset to enter actual values. Little's Law per J.D.C. Little (1961) "A Proof for the Queuing Formula: L = λW". M/M/c queue-theory approximation is standard operations-research — real LLM traffic has heavier tails (variable output length, streaming vs non-streaming) so sizing is a floor not ceiling. Validate under load before production cutover.

Plan your LLM throughput before production burns through rate limits. Pick a provider tier or enter custom RPM/TPM, describe request size + target concurrency, see which limit bottlenecks first, required queue depth for burst smoothing, tier-upgrade trigger points, and batch-vs-real-time tradeoff.

Privacy: all inputs are rate + volume parameters — no prompts or data. Safe to persist + share via URL. Client-side only.

Provider tier preset Approximate public tier values. Your org may have different limits — override via Custom. RPM limit (requests/min) Provider rate cap on request count per minute. TPM limit (tokens/min) Input + output tokens combined. Anthropic separates input/output TPM; use the tighter. Avg input tokens per request System prompt + user message + tool context + RAG chunks. Avg output tokens per request Model response length. Cap via max_tokens; long generations hit TPM hard. Avg request latency (ms) Time-to-first-token + generation time. Opus ~2-4s; Sonnet ~1-2s; Haiku ~0.3-1s for 300-token responses. Target concurrency (in-flight) Simultaneous requests you plan to keep open. Little's Law pairs this with throughput + latency. Target P95 latency (ms) End-to-end latency budget including queue time. Compare to rate-limit-induced queue depth. Projected requests per day Daily request volume for utilization + batch-API break-even analysis.

Why LLM rate limits work differently from traditional APIs

Quick answer: traditional REST APIs limit by request count — 1000 req/min, period. LLM APIs limit by BOTH request count (RPM) AND token volume (TPM), because the expensive resource isn't the request overhead — it's the GPU time per generated token. A single 4K-token request is much more expensive than 10 small 200-token requests, so providers cap both axes. Your effective throughput is the MIN of the two: hit either ceiling and the provider returns 429.

Consequence: the optimization you need depends on WHICH axis you hit first. If you're RPM-bound (many small requests), you need more concurrency + higher tier. If you're TPM-bound (fewer large requests), adding concurrency gives ZERO additional throughput — you need to compress prompts, cache shared preamble, route long-form to a higher-TPM tier, or move non-interactive work to Batches API. Getting this wrong means you pay for capacity you can't use.

Little's Law — the concurrency math every LLM deployment needs

Quick answer: Little's Law says N = throughput x latency, where N is the number of concurrent in-flight requests. For an LLM API, if you want to sustain 5 req/sec with 2 seconds average latency, you need at least 10 concurrent connections. Under-provision concurrency and the arrival rate exceeds service rate — queue grows forever, requests timeout, P95 explodes. Over-provision and idle connections consume file descriptors + socket handles + TLS state for nothing.

The non-intuitive part: concurrency requirement scales with LATENCY, not just throughput. When you switch from Sonnet (1.5s average) to Opus (3.5s), same throughput target requires 2.3x more concurrency. When you enable prompt caching (reduces TTFT by ~30%), concurrency requirement drops proportionally. Tool surfaces this live so you see the interaction.

RPM-bound vs TPM-bound — recognize which you have

Quick answer: compute both. RPM-bound throughput = RPM / 60. TPM-bound throughput = TPM / (input_tokens + output_tokens) / 60. Whichever is smaller is your bottleneck. RPM-bound profile: high-concurrency fleet of small requests — classification, triage, short-form Q&A, tool routing. Common in agent orchestration where many small sub-queries run in parallel. TPM-bound profile: fewer larger requests — RAG queries with large context, long-form generation, document summarization, multi-turn conversation with full history replay.

Anthropic Tier 4: 4000 RPM / 400K TPM. With 100-token requests: TPM-bound allows 40K RPM worth (100 tokens each = 4K req/sec) but RPM caps at 67 req/sec — RPM-bound. With 4000-token requests: TPM-bound allows 100 RPM worth (400K / 4K = 100 req/sec) but RPM cap is higher — TPM-bound. Same tier, same money, 10x different sustainable throughput depending on request shape.

Burst absorption — why steady-state math under-counts

Quick answer: real traffic has peaks. Average 5 req/sec often means 50 req/sec during peak hour, 2 req/sec overnight. If you size capacity for the average, peak traffic either queues or drops. Two strategies: (1) size capacity for peak (expensive but simple); (2) size for average + add buffering to absorb peaks (cheaper but requires queue depth + graceful back-pressure). Rule of thumb: dimension for 2-3x average peak burst; beyond 5x you need to rate-shape at ingress rather than absorb with queue.

M/M/c queue theory gives the baseline math: with c concurrent workers + Poisson arrivals + exponential service time, expected queue length at utilization ρ = λ/(cμ) is computable. Real LLM traffic deviates: service time has heavy tails (80% quick, 20% long-generation), arrivals are bursty not Poisson. Tool uses M/M/c as a floor approximation — actual queue depth in production will be 1.5-3x what the model predicts; validate via load test at 1.5x projected peak.

Tier upgrade vs multi-key fan-out vs multi-provider

Quick answer: three scaling strategies when you outgrow a tier. (1) TIER UPGRADE — simplest, no code change, typically 2-10x capacity for 1.5-2x cost. Anthropic Tier 4 requires $400+ historical spend; OpenAI Tier 5 requires $1000+ spend + 30 days account age. (2) MULTI-KEY FAN-OUT — multiple API keys on same provider, round-robin'd via load balancer; doubles/triples cap without tier jump; needs key management + load balancer logic. (3) MULTI-PROVIDER FAN-OUT — Anthropic + OpenAI + Gemini in parallel; diversifies against single-provider outage; adds latency variance since providers respond at different speeds; needs provider-agnostic prompt shape OR per-provider adapters.

Ranking for most orgs: tier upgrade first (cheapest), multi-key if spend-gate blocks upgrade, multi-provider only when reliability requirement justifies complexity. Multi-provider is also the right answer for latency-critical paths where you race the same request against 2-3 providers and take the first response — adds cost but halves P95 latency.

Batches API — 50% off for patient workloads

Quick answer: both Anthropic + OpenAI offer Batches API for async processing at 50% discount. Submit multiple requests in a single file, 24h SLA for processing. Use cases: classification-at-scale, dataset labeling, overnight analytics, summarization of historical content, content-moderation post-hoc. Not for: user-facing interactive workflows, sub-minute latency paths, single-request debugging. Break-even: any workload where 24h latency is acceptable AND volume exceeds ~1000 requests (below that, per-request batch overhead isn't worth it).

Batch API often has SEPARATE rate pool from real-time API — so routing non-interactive work to batch frees up your real-time rate pool for user-facing requests. Double win: 50% cheaper AND doesn't compete with interactive traffic. Anthropic Batches: 24h max completion, no priority tier. OpenAI Batch: 24h max, separate from real-time TPM pool.

Prompt caching — throughput multiplier that looks like a cost optimization

Quick answer: prompt caching isn't only a cost-saver — it's a throughput-amplifier for TPM-bound workloads. Anthropic cache (5-min TTL) + OpenAI cache (1-hour TTL) both (a) cut cached-prefix input cost to 10% of rate and (b) do NOT count cache-hit tokens against TPM. For a workload with 2000-token static preamble across many requests, caching halves the TPM consumed per request even if raw tokens sent look the same to your app. You go from 100 RPS TPM-bound to ~160-180 RPS without tier upgrade.

Write cost is +25% of base rate on the first request; read cost is 10%. Break-even: 2 cache reads per 5-min TTL for Anthropic, 2 cache reads per 1-hour TTL for OpenAI. For production apps with >10 req/5min sharing preamble: always on. For one-off requests: not worth it. Tool's TPM-bound recommendation flags caching opportunity when input_tokens > 1000.

What this planner does NOT model

Four classes of behavior require different analysis. (1) STREAMING LATENCY — tool uses average latency as a scalar; streaming responses deliver TTFT (time-to-first-token) much faster than total-completion latency, which shifts perceived latency for user-facing apps. Use TTFT for streaming-focused SLO analysis. (2) PROVIDER TIER AUTO-PROMOTION — OpenAI automatically promotes tier based on spend + account age; projected capacity may grow over time without explicit upgrade. (3) SERVER-SIDE BACKPRESSURE — some providers return 429 before hitting published limits when overall system load is high (shared capacity across all tenants). Build retry-with-backoff regardless of math. (4) OUTPUT LENGTH VARIANCE — average output tokens hides heavy-tail distribution (median 200 tokens but p99 at 2000). TPM calculations should use p95 output length for safety margin.

Sources + further reading

Anthropic API Rate Limits (docs.anthropic.com/en/api/rate-limits) — canonical source for Anthropic tier RPM / ITPM / OTPM values + upgrade criteria. OpenAI Rate Limits (platform.openai.com/docs/guides/rate-limits) — OpenAI tier structure + auto-promotion rules + 429 error semantics. Google Gemini API rate limits (ai.google.dev) — free vs paid tier comparison. J.D.C. Little (1961) "A Proof for the Queuing Formula: L = λW," Operations Research 9(3) — original Little's Law paper. Harchol-Balter, M. (2013) Performance Modeling and Design of Computer Systems: Queueing Theory in Action (Cambridge University Press) — comprehensive M/M/c + queue-theory reference. For practical LLM deployment patterns: Chip Huyen's Machine Learning Systems Design (2022) + Anthropic engineering blog + OpenAI production best-practices documentation (all updated continuously 2024-2026).

Kenny Tan Co-founder & Technical Lead

AI Model Evaluation Systems · LLM Benchmark Infrastructure · Prompt Engineering Research

Cross-domain expertise in software engineering, content systems, and infrastructure architecture.

Reviewed by Shanire — Co-founder & Creative Lead

21 April 2026 Also at: kennytan.net

LLM Rate-Limit + Throughput Planner Tool v1 · canonical sources cited inline above · runs entirely client-side, no data transmitted

Tell us what could be better

Two questions. Takes 30 seconds. We read every reply.

More tools

AI Token Counter Counter · Count tokens

AI Context Window Planner Planner · Plan a context budget

RAG Chunking Planner Planner · Plan a RAG chunk + index budget

Informational & educational tool. Outputs are for educational purposes and do not constitute medical, legal, financial, tax, or professional advice, and are not a substitute for consultation with a qualified physician, attorney, accountant, or licensed professional. If a result indicates immediate danger — fire, carbon monoxide, acute exposure, or a life-threatening scenario — call emergency services (911 in the US, 995 in Singapore, or your local equivalent), evacuate if air quality is compromised, and ventilate affected areas immediately. For poisoning, contact your regional poison control center. Tool outputs should be verified against authoritative sources before relying on them for decisions with health, safety, legal, or financial consequences.