LLM Rate-Limit + Throughput Planner — size concurrency, queue depth, tier before launch
Every production LLM deployment hits rate limits the first time traffic matches projections. The question is whether you designed for it or discovered it at 2 AM when 429 errors cascade into timeouts. Rate limits come in two flavors — RPM (request-count) and TPM (token-throughput) — and the bottleneck depends on YOUR request shape. Small fast classification requests hit RPM first; long RAG queries hit TPM first. Little's Law (N = throughput x latency) dictates your minimum concurrency. Without this math you either over-provision (waste money on unused tier headroom) or under-provision (outages under predictable load). This planner runs the math in-browser before you touch production.
Plan your LLM throughput before production burns through rate limits. Pick a provider tier or enter custom RPM/TPM, describe request size + target concurrency, see which limit bottlenecks first, required queue depth for burst smoothing, tier-upgrade trigger points, and batch-vs-real-time tradeoff.
Privacy: all inputs are rate + volume parameters — no prompts or data. Safe to persist + share via URL. Client-side only.
Try this: RPM + TPM + token values must be positive. Latency must be at least 100 ms. Reset defaults to recover.
Why LLM rate limits work differently from traditional APIs
Quick answer: traditional REST APIs limit by request count — 1000 req/min, period. LLM APIs limit by BOTH request count (RPM) AND token volume (TPM), because the expensive resource isn't the request overhead — it's the GPU time per generated token. A single 4K-token request is much more expensive than 10 small 200-token requests, so providers cap both axes. Your effective throughput is the MIN of the two: hit either ceiling and the provider returns 429.
Consequence: the optimization you need depends on WHICH axis you hit first. If you're RPM-bound (many small requests), you need more concurrency + higher tier. If you're TPM-bound (fewer large requests), adding concurrency gives ZERO additional throughput — you need to compress prompts, cache shared preamble, route long-form to a higher-TPM tier, or move non-interactive work to Batches API. Getting this wrong means you pay for capacity you can't use.
Little's Law — the concurrency math every LLM deployment needs
Quick answer: Little's Law says N = throughput x latency, where N is the number of concurrent in-flight requests. For an LLM API, if you want to sustain 5 req/sec with 2 seconds average latency, you need at least 10 concurrent connections. Under-provision concurrency and the arrival rate exceeds service rate — queue grows forever, requests timeout, P95 explodes. Over-provision and idle connections consume file descriptors + socket handles + TLS state for nothing.
The non-intuitive part: concurrency requirement scales with LATENCY, not just throughput. When you switch from Sonnet (1.5s average) to Opus (3.5s), same throughput target requires 2.3x more concurrency. When you enable prompt caching (reduces TTFT by ~30%), concurrency requirement drops proportionally. Tool surfaces this live so you see the interaction.
RPM-bound vs TPM-bound — recognize which you have
Quick answer: compute both. RPM-bound throughput = RPM / 60. TPM-bound throughput = TPM / (input_tokens + output_tokens) / 60. Whichever is smaller is your bottleneck. RPM-bound profile: high-concurrency fleet of small requests — classification, triage, short-form Q&A, tool routing. Common in agent orchestration where many small sub-queries run in parallel. TPM-bound profile: fewer larger requests — RAG queries with large context, long-form generation, document summarization, multi-turn conversation with full history replay.
Anthropic Tier 4: 4000 RPM / 400K TPM. With 100-token requests: TPM-bound allows 40K RPM worth (100 tokens each = 4K req/sec) but RPM caps at 67 req/sec — RPM-bound. With 4000-token requests: TPM-bound allows 100 RPM worth (400K / 4K = 100 req/sec) but RPM cap is higher — TPM-bound. Same tier, same money, 10x different sustainable throughput depending on request shape.
Burst absorption — why steady-state math under-counts
Quick answer: real traffic has peaks. Average 5 req/sec often means 50 req/sec during peak hour, 2 req/sec overnight. If you size capacity for the average, peak traffic either queues or drops. Two strategies: (1) size capacity for peak (expensive but simple); (2) size for average + add buffering to absorb peaks (cheaper but requires queue depth + graceful back-pressure). Rule of thumb: dimension for 2-3x average peak burst; beyond 5x you need to rate-shape at ingress rather than absorb with queue.
M/M/c queue theory gives the baseline math: with c concurrent workers + Poisson arrivals + exponential service time, expected queue length at utilization ρ = λ/(cμ) is computable. Real LLM traffic deviates: service time has heavy tails (80% quick, 20% long-generation), arrivals are bursty not Poisson. Tool uses M/M/c as a floor approximation — actual queue depth in production will be 1.5-3x what the model predicts; validate via load test at 1.5x projected peak.
Tier upgrade vs multi-key fan-out vs multi-provider
Quick answer: three scaling strategies when you outgrow a tier. (1) TIER UPGRADE — simplest, no code change, typically 2-10x capacity for 1.5-2x cost. Anthropic Tier 4 requires $400+ historical spend; OpenAI Tier 5 requires $1000+ spend + 30 days account age. (2) MULTI-KEY FAN-OUT — multiple API keys on same provider, round-robin'd via load balancer; doubles/triples cap without tier jump; needs key management + load balancer logic. (3) MULTI-PROVIDER FAN-OUT — Anthropic + OpenAI + Gemini in parallel; diversifies against single-provider outage; adds latency variance since providers respond at different speeds; needs provider-agnostic prompt shape OR per-provider adapters.
Ranking for most orgs: tier upgrade first (cheapest), multi-key if spend-gate blocks upgrade, multi-provider only when reliability requirement justifies complexity. Multi-provider is also the right answer for latency-critical paths where you race the same request against 2-3 providers and take the first response — adds cost but halves P95 latency.
Batches API — 50% off for patient workloads
Quick answer: both Anthropic + OpenAI offer Batches API for async processing at 50% discount. Submit multiple requests in a single file, 24h SLA for processing. Use cases: classification-at-scale, dataset labeling, overnight analytics, summarization of historical content, content-moderation post-hoc. Not for: user-facing interactive workflows, sub-minute latency paths, single-request debugging. Break-even: any workload where 24h latency is acceptable AND volume exceeds ~1000 requests (below that, per-request batch overhead isn't worth it).
Batch API often has SEPARATE rate pool from real-time API — so routing non-interactive work to batch frees up your real-time rate pool for user-facing requests. Double win: 50% cheaper AND doesn't compete with interactive traffic. Anthropic Batches: 24h max completion, no priority tier. OpenAI Batch: 24h max, separate from real-time TPM pool.
Prompt caching — throughput multiplier that looks like a cost optimization
Quick answer: prompt caching isn't only a cost-saver — it's a throughput-amplifier for TPM-bound workloads. Anthropic cache (5-min TTL) + OpenAI cache (1-hour TTL) both (a) cut cached-prefix input cost to 10% of rate and (b) do NOT count cache-hit tokens against TPM. For a workload with 2000-token static preamble across many requests, caching halves the TPM consumed per request even if raw tokens sent look the same to your app. You go from 100 RPS TPM-bound to ~160-180 RPS without tier upgrade.
Write cost is +25% of base rate on the first request; read cost is 10%. Break-even: 2 cache reads per 5-min TTL for Anthropic, 2 cache reads per 1-hour TTL for OpenAI. For production apps with >10 req/5min sharing preamble: always on. For one-off requests: not worth it. Tool's TPM-bound recommendation flags caching opportunity when input_tokens > 1000.
What this planner does NOT model
Four classes of behavior require different analysis. (1) STREAMING LATENCY — tool uses average latency as a scalar; streaming responses deliver TTFT (time-to-first-token) much faster than total-completion latency, which shifts perceived latency for user-facing apps. Use TTFT for streaming-focused SLO analysis. (2) PROVIDER TIER AUTO-PROMOTION — OpenAI automatically promotes tier based on spend + account age; projected capacity may grow over time without explicit upgrade. (3) SERVER-SIDE BACKPRESSURE — some providers return 429 before hitting published limits when overall system load is high (shared capacity across all tenants). Build retry-with-backoff regardless of math. (4) OUTPUT LENGTH VARIANCE — average output tokens hides heavy-tail distribution (median 200 tokens but p99 at 2000). TPM calculations should use p95 output length for safety margin.
Sources + further reading
Anthropic API Rate Limits (docs.anthropic.com/en/api/rate-limits) — canonical source for Anthropic tier RPM / ITPM / OTPM values + upgrade criteria. OpenAI Rate Limits (platform.openai.com/docs/guides/rate-limits) — OpenAI tier structure + auto-promotion rules + 429 error semantics. Google Gemini API rate limits (ai.google.dev) — free vs paid tier comparison. J.D.C. Little (1961) "A Proof for the Queuing Formula: L = λW," Operations Research 9(3) — original Little's Law paper. Harchol-Balter, M. (2013) Performance Modeling and Design of Computer Systems: Queueing Theory in Action (Cambridge University Press) — comprehensive M/M/c + queue-theory reference. For practical LLM deployment patterns: Chip Huyen's Machine Learning Systems Design (2022) + Anthropic engineering blog + OpenAI production best-practices documentation (all updated continuously 2024-2026).
LLM Rate-Limit + Throughput Planner Tool v1 · canonical sources cited inline above · runs entirely client-side, no data transmitted
Tell us what could be better
Informational & educational tool. Outputs are for educational purposes and do not constitute medical, legal, financial, tax, or professional advice, and are not a substitute for consultation with a qualified physician, attorney, accountant, or licensed professional. If a result indicates immediate danger — fire, carbon monoxide, acute exposure, or a life-threatening scenario — call emergency services (911 in the US, 995 in Singapore, or your local equivalent), evacuate if air quality is compromised, and ventilate affected areas immediately. For poisoning, contact your regional poison control center. Tool outputs should be verified against authoritative sources before relying on them for decisions with health, safety, legal, or financial consequences.