The “open vs closed” framing misleads because it implies a binary. The reality is a spectrum: fully closed (GPT-4o, Claude — API only, no weights), open-weight (Llama 3.1, Mistral — weights available, restrictive license), truly open (OLMo, Pythia — weights, data, training code, permissive license). Each point on this spectrum has different cost structures, capability ceilings, privacy guarantees, and legal constraints. This guide provides the comparison data across cost, performance, control, and risk — so you can make the decision based on engineering requirements, not ideology.

The Model Openness Spectrum

CategoryExamplesWhat you getWhat you don’t getLicense type
Fully closedGPT-4o, Claude Opus/Sonnet, Gemini ProAPI access onlyWeights, training data, architecture detailsProprietary API terms
Open-weight (restricted)Llama 3.1, Mistral Large, Gemma 2Model weights for downloadTraining data, training code; commercial restrictionsCustom (Meta Community, Mistral Research)
Open-weight (permissive)Mistral 7B (Apache 2.0), Qwen 2.5Model weights, permissive licenseTraining data, training codeApache 2.0, MIT
Fully openOLMo 2, Pythia, BLOOMWeights, training data, training codeNothing — full reproducibilityApache 2.0

Key distinction: “Open-weight” models (Llama, Gemma) have usage restrictions. Llama 3.1’s license prohibits use by companies with >700M monthly active users and requires attribution. Mistral’s research license restricts commercial use of some models. Always read the license — “open” in the AI industry does not mean what it means in traditional open-source software.

Performance Comparison

Quality measured across standard benchmarks (MMLU, HumanEval, MT-Bench) and practical tasks. Rankings shift quarterly as models update.

ModelParametersMMLUHumanEvalMT-BenchPractical quality tierAPI cost (per 1M output tokens)
GPT-4oUnknown88.790.29.3Tier 1 (frontier)$10.00
Claude Opus 4Unknown89.191.59.4Tier 1 (frontier)$75.00
Claude Sonnet 4Unknown87.588.39.1Tier 1 (frontier)$15.00
Gemini 2.5 ProUnknown88.287.19.2Tier 1 (frontier)$10.00
GPT-4.1Unknown87.892.09.2Tier 1 (frontier)$8.00
Llama 3.1 405B405B86.080.48.7Tier 2 (near-frontier)$2-5 (self-hosted)
Mistral Large 2~123B84.078.58.5Tier 2 (near-frontier)$4.00 (API)
Qwen 2.5 72B72B83.579.08.4Tier 2 (near-frontier)$1-3 (self-hosted)
Llama 3.1 70B70B82.073.88.2Tier 2$1-3 (self-hosted)
Mistral 7B7B64.045.07.0Tier 3 (capable small)$0.10-0.30 (self-hosted)
Llama 3.1 8B8B68.050.57.2Tier 3 (capable small)$0.10-0.30 (self-hosted)
Gemma 2 9B9B71.354.07.5Tier 3 (capable small)$0.10-0.30 (self-hosted)

The quality gap: Frontier closed models (GPT-4o, Claude Sonnet 4) lead open-weight models by 3-10 points on benchmarks and 0.5-1.0 points on MT-Bench. This gap is meaningful for complex reasoning and creative tasks but negligible for classification, extraction, and structured output tasks where open 70B+ models perform comparably.

Total Cost of Ownership

API (Closed Model) Cost at Scale

Monthly volumeGPT-4o-miniGPT-4oClaude Sonnet 4Claude Haiku 3.5
100K requests (1K tokens each)$75$1,250$1,800$480
1M requests$750$12,500$18,000$4,800
10M requests$7,500$125,000$180,000$48,000
100M requests$75,000$1,250,000$1,800,000$480,000

Self-Hosted (Open Model) Cost

Model sizeGPU requiredMonthly cloud GPU costMonthly throughputEffective $/1M output tokens
7-8B (FP16)1× A10G (24GB)$250-400~5M requests$0.05-0.08
7-8B (INT4)1× T4 (16GB)$100-200~3M requests$0.03-0.07
70B (FP16)4× A100 (80GB)$4,000-6,000~500K requests$0.50-1.20
70B (INT4)1× A100 (80GB)$1,000-1,500~200K requests$0.50-0.75
405B (FP16)8× A100 (80GB)$8,000-12,000~100K requests$2.00-5.00
405B (INT4)2× A100 (80GB)$2,000-3,000~50K requests$1.00-3.00

The Cost Crossover Point

ScenarioAPI cheaper untilSelf-hosted cheaper afterBreakeven
7-8B model vs GPT-4o-miniAlways use self-hosted (if you have ML ops)From day 1 at scale~50K requests/month
70B model vs GPT-4o~200K requests/month>200K requests/month$1,500/month GPU cost
70B model vs Claude Sonnet 4~100K requests/month>100K requests/monthEarlier crossover due to higher API price
405B model vs GPT-4o~500K requests/month>500K requests/monthOnly if quality is sufficient

Hidden costs of self-hosting: The GPU cost is the visible cost. Hidden costs include: ML engineering time ($10,000-20,000/month for dedicated ML ops), GPU orchestration and scaling ($500-2,000/month for infrastructure tooling), monitoring and debugging ($200-500/month for observability), model updates and fine-tuning ($1,000-5,000 per update cycle), and on-call support for GPU failures.

True Total Cost Comparison (1M requests/month)

Cost componentGPT-4o (API)Llama 3.1 70B (self-hosted)Llama 3.1 8B (self-hosted)
Inference cost$12,500$2,000 (GPU lease)$300 (GPU lease)
ML engineering$0$5,000 (partial FTE)$2,000 (partial FTE)
Infrastructure$0$500 (orchestration)$200 (orchestration)
Monitoring$0$300$200
Total monthly$12,500$7,800$2,700
Quality tierTier 1Tier 2Tier 3

The real math: Self-hosted 70B is only 38% cheaper than GPT-4o at 1M requests — not the 80-90% savings the GPU cost alone suggests. At 10M requests, the savings percentage improves because engineering costs are fixed while inference costs scale linearly.

Decision Framework

Your requirementOpen-weightClosed APIWhy
Data privacy (regulated industry)Strongly preferredAcceptable with DPA, BAAData stays on your infrastructure; no third-party processing
Cost at >1M requests/monthPreferred (if ML ops available)ExpensiveSelf-hosted inference is cheaper at scale
Cost at <100K requests/monthMore expensive (GPU overhead)PreferredAPI avoids fixed infrastructure costs
Maximum quality (complex reasoning)Frontier gap remainsPreferredGPT-4o, Claude Opus still lead on complex tasks
Maximum quality (classification/extraction)ComparableComparable70B+ open models match frontier on structured tasks
Fine-tuning requiredRequired for full fine-tuneAPI fine-tuning available (limited)Open weights allow LoRA, full fine-tune, custom training
No ML engineering teamNot feasibleRequiredSelf-hosting requires GPU ops expertise
Latency-sensitive (edge/on-device)Required for edgeNot possibleAPI adds network latency; edge deployment requires open weights
Vendor independenceRequired for independenceCreates provider dependencyOpen weights can’t be taken away
Compliance (EU, specific industries)Often preferredMay require extensive DPASelf-hosted simplifies data residency compliance

Deployment Architecture Options

ArchitectureModels supportedComplexityUse case
Direct APIClosed models onlyLowStandard cloud deployment
Self-hosted (single GPU)7-8B modelsMediumPrivacy-sensitive, moderate volume
Self-hosted (multi-GPU)70B-405B modelsHighHigh volume, quality requirements
Hybrid (API + self-hosted)BothHighRoute by task: complex→API, simple→self-hosted
Managed inference (Together, Fireworks, Groq)Open-weight modelsLow-mediumSelf-hosted quality at API convenience
On-device/edge<8B models (quantized)HighOffline, privacy, real-time

Managed Inference Providers (Open Models via API)

ProviderModels offeredPricing modelLatencyThroughput
Together AILlama, Mistral, Qwen, + 100 othersPer-token (competitive)LowHigh
Fireworks AILlama, Mistral, custom fine-tunesPer-tokenVery lowVery high
GroqLlama, Mistral, GemmaPer-token (cheap)Ultra-low (custom hardware)Very high
Anyscale/Ray ServeAny open modelGPU hoursVariableConfigurable
AWS BedrockLlama, Mistral, TitanPer-tokenMediumHigh

The managed inference middle ground: Together AI, Fireworks, and Groq offer open-weight models via API — giving you the cost benefits of open models without the operational burden of self-hosting. Groq’s custom LPU hardware delivers the lowest latency in the market.

How to Apply This

Use the token-counter tool to estimate your monthly token volume — this is the primary input to the cost crossover analysis between API and self-hosted deployment.

Default to closed APIs unless you have a specific reason for open models. “Specific reason” means: data privacy requirements, >1M requests/month with ML ops team, need for full fine-tuning, or edge deployment. “Open source is better in principle” is not a sufficient reason to accept the operational complexity.

Evaluate on your actual tasks, not benchmarks. A 70B open model may match GPT-4o on your specific classification task while trailing by 10 points on reasoning benchmarks that don’t represent your workload.

Consider managed inference (Together, Fireworks, Groq) before self-hosting. You get open-model pricing and the ability to fine-tune without managing GPU infrastructure. Self-host only when managed inference doesn’t meet your privacy, latency, or cost requirements.

Plan for the gap to close. The quality gap between open and closed models has shrunk from 15-20 points (2023) to 3-5 points (2026). Technical decisions made today should account for continued narrowing — don’t build architecture that assumes closed models will always be better.

Honest Limitations

Benchmark scores (MMLU, HumanEval) correlate imperfectly with production task quality — always evaluate on your own data. Self-hosted cost estimates assume cloud GPU pricing; on-premise GPU economics differ significantly. The “hidden costs of self-hosting” estimates (ML engineering time, orchestration) vary enormously by team experience — an experienced ML team’s overhead is much lower. License terms for open-weight models can change — Meta’s Llama license has been modified between versions. Quantization (INT4, INT8) reduces self-hosting costs but introduces 1-3% quality degradation that varies by task. The performance comparison reflects a snapshot in time — new model releases can shift rankings within weeks. Some “open-weight” models have disputed training data provenance, which may create legal risk depending on jurisdiction.