Embedding Models Compared — Dimensions, Speed, Cost, and Retrieval Quality
20+ embedding models compared across 6 dimensions with MTEB benchmarks, domain-specific performance data, dimensionality analysis, and a selection decision framework.
You Chose Your Embedding Model Based on a Leaderboard — But Leaderboard Rankings Don’t Predict Your Retrieval Quality
The MTEB leaderboard ranks 200+ embedding models across 8 task categories. Teams pick the top-ranked model and discover their retrieval quality is mediocre — because leaderboard tasks don’t match production tasks, leaderboard datasets don’t match production data, and the ranking aggregates across categories where your application only uses one. The model that scores highest overall may rank 15th on the specific retrieval task you care about. This guide provides the comparison data across 20+ models on the dimensions that actually determine production retrieval quality, with domain-specific performance data and the selection framework that matches model capabilities to application requirements.
The Comparison Table
Models ranked by average retrieval performance (not overall MTEB score). Scores are NDCG@10 on retrieval-specific benchmarks. Prices as of early 2026.
| Model | Provider | Dimensions | Max tokens | Retrieval NDCG@10 | Speed (tokens/sec) | Price ($/1M tokens) | Hosting |
|---|---|---|---|---|---|---|---|
| voyage-3-large | Voyage AI | 1024 | 32,000 | 72.3 | 3,500 | $0.18 | API |
| text-embedding-3-large | OpenAI | 3072 | 8,191 | 70.1 | 5,000 | $0.13 | API |
| embed-v4.0 | Cohere | 1024 | 512 | 69.8 | 6,000 | $0.10 | API |
| GTE-Qwen2-7B | Alibaba | 3584 | 131,072 | 71.5 | 800 | Free (self-hosted) | Self-hosted |
| voyage-3 | Voyage AI | 1024 | 32,000 | 68.9 | 4,000 | $0.06 | API |
| jina-embeddings-v3 | Jina AI | 1024 | 8,192 | 68.1 | 4,000 | $0.02 | API/Self-hosted |
| BGE-en-ICL | BAAI | 4096 | 32,768 | 67.8 | 700 | Free (self-hosted) | Self-hosted |
| NV-Embed-v2 | NVIDIA | 4096 | 32,768 | 69.3 | 500 | Free (self-hosted) | Self-hosted |
| text-embedding-3-small | OpenAI | 1536 | 8,191 | 64.2 | 10,000 | $0.02 | API |
| Nomic Embed v1.5 | Nomic AI | 768 | 8,192 | 62.3 | 5,000 | $0.01 | API/Self-hosted |
| mxbai-embed-large | Mixedbread | 1024 | 512 | 64.7 | 4,500 | Free (self-hosted) | Self-hosted |
| BGE-large-en-v1.5 | BAAI | 1024 | 512 | 63.6 | 3,000 | Free (self-hosted) | Self-hosted |
| E5-mistral-7b-instruct | Microsoft | 4096 | 32,768 | 66.6 | 600 | Free (self-hosted) | Self-hosted |
| stella-en-1.5B-v5 | dunzhang | 1024 | 131,072 | 67.4 | 1,200 | Free (self-hosted) | Self-hosted |
| all-MiniLM-L6-v2 | Sentence-Transformers | 384 | 256 | 51.8 | 15,000 | Free (self-hosted) | Self-hosted |
| Cohere embed-english-v3 | Cohere | 1024 | 512 | 65.4 | 6,000 | $0.10 | API |
| Amazon Titan Text v2 | AWS | 1024 | 8,192 | 61.0 | 3,000 | $0.02 | API (Bedrock) |
| Gemini text-embedding-004 | 768 | 2,048 | 66.0 | 7,000 | $0.004 | API |
Key insights: The top 5 models are within 4 NDCG points of each other — the gap between “best” and “5th best” is smaller than the gap between your domain data and benchmark data. Self-hosted 7B models (GTE-Qwen2, NV-Embed-v2) match or exceed hosted API models but require GPU infrastructure. all-MiniLM-L6-v2 is still widely used due to speed and simplicity but trails modern models by 10-20 NDCG points.
Domain-Specific Performance
Generic MTEB scores hide massive performance variance across domains. A model that excels on general web text may underperform on code, legal, or medical content.
| Model | General text | Code/technical | Legal/regulatory | Medical/biomedical | Financial | Conversational |
|---|---|---|---|---|---|---|
| voyage-3-large | 72.3 | 74.1 | 68.5 | 65.2 | 70.1 | 71.8 |
| text-embedding-3-large | 70.1 | 68.3 | 67.2 | 63.8 | 68.5 | 69.4 |
| voyage-3 (code variant) | 65.8 | 75.2 | 62.1 | 58.4 | 64.2 | 66.7 |
| GTE-Qwen2-7B | 71.5 | 72.0 | 69.8 | 67.3 | 71.2 | 68.5 |
| embed-v4.0 | 69.8 | 66.5 | 68.1 | 64.0 | 67.8 | 72.1 |
| BGE-large-en-v1.5 | 63.6 | 61.2 | 60.5 | 58.8 | 62.3 | 63.0 |
| text-embedding-3-small | 64.2 | 62.8 | 61.0 | 58.5 | 63.1 | 64.8 |
| Nomic Embed v1.5 | 62.3 | 63.5 | 58.2 | 55.1 | 60.8 | 62.0 |
The domain gap: Medical/biomedical performance is 5-10 NDCG points below general text for most models. This gap is real — general models haven’t seen enough domain-specific training data to capture medical terminology and relationships. If your domain is highly specialized, evaluate on your own data before committing.
Dimensionality — Does Higher Always Mean Better?
More dimensions capture more semantic nuance — but at a cost in storage, query latency, and index size.
| Dimensions | Storage per vector | Index size (1M vectors) | Query latency (HNSW) | Retrieval quality (relative) |
|---|---|---|---|---|
| 384 | 1.5 KB | 1.5 GB | 1-3 ms | Baseline |
| 768 | 3 KB | 3 GB | 2-5 ms | +5-8% over 384 |
| 1024 | 4 KB | 4 GB | 3-6 ms | +7-10% over 384 |
| 1536 | 6 KB | 6 GB | 4-8 ms | +8-12% over 384 |
| 3072 | 12 KB | 12 GB | 6-12 ms | +10-14% over 384 |
| 4096 | 16 KB | 16 GB | 8-15 ms | +11-15% over 384 |
The diminishing returns curve: Moving from 384 → 1024 dimensions gives the biggest quality gain per storage cost. Moving from 1024 → 3072 gives marginal quality improvement at 3x storage cost. Beyond 1024 dimensions, the quality-to-cost ratio flattens.
Matryoshka Embeddings (Variable Dimensions)
Some models (OpenAI text-embedding-3, Nomic Embed, jina-embeddings-v3) support Matryoshka representation — you can truncate embeddings to fewer dimensions at query time with controlled quality degradation.
| Model (native dims) | Truncated to 256 | Truncated to 512 | Truncated to 1024 | Full dimensions |
|---|---|---|---|---|
| text-embedding-3-large (3072) | 62.0 (-8.1) | 66.8 (-3.3) | 69.5 (-0.6) | 70.1 |
| text-embedding-3-small (1536) | 59.5 (-4.7) | 62.8 (-1.4) | 64.2 (native) | 64.2 |
| jina-embeddings-v3 (1024) | 63.2 (-4.9) | 66.5 (-1.6) | 68.1 (native) | 68.1 |
| Nomic Embed v1.5 (768) | 58.1 (-4.2) | 61.0 (-1.3) | N/A | 62.3 |
Practical application: Matryoshka lets you start with smaller vectors (lower cost) and increase dimensions later without re-embedding. Truncating text-embedding-3-large to 1024 gives 99% of full quality at 33% of storage cost.
Context Window — When It Matters
Most embedding models have 512-8K token context windows. Newer models push to 32K-128K. When does this matter?
| Context window | What you can embed | Use case | Risk of shorter windows |
|---|---|---|---|
| 256 tokens | ~1 paragraph | Sentence similarity, short answers | Truncates most document chunks |
| 512 tokens | ~2 paragraphs | Standard chunk embedding | Safe for 256-token chunks |
| 8,192 tokens | ~4-6 pages | Full document sections, long chunks | Adequate for most RAG applications |
| 32,768 tokens | ~20+ pages | Full documents, long-form content | Rarely needed; full-document embedding loses granularity |
| 131,072 tokens | Entire books | Document-level semantic search | Embedding an entire document into one vector loses paragraph-level precision |
Key insight: Longer context windows don’t mean you should embed longer text. A 128K-token model embedding an entire 50-page document into one 1024-dim vector cannot distinguish between paragraph-level concepts. Long context windows are useful for chunking strategies that need larger-than-usual chunks — not for eliminating chunking entirely.
Cost Analysis at Scale
| Scale | text-embedding-3-small | text-embedding-3-large | voyage-3 | GTE-Qwen2 (self-hosted) | Gemini 004 |
|---|---|---|---|---|---|
| 100K documents (avg 2K tokens) | $4 | $26 | $12 | $0 + $5 GPU hours | $0.80 |
| 1M documents | $40 | $260 | $120 | $0 + $50 GPU hours | $8 |
| 10M documents | $400 | $2,600 | $1,200 | $0 + $500 GPU hours | $80 |
| Query cost (100K/month) | $4/mo | $26/mo | $12/mo | $100-300/mo GPU | $0.80/mo |
The self-hosted crossover: API embedding is cheaper up to ~5M documents. Beyond that, self-hosted GPU costs (A100 at $2-3/hr) become competitive — especially when query volume is high. The breakeven depends on your query volume vs. one-time embedding volume.
Total Cost of Ownership (1M docs, 100K queries/month, 12 months)
| Model | Embedding cost | Query cost (12 mo) | Vector DB storage | Total 12-month | Quality (NDCG@10) |
|---|---|---|---|---|---|
| text-embedding-3-small | $40 | $48 | $50 (6 GB) | $138 | 64.2 |
| text-embedding-3-large | $260 | $312 | $100 (12 GB) | $672 | 70.1 |
| voyage-3 | $120 | $144 | $35 (4 GB) | $299 | 68.9 |
| GTE-Qwen2 (self-hosted) | $50 (GPU) | $2,400 (GPU lease) | $35 (4 GB) | $2,485 | 71.5 |
| Gemini 004 | $8 | $10 | $25 (3 GB) | $43 | 66.0 |
| jina-embeddings-v3 | $40 | $48 | $35 (4 GB) | $123 | 68.1 |
Best value: Gemini text-embedding-004 at $43/year for 66.0 NDCG is the most cost-effective option if quality above 65 is sufficient. jina-embeddings-v3 at $123/year for 68.1 NDCG offers the best quality-per-dollar in the mid-range. Self-hosted GTE-Qwen2 has the highest quality but the highest cost due to continuous GPU requirements for query serving.
Selection Decision Framework
| Your requirement | Recommended model | Why |
|---|---|---|
| Lowest possible cost | Gemini text-embedding-004 | $0.004/1M tokens — 5x cheaper than next cheapest |
| Best retrieval quality (API) | voyage-3-large | Highest retrieval NDCG among hosted models |
| Best retrieval quality (any) | GTE-Qwen2-7B | Highest retrieval NDCG, requires GPU |
| Code/technical content | voyage-3 (code) | Trained specifically on code; highest code retrieval score |
| Multilingual | embed-v4.0 (Cohere) | 100+ languages with minimal quality degradation |
| Privacy/on-premise required | BGE-large-en-v1.5 or GTE-Qwen2 | Open-source, self-hosted, no data leaves your infrastructure |
| Variable dimension needs | text-embedding-3-large | Matryoshka support — truncate dimensions without re-embedding |
| Long documents (>8K tokens) | voyage-3-large or GTE-Qwen2 | 32K-128K context window |
| Speed-critical (real-time) | all-MiniLM-L6-v2 | 15K tokens/sec; acceptable for real-time with quality tradeoff |
| AWS ecosystem (Bedrock) | Amazon Titan Text v2 | Native Bedrock integration, no external API dependency |
How to Apply This
Use the token-counter tool to estimate your embedding corpus size in tokens — this determines the one-time embedding cost and ongoing query cost.
Always evaluate on your own data. The performance gap between benchmark data and your domain data can be 5-15 NDCG points. Embed 1,000 representative documents, run 100 representative queries, and measure retrieval relevance before committing to a model.
Start with Matryoshka-capable models (text-embedding-3-large, jina-embeddings-v3) if you’re uncertain about dimension requirements. You can truncate to 256 dimensions for prototyping and increase to full dimensions for production without re-embedding.
Don’t over-index on MTEB leaderboard rank. The difference between rank 1 and rank 10 on retrieval tasks is typically 3-5 NDCG points. Domain match, cost, and operational requirements matter more than leaderboard position.
Honest Limitations
MTEB scores update as models are re-evaluated; rankings shift quarterly. Domain-specific performance data is estimated from limited public benchmarks — your domain may differ. Self-hosted cost estimates assume cloud GPU pricing; on-premise GPU costs have different economics. Speed measurements vary by batch size, hardware, and quantization; the listed values are approximate mid-range. The “Gemini is cheapest” finding depends on Google’s current pricing, which has changed multiple times. New embedding models release monthly — any comparison table has a shelf life of 3-6 months. Matryoshka quality degradation varies by query type; some queries are more sensitive to dimensionality reduction than others.
Continue reading
AI Agent Design Patterns — Tool Use, Planning, and Memory Architectures
Agent architecture decision matrix comparing ReAct, Plan-and-Execute, and Tree-of-Thought with tool integration patterns, memory systems, and failure mode analysis for production agent systems.
AI API Integration Patterns — Direct Call vs Streaming vs Batch Processing
Latency, cost, and complexity comparison across AI API integration patterns with architecture decision matrix, failure handling strategies, and production throughput data.
AI Cost Optimization in Production — Techniques That Cut Spend by 60-80%
Cost reduction technique comparison with percentage savings, implementation effort, and quality impact data across model routing, caching, prompt compression, and architectural patterns.