Embedding Models Compared — Dimensions, Speed, Cost, and Retrieval Quality

20+ embedding models compared across 6 dimensions with MTEB benchmarks, domain-specific performance data, dimensionality analysis, and a selection decision framework.

Kenny Tan 15 April 2026

You Chose Your Embedding Model Based on a Leaderboard — But Leaderboard Rankings Don’t Predict Your Retrieval Quality

The MTEB leaderboard ranks 200+ embedding models across 8 task categories. Teams pick the top-ranked model and discover their retrieval quality is mediocre — because leaderboard tasks don’t match production tasks, leaderboard datasets don’t match production data, and the ranking aggregates across categories where your application only uses one. The model that scores highest overall may rank 15th on the specific retrieval task you care about. This guide provides the comparison data across 20+ models on the dimensions that actually determine production retrieval quality, with domain-specific performance data and the selection framework that matches model capabilities to application requirements.

The Comparison Table

Models ranked by average retrieval performance (not overall MTEB score). Scores are NDCG@10 on retrieval-specific benchmarks. Prices as of early 2026.

Model	Provider	Dimensions	Max tokens	Retrieval NDCG@10	Speed (tokens/sec)	Price ($/1M tokens)	Hosting
voyage-3-large	Voyage AI	1024	32,000	72.3	3,500	$0.18	API
text-embedding-3-large	OpenAI	3072	8,191	70.1	5,000	$0.13	API
embed-v4.0	Cohere	1024	512	69.8	6,000	$0.10	API
GTE-Qwen2-7B	Alibaba	3584	131,072	71.5	800	Free (self-hosted)	Self-hosted
voyage-3	Voyage AI	1024	32,000	68.9	4,000	$0.06	API
jina-embeddings-v3	Jina AI	1024	8,192	68.1	4,000	$0.02	API/Self-hosted
BGE-en-ICL	BAAI	4096	32,768	67.8	700	Free (self-hosted)	Self-hosted
NV-Embed-v2	NVIDIA	4096	32,768	69.3	500	Free (self-hosted)	Self-hosted
text-embedding-3-small	OpenAI	1536	8,191	64.2	10,000	$0.02	API
Nomic Embed v1.5	Nomic AI	768	8,192	62.3	5,000	$0.01	API/Self-hosted
mxbai-embed-large	Mixedbread	1024	512	64.7	4,500	Free (self-hosted)	Self-hosted
BGE-large-en-v1.5	BAAI	1024	512	63.6	3,000	Free (self-hosted)	Self-hosted
E5-mistral-7b-instruct	Microsoft	4096	32,768	66.6	600	Free (self-hosted)	Self-hosted
stella-en-1.5B-v5	dunzhang	1024	131,072	67.4	1,200	Free (self-hosted)	Self-hosted
all-MiniLM-L6-v2	Sentence-Transformers	384	256	51.8	15,000	Free (self-hosted)	Self-hosted
Cohere embed-english-v3	Cohere	1024	512	65.4	6,000	$0.10	API
Amazon Titan Text v2	AWS	1024	8,192	61.0	3,000	$0.02	API (Bedrock)
Gemini text-embedding-004	Google	768	2,048	66.0	7,000	$0.004	API

Key insights: The top 5 models are within 4 NDCG points of each other — the gap between “best” and “5th best” is smaller than the gap between your domain data and benchmark data. Self-hosted 7B models (GTE-Qwen2, NV-Embed-v2) match or exceed hosted API models but require GPU infrastructure. all-MiniLM-L6-v2 is still widely used due to speed and simplicity but trails modern models by 10-20 NDCG points.

Domain-Specific Performance

Generic MTEB scores hide massive performance variance across domains. A model that excels on general web text may underperform on code, legal, or medical content.

Model	General text	Code/technical	Legal/regulatory	Medical/biomedical	Financial	Conversational
voyage-3-large	72.3	74.1	68.5	65.2	70.1	71.8
text-embedding-3-large	70.1	68.3	67.2	63.8	68.5	69.4
voyage-3 (code variant)	65.8	75.2	62.1	58.4	64.2	66.7
GTE-Qwen2-7B	71.5	72.0	69.8	67.3	71.2	68.5
embed-v4.0	69.8	66.5	68.1	64.0	67.8	72.1
BGE-large-en-v1.5	63.6	61.2	60.5	58.8	62.3	63.0
text-embedding-3-small	64.2	62.8	61.0	58.5	63.1	64.8
Nomic Embed v1.5	62.3	63.5	58.2	55.1	60.8	62.0

The domain gap: Medical/biomedical performance is 5-10 NDCG points below general text for most models. This gap is real — general models haven’t seen enough domain-specific training data to capture medical terminology and relationships. If your domain is highly specialized, evaluate on your own data before committing.

Dimensionality — Does Higher Always Mean Better?

More dimensions capture more semantic nuance — but at a cost in storage, query latency, and index size.

Dimensions	Storage per vector	Index size (1M vectors)	Query latency (HNSW)	Retrieval quality (relative)
384	1.5 KB	1.5 GB	1-3 ms	Baseline
768	3 KB	3 GB	2-5 ms	+5-8% over 384
1024	4 KB	4 GB	3-6 ms	+7-10% over 384
1536	6 KB	6 GB	4-8 ms	+8-12% over 384
3072	12 KB	12 GB	6-12 ms	+10-14% over 384
4096	16 KB	16 GB	8-15 ms	+11-15% over 384

The diminishing returns curve: Moving from 384 → 1024 dimensions gives the biggest quality gain per storage cost. Moving from 1024 → 3072 gives marginal quality improvement at 3x storage cost. Beyond 1024 dimensions, the quality-to-cost ratio flattens.

Matryoshka Embeddings (Variable Dimensions)

Some models (OpenAI text-embedding-3, Nomic Embed, jina-embeddings-v3) support Matryoshka representation — you can truncate embeddings to fewer dimensions at query time with controlled quality degradation.

Model (native dims)	Truncated to 256	Truncated to 512	Truncated to 1024	Full dimensions
text-embedding-3-large (3072)	62.0 (-8.1)	66.8 (-3.3)	69.5 (-0.6)	70.1
text-embedding-3-small (1536)	59.5 (-4.7)	62.8 (-1.4)	64.2 (native)	64.2
jina-embeddings-v3 (1024)	63.2 (-4.9)	66.5 (-1.6)	68.1 (native)	68.1
Nomic Embed v1.5 (768)	58.1 (-4.2)	61.0 (-1.3)	N/A	62.3

Practical application: Matryoshka lets you start with smaller vectors (lower cost) and increase dimensions later without re-embedding. Truncating text-embedding-3-large to 1024 gives 99% of full quality at 33% of storage cost.

Context Window — When It Matters

Most embedding models have 512-8K token context windows. Newer models push to 32K-128K. When does this matter?

Context window	What you can embed	Use case	Risk of shorter windows
256 tokens	~1 paragraph	Sentence similarity, short answers	Truncates most document chunks
512 tokens	~2 paragraphs	Standard chunk embedding	Safe for 256-token chunks
8,192 tokens	~4-6 pages	Full document sections, long chunks	Adequate for most RAG applications
32,768 tokens	~20+ pages	Full documents, long-form content	Rarely needed; full-document embedding loses granularity
131,072 tokens	Entire books	Document-level semantic search	Embedding an entire document into one vector loses paragraph-level precision

Key insight: Longer context windows don’t mean you should embed longer text. A 128K-token model embedding an entire 50-page document into one 1024-dim vector cannot distinguish between paragraph-level concepts. Long context windows are useful for chunking strategies that need larger-than-usual chunks — not for eliminating chunking entirely.

Cost Analysis at Scale

Scale	text-embedding-3-small	text-embedding-3-large	voyage-3	GTE-Qwen2 (self-hosted)	Gemini 004
100K documents (avg 2K tokens)	$4	$26	$12	$0 + $5 GPU hours	$0.80
1M documents	$40	$260	$120	$0 + $50 GPU hours	$8
10M documents	$400	$2,600	$1,200	$0 + $500 GPU hours	$80
Query cost (100K/month)	$4/mo	$26/mo	$12/mo	$100-300/mo GPU	$0.80/mo

The self-hosted crossover: API embedding is cheaper up to ~5M documents. Beyond that, self-hosted GPU costs (A100 at $2-3/hr) become competitive — especially when query volume is high. The breakeven depends on your query volume vs. one-time embedding volume.

Total Cost of Ownership (1M docs, 100K queries/month, 12 months)

Model	Embedding cost	Query cost (12 mo)	Vector DB storage	Total 12-month	Quality (NDCG@10)
text-embedding-3-small	$40	$48	$50 (6 GB)	$138	64.2
text-embedding-3-large	$260	$312	$100 (12 GB)	$672	70.1
voyage-3	$120	$144	$35 (4 GB)	$299	68.9
GTE-Qwen2 (self-hosted)	$50 (GPU)	$2,400 (GPU lease)	$35 (4 GB)	$2,485	71.5
Gemini 004	$8	$10	$25 (3 GB)	$43	66.0
jina-embeddings-v3	$40	$48	$35 (4 GB)	$123	68.1

Best value: Gemini text-embedding-004 at $43/year for 66.0 NDCG is the most cost-effective option if quality above 65 is sufficient. jina-embeddings-v3 at $123/year for 68.1 NDCG offers the best quality-per-dollar in the mid-range. Self-hosted GTE-Qwen2 has the highest quality but the highest cost due to continuous GPU requirements for query serving.

Selection Decision Framework

Your requirement	Recommended model	Why
Lowest possible cost	Gemini text-embedding-004	$0.004/1M tokens — 5x cheaper than next cheapest
Best retrieval quality (API)	voyage-3-large	Highest retrieval NDCG among hosted models
Best retrieval quality (any)	GTE-Qwen2-7B	Highest retrieval NDCG, requires GPU
Code/technical content	voyage-3 (code)	Trained specifically on code; highest code retrieval score
Multilingual	embed-v4.0 (Cohere)	100+ languages with minimal quality degradation
Privacy/on-premise required	BGE-large-en-v1.5 or GTE-Qwen2	Open-source, self-hosted, no data leaves your infrastructure
Variable dimension needs	text-embedding-3-large	Matryoshka support — truncate dimensions without re-embedding
Long documents (>8K tokens)	voyage-3-large or GTE-Qwen2	32K-128K context window
Speed-critical (real-time)	all-MiniLM-L6-v2	15K tokens/sec; acceptable for real-time with quality tradeoff
AWS ecosystem (Bedrock)	Amazon Titan Text v2	Native Bedrock integration, no external API dependency

How to Apply This

Use the token-counter tool to estimate your embedding corpus size in tokens — this determines the one-time embedding cost and ongoing query cost.

Always evaluate on your own data. The performance gap between benchmark data and your domain data can be 5-15 NDCG points. Embed 1,000 representative documents, run 100 representative queries, and measure retrieval relevance before committing to a model.

Start with Matryoshka-capable models (text-embedding-3-large, jina-embeddings-v3) if you’re uncertain about dimension requirements. You can truncate to 256 dimensions for prototyping and increase to full dimensions for production without re-embedding.

Don’t over-index on MTEB leaderboard rank. The difference between rank 1 and rank 10 on retrieval tasks is typically 3-5 NDCG points. Domain match, cost, and operational requirements matter more than leaderboard position.

Honest Limitations

MTEB scores update as models are re-evaluated; rankings shift quarterly. Domain-specific performance data is estimated from limited public benchmarks — your domain may differ. Self-hosted cost estimates assume cloud GPU pricing; on-premise GPU costs have different economics. Speed measurements vary by batch size, hardware, and quantization; the listed values are approximate mid-range. The “Gemini is cheapest” finding depends on Google’s current pricing, which has changed multiple times. New embedding models release monthly — any comparison table has a shelf life of 3-6 months. Matryoshka quality degradation varies by query type; some queries are more sensitive to dimensionality reduction than others.

Kenny Tan Co-founder & Technical Lead

Cross-domain expertise in software engineering, content systems, and infrastructure architecture.

15 April 2026

Continue reading

AI Agent Design Patterns — Tool Use, Planning, and Memory Architectures

Agent architecture decision matrix comparing ReAct, Plan-and-Execute, and Tree-of-Thought with tool integration patterns, memory systems, and failure mode analysis for production agent systems.

AI API Integration Patterns — Direct Call vs Streaming vs Batch Processing

Latency, cost, and complexity comparison across AI API integration patterns with architecture decision matrix, failure handling strategies, and production throughput data.

AI Cost Optimization in Production — Techniques That Cut Spend by 60-80%

Cost reduction technique comparison with percentage savings, implementation effort, and quality impact data across model routing, caching, prompt compression, and architectural patterns.

All articles in ai development workflows