You Chose Your Embedding Model Based on a Leaderboard — But Leaderboard Rankings Don’t Predict Your Retrieval Quality

The MTEB leaderboard ranks 200+ embedding models across 8 task categories. Teams pick the top-ranked model and discover their retrieval quality is mediocre — because leaderboard tasks don’t match production tasks, leaderboard datasets don’t match production data, and the ranking aggregates across categories where your application only uses one. The model that scores highest overall may rank 15th on the specific retrieval task you care about. This guide provides the comparison data across 20+ models on the dimensions that actually determine production retrieval quality, with domain-specific performance data and the selection framework that matches model capabilities to application requirements.

The Comparison Table

Models ranked by average retrieval performance (not overall MTEB score). Scores are NDCG@10 on retrieval-specific benchmarks. Prices as of early 2026.

ModelProviderDimensionsMax tokensRetrieval NDCG@10Speed (tokens/sec)Price ($/1M tokens)Hosting
voyage-3-largeVoyage AI102432,00072.33,500$0.18API
text-embedding-3-largeOpenAI30728,19170.15,000$0.13API
embed-v4.0Cohere102451269.86,000$0.10API
GTE-Qwen2-7BAlibaba3584131,07271.5800Free (self-hosted)Self-hosted
voyage-3Voyage AI102432,00068.94,000$0.06API
jina-embeddings-v3Jina AI10248,19268.14,000$0.02API/Self-hosted
BGE-en-ICLBAAI409632,76867.8700Free (self-hosted)Self-hosted
NV-Embed-v2NVIDIA409632,76869.3500Free (self-hosted)Self-hosted
text-embedding-3-smallOpenAI15368,19164.210,000$0.02API
Nomic Embed v1.5Nomic AI7688,19262.35,000$0.01API/Self-hosted
mxbai-embed-largeMixedbread102451264.74,500Free (self-hosted)Self-hosted
BGE-large-en-v1.5BAAI102451263.63,000Free (self-hosted)Self-hosted
E5-mistral-7b-instructMicrosoft409632,76866.6600Free (self-hosted)Self-hosted
stella-en-1.5B-v5dunzhang1024131,07267.41,200Free (self-hosted)Self-hosted
all-MiniLM-L6-v2Sentence-Transformers38425651.815,000Free (self-hosted)Self-hosted
Cohere embed-english-v3Cohere102451265.46,000$0.10API
Amazon Titan Text v2AWS10248,19261.03,000$0.02API (Bedrock)
Gemini text-embedding-004Google7682,04866.07,000$0.004API

Key insights: The top 5 models are within 4 NDCG points of each other — the gap between “best” and “5th best” is smaller than the gap between your domain data and benchmark data. Self-hosted 7B models (GTE-Qwen2, NV-Embed-v2) match or exceed hosted API models but require GPU infrastructure. all-MiniLM-L6-v2 is still widely used due to speed and simplicity but trails modern models by 10-20 NDCG points.

Domain-Specific Performance

Generic MTEB scores hide massive performance variance across domains. A model that excels on general web text may underperform on code, legal, or medical content.

ModelGeneral textCode/technicalLegal/regulatoryMedical/biomedicalFinancialConversational
voyage-3-large72.374.168.565.270.171.8
text-embedding-3-large70.168.367.263.868.569.4
voyage-3 (code variant)65.875.262.158.464.266.7
GTE-Qwen2-7B71.572.069.867.371.268.5
embed-v4.069.866.568.164.067.872.1
BGE-large-en-v1.563.661.260.558.862.363.0
text-embedding-3-small64.262.861.058.563.164.8
Nomic Embed v1.562.363.558.255.160.862.0

The domain gap: Medical/biomedical performance is 5-10 NDCG points below general text for most models. This gap is real — general models haven’t seen enough domain-specific training data to capture medical terminology and relationships. If your domain is highly specialized, evaluate on your own data before committing.

Dimensionality — Does Higher Always Mean Better?

More dimensions capture more semantic nuance — but at a cost in storage, query latency, and index size.

DimensionsStorage per vectorIndex size (1M vectors)Query latency (HNSW)Retrieval quality (relative)
3841.5 KB1.5 GB1-3 msBaseline
7683 KB3 GB2-5 ms+5-8% over 384
10244 KB4 GB3-6 ms+7-10% over 384
15366 KB6 GB4-8 ms+8-12% over 384
307212 KB12 GB6-12 ms+10-14% over 384
409616 KB16 GB8-15 ms+11-15% over 384

The diminishing returns curve: Moving from 384 → 1024 dimensions gives the biggest quality gain per storage cost. Moving from 1024 → 3072 gives marginal quality improvement at 3x storage cost. Beyond 1024 dimensions, the quality-to-cost ratio flattens.

Matryoshka Embeddings (Variable Dimensions)

Some models (OpenAI text-embedding-3, Nomic Embed, jina-embeddings-v3) support Matryoshka representation — you can truncate embeddings to fewer dimensions at query time with controlled quality degradation.

Model (native dims)Truncated to 256Truncated to 512Truncated to 1024Full dimensions
text-embedding-3-large (3072)62.0 (-8.1)66.8 (-3.3)69.5 (-0.6)70.1
text-embedding-3-small (1536)59.5 (-4.7)62.8 (-1.4)64.2 (native)64.2
jina-embeddings-v3 (1024)63.2 (-4.9)66.5 (-1.6)68.1 (native)68.1
Nomic Embed v1.5 (768)58.1 (-4.2)61.0 (-1.3)N/A62.3

Practical application: Matryoshka lets you start with smaller vectors (lower cost) and increase dimensions later without re-embedding. Truncating text-embedding-3-large to 1024 gives 99% of full quality at 33% of storage cost.

Context Window — When It Matters

Most embedding models have 512-8K token context windows. Newer models push to 32K-128K. When does this matter?

Context windowWhat you can embedUse caseRisk of shorter windows
256 tokens~1 paragraphSentence similarity, short answersTruncates most document chunks
512 tokens~2 paragraphsStandard chunk embeddingSafe for 256-token chunks
8,192 tokens~4-6 pagesFull document sections, long chunksAdequate for most RAG applications
32,768 tokens~20+ pagesFull documents, long-form contentRarely needed; full-document embedding loses granularity
131,072 tokensEntire booksDocument-level semantic searchEmbedding an entire document into one vector loses paragraph-level precision

Key insight: Longer context windows don’t mean you should embed longer text. A 128K-token model embedding an entire 50-page document into one 1024-dim vector cannot distinguish between paragraph-level concepts. Long context windows are useful for chunking strategies that need larger-than-usual chunks — not for eliminating chunking entirely.

Cost Analysis at Scale

Scaletext-embedding-3-smalltext-embedding-3-largevoyage-3GTE-Qwen2 (self-hosted)Gemini 004
100K documents (avg 2K tokens)$4$26$12$0 + $5 GPU hours$0.80
1M documents$40$260$120$0 + $50 GPU hours$8
10M documents$400$2,600$1,200$0 + $500 GPU hours$80
Query cost (100K/month)$4/mo$26/mo$12/mo$100-300/mo GPU$0.80/mo

The self-hosted crossover: API embedding is cheaper up to ~5M documents. Beyond that, self-hosted GPU costs (A100 at $2-3/hr) become competitive — especially when query volume is high. The breakeven depends on your query volume vs. one-time embedding volume.

Total Cost of Ownership (1M docs, 100K queries/month, 12 months)

ModelEmbedding costQuery cost (12 mo)Vector DB storageTotal 12-monthQuality (NDCG@10)
text-embedding-3-small$40$48$50 (6 GB)$13864.2
text-embedding-3-large$260$312$100 (12 GB)$67270.1
voyage-3$120$144$35 (4 GB)$29968.9
GTE-Qwen2 (self-hosted)$50 (GPU)$2,400 (GPU lease)$35 (4 GB)$2,48571.5
Gemini 004$8$10$25 (3 GB)$4366.0
jina-embeddings-v3$40$48$35 (4 GB)$12368.1

Best value: Gemini text-embedding-004 at $43/year for 66.0 NDCG is the most cost-effective option if quality above 65 is sufficient. jina-embeddings-v3 at $123/year for 68.1 NDCG offers the best quality-per-dollar in the mid-range. Self-hosted GTE-Qwen2 has the highest quality but the highest cost due to continuous GPU requirements for query serving.

Selection Decision Framework

Your requirementRecommended modelWhy
Lowest possible costGemini text-embedding-004$0.004/1M tokens — 5x cheaper than next cheapest
Best retrieval quality (API)voyage-3-largeHighest retrieval NDCG among hosted models
Best retrieval quality (any)GTE-Qwen2-7BHighest retrieval NDCG, requires GPU
Code/technical contentvoyage-3 (code)Trained specifically on code; highest code retrieval score
Multilingualembed-v4.0 (Cohere)100+ languages with minimal quality degradation
Privacy/on-premise requiredBGE-large-en-v1.5 or GTE-Qwen2Open-source, self-hosted, no data leaves your infrastructure
Variable dimension needstext-embedding-3-largeMatryoshka support — truncate dimensions without re-embedding
Long documents (>8K tokens)voyage-3-large or GTE-Qwen232K-128K context window
Speed-critical (real-time)all-MiniLM-L6-v215K tokens/sec; acceptable for real-time with quality tradeoff
AWS ecosystem (Bedrock)Amazon Titan Text v2Native Bedrock integration, no external API dependency

How to Apply This

Use the token-counter tool to estimate your embedding corpus size in tokens — this determines the one-time embedding cost and ongoing query cost.

Always evaluate on your own data. The performance gap between benchmark data and your domain data can be 5-15 NDCG points. Embed 1,000 representative documents, run 100 representative queries, and measure retrieval relevance before committing to a model.

Start with Matryoshka-capable models (text-embedding-3-large, jina-embeddings-v3) if you’re uncertain about dimension requirements. You can truncate to 256 dimensions for prototyping and increase to full dimensions for production without re-embedding.

Don’t over-index on MTEB leaderboard rank. The difference between rank 1 and rank 10 on retrieval tasks is typically 3-5 NDCG points. Domain match, cost, and operational requirements matter more than leaderboard position.

Honest Limitations

MTEB scores update as models are re-evaluated; rankings shift quarterly. Domain-specific performance data is estimated from limited public benchmarks — your domain may differ. Self-hosted cost estimates assume cloud GPU pricing; on-premise GPU costs have different economics. Speed measurements vary by batch size, hardware, and quantization; the listed values are approximate mid-range. The “Gemini is cheapest” finding depends on Google’s current pricing, which has changed multiple times. New embedding models release monthly — any comparison table has a shelf life of 3-6 months. Matryoshka quality degradation varies by query type; some queries are more sensitive to dimensionality reduction than others.