Your RAG System Answers Correctly on Demo Queries, Fails Silently on Production Traffic, and Every Framework Comparison Article Lists the Same Five Libraries Without Telling You Which One Catches Which Failure — Here’s the Per-Metric, Per-Framework Decision Matrix That Maps Actual Failure Modes to the Measurement Approach That Detects Them

A RAG system has four failure modes that matter: it retrieves the wrong documents, it retrieves the right documents but cites the wrong parts, it generates text that contradicts the retrieved documents, or it generates text that is irrelevant to the user’s question. Measuring these failures requires four different metrics — and no single evaluation framework implements all four with equal rigor. The common mistake is picking one framework (RAGAS is the popular default), running it, and declaring the system “evaluated” — when the framework’s weak spots coincide with the failure modes most likely to bite in production. This guide maps each of the four RAG-specific metrics to what it catches, which frameworks implement it well, and how to construct the golden set that makes evaluation meaningful rather than theatrical.

The Four RAG-Specific Metrics

RAG evaluation decomposes into two axes (retrieval quality and generation quality) × two directions (precision and recall):

MetricWhat it measuresWhat it catchesMeasurement approach
Context PrecisionOf the retrieved chunks, what fraction is relevant to the queryOver-retrieval; retriever returning too many irrelevant chunks that dilute the context windowRank-aware: relevant chunks should appear near the top of the retrieval list
Context RecallOf the chunks that WOULD have answered the query, what fraction did we retrieveUnder-retrieval; retriever missing critical chunksRequires ground-truth “should-have-retrieved” labels per query
FaithfulnessOf the claims in the generated answer, what fraction is supported by the retrieved contextHallucination; generator making up facts not in the retrieved documentsDecompose answer into claims → check each claim against retrieved context
Answer RelevanceDoes the generated answer address the user’s questionOff-topic answers; generator wandering or answering adjacent questionsSemantic similarity between generated answer and expected answer, penalizing off-topic drift

Critical asymmetry: Context Precision is cheap (you already have the retrieved chunks and the query). Context Recall is expensive (requires ground-truth labels — who should have been retrieved). Teams that skip Context Recall because “labeling is hard” are blind to the most common RAG failure mode: the retriever silently missing the right document and the generator confidently answering with the second-best document.

Metric Failure-Mode Mapping

Map each production symptom to the metric that detects it:

Production symptomRoot causeMetric that detectsWhy other metrics miss it
”Answer is about the right topic but facts are subtly wrong”Hallucination — generator inventing facts not in retrievalFaithfulnessContext Precision sees correct chunks; Answer Relevance sees on-topic answer
”Answer cites a real document but the document doesn’t actually say that”Misattribution — retrieval ok, generation distorts sourceFaithfulness (claim-level)Aggregate faithfulness score can mask single-claim errors; claim decomposition required
”Answer is off-topic or answers a different question”Query-understanding failure; retriever matched on surface keywordsAnswer RelevanceFaithfulness can be high on off-topic retrieved chunks
”Answer says ‘I don’t know’ but the document is in the corpus”Under-retrieval — retriever missed the relevant documentContext RecallFaithfulness is 100% (no claims); Answer Relevance low; root cause invisible without recall
”Answer is correct but uses the wrong version of the document”Over-retrieval returned stale version ranked higherContext Precision (rank-aware)Unranked precision counts both versions as relevant
”Answer synthesizes contradictory claims from different documents”Multi-hop retrieval without conflict resolutionFaithfulness + claim-consistency cross-checkStandard faithfulness checks each claim independently
”Answer quality degrades on long conversations”Context-window truncation removing earlier relevant retrievalTurn-level Context RecallSingle-turn evaluation misses conversational drift

The diagnostic value of an evaluation framework is the fraction of these symptoms it can isolate to their root cause. Frameworks that only implement aggregate metrics (a single “RAG score”) fail this test because multiple symptoms produce the same aggregate.

The Four RAG Evaluation Frameworks Compared

FrameworkFaithfulnessContext PrecisionContext RecallAnswer RelevanceGolden-set dependencyLLM-judge model configurabilityProduction monitoring
RAGASYes (claim-decomposition)Yes (rank-aware)Yes (requires ground-truth contexts)YesModerate — can run without ground-truth for some metricsHigh — swappable judge modelLimited — batch-oriented
TruLensYes (groundedness)YesIndirect (via answer relevance drop)YesLow — can run on production traffic without golden-setHigh — full judge swapStrong — production-trace integration
ARESYesYesYesYesHigh — trains small judge models on synthetic dataLimited — judge is trained modelWeak — research-oriented
DeepEval RAGYesYesYes (requires ground-truth)YesHigh — requires golden-setHighModerate — CI/CD integration focus

Framework Selection Decision Tree

  • You have a curated golden set of 100+ question-context-answer triples → RAGAS or DeepEval RAG. Both give you all four metrics with strongest rigor; DeepEval RAG wins if you prioritize CI/CD integration, RAGAS wins if you prioritize metric-level configurability.
  • You have production traces but no curated golden set → TruLens. Its groundedness and answer-relevance metrics work on live traffic without ground-truth labels; trades rigor for observability.
  • You can afford to label but don’t have labels yet → ARES. Trains a small judge model on synthetic data that scales cheaper than LLM-judge approaches for high-volume evaluation.
  • You need to gate production deploys on evaluation pass → DeepEval RAG. Its assertion-based API is designed for pytest-style integration into continuous-deployment pipelines.

No framework is strictly dominant. The right choice depends on where you are in the data-maturity curve.

Golden-Set Construction Protocol

The evaluation framework is only as good as the golden set. A 50-query golden set that covers only happy-path questions will give you a false sense of production readiness. The construction protocol:

StepActivityOutputQuality gate
1. Query samplingSample production queries from logs stratified by intent clusterRaw query list (aim for 300-500 before filtering)Intent-cluster coverage — every cluster represented
2. Intent taxonomyCluster queries by intent (factoid, multi-hop, comparative, temporal, numeric, contrastive)Intent-labeled query setEach intent cluster ≥20 queries
3. Failure-mode seedingAdd adversarial queries targeting known failure modes (ambiguous pronouns, negations, superlatives, date-sensitive)Adversarial query additions (~15-20% of total)Each failure mode ≥5 queries
4. Ground-truth context labelingFor each query, identify the chunks in the corpus that SHOULD be retrievedQuery → relevant-chunk-ids mappingInter-annotator agreement ≥0.8 (Cohen’s kappa)
5. Ground-truth answer labelingWrite the expected answer, citing specific chunksQuery → (answer, cited-chunks)Answer must be deriveable from labeled chunks — no external knowledge
6. Distribution checkCompare intent-cluster distribution in golden set vs production query distributionDistribution-fit reportChi-square p-value > 0.05 for intent-cluster distribution
7. Temporal refreshRe-sample and re-label every 90 daysRefresh cadence calendarNo golden-set query older than 180 days

Cost-to-construct reality: A 200-query golden set with full ground-truth chunk and answer labeling takes 40-80 hours of domain-expert time. Teams that try to short-circuit this by using LLM-generated synthetic queries get golden sets that systematically miss the ambiguous, malformed, and adversarial queries that define production difficulty.

Claim-Level Faithfulness — The Only Way to Catch Misattribution

Aggregate faithfulness scores (0.87 average) hide misattribution failures where a single claim in an otherwise-grounded answer is wrong. Claim-level faithfulness decomposes:

StepOperationLLM-judge prompt structure
1. Claim extractionDecompose generated answer into atomic claims”List each factual claim in this answer. One claim per line. A claim is a single assertion of fact.”
2. Per-claim context checkFor each claim, check if retrieved context supports it”Given this retrieved context: {ctx}. Does it support this claim: {claim}? Yes / No / Partially — and cite the exact supporting sentence.”
3. Claim aggregationFaithfulness = fraction of claims supportedNumerator: supported claims. Denominator: total extracted claims.
4. Single-claim alertFlag answers with ≥1 unsupported claim regardless of aggregateGate ship on unsupported-claim-count = 0, not on aggregate ≥ threshold

The single-claim alert is load-bearing. An answer with 10 supported claims and 1 unsupported claim scores 0.91 faithfulness (a passing grade under most thresholds) while containing a hallucinated fact. Single-claim alerting catches what aggregate thresholds miss.

Regression-Detection Pipeline

Evaluation is not a one-time activity. The pipeline architecture that catches regressions:

Pipeline stageTriggerMetric gateAction on failure
Commit-time smoke testEvery PRFaithfulness ≥ 0.80 on 20-query smoke setBlock merge
Pre-deploy full evaluationRelease candidateAll 4 metrics ≥ production baseline – 2%Block deploy; notify
Canary evaluation10% traffic on new versionFaithfulness gap between canary and control < 3%Roll back canary
Weekly full golden-set sweepCronTrend analysis — detect drift over timeInvestigate if 2+ metrics drop >3% vs 4-week rolling baseline
Production sampled-evalContinuous, 0.5-1% traffic sampledAnswer relevance, faithfulness via TruLensAlert on 1-hour window deviation > 5% from baseline

Threshold calibration: Absolute thresholds (faithfulness ≥ 0.85) are less useful than delta thresholds (faithfulness drop ≥ 3% from baseline) because the baseline is what your system can achieve today — the relevant question for release decisions is “did this change make it worse,” not “is it above an arbitrary bar.”

Per-Stage Measurement Cost

Evaluation is not free. Cost breakdown per query per run:

MetricLLM-judge callsToken volume per queryApproximate cost (GPT-4o judge)Sampling strategy
Context Precision1 call per retrieved chunk (relevance classification)~300-500 tokens per call$0.003-0.008 per query (top-5 retrieval)Full golden-set every deploy
Context Recall1 call per ground-truth chunk (was-it-retrieved check)~200-400 tokens per call$0.002-0.005 per queryFull golden-set every deploy
Faithfulness (claim-level)1 claim-extraction + N claim-verification calls~500 + 400×N tokens (N ≈ 5-10 claims)$0.015-0.040 per queryFull golden-set every deploy
Answer Relevance1 semantic-similarity call~400-600 tokens$0.004-0.008 per queryFull golden-set every deploy + sampled production

200-query golden-set full evaluation cost: $4.80-12.00 per run at GPT-4o judge pricing. Running weekly + every deploy (4-6 deploys per week) = $30-90/week evaluation budget at small scale. This is the floor — cheaper judge models (GPT-4o-mini, Haiku) cut this 5-10× with acceptable quality loss on most metrics except claim-level faithfulness, where the stronger judge is load-bearing.

Production Monitoring Architecture

Offline golden-set evaluation catches release regressions. Production monitoring catches drift (the corpus changes, query distribution shifts, user behavior evolves):

LayerTechniqueCost per 1K requestsWhat it catches
Trace loggingCapture query + retrieved-chunks + generated-answer for every request$0.01-0.05 (storage)Audit trail for postmortems
Heuristic flags”I don’t know” rate, answer-length distribution, citation-presence$0 (string checks)Obvious failure surges
Sampled groundednessTruLens groundedness on 1% sample$0.50-2.00Per-answer hallucination detection
Sampled answer relevanceTruLens answer-relevance on 1% sample$0.40-1.50Off-topic drift
Embedding-drift detectionTrack embedding distribution of queries over time; alert on distribution shift$0.02-0.10Query distribution changes before metric regression manifests
Weekly production-labeled evalSample 50 production queries + human-label + compute all 4 metrics$50-200/week laborGround-truth quality trend

The embedding-drift detection is a leading indicator. By the time the faithfulness metric has dropped 5%, users have been getting degraded answers for days. Monitoring query-distribution drift catches the shift before it manifests as measurable quality loss.

Anti-Patterns

Anti-patternWhy teams do itWhy it failsCorrect pattern
Aggregate-score-only gatingSimpler threshold; dashboards look cleanMasks single-claim hallucinations, rare-intent failuresClaim-level faithfulness + per-intent-cluster breakdown
Synthetic golden-setNo labeling costSystematically misses ambiguous production queriesProduction-sampled golden set with stratified intent labeling
Single-framework dependenceOperational simplicityFramework weak spots become blind spotsRAGAS offline + TruLens online for complementary coverage
Threshold-without-deltaEasy to explainAbsolute bars don’t adapt to model improvementsDelta-from-baseline thresholds tied to rolling averages
No golden-set refreshLabor costGolden set drifts from production distribution90-day refresh cadence with distribution-fit re-verification
LLM-judge without calibrationSpeedJudge model has biases that correlate with generator biasesHuman-calibration sample + inter-judge agreement checks

Honest Limitations

  • LLM-judge metrics are not ground truth. Faithfulness scores from GPT-4o judging GPT-4o-generated answers have systematic biases — judges favor answers from similar models. Human calibration on 10-20% of the evaluation sample is required for high-stakes deployment.
  • Context Recall requires ground-truth labels that many teams will never have. If you cannot afford domain-expert labeling, accept that you will be blind to under-retrieval failures. Production monitoring of “I don’t know” rate and answer-length distribution is a partial substitute — not a replacement.
  • Evaluation cost scales with corpus change velocity. A corpus that updates weekly requires golden-set re-verification weekly (labeled chunks may have moved). Teams budget for evaluation build-out but not for evaluation maintenance.
  • Metric correlation is not orthogonality. Answer Relevance and Context Precision correlate highly in practice. A drop in one often shows up in the other. Treating the four metrics as fully independent overstates evaluation coverage.
  • Claim decomposition fails on conversational and creative outputs. Claim-level faithfulness assumes the generated answer decomposes cleanly into factual assertions. Narrative answers, summaries with qualifications, and hedged responses confuse the claim extractor. Expect 5-15% of production answers to have ambiguous claim structure.
  • Regression-detection thresholds require historical baseline. The first 30 days of evaluation produces unstable thresholds. During this period, treat all metric readings as directional rather than gate-worthy.
  • Framework benchmarks are not production benchmarks. RAGAS, TruLens, and ARES all have published accuracy numbers on academic datasets. Your production distribution differs from academic datasets. Calibrate the framework on your golden set before trusting published numbers.
  • Evaluation frameworks do not catch upstream data-quality issues. If your corpus contains duplicate documents with contradictory content, faithfulness scoring treats “answers match document A” and “answers match document B” both as supported — missing the fact that the corpus itself is inconsistent. Corpus-quality auditing is a separate discipline.

The Production-Readiness Bar

A RAG system is production-ready when:

  • All four metrics have been measured on a 150+ query golden set with production-distribution coverage.
  • Claim-level faithfulness is implemented — aggregate-only faithfulness fails the bar.
  • Regression-detection pipeline gates deploy on 2+ metrics against rolling baseline.
  • Production monitoring samples ≥0.5% of traffic for groundedness + answer-relevance.
  • Golden-set refresh cadence is documented and operational.
  • LLM-judge calibration against human labels has been performed within 90 days.

RAG systems that meet this bar still fail in production — but their failures are detectable, diagnosable, and bounded rather than silent.