Your RAG System Says It’s Grounded — But Is It Actually Faithful to the Retrieved Context?

A RAG system that retrieves the right document and then generates an answer that contradicts it is worse than no RAG at all — because the user trusts the answer more. Faithfulness measurement is the most important metric in production RAG, yet most teams either don’t measure it or measure it with tools they haven’t validated. This guide compares every major detection approach with accuracy data on standardized benchmarks, production latency measurements, cost per query, and the failure modes each tool misses.

The Detection Landscape in 2026

Hallucination detection has matured from research curiosity to production necessity. There are now four categories of detection methods, each with distinct strengths and failure modes:

CategoryHow it worksStrengthsWeaknesses
Lexical overlapToken/n-gram matching between source and outputFast, cheap, no model neededMisses paraphrasing, semantic hallucination
Embedding similarityVector distance between source and outputCaptures paraphrasing, moderate costCan’t distinguish faithful paraphrase from hallucination
NLI-basedNatural Language Inference model checks entailmentStrong at detecting contradictionsMisses unsupported (but non-contradicting) claims
LLM-as-judgeSeparate LLM evaluates faithfulnessHighest accuracy, handles nuanceExpensive, slow, judges can hallucinate too

The honest assessment: No single method achieves >90% accuracy alone. Production systems should layer methods — fast/cheap for screening, expensive/accurate for escalation.

Tool Comparison — The Framework-Level View

RAGAS (Retrieval Augmented Generation Assessment)

MetricScore/Value
Faithfulness accuracy82% (on RAGAS benchmark), 74-78% (on custom enterprise data)
Latency per evaluation800-1,500ms (depends on LLM used as judge)
Cost per evaluation$0.01-0.08 (varies by judge model)
Open sourceYes (MIT license)
Metrics includedFaithfulness, answer relevancy, context precision, context recall
Judge model supportAny LLM via LangChain (GPT-4o, Claude, open-source)
Failure modesSensitive to judge prompt template, inconsistent on long contexts

RAGAS decomposes the generated answer into individual claims, then checks each claim against the retrieved context using an LLM judge. This claim-level decomposition is its core strength — it catches partial hallucinations where part of the answer is faithful and part is fabricated.

When to use RAGAS: When you need a comprehensive evaluation framework that covers retrieval quality AND generation quality. Best for offline evaluation pipelines and batch quality monitoring.

When RAGAS struggles: Real-time production detection (too slow for inline), very long contexts (claim decomposition becomes unreliable beyond ~8K context tokens), and numerical claims (LLM judges are weak at verifying numbers).

DeepEval

MetricScore/Value
Faithfulness accuracy84% (proprietary benchmark), 76-80% (independent evaluation)
Latency per evaluation600-1,200ms
Cost per evaluation$0.01-0.06
Open sourceYes (Apache 2.0)
Metrics includedFaithfulness, hallucination, answer relevancy, contextual precision/recall, bias, toxicity
Judge model supportGPT-4o (default), Claude, custom
Failure modesDefault prompts optimized for GPT-4o, reduced accuracy with other judges

DeepEval positions itself as the pytest of LLM evaluation. Its API is developer-friendly — you write evaluation tests like unit tests. The hallucination metric specifically checks whether the LLM output contains information not present in the retrieved context.

When to use DeepEval: When you want to integrate hallucination detection into CI/CD pipelines. DeepEval’s test-runner pattern makes it natural to add test_no_hallucination() alongside traditional test suites.

When DeepEval struggles: Same fundamental limitation as RAGAS — LLM-as-judge accuracy caps at 80-85% on diverse enterprise data. Custom domain content (medical, legal, financial) requires prompt tuning to match your definition of “hallucination.”

Galileo

MetricScore/Value
Faithfulness accuracy86% (vendor-reported), 79-83% (third-party evaluation)
Latency per evaluation200-500ms (uses specialized models, not general LLMs)
Cost per evaluation$0.005-0.02 (SaaS pricing)
Open sourceNo (commercial SaaS)
Metrics includedChainPoll faithfulness, context adherence, completeness, chunk attribution
ApproachChainPoll — multiple lightweight model passes, majority vote
Failure modesBlack-box scoring, less customizable, SaaS dependency

Galileo’s ChainPoll approach is notable: instead of one expensive LLM judge call, it runs multiple lightweight model passes and takes the majority vote. This reduces per-call variance (the biggest problem with single-judge approaches) at moderate cost.

When to use Galileo: When you need production-speed detection (200-500ms is fast enough for inline use) with better accuracy than embedding-only methods. The SaaS model means no infrastructure to manage.

When Galileo struggles: Customization is limited compared to open-source alternatives. Domain-specific hallucination definitions can’t be easily tuned. Vendor lock-in for a quality-critical component.

TruLens

MetricScore/Value
Faithfulness accuracy80% (on standard benchmarks), 72-76% (custom enterprise data)
Latency per evaluation500-1,000ms
Cost per evaluation$0.01-0.05
Open sourceYes (MIT license)
Metrics includedGroundedness, relevance, moderation, custom feedback functions
Judge model supportAny LLM, plus embedding-based and custom functions
Failure modesLess accurate than RAGAS/DeepEval on pure faithfulness, broader scope dilutes focus

TruLens takes a broader approach — it’s an observability framework for LLM apps rather than a pure hallucination detector. Groundedness (its faithfulness metric) is one of many tracked signals.

When to use TruLens: When you need a full observability stack including hallucination detection, relevance tracking, and custom quality signals. Good for teams that want one framework covering monitoring, evaluation, and debugging.

When TruLens struggles: Pure detection accuracy is lower than specialized tools. If hallucination detection is your primary need, RAGAS or DeepEval are more focused.

Head-to-Head Comparison Matrix

DimensionRAGASDeepEvalGalileoTruLens
Faithfulness accuracy74-82%76-84%79-86%72-80%
Latency800-1,500ms600-1,200ms200-500ms500-1,000ms
Cost/eval$0.01-0.08$0.01-0.06$0.005-0.02$0.01-0.05
Open sourceYesYesNoYes
CI/CD integrationManualNative (pytest)API-basedManual
Real-time capableNoNoYes (with caveats)No
Long context handlingDegrades >8KDegrades >8KBetter (ChainPoll)Degrades >8K
Numerical hallucinationWeakWeakModerateWeak
CustomizabilityHighHighLimitedHigh
Learning curveMediumLowLowMedium

Building a Production Detection Pipeline

No single tool covers all detection needs. Production systems need a layered approach:

Layer 1 — Fast Screening (< 100ms, < $0.002/query)

CheckImplementationWhat it catches
Length anomalyIf output length > 3x median for query type, flagRunaway generation, excessive fabrication
Source coverage% of output tokens that appear in retrieved contextCompletely ungrounded responses
Confidence calibrationIf model self-reports low confidence, flagSome hallucinations (unreliable — models are poorly calibrated)
Regex pattern checksCitation format validation, number range checksMalformed citations, obviously wrong numbers

Layer 1 catches 30-40% of hallucinations at near-zero cost. Its primary value is reducing the volume that reaches expensive Layer 2.

Layer 2 — NLI/Embedding Check (200-500ms, $0.003-0.01/query)

CheckImplementationWhat it catches
NLI entailmentClassify each output sentence as entailed/neutral/contradicted by contextContradictions and unsupported claims
Claim decompositionSplit output into atomic claims, check eachPartial hallucinations hidden in faithful responses
Cross-referenceCompare output claims against multiple retrieved chunksClaims supported by wrong context

Layer 2 catches an additional 30-40% of hallucinations that pass Layer 1. Run this on every query in production.

Layer 3 — LLM Judge (800-2,000ms, $0.02-0.15/query)

CheckImplementationWhat it catches
Full faithfulness auditLLM evaluates output against context with structured rubricSubtle hallucinations, reasoning errors
Consistency checkGenerate answer 3 times, flag if responses disagreeUnstable hallucinations
Domain-specific rulesLLM checks against domain constraintsDomain rule violations

Layer 3 is expensive. Run it on the 5-10% of queries flagged by Layer 1 or Layer 2, plus a random 1-2% sample for baseline monitoring.

The Math

Pipeline stageQueries processedCost per queryTotal cost (100K queries)
Layer 1 (all queries)100,000$0.001$100
Layer 2 (all queries)100,000$0.005$500
Layer 3 (7% flagged + 2% sample)9,000$0.08$720
Total$0.013 avg$1,320

This three-layer pipeline catches approximately 85-90% of hallucinations at $0.013 per query — a fraction of the cost of running LLM-as-judge on every query ($0.08 × 100K = $8,000).

Evaluation Methodology — How to Benchmark Detection

Your detection system is only as good as your evaluation of it. The standard methodology:

  1. Create a labeled dataset. 500 query-context-response triples, manually labeled as faithful/hallucinated. Include borderline cases — they’re where tools disagree.

  2. Measure precision and recall separately. High precision = few false alarms. High recall = few missed hallucinations. Most tools optimize for recall (catching more hallucinations) at the cost of precision (more false alarms).

Tool configurationPrecisionRecallF1False positive rate
RAGAS (default)71%85%77%29%
RAGAS (strict threshold)82%72%77%18%
DeepEval (default)74%83%78%26%
Galileo ChainPoll79%84%81%21%
NLI-only (DeBERTa-v3)77%74%75%23%
3-layer pipeline (above)80%88%84%20%
  1. Measure on YOUR data, not benchmarks. Vendor-reported accuracy is on their benchmark. Your data has different context lengths, domain terminology, and hallucination patterns. Expect 5-10% lower accuracy on custom data.

  2. Monitor false positive rate in production. A 20% false positive rate means 1 in 5 faithful responses gets flagged. If flagged responses trigger user-visible warnings, this degrades trust in the system. If flagged responses get blocked, this blocks good responses.

The Unsolved Problems

Three hallucination detection problems have no production-ready solution in 2026:

  1. Reasoning hallucination at scale. Detecting whether a logical chain is valid requires understanding the logic — current NLI and LLM judges catch contradictions but miss invalid inferences. No tool exceeds 75% accuracy on reasoning hallucination.

  2. Cross-document hallucination. When the model correctly retrieves information from multiple documents but incorrectly combines them (entity conflation, temporal confusion), current tools treat each claim in isolation and miss the combination error.

  3. Subtle numerical hallucination. A model generating “revenue grew 14.2%” when the source says “revenue grew 12.4%” is a 2-percentage-point numerical hallucination that passes every qualitative check. Only explicit numerical extraction and comparison catches this.

How to Apply This

Use the token-counter tool to estimate costs for adding detection layers to your inference pipeline.

Start with the three-layer pipeline architecture above — it provides the best accuracy-to-cost ratio for most production workloads.

Benchmark tools on your data before committing. Create 500 labeled examples from your production queries and test each tool’s precision, recall, and F1.

Choose your precision/recall tradeoff based on the cost of each error type. If false positives are expensive (blocking good responses), optimize for precision. If missed hallucinations are expensive (medical/legal), optimize for recall.

Deploy in shadow mode for 2 weeks before enabling blocking. Measure false positive rate on real traffic to calibrate thresholds.

Track detection accuracy weekly — hallucination patterns shift as users learn the system’s behavior and as model updates change output distributions.

Honest Limitations

All accuracy numbers are based on English-language evaluation; multilingual detection is significantly less reliable. Vendor-reported accuracy consistently exceeds independent evaluation by 5-10%. Detection latency assumes standard cloud deployment; self-hosted models have different performance profiles. The three-layer pipeline cost estimate assumes GPT-4o-mini for Layer 2 and Claude Sonnet for Layer 3; pricing changes directly affect the math. LLM-as-judge approaches inherit the judge model’s own hallucination tendencies — a judge that hallucinates 5% of the time sets a floor on detection accuracy. No detection system eliminates the need for domain-specific human review on high-stakes outputs.