Hallucination Detection Methods — RAG Faithfulness, Semantic Similarity, and Production Pipelines

Comparison of hallucination detection tools (RAGAS, DeepEval, Galileo, TruLens) with accuracy, cost, and latency data for production deployment.

Kenny Tan 15 April 2026

Your RAG System Says It’s Grounded — But Is It Actually Faithful to the Retrieved Context?

A RAG system that retrieves the right document and then generates an answer that contradicts it is worse than no RAG at all — because the user trusts the answer more. Faithfulness measurement is the most important metric in production RAG, yet most teams either don’t measure it or measure it with tools they haven’t validated. This guide compares every major detection approach with accuracy data on standardized benchmarks, production latency measurements, cost per query, and the failure modes each tool misses.

The Detection Landscape in 2026

Hallucination detection has matured from research curiosity to production necessity. There are now four categories of detection methods, each with distinct strengths and failure modes:

Category	How it works	Strengths	Weaknesses
Lexical overlap	Token/n-gram matching between source and output	Fast, cheap, no model needed	Misses paraphrasing, semantic hallucination
Embedding similarity	Vector distance between source and output	Captures paraphrasing, moderate cost	Can’t distinguish faithful paraphrase from hallucination
NLI-based	Natural Language Inference model checks entailment	Strong at detecting contradictions	Misses unsupported (but non-contradicting) claims
LLM-as-judge	Separate LLM evaluates faithfulness	Highest accuracy, handles nuance	Expensive, slow, judges can hallucinate too

The honest assessment: No single method achieves >90% accuracy alone. Production systems should layer methods — fast/cheap for screening, expensive/accurate for escalation.

Tool Comparison — The Framework-Level View

RAGAS (Retrieval Augmented Generation Assessment)

Metric	Score/Value
Faithfulness accuracy	82% (on RAGAS benchmark), 74-78% (on custom enterprise data)
Latency per evaluation	800-1,500ms (depends on LLM used as judge)
Cost per evaluation	$0.01-0.08 (varies by judge model)
Open source	Yes (MIT license)
Metrics included	Faithfulness, answer relevancy, context precision, context recall
Judge model support	Any LLM via LangChain (GPT-4o, Claude, open-source)
Failure modes	Sensitive to judge prompt template, inconsistent on long contexts

RAGAS decomposes the generated answer into individual claims, then checks each claim against the retrieved context using an LLM judge. This claim-level decomposition is its core strength — it catches partial hallucinations where part of the answer is faithful and part is fabricated.

When to use RAGAS: When you need a comprehensive evaluation framework that covers retrieval quality AND generation quality. Best for offline evaluation pipelines and batch quality monitoring.

When RAGAS struggles: Real-time production detection (too slow for inline), very long contexts (claim decomposition becomes unreliable beyond ~8K context tokens), and numerical claims (LLM judges are weak at verifying numbers).

DeepEval

Metric	Score/Value
Faithfulness accuracy	84% (proprietary benchmark), 76-80% (independent evaluation)
Latency per evaluation	600-1,200ms
Cost per evaluation	$0.01-0.06
Open source	Yes (Apache 2.0)
Metrics included	Faithfulness, hallucination, answer relevancy, contextual precision/recall, bias, toxicity
Judge model support	GPT-4o (default), Claude, custom
Failure modes	Default prompts optimized for GPT-4o, reduced accuracy with other judges

DeepEval positions itself as the pytest of LLM evaluation. Its API is developer-friendly — you write evaluation tests like unit tests. The hallucination metric specifically checks whether the LLM output contains information not present in the retrieved context.

When to use DeepEval: When you want to integrate hallucination detection into CI/CD pipelines. DeepEval’s test-runner pattern makes it natural to add test_no_hallucination() alongside traditional test suites.

When DeepEval struggles: Same fundamental limitation as RAGAS — LLM-as-judge accuracy caps at 80-85% on diverse enterprise data. Custom domain content (medical, legal, financial) requires prompt tuning to match your definition of “hallucination.”

Galileo

Metric	Score/Value
Faithfulness accuracy	86% (vendor-reported), 79-83% (third-party evaluation)
Latency per evaluation	200-500ms (uses specialized models, not general LLMs)
Cost per evaluation	$0.005-0.02 (SaaS pricing)
Open source	No (commercial SaaS)
Metrics included	ChainPoll faithfulness, context adherence, completeness, chunk attribution
Approach	ChainPoll — multiple lightweight model passes, majority vote
Failure modes	Black-box scoring, less customizable, SaaS dependency

Galileo’s ChainPoll approach is notable: instead of one expensive LLM judge call, it runs multiple lightweight model passes and takes the majority vote. This reduces per-call variance (the biggest problem with single-judge approaches) at moderate cost.

When to use Galileo: When you need production-speed detection (200-500ms is fast enough for inline use) with better accuracy than embedding-only methods. The SaaS model means no infrastructure to manage.

When Galileo struggles: Customization is limited compared to open-source alternatives. Domain-specific hallucination definitions can’t be easily tuned. Vendor lock-in for a quality-critical component.

TruLens

Metric	Score/Value
Faithfulness accuracy	80% (on standard benchmarks), 72-76% (custom enterprise data)
Latency per evaluation	500-1,000ms
Cost per evaluation	$0.01-0.05
Open source	Yes (MIT license)
Metrics included	Groundedness, relevance, moderation, custom feedback functions
Judge model support	Any LLM, plus embedding-based and custom functions
Failure modes	Less accurate than RAGAS/DeepEval on pure faithfulness, broader scope dilutes focus

TruLens takes a broader approach — it’s an observability framework for LLM apps rather than a pure hallucination detector. Groundedness (its faithfulness metric) is one of many tracked signals.

When to use TruLens: When you need a full observability stack including hallucination detection, relevance tracking, and custom quality signals. Good for teams that want one framework covering monitoring, evaluation, and debugging.

When TruLens struggles: Pure detection accuracy is lower than specialized tools. If hallucination detection is your primary need, RAGAS or DeepEval are more focused.

Head-to-Head Comparison Matrix

Dimension	RAGAS	DeepEval	Galileo	TruLens
Faithfulness accuracy	74-82%	76-84%	79-86%	72-80%
Latency	800-1,500ms	600-1,200ms	200-500ms	500-1,000ms
Cost/eval	$0.01-0.08	$0.01-0.06	$0.005-0.02	$0.01-0.05
Open source	Yes	Yes	No	Yes
CI/CD integration	Manual	Native (pytest)	API-based	Manual
Real-time capable	No	No	Yes (with caveats)	No
Long context handling	Degrades >8K	Degrades >8K	Better (ChainPoll)	Degrades >8K
Numerical hallucination	Weak	Weak	Moderate	Weak
Customizability	High	High	Limited	High
Learning curve	Medium	Low	Low	Medium

Building a Production Detection Pipeline

No single tool covers all detection needs. Production systems need a layered approach:

Layer 1 — Fast Screening (< 100ms, < $0.002/query)

Check	Implementation	What it catches
Length anomaly	If output length > 3x median for query type, flag	Runaway generation, excessive fabrication
Source coverage	% of output tokens that appear in retrieved context	Completely ungrounded responses
Confidence calibration	If model self-reports low confidence, flag	Some hallucinations (unreliable — models are poorly calibrated)
Regex pattern checks	Citation format validation, number range checks	Malformed citations, obviously wrong numbers

Layer 1 catches 30-40% of hallucinations at near-zero cost. Its primary value is reducing the volume that reaches expensive Layer 2.

Layer 2 — NLI/Embedding Check (200-500ms, $0.003-0.01/query)

Check	Implementation	What it catches
NLI entailment	Classify each output sentence as entailed/neutral/contradicted by context	Contradictions and unsupported claims
Claim decomposition	Split output into atomic claims, check each	Partial hallucinations hidden in faithful responses
Cross-reference	Compare output claims against multiple retrieved chunks	Claims supported by wrong context

Layer 2 catches an additional 30-40% of hallucinations that pass Layer 1. Run this on every query in production.

Layer 3 — LLM Judge (800-2,000ms, $0.02-0.15/query)

Check	Implementation	What it catches
Full faithfulness audit	LLM evaluates output against context with structured rubric	Subtle hallucinations, reasoning errors
Consistency check	Generate answer 3 times, flag if responses disagree	Unstable hallucinations
Domain-specific rules	LLM checks against domain constraints	Domain rule violations

Layer 3 is expensive. Run it on the 5-10% of queries flagged by Layer 1 or Layer 2, plus a random 1-2% sample for baseline monitoring.

The Math

Pipeline stage	Queries processed	Cost per query	Total cost (100K queries)
Layer 1 (all queries)	100,000	$0.001	$100
Layer 2 (all queries)	100,000	$0.005	$500
Layer 3 (7% flagged + 2% sample)	9,000	$0.08	$720
Total		$0.013 avg	$1,320

This three-layer pipeline catches approximately 85-90% of hallucinations at $0.013 per query — a fraction of the cost of running LLM-as-judge on every query ($0.08 × 100K = $8,000).

Evaluation Methodology — How to Benchmark Detection

Your detection system is only as good as your evaluation of it. The standard methodology:

Create a labeled dataset. 500 query-context-response triples, manually labeled as faithful/hallucinated. Include borderline cases — they’re where tools disagree.
Measure precision and recall separately. High precision = few false alarms. High recall = few missed hallucinations. Most tools optimize for recall (catching more hallucinations) at the cost of precision (more false alarms).

Tool configuration	Precision	Recall	F1	False positive rate
RAGAS (default)	71%	85%	77%	29%
RAGAS (strict threshold)	82%	72%	77%	18%
DeepEval (default)	74%	83%	78%	26%
Galileo ChainPoll	79%	84%	81%	21%
NLI-only (DeBERTa-v3)	77%	74%	75%	23%
3-layer pipeline (above)	80%	88%	84%	20%

Measure on YOUR data, not benchmarks. Vendor-reported accuracy is on their benchmark. Your data has different context lengths, domain terminology, and hallucination patterns. Expect 5-10% lower accuracy on custom data.
Monitor false positive rate in production. A 20% false positive rate means 1 in 5 faithful responses gets flagged. If flagged responses trigger user-visible warnings, this degrades trust in the system. If flagged responses get blocked, this blocks good responses.

The Unsolved Problems

Three hallucination detection problems have no production-ready solution in 2026:

Reasoning hallucination at scale. Detecting whether a logical chain is valid requires understanding the logic — current NLI and LLM judges catch contradictions but miss invalid inferences. No tool exceeds 75% accuracy on reasoning hallucination.
Cross-document hallucination. When the model correctly retrieves information from multiple documents but incorrectly combines them (entity conflation, temporal confusion), current tools treat each claim in isolation and miss the combination error.
Subtle numerical hallucination. A model generating “revenue grew 14.2%” when the source says “revenue grew 12.4%” is a 2-percentage-point numerical hallucination that passes every qualitative check. Only explicit numerical extraction and comparison catches this.

How to Apply This

Use the token-counter tool to estimate costs for adding detection layers to your inference pipeline.

Start with the three-layer pipeline architecture above — it provides the best accuracy-to-cost ratio for most production workloads.

Benchmark tools on your data before committing. Create 500 labeled examples from your production queries and test each tool’s precision, recall, and F1.

Choose your precision/recall tradeoff based on the cost of each error type. If false positives are expensive (blocking good responses), optimize for precision. If missed hallucinations are expensive (medical/legal), optimize for recall.

Deploy in shadow mode for 2 weeks before enabling blocking. Measure false positive rate on real traffic to calibrate thresholds.

Track detection accuracy weekly — hallucination patterns shift as users learn the system’s behavior and as model updates change output distributions.

Honest Limitations

All accuracy numbers are based on English-language evaluation; multilingual detection is significantly less reliable. Vendor-reported accuracy consistently exceeds independent evaluation by 5-10%. Detection latency assumes standard cloud deployment; self-hosted models have different performance profiles. The three-layer pipeline cost estimate assumes GPT-4o-mini for Layer 2 and Claude Sonnet for Layer 3; pricing changes directly affect the math. LLM-as-judge approaches inherit the judge model’s own hallucination tendencies — a judge that hallucinates 5% of the time sets a floor on detection accuracy. No detection system eliminates the need for domain-specific human review on high-stakes outputs.

Kenny Tan Co-founder & Technical Lead

Cross-domain expertise in software engineering, content systems, and infrastructure architecture.

15 April 2026

Continue reading

AI Bias Detection — Demographic Parity, Equal Opportunity, Calibration, and When Each Metric Applies

Fairness metric decision tree per use case, measurement methodology, regulatory requirements, and practical implementation for production AI systems.

AI Content Filtering — Guardrails That Block Without Breaking User Experience

False positive and negative rate comparison across filtering approaches, latency impact, implementation patterns, and the tradeoff between safety and usability.

Types of AI Hallucinations — Factual, Logical, Attribution, and How to Detect Each

Taxonomy of AI hallucination types with detection methods, failure rates by model and task, and a diagnostic decision tree for production systems.

All articles in ai safety responsible