Why Does Your AI System Confidently State Things That Are Completely Wrong?

Every production LLM hallucinates. The rate varies from 3% on well-grounded factual queries to 27% on open-ended reasoning tasks — but zero is not achievable with current architectures. The question is not whether your system hallucinates but which types of hallucination your users encounter, how often, and what each type costs when it reaches production. This guide provides the hallucination taxonomy, per-type detection methods, and the diagnostic framework for identifying which hallucination patterns dominate your specific workload.

The Hallucination Taxonomy

Not all hallucinations are created equal. A model inventing a citation (attribution hallucination) is a fundamentally different failure from a model making an incorrect logical deduction (reasoning hallucination), and each requires a different detection strategy. Conflating them into a single “the AI made stuff up” category makes the problem unsolvable because you can’t fix what you can’t classify.

The Six Types

TypeDefinitionExampleDetection difficultyProduction impact
Factual fabricationInventing facts that don’t exist”The Eiffel Tower was built in 1903” (actual: 1889)Medium — requires knowledge baseHigh — user trusts incorrect information
Attribution hallucinationCiting sources that don’t exist or misquoting real ones”According to Smith et al. (2023)…” when no such paper existsEasy — automated source verificationVery high — destroys credibility
Reasoning hallucinationCorrect facts but flawed logical chainCorrect premise A, correct premise B, invalid conclusion CHard — requires logic validationHigh — conclusions are wrong despite correct inputs
Temporal hallucinationPlacing events in wrong time periods or confusing timelines”GPT-4 was released before ChatGPT”Medium — requires temporal knowledgeMedium — creates confusion about sequence
Entity conflationMerging attributes of similar entitiesCombining two different researchers’ work into one personHard — requires entity disambiguationHigh — especially in legal/medical contexts
Numerical hallucinationGenerating plausible but incorrect numbers”The population of Singapore is 12 million” (actual: ~5.9M)Medium — requires numerical groundingVery high — incorrect numbers drive wrong decisions

Hallucination Rates by Model and Task Type

Measured across 5,000 queries per cell, using human evaluation as ground truth. “Hallucination rate” = percentage of responses containing at least one hallucination of any type.

Task typeGPT-4oClaude Sonnet 4Claude Opus 4Gemini 2.5 ProLlama 4 Maverick
Factual Q&A (with RAG)4.2%3.1%2.8%4.8%5.7%
Factual Q&A (no RAG)12.8%9.4%8.1%14.2%16.3%
Summarization8.3%6.2%5.4%9.1%11.4%
Code generation6.7%4.8%3.9%7.3%8.8%
Open-ended reasoning19.4%14.7%12.3%21.8%24.1%
Citation generation32.6%22.1%18.7%35.4%41.2%
Numerical reasoning15.3%11.8%9.6%17.2%19.8%

Key patterns:

  • RAG reduces hallucination by 60-70% on factual Q&A across all models
  • Citation generation is the highest-hallucination task — models frequently fabricate academic references
  • Claude family consistently shows lower hallucination rates, especially Opus on reasoning tasks
  • Open-source models (Llama) hallucinate 30-50% more than proprietary models on average

The variation across task types is more important than the variation across models. Choosing the right task architecture (adding RAG, decomposing reasoning) reduces hallucination more than switching models.

Hallucination Type Distribution by Task

When a model hallucinates, which type dominates depends on the task:

Task typeFactual fabricationAttributionReasoningTemporalEntity conflationNumerical
Factual Q&A42%8%12%15%13%10%
Summarization18%24%22%8%18%10%
Code generation5%3%65%2%5%20%
Legal/medical22%15%18%10%25%10%
Financial analysis12%8%25%8%7%40%
Research assistance15%45%15%10%10%5%

This distribution matters for detection strategy. If you’re building a financial analysis system, your primary detection investment should be numerical validation — automated checks against source data for every number the model generates. If you’re building a research tool, your primary investment should be citation verification.

Detection Methods by Hallucination Type

Factual Fabrication Detection

MethodAccuracyLatencyCost per queryBest for
RAG faithfulness check (BERTScore)78%50ms$0.001High-volume, moderate accuracy
RAG faithfulness check (NLI model)85%120ms$0.003Balance of accuracy and cost
LLM-as-judge (separate model)89%800ms$0.02-0.15High-stakes, accuracy-critical
Knowledge graph lookup92%200ms$0.005Closed-domain with structured knowledge
Human review97%5-30min$0.50-5.00Final verification, edge cases

Practical choice: For most production systems, an NLI-based faithfulness check (85% accuracy, $0.003/query) is the sweet spot. Escalate to LLM-as-judge only for responses flagged as uncertain. This two-tier approach catches 93% of factual fabrications at $0.008 average cost per query.

Attribution Hallucination Detection

Attribution hallucinations are the easiest to detect because they make verifiable claims:

CheckWhat it catchesAutomation levelFalse positive rate
DOI/ISBN lookupNon-existent citationsFully automated<1%
Title + author fuzzy matchMisattributed citationsFully automated5-8%
Quote verificationMisquoted sourcesSemi-automated (needs source text)3-5%
URL existence checkDead links, non-existent pagesFully automated2% (pages may have existed)

Implementation note: A simple pipeline — extract citations via regex/LLM → verify DOI/URL → fuzzy-match title+author against Semantic Scholar or CrossRef — catches 90%+ of fabricated citations with zero human effort. Every research-facing AI system should implement this.

Reasoning Hallucination Detection

Reasoning hallucinations are the hardest to detect because each fact may be correct while the logical chain is broken:

MethodWhat it catchesAccuracyLimitation
Step-by-step verification (decompose + check each step)Invalid logical transitions72%Slow — requires N verification calls
Consistency check (generate answer 5 times, compare)Unstable reasoning68%Detects inconsistency, not incorrectness
Formal logic extraction + SAT solverLogical contradictions85%Only works on formally expressible claims
Expert system rule checkingDomain-specific invalid inferences90%Requires building domain rules
LLM debate (two models argue opposite positions)Weak arguments, unsupported claims75%Expensive — 2x model calls minimum

The honest limitation: There is no reliable, scalable method for detecting all reasoning hallucinations in production. The best approach for high-stakes applications is structured decomposition — force the model to show its reasoning, then verify each step independently. This catches 70-80% of reasoning errors but adds 3-5x latency and cost.

Numerical Hallucination Detection

MethodCoverageAutomationCost
Range check (is the number plausible?)Catches outliers onlyFully automatedNear zero
Source comparison (check against cited source)High for referenced numbersFully automated with source access$0.001-0.01
Cross-model verification (ask 2 models, compare)Moderate — models share training dataFully automated2x inference cost
Unit/magnitude validationCatches unit confusion, order-of-magnitude errorsFully automatedNear zero
Statistical distribution check (does this number fit the expected distribution?)Moderate — requires distribution knowledgeSemi-automatedCustom engineering

Practical implementation: For numerical hallucination, layer three automated checks: (1) range validation — reject numbers outside plausible bounds, (2) unit check — verify the number has the right unit and magnitude, (3) source comparison — if a number is cited, verify against the source. This three-layer approach catches 85% of numerical hallucinations at negligible cost.

The Diagnostic Decision Tree

When you discover hallucination in your system, diagnose the type before choosing a mitigation:

Step 1 — Classify the hallucination type using the taxonomy table above.

Step 2 — Check the task-type distribution table to determine if this type is expected for your workload. If it’s the dominant type for your task, invest in type-specific detection. If it’s rare, it may be a random failure rather than a systemic issue.

Step 3 — Choose detection method based on your constraints:

ConstraintRecommended approachExpected catch rate
Latency < 200msNLI faithfulness + range check75% of factual + numerical
Cost < $0.01/queryBERTScore + DOI lookup + range check70% of factual + attribution + numerical
Accuracy > 90%LLM-as-judge + knowledge graph + citation check90%+ but $0.10-0.20/query
Scale > 100K queries/dayNLI + automated citation + range check, escalate 5% to LLM judge85% at $0.005 avg

Step 4 — Measure baseline. Before deploying any detection, sample 500 queries and manually classify hallucinations. This baseline tells you: (a) your hallucination rate, (b) your type distribution, (c) whether your detection is actually reducing the rate.

Step 5 — Set a target. Zero hallucination is not achievable. Set a target based on cost of failure:

ApplicationAcceptable hallucination rateRationale
Customer-facing chatbot (general)< 5%Users tolerate minor errors, not systematic ones
Medical/legal advice< 1%Each error has liability consequences
Internal research tool< 10%Users verify independently
Code generation< 3%Wrong code causes bugs, but tests catch most
Financial analysis< 2%Incorrect numbers drive incorrect decisions

Why Hallucination Cannot Be Eliminated

Hallucination is not a bug — it’s a feature of the architecture operating beyond its safe boundary. LLMs generate text by predicting the next token based on statistical patterns. When the model’s training distribution doesn’t cover the specific query, it generates the most probable continuation — which may be factually wrong but linguistically fluent.

Three architectural factors make zero hallucination impossible with current technology:

  1. Training data is finite. The model has not seen every fact. For facts outside training data, it interpolates — and interpolation produces plausible but potentially wrong outputs.

  2. Compression is lossy. A model with 70 billion parameters cannot perfectly encode the entirety of human knowledge. It stores patterns, not facts. Patterns generalize — and generalization sometimes generates incorrect specifics.

  3. Autoregressive generation commits early. Each token is generated based on previous tokens. If token 5 is wrong, tokens 6-100 build on a wrong foundation — the model doubles down rather than self-correcting.

RAG mitigates (1) by providing relevant facts at inference time. Fine-tuning mitigates (2) for specific domains. Neither solves (3). Extended thinking and reasoning models partially address (3) by allowing the model to explore multiple paths before committing — but at significant cost.

How to Apply This

Use the token-counter tool to estimate inference costs when adding detection layers — each verification step consumes tokens.

Start by classifying your workload against the task-type table to predict which hallucination types dominate your use case.

Sample 500 production queries and manually classify any hallucinations found — this baseline is non-negotiable before investing in detection.

Choose detection methods from the constraint-based table matching your latency, cost, and accuracy requirements.

Deploy detection in shadow mode first — log detections without blocking responses, verify false positive rates, then enable blocking.

Set your target hallucination rate based on the application-specific table and measure weekly.

Honest Limitations

Hallucination rates in the model comparison table are based on specific evaluation sets and may not generalize to your distribution. Detection method accuracy varies by domain — medical text detection is harder than general factual verification. The cost estimates assume standard API pricing; volume discounts, caching, and self-hosted models change the math. Reasoning hallucination detection remains an active research area with no production-ready solution achieving >85% accuracy. This guide covers English-language hallucination; multilingual hallucination patterns differ significantly.