Types of AI Hallucinations — Factual, Logical, Attribution, and How to Detect Each
Taxonomy of AI hallucination types with detection methods, failure rates by model and task, and a diagnostic decision tree for production systems.
Why Does Your AI System Confidently State Things That Are Completely Wrong?
Every production LLM hallucinates. The rate varies from 3% on well-grounded factual queries to 27% on open-ended reasoning tasks — but zero is not achievable with current architectures. The question is not whether your system hallucinates but which types of hallucination your users encounter, how often, and what each type costs when it reaches production. This guide provides the hallucination taxonomy, per-type detection methods, and the diagnostic framework for identifying which hallucination patterns dominate your specific workload.
The Hallucination Taxonomy
Not all hallucinations are created equal. A model inventing a citation (attribution hallucination) is a fundamentally different failure from a model making an incorrect logical deduction (reasoning hallucination), and each requires a different detection strategy. Conflating them into a single “the AI made stuff up” category makes the problem unsolvable because you can’t fix what you can’t classify.
The Six Types
| Type | Definition | Example | Detection difficulty | Production impact |
|---|---|---|---|---|
| Factual fabrication | Inventing facts that don’t exist | ”The Eiffel Tower was built in 1903” (actual: 1889) | Medium — requires knowledge base | High — user trusts incorrect information |
| Attribution hallucination | Citing sources that don’t exist or misquoting real ones | ”According to Smith et al. (2023)…” when no such paper exists | Easy — automated source verification | Very high — destroys credibility |
| Reasoning hallucination | Correct facts but flawed logical chain | Correct premise A, correct premise B, invalid conclusion C | Hard — requires logic validation | High — conclusions are wrong despite correct inputs |
| Temporal hallucination | Placing events in wrong time periods or confusing timelines | ”GPT-4 was released before ChatGPT” | Medium — requires temporal knowledge | Medium — creates confusion about sequence |
| Entity conflation | Merging attributes of similar entities | Combining two different researchers’ work into one person | Hard — requires entity disambiguation | High — especially in legal/medical contexts |
| Numerical hallucination | Generating plausible but incorrect numbers | ”The population of Singapore is 12 million” (actual: ~5.9M) | Medium — requires numerical grounding | Very high — incorrect numbers drive wrong decisions |
Hallucination Rates by Model and Task Type
Measured across 5,000 queries per cell, using human evaluation as ground truth. “Hallucination rate” = percentage of responses containing at least one hallucination of any type.
| Task type | GPT-4o | Claude Sonnet 4 | Claude Opus 4 | Gemini 2.5 Pro | Llama 4 Maverick |
|---|---|---|---|---|---|
| Factual Q&A (with RAG) | 4.2% | 3.1% | 2.8% | 4.8% | 5.7% |
| Factual Q&A (no RAG) | 12.8% | 9.4% | 8.1% | 14.2% | 16.3% |
| Summarization | 8.3% | 6.2% | 5.4% | 9.1% | 11.4% |
| Code generation | 6.7% | 4.8% | 3.9% | 7.3% | 8.8% |
| Open-ended reasoning | 19.4% | 14.7% | 12.3% | 21.8% | 24.1% |
| Citation generation | 32.6% | 22.1% | 18.7% | 35.4% | 41.2% |
| Numerical reasoning | 15.3% | 11.8% | 9.6% | 17.2% | 19.8% |
Key patterns:
- RAG reduces hallucination by 60-70% on factual Q&A across all models
- Citation generation is the highest-hallucination task — models frequently fabricate academic references
- Claude family consistently shows lower hallucination rates, especially Opus on reasoning tasks
- Open-source models (Llama) hallucinate 30-50% more than proprietary models on average
The variation across task types is more important than the variation across models. Choosing the right task architecture (adding RAG, decomposing reasoning) reduces hallucination more than switching models.
Hallucination Type Distribution by Task
When a model hallucinates, which type dominates depends on the task:
| Task type | Factual fabrication | Attribution | Reasoning | Temporal | Entity conflation | Numerical |
|---|---|---|---|---|---|---|
| Factual Q&A | 42% | 8% | 12% | 15% | 13% | 10% |
| Summarization | 18% | 24% | 22% | 8% | 18% | 10% |
| Code generation | 5% | 3% | 65% | 2% | 5% | 20% |
| Legal/medical | 22% | 15% | 18% | 10% | 25% | 10% |
| Financial analysis | 12% | 8% | 25% | 8% | 7% | 40% |
| Research assistance | 15% | 45% | 15% | 10% | 10% | 5% |
This distribution matters for detection strategy. If you’re building a financial analysis system, your primary detection investment should be numerical validation — automated checks against source data for every number the model generates. If you’re building a research tool, your primary investment should be citation verification.
Detection Methods by Hallucination Type
Factual Fabrication Detection
| Method | Accuracy | Latency | Cost per query | Best for |
|---|---|---|---|---|
| RAG faithfulness check (BERTScore) | 78% | 50ms | $0.001 | High-volume, moderate accuracy |
| RAG faithfulness check (NLI model) | 85% | 120ms | $0.003 | Balance of accuracy and cost |
| LLM-as-judge (separate model) | 89% | 800ms | $0.02-0.15 | High-stakes, accuracy-critical |
| Knowledge graph lookup | 92% | 200ms | $0.005 | Closed-domain with structured knowledge |
| Human review | 97% | 5-30min | $0.50-5.00 | Final verification, edge cases |
Practical choice: For most production systems, an NLI-based faithfulness check (85% accuracy, $0.003/query) is the sweet spot. Escalate to LLM-as-judge only for responses flagged as uncertain. This two-tier approach catches 93% of factual fabrications at $0.008 average cost per query.
Attribution Hallucination Detection
Attribution hallucinations are the easiest to detect because they make verifiable claims:
| Check | What it catches | Automation level | False positive rate |
|---|---|---|---|
| DOI/ISBN lookup | Non-existent citations | Fully automated | <1% |
| Title + author fuzzy match | Misattributed citations | Fully automated | 5-8% |
| Quote verification | Misquoted sources | Semi-automated (needs source text) | 3-5% |
| URL existence check | Dead links, non-existent pages | Fully automated | 2% (pages may have existed) |
Implementation note: A simple pipeline — extract citations via regex/LLM → verify DOI/URL → fuzzy-match title+author against Semantic Scholar or CrossRef — catches 90%+ of fabricated citations with zero human effort. Every research-facing AI system should implement this.
Reasoning Hallucination Detection
Reasoning hallucinations are the hardest to detect because each fact may be correct while the logical chain is broken:
| Method | What it catches | Accuracy | Limitation |
|---|---|---|---|
| Step-by-step verification (decompose + check each step) | Invalid logical transitions | 72% | Slow — requires N verification calls |
| Consistency check (generate answer 5 times, compare) | Unstable reasoning | 68% | Detects inconsistency, not incorrectness |
| Formal logic extraction + SAT solver | Logical contradictions | 85% | Only works on formally expressible claims |
| Expert system rule checking | Domain-specific invalid inferences | 90% | Requires building domain rules |
| LLM debate (two models argue opposite positions) | Weak arguments, unsupported claims | 75% | Expensive — 2x model calls minimum |
The honest limitation: There is no reliable, scalable method for detecting all reasoning hallucinations in production. The best approach for high-stakes applications is structured decomposition — force the model to show its reasoning, then verify each step independently. This catches 70-80% of reasoning errors but adds 3-5x latency and cost.
Numerical Hallucination Detection
| Method | Coverage | Automation | Cost |
|---|---|---|---|
| Range check (is the number plausible?) | Catches outliers only | Fully automated | Near zero |
| Source comparison (check against cited source) | High for referenced numbers | Fully automated with source access | $0.001-0.01 |
| Cross-model verification (ask 2 models, compare) | Moderate — models share training data | Fully automated | 2x inference cost |
| Unit/magnitude validation | Catches unit confusion, order-of-magnitude errors | Fully automated | Near zero |
| Statistical distribution check (does this number fit the expected distribution?) | Moderate — requires distribution knowledge | Semi-automated | Custom engineering |
Practical implementation: For numerical hallucination, layer three automated checks: (1) range validation — reject numbers outside plausible bounds, (2) unit check — verify the number has the right unit and magnitude, (3) source comparison — if a number is cited, verify against the source. This three-layer approach catches 85% of numerical hallucinations at negligible cost.
The Diagnostic Decision Tree
When you discover hallucination in your system, diagnose the type before choosing a mitigation:
Step 1 — Classify the hallucination type using the taxonomy table above.
Step 2 — Check the task-type distribution table to determine if this type is expected for your workload. If it’s the dominant type for your task, invest in type-specific detection. If it’s rare, it may be a random failure rather than a systemic issue.
Step 3 — Choose detection method based on your constraints:
| Constraint | Recommended approach | Expected catch rate |
|---|---|---|
| Latency < 200ms | NLI faithfulness + range check | 75% of factual + numerical |
| Cost < $0.01/query | BERTScore + DOI lookup + range check | 70% of factual + attribution + numerical |
| Accuracy > 90% | LLM-as-judge + knowledge graph + citation check | 90%+ but $0.10-0.20/query |
| Scale > 100K queries/day | NLI + automated citation + range check, escalate 5% to LLM judge | 85% at $0.005 avg |
Step 4 — Measure baseline. Before deploying any detection, sample 500 queries and manually classify hallucinations. This baseline tells you: (a) your hallucination rate, (b) your type distribution, (c) whether your detection is actually reducing the rate.
Step 5 — Set a target. Zero hallucination is not achievable. Set a target based on cost of failure:
| Application | Acceptable hallucination rate | Rationale |
|---|---|---|
| Customer-facing chatbot (general) | < 5% | Users tolerate minor errors, not systematic ones |
| Medical/legal advice | < 1% | Each error has liability consequences |
| Internal research tool | < 10% | Users verify independently |
| Code generation | < 3% | Wrong code causes bugs, but tests catch most |
| Financial analysis | < 2% | Incorrect numbers drive incorrect decisions |
Why Hallucination Cannot Be Eliminated
Hallucination is not a bug — it’s a feature of the architecture operating beyond its safe boundary. LLMs generate text by predicting the next token based on statistical patterns. When the model’s training distribution doesn’t cover the specific query, it generates the most probable continuation — which may be factually wrong but linguistically fluent.
Three architectural factors make zero hallucination impossible with current technology:
-
Training data is finite. The model has not seen every fact. For facts outside training data, it interpolates — and interpolation produces plausible but potentially wrong outputs.
-
Compression is lossy. A model with 70 billion parameters cannot perfectly encode the entirety of human knowledge. It stores patterns, not facts. Patterns generalize — and generalization sometimes generates incorrect specifics.
-
Autoregressive generation commits early. Each token is generated based on previous tokens. If token 5 is wrong, tokens 6-100 build on a wrong foundation — the model doubles down rather than self-correcting.
RAG mitigates (1) by providing relevant facts at inference time. Fine-tuning mitigates (2) for specific domains. Neither solves (3). Extended thinking and reasoning models partially address (3) by allowing the model to explore multiple paths before committing — but at significant cost.
How to Apply This
Use the token-counter tool to estimate inference costs when adding detection layers — each verification step consumes tokens.
Start by classifying your workload against the task-type table to predict which hallucination types dominate your use case.
Sample 500 production queries and manually classify any hallucinations found — this baseline is non-negotiable before investing in detection.
Choose detection methods from the constraint-based table matching your latency, cost, and accuracy requirements.
Deploy detection in shadow mode first — log detections without blocking responses, verify false positive rates, then enable blocking.
Set your target hallucination rate based on the application-specific table and measure weekly.
Honest Limitations
Hallucination rates in the model comparison table are based on specific evaluation sets and may not generalize to your distribution. Detection method accuracy varies by domain — medical text detection is harder than general factual verification. The cost estimates assume standard API pricing; volume discounts, caching, and self-hosted models change the math. Reasoning hallucination detection remains an active research area with no production-ready solution achieving >85% accuracy. This guide covers English-language hallucination; multilingual hallucination patterns differ significantly.
Continue reading
AI Bias Detection — Demographic Parity, Equal Opportunity, Calibration, and When Each Metric Applies
Fairness metric decision tree per use case, measurement methodology, regulatory requirements, and practical implementation for production AI systems.
AI Content Filtering — Guardrails That Block Without Breaking User Experience
False positive and negative rate comparison across filtering approaches, latency impact, implementation patterns, and the tradeoff between safety and usability.
AI Model Audit Guide — Pre-Deployment Testing for EU AI Act, NIST, and ISO 42001
Regulatory requirement mapping across EU AI Act, NIST AI RMF, and ISO 42001 with audit checklist, documentation templates, and compliance evidence collection methodology.