AI Bias Detection — Demographic Parity, Equal Opportunity, Calibration, and When Each Metric Applies
Fairness metric decision tree per use case, measurement methodology, regulatory requirements, and practical implementation for production AI systems.
Your AI System Treats Different Groups Differently — But Which Kind of “Differently” Actually Matters?
An AI hiring tool rejects 40% of male applicants and 42% of female applicants. Is it biased? What if the acceptance rate difference is 40% vs. 55%? What if the rates are equal but the false positive rate (hiring unqualified candidates) is 8% for one group and 15% for another? Each of these is a different kind of unfairness, measured by a different metric, and the metric you choose determines what your system optimizes away. This guide provides the fairness metric taxonomy, the decision tree for choosing which metric applies to your use case, and the regulatory requirements that may choose for you.
Why You Can’t Optimize for All Fairness Metrics Simultaneously
This is the most important and least discussed fact in AI fairness: most fairness metrics are mathematically incompatible. You cannot simultaneously achieve demographic parity, equalized odds, and calibration except in trivial cases where the base rates are identical across groups.
The Chouldechova (2017) impossibility theorem proves that if the base rate differs between groups (e.g., if one demographic genuinely has a higher loan default rate), then no classifier can simultaneously satisfy false positive parity, false negative parity, and positive predictive value parity.
This means every fairness intervention is a choice — you’re choosing which type of unfairness to eliminate, knowing that other types will persist or worsen. The choice should be deliberate, documented, and aligned with the specific harms your system can cause.
The Fairness Metric Taxonomy
| Metric | What it measures | Formula (simplified) | Intuition |
|---|---|---|---|
| Demographic parity | Equal selection rates across groups | P(Ŷ=1|A=a) = P(Ŷ=1|A=b) | “Same percentage selected from each group” |
| Equal opportunity | Equal true positive rates across groups | P(Ŷ=1|Y=1,A=a) = P(Ŷ=1|Y=1,A=b) | “Qualified candidates have equal chance” |
| Equalized odds | Equal TPR AND FPR across groups | TPR and FPR equal across groups | ”Equal accuracy for positives AND negatives” |
| Predictive parity | Equal precision across groups | P(Y=1|Ŷ=1,A=a) = P(Y=1|Ŷ=1,A=b) | “Positive predictions equally reliable” |
| Calibration | Predicted probability matches observed frequency per group | E[Y|S=s,A=a] = E[Y|S=s,A=b] for all s | ”80% confidence means 80% correct for everyone” |
| Individual fairness | Similar individuals get similar predictions | d(Ŷᵢ,Ŷⱼ) ≤ L·d(xᵢ,xⱼ) | “Similar people treated similarly” |
| Counterfactual fairness | Prediction unchanged if protected attribute were different | Ŷ(A=a) = Ŷ(A=b), all else equal | ”Would the decision change if you changed only the protected attribute?” |
The Compatibility Matrix — Which Metrics Can Coexist
| Demographic parity | Equal opportunity | Equalized odds | Predictive parity | Calibration | |
|---|---|---|---|---|---|
| Demographic parity | — | Incompatible* | Incompatible* | Incompatible* | Incompatible* |
| Equal opportunity | Incompatible* | — | Compatible (subset) | Incompatible* | Incompatible* |
| Equalized odds | Incompatible* | Compatible (superset) | — | Incompatible* | Incompatible* |
| Predictive parity | Incompatible* | Incompatible* | Incompatible* | — | Compatible (related) |
| Calibration | Incompatible* | Incompatible* | Incompatible* | Compatible (related) | — |
*Incompatible when base rates differ between groups (which is almost always the case in real data).
Practical implication: When a stakeholder says “make the AI fair,” the first question is “fair by which definition?” — because the definitions conflict.
The Decision Tree — Which Metric for Which Use Case
Hiring and Admission Decisions
| Scenario | Recommended metric | Why | Legal basis |
|---|---|---|---|
| Initial screening (resume filter) | Demographic parity (4/5ths rule) | US EEOC requires selection rate ratio ≥ 80% | Title VII, EEOC Guidelines |
| Interview scoring | Equal opportunity | Qualified candidates deserve equal chance | Same legal basis, but merit-based argument stronger |
| Final hiring decision | Equalized odds | Both false positives and false negatives matter | Most defensible in litigation |
The 4/5ths rule: if the selection rate for any group is less than 80% of the selection rate for the highest-selected group, there is adverse impact. Example: if 60% of Group A is selected and 40% of Group B is selected, the ratio is 40/60 = 66.7% < 80%, indicating adverse impact.
Lending and Credit
| Scenario | Recommended metric | Why | Legal basis |
|---|---|---|---|
| Loan approval | Predictive parity + calibration | Risk scores must be equally reliable across groups | ECOA, Fair Housing Act |
| Interest rate setting | Calibration | Predicted risk must match actual risk per group | ECOA, disparate impact doctrine |
| Credit limit | Equal opportunity | Creditworthy applicants should have equal access | ECOA |
Healthcare
| Scenario | Recommended metric | Why | Regulatory basis |
|---|---|---|---|
| Diagnostic screening | Equal opportunity (sensitivity) | Missing a disease in one group is worse than false alarm | FDA guidance on clinical AI |
| Treatment recommendation | Calibration | Predicted outcomes must be accurate for all demographics | Medical ethics, informed consent |
| Resource allocation | Equalized odds | Both over-treatment and under-treatment cause harm | AMA ethics guidelines |
Content Moderation
| Scenario | Recommended metric | Why |
|---|---|---|
| Hate speech detection | Equalized odds | FPR disparity → censoring legitimate speech from minority communities |
| Content recommendation | Demographic parity (soft) | Exposure should not be disproportionately limited for any group |
| Search ranking | Calibration | Relevance scores should mean the same thing regardless of query demographic context |
Measuring Bias — The Practical Methodology
Step 1: Define Protected Attributes and Groups
| Attribute | Common groups | Data availability challenge |
|---|---|---|
| Race/ethnicity | Varies by jurisdiction | Often not collected; proxy detection unreliable |
| Gender | Male, female, non-binary | Binary data common; non-binary underrepresented |
| Age | Typically decade bins | Usually available |
| Disability | Present/not | Rarely disclosed |
| Geography | Region, country | Available via IP; may correlate with other protected attributes |
The proxy problem: Even if you don’t use race as a feature, ZIP code, name, and education institution are strong proxies. Removing the protected attribute is insufficient — you must test for disparate impact on outcomes.
Step 2: Compute Metrics
For each protected group pair (A=a vs A=b), compute:
| Metric | Computation | Threshold for concern |
|---|---|---|
| Selection rate ratio | min(rate_a, rate_b) / max(rate_a, rate_b) | < 0.80 (4/5ths rule) |
| TPR difference | |TPR_a - TPR_b| | > 0.05 (5 percentage points) |
| FPR difference | |FPR_a - FPR_b| | > 0.05 |
| Calibration gap | max|E[Y|S=s,A=a] - E[Y|S=s,A=b]| across score bins | > 0.05 |
| Disparate impact ratio | (adverse outcome rate for protected group) / (adverse outcome rate for reference group) | > 1.25 (inverse of 4/5ths) |
Step 3: Statistical Significance
A 3-percentage-point TPR difference on 50 samples is noise. The same difference on 50,000 samples is a signal. Always compute confidence intervals:
| Sample size per group | Detectable TPR difference (95% CI, 80% power) |
|---|---|
| 100 | ≥ 15 percentage points |
| 500 | ≥ 7 percentage points |
| 1,000 | ≥ 5 percentage points |
| 5,000 | ≥ 2 percentage points |
| 50,000 | ≥ 0.7 percentage points |
Practical implication: If your minority group has <500 samples, you cannot reliably detect bias smaller than 7 percentage points. This is a measurement limitation, not proof of fairness.
Step 4: Root Cause Analysis
When bias is detected, identify the source:
| Source | Indicator | Mitigation |
|---|---|---|
| Training data imbalance | Group sizes differ >3x | Oversampling, synthetic data, re-weighting |
| Label bias | Labels assigned by biased humans | Label audit, multi-rater agreement |
| Feature encoding | Protected attribute proxy features | Proxy detection, feature ablation study |
| Sampling bias | Non-representative data collection | Stratified sampling, data augmentation |
| Measurement bias | Different measurement accuracy across groups | Separate calibration per group |
Regulatory Requirements — What the Law Requires
| Regulation | Jurisdiction | What it requires | Penalty |
|---|---|---|---|
| EU AI Act (2026) | EU | Risk assessment, bias testing, documentation for high-risk AI | Up to 7% global revenue |
| NYC Local Law 144 | NYC | Annual bias audit for automated employment decision tools | $500-1,500 per violation per day |
| ECOA / Reg B | US | Fair lending; no discrimination on protected characteristics | Penalties + litigation + consent orders |
| EEOC Guidelines | US | 4/5ths rule for employment selection | Litigation, back pay, injunctive relief |
| GDPR Art. 22 | EU | Right to explanation for automated decisions | Up to 4% global revenue |
| Colorado AI Act (2026) | Colorado | Risk assessment for high-risk AI decisions | AG enforcement |
The compliance gap: Most regulations require bias testing but don’t specify which metrics to use. This creates ambiguity — and audit risk. Document your metric choice, your rationale, and the tradeoffs you accepted. A well-documented decision to optimize equal opportunity (with demographic parity data available) is more defensible than an undocumented claim of “fairness.”
Mitigation Techniques — What Actually Works
| Technique | Stage | Accuracy impact | Fairness improvement | Complexity |
|---|---|---|---|---|
| Re-weighting | Pre-processing | -0.5 to -2% | Moderate (demographic parity) | Low |
| Oversampling minority | Pre-processing | 0 to -1% | Moderate (reduces data imbalance) | Low |
| Adversarial debiasing | In-processing | -1 to -3% | High (equalized odds) | High |
| Calibrated equalized odds | Post-processing | -0.5 to -1.5% | High (equalized odds + calibration) | Medium |
| Threshold adjustment | Post-processing | 0 to -2% | Moderate (demographic parity) | Low |
| Reject option classification | Post-processing | +1% (on accepted) | High (routes uncertain cases to humans) | Medium |
The accuracy-fairness tradeoff: Every mitigation technique reduces overall accuracy. The question is whether the accuracy loss is acceptable for the fairness gain. In most production systems, a 1-2% accuracy reduction to achieve regulatory compliance is a clear win.
- Responsible AI Deployment Checklist — 40 Points from Prototype to Production
- AI Transparency and Explainability — SHAP, LIME, Attention, and When Each Method Works
- LLM Safety Testing — Red Teaming, Adversarial Prompts, and Systematic Attack Taxonomies
How to Apply This
Use the token-counter tool to estimate costs for bias evaluation pipelines — each fairness check requires inference on evaluation datasets.
Start by identifying which metrics your jurisdiction requires from the regulatory requirements table. If no specific requirement exists, choose based on the use-case decision tree.
Compute the compatibility matrix for your chosen metrics — know in advance which types of unfairness you’re accepting when you optimize for your chosen metric.
Build a labeled evaluation dataset with demographic annotations (minimum 500 samples per group for statistically meaningful measurement).
Run statistical significance tests before reporting bias — use the sample size table to determine if your measured differences are signal or noise.
Document your metric choice, rationale, and accepted tradeoffs — this documentation is your primary compliance artifact.
Honest Limitations
The impossibility theorem means no system achieves all fairness definitions simultaneously — this guide helps you choose, not eliminate the tradeoff. Sample size requirements for reliable bias detection are large; most startups don’t have sufficient data for minority groups. Proxy attribute detection is imperfect — removing known proxies doesn’t guarantee the model hasn’t found other correlations. Regulatory requirements are evolving rapidly; this guide reflects 2026 state and will need updates. Intersectional bias (combining multiple protected attributes) is harder to detect and requires exponentially more data. Mitigation techniques tested in research settings may perform differently on production data distributions.
Continue reading
AI Content Filtering — Guardrails That Block Without Breaking User Experience
False positive and negative rate comparison across filtering approaches, latency impact, implementation patterns, and the tradeoff between safety and usability.
Types of AI Hallucinations — Factual, Logical, Attribution, and How to Detect Each
Taxonomy of AI hallucination types with detection methods, failure rates by model and task, and a diagnostic decision tree for production systems.
AI Model Audit Guide — Pre-Deployment Testing for EU AI Act, NIST, and ISO 42001
Regulatory requirement mapping across EU AI Act, NIST AI RMF, and ISO 42001 with audit checklist, documentation templates, and compliance evidence collection methodology.