AI Bias Detection — Demographic Parity, Equal Opportunity, Calibration, and When Each Metric Applies

Fairness metric decision tree per use case, measurement methodology, regulatory requirements, and practical implementation for production AI systems.

Kenny Tan 15 April 2026

Your AI System Treats Different Groups Differently — But Which Kind of “Differently” Actually Matters?

An AI hiring tool rejects 40% of male applicants and 42% of female applicants. Is it biased? What if the acceptance rate difference is 40% vs. 55%? What if the rates are equal but the false positive rate (hiring unqualified candidates) is 8% for one group and 15% for another? Each of these is a different kind of unfairness, measured by a different metric, and the metric you choose determines what your system optimizes away. This guide provides the fairness metric taxonomy, the decision tree for choosing which metric applies to your use case, and the regulatory requirements that may choose for you.

Why You Can’t Optimize for All Fairness Metrics Simultaneously

This is the most important and least discussed fact in AI fairness: most fairness metrics are mathematically incompatible. You cannot simultaneously achieve demographic parity, equalized odds, and calibration except in trivial cases where the base rates are identical across groups.

The Chouldechova (2017) impossibility theorem proves that if the base rate differs between groups (e.g., if one demographic genuinely has a higher loan default rate), then no classifier can simultaneously satisfy false positive parity, false negative parity, and positive predictive value parity.

This means every fairness intervention is a choice — you’re choosing which type of unfairness to eliminate, knowing that other types will persist or worsen. The choice should be deliberate, documented, and aligned with the specific harms your system can cause.

The Fairness Metric Taxonomy

Metric	What it measures	Formula (simplified)	Intuition
Demographic parity	Equal selection rates across groups	P(Ŷ=1\|A=a) = P(Ŷ=1\|A=b)	“Same percentage selected from each group”
Equal opportunity	Equal true positive rates across groups	P(Ŷ=1\|Y=1,A=a) = P(Ŷ=1\|Y=1,A=b)	“Qualified candidates have equal chance”
Equalized odds	Equal TPR AND FPR across groups	TPR and FPR equal across groups	”Equal accuracy for positives AND negatives”
Predictive parity	Equal precision across groups	P(Y=1\|Ŷ=1,A=a) = P(Y=1\|Ŷ=1,A=b)	“Positive predictions equally reliable”
Calibration	Predicted probability matches observed frequency per group	E[Y\|S=s,A=a] = E[Y\|S=s,A=b] for all s	”80% confidence means 80% correct for everyone”
Individual fairness	Similar individuals get similar predictions	d(Ŷᵢ,Ŷⱼ) ≤ L·d(xᵢ,xⱼ)	“Similar people treated similarly”
Counterfactual fairness	Prediction unchanged if protected attribute were different	Ŷ(A=a) = Ŷ(A=b), all else equal	”Would the decision change if you changed only the protected attribute?”

The Compatibility Matrix — Which Metrics Can Coexist

	Demographic parity	Equal opportunity	Equalized odds	Predictive parity	Calibration
Demographic parity	—	Incompatible*	Incompatible*	Incompatible*	Incompatible*
Equal opportunity	Incompatible*	—	Compatible (subset)	Incompatible*	Incompatible*
Equalized odds	Incompatible*	Compatible (superset)	—	Incompatible*	Incompatible*
Predictive parity	Incompatible*	Incompatible*	Incompatible*	—	Compatible (related)
Calibration	Incompatible*	Incompatible*	Incompatible*	Compatible (related)	—

*Incompatible when base rates differ between groups (which is almost always the case in real data).

Practical implication: When a stakeholder says “make the AI fair,” the first question is “fair by which definition?” — because the definitions conflict.

The Decision Tree — Which Metric for Which Use Case

Hiring and Admission Decisions

Scenario	Recommended metric	Why	Legal basis
Initial screening (resume filter)	Demographic parity (4/5ths rule)	US EEOC requires selection rate ratio ≥ 80%	Title VII, EEOC Guidelines
Interview scoring	Equal opportunity	Qualified candidates deserve equal chance	Same legal basis, but merit-based argument stronger
Final hiring decision	Equalized odds	Both false positives and false negatives matter	Most defensible in litigation

The 4/5ths rule: if the selection rate for any group is less than 80% of the selection rate for the highest-selected group, there is adverse impact. Example: if 60% of Group A is selected and 40% of Group B is selected, the ratio is 40/60 = 66.7% < 80%, indicating adverse impact.

Lending and Credit

Scenario	Recommended metric	Why	Legal basis
Loan approval	Predictive parity + calibration	Risk scores must be equally reliable across groups	ECOA, Fair Housing Act
Interest rate setting	Calibration	Predicted risk must match actual risk per group	ECOA, disparate impact doctrine
Credit limit	Equal opportunity	Creditworthy applicants should have equal access	ECOA

Healthcare

Scenario	Recommended metric	Why	Regulatory basis
Diagnostic screening	Equal opportunity (sensitivity)	Missing a disease in one group is worse than false alarm	FDA guidance on clinical AI
Treatment recommendation	Calibration	Predicted outcomes must be accurate for all demographics	Medical ethics, informed consent
Resource allocation	Equalized odds	Both over-treatment and under-treatment cause harm	AMA ethics guidelines

Content Moderation

Scenario	Recommended metric	Why
Hate speech detection	Equalized odds	FPR disparity → censoring legitimate speech from minority communities
Content recommendation	Demographic parity (soft)	Exposure should not be disproportionately limited for any group
Search ranking	Calibration	Relevance scores should mean the same thing regardless of query demographic context

Measuring Bias — The Practical Methodology

Step 1: Define Protected Attributes and Groups

Attribute	Common groups	Data availability challenge
Race/ethnicity	Varies by jurisdiction	Often not collected; proxy detection unreliable
Gender	Male, female, non-binary	Binary data common; non-binary underrepresented
Age	Typically decade bins	Usually available
Disability	Present/not	Rarely disclosed
Geography	Region, country	Available via IP; may correlate with other protected attributes

The proxy problem: Even if you don’t use race as a feature, ZIP code, name, and education institution are strong proxies. Removing the protected attribute is insufficient — you must test for disparate impact on outcomes.

Step 2: Compute Metrics

For each protected group pair (A=a vs A=b), compute:

Metric	Computation	Threshold for concern
Selection rate ratio	min(rate_a, rate_b) / max(rate_a, rate_b)	< 0.80 (4/5ths rule)
TPR difference	\|TPR_a - TPR_b\|	> 0.05 (5 percentage points)
FPR difference	\|FPR_a - FPR_b\|	> 0.05
Calibration gap	max\|E[Y\|S=s,A=a] - E[Y\|S=s,A=b]\| across score bins	> 0.05
Disparate impact ratio	(adverse outcome rate for protected group) / (adverse outcome rate for reference group)	> 1.25 (inverse of 4/5ths)

Step 3: Statistical Significance

A 3-percentage-point TPR difference on 50 samples is noise. The same difference on 50,000 samples is a signal. Always compute confidence intervals:

Sample size per group	Detectable TPR difference (95% CI, 80% power)
100	≥ 15 percentage points
500	≥ 7 percentage points
1,000	≥ 5 percentage points
5,000	≥ 2 percentage points
50,000	≥ 0.7 percentage points

Practical implication: If your minority group has <500 samples, you cannot reliably detect bias smaller than 7 percentage points. This is a measurement limitation, not proof of fairness.

Step 4: Root Cause Analysis

When bias is detected, identify the source:

Source	Indicator	Mitigation
Training data imbalance	Group sizes differ >3x	Oversampling, synthetic data, re-weighting
Label bias	Labels assigned by biased humans	Label audit, multi-rater agreement
Feature encoding	Protected attribute proxy features	Proxy detection, feature ablation study
Sampling bias	Non-representative data collection	Stratified sampling, data augmentation
Measurement bias	Different measurement accuracy across groups	Separate calibration per group

Regulatory Requirements — What the Law Requires

Regulation	Jurisdiction	What it requires	Penalty
EU AI Act (2026)	EU	Risk assessment, bias testing, documentation for high-risk AI	Up to 7% global revenue
NYC Local Law 144	NYC	Annual bias audit for automated employment decision tools	$500-1,500 per violation per day
ECOA / Reg B	US	Fair lending; no discrimination on protected characteristics	Penalties + litigation + consent orders
EEOC Guidelines	US	4/5ths rule for employment selection	Litigation, back pay, injunctive relief
GDPR Art. 22	EU	Right to explanation for automated decisions	Up to 4% global revenue
Colorado AI Act (2026)	Colorado	Risk assessment for high-risk AI decisions	AG enforcement

The compliance gap: Most regulations require bias testing but don’t specify which metrics to use. This creates ambiguity — and audit risk. Document your metric choice, your rationale, and the tradeoffs you accepted. A well-documented decision to optimize equal opportunity (with demographic parity data available) is more defensible than an undocumented claim of “fairness.”

Mitigation Techniques — What Actually Works

Technique	Stage	Accuracy impact	Fairness improvement	Complexity
Re-weighting	Pre-processing	-0.5 to -2%	Moderate (demographic parity)	Low
Oversampling minority	Pre-processing	0 to -1%	Moderate (reduces data imbalance)	Low
Adversarial debiasing	In-processing	-1 to -3%	High (equalized odds)	High
Calibrated equalized odds	Post-processing	-0.5 to -1.5%	High (equalized odds + calibration)	Medium
Threshold adjustment	Post-processing	0 to -2%	Moderate (demographic parity)	Low
Reject option classification	Post-processing	+1% (on accepted)	High (routes uncertain cases to humans)	Medium

The accuracy-fairness tradeoff: Every mitigation technique reduces overall accuracy. The question is whether the accuracy loss is acceptable for the fairness gain. In most production systems, a 1-2% accuracy reduction to achieve regulatory compliance is a clear win.

How to Apply This

Use the token-counter tool to estimate costs for bias evaluation pipelines — each fairness check requires inference on evaluation datasets.

Start by identifying which metrics your jurisdiction requires from the regulatory requirements table. If no specific requirement exists, choose based on the use-case decision tree.

Compute the compatibility matrix for your chosen metrics — know in advance which types of unfairness you’re accepting when you optimize for your chosen metric.

Build a labeled evaluation dataset with demographic annotations (minimum 500 samples per group for statistically meaningful measurement).

Run statistical significance tests before reporting bias — use the sample size table to determine if your measured differences are signal or noise.

Document your metric choice, rationale, and accepted tradeoffs — this documentation is your primary compliance artifact.

Honest Limitations

The impossibility theorem means no system achieves all fairness definitions simultaneously — this guide helps you choose, not eliminate the tradeoff. Sample size requirements for reliable bias detection are large; most startups don’t have sufficient data for minority groups. Proxy attribute detection is imperfect — removing known proxies doesn’t guarantee the model hasn’t found other correlations. Regulatory requirements are evolving rapidly; this guide reflects 2026 state and will need updates. Intersectional bias (combining multiple protected attributes) is harder to detect and requires exponentially more data. Mitigation techniques tested in research settings may perform differently on production data distributions.

Kenny Tan Co-founder & Technical Lead

AI Model Evaluation Systems · LLM Benchmark Infrastructure · Prompt Engineering Research

Cross-domain expertise in software engineering, content systems, and infrastructure architecture.

15 April 2026 Also at: kennytan.net

Continue reading

AI Content Filtering — Guardrails That Block Without Breaking User Experience

False positive and negative rate comparison across filtering approaches, latency impact, implementation patterns, and the tradeoff between safety and usability.

Types of AI Hallucinations — Factual, Logical, Attribution, and How to Detect Each

Taxonomy of AI hallucination types with detection methods, failure rates by model and task, and a diagnostic decision tree for production systems.

AI Model Audit Guide — Pre-Deployment Testing for EU AI Act, NIST, and ISO 42001

Regulatory requirement mapping across EU AI Act, NIST AI RMF, and ISO 42001 with audit checklist, documentation templates, and compliance evidence collection methodology.

All articles in ai safety responsible