Your AI System Treats Different Groups Differently — But Which Kind of “Differently” Actually Matters?

An AI hiring tool rejects 40% of male applicants and 42% of female applicants. Is it biased? What if the acceptance rate difference is 40% vs. 55%? What if the rates are equal but the false positive rate (hiring unqualified candidates) is 8% for one group and 15% for another? Each of these is a different kind of unfairness, measured by a different metric, and the metric you choose determines what your system optimizes away. This guide provides the fairness metric taxonomy, the decision tree for choosing which metric applies to your use case, and the regulatory requirements that may choose for you.

Why You Can’t Optimize for All Fairness Metrics Simultaneously

This is the most important and least discussed fact in AI fairness: most fairness metrics are mathematically incompatible. You cannot simultaneously achieve demographic parity, equalized odds, and calibration except in trivial cases where the base rates are identical across groups.

The Chouldechova (2017) impossibility theorem proves that if the base rate differs between groups (e.g., if one demographic genuinely has a higher loan default rate), then no classifier can simultaneously satisfy false positive parity, false negative parity, and positive predictive value parity.

This means every fairness intervention is a choice — you’re choosing which type of unfairness to eliminate, knowing that other types will persist or worsen. The choice should be deliberate, documented, and aligned with the specific harms your system can cause.

The Fairness Metric Taxonomy

MetricWhat it measuresFormula (simplified)Intuition
Demographic parityEqual selection rates across groupsP(Ŷ=1|A=a) = P(Ŷ=1|A=b)“Same percentage selected from each group”
Equal opportunityEqual true positive rates across groupsP(Ŷ=1|Y=1,A=a) = P(Ŷ=1|Y=1,A=b)“Qualified candidates have equal chance”
Equalized oddsEqual TPR AND FPR across groupsTPR and FPR equal across groups”Equal accuracy for positives AND negatives”
Predictive parityEqual precision across groupsP(Y=1|Ŷ=1,A=a) = P(Y=1|Ŷ=1,A=b)“Positive predictions equally reliable”
CalibrationPredicted probability matches observed frequency per groupE[Y|S=s,A=a] = E[Y|S=s,A=b] for all s”80% confidence means 80% correct for everyone”
Individual fairnessSimilar individuals get similar predictionsd(Ŷᵢ,Ŷⱼ) ≤ L·d(xᵢ,xⱼ)“Similar people treated similarly”
Counterfactual fairnessPrediction unchanged if protected attribute were differentŶ(A=a) = Ŷ(A=b), all else equal”Would the decision change if you changed only the protected attribute?”

The Compatibility Matrix — Which Metrics Can Coexist

Demographic parityEqual opportunityEqualized oddsPredictive parityCalibration
Demographic parityIncompatible*Incompatible*Incompatible*Incompatible*
Equal opportunityIncompatible*Compatible (subset)Incompatible*Incompatible*
Equalized oddsIncompatible*Compatible (superset)Incompatible*Incompatible*
Predictive parityIncompatible*Incompatible*Incompatible*Compatible (related)
CalibrationIncompatible*Incompatible*Incompatible*Compatible (related)

*Incompatible when base rates differ between groups (which is almost always the case in real data).

Practical implication: When a stakeholder says “make the AI fair,” the first question is “fair by which definition?” — because the definitions conflict.

The Decision Tree — Which Metric for Which Use Case

Hiring and Admission Decisions

ScenarioRecommended metricWhyLegal basis
Initial screening (resume filter)Demographic parity (4/5ths rule)US EEOC requires selection rate ratio ≥ 80%Title VII, EEOC Guidelines
Interview scoringEqual opportunityQualified candidates deserve equal chanceSame legal basis, but merit-based argument stronger
Final hiring decisionEqualized oddsBoth false positives and false negatives matterMost defensible in litigation

The 4/5ths rule: if the selection rate for any group is less than 80% of the selection rate for the highest-selected group, there is adverse impact. Example: if 60% of Group A is selected and 40% of Group B is selected, the ratio is 40/60 = 66.7% < 80%, indicating adverse impact.

Lending and Credit

ScenarioRecommended metricWhyLegal basis
Loan approvalPredictive parity + calibrationRisk scores must be equally reliable across groupsECOA, Fair Housing Act
Interest rate settingCalibrationPredicted risk must match actual risk per groupECOA, disparate impact doctrine
Credit limitEqual opportunityCreditworthy applicants should have equal accessECOA

Healthcare

ScenarioRecommended metricWhyRegulatory basis
Diagnostic screeningEqual opportunity (sensitivity)Missing a disease in one group is worse than false alarmFDA guidance on clinical AI
Treatment recommendationCalibrationPredicted outcomes must be accurate for all demographicsMedical ethics, informed consent
Resource allocationEqualized oddsBoth over-treatment and under-treatment cause harmAMA ethics guidelines

Content Moderation

ScenarioRecommended metricWhy
Hate speech detectionEqualized oddsFPR disparity → censoring legitimate speech from minority communities
Content recommendationDemographic parity (soft)Exposure should not be disproportionately limited for any group
Search rankingCalibrationRelevance scores should mean the same thing regardless of query demographic context

Measuring Bias — The Practical Methodology

Step 1: Define Protected Attributes and Groups

AttributeCommon groupsData availability challenge
Race/ethnicityVaries by jurisdictionOften not collected; proxy detection unreliable
GenderMale, female, non-binaryBinary data common; non-binary underrepresented
AgeTypically decade binsUsually available
DisabilityPresent/notRarely disclosed
GeographyRegion, countryAvailable via IP; may correlate with other protected attributes

The proxy problem: Even if you don’t use race as a feature, ZIP code, name, and education institution are strong proxies. Removing the protected attribute is insufficient — you must test for disparate impact on outcomes.

Step 2: Compute Metrics

For each protected group pair (A=a vs A=b), compute:

MetricComputationThreshold for concern
Selection rate ratiomin(rate_a, rate_b) / max(rate_a, rate_b)< 0.80 (4/5ths rule)
TPR difference|TPR_a - TPR_b|> 0.05 (5 percentage points)
FPR difference|FPR_a - FPR_b|> 0.05
Calibration gapmax|E[Y|S=s,A=a] - E[Y|S=s,A=b]| across score bins> 0.05
Disparate impact ratio(adverse outcome rate for protected group) / (adverse outcome rate for reference group)> 1.25 (inverse of 4/5ths)

Step 3: Statistical Significance

A 3-percentage-point TPR difference on 50 samples is noise. The same difference on 50,000 samples is a signal. Always compute confidence intervals:

Sample size per groupDetectable TPR difference (95% CI, 80% power)
100≥ 15 percentage points
500≥ 7 percentage points
1,000≥ 5 percentage points
5,000≥ 2 percentage points
50,000≥ 0.7 percentage points

Practical implication: If your minority group has <500 samples, you cannot reliably detect bias smaller than 7 percentage points. This is a measurement limitation, not proof of fairness.

Step 4: Root Cause Analysis

When bias is detected, identify the source:

SourceIndicatorMitigation
Training data imbalanceGroup sizes differ >3xOversampling, synthetic data, re-weighting
Label biasLabels assigned by biased humansLabel audit, multi-rater agreement
Feature encodingProtected attribute proxy featuresProxy detection, feature ablation study
Sampling biasNon-representative data collectionStratified sampling, data augmentation
Measurement biasDifferent measurement accuracy across groupsSeparate calibration per group

Regulatory Requirements — What the Law Requires

RegulationJurisdictionWhat it requiresPenalty
EU AI Act (2026)EURisk assessment, bias testing, documentation for high-risk AIUp to 7% global revenue
NYC Local Law 144NYCAnnual bias audit for automated employment decision tools$500-1,500 per violation per day
ECOA / Reg BUSFair lending; no discrimination on protected characteristicsPenalties + litigation + consent orders
EEOC GuidelinesUS4/5ths rule for employment selectionLitigation, back pay, injunctive relief
GDPR Art. 22EURight to explanation for automated decisionsUp to 4% global revenue
Colorado AI Act (2026)ColoradoRisk assessment for high-risk AI decisionsAG enforcement

The compliance gap: Most regulations require bias testing but don’t specify which metrics to use. This creates ambiguity — and audit risk. Document your metric choice, your rationale, and the tradeoffs you accepted. A well-documented decision to optimize equal opportunity (with demographic parity data available) is more defensible than an undocumented claim of “fairness.”

Mitigation Techniques — What Actually Works

TechniqueStageAccuracy impactFairness improvementComplexity
Re-weightingPre-processing-0.5 to -2%Moderate (demographic parity)Low
Oversampling minorityPre-processing0 to -1%Moderate (reduces data imbalance)Low
Adversarial debiasingIn-processing-1 to -3%High (equalized odds)High
Calibrated equalized oddsPost-processing-0.5 to -1.5%High (equalized odds + calibration)Medium
Threshold adjustmentPost-processing0 to -2%Moderate (demographic parity)Low
Reject option classificationPost-processing+1% (on accepted)High (routes uncertain cases to humans)Medium

The accuracy-fairness tradeoff: Every mitigation technique reduces overall accuracy. The question is whether the accuracy loss is acceptable for the fairness gain. In most production systems, a 1-2% accuracy reduction to achieve regulatory compliance is a clear win.

How to Apply This

Use the token-counter tool to estimate costs for bias evaluation pipelines — each fairness check requires inference on evaluation datasets.

Start by identifying which metrics your jurisdiction requires from the regulatory requirements table. If no specific requirement exists, choose based on the use-case decision tree.

Compute the compatibility matrix for your chosen metrics — know in advance which types of unfairness you’re accepting when you optimize for your chosen metric.

Build a labeled evaluation dataset with demographic annotations (minimum 500 samples per group for statistically meaningful measurement).

Run statistical significance tests before reporting bias — use the sample size table to determine if your measured differences are signal or noise.

Document your metric choice, rationale, and accepted tradeoffs — this documentation is your primary compliance artifact.

Honest Limitations

The impossibility theorem means no system achieves all fairness definitions simultaneously — this guide helps you choose, not eliminate the tradeoff. Sample size requirements for reliable bias detection are large; most startups don’t have sufficient data for minority groups. Proxy attribute detection is imperfect — removing known proxies doesn’t guarantee the model hasn’t found other correlations. Regulatory requirements are evolving rapidly; this guide reflects 2026 state and will need updates. Intersectional bias (combining multiple protected attributes) is harder to detect and requires exponentially more data. Mitigation techniques tested in research settings may perform differently on production data distributions.