Responsible AI Deployment Checklist — 40 Points from Prototype to Production
Pre-deployment checklist with pass/fail criteria covering safety testing, bias audits, monitoring, documentation, and regulatory compliance for production AI systems.
Your AI System Passes All Benchmarks — Is It Actually Ready for Users?
Benchmark accuracy is necessary but not sufficient for production deployment. A model that scores 95% on your test set can still hallucinate on edge cases your test set doesn’t cover, amplify bias against demographic groups underrepresented in your training data, break when input distribution shifts, or violate regulations you didn’t know applied. This checklist covers the 40 verification points between “the model works in my notebook” and “the model is safe to serve to users” — grouped by category, with pass/fail criteria and regulatory references.
Why Checklists Beat Principles
The AI safety literature is full of principles: “AI should be fair,” “AI should be transparent,” “AI should benefit humanity.” These are admirable aspirations and useless for engineering. You cannot deploy a principle. You can deploy a system that passes 40 specific, measurable checks.
This checklist is modeled on aviation pre-flight checklists — not because AI is as safety-critical as aviation, but because the checklist methodology works: it catches the obvious failures that smart people miss under deadline pressure. The checklist doesn’t replace expertise; it ensures expertise is consistently applied.
Category 1 — Model Quality (10 checks)
| # | Check | Pass criteria | Fail action | Regulatory reference |
|---|---|---|---|---|
| 1.1 | Task-specific accuracy | Accuracy ≥ target on held-out test set (not validation set) | Re-train, tune prompts, or lower deployment scope | ISO 42001 §6.1 |
| 1.2 | Edge case coverage | Test on 200+ edge cases identified from error analysis | If >10% failure on edge cases, fix or document limitations | EU AI Act Art. 9 |
| 1.3 | Hallucination rate | Measured on 500+ samples, below target for application type | Add retrieval grounding, citation verification, or detection layer | NIST AI RMF MAP 1.5 |
| 1.4 | Consistency | Same query produces substantively similar answer 95%+ of the time | Lower temperature, add structured output, or use self-consistency | ISO 42001 §7.5 |
| 1.5 | Latency (p50/p95/p99) | p95 latency ≤ user-facing SLA | Optimize prompt, use faster model, add caching | Internal SLA |
| 1.6 | Throughput | Handles expected peak QPS with <5% error rate | Scale infrastructure, add rate limiting, implement queuing | Internal SLA |
| 1.7 | Cost per query | Below budget at projected volume | Optimize prompt length, route to cheaper models, batch where possible | Internal budget |
| 1.8 | Regression test suite | Automated test suite covering core functionality, runs on every deployment | Build test suite before deploying | ISO 42001 §8.1 |
| 1.9 | Output format validation | Structured outputs (JSON, etc.) pass schema validation 99%+ of the time | Add retry logic, output parsers, or format enforcement | Internal quality |
| 1.10 | Graceful degradation | System handles model unavailability (API timeout, rate limit) without crashing | Add fallback model, cached responses, or informative error messages | ISO 42001 §6.1 |
Category 2 — Safety and Harm Prevention (10 checks)
| # | Check | Pass criteria | Fail action | Regulatory reference |
|---|---|---|---|---|
| 2.1 | Content safety filtering | Input/output filters block harmful content with <2% false positive rate | Tune filter thresholds, add human escalation path | EU AI Act Art. 9 |
| 2.2 | Prompt injection resistance | System resists standard injection attacks (DAN, ignore instructions, delimiter bypass) | Add input sanitization, system prompt protection, output validation | OWASP LLM Top 10 |
| 2.3 | Data leakage prevention | System does not expose training data, system prompts, or PII in outputs | Add output filtering, test with extraction attacks | GDPR Art. 5, EU AI Act Art. 10 |
| 2.4 | Bias audit | Fairness metrics computed for all protected groups, documented, within acceptable thresholds | Apply mitigation techniques, document accepted tradeoffs | EU AI Act Art. 10, NYC LL144, ECOA |
| 2.5 | Toxicity screening | Output toxicity rate <0.1% on production-representative inputs | Add toxicity classifier, tune safety training, add guardrails | Platform policies, EU DSA |
| 2.6 | Over-reliance prevention | System clearly communicates uncertainty; doesn’t present hallucination as fact | Add confidence indicators, “AI-generated” labels, uncertainty language | EU AI Act Art. 13 |
| 2.7 | Misuse resistance | System cannot be easily used for harm (e.g., generating malware, fraud templates) | Add use-case restrictions, monitor for misuse patterns | EU AI Act Art. 5 |
| 2.8 | Child safety | If accessible to minors, additional content filtering and interaction limits in place | Implement age-appropriate guardrails, parental controls | COPPA, EU DSA Art. 28 |
| 2.9 | Cultural sensitivity | Tested across target cultural contexts; no systematically offensive outputs | Expand test set to cover cultural contexts, add cultural sensitivity review | EU AI Act Art. 9 |
| 2.10 | Escalation path | Human review mechanism exists for high-stakes or uncertain outputs | Build human-in-the-loop pipeline with SLA for review time | EU AI Act Art. 14 |
Category 3 — Transparency and Documentation (10 checks)
| # | Check | Pass criteria | Fail action | Regulatory reference |
|---|---|---|---|---|
| 3.1 | Model card | Documented: model identity, training data summary, intended use, known limitations | Create model card (Mitchell et al. format or ISO 42001 Annex) | EU AI Act Art. 11, ISO 42001 §7.5 |
| 3.2 | Data documentation | Training/fine-tuning data: source, size, composition, known gaps documented | Create data sheet (Gebru et al. format) | EU AI Act Art. 10 |
| 3.3 | AI disclosure to users | Users informed they are interacting with AI (not human) | Add disclosure language at point of interaction | EU AI Act Art. 52 |
| 3.4 | Decision explanation | For consequential decisions, explanation of factors available on request | Implement explanation pipeline (SHAP, LIME, or natural language) | GDPR Art. 22, EU AI Act Art. 13 |
| 3.5 | Version tracking | Model version, prompt version, and configuration tracked per deployment | Implement model registry, prompt version control | ISO 42001 §8.1 |
| 3.6 | Change log | All model updates, prompt changes, and guardrail modifications logged | Establish change management process | ISO 42001 §8.1 |
| 3.7 | Performance reporting | Regular accuracy, safety, and bias reports generated and reviewed | Build automated reporting pipeline | EU AI Act Art. 9 |
| 3.8 | Incident response plan | Documented procedure for AI-caused incidents (harmful outputs, data leaks) | Create incident response playbook with roles, escalation, and communication | NIST AI RMF GOVERN 1.5 |
| 3.9 | Terms of service | AI-specific terms covering limitations, liability, and data usage | Legal review of AI-specific TOS clauses | General commercial law |
| 3.10 | Regulatory mapping | Identified which regulations apply to your AI system in each jurisdiction | Complete regulatory assessment with legal counsel | EU AI Act Art. 6 |
Category 4 — Monitoring and Operations (10 checks)
| # | Check | Pass criteria | Fail action | Regulatory reference |
|---|---|---|---|---|
| 4.1 | Accuracy monitoring | Ongoing accuracy measurement on production data (not just test set) | Implement continuous evaluation with sampling strategy | ISO 42001 §9.1 |
| 4.2 | Drift detection | Alert when input distribution or output distribution shifts significantly | Deploy distribution monitoring (KL divergence, PSI, or embedding drift) | NIST AI RMF MEASURE 2.6 |
| 4.3 | Hallucination rate tracking | Hallucination rate measured weekly on production sample | Deploy detection pipeline from hallucination detection guide | ISO 42001 §9.1 |
| 4.4 | Bias monitoring | Fairness metrics recalculated monthly on production data | Automate bias reporting pipeline | EU AI Act Art. 9, NYC LL144 |
| 4.5 | User feedback loop | Mechanism for users to flag incorrect/harmful outputs | Implement thumbs-up/down, report button, or feedback form | EU AI Act Art. 14 |
| 4.6 | Cost monitoring | Per-query and total cost tracked with budget alerts | Dashboard with cost per model/task/time, budget threshold alerts | Internal budget |
| 4.7 | Latency monitoring | p50/p95/p99 latency tracked with SLA breach alerts | APM integration with latency dashboards | Internal SLA |
| 4.8 | Safety incident logging | Every content filter trigger, prompt injection attempt, and harmful output logged | Centralized safety event log with search/filter | NIST AI RMF MEASURE 2.8 |
| 4.9 | Model update policy | Documented policy for when to retrain, when to update prompts, when to switch models | Create update decision framework with trigger conditions | ISO 42001 §8.1 |
| 4.10 | Kill switch | Ability to disable AI features within minutes without full system outage | Feature flags, circuit breaker, or emergency model bypass | NIST AI RMF GOVERN 1.3 |
The Deployment Readiness Score
Count your passed checks across all 40 items:
| Score | Readiness level | Recommendation |
|---|---|---|
| 36-40 | Production ready | Deploy with standard monitoring |
| 30-35 | Conditionally ready | Deploy with enhanced monitoring + plan to close gaps within 30 days |
| 24-29 | Not ready | Address critical gaps (Category 2 failures are blockers) |
| 18-23 | Significant gaps | Major rework needed; do not deploy to external users |
| <18 | Early stage | Return to development; this is still a prototype |
Blocking criteria: Any failure in checks 2.1-2.5 (safety and harm prevention) is a deployment blocker regardless of total score. A system scoring 38/40 but failing prompt injection resistance (2.2) and bias audit (2.4) is not production-ready.
The Priority Matrix — Where to Invest First
| Priority | Category | Rationale |
|---|---|---|
| P0 (blocking) | Safety (2.1-2.5) | These failures cause direct harm to users |
| P1 (critical) | Model quality (1.1-1.3) + Monitoring (4.1-4.3) | These ensure the system works and you know when it stops |
| P2 (important) | Transparency (3.1-3.4) + Regulatory (3.10) | Required by law in many jurisdictions |
| P3 (recommended) | Remaining checks | Best practices that reduce risk and improve operations |
Regulatory Applicability by System Type
| System type | EU AI Act risk level | Required checks (minimum) | Recommended checks |
|---|---|---|---|
| General chatbot | Limited risk | 3.3 (disclosure), basic safety | All Category 1-2 |
| Customer service AI | Limited-high risk | 3.3, 3.4, 2.4, 2.6, 4.5 | All 40 |
| Hiring/recruitment AI | High risk | ALL 40 checks (mandatory) | — |
| Medical/diagnostic AI | High risk | ALL 40 checks + domain-specific validation | — |
| Credit/lending AI | High risk | ALL 40 checks + fair lending specific | — |
| Content moderation AI | High risk | 2.1-2.5, 2.9, 3.3, 3.4, 4.4, 4.5 | All 40 |
How to Apply This
Use the token-counter tool to estimate monitoring costs — hallucination rate tracking and bias monitoring consume inference tokens on evaluation samples.
Start with Category 2 (safety). These are blocking and non-negotiable. A system that isn’t safe cannot be compensated for with documentation or monitoring.
Build the regression test suite (1.8) before your first deployment — it’s the single highest-ROI investment. Every future deployment, prompt change, and model update runs against this suite.
Implement the kill switch (4.10) on day one. You will need it — every production AI team uses it eventually.
Document as you go (Category 3). Post-hoc documentation is always lower quality and more expensive than documentation written during development.
Automate monitoring (Category 4) before launch. Discovering problems from user complaints instead of dashboards is the most expensive way to find failures.
Honest Limitations
This checklist covers technical deployment readiness, not organizational readiness (team skills, culture, leadership support). Regulatory requirements are jurisdiction-specific; this checklist maps to EU AI Act and US frameworks but may not cover all applicable regulations in your jurisdiction. The pass/fail thresholds are guidelines, not absolutes — your specific application context may warrant stricter or more lenient criteria. Some checks (especially bias audits) require significant data volume to produce meaningful results. The checklist assumes a single AI system; multi-model architectures (routing, ensemble, cascade) require additional coordination checks. This is a point-in-time assessment — production systems require ongoing monitoring, not one-time verification.
Continue reading
AI Bias Detection — Demographic Parity, Equal Opportunity, Calibration, and When Each Metric Applies
Fairness metric decision tree per use case, measurement methodology, regulatory requirements, and practical implementation for production AI systems.
AI Content Filtering — Guardrails That Block Without Breaking User Experience
False positive and negative rate comparison across filtering approaches, latency impact, implementation patterns, and the tradeoff between safety and usability.
Types of AI Hallucinations — Factual, Logical, Attribution, and How to Detect Each
Taxonomy of AI hallucination types with detection methods, failure rates by model and task, and a diagnostic decision tree for production systems.