AI Safety Incident Response Runbook — Incident Classification Matrix Across Prompt-Injection + Data-Exfiltration + Harmful-Hallucination + Bias + Jailbreak + PII-Leak + Model-Evasion, Severity SLAs, Detect-Contain-Eradicate-Recover-Postmortem Playbook, GDPR Article 33 Notification Paths
AI safety incident response runbook spanning incident classification matrix (prompt-injection + data-exfiltration + harmful-hallucination + bias-manifestation + jailbreak + PII-leak + model-evasion + dependency-compromise), severity levels (SEV-1 through SEV-4) with SLA targets per severity, detect-contain-eradicate-recover-postmortem playbook per incident class, regulator-notification paths (GDPR Article 33 72-hour clock, EU AI Act reporting, FTC Section 5, state AG notification matrices), communication templates for internal + customer + regulator + public, postmortem template with action-item tracking, and on-call readiness checklist operationalizing the difference between AI-safety-as-documentation and AI-safety-as-incident-response.
Your Production LLM Just Returned a User’s Private Document Content in Response to an Unrelated User’s Query, a Customer Has Screenshots, Legal Is Asking About GDPR Article 33 Notification Obligations, Your CEO Wants to Know What to Say Publicly, and the Engineering Team Has Not Yet Determined Whether This Is a Prompt-Injection Attack, a Retrieval-Misattribution Bug, a Training-Data Memorization Leak, or a Cache-Key Collision — You Have 72 Hours
AI safety incidents are not the failure mode of model misbehavior — they are the failure mode of the incident response process that activates when model misbehavior manifests in production. The model will misbehave; that is a statistical certainty. What separates the organizations that survive AI incidents from the ones that compound them into lawsuits, fines, and public-trust collapse is the response runbook — the predefined classification, severity, containment, notification, and postmortem workflow that activates under pressure. This runbook covers the eight incident classes that matter, severity definitions with SLA targets, the detect→contain→eradicate→recover→postmortem playbook per class, regulator-notification paths, communication templates, and the postmortem structure that converts incidents into systematic improvement rather than blame assignment.
Incident Classification Matrix
Eight incident classes cover the operational AI-safety failure surface:
| Class | Definition | Detection signal | Typical discovery path | Representative example |
|---|---|---|---|---|
| Prompt injection | Attacker-crafted input hijacks system-prompt intent | Unusual output patterns; system-prompt leakage in responses; tool-call anomalies | User report, red-team probe, automated prompt-injection classifier | User input “ignore previous instructions” causes model to reveal system prompt |
| Data exfiltration | Sensitive data in training/fine-tuning/RAG corpus leaks through model output | PII appearing in outputs; specific document fragments matching known-sensitive docs | DLP pipeline alert, user complaint, regulator inquiry | Customer B’s account details appear in Customer A’s response |
| Harmful hallucination | Confident generation of false content causing user harm | High-confidence answers contradicted by ground truth; user-reported harm | User complaint, safety-review sampling, external investigation | Medical symptom-checker advises delaying ER visit for condition requiring immediate care |
| Bias manifestation | Systematic differential quality/outcomes across protected attributes | Demographic-parity metric divergence; fairness audit findings | Bias-monitoring pipeline, fairness audit, disparate user complaints | Hiring-assistant consistently scores one demographic lower with equal qualifications |
| Jailbreak exploit | Attacker bypasses content-safety filters via evasion technique | Content-filter hit rate change; output violating documented usage policies | Safety-filter monitoring, red-team, public jailbreak disclosure | Role-play prompt enables content generation that directly-asked form would refuse |
| PII leak | Personal data exposed through model output or training | PII-detection regex/classifier hit in outputs; memorization probing | DLP alert, subject-access request revealing retained PII | Model completes a partial name + address from memorized training data |
| Model evasion | Adversarial inputs cause misclassification (classification models) | Confidence-distribution anomalies; adversarial-input detection | Adversarial monitoring, abuse-pattern reports | Safety classifier consistently passes a class of disguised harmful content |
| Dependency compromise | Upstream model, dataset, library, or API compromised | Supply-chain alert; upstream-provider disclosure; unexpected behavior shift | Security-advisory feed, provider notice, behavioral regression | Embedding-provider had data-poisoning incident; downstream retrieval is affected |
The misclassification-costs-hours reality: An incident is often initially ambiguous between classes. A data-exfiltration event may look like prompt-injection until investigation reveals the output matched training-data memorization rather than attacker-crafted extraction. Initial classification should be provisional; the runbook advances on the most-severe-plausible classification and downgrades only after investigation confirms.
Severity Levels With SLA Targets
| Severity | Definition | Examples | Response SLA | Escalation path |
|---|---|---|---|---|
| SEV-1 | Active harm to users, regulatory exposure, or public safety risk; active exploit in progress | Active data-exfiltration; harmful medical/legal advice being served; wide-impact bias affecting protected groups | Detect-to-containment: ≤30 minutes. Detect-to-notification: ≤2 hours internal, ≤72 hours regulator | Immediate: CTO, CEO, General Counsel, DPO, security on-call |
| SEV-2 | Contained harm or high-likelihood exploit; single-user impact with regulatory implications | Single-user PII leak; isolated jailbreak with limited blast radius; contained bias finding | Detect-to-containment: ≤2 hours. Detect-to-notification: ≤24 hours internal | Engineering director, Security, Legal, Product |
| SEV-3 | Elevated risk without active harm; degraded safety posture | Safety-filter degradation without confirmed harmful output; model-evasion finding in audit; dependency-vulnerability with no confirmed exploit | Detect-to-remediation: ≤5 business days | Engineering manager, Security |
| SEV-4 | Safety-hygiene issue; documentation or process gap | Documentation-only drift; audit-log completeness gap; expired credential for monitoring tool | Remediation: next sprint | Engineering team |
SLA discipline: Severity is assigned at detection based on plausible worst case, not confirmed scope. An ambiguous incident that could plausibly be SEV-1 is treated as SEV-1 until investigation downgrades it. The operational cost of a false SEV-1 is much smaller than the regulatory cost of a mis-classified SEV-1 that was handled as SEV-3.
Detect → Contain → Eradicate → Recover → Postmortem
The five-phase incident response process applies to every class with class-specific variations.
Phase 1 — Detect
| Detection layer | Signal | Typical false-positive rate |
|---|---|---|
| Automated classifiers | Real-time scoring of inputs and outputs against safety-violation classifiers | Moderate (10-25%) — tune per-class |
| DLP pipelines | Regex + classifier for PII, sensitive-document fragments, API keys in outputs | Low to moderate (5-15%) |
| Heuristic alerts | Unusual output patterns (length outliers, content-filter dodges, unusual-token distributions) | Moderate-to-high (20-40%) — triage filter required |
| User reports | Customer support tickets flagged for safety review | Variable — quality depends on user composition |
| External disclosure | Researcher/journalist/competitor/regulator notifying of finding | Typically high severity — assume serious until investigated |
| Red-team discovery | Internal adversarial testing surfacing issue | Prevention-phase finding — treat as SEV-3 unless exploit-in-wild |
| Canary queries | Known-vulnerable test queries run on schedule | Low false-positive; signals regression |
Detection-to-alert latency target: < 5 minutes for automated layers; < 15 minutes for triaged user reports.
Phase 2 — Contain
Containment stops the bleeding. Per-class containment actions:
| Incident class | First containment action | Fallback containment | Reversibility |
|---|---|---|---|
| Prompt injection | Deploy input-classifier in block mode; disable affected tool-use path | Disable feature entirely | Reversible |
| Data exfiltration | Block affected query class; purge cache; isolate affected tenant | Take feature offline | Reversible |
| Harmful hallucination | Add safety-filter override on topic; route to safe-fallback response | Disable feature for affected user segment | Reversible |
| Bias manifestation | Enable differential-outcome monitoring; pause deploy pipeline | Revert to prior-known-fair version | Reversible |
| Jailbreak exploit | Add jailbreak pattern to filter; enable strict-mode filtering | Disable content-sensitive features | Reversible |
| PII leak | Block affected output pattern; purge caches containing PII | Take endpoint offline | Partially reversible (PII already leaked is unrecoverable) |
| Model evasion | Tighten classifier threshold; route to stronger verification | Disable automated decision-making on affected class | Reversible |
| Dependency compromise | Switch to backup provider if available; disable dependent features | Take feature offline | Reversible |
Containment-first discipline: Preserve evidence during containment. Snapshot affected caches, logs, and model versions before modifying them. Post-incident investigation requires evidence; containment actions routinely destroy it when runbooks don’t explicitly require preservation.
Phase 3 — Eradicate
Eradication addresses the root cause. Per-class eradication:
| Incident class | Root-cause investigation | Eradication action |
|---|---|---|
| Prompt injection | Attack-pattern analysis, system-prompt-design review | Redesign system-prompt isolation; tool-use sandboxing; output-validation gate |
| Data exfiltration | Corpus audit, access-control review, embedding-similarity leak analysis | Remove sensitive content from corpus; retrain embeddings; tenant isolation enforcement |
| Harmful hallucination | Training-data/RAG-corpus factuality audit for the affected topic | Add authoritative-source retrieval for topic; uncertainty calibration; refusal-on-unknown tuning |
| Bias manifestation | Fairness audit, training-data representation analysis, model-card review | Debias mitigation (data + post-processing + model), monitoring baseline reset |
| Jailbreak exploit | Attack-surface mapping, safety-filter architecture review | Multi-layer safety architecture; adversarial-training fine-tune; monitored outputs |
| PII leak | Training-data PII audit, memorization probing, RAG-corpus PII scan | Corpus-level PII redaction; differential-privacy retraining if systematic; RAG-filter layer |
| Model evasion | Adversarial-input analysis, classifier-threshold calibration | Robust-classifier fine-tune; input-preprocessing normalization |
| Dependency compromise | Supply-chain audit, provider-postmortem review | Switch provider, implement defense-in-depth for primary dependencies |
Eradication-versus-containment distinction: Containment can be complete while eradication is in progress. Do not declare incident-closed until eradication verification — containment masks the issue; eradication removes its recurrence.
Phase 4 — Recover
Recovery restores service. The recovery checklist:
| Step | Verification | Owner |
|---|---|---|
| Eradication-verified | Test suite includes regression for this incident class | Engineering |
| Monitoring-strengthened | Detection-layer added or tuned for this incident class | SRE / MLOps |
| Containment-removed | Block rules / feature-flags removed after verification | Engineering |
| Canary-verified | Canary traffic served for verification window without recurrence | SRE / MLOps |
| Customer-notified | Affected customers notified per severity requirements | Customer Success + Legal |
| Public-communicated | Public statement if required (SEV-1 / regulatory obligation) | Comms + Legal |
| Regulatory-notified | GDPR / EU AI Act / FTC / state AG notifications per applicable jurisdictions | DPO + Legal |
Premature-recovery cost: The cost of an incident that reopens because recovery was declared before eradication verification is 3-5× the cost of holding containment for another 4-24 hours. Verification windows exist to catch this.
Phase 5 — Postmortem
Postmortem converts the incident into organizational learning. See postmortem template below.
Regulator-Notification Paths
Regulatory notification obligations are jurisdiction-dependent and tight-deadline. The notification matrix:
| Regulation | Trigger | Deadline | Content requirements | Who notifies |
|---|---|---|---|---|
| GDPR Article 33 | Personal-data breach likely to risk data-subject rights | 72 hours from awareness to supervisory authority | Nature, categories/approximate numbers, likely consequences, measures taken | DPO / Legal to lead supervisory authority |
| GDPR Article 34 | Personal-data breach with high-risk to data subjects | Without undue delay to affected individuals | Plain-language description, contact point, likely consequences, measures taken | DPO / Legal / Customer Success |
| EU AI Act (Art. 73) | Serious incident involving high-risk AI system | 15 days (most serious, 2 days for systemic-risk/widespread cases) to market surveillance authority | Incident description, high-risk system classification, corrective measures | Legal + Technical lead |
| FTC (Section 5) | Deceptive or unfair practice; substantial injury | No fixed deadline; disclosure in reports/filings | Varies by context; may trigger separately from breach | Legal |
| State AG (US state-by-state) | Varies — California CCPA, NY SHIELD, Virginia VCDPA, etc. | 30-45 days typical; some (MA) immediate | State-specific templates | Legal |
| HIPAA (if applicable) | Unsecured PHI breach | 60 days to HHS + affected individuals; annual for < 500 | HHS OCR format | Compliance + Legal |
| Sector-specific (SEC, DoD, healthcare providers) | Varies | Varies | Varies | Compliance + Legal |
The 72-hour clock reality: GDPR Article 33’s 72-hour clock starts at “awareness,” not at “investigation complete.” An incident discovered Friday evening has a Monday-morning notification deadline. Runbook must operationalize the clock, including out-of-hours coverage. The supervisory-authority filing does not require complete information — it requires the information known at the time plus a commitment to follow-up filings as investigation progresses. Teams that delay notification waiting for “complete information” routinely miss the deadline.
Communication Templates
Incident communication requires pre-drafted templates that legal and comms have approved. Ship-time edits under incident pressure produce legal mistakes.
Internal — Initial Notification
INCIDENT: SEV-{n} — {class} — {one-line-summary}
Detected: {timestamp}
Affected scope: {what we know now}
Containment status: {in-progress | contained}
Incident commander: {name}
Comms lead: {name}
Legal lead: {name}
Next update: {time}
War-room: {link}
Customer — Affected Notification
Subject: Important information about your {product} account
We are writing to inform you of a recent incident affecting your {product} account.
What happened: {plain-language description}
When: {time window}
What information was involved: {categories}
What we are doing: {containment + eradication summary}
What you can do: {user actions, e.g., review activity, reset credentials}
Contact: {dedicated contact path}
We regret this incident and are taking the following measures to prevent recurrence: {summary}.
Regulator — GDPR Article 33
Follow the supervisory authority’s published form. Typical content:
- Nature of the breach (class, vector)
- Categories and approximate number of data subjects
- Categories and approximate number of personal-data records
- Likely consequences
- Measures taken or proposed
- Contact details of DPO
Public — Statement
Draft with Legal + Comms in advance for every severity-plausible incident class. Post-incident edits under deadline routinely produce statements that damage the organization further.
Postmortem Template
The postmortem is the artifact that converts the incident into organizational learning.
# {Incident ID} — {Class} — {Severity}
## Summary
One-paragraph plain-language description of what happened.
## Timeline
| Time (UTC) | Event | Source |
|------------|-------|--------|
| ...
## Impact
- Users affected: {count}
- Data affected: {categories, counts}
- Duration: {start} to {containment} to {recovery}
- Regulatory notifications triggered: {list}
- Customer notifications: {count sent, response}
## Root cause
Technical root cause — not "user error" or "model misbehavior" without further specificity. Decompose to the process or system gap that allowed the incident.
## Contributing factors
- Factor 1
- Factor 2
- ...
## What went well
- Detection path worked
- Containment within SLA
- ...
## What went poorly
- Specific failures — blameless framing
- Specific delays
- ...
## Lessons learned
- Categorized: process, tooling, training, architecture
- Each lesson maps to an action item below
## Action items
| ID | Description | Owner | Due date | Status |
|----|-------------|-------|----------|--------|
| AI-1 | ... | ... | ... | Open |
## Near-miss sibling
Was there a near-miss that should have prevented this? Document to inform near-miss escalation.
## Detection improvement
What detection signal would have caught this earlier? Wire it in as an action item.
## Signoffs
- Incident commander: {name, date}
- Engineering director: {name, date}
- Security lead: {name, date}
- Legal: {name, date}
- DPO (if applicable): {name, date}
The blameless discipline: Postmortems blame processes, systems, and decisions — not individuals. An incident where the on-call responder made a reasonable call that proved wrong is a runbook-documentation failure, not a responder failure. Blame-seeking postmortems destroy the reporting incentive that surfaces incidents early.
On-Call Readiness Checklist
| Checklist item | Status | Verification |
|---|---|---|
| On-call rotation covers all 168 hours/week | ✓/✗ | PagerDuty/Opsgenie schedule published |
| Primary + secondary on-call per rotation | ✓/✗ | Escalation path tested quarterly |
| Runbook links accessible without VPN | ✓/✗ | Runbook portal tested from mobile |
| Severity-assignment decision tree posted | ✓/✗ | Published in runbook portal |
| Legal + DPO on-call contacts current | ✓/✗ | Quarterly review |
| Regulator-notification forms pre-drafted | ✓/✗ | Templates in runbook with Legal-approved placeholders |
| Communication templates pre-drafted | ✓/✗ | Comms-approved; in runbook |
| War-room setup automation | ✓/✗ | One-click war-room provisioning |
| Incident drill cadence | ✓/✗ | Quarterly tabletop; annual live drill per class |
| Postmortem template available | ✓/✗ | In runbook portal |
| Action-item tracking tool integrated | ✓/✗ | Jira/Linear project with postmortem-action-item label |
The drill discipline: Runbooks that have never been exercised under drill conditions routinely fail under real-incident conditions. Quarterly tabletop drills catch the procedural gaps that only manifest under pressure.
Anti-Patterns
| Anti-pattern | Why teams do it | Why it fails | Correct pattern |
|---|---|---|---|
| No predefined severity tree | Avoid “over-classifying” | Initial classification becomes ad-hoc; SLA clock starts late | Severity decision tree with plausible-worst-case rule |
| Containment destroys evidence | Operational focus on stopping harm | Investigation and postmortem blocked; root cause undetermined | Evidence-preservation snapshot before containment actions |
| Regulator-notification delayed for “complete info” | Desire to submit clean notification | Misses 72-hour deadline; additional violation | Initial notification with known facts + commitment to follow-up filings |
| Postmortem blames individuals | Cultural reflex | Destroys reporting incentive; future incidents surfaced later | Blameless structure; focus on processes and systems |
| Customer notification delayed to Legal review | Risk-aversion | Breach-notification-SLA violation compounds incident | Pre-approved templates; Legal pre-clears classes and provides on-call counsel |
| No drills | Production pressure | Runbook gaps only surface under real incidents | Quarterly tabletop drills per incident class |
| ”It’s just a test” severity downgrade | Avoid noise | Red-team findings that reflect real exploit surface get ignored | Red-team findings classified as SEV-3 minimum; upgrade if exploit-in-wild |
Honest Limitations
- Runbook quality decays without exercise. A runbook written two years ago and not drilled is 40-60% useful when a real incident fires. Budget quarterly drill time.
- Severity classification is often ambiguous at detection. The “plausible worst case” rule mitigates but does not eliminate the judgment call. Post-incident reviews should include “was severity correctly classified” as a standing question.
- Regulatory notification law changes faster than runbooks update. EU AI Act provisions (Article 73 serious-incident reporting) phase in through 2026-2027. US state privacy laws change quarterly. Legal review of the notification matrix every 90 days is required.
- Containment actions destroy evidence. Even with discipline, the operational pressure to stop harm routinely destroys investigation artifacts. Build evidence-preservation automation into the containment tools, not just into the runbook.
- Cross-jurisdiction notification is complex. A single incident affecting users in EU + California + New York triggers GDPR + CCPA + SHIELD notifications with different deadlines, different content requirements, different recipient authorities. The matrix assumes jurisdictional awareness that many engineering-led incident commanders lack.
- AI-specific incident classes evolve faster than traditional security classes. Prompt injection patterns in 2026 do not match 2024 patterns. Runbooks that hardcode attack patterns decay; runbooks that abstract to detection-action patterns scale.
- Postmortem action items close at 40-60% rate in most organizations. The most common failure after incident-response is the organizational failure to follow through on the structural changes identified in postmortem. Action-item tracking with accountability is non-negotiable.
- Blameless postmortems require cultural commitment that leadership must model. An organization whose leadership blames individuals in post-incident meetings will not have blameless postmortems regardless of documented process.
The Incident-Ready Production AI System
An AI system is incident-ready when:
- Detection layers cover all 8 incident classes with defined alert thresholds.
- Severity decision tree is published and drilled.
- Containment actions per class are pre-scripted and tested.
- Regulatory notification templates are Legal-approved and accessible under pressure.
- Communication templates (internal + customer + public) are pre-drafted.
- Postmortem template is standardized with blameless framing.
- Quarterly tabletop drill per class is scheduled.
- On-call rotation includes Legal + DPO + Security reachable within SLA.
- Action-item tracking from prior postmortems is current.
The goal is not to prevent incidents (impossible). The goal is to ensure that when an incident fires, the response is predictable, compliant, proportionate, and produces systemic improvement rather than organizational trauma.
Continue reading
AI Bias Detection — Demographic Parity, Equal Opportunity, Calibration, and When Each Metric Applies
Fairness metric decision tree per use case, measurement methodology, regulatory requirements, and practical implementation for production AI systems.
AI Content Filtering — Guardrails That Block Without Breaking User Experience
False positive and negative rate comparison across filtering approaches, latency impact, implementation patterns, and the tradeoff between safety and usability.
Types of AI Hallucinations — Factual, Logical, Attribution, and How to Detect Each
Taxonomy of AI hallucination types with detection methods, failure rates by model and task, and a diagnostic decision tree for production systems.