AI Safety Incident Response Runbook — Incident Classification Matrix Across Prompt-Injection + Data-Exfiltration + Harmful-Hallucination + Bias + Jailbreak + PII-Leak + Model-Evasion, Severity SLAs, Detect-Contain-Eradicate-Recover-Postmortem Playbook, GDPR Article 33 Notification Paths

AI safety incident response runbook spanning incident classification matrix (prompt-injection + data-exfiltration + harmful-hallucination + bias-manifestation + jailbreak + PII-leak + model-evasion + dependency-compromise), severity levels (SEV-1 through SEV-4) with SLA targets per severity, detect-contain-eradicate-recover-postmortem playbook per incident class, regulator-notification paths (GDPR Article 33 72-hour clock, EU AI Act reporting, FTC Section 5, state AG notification matrices), communication templates for internal + customer + regulator + public, postmortem template with action-item tracking, and on-call readiness checklist operationalizing the difference between AI-safety-as-documentation and AI-safety-as-incident-response.

Kenny Tan Reviewed by Shanire 25 April 2026

AI safety incidents are not the failure mode of model misbehavior — they are the failure mode of the incident response process that activates when model misbehavior manifests in production. The model will misbehave; that is a statistical certainty. What separates the organizations that survive AI incidents from the ones that compound them into lawsuits, fines, and public-trust collapse is the response runbook — the predefined classification, severity, containment, notification, and postmortem workflow that activates under pressure. This runbook covers the eight incident classes that matter, severity definitions with SLA targets, the detect→contain→eradicate→recover→postmortem playbook per class, regulator-notification paths, communication templates, and the postmortem structure that converts incidents into systematic improvement rather than blame assignment.

Incident Classification Matrix

Eight incident classes cover the operational AI-safety failure surface:

Class	Definition	Detection signal	Typical discovery path	Representative example
Prompt injection	Attacker-crafted input hijacks system-prompt intent	Unusual output patterns; system-prompt leakage in responses; tool-call anomalies	User report, red-team probe, automated prompt-injection classifier	User input “ignore previous instructions” causes model to reveal system prompt
Data exfiltration	Sensitive data in training/fine-tuning/RAG corpus leaks through model output	PII appearing in outputs; specific document fragments matching known-sensitive docs	DLP pipeline alert, user complaint, regulator inquiry	Customer B’s account details appear in Customer A’s response
Harmful hallucination	Confident generation of false content causing user harm	High-confidence answers contradicted by ground truth; user-reported harm	User complaint, safety-review sampling, external investigation	Medical symptom-checker advises delaying ER visit for condition requiring immediate care
Bias manifestation	Systematic differential quality/outcomes across protected attributes	Demographic-parity metric divergence; fairness audit findings	Bias-monitoring pipeline, fairness audit, disparate user complaints	Hiring-assistant consistently scores one demographic lower with equal qualifications
Jailbreak exploit	Attacker bypasses content-safety filters via evasion technique	Content-filter hit rate change; output violating documented usage policies	Safety-filter monitoring, red-team, public jailbreak disclosure	Role-play prompt enables content generation that directly-asked form would refuse
PII leak	Personal data exposed through model output or training	PII-detection regex/classifier hit in outputs; memorization probing	DLP alert, subject-access request revealing retained PII	Model completes a partial name + address from memorized training data
Model evasion	Adversarial inputs cause misclassification (classification models)	Confidence-distribution anomalies; adversarial-input detection	Adversarial monitoring, abuse-pattern reports	Safety classifier consistently passes a class of disguised harmful content
Dependency compromise	Upstream model, dataset, library, or API compromised	Supply-chain alert; upstream-provider disclosure; unexpected behavior shift	Security-advisory feed, provider notice, behavioral regression	Embedding-provider had data-poisoning incident; downstream retrieval is affected

The misclassification-costs-hours reality: An incident is often initially ambiguous between classes. A data-exfiltration event may look like prompt-injection until investigation reveals the output matched training-data memorization rather than attacker-crafted extraction. Initial classification should be provisional; the runbook advances on the most-severe-plausible classification and downgrades only after investigation confirms.

Severity Levels With SLA Targets

Severity	Definition	Examples	Response SLA	Escalation path
SEV-1	Active harm to users, regulatory exposure, or public safety risk; active exploit in progress	Active data-exfiltration; harmful medical/legal advice being served; wide-impact bias affecting protected groups	Detect-to-containment: ≤30 minutes. Detect-to-notification: ≤2 hours internal, ≤72 hours regulator	Immediate: CTO, CEO, General Counsel, DPO, security on-call
SEV-2	Contained harm or high-likelihood exploit; single-user impact with regulatory implications	Single-user PII leak; isolated jailbreak with limited blast radius; contained bias finding	Detect-to-containment: ≤2 hours. Detect-to-notification: ≤24 hours internal	Engineering director, Security, Legal, Product
SEV-3	Elevated risk without active harm; degraded safety posture	Safety-filter degradation without confirmed harmful output; model-evasion finding in audit; dependency-vulnerability with no confirmed exploit	Detect-to-remediation: ≤5 business days	Engineering manager, Security
SEV-4	Safety-hygiene issue; documentation or process gap	Documentation-only drift; audit-log completeness gap; expired credential for monitoring tool	Remediation: next sprint	Engineering team

SLA discipline: Severity is assigned at detection based on plausible worst case, not confirmed scope. An ambiguous incident that could plausibly be SEV-1 is treated as SEV-1 until investigation downgrades it. The operational cost of a false SEV-1 is much smaller than the regulatory cost of a mis-classified SEV-1 that was handled as SEV-3.

Detect → Contain → Eradicate → Recover → Postmortem

The five-phase incident response process applies to every class with class-specific variations.

Phase 1 — Detect

Detection layer	Signal	Typical false-positive rate
Automated classifiers	Real-time scoring of inputs and outputs against safety-violation classifiers	Moderate (10-25%) — tune per-class
DLP pipelines	Regex + classifier for PII, sensitive-document fragments, API keys in outputs	Low to moderate (5-15%)
Heuristic alerts	Unusual output patterns (length outliers, content-filter dodges, unusual-token distributions)	Moderate-to-high (20-40%) — triage filter required
User reports	Customer support tickets flagged for safety review	Variable — quality depends on user composition
External disclosure	Researcher/journalist/competitor/regulator notifying of finding	Typically high severity — assume serious until investigated
Red-team discovery	Internal adversarial testing surfacing issue	Prevention-phase finding — treat as SEV-3 unless exploit-in-wild
Canary queries	Known-vulnerable test queries run on schedule	Low false-positive; signals regression

Detection-to-alert latency target: < 5 minutes for automated layers; < 15 minutes for triaged user reports.

Phase 2 — Contain

Containment stops the bleeding. Per-class containment actions:

Incident class	First containment action	Fallback containment	Reversibility
Prompt injection	Deploy input-classifier in block mode; disable affected tool-use path	Disable feature entirely	Reversible
Data exfiltration	Block affected query class; purge cache; isolate affected tenant	Take feature offline	Reversible
Harmful hallucination	Add safety-filter override on topic; route to safe-fallback response	Disable feature for affected user segment	Reversible
Bias manifestation	Enable differential-outcome monitoring; pause deploy pipeline	Revert to prior-known-fair version	Reversible
Jailbreak exploit	Add jailbreak pattern to filter; enable strict-mode filtering	Disable content-sensitive features	Reversible
PII leak	Block affected output pattern; purge caches containing PII	Take endpoint offline	Partially reversible (PII already leaked is unrecoverable)
Model evasion	Tighten classifier threshold; route to stronger verification	Disable automated decision-making on affected class	Reversible
Dependency compromise	Switch to backup provider if available; disable dependent features	Take feature offline	Reversible

Containment-first discipline: Preserve evidence during containment. Snapshot affected caches, logs, and model versions before modifying them. Post-incident investigation requires evidence; containment actions routinely destroy it when runbooks don’t explicitly require preservation.

Phase 3 — Eradicate

Eradication addresses the root cause. Per-class eradication:

Incident class	Root-cause investigation	Eradication action
Prompt injection	Attack-pattern analysis, system-prompt-design review	Redesign system-prompt isolation; tool-use sandboxing; output-validation gate
Data exfiltration	Corpus audit, access-control review, embedding-similarity leak analysis	Remove sensitive content from corpus; retrain embeddings; tenant isolation enforcement
Harmful hallucination	Training-data/RAG-corpus factuality audit for the affected topic	Add authoritative-source retrieval for topic; uncertainty calibration; refusal-on-unknown tuning
Bias manifestation	Fairness audit, training-data representation analysis, model-card review	Debias mitigation (data + post-processing + model), monitoring baseline reset
Jailbreak exploit	Attack-surface mapping, safety-filter architecture review	Multi-layer safety architecture; adversarial-training fine-tune; monitored outputs
PII leak	Training-data PII audit, memorization probing, RAG-corpus PII scan	Corpus-level PII redaction; differential-privacy retraining if systematic; RAG-filter layer
Model evasion	Adversarial-input analysis, classifier-threshold calibration	Robust-classifier fine-tune; input-preprocessing normalization
Dependency compromise	Supply-chain audit, provider-postmortem review	Switch provider, implement defense-in-depth for primary dependencies

Eradication-versus-containment distinction: Containment can be complete while eradication is in progress. Do not declare incident-closed until eradication verification — containment masks the issue; eradication removes its recurrence.

Phase 4 — Recover

Recovery restores service. The recovery checklist:

Step	Verification	Owner
Eradication-verified	Test suite includes regression for this incident class	Engineering
Monitoring-strengthened	Detection-layer added or tuned for this incident class	SRE / MLOps
Containment-removed	Block rules / feature-flags removed after verification	Engineering
Canary-verified	Canary traffic served for verification window without recurrence	SRE / MLOps
Customer-notified	Affected customers notified per severity requirements	Customer Success + Legal
Public-communicated	Public statement if required (SEV-1 / regulatory obligation)	Comms + Legal
Regulatory-notified	GDPR / EU AI Act / FTC / state AG notifications per applicable jurisdictions	DPO + Legal

Premature-recovery cost: The cost of an incident that reopens because recovery was declared before eradication verification is 3-5× the cost of holding containment for another 4-24 hours. Verification windows exist to catch this.

Phase 5 — Postmortem

Postmortem converts the incident into organizational learning. See postmortem template below.

Regulator-Notification Paths

Regulatory notification obligations are jurisdiction-dependent and tight-deadline. The notification matrix:

Regulation	Trigger	Deadline	Content requirements	Who notifies
GDPR Article 33	Personal-data breach likely to risk data-subject rights	72 hours from awareness to supervisory authority	Nature, categories/approximate numbers, likely consequences, measures taken	DPO / Legal to lead supervisory authority
GDPR Article 34	Personal-data breach with high-risk to data subjects	Without undue delay to affected individuals	Plain-language description, contact point, likely consequences, measures taken	DPO / Legal / Customer Success
EU AI Act (Art. 73)	Serious incident involving high-risk AI system	15 days (most serious, 2 days for systemic-risk/widespread cases) to market surveillance authority	Incident description, high-risk system classification, corrective measures	Legal + Technical lead
FTC (Section 5)	Deceptive or unfair practice; substantial injury	No fixed deadline; disclosure in reports/filings	Varies by context; may trigger separately from breach	Legal
State AG (US state-by-state)	Varies — California CCPA, NY SHIELD, Virginia VCDPA, etc.	30-45 days typical; some (MA) immediate	State-specific templates	Legal
HIPAA (if applicable)	Unsecured PHI breach	60 days to HHS + affected individuals; annual for < 500	HHS OCR format	Compliance + Legal
Sector-specific (SEC, DoD, healthcare providers)	Varies	Varies	Varies	Compliance + Legal

The 72-hour clock reality: GDPR Article 33’s 72-hour clock starts at “awareness,” not at “investigation complete.” An incident discovered Friday evening has a Monday-morning notification deadline. Runbook must operationalize the clock, including out-of-hours coverage. The supervisory-authority filing does not require complete information — it requires the information known at the time plus a commitment to follow-up filings as investigation progresses. Teams that delay notification waiting for “complete information” routinely miss the deadline.

Communication Templates

Incident communication requires pre-drafted templates that legal and comms have approved. Ship-time edits under incident pressure produce legal mistakes.

Internal — Initial Notification

INCIDENT: SEV-{n} — {class} — {one-line-summary}
Detected: {timestamp}
Affected scope: {what we know now}
Containment status: {in-progress | contained}
Incident commander: {name}
Comms lead: {name}
Legal lead: {name}
Next update: {time}
War-room: {link}

Customer — Affected Notification

Subject: Important information about your {product} account

We are writing to inform you of a recent incident affecting your {product} account.

What happened: {plain-language description}
When: {time window}
What information was involved: {categories}
What we are doing: {containment + eradication summary}
What you can do: {user actions, e.g., review activity, reset credentials}
Contact: {dedicated contact path}

We regret this incident and are taking the following measures to prevent recurrence: {summary}.

Follow the supervisory authority’s published form. Typical content:

Nature of the breach (class, vector)
Categories and approximate number of data subjects
Categories and approximate number of personal-data records
Likely consequences
Measures taken or proposed
Contact details of DPO

Public — Statement

Draft with Legal + Comms in advance for every severity-plausible incident class. Post-incident edits under deadline routinely produce statements that damage the organization further.

Postmortem Template

The postmortem is the artifact that converts the incident into organizational learning.

# {Incident ID} — {Class} — {Severity}

## Summary
One-paragraph plain-language description of what happened.

## Timeline
| Time (UTC) | Event | Source |
|------------|-------|--------|
| ...

## Impact
- Users affected: {count}
- Data affected: {categories, counts}
- Duration: {start} to {containment} to {recovery}
- Regulatory notifications triggered: {list}
- Customer notifications: {count sent, response}

## Root cause
Technical root cause — not "user error" or "model misbehavior" without further specificity. Decompose to the process or system gap that allowed the incident.

## Contributing factors
- Factor 1
- Factor 2
- ...

## What went well
- Detection path worked
- Containment within SLA
- ...

## What went poorly
- Specific failures — blameless framing
- Specific delays
- ...

## Lessons learned
- Categorized: process, tooling, training, architecture
- Each lesson maps to an action item below

## Action items
| ID | Description | Owner | Due date | Status |
|----|-------------|-------|----------|--------|
| AI-1 | ... | ... | ... | Open |

## Near-miss sibling
Was there a near-miss that should have prevented this? Document to inform near-miss escalation.

## Detection improvement
What detection signal would have caught this earlier? Wire it in as an action item.

## Signoffs
- Incident commander: {name, date}
- Engineering director: {name, date}
- Security lead: {name, date}
- Legal: {name, date}
- DPO (if applicable): {name, date}

The blameless discipline: Postmortems blame processes, systems, and decisions — not individuals. An incident where the on-call responder made a reasonable call that proved wrong is a runbook-documentation failure, not a responder failure. Blame-seeking postmortems destroy the reporting incentive that surfaces incidents early.

On-Call Readiness Checklist

Checklist item	Status	Verification
On-call rotation covers all 168 hours/week	✓/✗	PagerDuty/Opsgenie schedule published
Primary + secondary on-call per rotation	✓/✗	Escalation path tested quarterly
Runbook links accessible without VPN	✓/✗	Runbook portal tested from mobile
Severity-assignment decision tree posted	✓/✗	Published in runbook portal
Legal + DPO on-call contacts current	✓/✗	Quarterly review
Regulator-notification forms pre-drafted	✓/✗	Templates in runbook with Legal-approved placeholders
Communication templates pre-drafted	✓/✗	Comms-approved; in runbook
War-room setup automation	✓/✗	One-click war-room provisioning
Incident drill cadence	✓/✗	Quarterly tabletop; annual live drill per class
Postmortem template available	✓/✗	In runbook portal
Action-item tracking tool integrated	✓/✗	Jira/Linear project with postmortem-action-item label

The drill discipline: Runbooks that have never been exercised under drill conditions routinely fail under real-incident conditions. Quarterly tabletop drills catch the procedural gaps that only manifest under pressure.

Anti-Patterns

Anti-pattern	Why teams do it	Why it fails	Correct pattern
No predefined severity tree	Avoid “over-classifying”	Initial classification becomes ad-hoc; SLA clock starts late	Severity decision tree with plausible-worst-case rule
Containment destroys evidence	Operational focus on stopping harm	Investigation and postmortem blocked; root cause undetermined	Evidence-preservation snapshot before containment actions
Regulator-notification delayed for “complete info”	Desire to submit clean notification	Misses 72-hour deadline; additional violation	Initial notification with known facts + commitment to follow-up filings
Postmortem blames individuals	Cultural reflex	Destroys reporting incentive; future incidents surfaced later	Blameless structure; focus on processes and systems
Customer notification delayed to Legal review	Risk-aversion	Breach-notification-SLA violation compounds incident	Pre-approved templates; Legal pre-clears classes and provides on-call counsel
No drills	Production pressure	Runbook gaps only surface under real incidents	Quarterly tabletop drills per incident class
”It’s just a test” severity downgrade	Avoid noise	Red-team findings that reflect real exploit surface get ignored	Red-team findings classified as SEV-3 minimum; upgrade if exploit-in-wild

Honest Limitations

Runbook quality decays without exercise. A runbook written two years ago and not drilled is 40-60% useful when a real incident fires. Budget quarterly drill time.
Severity classification is often ambiguous at detection. The “plausible worst case” rule mitigates but does not eliminate the judgment call. Post-incident reviews should include “was severity correctly classified” as a standing question.
Regulatory notification law changes faster than runbooks update. EU AI Act provisions (Article 73 serious-incident reporting) phase in through 2026-2027. US state privacy laws change quarterly. Legal review of the notification matrix every 90 days is required.
Containment actions destroy evidence. Even with discipline, the operational pressure to stop harm routinely destroys investigation artifacts. Build evidence-preservation automation into the containment tools, not just into the runbook.
Cross-jurisdiction notification is complex. A single incident affecting users in EU + California + New York triggers GDPR + CCPA + SHIELD notifications with different deadlines, different content requirements, different recipient authorities. The matrix assumes jurisdictional awareness that many engineering-led incident commanders lack.
AI-specific incident classes evolve faster than traditional security classes. Prompt injection patterns in 2026 do not match 2024 patterns. Runbooks that hardcode attack patterns decay; runbooks that abstract to detection-action patterns scale.
Postmortem action items close at 40-60% rate in most organizations. The most common failure after incident-response is the organizational failure to follow through on the structural changes identified in postmortem. Action-item tracking with accountability is non-negotiable.
Blameless postmortems require cultural commitment that leadership must model. An organization whose leadership blames individuals in post-incident meetings will not have blameless postmortems regardless of documented process.

The Incident-Ready Production AI System

An AI system is incident-ready when:

Detection layers cover all 8 incident classes with defined alert thresholds.
Severity decision tree is published and drilled.
Containment actions per class are pre-scripted and tested.
Regulatory notification templates are Legal-approved and accessible under pressure.
Communication templates (internal + customer + public) are pre-drafted.
Postmortem template is standardized with blameless framing.
Quarterly tabletop drill per class is scheduled.
On-call rotation includes Legal + DPO + Security reachable within SLA.
Action-item tracking from prior postmortems is current.

The goal is not to prevent incidents (impossible). The goal is to ensure that when an incident fires, the response is predictable, compliant, proportionate, and produces systemic improvement rather than organizational trauma.

Kenny Tan Co-founder & Technical Lead

AI Model Evaluation Systems · LLM Benchmark Infrastructure · Prompt Engineering Research

Cross-domain expertise in software engineering, content systems, and infrastructure architecture.

Reviewed by Shanire — Co-founder & Creative Lead

25 April 2026 Also at: kennytan.net

Continue reading

AI Bias Detection — Demographic Parity, Equal Opportunity, Calibration, and When Each Metric Applies

Fairness metric decision tree per use case, measurement methodology, regulatory requirements, and practical implementation for production AI systems.

AI Content Filtering — Guardrails That Block Without Breaking User Experience

False positive and negative rate comparison across filtering approaches, latency impact, implementation patterns, and the tradeoff between safety and usability.

Types of AI Hallucinations — Factual, Logical, Attribution, and How to Detect Each

Taxonomy of AI hallucination types with detection methods, failure rates by model and task, and a diagnostic decision tree for production systems.

All articles in ai safety responsible