AI Observability in Production — What to Measure, When to Alert, and What to Ignore

Dashboard metric priority matrix covering alert vs log vs ignore decisions, LLM monitoring platform comparison, drift detection methodology, and cost-effective observability architecture.

Kenny Tan 15 April 2026

Your AI Dashboard Has 47 Metrics and Zero Actionable Alerts — You’re Monitoring Everything and Observing Nothing

Traditional software monitoring maps neatly to AI systems for the infrastructure layer — latency, error rates, uptime are the same. But the quality layer is entirely new: model output degrades without any error code, user satisfaction drops without any server metric changing, and costs spike without any traffic increase. AI observability is the discipline of detecting these invisible failures. The problem is that most teams either monitor nothing (hoping benchmarks hold in production) or monitor everything (drowning in metrics that never trigger action). This guide provides the metric priority matrix that separates signal from noise, the platform comparison for LLM monitoring tools, and the architecture that detects quality drift before users complain.

The AI Observability Stack

AI observability has three layers, each catching different failure modes:

Layer	What it monitors	Failure modes caught	Traditional equivalent
Infrastructure	Latency, throughput, errors, availability	API outages, rate limits, timeout spikes	Standard APM (Datadog, New Relic)
Quality	Output correctness, faithfulness, relevance, safety	Model degradation, prompt regression, drift	No traditional equivalent
Business	User satisfaction, task completion, cost per outcome	Silent quality failures, UX degradation	Product analytics (Amplitude, Mixpanel)

Key insight: Infrastructure monitoring is necessary but insufficient. An AI system can have 99.99% uptime, sub-second latency, and zero errors — while producing increasingly wrong answers. The quality layer is what makes AI observability different from standard APM.

Metric Priority Matrix

Every metric falls into one of three categories: alert (wake someone up), log (investigate when other signals fire), or ignore (nice to have, never actionable).

Infrastructure Metrics

Metric	Priority	Alert threshold	Why
Error rate (4xx/5xx)	ALERT	>1% sustained 5 min	Direct user impact; provider outage signal
Latency p50	LOG	Track trend	Baseline for comparison; rarely actionable alone
Latency p95	ALERT	>3x baseline for 5 min	User experience degradation for tail requests
Latency p99	LOG	Track for capacity planning	Edge cases; usually not actionable
Rate limit hits (429)	ALERT	>5% of requests	Indicates capacity ceiling; users being throttled
Token throughput	LOG	Track trend	Cost correlation; rarely actionable alone
Availability (provider)	ALERT	Any downtime >30s	Trigger failover to backup model
Request queue depth	ALERT	>100 pending for 2 min	Backpressure building; may need scaling

Quality Metrics

Metric	Priority	Alert threshold	Collection method
User thumbs down rate	ALERT	>5% over 1-hour window	UI feedback button
Response regeneration rate	ALERT	>8% over 1-hour window	Track regenerate clicks
Hallucination rate (sampled)	ALERT	>10% on sampled responses	LLM-as-judge on 5-10% sample
Faithfulness score (RAG)	ALERT	<0.85 on sampled responses	RAGAS/custom metric on sample
Output format compliance	ALERT	<95% (structured output tasks)	Schema validation on every response
Safety filter triggers	LOG	Track trend; alert on 2x spike	Content filter logging
Output length distribution	LOG	Alert if mean shifts >20%	Token counting per response
Topic classification (output)	LOG	Alert on off-topic spike >3%	Lightweight classifier on output
Embedding drift	LOG	Monitor weekly	Compare embedding distribution of current vs baseline queries
Benchmark regression	LOG	Check weekly against eval suite	Automated eval pipeline

Business Metrics

Metric	Priority	Alert threshold	Collection method
Task completion rate	ALERT	<80% (or 10% drop from baseline)	End-to-end task tracking
Session abandonment	ALERT	>40% single-turn sessions	Session analytics
Cost per completed task	LOG	>2x baseline	Cost tracking + task completion join
Time to task completion	LOG	Track trend	Session timing
Feature adoption (AI vs non-AI path)	LOG	Track trend	A/B or feature flag analytics
Daily active users (AI feature)	LOG	Alert on >20% drop	Product analytics

LLM Monitoring Platform Comparison

Platform	Quality monitoring	Cost tracking	Trace logging	Eval integration	Pricing	Best for
Langfuse	Good (custom scores)	Excellent	Excellent	Good (custom evals)	Free (self-hosted), $59+/mo (cloud)	Open-source preference, full control
Langsmith	Good	Good	Excellent	Excellent (tight LangChain)	$39-400/mo	LangChain/LangGraph users
Braintrust	Excellent (built-in judges)	Good	Good	Excellent	$50-500/mo	Teams needing eval + monitoring
Helicone	Basic	Excellent	Good	Basic	Free tier, $50+/mo	Cost-focused monitoring
Arize Phoenix	Excellent (drift detection)	Good	Excellent	Good	Free (open-source)	ML teams familiar with ML observability
Datadog LLM Observability	Good	Good	Excellent	Basic	Part of Datadog pricing	Teams already using Datadog
Weights & Biases Weave	Good	Basic	Good	Excellent	Free tier, $50+/mo	ML teams with W&B workflow
OpenLLMetry	Basic (OpenTelemetry-based)	Good	Excellent	Basic	Free (open-source)	OpenTelemetry-native infrastructure

Platform Selection Decision

Your situation	Recommended platform	Why
Early stage, need basics fast	Langfuse (self-hosted) or Helicone	Free, quick setup, covers logging and cost tracking
Using LangChain	Langsmith	Native integration, deepest traces for chain/agent debugging
Need quality evaluation + monitoring	Braintrust	Best eval-to-monitoring pipeline; built-in LLM judges
Enterprise, existing Datadog	Datadog LLM Observability	Integrates with existing alerting, dashboards, oncall
ML team, familiar with experiment tracking	Arize Phoenix or W&B Weave	Concepts (drift, embeddings, evals) map to ML workflow
OpenTelemetry-first architecture	OpenLLMetry	Standard OTEL spans for LLM calls; integrates with any OTEL backend

Drift Detection — The Critical Quality Signal

AI systems degrade silently. Drift detection catches quality changes before users notice:

Types of Drift

Drift type	What changes	Cause	Detection method	Detection latency
Input drift	User queries become different from training/eval distribution	Seasonal changes, new user segments, feature changes	Embedding distribution comparison	Days to weeks
Output drift	Model outputs change in length, format, or content patterns	Provider model updates, system prompt changes	Output feature monitoring (length, format compliance)	Hours to days
Quality drift	Answer correctness degrades on consistent inputs	Model regression, data staleness, retrieval degradation	Sampled evaluation on production traffic	Days to weeks
Cost drift	Per-request cost increases without traffic changes	Token count inflation, prompt bloat, model routing failure	Cost-per-request monitoring	Hours
Behavioral drift	User interaction patterns change	Poor responses → behavior change → metrics shift	Session analytics, abandonment rate	Days to weeks

Drift Detection Methods

Method	What it detects	Implementation complexity	Cost	Sensitivity
Statistical process control	Mean/variance shifts in numerical metrics	Low (standard statistics)	Negligible	Medium — detects large shifts
Embedding distribution monitoring	Input/output distribution shifts	Medium (requires embedding + comparison)	$50-200/mo	High — detects semantic shifts
Periodic evaluation	Quality regression on fixed eval set	Medium (scheduled eval pipeline)	$10-100/run (LLM-as-judge)	High — directly measures quality
A/B comparison	Difference between current and baseline versions	Medium (requires traffic splitting)	2x LLM cost during A/B	Highest — controlled comparison
User signal correlation	Quality changes reflected in user behavior	Low (product analytics)	Negligible	Low — lagging indicator

The Drift Detection Pipeline

Stage	Frequency	What it checks	Action if drift detected
Real-time	Every request	Format compliance, latency, errors	Immediate alert; failover if critical
Hourly	Aggregate hourly	Cost per request, output length distribution	Warning if >20% shift from 24h average
Daily	Sample 100-500 requests	LLM-as-judge quality scores	Alert if score drops >5% from baseline
Weekly	Full eval suite	Complete regression test on fixed dataset	Alert if any metric drops >3% from all-time baseline
Monthly	Embedding distribution	Input/output semantic shift analysis	Investigate; may indicate new user patterns (not always bad)

Cost-Effective Observability Architecture

Full observability on every request is expensive. The cost-effective approach: monitor everything cheaply, evaluate selectively.

Data tier	What to capture	Storage cost	Analysis cost	Retention
Tier 1: Metadata (every request)	Timestamp, model, latency, token count, status code, cost	$5-20/mo per 1M requests	Negligible	90 days
Tier 2: Content (sampled, 10-20%)	Full prompt + response text	$50-200/mo per 1M requests	Moderate (storage-bound)	30 days
Tier 3: Evaluation (sampled, 5-10%)	LLM-as-judge scores, faithfulness, relevance	$100-500/mo per 1M requests	High (LLM inference cost)	90 days
Tier 4: Full trace (on-demand)	Complete request lifecycle with all intermediate steps	$200-1,000/mo per 1M requests	Highest	7 days

The 100/20/5 Rule

100% of requests: Log metadata (cost: negligible)
20% of requests: Store full content (cost: moderate)
5% of requests: Run quality evaluation (cost: highest per-request, manageable at 5%)

This achieves statistically significant quality monitoring (5% sample of 100K daily requests = 5,000 evaluated responses — more than sufficient for detecting 2-3% quality shifts) at a fraction of full-evaluation cost.

Alert Design

The difference between useful alerts and noise:

Alert type	Good alert	Bad alert	Why the difference matters
Error rate	”Error rate exceeded 2% for 5 minutes (current: 3.2%)"	"Error occurred”	Sustained rate with threshold is actionable; individual errors are noise
Quality	”Thumbs-down rate increased from 3% to 8% over last 2 hours"	"Low quality response detected”	Trend with comparison is actionable; individual low-quality responses are expected
Cost	”Cost per request increased 40% vs 7-day average (current: $0.14, baseline: $0.10)"	"Expensive request: $0.50”	Trend shift is actionable; individual expensive requests are normal variance
Latency	”P95 latency exceeded 5s for 10 minutes (current: 7.2s, baseline: 3.1s)"	"Slow response: 8s”	Sustained degradation is actionable; individual slow responses are expected

Alert Fatigue Prevention

Practice	What it prevents	Implementation
Minimum duration	Flap alerts from transient spikes	Alert only if condition persists >5 minutes
Comparison to baseline	Static thresholds that don’t account for normal variation	Alert on deviation from rolling 7-day average
Grouped alerts	50 alerts for the same root cause	Group alerts by service/feature within 10-minute window
Severity levels	Everything being “critical”	3 levels: info (log), warning (Slack), critical (page)
Auto-resolve	Open alerts that no longer apply	Auto-close when metric returns to normal range

How to Apply This

Use the token-counter tool to estimate the cost of your observability pipeline — LLM-as-judge evaluation on 5% of traffic consumes tokens that should be budgeted.

Start with the 100/20/5 architecture. Log metadata on everything, store content on 20%, evaluate 5%. This gives you cost tracking, debugging capability, and quality monitoring at manageable cost.

Implement three alerts on day one: error rate, user thumbs-down rate, and cost per request. These three signals catch the most critical failure modes across all three observability layers.

Choose your platform based on your existing stack. If you have Datadog, use Datadog LLM Observability. If you use LangChain, use Langsmith. Platform integration matters more than feature comparison — a tool you actually use beats a better tool you don’t.

Run weekly evaluation on your fixed eval set. This is the earliest warning for model quality regression. Provider model updates (which happen without notice) are the most common cause of quality drift.

Honest Limitations

Platform comparison reflects features as of early 2026; LLM observability tools are evolving rapidly with monthly feature releases. The 100/20/5 sampling ratios are guidelines — applications with higher stakes (medical, financial) should evaluate a higher percentage. Drift detection latency depends on traffic volume; low-traffic applications may need weeks to detect statistically significant shifts. LLM-as-judge evaluation costs scale with the judge model used — GPT-4o at $0.01-0.05 per evaluation adds up at scale. Alert thresholds in this guide are starting points; calibrate to your specific baseline and acceptable variance. Some drift is expected and healthy (changing user patterns, seasonal effects) — not all drift requires intervention. Self-hosted observability platforms (Langfuse, Phoenix) require infrastructure maintenance that’s not reflected in the “free” pricing.

Kenny Tan Co-founder & Technical Lead

Cross-domain expertise in software engineering, content systems, and infrastructure architecture.

15 April 2026

Continue reading

AI Agent Design Patterns — Tool Use, Planning, and Memory Architectures

Agent architecture decision matrix comparing ReAct, Plan-and-Execute, and Tree-of-Thought with tool integration patterns, memory systems, and failure mode analysis for production agent systems.

AI API Integration Patterns — Direct Call vs Streaming vs Batch Processing

Latency, cost, and complexity comparison across AI API integration patterns with architecture decision matrix, failure handling strategies, and production throughput data.

AI Cost Optimization in Production — Techniques That Cut Spend by 60-80%

Cost reduction technique comparison with percentage savings, implementation effort, and quality impact data across model routing, caching, prompt compression, and architectural patterns.

All articles in ai development workflows