AI Observability in Production — What to Measure, When to Alert, and What to Ignore
Dashboard metric priority matrix covering alert vs log vs ignore decisions, LLM monitoring platform comparison, drift detection methodology, and cost-effective observability architecture.
Your AI Dashboard Has 47 Metrics and Zero Actionable Alerts — You’re Monitoring Everything and Observing Nothing
Traditional software monitoring maps neatly to AI systems for the infrastructure layer — latency, error rates, uptime are the same. But the quality layer is entirely new: model output degrades without any error code, user satisfaction drops without any server metric changing, and costs spike without any traffic increase. AI observability is the discipline of detecting these invisible failures. The problem is that most teams either monitor nothing (hoping benchmarks hold in production) or monitor everything (drowning in metrics that never trigger action). This guide provides the metric priority matrix that separates signal from noise, the platform comparison for LLM monitoring tools, and the architecture that detects quality drift before users complain.
The AI Observability Stack
AI observability has three layers, each catching different failure modes:
| Layer | What it monitors | Failure modes caught | Traditional equivalent |
|---|---|---|---|
| Infrastructure | Latency, throughput, errors, availability | API outages, rate limits, timeout spikes | Standard APM (Datadog, New Relic) |
| Quality | Output correctness, faithfulness, relevance, safety | Model degradation, prompt regression, drift | No traditional equivalent |
| Business | User satisfaction, task completion, cost per outcome | Silent quality failures, UX degradation | Product analytics (Amplitude, Mixpanel) |
Key insight: Infrastructure monitoring is necessary but insufficient. An AI system can have 99.99% uptime, sub-second latency, and zero errors — while producing increasingly wrong answers. The quality layer is what makes AI observability different from standard APM.
Metric Priority Matrix
Every metric falls into one of three categories: alert (wake someone up), log (investigate when other signals fire), or ignore (nice to have, never actionable).
Infrastructure Metrics
| Metric | Priority | Alert threshold | Why |
|---|---|---|---|
| Error rate (4xx/5xx) | ALERT | >1% sustained 5 min | Direct user impact; provider outage signal |
| Latency p50 | LOG | Track trend | Baseline for comparison; rarely actionable alone |
| Latency p95 | ALERT | >3x baseline for 5 min | User experience degradation for tail requests |
| Latency p99 | LOG | Track for capacity planning | Edge cases; usually not actionable |
| Rate limit hits (429) | ALERT | >5% of requests | Indicates capacity ceiling; users being throttled |
| Token throughput | LOG | Track trend | Cost correlation; rarely actionable alone |
| Availability (provider) | ALERT | Any downtime >30s | Trigger failover to backup model |
| Request queue depth | ALERT | >100 pending for 2 min | Backpressure building; may need scaling |
Quality Metrics
| Metric | Priority | Alert threshold | Collection method |
|---|---|---|---|
| User thumbs down rate | ALERT | >5% over 1-hour window | UI feedback button |
| Response regeneration rate | ALERT | >8% over 1-hour window | Track regenerate clicks |
| Hallucination rate (sampled) | ALERT | >10% on sampled responses | LLM-as-judge on 5-10% sample |
| Faithfulness score (RAG) | ALERT | <0.85 on sampled responses | RAGAS/custom metric on sample |
| Output format compliance | ALERT | <95% (structured output tasks) | Schema validation on every response |
| Safety filter triggers | LOG | Track trend; alert on 2x spike | Content filter logging |
| Output length distribution | LOG | Alert if mean shifts >20% | Token counting per response |
| Topic classification (output) | LOG | Alert on off-topic spike >3% | Lightweight classifier on output |
| Embedding drift | LOG | Monitor weekly | Compare embedding distribution of current vs baseline queries |
| Benchmark regression | LOG | Check weekly against eval suite | Automated eval pipeline |
Business Metrics
| Metric | Priority | Alert threshold | Collection method |
|---|---|---|---|
| Task completion rate | ALERT | <80% (or 10% drop from baseline) | End-to-end task tracking |
| Session abandonment | ALERT | >40% single-turn sessions | Session analytics |
| Cost per completed task | LOG | >2x baseline | Cost tracking + task completion join |
| Time to task completion | LOG | Track trend | Session timing |
| Feature adoption (AI vs non-AI path) | LOG | Track trend | A/B or feature flag analytics |
| Daily active users (AI feature) | LOG | Alert on >20% drop | Product analytics |
LLM Monitoring Platform Comparison
| Platform | Quality monitoring | Cost tracking | Trace logging | Eval integration | Pricing | Best for |
|---|---|---|---|---|---|---|
| Langfuse | Good (custom scores) | Excellent | Excellent | Good (custom evals) | Free (self-hosted), $59+/mo (cloud) | Open-source preference, full control |
| Langsmith | Good | Good | Excellent | Excellent (tight LangChain) | $39-400/mo | LangChain/LangGraph users |
| Braintrust | Excellent (built-in judges) | Good | Good | Excellent | $50-500/mo | Teams needing eval + monitoring |
| Helicone | Basic | Excellent | Good | Basic | Free tier, $50+/mo | Cost-focused monitoring |
| Arize Phoenix | Excellent (drift detection) | Good | Excellent | Good | Free (open-source) | ML teams familiar with ML observability |
| Datadog LLM Observability | Good | Good | Excellent | Basic | Part of Datadog pricing | Teams already using Datadog |
| Weights & Biases Weave | Good | Basic | Good | Excellent | Free tier, $50+/mo | ML teams with W&B workflow |
| OpenLLMetry | Basic (OpenTelemetry-based) | Good | Excellent | Basic | Free (open-source) | OpenTelemetry-native infrastructure |
Platform Selection Decision
| Your situation | Recommended platform | Why |
|---|---|---|
| Early stage, need basics fast | Langfuse (self-hosted) or Helicone | Free, quick setup, covers logging and cost tracking |
| Using LangChain | Langsmith | Native integration, deepest traces for chain/agent debugging |
| Need quality evaluation + monitoring | Braintrust | Best eval-to-monitoring pipeline; built-in LLM judges |
| Enterprise, existing Datadog | Datadog LLM Observability | Integrates with existing alerting, dashboards, oncall |
| ML team, familiar with experiment tracking | Arize Phoenix or W&B Weave | Concepts (drift, embeddings, evals) map to ML workflow |
| OpenTelemetry-first architecture | OpenLLMetry | Standard OTEL spans for LLM calls; integrates with any OTEL backend |
Drift Detection — The Critical Quality Signal
AI systems degrade silently. Drift detection catches quality changes before users notice:
Types of Drift
| Drift type | What changes | Cause | Detection method | Detection latency |
|---|---|---|---|---|
| Input drift | User queries become different from training/eval distribution | Seasonal changes, new user segments, feature changes | Embedding distribution comparison | Days to weeks |
| Output drift | Model outputs change in length, format, or content patterns | Provider model updates, system prompt changes | Output feature monitoring (length, format compliance) | Hours to days |
| Quality drift | Answer correctness degrades on consistent inputs | Model regression, data staleness, retrieval degradation | Sampled evaluation on production traffic | Days to weeks |
| Cost drift | Per-request cost increases without traffic changes | Token count inflation, prompt bloat, model routing failure | Cost-per-request monitoring | Hours |
| Behavioral drift | User interaction patterns change | Poor responses → behavior change → metrics shift | Session analytics, abandonment rate | Days to weeks |
Drift Detection Methods
| Method | What it detects | Implementation complexity | Cost | Sensitivity |
|---|---|---|---|---|
| Statistical process control | Mean/variance shifts in numerical metrics | Low (standard statistics) | Negligible | Medium — detects large shifts |
| Embedding distribution monitoring | Input/output distribution shifts | Medium (requires embedding + comparison) | $50-200/mo | High — detects semantic shifts |
| Periodic evaluation | Quality regression on fixed eval set | Medium (scheduled eval pipeline) | $10-100/run (LLM-as-judge) | High — directly measures quality |
| A/B comparison | Difference between current and baseline versions | Medium (requires traffic splitting) | 2x LLM cost during A/B | Highest — controlled comparison |
| User signal correlation | Quality changes reflected in user behavior | Low (product analytics) | Negligible | Low — lagging indicator |
The Drift Detection Pipeline
| Stage | Frequency | What it checks | Action if drift detected |
|---|---|---|---|
| Real-time | Every request | Format compliance, latency, errors | Immediate alert; failover if critical |
| Hourly | Aggregate hourly | Cost per request, output length distribution | Warning if >20% shift from 24h average |
| Daily | Sample 100-500 requests | LLM-as-judge quality scores | Alert if score drops >5% from baseline |
| Weekly | Full eval suite | Complete regression test on fixed dataset | Alert if any metric drops >3% from all-time baseline |
| Monthly | Embedding distribution | Input/output semantic shift analysis | Investigate; may indicate new user patterns (not always bad) |
Cost-Effective Observability Architecture
Full observability on every request is expensive. The cost-effective approach: monitor everything cheaply, evaluate selectively.
| Data tier | What to capture | Storage cost | Analysis cost | Retention |
|---|---|---|---|---|
| Tier 1: Metadata (every request) | Timestamp, model, latency, token count, status code, cost | $5-20/mo per 1M requests | Negligible | 90 days |
| Tier 2: Content (sampled, 10-20%) | Full prompt + response text | $50-200/mo per 1M requests | Moderate (storage-bound) | 30 days |
| Tier 3: Evaluation (sampled, 5-10%) | LLM-as-judge scores, faithfulness, relevance | $100-500/mo per 1M requests | High (LLM inference cost) | 90 days |
| Tier 4: Full trace (on-demand) | Complete request lifecycle with all intermediate steps | $200-1,000/mo per 1M requests | Highest | 7 days |
The 100/20/5 Rule
- 100% of requests: Log metadata (cost: negligible)
- 20% of requests: Store full content (cost: moderate)
- 5% of requests: Run quality evaluation (cost: highest per-request, manageable at 5%)
This achieves statistically significant quality monitoring (5% sample of 100K daily requests = 5,000 evaluated responses — more than sufficient for detecting 2-3% quality shifts) at a fraction of full-evaluation cost.
Alert Design
The difference between useful alerts and noise:
| Alert type | Good alert | Bad alert | Why the difference matters |
|---|---|---|---|
| Error rate | ”Error rate exceeded 2% for 5 minutes (current: 3.2%)" | "Error occurred” | Sustained rate with threshold is actionable; individual errors are noise |
| Quality | ”Thumbs-down rate increased from 3% to 8% over last 2 hours" | "Low quality response detected” | Trend with comparison is actionable; individual low-quality responses are expected |
| Cost | ”Cost per request increased 40% vs 7-day average (current: $0.14, baseline: $0.10)" | "Expensive request: $0.50” | Trend shift is actionable; individual expensive requests are normal variance |
| Latency | ”P95 latency exceeded 5s for 10 minutes (current: 7.2s, baseline: 3.1s)" | "Slow response: 8s” | Sustained degradation is actionable; individual slow responses are expected |
Alert Fatigue Prevention
| Practice | What it prevents | Implementation |
|---|---|---|
| Minimum duration | Flap alerts from transient spikes | Alert only if condition persists >5 minutes |
| Comparison to baseline | Static thresholds that don’t account for normal variation | Alert on deviation from rolling 7-day average |
| Grouped alerts | 50 alerts for the same root cause | Group alerts by service/feature within 10-minute window |
| Severity levels | Everything being “critical” | 3 levels: info (log), warning (Slack), critical (page) |
| Auto-resolve | Open alerts that no longer apply | Auto-close when metric returns to normal range |
How to Apply This
Use the token-counter tool to estimate the cost of your observability pipeline — LLM-as-judge evaluation on 5% of traffic consumes tokens that should be budgeted.
Start with the 100/20/5 architecture. Log metadata on everything, store content on 20%, evaluate 5%. This gives you cost tracking, debugging capability, and quality monitoring at manageable cost.
Implement three alerts on day one: error rate, user thumbs-down rate, and cost per request. These three signals catch the most critical failure modes across all three observability layers.
Choose your platform based on your existing stack. If you have Datadog, use Datadog LLM Observability. If you use LangChain, use Langsmith. Platform integration matters more than feature comparison — a tool you actually use beats a better tool you don’t.
Run weekly evaluation on your fixed eval set. This is the earliest warning for model quality regression. Provider model updates (which happen without notice) are the most common cause of quality drift.
Honest Limitations
Platform comparison reflects features as of early 2026; LLM observability tools are evolving rapidly with monthly feature releases. The 100/20/5 sampling ratios are guidelines — applications with higher stakes (medical, financial) should evaluate a higher percentage. Drift detection latency depends on traffic volume; low-traffic applications may need weeks to detect statistically significant shifts. LLM-as-judge evaluation costs scale with the judge model used — GPT-4o at $0.01-0.05 per evaluation adds up at scale. Alert thresholds in this guide are starting points; calibrate to your specific baseline and acceptable variance. Some drift is expected and healthy (changing user patterns, seasonal effects) — not all drift requires intervention. Self-hosted observability platforms (Langfuse, Phoenix) require infrastructure maintenance that’s not reflected in the “free” pricing.
Continue reading
AI Agent Design Patterns — Tool Use, Planning, and Memory Architectures
Agent architecture decision matrix comparing ReAct, Plan-and-Execute, and Tree-of-Thought with tool integration patterns, memory systems, and failure mode analysis for production agent systems.
AI API Integration Patterns — Direct Call vs Streaming vs Batch Processing
Latency, cost, and complexity comparison across AI API integration patterns with architecture decision matrix, failure handling strategies, and production throughput data.
AI Cost Optimization in Production — Techniques That Cut Spend by 60-80%
Cost reduction technique comparison with percentage savings, implementation effort, and quality impact data across model routing, caching, prompt compression, and architectural patterns.