Your AI Dashboard Has 47 Metrics and Zero Actionable Alerts — You’re Monitoring Everything and Observing Nothing

Traditional software monitoring maps neatly to AI systems for the infrastructure layer — latency, error rates, uptime are the same. But the quality layer is entirely new: model output degrades without any error code, user satisfaction drops without any server metric changing, and costs spike without any traffic increase. AI observability is the discipline of detecting these invisible failures. The problem is that most teams either monitor nothing (hoping benchmarks hold in production) or monitor everything (drowning in metrics that never trigger action). This guide provides the metric priority matrix that separates signal from noise, the platform comparison for LLM monitoring tools, and the architecture that detects quality drift before users complain.

The AI Observability Stack

AI observability has three layers, each catching different failure modes:

LayerWhat it monitorsFailure modes caughtTraditional equivalent
InfrastructureLatency, throughput, errors, availabilityAPI outages, rate limits, timeout spikesStandard APM (Datadog, New Relic)
QualityOutput correctness, faithfulness, relevance, safetyModel degradation, prompt regression, driftNo traditional equivalent
BusinessUser satisfaction, task completion, cost per outcomeSilent quality failures, UX degradationProduct analytics (Amplitude, Mixpanel)

Key insight: Infrastructure monitoring is necessary but insufficient. An AI system can have 99.99% uptime, sub-second latency, and zero errors — while producing increasingly wrong answers. The quality layer is what makes AI observability different from standard APM.

Metric Priority Matrix

Every metric falls into one of three categories: alert (wake someone up), log (investigate when other signals fire), or ignore (nice to have, never actionable).

Infrastructure Metrics

MetricPriorityAlert thresholdWhy
Error rate (4xx/5xx)ALERT>1% sustained 5 minDirect user impact; provider outage signal
Latency p50LOGTrack trendBaseline for comparison; rarely actionable alone
Latency p95ALERT>3x baseline for 5 minUser experience degradation for tail requests
Latency p99LOGTrack for capacity planningEdge cases; usually not actionable
Rate limit hits (429)ALERT>5% of requestsIndicates capacity ceiling; users being throttled
Token throughputLOGTrack trendCost correlation; rarely actionable alone
Availability (provider)ALERTAny downtime >30sTrigger failover to backup model
Request queue depthALERT>100 pending for 2 minBackpressure building; may need scaling

Quality Metrics

MetricPriorityAlert thresholdCollection method
User thumbs down rateALERT>5% over 1-hour windowUI feedback button
Response regeneration rateALERT>8% over 1-hour windowTrack regenerate clicks
Hallucination rate (sampled)ALERT>10% on sampled responsesLLM-as-judge on 5-10% sample
Faithfulness score (RAG)ALERT<0.85 on sampled responsesRAGAS/custom metric on sample
Output format complianceALERT<95% (structured output tasks)Schema validation on every response
Safety filter triggersLOGTrack trend; alert on 2x spikeContent filter logging
Output length distributionLOGAlert if mean shifts >20%Token counting per response
Topic classification (output)LOGAlert on off-topic spike >3%Lightweight classifier on output
Embedding driftLOGMonitor weeklyCompare embedding distribution of current vs baseline queries
Benchmark regressionLOGCheck weekly against eval suiteAutomated eval pipeline

Business Metrics

MetricPriorityAlert thresholdCollection method
Task completion rateALERT<80% (or 10% drop from baseline)End-to-end task tracking
Session abandonmentALERT>40% single-turn sessionsSession analytics
Cost per completed taskLOG>2x baselineCost tracking + task completion join
Time to task completionLOGTrack trendSession timing
Feature adoption (AI vs non-AI path)LOGTrack trendA/B or feature flag analytics
Daily active users (AI feature)LOGAlert on >20% dropProduct analytics

LLM Monitoring Platform Comparison

PlatformQuality monitoringCost trackingTrace loggingEval integrationPricingBest for
LangfuseGood (custom scores)ExcellentExcellentGood (custom evals)Free (self-hosted), $59+/mo (cloud)Open-source preference, full control
LangsmithGoodGoodExcellentExcellent (tight LangChain)$39-400/moLangChain/LangGraph users
BraintrustExcellent (built-in judges)GoodGoodExcellent$50-500/moTeams needing eval + monitoring
HeliconeBasicExcellentGoodBasicFree tier, $50+/moCost-focused monitoring
Arize PhoenixExcellent (drift detection)GoodExcellentGoodFree (open-source)ML teams familiar with ML observability
Datadog LLM ObservabilityGoodGoodExcellentBasicPart of Datadog pricingTeams already using Datadog
Weights & Biases WeaveGoodBasicGoodExcellentFree tier, $50+/moML teams with W&B workflow
OpenLLMetryBasic (OpenTelemetry-based)GoodExcellentBasicFree (open-source)OpenTelemetry-native infrastructure

Platform Selection Decision

Your situationRecommended platformWhy
Early stage, need basics fastLangfuse (self-hosted) or HeliconeFree, quick setup, covers logging and cost tracking
Using LangChainLangsmithNative integration, deepest traces for chain/agent debugging
Need quality evaluation + monitoringBraintrustBest eval-to-monitoring pipeline; built-in LLM judges
Enterprise, existing DatadogDatadog LLM ObservabilityIntegrates with existing alerting, dashboards, oncall
ML team, familiar with experiment trackingArize Phoenix or W&B WeaveConcepts (drift, embeddings, evals) map to ML workflow
OpenTelemetry-first architectureOpenLLMetryStandard OTEL spans for LLM calls; integrates with any OTEL backend

Drift Detection — The Critical Quality Signal

AI systems degrade silently. Drift detection catches quality changes before users notice:

Types of Drift

Drift typeWhat changesCauseDetection methodDetection latency
Input driftUser queries become different from training/eval distributionSeasonal changes, new user segments, feature changesEmbedding distribution comparisonDays to weeks
Output driftModel outputs change in length, format, or content patternsProvider model updates, system prompt changesOutput feature monitoring (length, format compliance)Hours to days
Quality driftAnswer correctness degrades on consistent inputsModel regression, data staleness, retrieval degradationSampled evaluation on production trafficDays to weeks
Cost driftPer-request cost increases without traffic changesToken count inflation, prompt bloat, model routing failureCost-per-request monitoringHours
Behavioral driftUser interaction patterns changePoor responses → behavior change → metrics shiftSession analytics, abandonment rateDays to weeks

Drift Detection Methods

MethodWhat it detectsImplementation complexityCostSensitivity
Statistical process controlMean/variance shifts in numerical metricsLow (standard statistics)NegligibleMedium — detects large shifts
Embedding distribution monitoringInput/output distribution shiftsMedium (requires embedding + comparison)$50-200/moHigh — detects semantic shifts
Periodic evaluationQuality regression on fixed eval setMedium (scheduled eval pipeline)$10-100/run (LLM-as-judge)High — directly measures quality
A/B comparisonDifference between current and baseline versionsMedium (requires traffic splitting)2x LLM cost during A/BHighest — controlled comparison
User signal correlationQuality changes reflected in user behaviorLow (product analytics)NegligibleLow — lagging indicator

The Drift Detection Pipeline

StageFrequencyWhat it checksAction if drift detected
Real-timeEvery requestFormat compliance, latency, errorsImmediate alert; failover if critical
HourlyAggregate hourlyCost per request, output length distributionWarning if >20% shift from 24h average
DailySample 100-500 requestsLLM-as-judge quality scoresAlert if score drops >5% from baseline
WeeklyFull eval suiteComplete regression test on fixed datasetAlert if any metric drops >3% from all-time baseline
MonthlyEmbedding distributionInput/output semantic shift analysisInvestigate; may indicate new user patterns (not always bad)

Cost-Effective Observability Architecture

Full observability on every request is expensive. The cost-effective approach: monitor everything cheaply, evaluate selectively.

Data tierWhat to captureStorage costAnalysis costRetention
Tier 1: Metadata (every request)Timestamp, model, latency, token count, status code, cost$5-20/mo per 1M requestsNegligible90 days
Tier 2: Content (sampled, 10-20%)Full prompt + response text$50-200/mo per 1M requestsModerate (storage-bound)30 days
Tier 3: Evaluation (sampled, 5-10%)LLM-as-judge scores, faithfulness, relevance$100-500/mo per 1M requestsHigh (LLM inference cost)90 days
Tier 4: Full trace (on-demand)Complete request lifecycle with all intermediate steps$200-1,000/mo per 1M requestsHighest7 days

The 100/20/5 Rule

  • 100% of requests: Log metadata (cost: negligible)
  • 20% of requests: Store full content (cost: moderate)
  • 5% of requests: Run quality evaluation (cost: highest per-request, manageable at 5%)

This achieves statistically significant quality monitoring (5% sample of 100K daily requests = 5,000 evaluated responses — more than sufficient for detecting 2-3% quality shifts) at a fraction of full-evaluation cost.

Alert Design

The difference between useful alerts and noise:

Alert typeGood alertBad alertWhy the difference matters
Error rate”Error rate exceeded 2% for 5 minutes (current: 3.2%)""Error occurred”Sustained rate with threshold is actionable; individual errors are noise
Quality”Thumbs-down rate increased from 3% to 8% over last 2 hours""Low quality response detected”Trend with comparison is actionable; individual low-quality responses are expected
Cost”Cost per request increased 40% vs 7-day average (current: $0.14, baseline: $0.10)""Expensive request: $0.50”Trend shift is actionable; individual expensive requests are normal variance
Latency”P95 latency exceeded 5s for 10 minutes (current: 7.2s, baseline: 3.1s)""Slow response: 8s”Sustained degradation is actionable; individual slow responses are expected

Alert Fatigue Prevention

PracticeWhat it preventsImplementation
Minimum durationFlap alerts from transient spikesAlert only if condition persists >5 minutes
Comparison to baselineStatic thresholds that don’t account for normal variationAlert on deviation from rolling 7-day average
Grouped alerts50 alerts for the same root causeGroup alerts by service/feature within 10-minute window
Severity levelsEverything being “critical”3 levels: info (log), warning (Slack), critical (page)
Auto-resolveOpen alerts that no longer applyAuto-close when metric returns to normal range

How to Apply This

Use the token-counter tool to estimate the cost of your observability pipeline — LLM-as-judge evaluation on 5% of traffic consumes tokens that should be budgeted.

Start with the 100/20/5 architecture. Log metadata on everything, store content on 20%, evaluate 5%. This gives you cost tracking, debugging capability, and quality monitoring at manageable cost.

Implement three alerts on day one: error rate, user thumbs-down rate, and cost per request. These three signals catch the most critical failure modes across all three observability layers.

Choose your platform based on your existing stack. If you have Datadog, use Datadog LLM Observability. If you use LangChain, use Langsmith. Platform integration matters more than feature comparison — a tool you actually use beats a better tool you don’t.

Run weekly evaluation on your fixed eval set. This is the earliest warning for model quality regression. Provider model updates (which happen without notice) are the most common cause of quality drift.

Honest Limitations

Platform comparison reflects features as of early 2026; LLM observability tools are evolving rapidly with monthly feature releases. The 100/20/5 sampling ratios are guidelines — applications with higher stakes (medical, financial) should evaluate a higher percentage. Drift detection latency depends on traffic volume; low-traffic applications may need weeks to detect statistically significant shifts. LLM-as-judge evaluation costs scale with the judge model used — GPT-4o at $0.01-0.05 per evaluation adds up at scale. Alert thresholds in this guide are starting points; calibrate to your specific baseline and acceptable variance. Some drift is expected and healthy (changing user patterns, seasonal effects) — not all drift requires intervention. Self-hosted observability platforms (Langfuse, Phoenix) require infrastructure maintenance that’s not reflected in the “free” pricing.