You've deployed your AI system. It's processing queries, generating outputs, and your users seem happy. But you have no idea if it's actually working well. You can't see what it's retrieving, why it's generating specific responses, or whether quality is degrading. Welcome to the AI observability gap.
What You Need to Know
- AI observability is fundamentally different from traditional application monitoring. Standard metrics (uptime, latency, error rates) are necessary but insufficient. AI systems can be "up" while producing terrible results.
- The three layers of AI observability: operational (is it running?), quality (is it good?), and behavioural (is it changing?). Most teams only monitor the first.
- AI systems degrade silently. Unlike a crashed server, a degrading AI system keeps producing output - just worse output. Without quality monitoring, you discover problems when users complain.
- This guide covers what to measure, how to set alerts, and when human intervention is required.
Layer 1: Operational Observability
This is the familiar ground. Standard infrastructure monitoring applied to AI-specific components.
What to Measure
| Metric | What It Tells You | Alert Threshold |
|---|---|---|
| API latency (p50, p95, p99) | Response time distribution | p95 > 5s for interactive use cases |
| Error rate | System failures | > 1% of requests failing |
| Token usage | Cost and complexity | Unusual spikes (> 2x daily average) |
| Rate limit hits | Capacity constraints | Any rate limit hit during business hours |
| Embedding generation time | Data pipeline health | > 2x baseline for same document volume |
| Vector search latency | Retrieval performance | p95 > 500ms |
| Cache hit rate | Efficiency | Drop below 40% (for established systems) |
How to Implement
Standard observability tools work here. Datadog, Grafana, CloudWatch - whatever your team already uses. The key addition: instrument the AI-specific components (embedding generation, vector search, model inference) as separate spans in your tracing, not lumped together as one "AI call."
Request → [Auth: 5ms] → [Retrieval: 120ms] → [Reranking: 45ms] → [Inference: 1200ms] → [Post-process: 30ms] → Response
When latency spikes, you need to know which component caused it. A 3-second response because inference was slow is a different problem than a 3-second response because retrieval searched too many documents.
Layer 2: Quality Observability
This is where most teams fall short. The system is running, but is it running well?
Retrieval Quality
The most impactful quality metric for RAG systems. If retrieval is poor, everything downstream suffers.
Metrics:
- Retrieval relevance score. For each query, how relevant were the retrieved documents? This requires either automated relevance scoring (using a separate model to judge) or sampled human evaluation.
- Source diversity. Are retrieved documents coming from a healthy range of sources, or is one document dominating? Over-concentration suggests an indexing or embedding problem.
- Empty retrieval rate. What percentage of queries return no relevant documents? This indicates gaps in your knowledge base.
- Retrieval stability. For the same query, do you get similar documents over time? Instability suggests embedding drift or index corruption.
How to implement: Log every retrieval result with the query, the documents returned, and their similarity scores. Run a nightly evaluation job that samples queries and scores retrieval quality. Alert on trend changes, not individual scores.
Output Quality
Measuring whether the AI's responses are good. This is inherently harder than measuring operational metrics, but essential.
Automated signals:
- Confidence scores. If your system produces confidence estimates, track them over time. Declining average confidence suggests the system is encountering queries it's less equipped to handle.
- Output length distribution. Sudden changes in average response length can indicate prompt issues, retrieval problems, or model behaviour changes.
- Hallucination detection. Automated checks that verify claims in the AI output against the retrieved source documents. Not perfect, but catches obvious hallucinations.
- Refusal rate. How often does the AI decline to answer? Too low suggests inadequate guardrails. Too high suggests overly conservative constraints.
Human feedback signals:
- Thumbs up/down. Simple but effective. Track the ratio over time.
- Override rate. In systems where users can edit or override AI output, track how often they do and by how much.
- Escalation rate. How often do users escalate to manual processing because the AI output wasn't useful?
34%
of enterprise AI systems have no quality monitoring beyond user complaints
Source: Gartner, AI Operations Survey, Q2 2024
Quality Alerting
Quality metrics are noisier than operational metrics. Set alerts on trends, not individual data points:
- Rolling average degradation. If the 7-day rolling average of user satisfaction drops more than 10%, investigate.
- Segment-specific drops. Quality might be fine overall but degrading for specific query types, user groups, or document categories. Monitor segments, not just aggregates.
- New content impact. When new documents are added to the knowledge base, monitor whether retrieval and output quality change. New content can dilute or improve results.
Layer 3: Behavioural Observability
How is the system's behaviour changing over time? This is the early warning system for problems that haven't manifested as quality degradation yet.
Data Drift
Your knowledge base changes. Documents are added, updated, removed. The distribution of queries changes as users discover new use cases or as business needs shift.
What to monitor:
- Embedding distribution shift. Are new document embeddings clustering differently than existing ones? This could indicate a change in document type or quality.
- Query distribution shift. Are users asking different types of questions than they were a month ago? This might require prompt adjustments or knowledge base expansion.
- Source freshness. Are responses drawing on up-to-date information, or are they relying on stale documents because newer versions weren't properly indexed?
Model Behaviour Changes
Model providers update their models. Sometimes with notice. Sometimes without. These updates can change your system's behaviour in subtle ways.
What to monitor:
- Output consistency. Run a set of benchmark queries daily. Compare outputs to historical baselines. Significant changes indicate a model update.
- Token usage patterns. Model updates can change how verbose the model is, affecting both cost and user experience.
- Formatting changes. Model updates sometimes change how outputs are structured. If your system parses model output, formatting changes can break downstream processing.
Usage Patterns
How users interact with the system reveals its health:
- Session depth. Are users asking follow-up questions (good engagement) or abandoning after one query (poor results)?
- Feature adoption. Which capabilities are used? Which are ignored? Ignored features are either undiscoverable or unhelpful.
- Time-to-resolution. For task-oriented AI, how long from first query to task completion? Increasing time suggests declining usefulness.
The Observability Stack
Putting it together:
| Layer | Tools | Frequency |
|---|---|---|
| Operational | Datadog/Grafana + custom spans | Real-time |
| Quality - automated | Custom evaluation pipeline | Daily batch + real-time sampling |
| Quality - human | Feedback widgets + periodic review | Continuous collection, weekly analysis |
| Behavioural | Custom drift detection | Daily batch |
When to Intervene
Not every alert requires action. Here's a decision framework:
Immediate intervention (minutes):
- Error rate spike above 5%
- Complete retrieval failure
- Model API outage
- Security alert (data exposure, prompt injection detected)
Same-day investigation (hours):
- Sustained latency increase
- Sudden confidence score drop
- Unusual token usage spike
- User escalation rate increase
Scheduled review (days):
- Gradual quality metric decline
- Query distribution shift
- Embedding drift detection
- Usage pattern changes
Periodic assessment (weekly/monthly):
- Overall quality trends
- Cost optimisation opportunities
- Feature usage analysis
- Model comparison evaluations
Actionable Takeaways
- Start with operational monitoring. Instrument every component. This is the foundation.
- Add quality monitoring within the first month of production. Don't wait for user complaints to tell you quality is degrading.
- Log everything. Every query, every retrieval result, every model response, every user action. Storage is cheap. Debugging without logs is expensive.
- Monitor trends, not individual data points. AI quality is inherently variable. Trends reveal real problems. Individual outliers don't.
- Build a benchmark suite. A set of queries with known-good answers that you run regularly. This is your canary in the coal mine.
- Budget time for observability. It's not overhead. It's the difference between a production system and a demo that happens to be live.
