Skip to main content

Why Your AI Needs an SRE

AI systems need site reliability engineering. Model drift, latency spikes, integration failures. How to apply SRE principles to production AI.
8 September 2025·8 min read
John Li
John Li
Chief Technology Officer
Your AI system works in staging. It passed all the tests. The demo was flawless. Then it hits production and you discover that AI systems fail in ways that traditional software does not. Model drift, prompt injection, cost runaway, latency spikes under load, and integration failures that look like AI failures. These are not bugs. They are operational realities that need an engineering discipline to manage.

AI Fails Differently

Traditional software failures are mostly deterministic. A null pointer exception, a database connection timeout, a memory leak. The failure is reproducible, the cause is identifiable, and the fix is testable.
AI system failures are often probabilistic. The model returns a plausible but incorrect answer. The response quality degrades gradually as the input distribution shifts. The system works perfectly for 99% of queries and fails catastrophically on the 1% that matter most.
This difference in failure modes means that traditional monitoring and incident response are insufficient. You need the same rigour, but applied to a different set of signals.
43%
of production AI systems experience meaningful performance degradation within 6 months of deployment without active monitoring
Source: Google Cloud, ML Operations Survey, 2025

The Failure Modes

Model Drift

The model was trained or configured for a specific input distribution. Over time, the real-world inputs shift. A document classification system trained on 2024 invoice formats encounters 2025 formats with different layouts. A customer service AI trained on last year's product line cannot handle questions about this year's.
Drift is insidious because the system does not crash. It just gets gradually worse. Without monitoring, drift shows up as user complaints, not system alerts.

Cost Runaway

AI inference has variable costs based on token volume, model selection, and API pricing. A workflow that costs $50/day under normal load can cost $5,000/day during a spike. A prompt that generates verbose responses burns more tokens than one that generates concise ones. A retry loop on a failed API call can multiply costs by 10x before anyone notices.

Latency Spikes

AI systems are network-dependent in ways that traditional systems are not. A model API that averages 200ms response time can spike to 2,000ms during provider-side load events. When your AI system is embedded in a user-facing workflow, that spike turns into a workflow that feels broken.

Integration Failures That Look Like AI Failures

This is the one that catches teams off guard. The AI model is working perfectly. The integration layer between the model and the enterprise system has a data mapping error. The user sees an AI output that is wrong. The incident report says "AI failure." The root cause is a data pipeline issue.
Without proper observability, every AI-adjacent failure gets attributed to "the AI," which erodes trust in the system faster than the actual error rate warrants.

Prompt Injection and Adversarial Inputs

Production AI systems receive inputs from users, and users are creative. Prompt injection (inputs designed to override the system's instructions), adversarial inputs (inputs designed to produce harmful outputs), and edge case inputs (inputs the system was not designed for) are operational realities, not theoretical risks.

SRE Principles for AI

Site Reliability Engineering (SRE), the discipline Google pioneered for managing complex distributed systems, translates well to AI operations with some adaptations.

Define SLOs for AI

Traditional SLOs measure availability and latency. AI SLOs add quality metrics:
  • Accuracy SLO: What percentage of AI outputs meet the quality threshold? Measured through sampling and automated evaluation.
  • Latency SLO: P50, P95, P99 response times for AI inference. These should include the full pipeline, not just the model call.
  • Cost SLO: Maximum cost per query, per day, per month. With upper bounds that trigger alerts, not just reporting.
  • Freshness SLO: For RAG systems, how current is the knowledge base? A system answering questions about last month's policies with last year's data is a quality failure.

Monitor the Full Pipeline

AI observability is not just model monitoring. It is pipeline monitoring:
Input monitoring. Track the distribution of inputs over time. When the input distribution shifts, quality will follow. This is the early warning for drift.
Model monitoring. Track response quality, latency, and cost at the model level. Compare against baselines. Alert on deviation.
Integration monitoring. Track the data flows between the AI system and enterprise systems. Every data transformation, every API call, every mapping. This is where the invisible failures live.
Output monitoring. Sample and evaluate AI outputs against quality criteria. Automated evaluation catches obvious failures. Human evaluation catches subtle ones. Both are necessary.

Error Budgets for AI

The SRE concept of error budgets works for AI with one modification: AI error budgets should include quality errors, not just availability errors.
If your accuracy SLO is 95%, your error budget is 5%. When the error budget is consumed (quality drops below 95%), new feature development pauses and the team focuses on reliability. This prevents the common pattern of shipping new AI capabilities while existing ones degrade.

Runbooks for AI Incidents

AI incidents need specific runbooks:
Model degradation runbook. Quality metrics declining. Steps: identify affected queries, check input distribution for drift, evaluate model performance on benchmark set, escalate to model team if performance has degraded, roll back to previous model version if available.
Cost spike runbook. AI spend exceeding budget. Steps: identify the source (specific workflow, prompt, model), check for retry loops, check for input volume spikes, implement rate limiting if necessary, notify stakeholders.
Integration failure runbook. AI outputs incorrect due to data pipeline issues. Steps: verify model output in isolation, check data pipeline integrity, identify the mapping or transformation error, fix and validate, communicate root cause (not "AI failure").
60%
reduction in mean time to resolution for AI incidents when teams have AI-specific runbooks vs generic incident response
Source: RIVER, operational data across enterprise AI deployments, 2024-2025

The Team Model

AI reliability does not require a dedicated SRE team for most enterprises. It requires SRE thinking applied to AI operations:
Someone owns AI reliability. Not as a side project. As a defined responsibility. This person monitors the AI SLOs, maintains the runbooks, and leads incident response for AI-related issues.
The AI team and the platform team collaborate. AI reliability sits at the intersection of AI expertise (understanding model behaviour) and platform expertise (understanding infrastructure, monitoring, incident response). Neither team can do it alone.
Operational review is regular. Weekly review of AI SLOs, cost trends, quality metrics, and incident patterns. This is the feedback loop that catches drift before users do.

Production AI is not a deployment milestone. It is the start of an operational discipline. The organisations that apply SRE principles to their AI systems get reliability, cost control, and user trust. The ones that do not get incidents, budget surprises, and users who stop trusting the AI. The engineering is not glamorous. It is the difference between AI that works in demos and AI that works in production.