The AI Operations Playbook

Building AI is the easy part. Running it is where organisations succeed or fail. AI operations, the discipline of monitoring, maintaining, updating, and governing production AI systems, is the capability gap that separates organisations with AI demos from organisations with AI value. This is the playbook.

Why AI Operations Matters

Traditional software, once deployed, is relatively stable. It does what it was built to do until someone changes the code. AI systems are different. They degrade. Models drift as the world changes and the training data becomes stale. Data quality fluctuates. User behaviour evolves. Upstream APIs change. An AI system that works brilliantly on deployment day can quietly deteriorate over weeks and months without anyone noticing.

AI operations exists to prevent this. It's the set of practices, tools, and roles that keep production AI systems performing at the level they were designed for.

62%

of enterprise AI capabilities show measurable quality degradation within 6 months of deployment without active operations

Source: McKinsey, State of AI in Enterprise, 2025

AI Capability Quality Degradation Without Operations

Source: McKinsey, State of AI in Enterprise, 2025

The Four Pillars

1. Monitoring

AI monitoring goes beyond uptime and response time. You need to track:

Output quality. Are the AI's outputs still accurate, relevant, and useful? This requires domain-specific evaluation metrics, not just generic scores. For document summarisation, that might be completeness and factual accuracy. For classification, precision and recall. For generation, relevance and hallucination rate.

Model performance. Latency, throughput, token usage, cost per query. These are the operational metrics that affect both user experience and budget.

Data health. Are the data sources the AI depends on still clean, complete, and current? Data quality degradation is the most common cause of AI quality degradation.

Drift detection. Is the distribution of inputs changing? Are the outputs shifting? Drift detection catches problems early, before they become visible to users.

How to implement:

Automated evaluation pipelines that run daily or weekly against representative test sets
Real-time dashboards for operational metrics (latency, cost, throughput)
Alerting thresholds for quality metrics, with escalation paths
Monthly trend analysis to catch gradual degradation

2. Maintenance

AI systems need regular maintenance, not just when something breaks, but as ongoing operational hygiene.

Model updates. Foundation models release new versions regularly. Each update can change behaviour in subtle or dramatic ways. Production AI systems should pin model versions and test new versions in staging before deployment.

Data refreshes. Knowledge bases, embeddings, and reference data need regular updates. Documents are added, removed, or modified. Organisations change. The AI's knowledge must stay current.

Prompt refinement. Production prompts evolve based on observed performance. Edge cases surface new failure modes. User feedback reveals misunderstandings. Prompt engineering is not a one-time activity.

Dependency management. AI systems depend on APIs, data sources, and infrastructure that change independently. Regular dependency audits prevent surprises.

How to implement:

Scheduled model version reviews (monthly)
Automated data pipeline monitoring with freshness alerts
Prompt version control with A/B testing for refinements
Quarterly dependency audits

3. Scaling

AI systems that work for ten users often break for a thousand. Scaling AI operations requires attention to:

Compute scaling. Inference costs scale with usage. Auto-scaling infrastructure, model routing (using cheaper models for simpler queries), and caching strategies manage costs as usage grows.

Knowledge scaling. As document volumes grow, retrieval quality can degrade. Indexing strategies, chunking approaches, and retrieval pipelines need to scale with the corpus.

Operational scaling. The team and processes that manage ten AI capabilities won't manage fifty. AI operations needs to scale its own practices: more automation, better tooling, clearer runbooks.

How to implement:

Cost modelling per capability with projected growth curves
Load testing at 3x current usage before scaling events
Automated scaling policies with cost guardrails
Operational playbooks for each capability type

4. Governance

AI governance in operations is different from AI governance in strategy. Strategic governance asks "should we build this?" Operational governance asks "is this running safely?"

Continuous compliance. Automated checks that verify AI outputs against policy constraints. Content safety, data handling, decision boundaries, and access control, verified continuously, not periodically.

Audit trails. Every AI decision that affects a human outcome needs to be traceable. What input came in, what the AI did with it, what output was produced, and what happened next. This is a regulatory requirement in many sectors and a best practice in all of them.

Incident management. When AI systems produce harmful, incorrect, or unexpected outputs, the response needs to be faster than traditional software incident management. AI incidents can affect trust in ways that are hard to recover from.

Human oversight. Even highly automated AI systems need human oversight touchpoints. Regular output reviews, exception handling, and quality audits ensure that automation doesn't become neglect.

The AI Operations Team

Who does this work? In most organisations, AI operations is a shared responsibility:

AI engineers build monitoring, evaluation, and maintenance tooling. They're responsible for the technical infrastructure of AI operations.

Domain experts evaluate output quality and identify when the AI is producing results that are technically correct but practically wrong. They define "good" in ways that engineers can measure.

Operations staff manage day-to-day monitoring, respond to alerts, and execute maintenance procedures. They're the first line of response when something goes wrong.

Leadership sets quality standards, approves governance frameworks, and makes decisions about AI investment based on operational data.

In smaller organisations (which describes most NZ enterprises), these roles overlap. The key is that someone is explicitly responsible for each function, even if one person covers multiple roles.

Common Failure Modes

"Deploy and forget." The AI works on launch day. Nobody checks it again. Six months later, it's producing outdated, inaccurate, or irrelevant outputs. Users lose trust. The initiative is labelled a failure.

"Alert fatigue." Too many monitoring alerts, too loosely configured. The team ignores them. A real problem goes unnoticed because it's buried in noise.

"Manual governance." Governance is a quarterly review meeting, not a continuous process. Issues accumulate between reviews. By the time they're identified, the damage is done.

"Single point of knowledge." One person understands how the AI system works. When they leave or get busy, operations degrade. Documentation, runbooks, and shared knowledge prevent this.

Getting Started

If your organisation has AI in production but no formal AI operations practice:

Instrument what you have. Add monitoring to existing AI capabilities. Start with output quality metrics and cost tracking.
Define quality thresholds. What does "good enough" look like for each capability? Set thresholds and alert when they're breached.
Schedule maintenance. Monthly model reviews, weekly data freshness checks, quarterly governance audits. Put them on the calendar.
Document everything. Runbooks for common issues, architecture diagrams, dependency maps. If it's in one person's head, it's not operational.
Assign ownership. Someone needs to be responsible for AI operations, even if it's 20% of their role. Unowned systems degrade.

AI operations isn't glamorous. It's monitoring dashboards, maintenance schedules, governance audits, and incident runbooks. It's the work that makes AI sustainable. Without it, every AI deployment is a demo with an expiry date.

With it, AI compounds.