How We Handle Production Incidents

At 2am on a Tuesday, you get a message: the system is down. What happens in the next thirty minutes determines whether this becomes a contained incident or a cascading disaster. We've handled enough production incidents across our client portfolio to know that the difference isn't technical skill. It's process.

What You Need to Know

Incident response quality is determined before the incident happens, not during it. Preparation beats improvisation every time
Communication is more important than speed of fix. Stakeholders can tolerate downtime. They can't tolerate silence
Postmortems that blame people produce cover-ups. Postmortems that examine systems produce improvements
Heroics are a sign of a broken process, not a strong team

The First Thirty Minutes

When something breaks in production, adrenaline kicks in. The instinct is to start fixing immediately. Resist that instinct. The first thirty minutes should follow a sequence that's been agreed on before the incident happens.

Minute 0-5: Confirm and classify. Is this real? What's the impact? How many users are affected? Is data at risk? We classify incidents into three levels: P1 (service down, data at risk), P2 (significant degradation, workaround available), P3 (minor issue, limited impact). The classification determines the response.

Minute 5-10: Communicate. Before you start fixing, tell people. The client, the team, anyone who needs to know. A short message: "We're aware of an issue affecting [description]. We're investigating. Next update in 30 minutes." That message buys you time and credibility.

Minute 10-30: Triage. Now start diagnosing. What changed recently? Check deployments, configuration changes, third-party service status. The most common cause of production incidents, by a wide margin, is something that changed. Find what changed and you've usually found the cause.

The worst incidents I've been involved in weren't the ones with the biggest technical problems. By the time someone sent an update, the client had assumed the worst and escalated to their board.

John Li

Chief Technology Officer

Communication Cadence

Once the incident is confirmed, communication runs on a fixed cadence regardless of progress. For P1 incidents, that's every 30 minutes. For P2, every hour. Even if the update is "still investigating, no change," send it.

The template is simple:

Status: Current state of the issue
Impact: What's affected, who's affected
Action: What we're doing right now
Next update: When to expect the next communication

This sounds bureaucratic during an emergency. It isn't. It prevents the thing that makes incidents worse: stakeholders calling, emailing, and messaging the people who should be focused on fixing the problem.

The Fix

Most production incidents fall into a small number of categories:

Deployment regression. Something in the latest release broke something. Rollback is usually the fastest fix, then investigate the root cause without time pressure.

Infrastructure failure. A server, database, or network component failed. Cloud providers have their own incident processes. Check their status pages first.

Third-party service outage. An external API or service your system depends on is down. You can't fix this, but you can implement graceful degradation to limit the impact.

Data issue. Bad data entered the system and is causing errors. This is often the hardest to diagnose because the symptoms can appear far from the cause.

62%

of production incidents in cloud environments are caused by changes (deployments, config updates, infrastructure modifications)

Source: PagerDuty State of Digital Operations Report, 2021

For each category, the approach is different, but the principle is the same: restore service first, investigate root cause second. A temporary fix that gets users working again is more valuable than a perfect fix that takes four hours.

Postmortems That Work

Every P1 and P2 incident gets a postmortem. Not optional. Not when we get around to it. Within 48 hours.

The postmortem has a specific structure:

Timeline. What happened, when, in sequence. Facts only.
Root cause. Not "John pushed a bad deploy" but "the deployment pipeline lacked automated testing for the integration that failed." Root causes are systemic, not personal.
Contributing factors. What made the incident worse? Slow detection? Poor documentation? Missing monitoring?
Action items. Specific, assigned, with deadlines. Not "improve monitoring" but "add alerting for API response time exceeding 500ms, assigned to Sarah, due Friday."

The blameless part isn't a philosophical preference. It's practical. If people get blamed for incidents, they stop reporting near-misses. The near-misses are your early warning system. Lose those and you only find out about problems when they become incidents.

No Heroics

A culture of heroic incident response, one person staying up all night to save the day, feels good in the moment. It's a sign that something is wrong.

If resolving incidents depends on one person's knowledge, you have a single point of failure. If it requires staying up all night, you have inadequate escalation procedures. If it happens regularly, you have systemic issues that postmortems should be catching.

Good incident response is boring. It follows a process. Multiple people can execute it. It doesn't require exceptional effort because the preparation was done beforehand.

That's what we aim for. Not heroes. Systems.