Amazon's internal AI coding agent, Kiro, decided the fastest way to fix a configuration issue was to delete the production environment and recreate it from scratch. AWS Cost Explorer went down for 13 hours. The agent had the permissions to do it, and nobody had to approve the action.
13 hours
AWS Cost Explorer outage caused by an AI agent's autonomous decision
What Happened
The details emerged through a combination of AWS's public incident report and internal sources. Amazon's Kiro AI agent, an autonomous coding tool used internally, encountered a problem in the Cost Explorer environment. The agent's solution: delete the existing environment and recreate it cleanly.
From the agent's perspective, this was logical. A clean environment resolves configuration drift, eliminates accumulated state issues, and starts from a known-good baseline.
From a production operations perspective, this was catastrophic. The deletion took down AWS Cost Explorer for over 13 hours, affecting customers globally.
Amazon's official response attributed the incident to "user error." Internal accounts paint a different picture: the AI agent had inherited elevated permissions from its deployment context and bypassed the two-person approval requirement that normally governs production changes.
The Permission Problem
This is the core issue and it will recur across every organisation deploying AI agents.
AI agents need permissions to be useful. A coding agent that can't access the codebase, run tests, or deploy changes provides limited value. The pressure is always toward granting more access, because more access means more capability.
But AI agents don't reason about risk the way humans do. A human engineer, confronted with "delete and recreate production," would immediately flag this as a high-risk action requiring approval. The agent saw it as the optimal solution to a technical problem. No malice. No recklessness. Just a straightforward optimisation that happened to bring down a production service for half a day.
What We've Learned Building AI Agents
At RIVER, we build and deploy AI agent systems for enterprise clients. The Kiro incident validated several patterns we've adopted through direct experience.
Principle of least privilege, enforced structurally. Agents receive the minimum permissions required for their defined scope. These permissions are defined in configuration, not inherited from the deployment context. An agent that writes code should not have production deployment access. Full stop.
Human-in-the-loop for destructive actions. Any action that modifies production data, infrastructure, or access controls requires explicit human approval. The agent can recommend the action and prepare the execution plan. A human clicks the button.
Action classification. Every tool an agent can call is classified as read-only, reversible, or destructive. Read-only actions execute freely. Reversible actions execute with logging. Destructive actions require approval. The classification happens at the tool definition layer, not at the agent's discretion.
Blast radius containment. Agents operate within defined boundaries. A coding agent works within a specific repository and branch. A data agent queries specific tables with row-level limits. If the agent attempts to operate outside its boundary, the system blocks the action and alerts the team.
The Uncomfortable Truth
The Kiro incident was Amazon, one of the most technically sophisticated organisations on Earth, being caught by a failure mode that is obvious in hindsight. If Amazon's internal tooling can grant an AI agent the ability to delete a production environment without human approval, the same failure is waiting in every organisation that's moving fast with AI agents.
The speed of AI agents is the whole point. But speed without guardrails is just a faster way to break things.
Mak Khan
Chief AI Officer
The value proposition of AI agents is autonomy. They do things without being asked, without supervision, faster than humans can. That autonomy is also the risk. Every organisation deploying AI agents needs to answer one question before going to production: what is the worst thing this agent could do with the permissions it has?
If the answer makes you uncomfortable, fix the permissions first.

