Why AI Agents Fail in Production (and the Math Nobody Mentions)

A close-up of a circuit board, the machinery behind production AI systems

This week Microsoft 365 Copilot went down for a few hours, and a lot of companies discovered something uncomfortable: Copilot's uptime isn't actually covered under their standard Microsoft 365 SLA. The AI part, the part they'd started leaning on, had no guarantee behind it at all.

The same week, a round of reports landed showing that enterprise AI agents are failing in production at rates the demos never warned anyone about. I run an AI-first agency in Kashmir, so this is my inbox, not my Twitter feed, and it's the same conversation I keep having with clients who ask me for AI agents. I want to write down the part nobody puts on the sales slide: the math of why these things break.

The number that should scare everyone

Here's the figure that's been circulating, and it matches what I see on real projects. Researchers measured a 37% gap between how agents perform on lab benchmarks and how they perform in production. Reliability, measured as getting the same correct result across repeated runs, drops from around 60% in testing to about 25% once it's doing real work.

Sit with that. A system that looked like it worked six times out of ten in the demo works one time out of four when a real customer is on the other end. That's not a rounding error. That's the difference between a tool and a liability.

Why multi-step agents fall apart

The marketing word "agent" implies a thing that goes off and does a whole job for you: read the email, check the calendar, draft the reply, update the CRM, send the invoice. Five steps, hands off, magic. The problem is that those steps multiply, and not in your favour.

Say each step is 85% reliable. That sounds great. Most people would sign off on 85%. But a five-step workflow isn't 85% reliable, it's 0.85 to the power of five, which is about 44%. Stretch it to ten steps and you're at roughly 20%. The agent that's "85% reliable at every step" completes the full job correctly one time in five.

This is the compounding failure effect, and it's just arithmetic. Every handoff is a place where a small error gets passed forward and amplified. A human doing the same chain catches their own mistakes between steps because they have context and judgement. An agent confidently carries the mistake into step six and builds on it.

What this looks like on a real project

Last month a client wanted an "AI agent" to handle their entire customer support pipeline: categorise the message, look up the order, decide the resolution, issue the refund or reply, and log it. Five steps. Fully autonomous. They'd seen a demo somewhere and it looked done.

We built it the way the demo implied, end to end, just to measure it. On clean test tickets it was great. On a week of real messages, full of typos, half-questions, two complaints stuffed into one paragraph, and a person who just sent a voice note transcript, it made the right call start to finish on roughly a quarter of them. Not because any single step was bad. Because four "pretty good" steps stacked into one "mostly wrong" outcome.

And the failures weren't loud. That's the part that catches people. The agent didn't crash or throw an error you could catch in a log. It quietly categorised a refund request as a delivery question, looked up the wrong order with total confidence, and wrote a polite, fluent, completely wrong reply. A broken script tells you it's broken. A broken agent sounds exactly like a working one. In a small Kashmiri business where one annoyed customer tells the whole neighbourhood, that confident-and-wrong failure mode is worse than no automation at all.

How we actually ship it instead

So we don't ship the five-step autonomous agent. We break the chain and put a human at the one joint that matters. The AI does the boring, narrow, high-reliability parts: read the message, pull the order, suggest a category and a draft reply. Then it stops and hands a finished suggestion to a person, who approves or fixes it in two seconds and clicks send.

Counter-intuitively, that one human checkpoint makes the whole thing more valuable, not less. You're no longer multiplying ten uncertain steps into a 20% success rate. You're running two or three reliable steps, catching the error before it compounds, and letting a person own the decision that carries risk. The support agent went from "right a quarter of the time, autonomously" to "right almost every time, with one human glance per ticket." The client still cut their handling time in half. They just didn't fire anyone.

This is the same lesson the big platforms are quietly relearning. The interesting line out of this month's enterprise AI coverage wasn't about smarter models. It was Microsoft framing the whole battle around reliability, not capability, and a wave of companies entering what people are calling the "rebuild era," tearing out the autonomous agents they rushed in last year and re-architecting them around state, recovery, and human checkpoints. The hype cycle is meeting the production cycle, and production is winning.

What I tell clients now

When someone asks me for an agent, I ask them one question back: how many steps does it have to get right in a row with no human in the loop? If the honest answer is more than two or three, I tell them what the math does to that number, and we redesign it on the spot. Almost always we land on the same shape: narrow AI steps, a human owning the risky decision, and a system that's boring and reliable instead of impressive and wrong.

None of this means AI in production is overrated, and I've written before about what AI actually changed and what it didn't. It means autonomy is the expensive, fragile part, and most businesses don't actually need it. They need the tedious 80% done fast and a person owning the 20% that matters. The gap between the demo and the deliverable is real, but it closes the moment you stop trying to remove the human and start trying to remove the tedium.

If you're building with AI in a real business and tired of demos that don't survive contact with real users, I'd genuinely like to compare notes: me@mehranshahmiri.com

← All posts Reply by email