# Field notes from running customer support AI in production

The demo is easy. Anyone can wire a language model to a chat box and get a reply that sounds right. You show it to a room, people nod, someone says "this changes everything," and they are not wrong. But the gap between that demo and a system you trust with a customer's refund is enormous, and **almost none of it is about the model.**

I know this because we crossed that gap the hard way, from an unusual starting point. Customer support is an old and brutally competitive industry, and we have run it with human agents for years across these brands. But we have been **tech-first since day one**, in a business that mostly is not, and that is the reason we got anywhere in it. Most companies here are one or the other: legacy operators with little engineering, or AI startups that have never run a real queue. We were both from the start, which is why the AI platform was never a pivot. It is the most natural thing we have built.

Over the last ten months we took that operation from a few hundred AI-handled conversations a month to hundreds of thousands, across a few dozen brands running real commerce stacks: orders, subscriptions, carriers, returns. **More than half of the conversations our AI engages now close without a human ever touching them**, and the rest still go to people who do this for a living. None of that came from a smarter prompt. It came from a long list of things that broke.

Here are eight of them, and the lesson each one beat into us. They look like unrelated war stories. **By the end you will see they are the same story.**

## 1\. The one-line outage

One afternoon every single conversation, across every brand, stopped. The AI crashed on each inbound message and handed the ticket to a human. From the outside it looked like a total system failure. **The cause was one line.**

I had added an import to our central orchestrator, the function every message on every platform flows through. The name I imported happened to collide with a variable assigned later in the same function. Python quietly decided the name was local for the entire function, so a reference earlier in the code blew up before it was ever assigned. Classic, boring, and total.

The part that stung: I had "checked" the change. The file parsed. The code was valid. But **valid is not correct**, and a parse check cannot see a bug that only exists when the line runs. In an AI system the orchestrator is your single point of failure, and **its blast radius is every customer at once.** We now gate every deploy on a dataflow check that catches exactly this class, plus an actually-executed path, never just "it imports."

The lesson is unglamorous and it is the foundation of everything else: **the reliability of your AI is mostly the reliability of the very normal software wrapped around it.**

## 2\. The agent that quietly forgot its instructions

We made what looked like a harmless plumbing change. Instead of placing each workflow's instructions at the very front of the model's context, we moved them to sit after the reconstructed conversation history. Same words, same instructions, slightly different position.

![](https://cdn.hashnode.com/uploads/covers/67c1b1c6313ccbb4a8ec7d06/43e94337-8ebb-4181-848a-767e93623749.svg align="left")

The agent started ignoring its own workflow. Steps it was told to follow got skipped. Required fields in its output silently went missing. And it only happened **after the first turn**, because on turn one there was no history in front of the instructions yet, so it worked perfectly in every quick test and degraded only once a real conversation got going.

That is the maddening texture of building on language models. A change you would call cosmetic in any normal codebase, the order of two blocks of text, **measurably changed behaviour.** There was no error, no exception, no log line. The model simply paid less attention to instructions that were buried, the same way a person skims the middle of a long email.

The lesson: **with language models, where you put an instruction matters as much as what the instruction says.** Non-determinism does not announce itself. It hides in the plumbing you assumed was safe.

## 3\. The over-helpful agent, and the tempting wrong fix

A customer wrote in with two things in one message: my order never arrived, and also, cancel my subscription. This is not an edge case. Across our entire platform those are two of the highest-volume reasons customers contact a brand. They collide constantly.

Our router picked one intent, order tracking, and ran that workflow well. Then the agent reached the second request, found it had no tool to cancel a subscription in the workflow it was running, and **escalated the whole ticket to a human.** Including the delivery half it had already solved.

The obvious fix is right there and **it is a trap**: just give the agent all the tools, all the time. Then it could have cancelled the subscription itself.

We did not do that, and refusing to is one of the most important decisions in the product. A subscription cancellation does not live alone. It lives inside a workflow that carries policy: the retention offer, the confirmation step, the rules about what the customer is owed. If you let the model reach the cancel capability from outside that workflow, you have an AI cancelling subscriptions without any of the guardrails that are supposed to govern cancellations. We call that *freelancing*, and **it is precisely how you lose control of a support agent.**

![](https://cdn.hashnode.com/uploads/covers/67c1b1c6313ccbb4a8ec7d06/456261a3-08d9-41b8-bb79-9e4f083b7501.svg align="left")

The real problem was never a missing tool. **It was routing.** One message, two intents, and a router that only knew how to pick one. The fix is to run both of the right workflows, each carrying its own policy, not to tear down the walls between them. Over-escalation looked like a capability gap. **It was a routing gap.** Those are very different bugs, and confusing them is how good teams accidentally build an agent they cannot trust.

## 4\. The failure that poisoned the conversation

Here is one that genuinely surprised me. A tool the agent called, a carrier lookup, threw an error. Transient, the kind of thing that happens a thousand times a day to any system talking to third-party APIs. You would expect the agent to recover, apologise, retry, move on.

**Instead the conversation died.** Not that message. The conversation. Every subsequent message from that customer hard-errored, permanently.

![](https://cdn.hashnode.com/uploads/covers/67c1b1c6313ccbb4a8ec7d06/3b458e37-4ef4-4376-a663-3ad57cec7441.svg align="left")

The reason is specific to stateful AI agents and worth understanding. The framework had already saved a snapshot of the turn the instant the model decided to call the tool, an assistant message that says "I am about to call this tool." Then the tool blew up before any result could be recorded. Now the saved state contains **a promise with no answer**: a message announcing a tool call that has no matching result. To the model's API that is a malformed conversation, and every future message inherits the broken snapshot and is rejected on arrival. One transient hiccup became a dead conversation that could never heal itself.

The lesson: **failures in stateful AI systems are sticky. Corrupted state outlives the error that caused it.** In ordinary software a failed call is a bad second; here it can poison everything that comes after. You cannot treat tool failure as an exception you log and forget. **It is a first-class path**, and you have to engineer the recovery as carefully as you engineer the happy case.

## 5\. What counts as automated

![](https://cdn.hashnode.com/uploads/covers/67c1b1c6313ccbb4a8ec7d06/894a3d01-4df2-4d07-affd-7dfdfd2a0cd2.svg align="left")

Once the AI was resolving real volume, a question that sounds like a reporting detail turned out to be the hardest one in the product: **which conversations count as automated?** It matters because the answer is what we put on the invoice, and that is where trust with the customer is won or lost.

The tempting definition is generous to us. The AI replied, the customer went away happy, call it automated. But real conversations are messier. On plenty of tickets the AI did almost all the work and a human then sent a single closing line. On others the AI replied and a human replied too. Is that automated? Half-automated? **If we get to decide, every ambiguous case quietly rounds in our favour**, and the headline number drifts away from anything the customer would recognise as true.

So we built the accounting to be **conservative on purpose.** Every conversation is sorted into one of a few segments by who actually replied: the AI alone, the AI plus a human, a human alone, the AI deliberately closing a ticket without replying, or nobody. Only the first one, where the AI handled the whole thing and no human touched it, counts as fully automated, and in the strict billing mode that is the only segment we charge for. **A human sending one line is enough to make the ticket free.** The human here is not a failure to bill around; on the cases the AI should not close alone, a person finishing the job is the design, not the exception. We would rather undercount and be unarguably fair than inflate the number and let the customer find the seam later.

The lesson: **the definition of "automated" is a product decision, not a metric**, and drawing the line in the customer's favour is what makes the number worth trusting. It also points our own incentives the right way, because we only get paid when we genuinely removed the human, which is exactly the thing the customer is buying. **Honest accounting is a feature**, and like everything else here it is a constraint we chose to put on ourselves.

## 6\. The boring economics of depth

When you need shipment tracking, there is an easy path and a hard path. The easy path is a single aggregator that claims to track every carrier through one API. You integrate once and you are done. The hard path is integrating each major carrier directly, one painful API at a time.

We started down the easy path and **backed out.** The aggregator bills per lookup, it is rate-shared with everyone else using it, and the data it returns is shallow. At the volume of "where is my order," the single most common question in all of e-commerce support, that per-lookup tax adds up and the shallow data means more handovers. So we built native carrier integrations first and kept the aggregator only as a fallback for carriers where no direct API exists. **Richer data, not rate-shared, and cheaper.**

This is the least exciting work we do and **it is most of the moat.** Anyone can connect a model to a chat. Very few teams will do the unglamorous work of integrating thirty carrier and commerce APIs deeply enough that the AI can actually resolve the request instead of describing it. **The intelligence is a commodity. The depth of real operations behind it is not.**

## 7\. Every external string is hostile

Most people picture prompt injection as something a malicious customer types into the chat box. In production **the larger surface is everything the agent reads that is not the customer.** The agent ingests carrier API responses, order notes, ticket fields, product metadata: text from a dozen third parties, any of which can carry an instruction the model might decide to obey. **Your attack surface is not your input box. It is every byte the model reads**, and most of those bytes come from systems you do not control.

The customer channel is not just words either. We saw inbound messages padded with thousands of invisible characters, spam crafted to bloat the context and derail the agent. So we strip that padding before it ever reaches the model and **treat every external string, not just the obvious user input, as untrusted by default.**

The lesson: **sanitise at the boundary, and assume the adversary is not only the person typing.** In an ordinary backend you validate the form fields. In an LLM system you have to account for the fact that the model will read, and may act on, text that arrived through an API you have never thought of as user input.

## 8\. You cannot wall off a language model

Our first instinct on security was the one every engineer has: lock it down like a normal backend. Validate every input, deny by default, allow only what is explicitly permitted. It is the correct instinct for a CRUD application and **it is the wrong instinct here**, because the thing that makes the agent valuable is exactly the flexibility that deny-by-default destroys.

What we landed on is **two tiers**, and the distinction has become one of our core design principles. **Tier one is the hard boundaries**, the places where the model has no role at all: authentication, secrets, making sure one brand can never see another brand's data. Those are code-enforced, non-negotiable, and **the model's opinion is never consulted. Tier two is everything the model mediates**: how it interprets a request, which guardrails apply to a sensitive action, how it handles messy third-party content. Those you do not freeze in code on day one. You instrument them, you watch what the model actually does at scale, and you **tighten with evidence rather than with fear.**

The lesson: **an AI-first product needs a security posture that is part code and part instrumentation.** Hardening it like a traditional backend either breaks the product or, worse, gives you a false sense of safety while the real risks sit somewhere your validation rules were never looking.

## The obvious objection

I can hear the counterargument, because I have made it myself. **This all sounds old-fashioned.** Routers, workflows, hard boundaries, a model kept on a short leash: it looks like the scripted chatbots of 2018 wearing a newer coat. The industry is sprinting the other way, towards agents you hand a goal and trust to improvise, and every month the models get good enough to make that pitch more believable. So aren't we just building scaffolding that the next model generation tears down? Won't a clever enough agent simply not need the rails?

It is a fair challenge, and **part of it is true.** Some of our constraints exist purely because today's models are not reliable enough, and those will recede. As the models improve we fully expect to widen the second tier: give the agent more latitude inside each workflow, collapse routing steps it no longer needs, hand it judgement we currently make for it. The frontier moves, and we intend to move the line with it. **We would rather be the team that loosens a constraint when the evidence says it is safe than the team that shipped autonomy and learned the hard way in front of a customer.**

But the pitch conflates two things that are not the same, and that is where I think it is wrong. **Capability is not authority.** A more capable model can work out *how* to cancel a subscription; it still should not get to decide *whether* it is allowed to, on what terms, with which retention offer, under whose policy. That is not a gap in the model's intelligence. It is the business saying what it will and will not stand behind. A perfect model still has to be told the refund policy, still has to be auditable, still must never let one brand see another's data. **Those constraints are not distrust of the model. They are the shape of the product the customer is actually buying.** No amount of model progress removes them, because they were never model problems.

So yes, it is unmagical on purpose. The magical demo and the system you trust with a refund are different artefacts, and the distance between them is exactly the boring, deliberate, old-fashioned engineering this whole piece is about. **The autonomy that matters here is not the agent's. It is the customer's:** the freedom to hand over a queue and stop worrying about it. You earn that with constraint, not in spite of it.

## The one thread

Read those eight again. A one-line crash, a misplaced block of instructions, an over-eager router, a poisoned conversation, an honest automation count, a per-lookup tax, a hostile API payload, a security model. They sound unrelated. **They are the same lesson wearing eight outfits.**

**The hard part of production customer support AI is not capability. It is constraint.**

Every win we have had came from narrowing what the model can do at any given moment, isolating each capability behind the workflow that owns its policy, and engineering for failure as a normal path rather than a surprise. The industry is busy selling the opposite: agents you point at a problem and trust to improvise. Our data, across **seven figures of real conversations**, says **improvisation is the enemy.** The agents we trust are the ones we have most carefully fenced in.

![](https://cdn.hashnode.com/uploads/covers/67c1b1c6313ccbb4a8ec7d06/8e8493e6-92ae-46eb-b418-b4f46683afa3.svg align="left")

The economics make the point on their own. **Millions of AI-generated messages cost us a five-figure model bill: pennies per message**, and what we pay the model is a small fraction of what a resolved ticket is worth, a gap that only widens as inference gets cheaper every quarter. **The margin was never in the model.** It lives in the system around it, the workflows, the routing, the recovery paths, the tiered security, the part that took ten months of things breaking to build and the part that is actually the company. We run thousands of workflows in production, dozens per brand, the most sophisticated past two hundred.

This is what I mean when I say **customer care is an infrastructure problem, not a model problem.** The engine is brilliant and slightly unreliable, and everyone has the same one. The car you build around it, **software and people both**, is what customers actually feel, and it is the only thing that is hard to copy. The deepest part of it is the part no prompt can produce: **the credibility of a team that was already doing this job, by hand, well.** A brand does not hand its refund button to a demo; it hands it to an operation it already trusts, and lets the AI take the volume that operation has earned the right to automate.

**Control is not the thing slowing the AI down. Control is the product.** It is the entire reason a support leader can hand an agent the refund button and sleep at night.
