MMNTM logo
Best Practices

The Hard Thing About AI Agents

The demo worked. The pilot impressed the board. Now your agent is hallucinating to customers at 3am. Here are the hard truths about deploying AI agents that nobody wants to tell you.

MMNTM Research
12 min read
#AI Agents#Production#Failure Modes#Reliability

The Demo Worked. Then What?

The agent demo is flawless. Full autonomy. No human intervention. The model reasons, acts, delivers.

Then you deploy.

Three weeks later, you're staring at Slack at 2am because your "autonomous" agent just sent a customer a completely fabricated legal citation. The customer is a lawyer. The lawyer is not amused.

Cancelled by 2027

40%

Gartner's prediction for agentic AI projects

Gartner predicts 40% of agentic AI projects will be cancelled by 2027—not because the technology failed, but because teams didn't anticipate what happens when demos meet production.

Pilot Failure Rate

90%

Enterprise AI initiatives that never reach production

Building an agent is straightforward. Owning the fallout when it breaks is not.


The Five Ways Agents Die

Agents don't die from exotic edge cases. They die from predictable failure modes that every production system encounters.

1. Context Starvation

The agent gives generic, useless responses that technically "answer" the question but miss the point entirely. It lacks access to the information it needs—user history, domain knowledge, previous conversation context.

The trap: pilots use curated data. Production uses the messy reality of actual infrastructure. The gap between them is where projects die.

2. Tool Amnesia

The agent has tools but forgets to use them. Or uses the wrong ones. Tool descriptions are too vague, too similar, or buried in a prompt that's too long.

The fix isn't more capabilities. It's fewer tools, clearer descriptions, ruthless pruning. Specialization beats generalization.

3. Confidence Hallucination

The agent fabricates information and presents it as fact. Users trust it because it sounds authoritative. Legal AI tools hallucinate 17-33% of the time. One-third of outputs are invented. Delivered with complete confidence.

Only 19% of organizations express high confidence in their ability to prevent hallucinations. The other 81% are hoping it doesn't happen to them.

This failure mode ends careers. A single confident hallucination to the wrong customer can cost more than a thousand correct answers are worth.

4. Infinite Loop Syndrome

The agent gets stuck in retry cycles, burning tokens and time without progress. No clear success criteria. No termination condition. Just an ever-growing invoice.

Agents have burned $50,000 in a single weekend because nobody implemented a step limit. The code was "working." The loop was infinite.

5. Cascade Failure

One agent error propagates through a multi-agent system, corrupting downstream agents. Agent A hallucinates. Agent B trusts Agent A. Agent C trusts Agent B. By the time output reaches the user, the fabrication has been laundered through three layers of apparent validation.

Preventing individual failures is tractable. Containing the blast radius when failures compound is where teams actually struggle.

For the full taxonomy, see Why Agents Die.


The Hallucination Tax

Every AI hallucination has a cost. Not a theoretical cost. A dollar cost.

Annual Hallucination Tax

$2.9M

At 8% error rate, 500 queries/day, $200 cost per error

The formula:

Hallucination Tax = Error Rate × Volume × Cost Per Error

At 8% error rate with 500 daily queries and $200 cost per error (30 minutes of expert time to correct), you're bleeding $8,000 per day. Every day.

That's not a rounding error. That's a line item.

And 8% is optimistic. Research shows hallucination rates range from 6.8% to 48% depending on model and task complexity.

The 5% that fails will cost you more than the 95% saves. Every point of error reduction has measurable dollar value. Every layer of validation reduces your exposure.

Calculate your tax. Then architect it down to something you can afford. See The Hallucination Tax for the full framework.


Demo Culture vs Production Reality

There are two modes of operating AI teams. Most organizations are stuck in the first.

Demo Culture vs Production Reality

FeatureDemo CultureProduction Reality
FocusShip demos. Impress stakeholders.Ship reliability. Survive production.
MetricsBenchmark accuracy. Eval scores.Escaped hallucination rate. Cost per error.
Response to failureWe will fix it in the next sprint.Stop the bleeding. Now.
Human oversightFull autonomy is the goal.Full autonomy is a myth.
MonitoringCheck dashboards weekly.Get paged at 2am.
Success criteriaThe model works.The business works.

Demo culture optimizes for model performance. Production reality optimizes for business outcomes.

Demo culture believes in full autonomy. Production reality knows that HITL strategies deliver 2x better ROI than unsupervised agents.

Demo culture runs evals. Production reality gets paged at 2am because the agent just sent 10,000 wrong emails and the VP of Customer Success wants answers.

If you're deploying AI agents to customers, you're already in production reality whether you've admitted it or not.


What Actually Works

When your agent is bleeding money at 3am, you don't need innovation. You need the boring infrastructure that actually works.

The checklist:

Circuit Breakers

  • Hard step limits (MAX_STEPS = 15). Most tasks complete in 5-10 steps. Terminating at step 15 prevents 80% of runaway costs.
  • Session budget caps. A hard $2.50 per run. When threshold hits, execution stops.
  • Semantic convergence detection. If cosine similarity exceeds 0.95, the agent is repeating itself. Kill it.

Human Checkpoints

  • Route low-confidence outputs to humans. Not all of them. Just the ones that matter.
  • Smart thresholds: >85% confidence auto-approves. 70-85% gets fast-track review. <70% gets full escalation.
  • The HITL Firewall delivers 85% cost reduction at 98% accuracy. Full autonomy delivers 85% accuracy at 100% liability.

Observability

  • Log every LLM call. Every tool invocation. Every decision point.
  • When your agent fails at scale, you need to reconstruct exactly what happened. Without tracing, you're debugging blind.
  • Build the dashboards before you need them. See Agent Observability.

Validation Layers

  • Don't let the model answer from memory. Force it through retrieval.
  • Citation requirements. No source, no claim.
  • Cross-reference outputs against known facts before delivery.

None of this is glamorous. None of this makes for a good demo. But this is the infrastructure that keeps agents alive in production.


The 10% Who Escape

The 10% of AI pilots that escape Pilot Purgatory don't escape because they had better models.

They escape because:

  • They built for production during the pilot, not after
  • They killed the projects that wouldn't make it
  • They measured business outcomes, not benchmark accuracy
  • They accepted that operational overhead is the cost of reliability

The demo isn't the hard part. Owning what comes after is.

Killing an agent project when it's burning money and the team is emotionally invested. Telling the board that the 40% cancellation rate applies to you too. Looking your customer in the eye after your agent hallucinated legal advice and explaining how you're going to make it right.

None of this gets easier. You just get better at doing it.