MMNTM logo
Return to Index
Operations

The Agent Operations Playbook: SRE for AI Systems

Traditional SRE fails with non-deterministic systems. Here are the SLAs, incident response patterns, and deployment strategies that work for production AI agents.

MMNTM Research Team
9 min read
#AI Agents#Operations#SRE#Observability#Production

What is Agent Operations (LLMOps)?

Agent Operations (LLMOps) is SRE adapted for non-deterministic AI systems. Traditional monitoring fails because agents can return 200 OK while outputting wrong answers—semantic failures that succeed technically. Agent SLAs must track Task Success Rate (>95%), P95 latency (not averages), faithfulness scores, and Cost Per Task. Incident response requires distinguishing hard crashes (SRE-owned) from quality deviations (ML Engineer-owned), with circuit breakers at every agent handoff to prevent cascade failures in multi-agent systems.


The Agent Operations Playbook: SRE for AI Systems

Why Traditional SRE Fails

You deploy an agent. It returns 200 OK. Latency looks normal. Uptime is 99.9%.

And yet customers are getting wrong answers.

This is the fundamental challenge of agent operations: semantic failures that succeed technically. The API request completes successfully, but the agent confidently outputs incorrect information. Traditional APM tools miss this entirely because they monitor infrastructure, not intelligence.

Operations is one of the four pillars explored in The Agent Thesis, alongside architecture, economics, and security.

The failure taxonomy for agents expands beyond crashes:

CategoryTraditional FailureAgent Failure
ReliabilityService unavailableLatency spikes, format errors
QualityN/AHallucinations, reasoning errors, context loss
SafetyN/AToxicity, PII leakage, prompt injection

The most dangerous failure mode is silent degradation. When an agent hallucinates or loses track of a multi-step workflow, operational status stays green while business outcomes suffer. This dramatically increases Mean Time to Detection - potentially thousands of bad outputs before anyone notices.

The operational playbook must pivot from availability monitoring to semantic monitoring: confidence scores, quality drift, and LLM-as-Judge evaluations.

The Agent SLA Blueprint

Redefining Reliability

For a traditional API, 99.9% uptime is the gold standard. For goal-based agents, uptime is meaningless if task completion is 70%.

Agent SLAs must be outcome-centric. The core metric is Task Success Rate, not availability.

The Three Pillars of Agent SLIs

1. Performance SLIs

  • P50/P95/P99 Latency (percentiles, not averages)
  • Throughput (requests per minute)
  • Time-to-First-Token (for streaming responses)

2. Quality SLIs

3. Financial SLIs

P95 Latency Benchmarks

Average latency is misleading for LLM systems because a few slow outliers skew the mean far from what users experience. Anchor SLOs to percentiles:

Use CaseP50 TargetP95 Target
Simple Q&A (reactive)<500ms<1,000ms
Complex RAG (goal-based)<2,000ms<4,000ms
Multi-agent orchestration<3,000ms<6,000ms

Critical insight: Agentic workflows impose cumulative latency penalties. An agent orchestrating RAG + reasoning + tool calls + summarization combines the latency of every step. If users tolerate 4 seconds total, you cannot afford 4 seconds at each stage. Set internal component SLOs much tighter (e.g., <500ms for RAG retrieval) to keep end-to-end P95 acceptable.

Quality and Cost SLOs

Task Success Rate: Target >95%. This is the number that matters.

Faithfulness Score: Start with 85-95% target. Use progressive threshold adjustment - increase targets by 2-5% each release cycle as the system matures.

Safety Metrics: Target 90-98% compliance for toxicity, bias, and PII avoidance. These cannot slip.

Cost Per Task: Audit rigorously against realized value. An agent costing $0.10/task saving $50/task is obviously ROI-positive. Track this ratio.

The Triple Constraint

Unlike traditional software where quality is fixed after deployment, agents trade off Quality, Latency, and Cost continuously. Reducing hallucinations may require a larger model (more cost) or complex retrieval (more latency).

The mandate: define SLOs that satisfy all three simultaneously. "Maintain 95% Task Success Rate at CPT below $0.05 and P95 latency below 2,000ms."

Incident Response for Non-Deterministic Systems

The Two Failure Types

Rapid triage requires distinguishing between:

Hard Crashes - Infrastructure issues. Service outages, resource exhaustion, external API failures, rate limiting. Detected by traditional error rate and latency monitoring. Platform/SRE owns this.

Quality Deviations - Semantic failures. Agent is available but outputs are wrong, nonsensical, or malformed. These are the failure modes that kill agents - requires specialized quality monitoring. ML Engineers own this.

Failure TypeDetectionAlert TypeOwner
Hard CrashError rate >5% for 2 minPagePlatform/SRE
Quality DeviationSuccess rate <90%WarningML Engineer
Safety BreachGuardrail failureCriticalSecurity/MLOps

The 5-Step Triage Framework

Non-deterministic failures defy exception handling logic. This framework addresses unpredictable agent breakdowns:

Step 1: Detect via Anomaly

Traditional thresholds don't work. Implement dynamic confidence thresholds that flag outputs when confidence deviates more than 2σ from the rolling average. Track context drift (tokens consumed) - agents lose efficacy during long sessions. Force checkpointing at maximum context window.

Step 2: Preserve Context

Capture the agent's working state at critical decision points (before API calls, agent handoffs) as lightweight JSON snapshots. Store reasoning chains - not just outcomes, but why decisions were made. This enables resuming from last known good state instead of restarting.

For production-grade workflows that must survive crashes and deployments without losing state, consider durable execution frameworks. Temporal automatically persists event histories and enables deterministic replay from any point—eliminating the "restart from scratch" problem entirely. See Temporal Deep Dive for the architectural patterns Netflix uses to run hundreds of thousands of workflows daily.

Step 3: Prevent Cascade

In multi-agent systems, isolation is key. Implement circuit breakers at every agent handoff. If upstream failure rate spikes, trip the circuit to prevent corrupted outputs from flowing downstream. Use message queues between agents as buffers.

Step 4: Checkpoint Incrementally

For long-running processes (analyzing 50 documents), define transaction boundaries and save state after each logical unit completes. Recovery resumes from the last clean checkpoint, not the beginning.

Step 5: Escalate with Context

When escalating to humans, automatically pass partial results, confidence scores, and the full reasoning chain. The human should not need to reproduce the failure. For systematic human-in-the-loop architectures, see the HITL Firewall patterns.

On-Call Structure

MLEs often assume full-cycle responsibility including on-call. Clear demarcation matters:

  • Platform/SRE: Infrastructure, general availability, resource provisioning
  • ML Engineers: Quality metrics, model drift, prompt degradation
  • Security: Safety guardrails, adversarial attacks

Use AI-powered incident tools to accelerate root cause analysis - they correlate deployment history, config changes, and system anomalies faster than manual log diving.

PromptOps: Prompts as Code

Why This Matters

In production, prompts are not static suggestions - they are critical application logic. Untracked changes cascade into production issues, degrading quality across thousands of interactions without detection.

The PromptOps Mandate: Treat prompts with the same rigor as application code.

Version Control Integration

  • Store prompts in Git alongside application code
  • Maintain change history, timestamps, and authors
  • Link each version to: target model, parameter configs (temperature, top-p), environment, and performance metrics from evaluation runs

Separation from code: For operational flexibility, manage prompts via external configuration systems. This enables runtime updates without redeploying the entire application.

A/B Testing Non-Deterministic Systems

Testing agents is fundamentally harder than traditional software because the same prompt yields variable responses.

Minimum Detectable Effect (MDE): Calculate the smallest meaningful effect size based on current success rate and desired statistical power (80% power at 95% confidence). This determines required sample size.

Guardrail Metrics: Every experiment must monitor latency, cost, and safety alongside the primary metric. A 2% accuracy improvement means nothing if it causes 50% latency regression.

Practical vs. Statistical Significance: High-variability outputs mean large samples often detect tiny improvements (0.5% quality gain). But if that gain requires a 50% cost increase, it's not practically justified. Mandate cost-benefit evaluation.

Deployment Protocol

Canary Strategy: Release new prompts to 1-5% of traffic initially. Run canary and control side-by-side with identical inputs. Measure performance, quality, and cost differences.

Rollback Triggers:

  • Task success rate drops below 90%
  • P95 latency increases >20% vs control
  • Safety guardrail breach detected

The system must support one-click rollback to last known good prompt version without full application redeploy.

Monitoring and Alerting Configuration

Dashboard Layout

Structure observability around rapid triage:

  1. System Health Overview - SLO status (success rate, P95, daily cost)
  2. Performance & Resource - Latency distribution, token usage trends
  3. Agent Behavior & Quality - Confidence score distribution, hallucination trends, tool success breakdown
  4. Change Management - Metrics correlated with deployed model/prompt versions
  5. Cost Monitoring - Real-time spend vs budget caps

Alerting Thresholds

MetricTargetWarningCritical
Success Rate>95%<90% for 15 min<85% for 5 min
P95 Latency (simple)<1,000ms>1,500ms for 10 min>4,000ms
Provider Error Rate<0.5%>2% for 5 min>5% for 2 min
Daily Budget100%70% reached100% reached
Confidence Score>0.85 median2σ dropSustained drop

Budget Enforcement

LLM costs are highly unpredictable - metered by tokens, subject to runaway loops.

Soft Caps: Alert at 70% and 90% of budget. Enables proactive investigation.

Hard Caps: At 100%, enforce boundaries - block requests, route to cheaper model, or shutdown. Prevents catastrophic overruns.

Cost Anomaly Detection: Sudden usage spikes often indicate security issues - DoS attacks or prompt injection causing massive token output. Configure cost anomalies as Critical alerts linked to security response.

For multi-agent systems where agents transact across vendors, financial monitoring extends to inter-agent payments. The billing infrastructure for autonomous settlements—micropayments, escrow, and outcome-based pricing—is covered in Agent Billing & Crypto.

The Bottom Line

Agent operations requires a paradigm shift from deterministic SRE to probabilistic MLOps. The challenge is not uptime - it's detecting silent quality degradation while the system reports healthy.

The prescriptions:

  1. Define quantifiable SLOs across the triple constraint: P95 latency, task success rate (>95%), and cost per task
  2. Instrument semantically - distributed tracing, reasoning chain logging, dynamic confidence thresholds (2σ deviation alerts)
  3. Mandate PromptOps - prompts as versioned code, decoupled from application for fast rollbacks
  4. Enforce safe deployments - A/B testing with MDE calculations, canary rollouts, immediate rollback triggers
  5. Automate cost governance - real-time anomaly detection, hard budget caps linked to critical alerting

The gap between "working demo" and "24/7 production" is where most agent deployments fail. This playbook bridges that gap.

For translating these operational metrics into executive-ready business reporting, see the Agent Scorecard. For automated prompt optimization, systems can close the loop between observability and improvement.

The Agent Operations Playbook: SRE for AI Systems