What is Agent Operations (LLMOps)?
Agent Operations (LLMOps) is SRE adapted for non-deterministic AI systems. Traditional monitoring fails because agents can return 200 OK while outputting wrong answers—semantic failures that succeed technically. Agent SLAs must track Task Success Rate (>95%), P95 latency (not averages), faithfulness scores, and Cost Per Task. Incident response requires distinguishing hard crashes (SRE-owned) from quality deviations (ML Engineer-owned), with circuit breakers at every agent handoff to prevent cascade failures in multi-agent systems.
The Agent Operations Playbook: SRE for AI Systems
Why Traditional SRE Fails
You deploy an agent. It returns 200 OK. Latency looks normal. Uptime is 99.9%.
And yet customers are getting wrong answers.
This is the fundamental challenge of agent operations: semantic failures that succeed technically. The API request completes successfully, but the agent confidently outputs incorrect information. Traditional APM tools miss this entirely because they monitor infrastructure, not intelligence.
Operations is one of the four pillars explored in The Agent Thesis, alongside architecture, economics, and security.
The failure taxonomy for agents expands beyond crashes:
| Category | Traditional Failure | Agent Failure |
|---|---|---|
| Reliability | Service unavailable | Latency spikes, format errors |
| Quality | N/A | Hallucinations, reasoning errors, context loss |
| Safety | N/A | Toxicity, PII leakage, prompt injection |
The most dangerous failure mode is silent degradation. When an agent hallucinates or loses track of a multi-step workflow, operational status stays green while business outcomes suffer. This dramatically increases Mean Time to Detection - potentially thousands of bad outputs before anyone notices.
The operational playbook must pivot from availability monitoring to semantic monitoring: confidence scores, quality drift, and LLM-as-Judge evaluations.
The Agent SLA Blueprint
Redefining Reliability
For a traditional API, 99.9% uptime is the gold standard. For goal-based agents, uptime is meaningless if task completion is 70%.
Agent SLAs must be outcome-centric. The core metric is Task Success Rate, not availability.
The Three Pillars of Agent SLIs
1. Performance SLIs
- P50/P95/P99 Latency (percentiles, not averages)
- Throughput (requests per minute)
- Time-to-First-Token (for streaming responses)
2. Quality SLIs
- Task Success Rate (the core metric) - see evaluation frameworks
- Faithfulness score (inverse of hallucination rate)
- Context relevance (for RAG systems)
- Tool success rate (external API calls)
3. Financial SLIs
- Token consumption per request
- Cost Per Task (CPT)
- Budget utilization (against caps)
P95 Latency Benchmarks
Average latency is misleading for LLM systems because a few slow outliers skew the mean far from what users experience. Anchor SLOs to percentiles:
| Use Case | P50 Target | P95 Target |
|---|---|---|
| Simple Q&A (reactive) | <500ms | <1,000ms |
| Complex RAG (goal-based) | <2,000ms | <4,000ms |
| Multi-agent orchestration | <3,000ms | <6,000ms |
Critical insight: Agentic workflows impose cumulative latency penalties. An agent orchestrating RAG + reasoning + tool calls + summarization combines the latency of every step. If users tolerate 4 seconds total, you cannot afford 4 seconds at each stage. Set internal component SLOs much tighter (e.g., <500ms for RAG retrieval) to keep end-to-end P95 acceptable.
Quality and Cost SLOs
Task Success Rate: Target >95%. This is the number that matters.
Faithfulness Score: Start with 85-95% target. Use progressive threshold adjustment - increase targets by 2-5% each release cycle as the system matures.
Safety Metrics: Target 90-98% compliance for toxicity, bias, and PII avoidance. These cannot slip.
Cost Per Task: Audit rigorously against realized value. An agent costing $0.10/task saving $50/task is obviously ROI-positive. Track this ratio.
The Triple Constraint
Unlike traditional software where quality is fixed after deployment, agents trade off Quality, Latency, and Cost continuously. Reducing hallucinations may require a larger model (more cost) or complex retrieval (more latency).
The mandate: define SLOs that satisfy all three simultaneously. "Maintain 95% Task Success Rate at CPT below $0.05 and P95 latency below 2,000ms."
Incident Response for Non-Deterministic Systems
The Two Failure Types
Rapid triage requires distinguishing between:
Hard Crashes - Infrastructure issues. Service outages, resource exhaustion, external API failures, rate limiting. Detected by traditional error rate and latency monitoring. Platform/SRE owns this.
Quality Deviations - Semantic failures. Agent is available but outputs are wrong, nonsensical, or malformed. These are the failure modes that kill agents - requires specialized quality monitoring. ML Engineers own this.
| Failure Type | Detection | Alert Type | Owner |
|---|---|---|---|
| Hard Crash | Error rate >5% for 2 min | Page | Platform/SRE |
| Quality Deviation | Success rate <90% | Warning | ML Engineer |
| Safety Breach | Guardrail failure | Critical | Security/MLOps |
The 5-Step Triage Framework
Non-deterministic failures defy exception handling logic. This framework addresses unpredictable agent breakdowns:
Step 1: Detect via Anomaly
Traditional thresholds don't work. Implement dynamic confidence thresholds that flag outputs when confidence deviates more than 2σ from the rolling average. Track context drift (tokens consumed) - agents lose efficacy during long sessions. Force checkpointing at maximum context window.
Step 2: Preserve Context
Capture the agent's working state at critical decision points (before API calls, agent handoffs) as lightweight JSON snapshots. Store reasoning chains - not just outcomes, but why decisions were made. This enables resuming from last known good state instead of restarting.
For production-grade workflows that must survive crashes and deployments without losing state, consider durable execution frameworks. Temporal automatically persists event histories and enables deterministic replay from any point—eliminating the "restart from scratch" problem entirely. See Temporal Deep Dive for the architectural patterns Netflix uses to run hundreds of thousands of workflows daily.
Step 3: Prevent Cascade
In multi-agent systems, isolation is key. Implement circuit breakers at every agent handoff. If upstream failure rate spikes, trip the circuit to prevent corrupted outputs from flowing downstream. Use message queues between agents as buffers.
Step 4: Checkpoint Incrementally
For long-running processes (analyzing 50 documents), define transaction boundaries and save state after each logical unit completes. Recovery resumes from the last clean checkpoint, not the beginning.
Step 5: Escalate with Context
When escalating to humans, automatically pass partial results, confidence scores, and the full reasoning chain. The human should not need to reproduce the failure. For systematic human-in-the-loop architectures, see the HITL Firewall patterns.
On-Call Structure
MLEs often assume full-cycle responsibility including on-call. Clear demarcation matters:
- Platform/SRE: Infrastructure, general availability, resource provisioning
- ML Engineers: Quality metrics, model drift, prompt degradation
- Security: Safety guardrails, adversarial attacks
Use AI-powered incident tools to accelerate root cause analysis - they correlate deployment history, config changes, and system anomalies faster than manual log diving.
PromptOps: Prompts as Code
Why This Matters
In production, prompts are not static suggestions - they are critical application logic. Untracked changes cascade into production issues, degrading quality across thousands of interactions without detection.
The PromptOps Mandate: Treat prompts with the same rigor as application code.
Version Control Integration
- Store prompts in Git alongside application code
- Maintain change history, timestamps, and authors
- Link each version to: target model, parameter configs (temperature, top-p), environment, and performance metrics from evaluation runs
Separation from code: For operational flexibility, manage prompts via external configuration systems. This enables runtime updates without redeploying the entire application.
A/B Testing Non-Deterministic Systems
Testing agents is fundamentally harder than traditional software because the same prompt yields variable responses.
Minimum Detectable Effect (MDE): Calculate the smallest meaningful effect size based on current success rate and desired statistical power (80% power at 95% confidence). This determines required sample size.
Guardrail Metrics: Every experiment must monitor latency, cost, and safety alongside the primary metric. A 2% accuracy improvement means nothing if it causes 50% latency regression.
Practical vs. Statistical Significance: High-variability outputs mean large samples often detect tiny improvements (0.5% quality gain). But if that gain requires a 50% cost increase, it's not practically justified. Mandate cost-benefit evaluation.
Deployment Protocol
Canary Strategy: Release new prompts to 1-5% of traffic initially. Run canary and control side-by-side with identical inputs. Measure performance, quality, and cost differences.
Rollback Triggers:
- Task success rate drops below 90%
- P95 latency increases >20% vs control
- Safety guardrail breach detected
The system must support one-click rollback to last known good prompt version without full application redeploy.
Monitoring and Alerting Configuration
Dashboard Layout
Structure observability around rapid triage:
- System Health Overview - SLO status (success rate, P95, daily cost)
- Performance & Resource - Latency distribution, token usage trends
- Agent Behavior & Quality - Confidence score distribution, hallucination trends, tool success breakdown
- Change Management - Metrics correlated with deployed model/prompt versions
- Cost Monitoring - Real-time spend vs budget caps
Alerting Thresholds
| Metric | Target | Warning | Critical |
|---|---|---|---|
| Success Rate | >95% | <90% for 15 min | <85% for 5 min |
| P95 Latency (simple) | <1,000ms | >1,500ms for 10 min | >4,000ms |
| Provider Error Rate | <0.5% | >2% for 5 min | >5% for 2 min |
| Daily Budget | 100% | 70% reached | 100% reached |
| Confidence Score | >0.85 median | 2σ drop | Sustained drop |
Budget Enforcement
LLM costs are highly unpredictable - metered by tokens, subject to runaway loops.
Soft Caps: Alert at 70% and 90% of budget. Enables proactive investigation.
Hard Caps: At 100%, enforce boundaries - block requests, route to cheaper model, or shutdown. Prevents catastrophic overruns.
Cost Anomaly Detection: Sudden usage spikes often indicate security issues - DoS attacks or prompt injection causing massive token output. Configure cost anomalies as Critical alerts linked to security response.
For multi-agent systems where agents transact across vendors, financial monitoring extends to inter-agent payments. The billing infrastructure for autonomous settlements—micropayments, escrow, and outcome-based pricing—is covered in Agent Billing & Crypto.
The Bottom Line
Agent operations requires a paradigm shift from deterministic SRE to probabilistic MLOps. The challenge is not uptime - it's detecting silent quality degradation while the system reports healthy.
The prescriptions:
- Define quantifiable SLOs across the triple constraint: P95 latency, task success rate (>95%), and cost per task
- Instrument semantically - distributed tracing, reasoning chain logging, dynamic confidence thresholds (2σ deviation alerts)
- Mandate PromptOps - prompts as versioned code, decoupled from application for fast rollbacks
- Enforce safe deployments - A/B testing with MDE calculations, canary rollouts, immediate rollback triggers
- Automate cost governance - real-time anomaly detection, hard budget caps linked to critical alerting
The gap between "working demo" and "24/7 production" is where most agent deployments fail. This playbook bridges that gap.
For translating these operational metrics into executive-ready business reporting, see the Agent Scorecard. For automated prompt optimization, systems can close the loop between observability and improvement.