This is the canonical reference for building AI agents that work in production. Organized by layer, it provides a curated reading path through our research library.
How to use this guide:
- Start at Foundation if you're new to agent development
- Jump directly to a specific layer if you have targeted questions
- Each layer has a featured deep dive plus supporting articles
Before you orchestrate agents or worry about operations, you need to understand the foundation layer: how models work with external context, the tradeoffs of RAG, and the protocols connecting agents to tools.
Start here
RAG Is Oversold: The Gap Between Tutorial and Production
95% of RAG projects fail to reach production. The gap isn't infrastructure—it's retrieval accuracy, data processing, and reasoning. Naive RAG is obsolete; production requires rigorous engineering.
The Prompt DNA Hypothesis: Evolving Agent Instructions
What if we treated prompts like genetic code—subject to mutation, selection, and evolution? The best agent prompts aren't written. They're bred.
MCP: The Protocol That Won (For Now)
MCP solved the N×M integration crisis and achieved escape velocity through strategic open-sourcing and the Linux Foundation play. The de facto standard for AI connectivity—though not without costs.
The MCP Tax: When Standards Cost You 99% of Your Token Budget
The design decisions that grant MCP its universality—verbose schemas, data through context—create a compounding tax on tokens, latency, and model intelligence. Anthropic's own fixes prove the original architecture is broken.
The Probabilistic Stack: Engineering for Non-Determinism
LLMs break the fundamental assumption of software engineering: deterministic inputs produce deterministic outputs. New patterns required.
Architecture decisions determine what your agent can do. Single agent or swarm? Stateless or stateful? Chat-based or graph-based? These choices compound.
Framework comparison
The Orchestration Decision: LangGraph vs AutoGen
Choosing the wrong agent framework costs months. LangGraph excels at production determinism. AutoGen excels at rapid prototyping. Here is when to use each - and why the answer is often both.
Agent Memory: From Stateless to Stateful AI
LLMs are stateless by design. Agents require state. The memory architectures—context management, vector stores, knowledge graphs—that transform amnesiacs into collaborators.
Swarm Patterns: When Agents Learn to Collaborate
Single agents hit ceilings. Multi-agent swarms break through them. Here are the coordination patterns separating toy demos from production systems.
The Graph Mandate: Why Chat-Based Agents Fail in Production
The "Chat Loop" is the "goto" statement of the AI era. 70-90% of enterprise AI projects stall in Pilot Purgatory. Graph architectures are the path to production.
The Durable Agent: Why Infrastructure Beats Prompts
A 15-minute task that crashes at 99% wastes $4.50 in compute. Temporal eliminates the Restart Tax and turns debugging into DVR replay.
The demo worked. Now ship it. This layer covers what breaks in production, how to see it breaking, and how to build systems that recover automatically.
Why pilots fail
Why 90% of AI Pilots Still Fail (And How to Beat the Odds)
Only 5-10% of enterprise AI initiatives escape pilot phase to deliver measurable ROI. The problem isn't the technology—it's data readiness, the performance illusion, and organizational deficits.
The 5 Agent Failure Modes (And How to Prevent Them)
Most AI agents fail silently in production. Here are the five failure modes killing your deployments—and the architecture patterns that prevent them.
Agent Observability: Monitoring AI Systems in Production
Evaluation ends at deployment. Observability begins. Distributed tracing, guardrails, and the monitoring stack that keeps production agents reliable.
The Self-Healing Agent: How AI Systems Learn to Fix Themselves
Static prompts in dynamic environments lead to performance decay. Here is the architecture for building agents that automatically analyze their failures and optimize themselves.
The Agent Operations Playbook: SRE for AI Systems
Traditional SRE fails with non-deterministic systems. Here are the SLAs, incident response patterns, and deployment strategies that work for production AI agents.
Agents that work but don't pay for themselves don't ship. Understanding unit economics separates production deployments from eternal pilots.
Unit economics
Agent Economics: The Unit Economics of Autonomous Work
Stop measuring cost per token. The metric that matters is Cost Per Completed Task. Here is the framework for measuring, optimizing, and governing the economics of AI agents.
The CPCT Standard: Why Cost-Per-Token is a Vanity Metric
Cost-per-token is the new "hits per second"—a vanity metric that obfuscates business health. The "cheap" model that fails 50% of the time costs 3.75x more than the premium alternative.
The Hallucination Tax: Calculating the True Cost of AI Errors
Every AI hallucination has a cost—lost trust, wasted time, incorrect decisions. Here's how to calculate yours and the architecture that minimizes it.
The Agent Scorecard: Translating Technical Metrics to Business ROI
Engineers track latency and tokens. Executives want ROI. Here is the framework for translating agent performance into board-ready business metrics.
Agents that can take actions can take wrong actions. The security layer isn't optional—it's the difference between a demo and something you'd let touch production data.
Threat model
The Agent Attack Surface: Security Beyond Safety
The shift from chat to agency creates a new threat model. AI Security differs from AI Safety. Prompt injection is unsolved—defense requires architectural containment, not prevention.
The Agent Safety Stack: Defense-in-Depth for Autonomous AI
Agents that take actions have different risk profiles than chatbots. Here is the defense-in-depth architecture: prompt injection defense, red teaming, kill switches, and guardrail benchmarks.
The HITL Firewall: How Human Oversight Doubles Your AI ROI
Full autonomy is a myth for high-stakes tasks. Smart thresholds with human review deliver 85% cost reduction at 98% accuracy. Here are the approval patterns that work.
The Input Assurance Boundary: Treating Prompts Like SQL Injection
Prompt injection is not a bug. It is an architectural feature of LLMs. Security audits show 73% of systems are vulnerable. Safety is not a prompt. Safety is architecture.
'It seems to work' isn't a deployment criteria. Rigorous evaluation separates agents you trust from agents you hope work.
Understanding where the market is going helps you build for the right future. Vertical beats horizontal. Context beats capability.
Market thesis
Vertical Agents Are Eating Horizontal Agents
Harvey ($8B), Cursor ($29B), Abridge ($2.5B): vertical agents are winning. The "do anything" agent was a transitional form—enterprises buy solutions, not intelligence.
The Autonomous Revolution: AI Agents Rewriting Work
The workforce is evolving—literally. AI agents are no longer experimental tools but genetically optimized systems driving 50%+ of enterprise operations autonomously.
Solve Intelligence: The AI Operating System for Patent Law
Solve Intelligence exemplifies the vertical agent thesis—domain depth, proprietary fine-tuning, and workflow integration create moats that horizontal AI cannot replicate.
Why Legal AI Breaks Every Rule About Agent Adoption
In every vertical, small companies deploy AI faster than enterprises. Legal is the exception. Content moats and liability costs invert the landscape.
The State of Legal AI: When Research Takes Minutes and Arguments Write Themselves
Legal AI evolved from search engines to autonomous research partners. CoCounsel, Harvey, and the new wave are rebuilding the profession.
The Agent Ecosystem Map: A Buyer's Guide to Vendor Selection
The $7.6B agent market in three tiers: Foundational (Microsoft, Google), Orchestration (Kore.ai, Airia), and Vertical (Harvey, Devin). Vendor evaluation guide.
The Top 100 AI Agent Companies: A Strategic Directory
The definitive directory of 100 AI agent companies. Three tiers: Foundational platforms, Integration partners, and Vertical specialists for enterprise automation.
Voice AI is a distinct vertical with its own constraints: latency, streaming, turn-taking. A separate reading path for voice-first applications.
The 500ms rule
The 500ms Threshold: Why Latency Kills Voice AI
Voice AI has a hard latency ceiling. Exceed 500ms round-trip and users abandon. This shapes every architectural decision from model selection to interrupt handling.
Voice: The Universal API for Human-Computer Interaction
Voice is not a feature—it's an interface paradigm shift. The trajectory from CLI to Voice, and why getting turn management right matters more than raw speed.
ElevenLabs: The Voice Infrastructure Play
ElevenLabs pivoted from creative TTS tool to real-time voice infrastructure. At $3.3B valuation, they bet on becoming the "Voice OS" of the enterprise.