Agent Cost Optimization: Cut LLM Costs by 10x

Demo costs $0.10. Production costs $10. Same task. Two orders of magnitude.

This is the prototype-to-production cliff. A spreadsheet-analysis agent runs beautifully in the lab. Deploy it to real users—ambiguous inputs, error handling, accumulated context—and the economics collapse.

The hundred-dollar task is the default state of unoptimized agents. But production teams at Klarna, Uber, and Shopify have figured out how to run agents profitably. The patterns are replicable.

Where Tokens Actually Go

The assumption: cost scales with query complexity. The reality: your agent burns most of its budget before it reads a single word of user input.

The Fixed Costs (Every Turn)

System Prompt: The Persona Tax

"You are a senior reliability engineer..." + operational constraints + safety guardrails + output formatting. Maybe 1,500 tokens. Now multiply by every inference call. A 10-step workflow = 15,000 tokens just to "remember who it is."

Tool Schemas: The Capability Tax

Function definitions are verbose. An agent connected to CRM, vector database, Slack, and code interpreter might carry 5,000+ tokens of tool schemas. Agents connected to massive APIs can hit 97% of their context window on schemas alone—before processing anything.

Both taxes are paid on every turn, whether or not the agent uses the capability.

The Variable Costs (Compounding)

Context Accumulation

Turn 1: System prompt + user query. Turn 5: All of the above + 4 cycles of thought/action/observation. Turn 10: Everything, everywhere, all at once.

The context window grows linearly, but cost compounds—later turns carry the full weight of history. Long-running agents are expensive by construction.

Reasoning Overhead

Chain-of-thought prompting improves accuracy by telling the model to "think step-by-step." But output tokens cost 3-5x input tokens. A 500-token internal monologue can cost more than the actual tool execution.

The Hidden Multipliers

Retry Storms

Production systems retry on failure. If step 3 generates malformed JSON, the agent retries with error context. Up to 3 attempts per step = worst-case 3x cost on that node. If the model consistently misunderstands a schema, you get a "retry storm"—maximum retries at every step before failing.

Paradox: failed tasks cost 3-4x more than successful ones.

Verbose Tool Outputs

Query a customer database, get a 10,000-token JSON dump. All of it goes into the context window as "observation." One unoptimized API response can cost more than the entire rest of the conversation.

Multi-Agent Overhead

Manager delegates to Worker. Context serializes and passes. Worker reports back. Hand-off duplicates context. Clarification burns more tokens. In poorly orchestrated systems, inter-agent chatter accounts for 50% of token budget.

The Breakdown

Component	Token Range	Impact
System prompt	500-2,000+ per call	Fixed floor
Tool schemas	1,000-20,000+ per call	Fixed floor
Context history	Linear per step	Compounding
Tool outputs	High variance	Unpredictable spikes
Reasoning (CoT)	Output tokens (3-5x)	Multiplier
Retries	1-3x per failure	Multiplier

Caching: First Line of Defense

The most economically efficient token is the one never generated.

Prompt Caching (Native)

Anthropic and DeepSeek offer native prompt caching. The API caches the static prefix—system instructions, tool definitions—and charges a fraction to reuse it.

The math: 10,000-token schema costs ~$0.15 per call. With caching: ~$0.015 after the first call. That's a 90% discount on the "capability tax."

Strategic Breakpoints

Structure prompts in layers:

Static: System instructions, persona, global tools. Rarely changes. Cache with long TTL.
Semi-static: User-specific context, uploaded documents. Constant within session.
Dynamic: Conversation history, latest query.

Place cache markers after each layer. The schema load amortizes across all users. Multi-turn conversations with specific documents become cheap.

Semantic Caching

Prompt caching optimizes input. Semantic caching skips inference entirely.

The pattern: intercept query → embed → vector search for similar historical queries → return cached response if similarity exceeds threshold.

For support agents and FAQ bots, 30-60% hit rates are achievable. Cost drops to near-zero for cache hits.

The Risk: False semantic positives. "What's my balance?" and "What was my balance yesterday?" are semantically close but factually different.

Mitigation: Scope cache by user/tenant ID. Use a cheap verifier model to confirm functional equivalence. Set aggressive TTLs for volatile data.

KV Cache Reuse (Self-Hosted)

For self-hosted models (vLLM, TGI), keep the KV cache of system prompts in GPU memory across requests. Not just latency optimization—it increases request density per GPU by 50%+. Serve twice the traffic on the same hardware.

Model Routing: The Arbitrage

The price gap between frontier (GPT-4o, Claude Sonnet) and commodity (GPT-4o-mini, Claude Haiku) creates arbitrage opportunity. Route tasks to the cheapest model that can handle them.

The Cascade Pattern

Cheap model first. If it fails or returns low confidence, escalate to expensive model.

This captures the "easy" 60-80% of queries—"reset my password," "what are your hours?"—at commodity prices. Premium spend reserved for complex reasoning.

Data: 60% cost reduction, 95-99% quality retention.

Trade-off: latency penalty for escalated cases (wait for cheap model to fail).

The Judge Pattern

A lightweight classifier routes intent before attempting the task. "Simple FAQ" → Haiku. "Complex Reasoning" → Sonnet.

Avoids the cascade latency penalty. The judge itself can be tiny (Llama-8B, BERT classifier). ROI remains positive even with the extra call.

Self-Hosting Break-Even

When does self-hosting beat APIs?

Rule of thumb: >1M tokens/day on repetitive tasks.

API: variable cost, scales linearly. Self-hosted: fixed cost (GPU rental), marginal cost per token ≈ $0.

Once the GPU is rented, you can be "wasteful"—verbose reasoning, parallel sampling, extensive self-correction loops that would bankrupt you on an API.

Case data: Fintech chatbot switched from GPT-4 to hybrid (Haiku for FAQs, self-hosted 7B for batch summarization). Monthly costs: $47,000 → $8,000.

Prompt Optimization: Compress the Signal

Schema Pruning

OpenAPI specs are verbose. Agents often don't need full descriptions, examples, nested types. Strip non-essential fields, abbreviate keys, flatten structures.

Format matters too. YAML uses indentation instead of closing brackets and quotes—each is a distinct token. YAML can be 30-40% leaner than JSON while remaining machine-parsable.

Algorithmic Compression

LLMLingua uses a small model to identify and remove "non-essential" tokens. Analyzes perplexity (surprise factor), drops filler words and redundant sentences.

Benchmarks show 20x compression with minimal accuracy loss, particularly for RAG applications with high document redundancy.

The Pointer Pattern

Anti-pattern: pass data through the LLM. Agent queries database, gets 10,000 rows, passes to analysis tool. LLM "reads" and "writes" those rows. Catastrophic for cost.

Optimization: use reference IDs. Agent's tool outputs dataset_id: 12345. LLM routes the ID to the analysis tool. Actual data transfer happens in backend code, invisible to the LLM.

Result: 90% token reduction in data-heavy workflows.

Workflow Architecture

ReAct vs Plan-and-Execute

ReAct (Reason-Act-Observe) is flexible but verbose. The model outputs Thought, Action, Observation at every step. Lots of "thinking aloud" for routine tasks.

Plan-and-Execute separates planning from execution. Agent generates a complete 5-step plan in one call. Deterministic worker executes. LLM only re-engages for step dependencies.

Fewer round-trips. Less context accumulation. Batched reasoning upfront.

Chain of Agents

For massive context (500-page documents), feeding everything into a 1M window is expensive—quadratic attention costs.

Instead: break input into chunks. Worker agents process each chunk, pass compressed "summary state" to the next. Context window stays small and constant. Cost scales linearly, not quadratically.

Protocol Selection

REST APIs return full "hydrated" objects. Agent calls getUser(), gets 50 fields when it needed the email.

GraphQL and MCP let agents specify exactly what they need. Response payload minimizes. Every saved byte is a saved token.

Apollo data: switching from REST to GraphQL reduced token usage by 75% for data-fetching tasks.

Operational Levers

Batch APIs

OpenAI and Anthropic offer batch endpoints: 50% discount for 24-hour turnaround.

Separate interactive tasks (chat) from background tasks (daily reports, data cleansing). Route non-urgent volume to the half-price tier.

Speculative Execution

Counterintuitive: parallel attempts can be cheaper than sequential retry chains.

If task success rate is 50%, running three parallel attempts and taking the first success avoids the expensive debugging loops. Parallel attempts are "fresh" (light context). Retry chains carry full error history (heavy context).

Trade width for depth.

Budget Enforcement

Token budgets per task: If the agent exceeds its allocation, force wrap-up or termination. Prevents infinite loops.

User quotas: Free tier → cheap models, strict caps. Enterprise → GPT-4, higher limits. Protect margins at the gateway.

Cost alerts: A spike in cost-per-task usually signals a bug—agent looping, tool returning verbose errors. Cost becomes an operational health signal.

Case Studies

Klarna: AI assistant handles 2/3 of customer service chats. Resolves errands in 2 minutes vs 11 minutes. Fewer turns = less context = lower cost. Equivalent of 700 FTE.

Uber: Fine-tuned Llama/Mistral on Uber Eats menus. Matched GPT-4 performance for their domain at a fraction of inference cost. GenAI Gateway manages routing and policy enforcement.

Zapier: Deterministic filters before AI steps. "Only draft a reply if email is a customer complaint." Prevents millions of unnecessary generations.

The Verdict

The gap between a $100 task and a $0.10 task is engineering, not magic.

The hierarchy:

Caching: Prompt caching for fixed costs. Semantic caching for repeated queries.
Routing: Cheap models for easy tasks. Frontier for hard ones. Self-host when volume justifies.
Compression: Schema pruning. Algorithmic compression. Pointer pattern for data.
Architecture: Plan-and-execute over ReAct. Chain of agents for long context. GraphQL over REST.
Operations: Batch APIs. Budget enforcement. Cost as health signal.

Tokens are a scarce resource. Treat them with the same rigor as database connections or memory allocations. The production teams that win aren't the ones with the smartest agents—they're the ones that made agency economically viable.

The $100 Task: How Production Teams Cut Agent Costs by 10x