What is Cost Per Completed Task?
Cost Per Completed Task (CPCT) is the true economic metric for AI agents: CPCT = (C_compute + C_tools) / P_success + (P_fail × C_human). It reveals that "cheap" models are often more expensive—a model costing $0.005 per run with 50% success rate costs $3.00 per outcome when human remediation ($6.00) is factored in, versus $0.80 for a "premium" model at $0.50/run with 95% success. The three hidden taxes—retry, context bloat, and human remediation—make cost-per-token a vanity metric.
The CPCT Standard: Why Cost-Per-Token is a Vanity Metric
In the enterprise boardroom, "Cost Per Token" is the new "Hits Per Second"—a vanity metric that obfuscates the true health of the business.
When a CFO asks, "What does our AI workforce cost?", the engineering answer is typically: "o4-mini costs $0.10 per million tokens, while Claude Opus 4.5 is significantly more expensive."
This is the wrong answer. It frames AI as a software utility, like bandwidth or storage. But autonomous agents are not software utilities; they are digital labor. You do not measure a software engineer by the number of keystrokes they produce for a dollar; you measure them by the features they ship.
The prevailing obsession with minimizing inference costs has led enterprises into a "Race to the Bottom," optimizing for the cheapest model rather than the most capable one. This creates a false economy. A cheap model that requires five retries to parse a document is more expensive than a premium model that does it right the first time.
To build a sustainable AI strategy, we must adopt a new economic standard: Cost Per Completed Task (CPCT).
The Economics of Intelligence
The fundamental error in current AI budgeting is assuming that intelligence is a commodity. It is not. It is a probabilistic resource with varying degrees of reliability.
When you choose a "cheaper" model (e.g., o4-mini vs. GPT-5, or a self-hosted Llama 4 Scout vs. Claude Opus 4.5), you are trading capital (dollars per token) for operational risk (probability of failure).
In an agentic workflow—where an AI performs multi-step reasoning, tool use, and decision making—failures are not just "bad answers." They are process breaks. They result in loops, hallucinations, and, most expensively, human intervention.
The CPCT Formula
To visualize the true cost, we must expand the equation beyond the API bill.
CPCT = (C_compute + C_tools) / P_success + (P_fail × C_human)
Where:
- C_compute: The cost of tokens (inference) used in the attempt
- P_success: The probability (0.0 to 1.0) that the agent successfully completes the task without intervention
- C_human: The cost of a human reviewing and fixing the failure
The Inversion Principle
This formula reveals a critical insight MMNTM calls the Inversion Principle:
”As the complexity of a task rises, the cost of the model becomes the least significant variable in the economic equation.
A Concrete Example:
Consider a legal classification task.
- Model A (Cheap): Costs $0.01 to run. Success rate 80%.
- Model B (Smart): Costs $0.10 to run. Success rate 99%.
- Human Review: Costs $5.00 to fix a failure.
The Math:
- Model A CPCT: $0.01 + (20% × $5.00) = $1.01
- Model B CPCT: $0.10 + (1% × $5.00) = $0.15
The "expensive" model is actually 6x cheaper per outcome. The 10x savings on tokens was an illusion that cost the business nearly a dollar per task in remediation.
The Three Hidden Taxes
Low-intelligence models impose three hidden taxes that inflate CPCT, often making them significantly more expensive than their "premium" counterparts despite the lower sticker price.
1. The Retry Tax
Autonomous agents are designed to self-correct. If a model tries to call a tool and fails (e.g., "Invalid JSON format"), it feeds the error back into the context window and tries again.
A low-capability model often lacks the reasoning depth to diagnose its own syntax errors. It enters a Retry Loop.
Scenario: An agent uses Llama 4 Scout ($0.21/1M tokens) to parse a complex PDF. It fails formatting 4 times before succeeding.
Cost Reality: You have paid for the input context 5 times. You have also paid for the "thinking time" (latency). The effective cost per token has quintupled.
2. The Context Bloat Tax
Smarter models are more concise. They can follow complex instructions in a single shot ("Zero-Shot").
Less capable models require "Few-Shot Prompting"—you must provide 5-10 examples of the desired output in the context window to get a reliable result.
This bloats the input prompt. If you have to feed 4,000 tokens of examples to save $0.50 on the model price, you are often paying more in aggregate bandwidth than if you had used a smarter model that understood the instruction natively.
3. The Human Remediation Tax (The Killer)
This is the single largest destroyer of AI ROI.
Consider a complex task: Reviewing a master service agreement (MSA).
- Model A (Claude Opus 4.5): Costs $0.50 per run. Success Rate: 95%.
- Model B (o3-mini): Costs $0.005 per run. Success Rate: 50%.
At face value, Model B is 100x cheaper. But consider the failure cost. When the model fails, a human lawyer ($300/hr) must spend minutes fixing it (assume $6.00 cost).
- Opus 4 CPCT: $0.50 + (0.05 × $6.00) = $0.80
- Nano CPCT: $0.005 + (0.50 × $6.00) = $3.00
The "cheaper" model is 3.75x more expensive per outcome. The massive savings on tokens was an illusion that cost the business $2.20 per task in human labor.
The Model Ladder Strategy
Enterprises should not default to the most expensive model, nor the cheapest. They should implement a Model Ladder architecture, routing traffic based on the "Price of Error."
Tier 1: The Utility Layer (o3-mini, Llama 4 Scout, Mistral Small 3.1)
- Pricing: ~$0.10 - $0.21 per 1M input tokens
- Role: Classification, simple extraction, routing, high-volume data cleaning
- Success Requirement: High for deterministic tasks
- Deployment: Use these to preprocess data before it hits the expensive reasoning engine
Tier 2: The Intelligence Layer (GPT-5, Claude Sonnet 4.5, Claude 4 Haiku)
- Pricing: ~$0.75 - $2.00 per 1M input tokens
- Role: General reasoning, planning, code generation, summarization
- Success Requirement: Critical
- Deployment: Default to these for standard user-facing interactions. They strike the balance on the "Cost-Quality Pareto Frontier."
Tier 3: The Reasoning Layer (Claude Opus 4.5, GPT-5.2 Thinking, Gemini 3 Pro)
- Pricing: ~$15.00+ per 1M input tokens
- Role: Architectural review, complex root cause analysis, high-stakes decision making (Legal/Medical)
- Success Requirement: Absolute
- Deployment: Use sparingly, but do not shy away from the cost. If an Opus 4 call costs $2.00 but prevents a $10,000 outage, the ROI is infinite.
The Bottom Line
We must stop treating AI cost optimization as a "procurement" problem (negotiating lower API rates) and start treating it as an engineering problem (optimizing for success rates).
As of late 2025, the gap between the "Smartest" model and the "Cheapest" model is widening, not shrinking.
- Stop counting tokens.
- Start measuring failures.
- Calculate your CPCT.
If you optimize for the cost of the input, you will hemorrhage margin on the cost of the output.
When agents execute thousands of outcome-based tasks daily, traditional payment rails break down. The shift to per-resolution pricing introduces fundamental infrastructure challenges around micropayments and cross-vendor settlements—explored in depth in Agent Billing & Crypto.
For the complete framework on budget governance and runaway cost prevention, see Agent Economics. For verification cost analysis, read The Hallucination Tax. For human-in-the-loop patterns that optimize the expensive human resource, route to review only when confidence is low.