MMNTM logo
Analysis

The Third Era of AI Coding Is an Operations Problem

Cursor reports 35% of PRs from autonomous agents and 15x growth. The models are ready. The factory floor — observability, economics, verification — is not.

Greg Salwitz
12 min read
#AI Agents#Software Development#Agent Operations#Cloud Agents#Developer Tools

Cursor's CEO describes the future correctly. He undersells the infrastructure it requires.


Michael Truell published a post last week that deserves more than a skim. The Cursor CEO laid out three eras of AI-assisted software development: Tab autocomplete, synchronous agents, and now cloud agents that run independently on their own VMs. The numbers behind the thesis are concrete.

Cursor PRs from Agents

35%

Merged autonomously

Agent Usage Growth

15x

Year over year

Agent-to-Tab Ratio

2x

Flipped from 2.5x Tab-to-Agent in March 2025

The most revealing line: developers adopting cloud agents "spend their time breaking down problems, reviewing artifacts and code, and giving feedback." They write almost no code themselves.

Truell is right about the trajectory. But the post describes the destination while underestimating the infrastructure needed to get there. A factory without a factory floor is just a warehouse full of machines running unsupervised.


Era Three Is an Organizational Shift, Not a Tool Upgrade

Tab autocomplete was a tool. It changed keystrokes. You typed less, but the job was the same: writing code, character by character, with a smarter autocomplete.

Synchronous agents were a workflow change. You stopped typing and started directing. But you still sat in the loop, one agent at a time, watching it work. The interaction model was conversational — prompt, response, correction, prompt. The bottleneck shifted from typing speed to context quality, but the developer remained the single thread of execution.

Cloud agents are neither. They're an organizational change. The developer becomes a factory manager — decomposing problems, dispatching agents, reviewing output. As MMNTM's Claude Code Superuser guide put it: "The role shift is profound: you move from writing code to orchestrating agents. Instead of typing syntax, you architect context."

Three Eras of AI-Assisted Development

FeatureInteraction ModelHuman RolePrimary ConstraintFailure Mode
Era 1: TabKeystroke completionWriterTyping speedWrong suggestion accepted
Era 2: Sync AgentsPrompt-response loopDirectorContext qualityContext starvation, loop divergence
Era 3: Cloud AgentsDispatch and reviewFactory managerOperations infrastructureSilent failures at scale

The parallel to manufacturing is useful. Artisans → assembly lines → automated factories. Each transition eliminated skills, created new ones, and — critically — required entirely new management infrastructure. You don't run a factory with the same tools you used in the workshop.

Truell's post implies this but doesn't confront it. "There is a lot of work left before this approach becomes standard," he writes. That's an understatement. The gap between "agents can do this" and "agents do this reliably at scale" is where the actual engineering happens.


The Verification Bottleneck

Here's the math that Truell doesn't do. If 35% of merged PRs come from agents today, and the trajectory continues, you're looking at 75%+ within 18 months. Each cloud agent runs independently for hours, then returns with "logs, video recordings, and live previews rather than diffs."

Better than raw diffs? Absolutely. But someone still has to review it. And when 10 agents return 10 PRs simultaneously, review becomes the serialization bottleneck in a supposedly parallel system.

The Review Paradox: Cloud agents remove the bottleneck of synchronous development. But if every agent's output requires human review, you've just moved the bottleneck from "writing code" to "reading code." The throughput gain depends entirely on how much review you can automate.

This is the HITL firewall problem. The solution isn't "review everything" or "review nothing." It's smart threshold review:

  • Auto-approve (confidence > 85%): Tests pass, no security flags, diff within expected scope. Ship it.
  • Fast-track review (70-85%): Summarized diff, key decisions highlighted, reviewer spends 5 minutes not 30.
  • Full review (< 70%): Agent hit ambiguity, made architectural choices, or touched sensitive code. Human reads everything.

Organizations with HITL strategies are 2x more likely to achieve 75%+ cost savings compared to fully autonomous deployments. The "fully autonomous" dream sounds efficient until you're triaging 15 broken PRs on Monday morning.

The AI-Assisted Engineering Playbook captures the underlying dynamic: "The leverage comes from combining. A senior engineer directing AI moves faster than either alone. A junior engineer without decomposition and review skills gets AI-generated chaos — faster." Cloud agents amplify this. They don't resolve it.


The Factory Requires Operations

Truell concedes one line to the central unsolved problem: "At industrial scale, a flaky test or broken environment that a single developer can work around turns into a failure that interrupts every agent run."

This deserves more than a line. It's the whole game.

Agents don't fail like software. They fail like employees — doing technically correct work that produces wrong outcomes. An API call returns 200 OK, but the agent confidently generates incorrect logic. The tests pass because the tests don't cover that edge case. The PR gets merged. Three weeks later, someone notices.

Traditional SRE vs Agent Operations

FeatureWhat It MonitorsWhat It MissesWhat Breaks
Traditional SREUptime, latency, error ratesSemantic correctness of outputAgent returns wrong answer with 200 OK
Agent OperationsTask completion, decision quality, behavioral driftNothing (in theory)Tooling doesn't exist yet (in practice)

Five known failure modes get worse when you multiply agents:

  1. Context starvation — agents lose critical context over long autonomous runs. A single developer re-reads the file. Ten parallel agents each independently lose track.
  2. Infinite loops — one agent in a retry loop costs money. Ten agents in retry loops costs ten times as much, simultaneously.
  3. Silent failures — "Your agent has been failing silently for weeks. No errors, no alerts — just wrong answers delivered with confident authority." Multiply by fleet size.
  4. Cascade failures — Agent A's output becomes Agent B's input. Bad output propagates.
  5. Fleet divergence — the novel one. Ten agents solving overlapping problems produce ten incompatible approaches. No individual PR is wrong. The codebase becomes incoherent.

The fleet divergence problem is unique to era three. A single synchronous agent produces a consistent codebase because it sees every change it makes. Ten parallel agents each see a snapshot from before any of them started. Merge conflicts are the easy case. Architectural drift is the hard one.

The Agent Operations Playbook prescribes what's needed: agent-specific SLIs (task completion rate, not just uptime), decision traces (reconstructing why an agent chose what it chose), and PromptOps (treating prompts as critical application logic with versioning, canary deploys, and rollback). None of this ships with current agent IDEs.


The Economics Nobody's Doing

Cloud VMs aren't free. Ten parallel agents running for three hours each means 30 VM-hours plus API costs plus retries plus human review time. The economics shift from cost-per-developer-hour to cost-per-completed-task — and most teams have no framework for the latter.

Cost Inversion

3.75x

'Cheap' models that fail 50% of the time cost 3.75x more than premium alternatives

The CPCT formula makes this concrete: CPCT = (C_compute + C_tools) / P_success + (P_fail × C_human). As task complexity rises, the cost of the model becomes the least significant variable. A failed three-hour cloud agent run wastes the VM time, the API tokens, and the human time spent reviewing output that shouldn't have been produced.

The model ladder strategy — routing 80% of agent subtasks to cheaper models while reserving premium models for complex decisions — achieves 75% cost reduction. But this requires task classification infrastructure. Current cloud agent implementations treat every task identically: spin up a VM, run the same model, hope for the best.

The three hidden taxes compound at scale:

  • Retry tax: Failed tasks that rerun, doubling compute costs
  • Context bloat tax: Agents that accumulate irrelevant context, degrading performance over time
  • Human remediation tax: The engineer hours spent fixing what agents got wrong

Without cost-per-completed-task tracking, teams will discover the economics retroactively — via an invoice.


The Succession Problem

Truell describes developers who "write almost 100% of their code" with agents. They "spend their time breaking down problems, reviewing artifacts and code, and giving feedback." This works when those developers learned to code without agents.

What happens to the next generation?

The Hollow Firm pattern is already visible in professional services. AmLaw 100 firms slowed hiring. McKinsey's headcount dropped 11% from 2022 to 2024. BCG added 80% fewer employees in 2024 than 2022. The thesis: AI handles junior work, so you need fewer juniors.

The supervision paradox: Effective review of agent output requires knowing where agents fail. That knowledge comes from having done the work yourself. If tomorrow's seniors never debugged manually, never traced through unfamiliar code without AI, never built the intuition for "this looks wrong" — the quality control mechanism breaks down.

Software engineering hasn't confronted this yet, but it will. If era three means juniors start their careers as agent managers rather than code writers, the skill they're missing isn't syntax — it's judgment. The ability to look at agent output and know, without running it, that the error handling is wrong. That the race condition will surface under load. That the abstraction will calcify in six months.

Manufacturing solved this with apprenticeship programs that rotated trainees through every station on the factory floor before promoting them to supervisors. Software development has no equivalent yet.


What the Factory Floor Actually Looks Like

The infrastructure for era three exists in fragments. No single platform provides all of it. Here's what a complete factory floor requires.


The Models Are Ready. The Operations Aren't.

Truell is right. The third era is arriving. Agent usage grew 15x. A third of PRs at one of the most technically sophisticated companies in the world come from autonomous agents. The developer-to-agent ratio has already flipped.

But a factory without operations is just a room full of unsupervised machines. The teams that build the factory floor — agent observability, cost-per-task economics, verification infrastructure, fleet management — will compound the productivity gains Truell describes.

The rest will have expensive VM bills and a review backlog.