What is "Verification Determines Territory"?
Verification Determines Territory is the thesis that AI market leadership follows verification. Domains with objective measurement (code passes tests, citations check out, math proofs are correct) see rapid progress and clear winners. Domains where quality is subjective (strategy, creative writing) remain contested. Whoever controls the benchmark controls the territory.

March 9, 2024. Simon Willison published a blog post titled "The GPT-4 barrier has been satisfyingly broken." It landed at position 5 on Techmeme. Not position 1. The tech press hadn't processed what happened.
”"For over a year, GPT-4 has been the undisputed champion of the LLM space... Claude 3 Opus is the first model I've used that genuinely feels like a peer to GPT-4."
Five days earlier, Anthropic had released Claude 3. CNBC ran: "Google-backed Anthropic debuts Claude 3, claims it beats GPT-4." Position 5. Still not the lead story.
By March 28, Ars Technica made it explicit: "The king is dead: Claude 3 surpasses GPT-4 on Chatbot Arena for the first time."
The 15-month OpenAI hegemony ended not with a bang, but a benchmark.
This is about what happened next—and what it reveals about how AI markets actually work. The shift wasn't about raw intelligence. It was about verification. Once capability could be measured objectively, the market reshuffled in 12 months.
The Vibes Economy
To understand why March 2024 mattered, you have to understand what preceded it: 15 months of market leadership determined by vibes.
ChatGPT Launch
OpenAI captures consumer imagination. 40 related articles on Techmeme.
GPT-4 Launch
Peak dominance. "OpenAI = AI" becomes the default frame.
Gemini Launch
First credible challenger. Media skeptical.
Claude 3 / GPT-4 Barrier Breaks
Narrative shifts. Benchmark comparisons become news.
Claude 3.5 Sonnet
49% SWE-bench vs GPT-4o at 33%. Developer migration begins.
Orion Slowdown Story
The Information reports OpenAI momentum slowing. Permission to question.
DeepSeek Moment
49 related articles. Full fragmentation acknowledged.
In the Hegemony Era (November 2022 through February 2024), the market ran on what you might call the Vibes Metric: conversational fluency, brand ubiquity, cherry-picked demos, Chatbot Arena Elo ratings.
The problem with vibes is they don't compound. Chatbot Arena measures preference—which model sounds more helpful. It doesn't measure utility—whether the model can do the work.
A model can be a delightful conversationalist and a terrible worker.
The HumanEval Trap
Nowhere was this failure more acute than in coding.
The industry standard was HumanEval—164 self-contained Python problems. Reverse a string. Implement bubble sort. By early 2024, frontier models scored 85-90%.
GPT-4 on HumanEval
~90%
Self-contained Python problems
GPT-4 on SWE-bench
<2%
Real GitHub issue resolution
This created the HumanEval Trap: buyers believed AI was ready to replace software engineers. Then models hallucinated libraries, overwrote critical files, and failed to understand codebase context in real deployments.
The delta between perceived capability (90%) and actual utility (under 2%) signaled that LLMs were excellent autocomplete engines but functionally illiterate at software engineering.

The Birth of SWE-bench
In October 2023, researchers introduced SWE-bench—2,294 task instances drawn from 12 Python repositories including Django, scikit-learn, and Flask.
To solve a SWE-bench task:
This is the verification mechanism. Code has ground truth. Tests pass or they don't. The model can't talk its way out of a failing test suite.
By August 2024, OpenAI collaborated with the original authors to release SWE-bench Verified—500 tasks manually validated to be solvable. This became the currency of technical competence.
The Claude Flip
The pivotal moment: June 2024, Claude 3.5 Sonnet.
Claude 3.5 Sonnet on SWE-bench Verified
June 2024
This wasn't marginal. A 16-point lead—more than doubling Claude 3 Opus's ~22% score and decisively outperforming GPT-4o.
Developer tools responded immediately. Cursor and Replit integrated Sonnet as the default, citing context retention and complex refactoring capability. Technical buyers began swapping OpenAI API keys for Anthropic's.
By mid-2025, Menlo Ventures documented the result:
Enterprise API Market Share (2025)
| Feature | 2023 (Est.) | 2025 (Actual)Popular |
|---|---|---|
| OpenAI | ~50% | 25% |
| Anthropic | ~10-15% | 32% |
| Under 10% | 20% |
The market share flip coincides precisely with the period when Claude held a double-digit SWE-bench lead. Enterprise technical buyers tested models on their internal codebases. The model that passed more tests won the contract. Brand couldn't overcome verifiable performance.
Coding as Proxy for Trust
There's a subtler point here.
Even for non-coding tasks, enterprise buyers used coding benchmarks as a reliability heuristic. The logic: if a model can write bug-free Python—a strictly verifiable task involving logic, syntax, and planning—it's less likely to hallucinate when summarizing a legal contract or analyzing a financial report.
The Verification Heuristic: Buyers treat performance on verifiable tasks as a proxy for trustworthiness on unverifiable tasks. This is why coding benchmarks predict enterprise adoption even in non-technical domains.
Anthropic capitalized by positioning Claude as "reliable, steerable"—the alternative to OpenAI's black box. The SWE-bench score wasn't just a technical metric. It was a trust signal.
Wait. None of This Should Have Worked.
I need to pause because the story doesn't quite hold together.
If verification determines territory, OpenAI should have won coding. They had GPT-4 for 15 months. They had GitHub's code corpus—the world's largest. More compute, more engineers, earlier access to the key insight that language models excel at code.
Anthropic was a safety research lab. No code-specific product. Smaller training data. A fraction of OpenAI's compute budget.
And yet Anthropic won the coding market.
The verification thesis says "optimize for the test, win the territory." But OpenAI could have optimized for SWE-bench just as easily. They didn't. Why?
The answer: verification doesn't just enable optimization. It reveals what you were already optimizing for.
OpenAI optimized for consumer virality—the ChatGPT wow factor, the GPT-4 launch spectacle, the vibes. Anthropic optimized for enterprise reliability—Constitutional AI, refusal training, predictability.
When SWE-bench emerged, it didn't change what each company was doing. It measured what they'd been doing all along. And it turned out that reliable, predictable, steerable beats impressive, creative, surprising—when the output has to pass a test suite.
The benchmark revealed strategic DNA that was invisible before verification existed.
The Narrative Lag
Here's what the Techmeme data shows: narrative shifted 6+ months after technical parity.
Benchmark mentions in headlines increased roughly 16x from early 2023 to early 2025. But the "GPT-4 barrier" framing persisted for months after Claude 3 achieved parity. The Ars Technica "king is dead" headline appeared at Position 4—not Position 1.
The Orion Inflection: The narrative didn't break when challengers got better. It broke when OpenAI's pace visibly slowed. The Information's November 2024 story on Orion's diminishing returns hit Position 1. The market was waiting for permission to question the leader.
Technical reality leads. Narrative follows. The lag creates windows of opportunity—and vulnerability.
The Fragmentation Paradox
There's a case that "fragmentation" is overstated.
Despite the "no single leader" narrative, the Techmeme data shows OpenAI's share of coverage actually increased from 10.5% (2024) to 12.3% (2025). The Herfindahl-Hirschman Index for AI coverage remained above 5,000 across all three years—highly concentrated by any measure.
OpenAI Share of AI Coverage
12.3%
2025 (up from 10.5% in 2024)
HHI Concentration Index
>5,000
Still highly concentrated
Fragmentation is perception, not attention. The narrative of multiple credible competitors serves challengers—it's a strategic frame, not a statistical reality. OpenAI still dominates coverage even as it cedes market share.
The fragmentation that is real is happening at the domain level. Not "who leads AI?" but "who leads in X domain?"
The Compression Problem
Here's where the thesis gets uncomfortable.
As of December 2025, Claude Opus 4.5 leads SWE-bench Verified at 80.9%. But the gap between first and fourth has collapsed to single digits. Gemini 3 Pro dominates LMArena (1501 Elo). GPT-5 claims "AGI-adjacent reasoning." DeepSeek V3.2 matches frontier performance at a fraction of the cost.
Claude won the 2024 window. By late 2025, that lead compressed into a knife fight where everyone is within striking distance.
This is what happens when verification becomes universal. Once everyone optimizes for the same test, the test stops differentiating. SWE-bench worked as a weapon when Anthropic was the only one aiming at it. Now that OpenAI, Google, and DeepSeek have all oriented training runs toward passing tests, the benchmark reveals less about real-world performance.
The frontier models are all good at SWE-bench. The question buyers actually care about—which one works for my codebase, my workflow, my edge cases—isn't answered by any public benchmark.
This is the paradox at the heart of verification: the more useful a benchmark becomes, the faster it gets gamed into uselessness.
Benchmarks as Weapons
Make no mistake: benchmarks are not neutral measurement. They are strategic weapons.
The Benchmark Arms Race: OpenAI co-authored SWE-bench Verified (August 2024)—defining the standard they'd compete on. Anthropic released its "Computer Use" benchmark to define agentic metrics. Enterprise customers build proprietary eval suites. They no longer trust public leaderboards; they trust their own data.
The LMArena gaming accusations (May 2025) and Gary Marcus's critiques of formal reasoning (October 2024) show benchmark skepticism has gone mainstream. But skepticism doesn't diminish strategic value—it intensifies the race to control the definition of quality.
Whoever defines the test defines the territory.
If you can create a benchmark that measures something you're good at, you've created a hill to climb that favors you. SWE-bench favored whoever could optimize for test-driven development. Companies that understood this earliest captured territory while OpenAI was still optimizing for "general intelligence."
The Verification Mechanism
To understand why verification determines territory, you need the technical mechanism.
Why Verification Creates Winners
| Feature | Verifiable DomainsPopular | Unverifiable Domains |
|---|---|---|
| Ground Truth | Tests pass/fail | Subjective judgment |
| Training Signal | Dense (compiler feedback) | Sparse (human preference) |
| Self-Correction | Automated (retry until pass) | Manual (human review) |
| Progress Rate | Exponential | Linear |
| Clear Leader? |
In verifiable domains, models can run test-time compute loops: generate a solution, run the test, observe the error, iterate. This is the mechanism behind "Thinking" models (o1, o3, Gemini Thinking)—they verify their own work before presenting it.
In unverifiable domains, there's no automated signal to retry. A strategy memo has no compiler. "Is this good strategy?" is subjective. Without verification, there's no hill-climbing algorithm for quality.
This is why progress in coding feels exponential while progress in creative writing feels slower. It's not that models are worse at writing—it's that we can't measure "good writing" well enough to optimize for it.
For the technical details on how verification drives agent capability, see Building Agent Evals. For why this matters at the platform level, see The Context Aggregator.
What This Means

Your Evals Are a Trade Secret
Harvey, Abridge, Cursor—none of them use public benchmarks for model selection. They've built proprietary evaluation suites tuned to their failure modes.
When public benchmarks compress (everyone scores 75-80%), the only differentiation is how a model fails on your edge cases. That knowledge is a moat. If you're sharing benchmark results publicly, you're training competitors on where to optimize.
The Sleeper Bet Is Eval Infrastructure
Everyone's investing in model companies. The smarter bet: whoever builds the testing substrate shapes what gets optimized.
SWE-bench was created by a small academic team. It redirected billions in training compute. The companies building domain-specific evaluation infrastructure—Braintrust, Langfuse, Vals.ai—are underpriced relative to their influence. They're not measuring models. They're defining what good means.
Unverifiable Domains Are the Last Moat
The contrarian take: maybe you don't want verification in your domain.
If quality can be objectively measured, you're in a race to the bottom. Coding already commoditized—every frontier model is "good enough." The domains where AI can't be evaluated—strategy, creativity, taste—are where human judgment remains load-bearing.
The structural conclusion isn't just "control the definition of best." It's that defining best accelerates commoditization. The winners capture territory before verification arrives—then move to the next unverifiable frontier.
The Context Aggregator
Everyone assumes one AI agent platform will dominate. But there's no universal way to verify if AI output is 'good'—legal has citations, code has tests, general work has human judgment. The market fragments into specialized empires, not one winner.
Building Agent Evals: From Zero to Production
Why 40% of agent projects fail: the 5-level maturity model for production evals. Move beyond SWE-bench scores to measure task completion, error recovery, and ROI.
Vertical Agents Are Eating Horizontal Agents
Harvey ($8B), Cursor ($29B), Abridge ($2.5B): vertical agents are winning. The "do anything" agent was a transitional form—enterprises buy solutions, not intelligence.
RLVR: When Verification Became the Training Signal
How 2025's shift from RLHF to RLVR changed model training, created jagged intelligence, and unlocked test-time compute. The paradigm that replaced human feedback.