MMNTM logo
Analysis

Verification Determines Territory

The AI market reshuffled in 12 months. Not because models got smarter—but because we learned how to measure smart. Benchmarks are strategic weapons. Whoever defines the test owns the territory.

MMNTM Research
12 min read
#AI Strategy#Benchmarks#Market Analysis#Enterprise AI#Anthropic#OpenAI

What is "Verification Determines Territory"?

Verification Determines Territory is the thesis that AI market leadership follows verification. Domains with objective measurement (code passes tests, citations check out, math proofs are correct) see rapid progress and clear winners. Domains where quality is subjective (strategy, creative writing) remain contested. Whoever controls the benchmark controls the territory.

The benchmark battlefield: peaks formed by leaderboards, valleys shrouded in fog where verification hasn't arrived


March 9, 2024. Simon Willison published a blog post titled "The GPT-4 barrier has been satisfyingly broken." It landed at position 5 on Techmeme. Not position 1. The tech press hadn't processed what happened.

"For over a year, GPT-4 has been the undisputed champion of the LLM space... Claude 3 Opus is the first model I've used that genuinely feels like a peer to GPT-4."

Five days earlier, Anthropic had released Claude 3. CNBC ran: "Google-backed Anthropic debuts Claude 3, claims it beats GPT-4." Position 5. Still not the lead story.

By March 28, Ars Technica made it explicit: "The king is dead: Claude 3 surpasses GPT-4 on Chatbot Arena for the first time."

The 15-month OpenAI hegemony ended not with a bang, but a benchmark.

This is about what happened next—and what it reveals about how AI markets actually work. The shift wasn't about raw intelligence. It was about verification. Once capability could be measured objectively, the market reshuffled in 12 months.


The Vibes Economy

To understand why March 2024 mattered, you have to understand what preceded it: 15 months of market leadership determined by vibes.

Milestone

ChatGPT Launch

OpenAI captures consumer imagination. 40 related articles on Techmeme.

Milestone

GPT-4 Launch

Peak dominance. "OpenAI = AI" becomes the default frame.

Gemini Launch

First credible challenger. Media skeptical.

Milestone

Claude 3 / GPT-4 Barrier Breaks

Narrative shifts. Benchmark comparisons become news.

Milestone

Claude 3.5 Sonnet

49% SWE-bench vs GPT-4o at 33%. Developer migration begins.

Milestone

Orion Slowdown Story

The Information reports OpenAI momentum slowing. Permission to question.

Milestone

DeepSeek Moment

49 related articles. Full fragmentation acknowledged.

In the Hegemony Era (November 2022 through February 2024), the market ran on what you might call the Vibes Metric: conversational fluency, brand ubiquity, cherry-picked demos, Chatbot Arena Elo ratings.

The problem with vibes is they don't compound. Chatbot Arena measures preference—which model sounds more helpful. It doesn't measure utility—whether the model can do the work.

A model can be a delightful conversationalist and a terrible worker.


The HumanEval Trap

Nowhere was this failure more acute than in coding.

The industry standard was HumanEval—164 self-contained Python problems. Reverse a string. Implement bubble sort. By early 2024, frontier models scored 85-90%.

GPT-4 on HumanEval

~90%

Self-contained Python problems

GPT-4 on SWE-bench

<2%

Real GitHub issue resolution

This created the HumanEval Trap: buyers believed AI was ready to replace software engineers. Then models hallucinated libraries, overwrote critical files, and failed to understand codebase context in real deployments.

The delta between perceived capability (90%) and actual utility (under 2%) signaled that LLMs were excellent autocomplete engines but functionally illiterate at software engineering.

Left: the solved puzzle of HumanEval. Right: the chaos of real codebases.


The Birth of SWE-bench

In October 2023, researchers introduced SWE-bench—2,294 task instances drawn from 12 Python repositories including Django, scikit-learn, and Flask.

To solve a SWE-bench task:

This is the verification mechanism. Code has ground truth. Tests pass or they don't. The model can't talk its way out of a failing test suite.

By August 2024, OpenAI collaborated with the original authors to release SWE-bench Verified—500 tasks manually validated to be solvable. This became the currency of technical competence.


The Claude Flip

The pivotal moment: June 2024, Claude 3.5 Sonnet.

49%

Claude 3.5 Sonnet on SWE-bench Verified

June 2024

GPT-4o: 33.2%

This wasn't marginal. A 16-point lead—more than doubling Claude 3 Opus's ~22% score and decisively outperforming GPT-4o.

Developer tools responded immediately. Cursor and Replit integrated Sonnet as the default, citing context retention and complex refactoring capability. Technical buyers began swapping OpenAI API keys for Anthropic's.

By mid-2025, Menlo Ventures documented the result:

Enterprise API Market Share (2025)

Feature2023 (Est.)2025 (Actual)Popular
OpenAI~50%25%
Anthropic~10-15%32%
GoogleUnder 10%20%

The market share flip coincides precisely with the period when Claude held a double-digit SWE-bench lead. Enterprise technical buyers tested models on their internal codebases. The model that passed more tests won the contract. Brand couldn't overcome verifiable performance.


Coding as Proxy for Trust

There's a subtler point here.

Even for non-coding tasks, enterprise buyers used coding benchmarks as a reliability heuristic. The logic: if a model can write bug-free Python—a strictly verifiable task involving logic, syntax, and planning—it's less likely to hallucinate when summarizing a legal contract or analyzing a financial report.

The Verification Heuristic: Buyers treat performance on verifiable tasks as a proxy for trustworthiness on unverifiable tasks. This is why coding benchmarks predict enterprise adoption even in non-technical domains.

Anthropic capitalized by positioning Claude as "reliable, steerable"—the alternative to OpenAI's black box. The SWE-bench score wasn't just a technical metric. It was a trust signal.


Wait. None of This Should Have Worked.

I need to pause because the story doesn't quite hold together.

If verification determines territory, OpenAI should have won coding. They had GPT-4 for 15 months. They had GitHub's code corpus—the world's largest. More compute, more engineers, earlier access to the key insight that language models excel at code.

Anthropic was a safety research lab. No code-specific product. Smaller training data. A fraction of OpenAI's compute budget.

And yet Anthropic won the coding market.

The verification thesis says "optimize for the test, win the territory." But OpenAI could have optimized for SWE-bench just as easily. They didn't. Why?

The answer: verification doesn't just enable optimization. It reveals what you were already optimizing for.

OpenAI optimized for consumer virality—the ChatGPT wow factor, the GPT-4 launch spectacle, the vibes. Anthropic optimized for enterprise reliability—Constitutional AI, refusal training, predictability.

When SWE-bench emerged, it didn't change what each company was doing. It measured what they'd been doing all along. And it turned out that reliable, predictable, steerable beats impressive, creative, surprising—when the output has to pass a test suite.

The benchmark revealed strategic DNA that was invisible before verification existed.


The Narrative Lag

Here's what the Techmeme data shows: narrative shifted 6+ months after technical parity.

Benchmark mentions in headlines increased roughly 16x from early 2023 to early 2025. But the "GPT-4 barrier" framing persisted for months after Claude 3 achieved parity. The Ars Technica "king is dead" headline appeared at Position 4—not Position 1.

The Orion Inflection: The narrative didn't break when challengers got better. It broke when OpenAI's pace visibly slowed. The Information's November 2024 story on Orion's diminishing returns hit Position 1. The market was waiting for permission to question the leader.

Technical reality leads. Narrative follows. The lag creates windows of opportunity—and vulnerability.


The Fragmentation Paradox

There's a case that "fragmentation" is overstated.

Despite the "no single leader" narrative, the Techmeme data shows OpenAI's share of coverage actually increased from 10.5% (2024) to 12.3% (2025). The Herfindahl-Hirschman Index for AI coverage remained above 5,000 across all three years—highly concentrated by any measure.

OpenAI Share of AI Coverage

12.3%

2025 (up from 10.5% in 2024)

HHI Concentration Index

>5,000

Still highly concentrated

Fragmentation is perception, not attention. The narrative of multiple credible competitors serves challengers—it's a strategic frame, not a statistical reality. OpenAI still dominates coverage even as it cedes market share.

The fragmentation that is real is happening at the domain level. Not "who leads AI?" but "who leads in X domain?"


The Compression Problem

Here's where the thesis gets uncomfortable.

As of December 2025, Claude Opus 4.5 leads SWE-bench Verified at 80.9%. But the gap between first and fourth has collapsed to single digits. Gemini 3 Pro dominates LMArena (1501 Elo). GPT-5 claims "AGI-adjacent reasoning." DeepSeek V3.2 matches frontier performance at a fraction of the cost.

Claude won the 2024 window. By late 2025, that lead compressed into a knife fight where everyone is within striking distance.

This is what happens when verification becomes universal. Once everyone optimizes for the same test, the test stops differentiating. SWE-bench worked as a weapon when Anthropic was the only one aiming at it. Now that OpenAI, Google, and DeepSeek have all oriented training runs toward passing tests, the benchmark reveals less about real-world performance.

The frontier models are all good at SWE-bench. The question buyers actually care about—which one works for my codebase, my workflow, my edge cases—isn't answered by any public benchmark.

This is the paradox at the heart of verification: the more useful a benchmark becomes, the faster it gets gamed into uselessness.


Benchmarks as Weapons

Make no mistake: benchmarks are not neutral measurement. They are strategic weapons.

The Benchmark Arms Race: OpenAI co-authored SWE-bench Verified (August 2024)—defining the standard they'd compete on. Anthropic released its "Computer Use" benchmark to define agentic metrics. Enterprise customers build proprietary eval suites. They no longer trust public leaderboards; they trust their own data.

The LMArena gaming accusations (May 2025) and Gary Marcus's critiques of formal reasoning (October 2024) show benchmark skepticism has gone mainstream. But skepticism doesn't diminish strategic value—it intensifies the race to control the definition of quality.

Whoever defines the test defines the territory.

If you can create a benchmark that measures something you're good at, you've created a hill to climb that favors you. SWE-bench favored whoever could optimize for test-driven development. Companies that understood this earliest captured territory while OpenAI was still optimizing for "general intelligence."


The Verification Mechanism

To understand why verification determines territory, you need the technical mechanism.

Why Verification Creates Winners

FeatureVerifiable DomainsPopularUnverifiable Domains
Ground TruthTests pass/failSubjective judgment
Training SignalDense (compiler feedback)Sparse (human preference)
Self-CorrectionAutomated (retry until pass)Manual (human review)
Progress RateExponentialLinear
Clear Leader?

In verifiable domains, models can run test-time compute loops: generate a solution, run the test, observe the error, iterate. This is the mechanism behind "Thinking" models (o1, o3, Gemini Thinking)—they verify their own work before presenting it.

In unverifiable domains, there's no automated signal to retry. A strategy memo has no compiler. "Is this good strategy?" is subjective. Without verification, there's no hill-climbing algorithm for quality.

This is why progress in coding feels exponential while progress in creative writing feels slower. It's not that models are worse at writing—it's that we can't measure "good writing" well enough to optimize for it.

For the technical details on how verification drives agent capability, see Building Agent Evals. For why this matters at the platform level, see The Context Aggregator.


What This Means

At the edge of mapped territory, the surveyor's shadow reaches into unmeasured space


Your Evals Are a Trade Secret

Harvey, Abridge, Cursor—none of them use public benchmarks for model selection. They've built proprietary evaluation suites tuned to their failure modes.

When public benchmarks compress (everyone scores 75-80%), the only differentiation is how a model fails on your edge cases. That knowledge is a moat. If you're sharing benchmark results publicly, you're training competitors on where to optimize.

The Sleeper Bet Is Eval Infrastructure

Everyone's investing in model companies. The smarter bet: whoever builds the testing substrate shapes what gets optimized.

SWE-bench was created by a small academic team. It redirected billions in training compute. The companies building domain-specific evaluation infrastructure—Braintrust, Langfuse, Vals.ai—are underpriced relative to their influence. They're not measuring models. They're defining what good means.

Unverifiable Domains Are the Last Moat

The contrarian take: maybe you don't want verification in your domain.

If quality can be objectively measured, you're in a race to the bottom. Coding already commoditized—every frontier model is "good enough." The domains where AI can't be evaluated—strategy, creativity, taste—are where human judgment remains load-bearing.


The structural conclusion isn't just "control the definition of best." It's that defining best accelerates commoditization. The winners capture territory before verification arrives—then move to the next unverifiable frontier.

MMNTM ResearchDec 31, 2025
Verification Determines Territory: How Benchmarks Reshaped AI