RLVR: Reinforcement Learning from Verifiable Rewards

In our LLM-as-Judge article, we showed how verification replaced $5,000 human experts with $10 of GPT-4 compute. In 2025, verification did something more fundamental: it replaced RLHF entirely.

The same "verifiability" that made LLM-as-judge scalable is now the bottleneck-remover for training. Objective rewards (math/code correctness) enable ungameable optimization runs 10x longer than RLHF. This is Reinforcement Learning from Verifiable Rewards (RLVR), and it's the fourth stage in the production LLM stack.

Understanding RLVR is understanding how 2025's models—OpenAI's o1/o3, DeepSeek R1, Claude Opus 4.5—actually work. Andrej Karpathy's 2025 Year in Review frames this shift clearly: RLVR is where capability progress came from, not bigger models.

The Stack Evolution:

2020-2024: Pretraining (heavy) → SFT (light) → RLHF (light)
2025: Pretraining (lighter) → SFT (light) → RLVR (heavy)

Unlike SFT and RLHF (both thin/short compute stages), RLVR involves training against objective, non-gameable reward functions. This allows dramatically longer optimization runs—compute that was originally intended for pretraining.

RLVR Gobbled the Pretraining Budget

2025's capability gains didn't come from scaling up model size. They came from scaling up RL runs. RLVR offers high capability/$ because verifiable rewards are ungameable, allowing deep optimization without the noise and drift of human feedback.

OpenAI's o1 (late 2024) was the first demonstration of an RLVR model. But o3 (early 2025) was the inflection point—where you could intuitively feel the difference. As Karpathy notes: "Running RLVR gobbled up the compute that was originally intended for pretraining."

The critical insight: Traditional RLHF hits a wall quickly. Human feedback is noisy, expensive, and gameable (models learn to sound helpful rather than be helpful). RLVR sidesteps this by using environments with objective outcomes. Did the code execute correctly? Did the math check out? These are binary, verifiable signals.

DeepSeek R1's paper (January 2025) demonstrated what this enables: spontaneous strategy development. The models weren't taught how to reason—they discovered strategies through optimization. Breaking down problems into steps. Backtracking when stuck. Self-correcting errors. These emerged from the reward landscape, not from supervised examples.

RLHF Era (2022-2024)

• Compute: ~90% pretraining, 10% finetuning
• Reward: Human preference (noisy, gameable)
• Optimization: Short runs (convergence drift)
• Output: Sounds helpful, often superficial

RLVR Era (2025)

• Compute: ~70% pretraining, 30% RLVR
• Reward: Verifiable outcomes (objective)
• Optimization: Long runs (ungameable signal)
• Output: Emergent reasoning strategies

Technical detail: DeepSeek popularized GRPO (Group Relative Policy Optimization), which removed the need for a separate value function (critic) model—a memory bottleneck in standard PPO. By sampling multiple outputs for the same prompt and optimizing based on relative advantage within the group, GRPO enabled the massive scaling characteristic of 2025's RLVR runs.

The compute reallocation is real. Similar-sized models, dramatically longer RL training, measurably better reasoning.

Jagged Intelligence is a Feature, Not a Bug

Karpathy's framing is the key to understanding RLVR models: "We're not 'evolving animals,' we're 'summoning ghosts.'"

Everything about the LLM stack is different from biological intelligence—neural architecture, training data, optimization pressure. RLVR intensifies this. By optimizing against verifiable rewards in specific domains (math, code, logic puzzles), models spike in capability in the vicinity of those domains.

The result: "amusingly jagged performance characteristics—simultaneously a genius polymath and a confused grade schooler, seconds away from getting tricked by a jailbreak."

The Benchmark Problem: "Training on the test set is a new art form." — Karpathy

Because RLVR thrives on verifiable environments, benchmarks become training environments. Labs construct synthetic tasks adjacent to benchmark embedding space and grow "jaggies" to cover them. The model crushes SWE-bench but can't follow simple instructions. It aces MMLU but fails basic reasoning outside the distribution.

What does it look like to crush all the benchmarks but still not get AGI? This. Jagged intelligence optimized for verifiable domains.

Production implications:

Verifiable domains (code, math, structured extraction): Trust the RLVR spike. These models genuinely excel here.
Non-verifiable domains (tone, cultural nuance, subjective judgment): Still ghosts. Still weird failures. Human oversight required.
Benchmarks: Treat with extreme skepticism. Correlation with real-world performance has degraded.

For teams building with these models, this means rethinking evaluation. Traditional benchmarks don't predict production performance. You need trajectory analysis, agent-specific evals, and domain-specific test harnesses that go beyond what the model was optimized for.

The jaggedness isn't a flaw. It's the signature of how RLVR works.

Test-Time Compute: The New Capability Knob

RLVR unlocked something unprecedented: capability as a function of thinking time, not just model size.

Models trained with RLVR learn to generate extended reasoning traces—intermediate calculations, self-checks, exploration of solution paths. By allowing the model to "think longer" at inference time, you can trade latency for accuracy.

OpenAI's o3 demonstrated this dramatically. On ARC-AGI (a benchmark designed to resist memorization), o3's performance scaled with test-time compute. More thinking tokens = higher accuracy. This is a new scaling law, orthogonal to parameter count or training compute.

AIME 2025

93.4%

o3 reasoning accuracy

SWE-bench

80.9%

Claude Opus 4.5

Longer RL Runs

10x

vs RLHF era (2022-2024)

The reasoning efficiency paradox: RLVR models initially became slower (generating thousands of "thought" tokens) to be more accurate. But late 2025 saw a distillation trend.

The goal shifted from "always think long" to "predict when to think long." Production systems now use fast models for 80% of queries and escalate to reasoning models for the complex tail. This mirrors the pattern in agent-failure-modes—knowing when to escalate is the critical skill.

Spontaneous strategy development: Models trained with RLVR didn't just learn answers—they reinvented human algorithms. Divide and conquer. Proof by contradiction. The "Aha" moment phenomenon, where the reward landscape funnels the model into discovering robust cognitive strategies without explicit teaching.

This is what RLVR unlocks: not just better answers, but emergent reasoning processes.

What This Means for Building

The RLVR era changes your evaluation strategy, not just your model choice.

Decision framework:

Domain	Model Strategy	Eval Strategy
Verifiable tasks (code, math, structured extraction)	Use RLVR models (o3, Opus 4.5, DeepSeek R1). Leverage test-time compute for hard cases.	Unit tests, formal verification, execution-based evals. Trust the spike.
Subjective tasks (tone, creativity, empathy)	RLVR doesn't help. Use standard models with human-in-the-loop.	LLM-as-judge for scale, human review for edge cases.
Hybrid tasks (agent trajectories)	RLVR for reasoning steps, standard models for presentation.	Code verifies execution, LLM judges evaluate reasoning quality. See building-agent-evals.

MMNTM's POV: We're building with the reality that verification is the training signal, not just the eval signal. This changes what works:

Evals drive capability. If you can verify it, you can optimize for it. Design your tasks to be verifiable.
Jaggedness is predictable. Spike in verifiable domains, struggle elsewhere. Plan for this asymmetry.
Test-time compute is a cost knob. Fast models for routine work, reasoning models for the tail. Build routing logic.

The safety wildcard: Karpathy's point about jagged intelligence—"simultaneously genius and cognitively challenged, seconds away from jailbreak"—is critical. But late 2025 research suggests safety-capability convergence. By treating safety constraints (e.g., "do not output dangerous chemicals") as verifiable rules during RLVR training, models can learn to reason about safety the same way they reason about math.

This breaks the old "alignment tax" theory (where safer models are dumber). RLVR models can be both capable and safe, if safety is part of the verifiable reward structure.

Verification All the Way Down

RLVR is the fourth stage in the production LLM stack. It gobbled pretraining compute, created jagged intelligence, and unlocked test-time scaling.

The meta-insight: The same verifiability that made LLM-as-judge economical made RLVR possible. Objective rewards enable ungameable optimization. Whether you're evaluating outputs or training models, verifiability is the lever. This is why verification determines territory—whoever can measure quality objectively captures the domain.

What breaks: Traditional benchmarks. The models are training on what you're measuring. Benchmark performance no longer predicts production performance.

What works: Building evals that account for jaggedness. Knowing when to use reasoning models vs fast models. Treating verification as a design constraint, not an afterthought.

The paradigm shift is complete. Verification isn't just for evals anymore—it's the training signal that unlocks emergent reasoning. We're tracking how this changes production patterns. Follow our eval research for updates.

RLVR: When Verification Became the Training Signal