The result people focus on is 700 autonomous code changes in one overnight session — 20 of which improved the model more than weeks of manual tuning, producing an 11% leaderboard gain on a depth-12 model that was already well-tuned. Within ten days of Karpathy open-sourcing the setup, independent builders had pointed the same loop at financial markets, chess engines, rendering pipelines, and argumentative reasoning.
The thing worth studying is the constraint: five minutes, no exceptions.
Why the Clock Is the Intelligence
The setup is deliberately minimal: a 630-line training script, a prompt file describing the research direction, a Git branch per experiment. The agent modifies training code, runs a small language model for exactly five minutes, checks validation loss, commits the result, and loops.
The five-minute clock is not a performance optimization. It's an epistemological design.
Without a fixed time constraint, the agent can spend hours on a promising direction that turns out to be a dead end. With the clock, every dead end costs five minutes. The constraint makes each experiment disposable. Disposable experiments are the only kind you can run at scale.
The clock creates commensurability. Each experiment has the same cost. Comparing run 47 to run 12 is meaningful because the inputs are controlled for time. If experiments ran for arbitrary durations, the results would be incomparable — a good 3-minute run and a good 3-hour run can't be ranked on the same leaderboard.
The agent didn't try random mutations. It examined the sequence of prior results and planned subsequent experiments around what had worked. It found that the attention was too diffuse (a missing scaler multiplier), the regularization was incomplete, the banded attention was too conservative, and the optimizer betas were misconfigured. None of this required intelligence about the problem domain. It required only a metric and a clock.
Lines of Code
630
The entire autoresearch setup
Autonomous Changes
700
One overnight session
Leaderboard Gain
11%
On top of prior manual tuning
Five Instantiations in Ten Days
The pattern transferred because it required no new infrastructure. A metric, a clock, a loop. The domain didn't matter.
Autoresearch Instantiations
| Feature | Domain | Metric | Result |
|---|---|---|---|
| ML training (Karpathy, day 0) | ML training | Validation loss | 700 changes, 11% leaderboard gain |
| Finance (@cworsey, day ~5) | Financial markets | Sharpe ratio | +22% over 173 days |
| Chess (@jnl, day ~7) | Chess engine | ELO vs Stockfish | Expert → grandmaster |
| Rendering (day ~8) | Rendering pipeline | Frame time (ms) | 53% faster |
| Reasoning (AutoReason, day ~10) | Argumentative reasoning | Win rate in pairwise debates | Prompt mutations survived natural selection |
What the five domains share: a single unambiguous number, a fast evaluation, and a direct correlation between the metric and what you actually care about.
The financial markets case is the clearest demonstration of portability. @cworsey ran 25 agents debating macro, rates, commodities, sectors, and single stocks. Every recommendation scored against real outcomes. The worst agent by rolling Sharpe ratio got its prompt rewritten by the system. Keep or revert. The prompts are the weights. Sharpe is the loss function. Same loop, same logic — just a different domain.
The Eval Criteria
Karpathy almost throws away the key line: "any metric you care about that is reasonably efficient to evaluate can be autoresearched by an agent swarm."
The inverse is the insight: if you can't write the eval, you can't run the loop.
Three properties make an eval autoresearchable:
Fast. The clock doesn't have to be five minutes, but it has to be short enough to run dozens of iterations overnight. A 30-minute evaluation cycle means 16 iterations in an 8-hour window. A 2-minute cycle means 240. The iteration ceiling is the clock. This is not a marginal difference — 16 iterations converges in months; 240 converges in days.
Unambiguous. The metric has to return a number that is clearly better or clearly worse. Validation loss is unambiguous. Sharpe ratio is unambiguous. "Is this email good?" is not unambiguous. The eval has to be computable, not judgeable.
Predictive. The metric has to correlate with what you actually care about. Validation loss predicts model quality. A proxy metric that doesn't predict downstream quality is worse than no metric at all — it optimizes in the wrong direction and produces a loop that converges on the wrong answer.
The Enterprise Eval Gap
Enterprise Multi-Agent Inquiry Surge
1,445%
Gartner, weeks after Karpathy's agentic engineering concept
Enterprise demand for multi-agent systems has surged. The loops are available. The Karpathy blueprint is documented, open-sourced, and instantiated across five domains. The infrastructure cost has dropped to near zero.
What hasn't scaled is the eval.
Most knowledge work doesn't have a loss function. "Is this customer support response good?" "Is this contract well-drafted?" "Is this outreach email compelling?" These are questions humans can answer in ten seconds and disagree about 40% of the time. They're not computable. They require judgment that isn't reducible to a single number.
This is the enterprise eval gap: the distance between "the loop exists" and "we have a metric the loop can optimize." Most teams trying to deploy agentic automation are stuck at this gap, not at the loop. The agents are ready. The organizations aren't — because they haven't done the work of defining what "good" actually means.
A missing eval isn't a temporary blocker you work around by deploying anyway. An agent loop without an eval is not an agent that improves — it's an agent that runs. Deployment without an eval is the wrong order of operations.
Building Evals for Hard-to-Measure Tasks
The gap isn't closed by accepting ambiguity. It's closed by being more precise about what you actually care about. Four approaches:
Proxy metrics. You can't measure "email quality" directly. You can measure reply rate, link click rate, meeting booking rate. These are imperfect proxies, but they're computable and predictive. The question isn't "is this email good?" — it's "what does a good email produce, and can I measure that?"
Pairwise comparison. "Is A better than B?" is a far easier question than "how good is A?" Pairwise comparisons can be automated. Given two responses, which scores higher against a rubric? This produces an ordinal ranking rather than an absolute score — but ordinal ranking is enough for a loop. You don't need to know the absolute quality of response 47. You need to know whether it's better than response 46.
Downstream outcomes. Measure what the work produces, not the work itself. If the agent is drafting support responses, don't evaluate the response — evaluate how quickly the ticket closes, how many follow-ups it generates, whether the customer reopens it. Downstream outcomes are harder to collect but more predictive. They're also harder to game.
Consistency as prerequisite. Before optimizing for quality, establish that the agent produces consistent outputs under the same conditions. An agent with wildly variable outputs can't be optimized — the signal is lost in noise. Consistency is not the goal, but it's the gate.
Patterns Worth Stealing
Define the eval before building the agent. If you can't write the eval function before you start building, you're not ready. The eval is not a post-hoc quality check — it's the specification. The agent you build is a function that optimizes the eval you define.
The clock is the contract. Any agent loop needs a temporal constraint: how long does one experiment run? Without a fixed clock, experiments aren't comparable. Without comparability, the loop produces noise instead of signal.
Proxy metrics are not cheating. Reply rate is not email quality. But it's predictive of email quality and it's computable. Most real-world evals are proxies. The question is whether the proxy correlates with what you actually care about — not whether it's a perfect measure.
The iteration ceiling is the clock. A 30-minute eval cycle means 16 iterations per night. A 2-minute eval cycle means 240. The difference between those two numbers is not marginal. It's the difference between an agent that converges in months and one that converges in days.
See also: Building Agent Evals for a practical guide to eval construction, State of Evals 2026 for where the field currently stands, and The Hard Thing About Building Agents for the organizational problems that precede the technical ones.
Building Agent Evals: From Zero to Production
Why 40% of agent projects fail: the 5-level maturity model for production evals. Move beyond SWE-bench scores to measure task completion, error recovery, and ROI.
How to Know If Your AI Agent Actually Works
Model benchmarks tell you nothing about agent performance. Trajectory analysis, the three evaluation pillars, and the metrics that actually matter.
The Hard Thing About AI Agents
The demo worked. The pilot impressed the board. Now your agent is hallucinating to customers at 3am. Here are the hard truths about deploying AI agents that nobody wants to tell you.