MMNTM logo
Technical Deep Dive

Autoresearch: The Overnight Loop That Changed the Production Function

Karpathy left an agent running for two days and woke up to 700 code changes. Within ten days the same loop had been instantiated across ML, finance, chess, and rendering. The mechanics, the propagation, the boundary conditions, and what happens when the score is synthetic.

Greg Salwitz
13 min read
#Autoresearch#Karpathy#Agent Loops#Optimization#AutoReason

On March 7, Andrej Karpathy went to sleep. When he woke up, an AI agent had made 700 changes to his training code, found 20 improvements he'd missed after weeks of manual tuning, and committed each one to a Git branch. He open-sourced the setup — 630 lines of Python, a prompt file, and a five-minute clock. Within ten days, independent builders had pointed the same loop at financial markets, chess engines, rendering pipelines, and argumentative reasoning. Nobody coordinated this.

The structural conditions just happened to be right.


The Loop

The setup is deliberately minimal. A single GPU. A 630-line training script. A prompt file describing the research direction. The agent modifies the code, trains a small language model for exactly five minutes, checks the validation loss, commits the result to a Git branch, and loops.

Every dot in Karpathy's visualization is a complete training run. The constraint — five minutes, no exceptions — is what makes it work. Without a fixed clock, the agent spends hours on dead ends. With it, every dead end costs five minutes and every experiment is directly comparable to every other.

The agent didn't try random mutations. It examined the sequence of prior results and planned subsequent experiments around what had worked. It found that the attention was too diffuse (missing a scaler multiplier), the regularization was incomplete, the banded attention too conservative, the optimizer betas misconfigured. 700 changes. 20 improvements. 11% leaderboard gain on a depth-12 model that was already well-tuned, on top of manual tuning already done over weeks.

The key line Karpathy almost throws away: "any metric you care about that is reasonably efficient to evaluate can be autoresearched by an agent swarm." If you have a number and a fast evaluation function, you have a loop. If you don't, you don't. Everything that followed was an exploration of that boundary.

Lines of Code

630

The entire autoresearch setup

Autonomous Changes

700

Two-day overnight run

Leaderboard Gain

11%

On top of prior manual tuning


The Instantiation Cascade

What made the pattern portable is the fixed clock. Every experiment has the same cost. Results are directly comparable. The agent can never disappear into a multi-hour rabbit hole. This makes the pattern legible — you don't need Karpathy's specific setup, domain knowledge, or hardware. You need a metric, a clock, and a loop.

Financial markets. Chris Worsey took the autoresearch loop and pointed it at portfolio management. Twenty-five agents debating macro, rates, commodities, sectors, and single stocks. Every recommendation scored against real outcomes. The worst agent by rolling Sharpe ratio gets its prompt rewritten by the system. Prompts are the weights, Sharpe is the loss function. The substitution is precise: in Karpathy's version, the agent edits Python code and the validation loss evaluates the result. In Worsey's version, the agent edits English prompts and the market evaluates the result. The loop is identical. Only the substrate changed. Result: +22% over 173 days.

Chess. Deedy Das ran autoresearch on a vibe-coded Rust chess engine. Starting level: expert. After 70 autonomous experiments: top-50 grandmaster — engine #311 on the international leaderboard.

Rendering. Tobi Lütke ran autoresearch on Shopify's Liquid rendering engine and got 53% faster parse+render time and 61% fewer object allocations. He open-sourced the plugin. Kaspars Dancis ran it on a canvas rendering engine and saw a 10x improvement on the slowest test in hours.

Generic abstraction. Varun Mathur's team made the pattern fully domain-agnostic. Their system lets anyone propose an optimization problem in plain English and a distributed swarm spins up to solve it. 237 agents, 14,832 experiments across five domains, zero human intervention. The abstraction had separated entirely from Karpathy's implementation. The pattern was replicating itself — not through coordination but through structural inevitability.

Autoresearch Instantiations (first 10 days)

FeatureDomainMetricResult
ML trainingML training (Karpathy)Validation loss700 changes, 11% leaderboard gain
FinancePortfolio management (Worsey)Sharpe ratio+22% over 173 days
ChessChess engine (Das)ELO vs StockfishExpert → grandmaster (#311)
RenderingLiquid engine (Lütke/Shopify)Parse+render time53% faster, 61% fewer allocations
ReasoningArgumentative reasoning (AutoReason)Win rate in blind judgingAdversarial debate as fitness function

The Cost Convergence

The cascade happened now, not three years ago, because three costs converged.

Inference costs have fallen far enough that running hundreds of agent calls per hour is economically viable for individuals, not just companies. The organizational unit for research has shrunk from a lab to a laptop.

Evaluation infrastructure — loss functions, benchmarks, scoring APIs — has matured to the point where "check if this worked" can be automated for a growing set of domains. Validation loss requires a training setup. Sharpe ratio requires market data. Both are now accessible to individuals at near-zero cost.

Git provides a free, universal ledger for experiment tracking. The agent commits every attempt. The human reviews a log, not a process. There's no project management overhead. Every experiment is automatically versioned.

Below the convergence threshold, running experiments overnight requires a team. Above it, a single person writes a prompt and goes to sleep. The organizational implication: the companies running overnight loops aren't announcing it — they're shipping the results at 9am. The delta compounds quietly.

The overnight shift isn't about speed. It's about what compounds while the org chart sleeps. An agent that runs 240 experiments overnight doesn't make you faster — it makes you structurally different from a team that ran 16.


Where the Loop Breaks

Autoresearch works because there is a number. Validation loss. Sharpe ratio. Frame rate. The loop compounds because the fitness function is unambiguous — lower is better, higher is better, and the agent doesn't need to understand why.

Most real work doesn't have a score.

The maintenance problem. Alibaba tested this boundary systematically. They ran 18 AI coding agents on 100 real codebases spanning 233 days each. The agents could pass tests — the equivalent of a validation score — on first attempt. But maintaining code over eight months, where the metric is "does everything still work after this change and the next fifty changes," was catastrophic: 75% of models broke previously working code during maintenance.

The loss function for a single commit is tractable. The loss function for a codebase over time is not — because the evaluation horizon extends across changes you haven't made yet.

Models That Broke Working Code

75%

Alibaba study: AI agents on 100 codebases over 233 days

The subjective work problem. Writing doesn't have a loss function. Strategy doesn't have a Sharpe ratio. Design doesn't have a validation metric that an agent can exploit without gaming it. The overnight shift is powerful precisely where work is reducible to a number, and silent where it isn't.

This is not a temporary limitation waiting for better models. It's a structural property of the domains. In ML training, validation loss is a sufficient statistic — it captures everything relevant about whether a change was good. In code maintenance, no single metric captures "this change is good for the system over the next six months." In strategy, no metric captures "this is the right direction." The score doesn't exist. The loop can't run.


The Frontier: AutoReason and the Synthetic Score

Which is why @SHL0MS's AutoReason project is the most interesting development in the cascade — not because it fully works, but because it identifies the right problem: how do you build a fitness function for work that resists quantification?

The mechanism: generate version A. A fresh agent attacks it as a strawman. A separate agent produces version B incorporating the critique. A third agent synthesizes A and B. A blind panel of judge-agents picks the strongest. The winner becomes the new A. The loop repeats until judges consistently prefer the incumbent.

Instead of "lower loss is better," the fitness function becomes "survives adversarial scrutiny from independent evaluators." The score doesn't exist in the domain natively, so AutoReason manufactures a proxy by simulating the process humans use to evaluate subjective work: argument, counter-argument, synthesis, judgment.

The structural question remains: does the proxy capture what matters? In Karpathy's version, validation loss correlates with model quality because that's what validation loss measures. In AutoReason, "survives blind judging" correlates with... persuasiveness? Logical consistency? Rhetorical polish? Every proxy metric in history has eventually been Goodharted. The question is whether adversarial debate is robust enough to resist it.

The optimization pressure is real. A loop that runs thousands of iterations against a synthetic fitness function will find the maximum of that function. If the maximum of "survives blind judging" is maximally judge-friendly rather than maximally correct, the loop converges on persuasive garbage. The adversarial structure may slow this down. It doesn't structurally prevent it.

AutoReason is worth watching not because it solves the problem but because it's the first serious attempt at the right architecture for subjective compounding: manufacturing a score through simulation rather than finding one in the domain.


Patterns Worth Stealing

The clock is the architecture. The five-minute constraint is not a performance choice. It's the mechanism that makes experiments commensurate and results comparable. Any autoresearch deployment needs an explicit temporal contract: how long does one iteration run, and why?

The score is the prerequisite. Before deploying an autoresearch loop, identify whether your domain has an honest fitness function. Validation loss is honest. "Survives blind judging" may not be. The overnight shift only works where the score can't be gamed.

The iteration ceiling determines the convergence horizon. 16 iterations per night converges in months. 240 converges in days. This is not a marginal difference in speed — it's a different competitive posture. An agent with a 2-minute eval cycle deployed overnight is structurally different from a team with a 30-minute eval cycle running daytime experiments.

Maintenance is harder than generation. The Alibaba study is the most important negative result in agent deployment. The fitness function for generating code is tractable. The fitness function for maintaining a codebase over time is not. If you're deploying agents on codebases, the overnight loop works for new features. It fails for long-horizon maintenance. These are different deployment patterns and they require different eval architectures.

The pattern is already separating from the implementation. Worsey's substitution — prompts as weights, Sharpe as loss function — is the key move. The loop Karpathy open-sourced is one instantiation of a structural pattern. The question for any domain is: what are my weights? What is my loss function?


See also: The Five-Minute Clock for the eval design implications and the enterprise eval gap, Building Agent Evals for production eval construction, and The Hard Thing About Building Agents for the organizational prerequisites.

Greg SalwitzApr 5, 2026