When AI Starts Building AI

More than 80% of the code Anthropic merged in May 2026 was written by Claude. Not assisted. Written. The engineers who built it were merging 8x more code per quarter than they were in 2024.

In early 2025, AI-written code was a rounding error — low single digits. Sixteen months later, it was the majority of what went into the codebase of the company building the AI.

This is what recursive self-improvement looks like before it becomes a philosophy debate.

For the rest of us: what is recursive self-improvement?

The term sounds like science fiction. The idea is simple.

An AI system helps build a better AI system. That better system helps build a better one still. Each iteration is faster, more capable, and more productive than the last. At some point — and this is the part that makes the concept serious — the humans in the loop are no longer the constraint on how fast progress happens. The AI is.

We are not there yet. But the Anthropic data published in 2026 suggests we are closer than most organisations are planning for.

The task horizon

In March 2024, Claude Opus 3 could reliably complete tasks that took about four minutes. A year later, Claude Sonnet 3.7 was managing 90-minute tasks. By March 2026, Claude Opus 4.6 was sustaining work for 12 hours without human intervention.

That’s a 180x increase in autonomous task duration in two years. The internal Anthropic estimate is that task horizon is doubling roughly every four months.

SWE-bench — the coding benchmark used across the industry — went from single-digit scores to saturation in two years. CORE-Bench, which measures AI’s ability to reproduce scientific research, went from 20% success in 2024 to saturation in 15 months.

Fig. 01 · Self-improvement timeline

The pace
no one planned for.

Task horizon — how long AI can work autonomously

Mar 2024

4 min Claude Opus 3
Mar 2025

90 min Claude Sonnet 3.7
Mar 2026

12 hrs Claude Opus 4.6

Doubling roughly every four months · Source: Anthropic Institute, 2026

The pivot

80%+

of Anthropic's merged code
written by Claude · May 2026

vs low single digits in early 2025

8×

more code merged per quarter vs 2024

97%

of performance gap recovered autonomously
vs 23% by human researchers in the same window

$18k

compute cost · 800 hrs autonomous research

The shape of things today is roughly: humans have ideas, and models implement, test, and evaluate them an order of magnitude faster. How long before models have the ideas too?

The autonomous research project

In April 2026, Anthropic ran an experiment that deserves more attention than it has received.

They set AI agents to work on a research problem where a performance gap existed between weak and strong model supervisors. Human researchers, working for one week, recovered 23% of that gap. The AI agents, running for 800 cumulative hours at a compute cost of approximately $18,000, recovered 97%.

Not 97% of what the humans found. 97% of the total possible gap.

By November 2025, Claude Opus 4.5 was outperforming human researchers 51% of the time on next-step decisions in research tasks. By April 2026, Claude Mythos Preview was winning 64% of head-to-head comparisons with researchers on direction choices.

The code optimisation story is similarly striking. In May 2025, Claude Opus 4 achieved roughly 3x speedup on optimisation tasks — human researchers typically hit 4x in four to eight hours. By April 2026, Claude Mythos Preview was achieving 52x.

What “parity” actually means

An internal Anthropic survey in March 2026 asked 130 employees to estimate their output increase with Mythos Preview access. The median answer was 4x. One employee, who said they hadn’t written code themselves in five months, described leaning “hard into Claudifying” for the past year.

The company’s assessment of code quality: roughly at parity with human-written code by 2026, expected to be strictly better within the year.

This is not a productivity story. It’s a structural shift in what “human-in-the-loop” means. The current division of labour, in their own words, is: humans have ideas, and models implement, test, and evaluate them an order of magnitude faster. The comparative advantage of humans is “still in seeing the bigger picture and thinking beyond immediate task confines.”

The word “still” is doing a lot of work in that sentence.

Three futures Anthropic sees

The Anthropic paper outlines three possible trajectories:

The plateau. Progress stalls. The capabilities diffuse widely but don’t compound. Today’s tools become commoditised and the ecosystem normalises around them. This is the benign scenario — genuinely useful AI as a productivity layer, not a self-accelerating one.

Continued efficiency. Humans set direction; AI automates execution. A 100-person organisation does the work of a 10,000-person one. This is the intermediate case, already partially underway. It compresses competitive timelines dramatically and renders many conventional staffing strategies obsolete.

Full recursion. AI systems design and build their successors. Humans move to oversight roles. The pace of progress is determined by compute availability, not human ingenuity. This is the scenario that makes AI governance genuinely urgent.

Anthropic’s authors do not know which path we’re on. They argue for investigating whether international verification systems — analogous to nuclear arms control — could enable credible slowdowns or pauses in frontier development if the third trajectory becomes clear.

What this means for organisations

Most enterprise AI strategy is calibrated for the plateau scenario, even when it claims not to be. The planning assumptions — headcount, timelines, tooling cycles — reflect a world where AI is a productivity multiplier on existing workflows, not a structural change in what workflows require humans.

The 97% autonomous research result changes that calculus. It is not a demonstration that AI can replace researchers. It is a demonstration that the bottleneck in research can shift from human capacity to compute budget — at $18,000 for 800 hours of work that a week of human effort couldn’t complete.

For anyone making decisions about AI capability over a two-to-three year horizon, the doubling-every-four-months task trajectory is the most important number in this report. The question is not whether you believe it. The question is what it implies if it holds for another eight months.

References

Anthropic Institute, “Recursive Self-Improvement” (2026) — anthropic.com/institute/recursive-self-improvement
Anthropic internal data: code authorship, task horizons, productivity surveys (Q1–Q2 2026)
SWE-bench benchmark results, 2024–2026
CORE-Bench benchmark results: 20% (2024) to saturation (2025)
Anthropic, autonomous research project data: 97% gap recovery, 800 hrs, $18,000 compute

For the rest of us: what is recursive self-improvement?

The task horizon

The autonomous research project

What “parity” actually means

Three futures Anthropic sees

What this means for organisations

References

Keep reading

The AI Too Dangerous to Release — and What Came After

The Super Agent Org Chart

The Chatbot Era Is Ending Inside the Companies That Built It