
Anthropic just told the world, on the record, that Claude now writes more than 80% of the code that gets merged into its own systems. Up from "low single digits" in 2021–2024. This is the most concrete admission yet that the AI-coding-replaces-engineers narrative is now self-confirming at the model labs.
What You Need to Know: In a June 4, 2026 blog post titled "When AI Builds Itself," Anthropic's Institute team disclosed that Claude now authors more than 80% of the merged code across Anthropic's own codebase, and that the average engineer ships roughly 8× as much code per quarter as they did in 2021–2025. The same week, two new agent-coding benchmarks — SWE-Bench Pro and Terminal-Bench 2.0 — cemented the "agentic coding" category as the next evaluation battleground.
Anthropic's Institute page on recursive self-improvement, published June 4, 2026, gives the methodology in one line: "today, Anthropic engineers on average ship 8x as much code per quarter as they did from 2021-2025." The Scientific American summary, written by Chris Stokel-Walker, makes the related claim explicit: "Anthropic said Claude now writes more than 80 percent of the code merged into its systems, up from low single digits before the [latest model generation]." That stat is now part of the public record. Multiple outlets have confirmed it, including the Metaintro coverage and the Tom's Hardware analysis.
The post also walks through the four eras of internal AI usage: 2021–2023 (laptops), 2023–2025 (chatbots suggesting code), 2025–2026 (coding agents writing entire files), and "today" (autonomous agents delegating hours of work to other agents). That trajectory is the substance behind the 80% number.
Codingfleet's June 4, 2026 SWE-Bench Pro explainer lays out the new evaluation stack. SWE-Bench Verified — the 2024–2025 standard — was saturated by every frontier model by late 2025. SWE-Bench Pro (released by Scale AI in late 2025, updated in 2026) requires agents to handle long-horizon, multi-file tasks in full repositories, with private test sets. Terminal-Bench 2.0 — from the Terminal-Bench paper on arXiv (January 2026) — measures CLI and DevOps workflows, where "agents live or die" in production.
The current leaderboard picture, per Firecrawl's June 2026 ranking: GPT-5.5 at 82.7% on Terminal-Bench 2.0 and 58.6% on SWE-Bench Pro. Claude Opus 4.6 sits at 80.8% on SWE-Bench Verified (the older metric). The takeaway: the new benchmarks are doing their job — differentiating models that the old ones had flattened.
The 80% number is real, but it's not the whole story. What it actually says is that Anthropic's engineering team has reorganized around AI-first workflows: humans set the spec, Claude does the implementation, humans review. The 8× productivity multiplier is downstream of that reorganization — you don't get 8× by typing faster, you get it by changing what your job is. For the rest of the industry, the lesson isn't "fire your engineers" — it's "stop writing code by hand for problems that an agent can spec, implement, and test."
Anthropic says Claude now writes 80% of its own merged code, with engineers shipping 8× more per quarter than they did in 2021–2025. Two new agent-coding benchmarks — SWE-Bench Pro and Terminal-Bench 2.0 — are now the field's standard yardsticks, with GPT-5.5 and Claude Opus 4.6 trading top scores.
Source: VentureBeat | mr.technology — The Master Skill Index