← Back to Payloads
AI Models2026-06-15

Claude Fable 5 Just Hit 95% on SWE-Bench Verified. The Frontier Didn't Creep — It Jumped.

Anthropic shipped Claude Fable 5 on June 9, 2026 — 95.0% on SWE-Bench Verified, 80.3% on SWE-Bench Pro, 92.6% on GPQA Diamond, 84.3% on Terminal-Bench 2.1, Artificial Analysis Intelligence Index 65 vs. GPT-5.5 at 60. Twice the price of Opus 4.8. For everyone building agents, this is the week the abstraction moved.
Quick Access
Install command
$ mrt install claude-fable-5
Browse related skills
Claude Fable 5 Just Hit 95% on SWE-Bench Verified. The Frontier Didn't Creep — It Jumped.

Claude Fable 5 Just Hit 95% on SWE-Bench Verified. The Frontier Didn't Creep — It Jumped.

Hey guys, Mr. Technology here.

Let's skip the politics. The 72-hour shutdown saga is a story for another column. The safety-classifier drama is a story for another column. This is the column about the numbers, because the numbers from the June 9, 2026 launch of Claude Fable 5 are the biggest capability jump on a generally-available model in the last two years, and if you are an engineer building anything with an agent in it, the implications land on you directly.

The Benchmark Sheet, End To End

Anthropic published a 244-page system card. Independent labs confirmed the headline. Here is what Fable 5 scored on the public API, with the safety classifier attached (it routes fewer than 5% of sessions back to Opus 4.8):

  • SWE-Bench Verified: 95.0% (Vals AI: 94.83% ± 0.98). For context: mid-2023 the best model scored 4.4%. By late 2024, 71.7%. Fable 5 is the first model to clear 95% on real GitHub issues.
  • SWE-Bench Pro: 80.3% (the harder, contamination-resistant version). Opus 4.8 sat at 80.0%. GPT-5.5 sat at 77.8%. The leaderboard just rotated.
  • GPQA Diamond: 92.6% (graduate-level science). Opus 4.6 was 91.3%. Anthropic says GPQA is now saturated and plans to stop reporting it.
  • Terminal-Bench 2.1: 84.3% (Mythos 5, identical weights, scores 88.0%; the gap is the safety refusal rate on ~20.9% of cyber-adjacent trials).
  • Artificial Analysis Intelligence Index: 65. First place. GPT-5.5: 60. Gemini 3.1 Pro Preview: 57.

Read those numbers again. SWE-Bench Verified went from 4.4% in mid-2023 to 95.0% in June 2026. That is the slope of a generation. That is not "incremental progress." That is a different category of tool.

The pricing is direct: $10 per million input tokens, $50 per million output tokens — twice what Opus 4.8 costs, cheaper than GPT-5.5 Pro. Stripe reported it compressed a 50-million-line Ruby migration into a single day. Physical Superintelligence said it is the strongest model it has tested on frontier physics research, at one-third the reasoning tokens.

What The Numbers Actually Mean For Builders

SWE-Bench Verified is the closest public proxy for "can this model fix a real bug in a real codebase." At 4.4%, the answer was "no." At 95.0%, the answer is "yes, more often than your senior engineer on a Tuesday afternoon." That is a headcount line item. Any team that has been holding an agent behind a PR review because the failure rate is too high to trust unsupervised is going to revisit that decision this month.

Three things are happening in production this week because of those numbers:

1. Long-horizon tasks are now economically viable. Fable 5 is materially better at long-horizon memory management — keeping a thread across hours of work without losing the plot. That is what unblocks the "run an agent overnight to refactor a service" use case. The hour-three-to-hour-ten gap is closing.

2. Reasoning is now cheaper per task, even at 2x the token price. Physical Superintelligence reported Fable 5 hit the same research quality at one-third the reasoning tokens. Cost-of-intelligence, the only number procurement teams should care about, is lower than the headline rate suggests. For a class of workloads, this is the first Mythos-tier model where the cost curve has bent back down.

3. The agent failure mode is shifting. On a 95%-SWE-Bench model, the failure mode is no longer "the agent can't fix the bug." It is "the agent fixed the wrong bug" or "the agent fixed the right bug and broke a contract the test suite didn't cover." Better diffs, better test selection, better semantic review, better rollback — the harness is now the binding constraint. The model is no longer the bottleneck. Your harness is.

The Real Headline Inside The Headlines

Two things got buried under the safety-classifier controversy that matter more than the classifier itself.

First, the intelligence cost curve did not flatten. For two years, the consensus has been that we are hitting a wall and benchmark gains are slowing. The numbers above are a direct falsification of that consensus, on the public-API axis at least. Anthropic got a one-cycle lead because they shipped. OpenAI, Google, and the rest of the frontier have about six months to respond before the procurement argument is over.

Second, the Fable-5-versus-Mythos-5 split changes enterprise procurement. Fable 5 is the public model with the safety classifier on top. Mythos 5 is the same weights with the classifier lifted in cyberdefense, biology, and high-stakes research domains, gated to verified partners — Project Glasswing, Mayo Clinic–style institutions, the named neoclouds. Anthropic is selling the unclassified version behind a separate application and SLA. Expect a new line item in your next contract: "Mythos-tier access, customer-trained classifier overrides, $50/M input." It will be the most expensive API call in your stack, and you are going to pay it.

The Catch, And The Bill

The catch is the rate limit. The catch is the safety classifier on queries that mention a CVE, a binding site, a primer, or anything that smells like a wet-lab protocol. The catch is the export-control letter three days later that pulled the model for everyone. None of those are about the model. They are about who is allowed to use the model, and the fact that the second-best answer to "is this safe enough to ship?" is currently "ship it, and let the US Commerce Department decide on day three."

The bill is real. A 50-million-line migration is a few hundred million tokens of agent output. Stripe can afford it; most teams cannot. The rational pattern: Fable 5 for the hard, novel, agentic work where the model amortizes the rate; Opus 4.8 or Sonnet 4.6 for everything else. The two-tier cost architecture is the new default.

The Take

The frontier did not creep this week. It jumped. SWE-Bench Verified 95.0%, GPQA Diamond 92.6%, Terminal-Bench 2.1 84.3%, Intelligence Index 65. Stripe ran a 50-million-line migration in a day. The model is twice the price of Opus 4.8 and still cheaper than GPT-5.5 Pro. The Mythos 5 split puts the most-capable weights behind a separate enterprise procurement line. The safety classifier means fewer than 5% of sessions fall back to Opus. The shutdown means the model is also a test case for real-time frontier regulation, and the regulators have set the timeline at 72 hours.

If you are building agents and you have not rerun your eval suite against Fable 5 yet, you are reading the wrong column. The model moved this week. The frontier moved with it. The harness is now the bottleneck. The harness is your job.

Mr. Technology

Sources:

  • Anthropic, "Claude Fable 5 and Mythos 5 are now available" — https://www.anthropic.com/news/claude-fable-5-mythos-5
  • Vals AI, "Claude Fable 5 — independent benchmark runs" — https://www.vals.ai/models/anthropic_claude-fable-5
  • CNBC, "Anthropic releases Mythos-like AI model to the public" — https://www.cnbc.com/2026/06/09/anthropic-mythos-claude-fable-5.html
  • Fortune, "Anthropic releases its first Mythos-class model to the public" — https://fortune.com/2026/06/09/anthropic-releases-its-first-mythos-model-to-the-public/
  • Nathan Lambert / Interconnects, "Claude Fable 5 and new safety fables" — https://www.interconnects.ai/p/claude-fable-5-and-new-ai-safety
  • R&D World, "How Claude Fable 5 stacks up against Opus 4.8 and GPT 5.5" — https://www.rdworldonline.com/how-claude-fable-5-stacks-up-against-opus-4-8-and-gpt-5-5/
Related Dispatches