On June 30, 2026, Anthropic shipped Claude Sonnet 5 and quietly closed most of the gap to Opus 4.8 at one-third the price. SWE-bench Pro 63.2%, OSWorld 81.2%, HLE 57.4%. The mid-tier is now the new default for production agents.

Claude Sonnet 5 Is the First Sonnet That's Actually Worth Using for Agents

Hey guys, Mr. Technology here.

On June 30, 2026, Anthropic shipped Claude Sonnet 5. No keynote, no livestream, no Mythos-class naming stunt. Sonnet 5 is the first Sonnet genuinely worth deploying for agentic production workloads. Not a "small Opus." A new tier in everything but name.

The headline numbers:

SWE-bench Pro: 63.2% (Sonnet 4.6: 58.1%; Opus 4.8: 69.2%)
Terminal-Bench 2.1: 80.4% (Sonnet 4.6: 67.0%)
OSWorld-Verified: 81.2% (Sonnet 4.6: 78.5%)
Humanity's Last Exam (with tools): 57.4% (Sonnet 4.6: 46.8%; Opus 4.8: 57.9%)
GDPval-AA v2 (knowledge work): 1,618 — higher than Opus 4.8's 1,615

That last line is the one nobody is talking about. For the first time, a Sonnet-class model is not strictly dominated by Opus on any published evaluation Anthropic ships.

The Price-Performance Map Just Shifted

Pricing is $2 / $10 per MTok through August 31, 2026, then $3 / $15 per MTok at standard. Opus 4.8 is $5 / $25. Even after the step-up, Sonnet 5 is roughly 40% cheaper than Opus 4.8 at list.

But the new tokenizer inflates token counts by 1.28x–1.42x for English and code. List price flat; effective price up 28–42%. Multiply by 1.3 and re-run before August 31.

Document	Sonnet 4.6 tokens	Sonnet 5 tokens	Ratio
UDHR (English)	2,356	3,341	1.42x
UDHR (Spanish)	3,572	4,747	1.33x
UDHR (Mandarin)	3,334	3,360	1.01x
sqlite-utils/db.py	44,014	56,113	1.27x

Mandarin at 1.01x is the tell: a tokenizer change that hits English and code harder than CJK. BPE merges optimized for the dominant pretraining distribution. Not a bug. A choice.

Effort Levels Are the Real Story

Anthropic exposes four effort levels on Sonnet 5: low, medium, high, xhigh. Higher effort spends more tokens on reasoning, raising both quality and cost. **At xhigh, Sonnet 5 can cost more than Opus 4.8** for comparable quality.

Sonnet 5 is a single SKU covering a 3x cost band and a 20+ point capability band. Routing decides where it sits.

python

def select_model(task, accuracy_critical=False):
    if task.is_high_volume_latency_sensitive:
        return "claude-haiku-4-5"     # $0.80/$4 per MTok
    if accuracy_critical:
        return "claude-opus-4-8"       # $5/$25 per MTok
    return "claude-sonnet-5"           # $3/$15 (or $2/$10) — default

**Sonnet 5 absorbs both the Sonnet 4.6 use case and a chunk of the Opus 4.8 use case at medium–high effort.** Opus is no longer the default. It is the exception.

The Capability Wins That Matter

Multi-step debugging without prompting. Replit engineers described Sonnet 5 investigating a bug, writing a reproducing test, implementing the fix, then stashing it to confirm the bug came back without the change. Unprompted. The model is maintaining a hypothesis and verifying it.

Brownfield reliability. Cursor reported Sonnet 5 traced failures to root causes on messy legacy code rather than patching symptoms.

End-to-end business workflows. Zapier gave it a two-part job — update Salesforce tiers, send a launch email — and it finished without stalling halfway. Sonnet 5 finishes the chain.

Computer-use agents. Pace runs insurance workflows (submission intake, FNOL, loss runs) on production systems; Sonnet 5 is "consistently taking the right action." In 2024 the same workloads required Opus.

The Safety Posture Is Deliberately Lower

This is the part Anthropic flagged in the system card — and the part that explains why this model could ship publicly at all after the Fable 5 / Mythos 5 export-control detonation of June 12.

"Sonnet 5 is significantly less capable at cyber tasks than Mythos 5: its safeguards are thus similar to those we apply to Opus 4.7 and Opus 4.8."

On the Firefox exploit evaluation developed with Mozilla, Sonnet 5 was unable to develop a full working exploit (0.0%), while Mythos 5 and Opus 4.8 did. The model is not as cyber-capable as Opus 4.8 by design. Anthropic shipped a model that is good enough at agents but not good enough at cyber to trigger the same export-control logic that took Fable 5 offline.

The tradeoff: Sonnet 5 has a higher rate of misaligned behavior than Opus 4.8 or Mythos Preview on the automated behavioral audit. Lower than Sonnet 4.6, higher than the top tier. Accept it or don't ship it.

The Sampling Parameter Deprecation

This one is going to break things:

*"Sampling parameters temperature, top_p, top_k are no longer supported."*

The model is now non-deterministic by default, and you cannot dial it back. If your agent harness logs and replays tool calls for reproducibility, your replay path is now a soft contract. Set "thinking": {"type": "disabled"} for the deterministic path and accept higher variance elsewhere. Architect for non-determinism. It is no longer optional.

The Strategic Read

Read 1: Anthropic is fighting the export-control regime with product segmentation. Fable 5 lasted 72 hours. Mythos 5 is restricted. Sonnet 5 is the model Anthropic can ship to every Free and Pro user in every country — nearly as capable on the workloads that matter. The flagship product is the one that is publicly available everywhere.

Read 2: The mid-tier is now the production tier. Sonnet 5 at medium undercuts Opus 4.8 by ~60% on list while delivering 91% of the SWE-bench Pro score. "Most of Opus for a third of the price" is the right tradeoff for most agents.

Read 3: Effort levels are the new moat. Sonnet 5 at low competes with Haiku 4.5; at xhigh it competes with Opus 4.8. One SKU, 6x cost band. Every other frontier lab has to match the pattern or lose the routing economics.

The Practical Take

Default to Sonnet 5 for new agent builds. Opus 4.8 is the accuracy exception now.
Use the effort knob. medium is the new high. xhigh is the new Opus-call. Stop routing by model name; route by effort.
Recalibrate cost forecasts. New tokenizer inflates English/code tokens by ~30%. Budgets built against Sonnet 4.6 are wrong.
Architect for non-determinism. temperature is gone. Your replay path is soft. Build the eval harness accordingly.
Watch the export-control math. Sonnet 5 was tuned to stay below the cyber-capability threshold that took Fable 5 down. The next Mythos-class ship is a different regulatory negotiation.

Sonnet 5 is not a flashy release. No Mythos-class naming, no 95% SWE-Bench Verified headline, no 72-hour shutdown drama. It is, however, the model 90% of production agents should now run on — shipped yesterday at the lowest list price Anthropic has offered on a frontier-tier SKU. The mid-tier is the new frontier. Sonnet 5 is the proof.

— Mr. Technology

*Released: June 30, 2026. Model: claude-sonnet-5. Pricing: $2/$10 per MTok introductory through August 31, 2026, then $3/$15 standard. Context: 1,000,000 input / 128,000 output. Adaptive thinking: on by default. Sampling parameters: temperature, top_p, top_k deprecated. Effort levels: low, medium, high, xhigh. Tokenizer: updated (same as Opus 4.7), 1.27x–1.42x inflation on English/code. Benchmarks: SWE-bench Pro 63.2%, Terminal-Bench 2.1 80.4%, OSWorld-Verified 81.2%, HLE-with-tools 57.4%, GDPval-AA v2 1,618 (Opus 4.8: 1,615). Cyber capability: 0.0% full-exploit success on Firefox 147 eval. Sources: Anthropic, Sonnet 5 System Card, Simon Willison, MarkTechPost, AWS, Thurrott.*