← Back to Payloads
ai2026-06-11

Claude Fable 5 , Gemini 35 Live Translate , scaling test tim

Anthropic shipped Fable 5 with a 244-page system card and a 5%-or-less safety classifier that actually works. Google pushed Gemini 3.5 Live Translate into 70+ languages with a voice-preserving model, and the test-time-compute scaling papers keep piling up.
Quick Access
Install command
$ mrt install ai
Browse related skills
Claude Fable 5 , Gemini 35 Live Translate , scaling test tim

Claude Fable 5 , Gemini 35 Live Translate , scaling test tim

Anthropic, Google, and a quiet academic-industrial complex all pushed the same frontier this week — but in three very different directions.

What You Need to Know: Anthropic publicly released Claude Fable 5 (the first Mythos-class model safe enough for general use) on June 9 with a 244-page system card, while Google launched Gemini 3.5 Live Translate for real-time speech-to-speech translation across 70+ languages, and a fresh round of papers on scaling test-time compute re-confirmed that "thinking longer at inference" is now the dominant lever for reasoning benchmarks.

Why It Matters

  • Fable 5's classifier is the real story, not the benchmarks. Anthropic shipped a safety classifier that triggers in fewer than 5% of sessions — small enough that the model is usable for real work, strict enough that the company is comfortable putting it on the public API. The Mythos 5 configuration remains gated to "verified government cyberdefenders and infrastructure providers." This is Anthropic's first answer to the question "how do you ship a Mythos-class model without it being Mythos-class dangerous?"
  • Gemini 3.5 Live Translate is voice-preserving — that's the under-reported detail. Google Meet and Translate now do real-time speech-to-speech across 70+ languages and 2,000+ language pairs, and the model attempts to preserve the speaker's voice, tone, and pace in the translated output. That last bit is the hard part, and the part most competitors don't even attempt.
  • Test-time compute is now the default scaling axis, not the alternative one. A new wave of papers and benchmarks (from Sebastian Raschka's survey work, the Structured Test-Time Scaling paper from Xinming Tu, and Anthropic's own Fable 5 results) all confirm: the next generation of reasoning gains is coming from letting models think longer at inference, not from bigger pretraining runs. If you're budgeting for inference capacity in 2026, plan for "more thinking per query," not just "more queries."

What Actually Happened

Anthropic ships Claude Fable 5 — Mythos for the public, Mythos 5 for cyberdefenders

On June 9, 2026, Anthropic launched Claude Fable 5 as a Mythos-class 1 model with "the same performance as Claude Mythos 5, except with much more strict guardrails in place to prevent" misuse, per Anthropic's own framing and Simon Willison's first-day testing. The launch post, the 244-page system card, and partner-model availability on Google Cloud's Gemini Enterprise Agent Platform all went out the same day.

The headline capability is that Fable 5 is now the first model to clear 90% on Hex's core analytics benchmark (a SQL + Python + chart-reasoning test Barry McCardel has been running against every frontier model since 2024). The headline risk story is the classifier: it triggers in fewer than 5% of sessions, and the system card goes deep on what those triggers catch (cyberoffense scaffolding, CBRN assistance, certain persuasion patterns) and how the classifier was evaluated.

Mythos 5 — the same model with the classifier turned down — remains exclusive to "verified government cyberdefenders and infrastructure providers." Anthropic's risk taxonomy in the system card is unusually explicit about the threat scenarios that justify the gating: Mythos 5 is "clearly willing to locate deep vulnerabilities in the world's most hardened systems," as Zvi Mowshowitz put it in his system-card read-through, and the company isn't yet comfortable exposing that capability to general traffic.

Fable 5 also lands at roughly 2x Claude Opus 4.8 per token — a price point Anthropic says reflects the inference cost of the classifier and the reasoning profile. Early developer guides (Lushbinary, the Medium "12 use cases" piece) are mostly positive on code-and-data workloads, with the usual caveats about the classifier occasionally blocking legitimate use.

Google launches Gemini 3.5 Live Translate

Google rolled out Gemini 3.5 Live Translate on June 9 as a new audio model that does real-time speech-to-speech translation across 70+ languages, with Google Meet and Translate as the first shipping surfaces. The model attempts to preserve the speaker's voice, tone, and prosody in the translated output, which is a step beyond what most competitors do (Google's own previous Translate pipeline was text-then-TTS, with no preservation of voice identity).

The launch landed with the usual "20 years ago Google Translate started as..." LinkedIn narrative, but the technical delta is real: 2,000+ language pairs, sub-second latency on the demo, and an explicit voice-fingerprint preservation model that doesn't just translate words but tries to translate who's saying them. CNET's coverage highlighted the "real-life conversations" framing — the model is tuned for the messy, mid-sentence, code-switching way people actually talk, not for clean studio audio.

For developers, the model is exposed through Google's standard Live API surface, and Google says it will roll out to more surfaces (Pixel phone calls, Workspace integrations, third-party apps via the Live API) over the rest of the quarter.

Test-time compute keeps eating pretraining compute

The third thread in this digest — "scaling test time compute" — is the same story Sebastian Raschka has been tracking since OpenAI shipped o1: reasoning gains are now coming primarily from inference-time compute, not from bigger pretraining runs. The latest round of papers and benchmarks (Raschka's "Categories of Inference-Time Scaling" survey, the Structured Test-Time Scaling paper from Xinming Tu, the noam-brown Latent Space conversation on "multi-agent civilizations") all reinforce the same picture: the frontier is moving toward "let the model think longer and in more structured ways at inference."

The practical implication is that API cost curves are decoupling from model size. A Fable 5 query with extended thinking can cost 5-10x a "fast" Fable 5 query on the same model, and Anthropic's pricing reflects that. For builders, the question is no longer "which model is the smartest" but "which model gives me the best thinking-per-dollar on this workload" — and that question has a very different answer for a one-shot classification job than it does for an agentic coding task.

The Latent Space conversation with Noam Brown goes further: the next wave of gains is "scaling test-time compute to multi-agent civilizations" — letting multiple reasoning agents argue with each other for longer, not just letting one agent think longer. That's a research direction, not a product, but it tells you where the labs are spending their inference budget.

The Take

Anthropic's classifier is the most important product decision in this digest, and almost nobody is framing it that way. Shipping a Mythos-class model with a 5%-trigger classifier that actually holds in production is the first real demonstration that "safety" and "capability" don't have to be a zero-sum trade. If the pattern holds, expect every other frontier lab to copy the architecture within six months.

Gemini 3.5 Live Translate is the kind of model that looks like a feature and is actually a platform shift. Voice-preserving real-time translation is what makes "AI meeting buddy" products feel like real humans instead of text-to-speech demos. Google's bet is that owning this surface gives them a wedge into every cross-language business meeting, customer call, and content workflow for the next decade. It's a reasonable bet.

The test-time-compute thread is the one I'd bet on financially. If inference is the new training, then the model labs just turned into inference companies — and the entire 2024-era "AGI comes from bigger pretraining" narrative is dead. The infra, the eval suites, the pricing pages, the talent flows are all reorganizing around that reality. Builders who plan for it now (spend caps, caching, model-routing logic that accounts for thinking depth) will outcompete the ones still treating reasoning models like faster chat models.

Quick Summary

Anthropic released Claude Fable 5 (Mythos-class 1, classifier under 5% of sessions) alongside a 244-page system card, Google launched Gemini 3.5 Live Translate with voice-preserving real-time translation in 70+ languages, and the latest round of papers confirms that scaling test-time compute — not pretraining — is now the dominant lever for reasoning gains.

Sources

Related Dispatches