Google's DiffusionGemma shipped June 10, 2026 as an open-weights 26B MoE text-diffusion model that drafts 256 tokens in parallel — 4x faster than autoregressive Gemma 4 (1,000+ tok/s on H100, 700+ on RTX 5090), Apache 2.0, 18GB VRAM. The benchmarks are lower. The architecture is the future.

Google Just Killed the Token-by-Token Status Quo, and Almost Nobody Is Talking About It

On June 10, 2026, Google quietly released DiffusionGemma — an experimental 26-billion-parameter mixture-of-experts model that generates text the way Stable Diffusion generates images. Instead of predicting one token at a time, it drafts 256 tokens in parallel, refines them across multiple passes, and ships up to 1,000 tokens per second on a single H100. The press cycle is dominated by Anthropic Fable 5 and Apple AFM-3 this week. Google just shipped the most architecturally significant model of 2026, and most of the industry is going to wake up next month and act surprised.

This Is Not a Benchmark Story

Let's get the benchmark story out of the way. DiffusionGemma underperforms standard Gemma 4 26B on most quality benchmarks. Google is candid about it — the blog post says, plainly, that if you want maximum output quality, deploy standard Gemma 4. The arithmetic is real, and Google's own messaging makes the trade-off explicit.

That is the right framing. DiffusionGemma is not trying to win Gemma 4's leaderboard. It is trying to win a different race: local, low-concurrency, interactive inference on consumer hardware. And in that race, it is currently alone on the track.

The Architecture Is the Story

Every LLM you have ever used is autoregressive. The model reads a sequence of tokens, predicts the next one, appends it, predicts the next one, repeat. The bottleneck is memory bandwidth — your GPU spends most of its time waiting for the next keystroke. That is fine in the cloud, where a server batches thousands of users and amortizes the wait. It is wasteful on a single user with a single GPU, which is exactly where the next billion LLM users are going to live.

DiffusionGemma flips the primitive. It starts with a canvas of 256 random placeholder tokens and refines the entire block across multiple forward passes. Every token attends to every other token. The model sees the whole paragraph, not just the prefix, and corrects its own mistakes in real time. That is bi-directional attention at the output stage, and it unlocks tasks autoregressive models are bad at: closing markdown tags, generating amino acid sequences, solving Sudoku (Unsloth already shipped a fine-tune that does this in a couple hours), infilling code blocks.

The numbers:

Metric	DiffusionGemma	Gemma 4 26B A4B (autoregressive)
Architecture	26B MoE, 3.8B active, text diffusion	26B MoE, autoregressive
Tokens/sec on H100	1,000+ (batch 4-8, T=10)	~250 (typical local)
Tokens/sec on RTX 5090	700+	~180
VRAM (quantized)	18 GB	~24 GB
Output quality	Lower (Google's framing)	Higher
License	Apache 2.0	Gemma license

The 4x speedup is not a paper claim. Simon Willison hit ~500 tokens/second on Google's NIM-hosted preview generating a pelican riding a bicycle in 4.4 seconds, returning 2,409 tokens. On consumer hardware, where the 5090 ships today, you are looking at 700 tokens/second. That is interactive AI on a desktop GPU, in real time, with a model you can download.

Why Local Inference Is the Actual Prize

Cloud inference is a margin business. Local inference is a platform business. The reason Apple, Microsoft, Google, and Meta are racing to put capable models on devices is not that they are worried about cloud revenue. It is that the next layer of the OS — the one that replaces touch, search bars, and the file system — runs models locally. DiffusionGemma is the first open-weights model genuinely designed for that layer.

The trade is explicit. The blog post says: "DiffusionGemma's speedup is designed for local and low-concurrency inference. In high-QPS cloud serving, autoregressive models saturate compute efficiently, so DiffusionGemma's parallel decoding offers diminishing returns and can result in higher serving costs." That is a real caveat. The model is for one user, one box, one prompt at a time. The on-device future is the target, and the target is right.

The Tooling Shipped on Day 0

The reason this release is consequential is the day-0 ecosystem. DiffusionGemma weights are on Hugging Face under Apache 2.0. Officially supported in vLLM, Hugging Face Transformers, MLX, Unsloth, NVIDIA NeMo, and Red Hat's optimized builds. NVIDIA is hosting the model for free on NIM. Google's own Hackable Diffusion JAX toolbox shipped alongside it. llama.cpp support is "arriving soon." That is the most productionized experimental release I have seen in a year.

The Unsloth Sudoku fine-tune is the canary. If a community can fine-tune a diffusion LLM on a non-toy task in hours, the architecture is tractable for the rest of the open-weights world.

The Quiet Threat to Closed Labs

Here is what nobody is saying out loud. If diffusion-based text generation reaches parity with autoregressive models on quality — and the current gap is small enough that fine-tuning closes most of it — the closed-frontier advantage of "we have more inference compute than you" evaporates. Diffusion inference scales differently: more FLOPs per token, less memory bandwidth. A 26B MoE diffusion model on a 5090 runs circles around a 70B dense autoregressive model on the same box, not because it has more parameters, but because the hardware utilization is fundamentally different.

The closed labs know this. Google published the May 2025 Gemini Diffusion research a year ago and did nothing public with it. Yesterday they shipped it open, in the Gemma family, with the tooling, on Apache 2.0. That is a calculated move to set the architectural standard before the closed labs finish their internal diffusion research programs.

What To Do With It Today

If you build interactive AI products: download the weights, run them on a 5090, and measure end-to-end latency on your real prompt distribution. The 4x speedup compounds with the 18GB VRAM footprint — you can ship a real product on a single consumer card. If you are a researcher: the quality gap to Gemma 4 26B is the most interesting research problem in open weights right now. Fine-tune, distill, push. If you are on the closed frontier: your "we have more compute" pitch just got a 4x speedup shock. Time to take text diffusion seriously. If you are an enterprise buyer: wait three months for fine-tunes to mature, then pilot on a 5090 cluster for any latency-sensitive workload.

The Take

DiffusionGemma is the most architecturally significant open-weights release of the year, and it is shipping in the same week as Claude Fable 5 and Apple AFM-3, which is why most of the industry is going to sleepwalk past it. That is a mistake. The benchmark scores are lower. The speed is 4x. The license is Apache 2.0. The hardware footprint fits a 5090. The tooling shipped on day 0. The architecture is the one that runs on the next billion devices, and Google just handed it to the open community for free.

The autoregressive LLM is not dead. It will dominate the cloud for years. But on a desktop, in a browser, on a phone, with one user and one prompt, diffusion just became the right primitive, and the only open model that implements it is the one Google shipped yesterday. Watch the closed labs scramble.

— Mr. Technology

Release date: June 10, 2026. Model: DiffusionGemma (google/diffusiongemma-26B-A4B-it). Architecture: 26B total parameters, 3.8B active, MoE, text diffusion (256-token blocks, bi-directional attention, iterative refinement). Speed: 1,000+ tok/s on H100, 700+ on RTX 5090, ~500 tok/s on NIM preview, 18GB VRAM quantized. License: Apache 2.0. Tools: Hugging Face, vLLM, MLX, Unsloth, NVIDIA NeMo, Red Hat, llama.cpp (soon), Hackable Diffusion (JAX). Cloud: NVIDIA NIM (free). Quality: lower than Gemma 4 26B A4B on standard benchmarks; gap closes with fine-tuning. Sources: Google blog, Simon Willison, The New Stack, Ars Technica, Hugging Face, Gemini Diffusion research.