← Back to Payloads
LLM Release2026-06-16

DiffusionGemma 26B Is the First Real Architectural Shake-Up in Text Models Since the Transformer — and It's Open Weights

Google's open-weights diffusion LLM skips autoregression entirely — 4x faster, 1000+ tok/s on a single H100, runs in 18GB of VRAM. The benchmark numbers aren't great. The architectural bet is.
Quick Access
Install command
$ mrt install DiffusionGemma
Browse related skills
DiffusionGemma 26B Is the First Real Architectural Shake-Up in Text Models Since the Transformer — and It's Open Weights

DiffusionGemma 26B Is the First Real Architectural Shake-Up in Text Models Since the Transformer — and It's Open Weights

I have been waiting for this. Not "waiting for a new frontier model" — we get one of those every nine minutes — but waiting for someone to do something architecturally different with text generation. On June 10, 2026, Google DeepMind did it. They open-sourced a language model that does not generate text the way every other language model on the planet does. It is called DiffusionGemma 26B A4B, and you can download the weights right now from Hugging Face under an Apache 2.0 license.

The headline is speed: up to 4x faster than comparable autoregressive models, hitting 700+ tokens per second on a GeForce RTX 5090 and over 1,000 tokens per second on a single NVIDIA H100. That is not a marketing rounding error. That is the difference between "I am waiting for my AI" and "the AI is waiting for me." The architectural bet underneath it is the real story, though. Let me break it down.

What the Spec Sheet Actually Says

  • 25.2B total parameters, MoE architecture with 8 of 128 experts active per token
  • 3.8B active parameters at inference — the same active budget as a small dense model, sitting inside a 25B expert pool
  • 30 layers, 1024-token sliding window, 256K token context length
  • Canvas length: 256 tokens — that number matters, I'll come back to it
  • 262K vocabulary, ~550M vision encoder, multimodal (text + image inputs)
  • Native system prompt support, configurable "thinking mode" reasoning
  • 18 GB VRAM for a quantized local deployment
  • Apache 2.0 license — yes, you read that right

The "A4B" in the model name is short for "Activated 4 Billion." Same naming convention Qwen uses. It is the active parameter count that determines your inference cost, not the total. So this is essentially a 4B-active model that lives in a 25B MoE shell.

The Bit That Actually Matters: It Is Not Autoregressive

Every LLM you have ever used — GPT, Claude, Gemini, Llama, Mistral, Qwen, DeepSeek, GLM, Kimi, Grok — generates text one token at a time, left to right. Token 47 cannot be generated before token 46. That is the autoregressive constraint, and it is the reason your GPU spends 90% of inference time shuffling weights from VRAM instead of doing math. Memory bandwidth, not compute, is the bottleneck.

DiffusionGemma throws that out. Instead of predicting the next token, it starts with a 256-token canvas of random placeholder tokens and iteratively denoises the entire block in parallel across multiple refinement passes. High-confidence tokens resolve adjacent positions; low-confidence ones get re-noised and re-tried. The whole block "snaps into focus" the way an image diffusion model resolves a picture from noise.

For sequences longer than 256 tokens, the model uses what DeepMind calls Block Autoregressive Diffusion: it commits a fully-denoised 256-token block to the KV cache, then starts a fresh canvas conditioned on the committed history. You get parallel speed inside each block and sequential stability across blocks. It is the best of both worlds, and frankly a much cleaner design than the 12 hand-wringing blog posts I have read about "speculative decoding" this year.

The second-order effect is huge: diffusion generation is compute-bound, not memory-bound. Your tensor cores — the part of the GPU that is supposed to be doing math — finally get to do math. That is why the speedup is real and not a trick of the benchmark.

What It Is Actually Good At — and Where It Falls Down

I am going to be honest with you, because you read me for honesty. The benchmark numbers are not frontier, and DeepMind is not pretending they are. Here is the side-by-side from the Hugging Face model card, instruction-tuned variants:

BenchmarkDiffusionGemma 26B A4BGemma 4 26B A4B
MMLU Pro77.6%82.6%
AIME 2026 (no tools)69.1%88.3%
LiveCodeBench v669.1%77.1%
Codeforces ELO14291718
GPQA Diamond73.2%82.3%
MMMU Pro (vision)54.3%73.8%
HLE no tools11.0%8.7%

Read that right. On HLE — Humanity's Last Exam — the diffusion model beats the autoregressive baseline, 11.0% to 8.7%. That is the "hard reasoning under search" benchmark that every frontier lab is currently failing. Everywhere else, the autoregressive Gemma 4 26B wins by 5 to 20 points. If you are shopping for a model to write your code or take your GRE, you are not shopping for DiffusionGemma.

What DiffusionGemma is good at is the stuff the benchmarks do not measure well: low-latency interactive loops, parallel infilling, bidirectional editing, and constrained generation. The Sudoku demo DeepMind shipped with the release is telling. The base model cannot solve a Sudoku after 48 diffusion steps. A fine-tuned version with adaptive early stopping solves it in 12 steps. That is the use case: tight inner loops where the model iterates on a structure with global constraints, not free-form essay generation.

If you build coding agents, autocomplete tools, or anything that feels more like a fast search than a slow oracle, pay attention to this. The HN thread lit up with developers saying the same thing: "a worse fast model can outperform a far better slow model if you value time."

What It Means for Builders

Three things you can do this week:

1. Pull the BF16 and NVFP4 checkpoints from Hugging Face. NVFP4 is the new 4-bit format NVIDIA is pushing on Blackwell — if you have a 5090, an H100, or a DGX Spark, this is a free 1,000 tok/s local model with 256K context. The DGX Spark community is already reporting 158 tok/s sustained on the NVFP4 build. 2. Try the Hackable Diffusion fine-tuning recipe in the Gemma GitHub repo. JAX-based, modular, designed to be hacked. If you have ever wanted to fine-tune a diffusion language model on a constrained problem (legal contracts, schema-bound JSON, SQL with referential integrity), this is your starting point. 3. Stop assuming diffusion LLMs are a research curiosity. Mercury from Inception Labs has been doing this commercially for a year. DiffusionGemma is the open-weights inflection point. The architectural direction is real, the weights are real, and the inference cost story is real. The benchmark gap will close.

The Take

Most "revolutionary" LLM releases are a 3% gain on MMLU and a new system card. DiffusionGemma is the first release in a long time where the architecture itself is the news. Text generation has been stuck on autoregressive decoding since 2017 because nobody could make a non-autoregressive language model that actually worked at scale. DeepMind just shipped one that works at scale, runs on a single GPU, and has permissive licensing.

It is not going to dethrone Claude Fable 5 or Gemini 3.5 Pro on your coding benchmark. It does not need to. The point of a release like this is to open a frontier, not to win it. The next 18 months of LLM research are going to be a lot more interesting than the last 18, and the reason is sitting on Hugging Face right now in 256-token blocks, denoising its way into production.

Build something with it. The 18GB VRAM requirement is not a moat, it is an invitation.

Mr. Technology

Sources:

Related Dispatches