← Back to Payloads
ai2026-06-07

Run Gemma 4 on Your Laptop

Google's Gemma 4 QAT checkpoints fit a usable 4B model in 6.72 GB of VRAM; the 26B-A4B MoE fits 256K context in ~11 GB at Q4. Combined with the collapse in API pricing, local AI just became a credible default for privacy-sensitive and cost-sensitive workloads.
Quick Access
Install command
$ mrt install ai
Browse related skills
Run Gemma 4 on Your Laptop

Run Gemma 4 on Your Laptop

Google just made local AI a real option for the average developer laptop. The Agent Roundup's June 6, 2026 write-up covered it: Gemma 4's new Quantization-Aware Trained (QAT) checkpoints need only 6.72 GB of VRAM. That's not a workstation number. That's "the GPU that shipped in your Dell last year" territory.

What You Need to Know: Google's Gemma 4 family — including the 4B E4B IT and the 26B-A4B IT MoE — now ships QAT (quantization-aware trained) checkpoints that drop the runtime VRAM requirement to 6.72 GB for a usable 4B model and ~11 GB for the 26B MoE at Q4. For comparison, the equivalent full-precision Gemma 4 26B needs 50+ GB and a desktop GPU.

Why It Matters

  • Local LLM just crossed a usability threshold. A 4B QAT model that runs on a 6-year-old RTX 3060 with 8 GB VRAM changes the calculus for privacy-sensitive workloads, dev work, and anyone tired of paying per-token for trivial tasks.
  • The "AI is a commodity" thesis gets reinforced. Agent Roundup's editor Tobias makes the point bluntly: DeepSeek V4 Pro costs 28× less than Claude Opus 4.8 for "maybe 20% worse" output. When the gap to frontier shrinks that fast, owning the inference starts to make sense.
  • MoE helps, but QAT is the real unlock. The Gemma 4 26B-A4B IT is a Mixture-of-Experts model — only 4B active params per token, even though the model is 26B. That architecture, combined with Q4 quantization, fits a 256K-context model in ~11 GB VRAM.
  • Privacy and lock-in flip the cost equation. When the subsidized API price disappears (and it will), running a QAT model on hardware you already own is cheaper per token than renting from anyone. That's the on-prem calculus that has stayed niche — until now.

What Actually Happened

The Gemma 4 QAT release and what it actually buys you

Google's Gemma 4 lineup spans a few distinct sizes. The smallest, Gemma 4 E2B IT, is a 2B dense model that runs at "120+ tokens per second on an RTX 3060/4060, 100+ TPS on Apple M3 Max" per Made By Agents' reference model page. The mid-tier Gemma 4 E4B IT is 4B parameters, also dense, and runs "exceptionally accessible for consumer-grade hardware" with VRAM as the only real constraint. The flagship for local use is the Gemma 4 26B-A4B IT, a 26B-parameter MoE that activates only 4B per token, with a 256K context window and a 68.3 benchmark score (BB rating). At Q4 quantization, that's about 11 GB VRAM.

The new QAT checkpoints — the ones that put "Run Gemma 4 on Your Laptop" in reach — are quantization-aware trained rather than post-hoc quantized. That distinction matters: a model trained with quantization in the loop preserves accuracy at INT4 in a way that post-training quantization usually doesn't. The 6.72 GB VRAM number that The Agent Roundup highlighted is the practical lower bound: a QAT 4B at INT4 with KV cache.

The bigger picture: local AI is winning

The Agent Roundup's June 6, 2026 newsletter, "Run Gemma 4 on Your Laptop", pairs the Gemma 4 news with a longer argument: cheap local inference is the endgame. The newsletter's central claim is that AI inference has become a commodity — DeepSeek V4 Pro is "28 times less" than Claude Opus 4.8 for "maybe 20% worse" output, and the gap is closing. The author, Tobias, explicitly compares the moment to Bitcoin mining: whoever runs the cheapest compute wins. When investor subsidies on API pricing end, on-prem hardware pays off.

The Take

For most developers, the real question is no longer "can I run a useful LLM locally" — it's "what's the cheapest hardware I can put under my desk and forget about." A 6.72 GB VRAM floor means a refurbished RTX 3060 12GB (or an M-series Mac with 16+ GB unified memory) is now sufficient for serious work. The frontier model is still in the API, but for the long tail of "I want a private copilot that doesn't phone home," the answer in June 2026 is "buy a GPU once, run Gemma 4 forever."

Quick Summary

Google's Gemma 4 QAT checkpoints drop the runtime VRAM requirement to 6.72 GB for a 4B model. The MoE 26B-A4B fits 256K context in ~11 GB at Q4. Local AI is now a credible alternative to API-rent for the privacy-sensitive and the cost-sensitive.

Sources


Source: Newsletter | mr.technology — The Master Skill Index

Related Dispatches