
Google just made local AI a real option for the average developer laptop. The Agent Roundup's June 6, 2026 write-up covered it: Gemma 4's new Quantization-Aware Trained (QAT) checkpoints need only 6.72 GB of VRAM. That's not a workstation number. That's "the GPU that shipped in your Dell last year" territory.
What You Need to Know: Google's Gemma 4 family — including the 4B E4B IT and the 26B-A4B IT MoE — now ships QAT (quantization-aware trained) checkpoints that drop the runtime VRAM requirement to 6.72 GB for a usable 4B model and ~11 GB for the 26B MoE at Q4. For comparison, the equivalent full-precision Gemma 4 26B needs 50+ GB and a desktop GPU.
Google's Gemma 4 lineup spans a few distinct sizes. The smallest, Gemma 4 E2B IT, is a 2B dense model that runs at "120+ tokens per second on an RTX 3060/4060, 100+ TPS on Apple M3 Max" per Made By Agents' reference model page. The mid-tier Gemma 4 E4B IT is 4B parameters, also dense, and runs "exceptionally accessible for consumer-grade hardware" with VRAM as the only real constraint. The flagship for local use is the Gemma 4 26B-A4B IT, a 26B-parameter MoE that activates only 4B per token, with a 256K context window and a 68.3 benchmark score (BB rating). At Q4 quantization, that's about 11 GB VRAM.
The new QAT checkpoints — the ones that put "Run Gemma 4 on Your Laptop" in reach — are quantization-aware trained rather than post-hoc quantized. That distinction matters: a model trained with quantization in the loop preserves accuracy at INT4 in a way that post-training quantization usually doesn't. The 6.72 GB VRAM number that The Agent Roundup highlighted is the practical lower bound: a QAT 4B at INT4 with KV cache.
The Agent Roundup's June 6, 2026 newsletter, "Run Gemma 4 on Your Laptop", pairs the Gemma 4 news with a longer argument: cheap local inference is the endgame. The newsletter's central claim is that AI inference has become a commodity — DeepSeek V4 Pro is "28 times less" than Claude Opus 4.8 for "maybe 20% worse" output, and the gap is closing. The author, Tobias, explicitly compares the moment to Bitcoin mining: whoever runs the cheapest compute wins. When investor subsidies on API pricing end, on-prem hardware pays off.
For most developers, the real question is no longer "can I run a useful LLM locally" — it's "what's the cheapest hardware I can put under my desk and forget about." A 6.72 GB VRAM floor means a refurbished RTX 3060 12GB (or an M-series Mac with 16+ GB unified memory) is now sufficient for serious work. The frontier model is still in the API, but for the long tail of "I want a private copilot that doesn't phone home," the answer in June 2026 is "buy a GPU once, run Gemma 4 forever."
Google's Gemma 4 QAT checkpoints drop the runtime VRAM requirement to 6.72 GB for a 4B model. The MoE 26B-A4B fits 256K context in ~11 GB at Q4. Local AI is now a credible alternative to API-rent for the privacy-sensitive and the cost-sensitive.
Source: Newsletter | mr.technology — The Master Skill Index