At WWDC 2026, Apple launched its third-generation Apple Foundation Models — headlined by a 20B sparse on-device LLM activating only 1–4B parameters per prompt via Instruction-Following Pruning.

Apple Just Shipped a 20B Sparse LLM That Runs on Your iPhone. The IFP Trick Is the Real Story.

Apple held WWDC 2026 on June 8. The headline was the Siri AI rebrand — a standalone Siri app, cross-app context, agentic tool use — and that is the wrong story. The real story is the third generation of Apple Foundation Models (AFM 3), anchored by a 20B sparse on-device LLM that fits in flash memory, activates only 1–4B parameters per prompt, and runs natively on iPhone silicon. Apple quietly put the first production-scale dynamic-sparse LLM in consumers' pockets, and the press treated it as a marketing event.

The Five-Model Lineup

Apple's research post names five models. Two on-device, three in Private Cloud Compute. Two sizes are disclosed; three are not.

AFM 3 Core — on-device, 3B dense, fast NLU and routing.
AFM 3 Core Advanced — on-device, 20B sparse (1–4B active per prompt), multimodal, powers the new Siri, dictation, TTS, image understanding.
AFM 3 Cloud — Private Cloud Compute on Apple silicon, main cloud text + image-understanding workhorse. Size undisclosed.
ADM 3 Cloud — Private Cloud Compute, image generation. Powers Image Playground, Reframe, Extend, Cleanup.
AFM 3 Cloud Pro — Private Cloud Compute on NVIDIA GPUs in Google Cloud, used for complex reasoning and agentic tool use. Size undisclosed.

Three cloud models with no disclosed parameter counts. Apple is treating capability as the product and architecture as the moat.

Instruction-Following Pruning Is the Trick

The interesting model is AFM 3 Core Advanced, and the interesting thing is not the 20B size. It is that 20B lives in flash and only 1–4B ever touches DRAM.

The technique is Instruction-Following Pruning (IFP), originally published by Apple Research in January 2025 and presented at ICML 2025. A small dense predictor reads the prompt, decides which rows and columns of the feed-forward matrices are useful for that request, and patches the chosen "routed experts" into a dense model in DRAM alongside a fixed set of "shared experts" that are always active. The full 20B model sits in NAND; the active subset is loaded into DRAM only when needed.

The published IFP result that matters: a 3B-activated configuration beat a 3B dense baseline by 5 to 8 absolute points on math and coding, and matched a 9B dense model at the same active compute. 9B-class quality inside a 3B active memory footprint, with the rest of the model streaming from flash.

This is not classic MoE. Standard MoE picks K-of-N experts per token, which means token-by-token weight swaps. IFP does per-prompt routing with batched expert loads, which is what makes the flash-to-DRAM story viable. There are research papers on dynamic sparsity going back years; the production-scale consumer-ship moment is today.

What Apple does not claim: no published comparison against GPT-5.5, Claude Opus 4.8, Gemini 3.1 Pro, Qwen 3.7, or Llama 4. Every benchmark in the research post is side-by-side against Apple's 2025 baseline. Treat the eval numbers as generational evidence, not a competitive ranking.

What Apple Published

Side-by-side blind human preference, verbatim from the research post:

Text, AFM 3 Core: preferred 45.6% of prompts vs 23.3% for the 2025 baseline.
Text, AFM 3 Cloud: 64.7% vs 8.7%.
Dictation, AFM 3 Core Advanced: 44.7% vs 17.6%.
AFM 3 Cloud Pro adds +10% relative preference on text, +14% on math, +14% on image understanding over Cloud.
On-device TTS MOS: 3.87 → 4.15 (general), 3.82 → 4.24 (conversational).
Cloud response satisfaction: ~36% relative improvement over the 2025 AFM Server model.

No third-party benchmarks — no MMLU, no SWE-Bench, no GPQA. Side-by-side preference is loose for technical work. It tells you "the human liked the answer better," not "the code compiled."

The Apple-Google-Gemini Question

Craig Federighi told 9to5Mac: "The amount of the Google Assistant we use is none." Amar Subramanya, Apple AI VP, told CNBC: "All of these are custom builds for Apple Silicon, trained using proprietary data, and refined using outputs from Gemini frontier models."

Both are true. The model is Apple's. The post-training signal is Gemini's. Apple is not running Gemini in production — they are using Gemini outputs the same way everyone in 2026 uses frontier models, as a teacher signal in a distillation-style post-training loop. Apple is the largest distribution channel to publicly admit the dependency.

Private Cloud Compute Now Runs on NVIDIA in Google Cloud

The 2024 version of Private Cloud Compute was a privacy story built on Apple Silicon servers with cryptographic attestation. The 2026 version extends PCC to NVIDIA GPUs hosted in Google Cloud, with Apple claiming the same data-handling guarantees. Reporting suggests Apple tried Cloud Pro on its own PCC hardware first and the model was too slow. NVIDIA capacity on Google Cloud was the path that shipped.

The engineering substance is the cryptographic attestation chain, not the geographic location of the GPUs. Moving the substrate to NVIDIA-in-GCP does not break the attestation model — but the trust boundary now spans more vendors than the 2024 version.

What Developers Get

The Foundation Models framework is the under-covered part. Introduced in iOS 26, it is the Swift API that gives any third-party app direct access to the on-device model — no API key, no network, no per-token cost. The 2026 update adds image input: developers can pass images alongside text for captioning, structured extraction, UI element classification, all without a cloud round-trip.

The realistic pattern for fall 2026 is a hybrid: Foundation Models for fast, free, offline work; a cloud model for everything that needs frontier reasoning, long context, or fresh world knowledge. The 3B on-device model is not a ChatGPT replacement. It is a free, private, offline inference layer for product features that would otherwise require an API call.

The Geography Problem

Siri AI is not launching in the EU on iPhone or iPad, and not in mainland China on any device. Mac, Apple Watch, and Vision Pro get it in the EU. The DMA's interoperability requirements for designated gatekeeper AI features are the official reason; the practical reason is Apple has chosen not to ship third-party AI assistant access at parity with Siri, and the EU would fine them up to 10% of global annual revenue for non-compliance. Apple's share price dropped close to 2% the day of the announcement.

The Take

The 20B sparse on-device model with Instruction-Following Pruning is the most consequential LLM architecture to ship in the last seven days. Not a benchmark win — Apple did not publish a benchmark win — but a deployment win. The first production-scale dynamic-sparse LLM in a consumer device is, in the long run, more important than another 0.5% on MMLU. The on-device path is not a research curiosity; it is a shipping product with a 20B ceiling and a sane battery profile.

The Gemini refinement dependency is the part I am watching. The moment Gemini's outputs become a strategic bottleneck — pricing, rate limits, output quality — Apple will either need a frontier teacher signal in-house or accept a long-term dependency on a direct competitor in mobile. Neither is comfortable.

If you ship iOS apps, build the offline layer of your product against the Foundation Models framework this quarter. Image input, structured Swift output, tool calling, zero per-token cost. The 3B on-device model is not going to win coding benchmarks. It is going to replace an enormous amount of expensive cloud inference in shipped apps by Q1 2027.

If you do not ship iOS apps, the headline is still worth knowing. Sparse activation in flash is a template. The next 18 months of "on-device AI" roadmaps from Qualcomm, MediaTek, and Samsung are going to look like Apple's playbook. Apple just changed what an on-device model is, and the implications are not going to land until 2027.

— Mr. Technology

Released: June 8, 2026, at WWDC 2026. Models: AFM 3 Core (3B dense on-device), AFM 3 Core Advanced (20B sparse, 1–4B active, IFP-routed), AFM 3 Cloud (PCC, Apple Silicon, size undisclosed), ADM 3 Cloud (PCC image gen, size undisclosed), AFM 3 Cloud Pro (PCC on NVIDIA-in-GCP, size undisclosed). Training: AXLearn framework, latest-gen cloud TPUs, multi-stage RL post-training, refined using outputs from Google's Gemini frontier models. Availability: US English later this year; EU on Mac/Watch/Vision Pro only; not in mainland China at launch. Foundation Models framework gains image input. Sources: Apple ML Research — Third Generation AFM, Apple Security — Expanding PCC, ICML 2025 — IFP paper, 9to5Mac on the Apple-Google collaboration, CNBC on Apple-Google-NVIDIA, MacRumors on EU/China launch carve-out, TechCrunch WWDC 2026 recap, ofox.ai developer read on AFM 3.