← Back to Payloads
AI Models2026-06-01

Llama 4 Behemoth Finally Shipped and That's Not Actually the News

Meta dropped the 2-trillion parameter Behemoth on May 27 — 14 months after previewing it. The 2T is the headline, but the distillation cascade to Scout and Maverick is the release that actually moves the open-weight ecosystem. Closed labs are about to have an uncomfortable pricing conversation.
Quick Access
Install command
$ mrt install llama-4
Browse related skills
Llama 4 Behemoth Finally Shipped and That's Not Actually the News

Llama 4 Behemoth Finally Shipped and That's Not Actually the News

Meta finally released Llama 4 Behemoth on May 27, 2026 — 14 months after they previewed it alongside Scout and Maverick. The 2-trillion parameter model, with 288 billion active parameters per token via mixture of experts, is the largest openly distributed language model ever released. It's the kind of announcement that writes its own headline. It's also not the part of the release that actually matters.

The news isn't that Behemoth exists. The news is what 14 months of distillation from a 2T teacher did to Scout and Maverick, and what that implies for the open-weight ecosystem at large.

The Architecture: A Quick Refresher

Behemoth is a mixture-of-experts model with 16 expert FFN blocks per layer and 2 experts active per token. The 288B active footprint sits inside a 2T total — meaning roughly 14% of the network is engaged for any given forward pass. Pre-training ran in FP8 on Meta's H100 clusters, consuming an estimated 22 trillion tokens across 38 languages. The text-and-image multimodal training was done jointly from the start, not bolted on with a CLIP-style adapter after the fact.

That distinction matters. Late-fusion multimodal models — where you add a vision encoder to a text-only base — produce a system that talks about images. Early-fusion models like Behemoth produce one that thinks with images. The internal representations share a single representational space; the model doesn't translate between modalities at inference time.

The context length is 10 million tokens via Meta's iRoPE scheme, which interleaves rotary position embeddings across local and global attention windows. If you've been waiting for a frontier model that can ingest an entire monorepo or a multi-week code review history without chunking — this is the first one that doesn't lie about the limit in the marketing copy.

The Benchmarks Are Fine. The Distillation Is the Story.

Here's what the tutorials won't tell you: nobody runs Behemoth in production. 2T parameters at 288B active means 144GB of weights in FP8 just to load the model, and inference throughput is measured in tens of tokens per second per user on a fully populated 8xH100 node. Behemoth is a teacher, not a product.

The real release is the May 27 refresh of Llama 4 Scout — same 17B active / 109B total MoE shape as the original, but now distilled from Behemoth across roughly 500 billion tokens of synthetic data generated at the teacher's temperature. The distillation corpus includes the standard chat-style preference data, but it also includes hard negative mining, code execution traces, and tool-call traces that the teacher generated with explicit chain-of-thought the student can compress.

The numbers move. Scout jumps from 78% to 86% on MMLU-Pro, and HumanEval+ goes from 67% to 79%. Maverick's scores shift similarly across the reasoning and code benchmarks. Behemoth itself lands at 92% on GPQA Diamond and 89% on MATH-500 — slightly behind Claude Opus 4.8 and GPT-5.5 Pro on raw capability, but the gap matters less than the cost ratio. Behemoth can produce 100 tokens for what it costs Maverick to produce 1, and Scout is another order of magnitude cheaper than that.

The Open-Weights Implication

The honest take: Meta just forced every closed lab to re-price inference. When a 17B-active open-weight model with 79% on HumanEval+ is downloadable from Hugging Face under a permissive license, the $25 per million output tokens that Opus 4.8 charges starts to look like a tax on ignorance rather than a premium on quality. The Scout refresh isn't a new model — it's an existence proof that frontier capability is now distillable into a footprint you can serve on a single 8xA100 node.

The counterargument is the same one we heard when Llama 3 70B dropped in early 2024: the open weights are lagging, the safety tuning is thin, the license restricts commercial use for the largest variants. All true. Also increasingly irrelevant as the gap shrinks with each distillation cycle.

The real question for 2026 isn't "open vs closed" — it's "are you paying 10x for a 5% benchmark gap, and why?"

The Things I Don't Love

The Behemoth license restricts use by companies with more than 700 million monthly active users — a clause aimed squarely at preventing a hyperscaler from grabbing the weights and undercutting Meta's own inference business. That's defensible from a corporate strategy angle. From a research angle, it's the same lock-in pattern as Llama 3, and it means the most interesting frontier open model still doesn't get fine-tuned by the teams that might extract the most value from it.

The iRoPE 10M context is also — let's be honest — a marketing number as much as a capability. Behemoth performs well on needle-in-a-haystack evaluations at 10M tokens. Real long-context workloads (multi-document reasoning, code review across large diffs, agent loops with substantial tool history) still degrade in the 1–2M token range where most of the attention weight has to actually generalize to unfamiliar structure. The 10M figure is true. It's also not where the model is useful. Scout tops out at 1M and Maverick at 512K, which is plenty for the workloads people actually run.

The other limitation: Behemoth's release doesn't include the training data, the training code, or the eval suite used to validate it. Meta released the weights and a 40-page system card. You cannot reproduce this model. You can fine-tune it, you can distill from it, you can serve it — but you cannot learn from how it was built, which is the part the research community actually needs.

The Take

Llama 4 Behemoth is the model Meta needed to release to make Scout and Maverick competitive, and they did it. The next six months will be interesting, because every open-weight lab is going to distill from this thing, and the closed labs are going to have to decide whether to keep charging 10x for what is rapidly becoming a 5% capability premium.

If you're shipping a product, the Scout refresh is the model to evaluate this week. If you're training a fine-tune, Behemoth is the distillation teacher to study. If you're buying inference at scale, use the existence of this release as leverage in your next vendor negotiation.

The news isn't the 2 trillion parameters. The news is that they don't matter anymore except as the source of a much smaller, much cheaper thing that does.


Llama 4 Behemoth released May 27, 2026 by Meta. 2T total / 288B active parameters, 16-of-2 MoE routing, 10M token context via iRoPE, 38 languages, native multimodal pre-training. Scout and Maverick refresh distilled from Behemoth ship alongside. License: Llama 4 Community (700M MAU restriction applies to Behemoth). Weights on Hugging Face, llama.com, AWS Bedrock, and Azure AI Foundry.

Related Dispatches