NVIDIA's Nemotron 3 Ultra dropped on June 4, 2026 — a 550B-parameter MoE with 55B active, 48 points on the Artificial Analysis intelligence index, and a 5x throughput lead over the rest of the open-weights field. The architecture (hybrid Mamba-Transformer, LatentMoE, NVFP4, MTP) is the most interesting American open release of the year, and it lands at the moment when the open-weights business model is being written off by everyone else.

NVIDIA Shipped the Smartest Open US Model, and the 5x Inference Lead Is the Real Story

NVIDIA released Nemotron 3 Ultra on June 4, 2026 — a 550B-parameter Mixture-of-Experts model with 55B active per token, on the Linux Foundation's new OpenMDW-1.1 license, day-one on Hugging Face, OpenRouter, DeepInfra, Perplexity, SageMaker JumpStart, Google Cloud, Microsoft Foundry, and Oracle Cloud. On the Artificial Analysis intelligence index it scores 48 — the highest of any open US model — and on DeepInfra it sustains more than 300 tokens per second. Comparably sized open models from DeepSeek and Moonshot are doing 50 to 100. That is not a benchmark gap. That is a generational gap, and it is the actual story.

The press will frame this as "NVIDIA joins the open-weights race." That framing is wrong. NVIDIA did not join a race. NVIDIA used to own the only race that mattered (GPUs), and it just used that position to ship a frontier-scale open model with an inference profile nobody in the open-weights ecosystem can match. This is not just a model release. It is a thesis about how open weights compete with closed APIs in 2026 and 2027.

The Architecture Is The Story, Not The Size

A 550B/55B MoE is table stakes for a frontier release. The interesting decisions are what NVIDIA did to make a model that size run fast and run long.

Hybrid Mamba-Transformer layers. Ultra interleaves Mamba state-space layers with standard self-attention. Mamba handles long-context workload in linear time; attention preserves precise recall where it is needed. For a 1M-token context window (Ultra hits 95% on RULER@1M) this is the difference between a model that can read a real codebase and a model that can pretend to read one. Nemotron 3 Ultra quietly added a state-space backbone the rest of the open-weights tier is still treating as research.

LatentMoE routing. LatentMoE does the routing in a learned latent space, so the model handles workflows that mix reasoning, code, tool calls, and domain logic without collapsing onto the same handful of experts. Quality holds across the actual workflows agents run, not just on the benchmarks those workflows get reduced to.

Multi-Token Prediction and NVFP4. MTP predicts multiple future tokens in a single forward pass, compounding the speedup on long agent outputs. NVFP4 is the move only NVIDIA could make — a 4-bit floating-point format with specialized kernels that runs the same checkpoint on Hopper, Blackwell, and Ampere without re-quantization, delivering up to 5x higher throughput per GPU on Blackwell at the same interactivity compared to BF16. The company that sells the picks and shovels is using its own model as the proof.

Post-Trained For Agents, Not Chat

The training pipeline is the more important piece. Ultra was post-trained using Multi-Teacher On-Policy Distillation (MOPD) — more than 10 specialized teacher models, each with its own domain pipeline, scoring the student in their area of expertise while the student generates its own rollouts asynchronously. The result is a model that holds up across agent harnesses, not a model that benchmarks well on the harness the lab trained it on.

The proof is in the consistency. SWE-Bench Verified scores 65% to 70.4% across Pi, OpenHands, Hermes, OpenCode, and Mini SWE Agent — a 5-point spread across five very different harnesses. A model that scores 75% on one harness and 50% on the others is a model trained to that harness. Ultra is trained to the workload.

NVIDIA's agent-economics benchmarks show Ultra completing Terminal-Bench 2.0 using fewer total tokens and fewer tokens per turn than comparably sized models. Cost-to-task-completion is 30% lower. Throughput on DeepInfra is 5x. The agent cost curve just bent.

The Benchmarks, Honestly

The Artificial Analysis intelligence index is the cleanest cross-lab number. Nemotron 3 Ultra at 48 sits behind Kimi K2.6 (54) and Claude Opus 4.8 (61). It is ahead of GLM 5.1, Qwen3.5 397B, and every other US open-weights release including Gemma 4 31B (39) and gpt-oss-120b (33). Six months ago the open US tier was not in the conversation. Now it is at parity on intelligence, ahead on speed, and ahead on cost.

On agent-specific benchmarks from the NVIDIA launch post, Ultra is at parity with Kimi K2.6 on PinchBench (91%), behind on Terminal-Bench 2.0 (54% vs Kimi's 67%), and ahead of the comparison set on IFBench (82%), ProfBench Search (56%), and RULER@1M (95%). Honest read: Ultra is not the best coding model in the world, but it is the best open-weights model on agent harnesses and the best open US model on raw intelligence. Defensible product position.

Pricing is set by inference providers, not NVIDIA. OpenRouter has a free tier alongside paid endpoints; DeepInfra, Baseten, and Perplexity set per-token rates that undercut the closed frontier by 10x to 25x. The closed labs sell tokens. The open labs sell throughput. NVIDIA just gave the open labs a model that wins on throughput.

The Open-Weights Bet That Actually Has A Business Model

The other open-weights labs are having a rough 2026. Alibaba closed Qwen 3.7 to API-only on June 2. Mistral is licensing rather than open-sourcing the Magnitude line. Meta is shipping Llama 4 with restrictive community licenses. DeepSeek V4 is open but commoditized. The "open weights are the future" narrative is being replaced by the "open weights are a feature, not a business" narrative, and the loudest voices saying so are the labs that used to lead the open movement.

NVIDIA is the counterexample. NVIDIA does not need to make money on the model. NVIDIA makes money on the GPUs that run the model. The Nemotron family is the proof-of-concept that shows NVIDIA's customers what the next two years of inference hardware demand is going to look like. Every team that runs Nemotron 3 Ultra on Blackwell is a customer for the next generation of NVIDIA hardware. Every team that adopts NVFP4 is locking in the NVIDIA inference stack. Every team that uses the Dynamo deployment recipes is buying more NVIDIA systems, not fewer.

What To Do With It Today

If you build agents in production: download the NVFP4 checkpoint, deploy it on a Blackwell instance through NVIDIA NIM, and benchmark it against the closed model you are calling today on your actual harness. The 5x throughput claim is real, the 30% lower cost-to-completion is real, and the harness-consistent SWE-Bench numbers are real. You will not switch off the closed model for every workload, but you should switch it off for the 60% of your traffic that is routine agent execution, and reserve the closed budget for the 10% that actually requires the frontier. If you are on Hopper or Ampere, the NVFP4 checkpoint is the same artifact — leave throughput on the table but you are still running the smartest open US model in the world at a fraction of closed-frontier cost. NVIDIA is not just shipping a model. It is shipping the open-weights template.

The Take

Nemotron 3 Ultra is the most consequential LLM release of the past seven days because it is the first open-weights release where the open lab has a defensible business model for staying open. NVIDIA does not need the model to be the business. The model is the demo for the GPU business. That changes the open-weights calculus in a way the rest of the field has not yet internalized, and the architectural choices (Mamba hybrid, LatentMoE, MTP, NVFP4, MOPD) are the template every other open lab is going to be copying for the next twelve months.

The closed frontier is still ahead on raw intelligence (Opus 4.8 hits 61 on the AAII, Kimi K2.6 hits 54). The open Chinese frontier is still ahead on absolute score. But the gap just closed, the throughput lead is real, and the business model behind this open release is the one the other open labs do not have. The smartest open US model of 2026 is also the open-weights release that finally answers "how does open compete with closed when the closed labs have all the capital." The answer: it does not compete on capital. It competes on hardware economics. NVIDIA just proved the model. If you are building on the open-weights tier, anchor your 2026 stack to this release. If you are building on the closed frontier, this is the release that should make you rethink what you are paying for.

— Mr. Technology

Release date: June 4, 2026. Architecture: 550B total / 55B active MoE, hybrid Mamba-Transformer, LatentMoE, MTP, NVFP4 (4-bit) checkpoint. Context: 1M tokens (RULER@1M 95%). Training: 10T token base, 212B new tokens (legal + Wiki + GitHub), MOPD with 10+ teacher models. SWE-Bench Verified 65-70.4% across Pi, OpenHands, Hermes, OpenCode, Mini SWE Agent. AA-II score 48 (highest US open). Throughput 300+ tok/s on DeepInfra (5x comparable open models). Cost-to-completion 30% lower. License: OpenMDW-1.1 (Linux Foundation). Day-0 inference: NVIDIA NIM, SGLang, vLLM, TRT-LLM, OpenRouter, DeepInfra, Perplexity, Baseten, SageMaker JumpStart, Google Cloud, Microsoft Foundry, Oracle Cloud. Sources: NVIDIA developer blog, Decoder coverage, SGLang day-0 support, vLLM day-0 support, OpenRouter pricing, NVIDIA model card.