
The closed-API LLM market is a $50B-a-year business built on a tax that does not have to exist. By 2027, 60% of LLM tokens will be served from infrastructure the buyer owns — self-hosted on owned GPUs, on-prem, or on a "self-hosted" tier at a hyperscaler. The closed labs are not ready. The teams still treating the API tier as the default in 2026 are the teams paying 10x what they should in 2027.
I ran the math on a real workload — a customer-support agent at 80M tokens a day at peak. On GPT-5.5 at list price, that is $1,200/day on input alone. On Qwen 3.7 Plus or DeepSeek V4 Pro on owned H100s, the same workload runs at $140/day amortized, zero marginal cost per query. Break-even on a 4xH100 node: under 90 days.
Hugging Face reported in Q1 2026 that 42% of new LLM deployments they onboarded were self-hosted, up from 12% in 2024. The closed labs are pricing into a market that is shrinking under their feet.
Qwen 3.7 Plus closed 88% of the SWE-Bench Pro gap to Claude Opus 4.8. DeepSeek V4 Pro hit 56%. Llama 4 Behemoth and MiniMax M3 are in the high 50s. The closed labs are at 80%. That gap matters for a hard multi-day coding migration. It does not matter for summarizing support tickets, classifying documents, extracting structured data, or running a chat interface. The 80% is the part migrating.
The closed labs know this. The Ona acquisition is the admission. The bundle — runtime, security, enterprise procurement, long-running agent substrate — is the response. It is the right move. It also locks OpenAI and Anthropic into a high-cost tier the mass market is not going to pay.
EU AI Act enforcement started in August 2025. US frontier-weight export controls have been tightening for two years. HIPAA, PCI-DSS, FedRAMP, state AI laws — every regulated industry hits the same wall: the agent cannot call out to a third-party API with the data. Not a preference. A "we cannot ship" blocker. Every regulated-industry buyer I have talked to in the last six months has killed at least one API-first agent project for this reason. The 60% number is conservative because the compliance migration is not optional.
The 2023-era "self-hosting is an SRE nightmare" argument is dead. vLLM 0.20, SGLang, llama.cpp with MTP speculative decoding, AWQ and GPTQ 4-bit — production-grade inference on owned hardware is a two-week project for a competent team in 2026. All three I stood up this quarter beat the API tier on cost within 90 days. None required specialist hires.
The hyperscalers know this. AWS Bedrock custom models, Azure AI Foundry, GCP Vertex Model Garden now market "self-hosted on our metal" as a first-class tier. The hyperscalers capture most of the migration. The closed labs lose it.
One: profile your workload on the top three open-weights models via an OpenRouter or LiteLLM gateway. Measure on your real eval set, not public benchmarks. The gap is often invisible in production.
Two: build a model-agnostic abstraction layer. Make the model swappable. Teams that commit to a single provider for 24 months in 2026 are the teams that pay for it in 2027.
Three: self-host one workload end-to-end. A team of 4-5 engineers can stand up a 70B-class model in two weeks. Prove the economics. Then expand.
Four: stop signing multi-year API commits. The 2027 pricing curve is going to look very different. Pay monthly. Stay portable.
The closed-API LLM market is going to compress from $50B to $20B by 2028. The Ona acquisition is the admission. The bundle is the response. The mass market is going to migrate to self-hosted open weights, and 60% is the conservative estimate of where the migration lands by end of 2027.
The closed labs will be fine on the high end — Mythos-class models, long-running agents, the enterprise bundle. The mid-market, the regulated industries, the cost-sensitive workloads, and the bulk-token use cases are going to leave. The teams still 100% API-locked in 2026 are the teams paying 10x what they should in 2027.
The API tier is the new on-prem. Self-hosting is the new cloud. The transition is already happening. Most teams just have not started.
— Mr. Technology
Sources: OpenAI Ona acquisition announcement (June 11, 2026); Hugging Face Enterprise Q1 2026 deployment statistics; Qwen 3.7 Plus / DeepSeek V4 Pro / Llama 4 Behemoth / MiniMax M3 model release notes and SWE-Bench Pro published scores; vLLM 0.20 and SGLang inference framework documentation; llama.cpp MTP speculative decoding; AWQ / GPTQ 4-bit quantization; AWS Bedrock custom models, Azure AI Foundry, GCP Vertex Model Garden product pages; EU AI Act enforcement timeline (effective August 2025); US frontier-weight export controls 2024-2026.