← Back to Payloads
2026-07-02

The Cheap-Token Era Just Ended. Almost Nobody Has Noticed.

Tokens got 16x cheaper between 2023 and 2026. The next three years will deliver single-digit percentage gains. HBM, power, and latency have run out of headroom — and every AI unit-economics deck is still priced for the wrong world.
Quick Access
Install command
$ mrt install opinion
Browse related skills
The Cheap-Token Era Just Ended. Almost Nobody Has Noticed.

The Cheap-Token Era Just Ended. Almost Nobody Has Noticed.

Build with me for a moment. In 2023, an A100 inference hour cost about $2.00 retail and served roughly 90,000 GPT-3.5-class tokens. Today, July 2026, an H200 hour runs about $1.10 and a frontier 200B-class model serves around 280,000 tokens per hour. That is a 16x improvement in three years — jaw-dropping, and the basis of every AI unit-economics deck you'll be pitched this year.

The next three years? Five percent. Maybe twelve in a best case. The token-cost era just ended, and most VCs haven't noticed yet.

Counter-argument — and the one every "AI infra" founder has on standby: "Hopper to Blackwell, then Rubin, what about the next two silicon generations?" That's real, and Blackwell is pulling maybe 40% per rack. After that, the picture collapses. Three reasons.

HBM is the new ceiling

First: HBM. We've moved from HBM2e in 2022 to HBM3e at production scale today, and HBM4 is still on Samsung's roadmap without volume. Every generation costs more die area, more packaging complexity, and SK Hynix has signaled that HBM4 capacity will be supply-constrained into 2027. Without faster memory, the silicon doesn't matter. We are memory-bandwidth-bound, not compute-bound, on inference of long-context models. A modern request against a 200B model streams roughly 2 TB/s of weights through HBM during prefill. We can't shrink those models fast enough to keep up with the demand curve, and we can't widen HBM fast enough to keep up with model size. Both halves of that ratio are pinned.

Power, not silicon, is the bottleneck

Second: power. Northern Virginia's grid added 2.4 GW of new load last year. Phoenix added 1.9. Dublin is at the ceiling. The hyperscalers have signed PPAs fast enough that new substation build-out is now a 36-month permitting queue, not a 12-month capex problem. New data center capacity is flat. Inference volume is growing 4x year-over-year. Loudoun County's Dominion interconnection queue is 4.1 GW deep as of Q2. Hyperscalers are now signing long-term power purchase agreements at tariffs 30–40% above 2023 spot. That cost has to land somewhere.

Latency fixes the model size

Third: latency. The "AI will replace search" thesis depends on sub-200ms responses. Sub-200ms on a 70B-plus model means batch=1, which means either a frontier model on dedicated silicon or a smaller model you already own. The frontier-data-center cost per query is essentially fixed by silicon and memory cost, not by software improvements. We are out of obvious wins.

I've watched three AI-native vertical SaaS companies pitch me on "tokens will be 10x cheaper in 18 months, so we can lose money now and print it later." None of them have updated the model. Two of them will run out of runway before Q2 2027.

The honest move for builders: treat today's inference cost as roughly the floor. If your unit economics don't work at $0.80 per million output tokens for a frontier-class model, they are not going to work in 2028 either — and they may get worse. Distillation, smaller models, caching, batching, quantization — all of these are real. All of these have already been mostly done. The 2x in extra savings from here is yours; the 10x is gone.

If you're an investor pricing an AI-native SaaS at 2030 margins, redo the model. If you're a founder, build the local-fallback path now. The cheap-token future died in a substation in Loudoun County, and almost nobody is paying attention.

Related Dispatches