
I'll take the heat: OpenAI's o1, o3, DeepSeek's R1, the entire category of "reasoning models" — they're a transitional hack. The real breakthrough isn't the reasoning model as a SKU. It's the training method that produced them, and in 18 months dedicated reasoning SKUs will look as quaint as GPT-3 fine-tuning.
Here's the part nobody wants to say out loud: a reasoning model is just a base model with more test-time compute and a few thousand RL steps on chain-of-thought traces. That's it. There's no architectural leap. There's no fundamental new capability. There's just more inference tokens and a training signal that finally knows how to use them.
OpenAI shipped o1 as a separate model because that's the cheapest way to gate expensive inference. Anthropic shipped "extended thinking" because they had to compete. DeepSeek open-sourced R1 because the cost was already collapsing. But every one of these is the same underlying trick — same transformer, more tokens at inference, RL on chains of thought.
This is fine. It works. o1 is genuinely better than GPT-4o at math and code. R1 matched it for free. But calling this a "new class of model" is marketing. The model didn't learn to think. It learned to spend more tokens when the question is hard, and the training signal taught it which tokens to spend.
Once you see that, the trajectory is obvious. The reasoning capability will collapse into the base model. Why pay for two SKUs when one model can learn to allocate compute adaptively? The next generation of frontier models will reason when the prompt warrants it, answer when it doesn't, and bill accordingly — all from a single checkpoint.
The thing everyone is missing is that RL on chain-of-thought (RLVR — RL with verifiable rewards) is a general-purpose training technique, not a model category. It's the first time we have a gradient signal that actually teaches a model to plan. Supervised fine-tuning teaches format. RLHF teaches taste. RL on CoT teaches strategy.
This is enormous. It means:
I'm going to say something that sounds crazy: the frontier lab that figures out how to do RL on CoT at pretraining scale — not post-training scale — wins the next decade. The labs currently applying this trick in post-training are leaving 90% of the gain on the table. The labs that bake it into the foundation will own the next platform shift.
If you're building products on top of these models, three things:
1. Don't architect around a specific reasoning SKU. The capability is migrating into the base model. Build for the capability, not the wrapper. 2. Invest in your own verifiers. RL on CoT needs ground truth. If you have a domain with checkable answers — invoice parsing, test cases, schema validation, code execution — you can fine-tune yourself a model that beats the frontier on your task for pennies. The frontier labs aren't going to do this for you. 3. Stop paying the reasoning-model tax. The price premium for "reasoning" is a temporary moat. As RL-on-CoT becomes a standard post-training step (it will, by Q4 2026), the capability becomes table stakes and the pricing follows.
Reasoning models are a SKU. RL on chain-of-thought is a paradigm. The former is going to be absorbed into the latter the same way "AI assistant" got absorbed into "model." The labs charging a premium for a separate reasoning checkpoint are pricing in a moat that evaporates the moment their own researchers figure out how to scale the training signal — which they're working on, right now, in public, and openly publishing the recipes.
I might be wrong about the timeline. I don't think I'm wrong about the destination.
Reasoning models are a SKU, not a paradigm — and the labs pricing them as a moat are about to be undercut by their own research. RL on chain-of-thought is the breakthrough. The separate "thinking" checkpoint is just the bill for not having shipped it yet.