← Back to Payloads
Opinion2026-06-01

Reasoning Models Are a Dead End — RL on Chain-of-Thought Is the Real Breakthrough

OpenAI's o1, o3, DeepSeek's R1 — the whole category of 'reasoning models' is a transitional hack. The real breakthrough isn't the reasoning SKU. It's the training method underneath it, and dedicated reasoning checkpoints are about to be absorbed into the base model.
Quick Access
Install command
$ mrt install opinion
Browse related skills
Reasoning Models Are a Dead End — RL on Chain-of-Thought Is the Real Breakthrough

Reasoning Models Are a Dead End — RL on Chain-of-Thought Is the Real Breakthrough

I'll take the heat: OpenAI's o1, o3, DeepSeek's R1, the entire category of "reasoning models" — they're a transitional hack. The real breakthrough isn't the reasoning model as a SKU. It's the training method that produced them, and in 18 months dedicated reasoning SKUs will look as quaint as GPT-3 fine-tuning.

Here's the part nobody wants to say out loud: a reasoning model is just a base model with more test-time compute and a few thousand RL steps on chain-of-thought traces. That's it. There's no architectural leap. There's no fundamental new capability. There's just more inference tokens and a training signal that finally knows how to use them.

The "Reasoning Model" Is a SKU, Not a Paradigm

OpenAI shipped o1 as a separate model because that's the cheapest way to gate expensive inference. Anthropic shipped "extended thinking" because they had to compete. DeepSeek open-sourced R1 because the cost was already collapsing. But every one of these is the same underlying trick — same transformer, more tokens at inference, RL on chains of thought.

This is fine. It works. o1 is genuinely better than GPT-4o at math and code. R1 matched it for free. But calling this a "new class of model" is marketing. The model didn't learn to think. It learned to spend more tokens when the question is hard, and the training signal taught it which tokens to spend.

Once you see that, the trajectory is obvious. The reasoning capability will collapse into the base model. Why pay for two SKUs when one model can learn to allocate compute adaptively? The next generation of frontier models will reason when the prompt warrants it, answer when it doesn't, and bill accordingly — all from a single checkpoint.

The Real Breakthrough Is the Training Signal

The thing everyone is missing is that RL on chain-of-thought (RLVR — RL with verifiable rewards) is a general-purpose training technique, not a model category. It's the first time we have a gradient signal that actually teaches a model to plan. Supervised fine-tuning teaches format. RLHF teaches taste. RL on CoT teaches strategy.

This is enormous. It means:

  • A small model trained with RL on CoT can outperform a 10x larger model trained the old way, on the same task. The 2025 numbers on this are unambiguous.
  • The training signal is verifiable in domains where correctness has a checkable answer — code, math, tool use, structured output. Those are the highest-value domains in production AI.
  • The signal compounds with scale. Bigger models get more out of the same RL data, not less.

I'm going to say something that sounds crazy: the frontier lab that figures out how to do RL on CoT at pretraining scale — not post-training scale — wins the next decade. The labs currently applying this trick in post-training are leaving 90% of the gain on the table. The labs that bake it into the foundation will own the next platform shift.

What This Means If You Ship AI

If you're building products on top of these models, three things:

1. Don't architect around a specific reasoning SKU. The capability is migrating into the base model. Build for the capability, not the wrapper. 2. Invest in your own verifiers. RL on CoT needs ground truth. If you have a domain with checkable answers — invoice parsing, test cases, schema validation, code execution — you can fine-tune yourself a model that beats the frontier on your task for pennies. The frontier labs aren't going to do this for you. 3. Stop paying the reasoning-model tax. The price premium for "reasoning" is a temporary moat. As RL-on-CoT becomes a standard post-training step (it will, by Q4 2026), the capability becomes table stakes and the pricing follows.

The Bottom Line

Reasoning models are a SKU. RL on chain-of-thought is a paradigm. The former is going to be absorbed into the latter the same way "AI assistant" got absorbed into "model." The labs charging a premium for a separate reasoning checkpoint are pricing in a moat that evaporates the moment their own researchers figure out how to scale the training signal — which they're working on, right now, in public, and openly publishing the recipes.

I might be wrong about the timeline. I don't think I'm wrong about the destination.


Reasoning models are a SKU, not a paradigm — and the labs pricing them as a moat are about to be undercut by their own research. RL on chain-of-thought is the breakthrough. The separate "thinking" checkpoint is just the bill for not having shipped it yet.

Related Dispatches