← Back to Payloads
Opinion2026-06-16

Synthetic Data Is Going to Break the Model Training Flywheel. The "Data Wall" Fix Is a Trap.

Every frontier lab is leaning on synthetic data to break through the human-data ceiling. Two to three more training generations and the next models will be measurably worse — not from compute, not from architecture, but from rotten training signal.
Quick Access
Install command
$ mrt install opinion
Browse related skills
Synthetic Data Is Going to Break the Model Training Flywheel. The "Data Wall" Fix Is a Trap.

Synthetic Data Is Going to Break the Model Training Flywheel. The "Data Wall" Fix Is a Trap.

The whole industry is leaning on synthetic data to break through the human-data ceiling, and I'm going to say something the industry doesn't want to hear: two to three more training generations and the next frontier models will be measurably worse than the current ones. Not because of compute. Not because of architecture. Because the training signal is going rotten, and nobody wants to be the first lab to admit it.

Hey guys, Mr. Technology here.

  • The Mainstream View
  • Why It's Wrong
  • What the Right Answer Is
  • Bottom Line

The Mainstream View

The story is clean. We have scraped the web, the books, the code, the papers. We are running out. The fix is to let models generate their own training data — reasoning traces, multi-step solutions, simulated dialogues, tool-use transcripts. Phi-4 was distilled from a larger teacher. DeepSeek V3 leaned on chain-of-thought synthesis. Llama 3.1 used rejection-sampled synthetic. Every frontier lab now admits synthetic data is a major component of the next checkpoint. The papers say it works. The benchmarks go up. Ship it.

Why It's Wrong

The papers are testing one hop. Train on synthetic. Evaluate on real. Score improves. The industry is not doing one hop. It is running an indefinite chain, where each new model generates synthetic data for the next, which generates synthetic data for the one after that. The 2024 "model collapse" research showed what happens by generation five or six: outputs collapse toward the most-probable tokens, the tail of rare-but-correct reasoning paths vanishes, and the model becomes confidently narrow.

Three things break first:

  • Diversity loss. Synthetic data mirrors the model's own distribution. Real human data is jagged, weird, contradictory, and rich. Synthetic data is the average of the average. Train on it twice and the average gets narrower.
  • Distillation echo. Distillation pipelines already over-weight whatever the teacher prefers. Stack two distillations and the student converges on the teacher's blind spots as if they were ground truth.
  • Verifier collapse. Self-play pipelines assume the model can judge its own outputs. It cannot. The verifier is the same model. Errors get reinforced as features.

By 2027 — maybe 2028 — the public model releases that lean hardest on synthetic data will show a specific failure pattern: high benchmark scores, brittle production behavior, and an inability to handle inputs that fall outside the synthetic distribution. Eval loss will look fine. User trust will not.

What the Right Answer Is

Stop pretending synthetic data is free. Treat it like a reagent.

  • Cap the synthetic share at 30% or less. Above that you start trading generalization for in-distribution polish.
  • Buy real, messy, adversarial human data. RLHF on actual user feedback. Real support transcripts. Real code reviews. Real arguments on forums. The signal is in the chaos.
  • Build verifiers from different distributions. If the generator and the judge are the same model family, you do not have a judge. You have a mirror.
  • Track tails, not means. Benchmark averages hide collapse. Monitor rare-but-correct outputs. If the long tail is shrinking, your model is shrinking.

The labs that win the next cycle will not be the ones with the most synthetic tokens. They will be the ones that refused to over-rely on them.

Bottom Line

Synthetic data is a useful tool and a strategic poison. The industry is treating it as a fuel source. It is a catalyst — use too much and the reaction eats itself. I'm calling it now: the next frontier model that ships with a 70%+ synthetic data mix is going to quietly underperform its benchmarks in production. And nobody will admit why.

Mr. Technology

Related Dispatches