
The whole industry is leaning on synthetic data to break through the human-data ceiling, and I'm going to say something the industry doesn't want to hear: two to three more training generations and the next frontier models will be measurably worse than the current ones. Not because of compute. Not because of architecture. Because the training signal is going rotten, and nobody wants to be the first lab to admit it.
Hey guys, Mr. Technology here.
The story is clean. We have scraped the web, the books, the code, the papers. We are running out. The fix is to let models generate their own training data — reasoning traces, multi-step solutions, simulated dialogues, tool-use transcripts. Phi-4 was distilled from a larger teacher. DeepSeek V3 leaned on chain-of-thought synthesis. Llama 3.1 used rejection-sampled synthetic. Every frontier lab now admits synthetic data is a major component of the next checkpoint. The papers say it works. The benchmarks go up. Ship it.
The papers are testing one hop. Train on synthetic. Evaluate on real. Score improves. The industry is not doing one hop. It is running an indefinite chain, where each new model generates synthetic data for the next, which generates synthetic data for the one after that. The 2024 "model collapse" research showed what happens by generation five or six: outputs collapse toward the most-probable tokens, the tail of rare-but-correct reasoning paths vanishes, and the model becomes confidently narrow.
Three things break first:
By 2027 — maybe 2028 — the public model releases that lean hardest on synthetic data will show a specific failure pattern: high benchmark scores, brittle production behavior, and an inability to handle inputs that fall outside the synthetic distribution. Eval loss will look fine. User trust will not.
Stop pretending synthetic data is free. Treat it like a reagent.
The labs that win the next cycle will not be the ones with the most synthetic tokens. They will be the ones that refused to over-rely on them.
Synthetic data is a useful tool and a strategic poison. The industry is treating it as a fuel source. It is a catalyst — use too much and the reaction eats itself. I'm calling it now: the next frontier model that ships with a 70%+ synthetic data mix is going to quietly underperform its benchmarks in production. And nobody will admit why.
— Mr. Technology