Model collapse is not coming to save you

Recently I was sitting in a hotel restaurant talking with a few people from my profession about AI and software development. As usual, opinions were all over the place.

At some point somebody brought up "model collapse" and how that will eventually stop these things from getting better, maybe make them worse.

I had not heard that term much anymore. It was big news maybe two years ago, when a paper got a lot of coverage arguing that if you keep training models on their own generated output, they start losing the rare and unusual parts of what they learned. The weird edges get washed out first, then the quality degrades further with each generation. People latched onto the idea quickly, for obvious reasons.

I remember thinking back then what I still think now: yes, that can happen, but only under very specific conditions that the real world does not actually satisfy.

The loop has to stay closed

The concerning version of model collapse requires one key thing: that real data gets replaced by synthetic data over time. Train on your own output, ignore new human-generated content, repeat. Under those conditions, yes, things degrade. The math on that is fairly clear.

But that assumption is just wrong as a description of how the internet or the AI industry works.

People keep producing new content whether AI exists or not. Software fails in new ways. News happens. Research gets published. Arguments break out on forums. Code gets written, compiled, tested, deployed and debugged. None of that stops. And all of it generates new signal that is grounded in reality, not in the output of the previous model generation.

Training data accumulates. It does not get replaced. The older human-generated material stays in the mix. And research that followed the initial wave of warnings confirmed exactly this: if you keep accumulating real data alongside synthetic data, collapse does not happen. The error stays bounded.

Bad output does not reproduce well

The evolution angle makes this even clearer.

Evolution does not guarantee improvement and it produces a lot of dead ends. But open systems with variation, competition and selection do not just average themselves into nothing either. The key is that not everything survives equally.

The same thing already happens with model outputs. Useful ones get copied, cited, incorporated into real work, built upon. Bad outputs get ignored, deleted, downvoted or simply never referenced again. A code snippet that does not compile does not end up in a lot of training data. A blog post that is just incoherent noise does not get linked. The web is a terrible curator but it is still a curator.

Most of the novel ideas people come up with are also bad. We just do not remember those for long. We mostly remember and keep the ones that survived contact with reality. Same process.

Money is a pretty good filter

Training a frontier model costs a lot of money. That alone is a strong reason for labs not to poison their own training data.

If one lab cuts corners and trains mostly on synthetic slop while another invests in curation, fresh data, human feedback, and filtering, the second one wins. The incentives push strongly against the closed-loop scenario.

The same goes for the platforms where that content lives. A forum or search result page that fills with generated garbage loses its audience. Some of that is already happening, but the response is also already happening. Spam filters, quality signals, provenance markers, community curation. People adapt because they have reasons to adapt.

Things will change. Please do not resist

Still, model collapse is a real phenomenon. Small closed loops should be avoided.

Some parts of the web can become harder to use because they flood with generated slop.

But the ultimately scary (some would say good) version of this story, where AI eats the internet and then itself and then the whole thing collapses, requires ignoring competition, fresh content, curation, money, and the simple fact that humans - and AI by proxy - keep interacting with reality.

The successful models (and people) will still get fresh signals from the real world and keep improving.

That is what I expect. And it looks a lot more like an ecosystem finding a new shape than anything like collapse.

Model collapse is not coming to save you

The loop has to stay closed

Bad output does not reproduce well

Money is a pretty good filter

Things will change. Please do not resist

Comments