What is model collapse and why is it a risk for enterprise AI?
Model collapse is a nightmare for AI companies and users, with AI models trained on AI data losing quality with each generation
Sign up today and you will receive a free copy of our Future Focus 2025 report - the leading guidance on AI, cybersecurity and other IT challenges as per 700+ senior executives
You are now subscribed
Your newsletter sign-up was successful
Ever since generative AI became mainstream, AI generated content has become increasingly common. From text, to audio, images, and video, synthetic data has become easier than ever to produce.
A study by Ahrefs, published in 2025, analyzed 900,000 English-language web pages. About 2.5% of pages were purely AI-generated, and 71.7% of pages were a mix of human-written and AI-generated.
It's clear the percentage of AI-generated content, along with mixed content, is only going to increase. In some aspects, this is great news for enterprises, which can generate large, sanitized datasets that closely mimic their specific enterprise data without exposing actual private information to AI firms.
“Synthetic data shouldn’t be about replacing real data entirely but instead enhancing and extending it to build great AI models and agents,” says Dael Williamson, EMEA CTO at Databricks.
Ironically, however, the explosion of synthetic data has problematic implications for training future generations of AI models. Some firms, like Gartner, are already sounding the alarm over how the massive volumes of AI-generated data coming down the pipeline. The concern is training on AI data could create a sort of algorithmic feedback loop, in which AI models become increasingly distanced from real data and more prone to hallucinations.
A blend of intelligent AI-generated and human-written information is important. What if AI-generated information exhausts the percentage of real content, reaching its critical limit?
What is model collapse?
In a joint study, Ilia Shumailov from the University of Oxford and Zakhar Shumaylov coined the term “model collapse”.
Sign up today and you will receive a free copy of our Future Focus 2025 report - the leading guidance on AI, cybersecurity and other IT challenges as per 700+ senior executives
Model collapse is defined as a long-term statistical degenerative process in which an AI model gradually misperceives reality, bets on improbable events, and starts to generate low-quality or wrong outputs. An AI model begins to collapse when its training data gets contaminated by its output or other AI-generated information.
For example, model A trains on purely human-generated information. Model A now generates a large volume of synthetic data. A new model B now uses this synthetic data as a part of its training. Then, model C trains on synthetic data from models A and B… and so on. That’s how synthetic data ruins the training data distribution.
Anyone who ever tried to copy a VHS tape more than once has experienced generation loss, in which each reproduction of data produces a lesser copy filled with more and more artifacts. Model collapse could happen in much the same way, with AI developers forced to run more and more costly training runs to eke out performance increases in subsequent generations of frontier AI models.
Model collapse isn’t limited to LLMs: neural network architectures such as variational encoders and probabilistic models such as mixture models are also susceptible.
A real world example can be found in OPT-125M, an open large language model created by Meta AI researchers. To test model collapse, the researchers trained OPT-125M on its own outputs, then prompted it to answer questions on historical architecture. In response, the model produced answers rabbits instead. In the near future, the nth generation of AI models would have to fight just such a collapse.
Williamson uses a real-world example to explain model collapse:
"Imagine a locksmith who keeps duplicating keys from the latest copy, not the original. Early on, the key still turns most locks.
“But after many rounds, the key may look right at a glance, yet it jams, scrapes, or spins uselessly. Copying copies collapses fidelity, but copying while anchored to the master preserves it”, Williamson adds.
Model collapse would promote hallucinations within LLMs, as well as loss of key training information, the potential for poor decision making, and degraded AI guardrails that will erode business trust in affected AI models.
The three stages of model collapse
Models are undergoing partial synthetic training and for now, the threshold of real data tends to be higher than that of synthetic data. However, models can collapse when the percentage of synthetic data exceeds real data. There are three stages of model collapse:
Early collapse: The model starts to skip over or 'forget' some rare, uncommon, and unusual information. Early collapse is less likely to be noticed because model runs still generate quality information.
Late collapse: this marks the loss of structure in synthetic data. The model blends different patterns to generate outputs that may sound similar, but don't resemble real-world situations. A model undergoing late collapse produces vague and low-quality responses.
Total collapse: Total collapse is a hypothetical event that occurs when a very high threshold of synthetic data enters the training environment. The model enters into a “self-consuming loop” and relies only on synthetic data.
Errors causing the collapse
Errors in training data lead to model collapse and the size of the sampling data influences error rates. If enterprises use an infinite number of samples, errors would disappear. That’s how strange mathematics is. However, infinite data does not exist on the internet: it’s a physical impossibility.
This is why synthetic data, used correctly, is actually good for training quality AI models. The key is preparing the right mix of organic and synthetic data at every stage of the training process.
Every time a model learns from sampled data, some approximation error occurs. When a large volume of synthetic data enters the training cycle, statistical errors appear more frequently. As a result, small statistical errors integrate over time, leading to model collapse.
Statistical approximation errors are the primary reasons for model collapse. Models perform regression tasks, essentially learning to draw links between disparate data points. Errors are deviations between true values and approximations derived from models, which lead to models confidently outputting generalized or flat-out incorrect answers. A collapsing model would generate more deviations from true values.
All models assign weight to neurons in a neural network. The higher the weight, the higher the relevance of the dependent variable. But the data on which AI is trained is only ever an approximation of complex data – and due to functional approximation errors, models can draw incorrect links between training data, which leads to them generating false data outside the training data down the line. Functional approximation errors are the secondary cause of model collapse, which don’t grow with generations.
Data poisoning is another indirect reason for model collapse. In this cyber plan, hackers can intentionally insert misleading data into training loops. As a result, models may learn wrong patterns to generate false information. While this isn’t a major risk right now, it’s something that will have to be guarded against as future models are trained.
“Most organizations do not have enough high-quality data to train a model to the precision they need for good business outcomes, and few build foundation models from scratch,” says Christopher Royles, field CTO EMEA at Cloudera.
“The best way is to start with model selection,” he says, advocating for a firm business data strategy that deploys a blend of organic and synthetic data.
Even after selecting optimal training data, Royles says, it is impossible to prevent synthetic data from entering the loop. "Begin with real-world samples, labelled and enriched with metadata that capture quality". Royles shares a step-wise breakdown to deal with model collapse. "Cluster the data to understand the statistical patterns within it, then generate synthetic records that mirror those patterns. LLM-as-a-judge can help here.”
Model collapse remains inevitable, but can be delayed if human-generated information maintains a higher percentage in training data. Thanks to years of big data.

Venus is a freelance technology writer specializing in IT, quantum physics, electronics, and among other technical fields. She holds a degree in Electronics and Telecommunications Engineering from Mumbai University, India.
With years of experience in writing for global media brands and IT companies, she enjoys translating complex content into engaging stories. When she’s not writing about the latest IT trends, Venus can be found tracking enterprise trends or the newest processor in town.
-
Developers are slacking on AI-generated code checksNews While organizations are aware of the risks, many are spending little time or effort on tracking artifact versions, origins, and security attestations
-
The threat Iran-backed cyber attacks pose to businessesIn-depth What’s the real risk to business in the US and UK during this critical situation?