What is model collapse and why is it a risk for enterprise AI?

Visual art of green pixelated gradients, representing AI model collapse. — (Image credit: Getty Images)

Ever since generative AI became mainstream, AI generated content has become increasingly common. From text, to audio, images, and video, synthetic data has become easier than ever to produce.

A study by Ahrefs, published in 2025, analyzed 900,000 English-language web pages. About 2.5% of pages were purely AI-generated, and 71.7% of pages were a mix of human-written and AI-generated.

It's clear the percentage of AI-generated content, along with mixed content, is only going to increase. In some aspects, this is great news for enterprises, which can generate large, sanitized datasets that closely mimic their specific enterprise data without exposing actual private information to AI firms.

“Synthetic data shouldn’t be about replacing real data entirely but instead enhancing and extending it to build great AI models and agents,” says Dael Williamson, EMEA CTO at Databricks.

What is model collapse?

In a joint study, Ilia Shumailov from the University of Oxford and Zakhar Shumaylov coined the term “model collapse”.

Model collapse is defined as a long-term statistical degenerative process in which an AI model gradually misperceives reality, bets on improbable events, and starts to generate low-quality or wrong outputs. An AI model begins to collapse when its training data gets contaminated by its output or other AI-generated information.

For example, model A trains on purely human-generated information. Model A now generates a large volume of synthetic data. A new model B now uses this synthetic data as a part of its training. Then, model C trains on synthetic data from models A and B… and so on. That’s how synthetic data ruins the training data distribution.

Anyone who ever tried to copy a VHS tape more than once has experienced generation loss, in which each reproduction of data produces a lesser copy filled with more and more artifacts. Model collapse could happen in much the same way, with AI developers forced to run more and more costly training runs to eke out performance increases in subsequent generations of frontier AI models.

Model collapse isn’t limited to LLMs: neural network architectures such as variational encoders and probabilistic models such as mixture models are also susceptible.

A real world example can be found in OPT-125M, an open large language model created by Meta AI researchers. To test model collapse, the researchers trained OPT-125M on its own outputs, then prompted it to answer questions on historical architecture. In response, the model produced answers rabbits instead. In the near future, the nth generation of AI models would have to fight just such a collapse.

Williamson uses a real-world example to explain model collapse:

"Imagine a locksmith who keeps duplicating keys from the latest copy, not the original. Early on, the key still turns most locks.

“But after many rounds, the key may look right at a glance, yet it jams, scrapes, or spins uselessly. Copying copies collapses fidelity, but copying while anchored to the master preserves it”, Williamson adds.

Model collapse would promote hallucinations within LLMs, as well as loss of key training information, the potential for poor decision making, and degraded AI guardrails that will erode business trust in affected AI models.

The three stages of model collapse

Models are undergoing partial synthetic training and for now, the threshold of real data tends to be higher than that of synthetic data. However, models can collapse when the percentage of synthetic data exceeds real data. There are three stages of model collapse:

Early collapse: The model starts to skip over or 'forget' some rare, uncommon, and unusual information. Early collapse is less likely to be noticed because model runs still generate quality information.

Late collapse: this marks the loss of structure in synthetic data. The model blends different patterns to generate outputs that may sound similar, but don't resemble real-world situations. A model undergoing late collapse produces vague and low-quality responses.

Total collapse: Total collapse is a hypothetical event that occurs when a very high threshold of synthetic data enters the training environment. The model enters into a “self-consuming loop” and relies only on synthetic data.

Errors causing the collapse

Errors in training data lead to model collapse and the size of the sampling data influences error rates. If enterprises use an infinite number of samples, errors would disappear. That’s how strange mathematics is. However, infinite data does not exist on the internet: it’s a physical impossibility.

This is why synthetic data, used correctly, is actually good for training quality AI models. The key is preparing the right mix of organic and synthetic data at every stage of the training process.

Every time a model learns from sampled data, some approximation error occurs. When a large volume of synthetic data enters the training cycle, statistical errors appear more frequently. As a result, small statistical errors integrate over time, leading to model collapse.

Statistical approximation errors are the primary reasons for model collapse. Models perform regression tasks, essentially learning to draw links between disparate data points. Errors are deviations between true values and approximations derived from models, which lead to models confidently outputting generalized or flat-out incorrect answers. A collapsing model would generate more deviations from true values.

All models assign weight to neurons in a neural network. The higher the weight, the higher the relevance of the dependent variable. But the data on which AI is trained is only ever an approximation of complex data – and due to functional approximation errors, models can draw incorrect links between training data, which leads to them generating false data outside the training data down the line. Functional approximation errors are the secondary cause of model collapse, which don’t grow with generations.

Data poisoning is another indirect reason for model collapse. In this cyber plan, hackers can intentionally insert misleading data into training loops. As a result, models may learn wrong patterns to generate false information. While this isn’t a major risk right now, it’s something that will have to be guarded against as future models are trained.

“Most organizations do not have enough high-quality data to train a model to the precision they need for good business outcomes, and few build foundation models from scratch,” says Christopher Royles, field CTO EMEA at Cloudera.

“The best way is to start with model selection,” he says, advocating for a firm business data strategy that deploys a blend of organic and synthetic data.

Even after selecting optimal training data, Royles says, it is impossible to prevent synthetic data from entering the loop. "Begin with real-world samples, labelled and enriched with metadata that capture quality". Royles shares a step-wise breakdown to deal with model collapse. "Cluster the data to understand the statistical patterns within it, then generate synthetic records that mirror those patterns. LLM-as-a-judge can help here.”

Model collapse remains inevitable, but can be delayed if human-generated information maintains a higher percentage in training data. Thanks to years of big data.

Venus is a freelance technology writer specializing in IT, quantum physics, electronics, and among other technical fields. She holds a degree in Electronics and Telecommunications Engineering from Mumbai University, India.

With years of experience in writing for global media brands and IT companies, she enjoys translating complex content into engaging stories. When she’s not writing about the latest IT trends, Venus can be found tracking enterprise trends or the newest processor in town.