World

New study warns of ‘model collapse’ as AI tools train on AI-generated content

New study warns of ‘model collapse' as AI tools train on AI-generated content

(AI) models could soon face a new problem as AI-generated content increasingly populates the Internet.

Large language models (LLMs) such as OpenAI's have relied on available online to train and improve their models.

However, as these models exhaust the available online data, or face increased restrictions on data access, they may train on AI-generated content.

This could result in a degradation of model performance, eventually leading to the production of gibberish content, a phenomenon referred to as “model collapse,” according to a new study.

“Overtime, we expect that it will get harder to train the models, even though we are likely to have more data, just because it's very easy to sample data from the models,” Ilia Shumailov, junior research fellow at the University of Oxford and co-author of the study, told Euronews Next.

“But what's going to be happening is that it's going to be harder to find a population of data that is not actually biased,” he added.

The study, published in the journal Nature, discusses what happens when models are trained on data generated by AI over multiple cycles.

The research found that after a few loops of AI models generating and then being trained on AI-generated content, the systems start making significant errors and fall into nonsense.

A separate paper by Duke University researcher Emily Wenger demonstrates this through an experiment where an AI model is continuously trained on AI-generated content.

In the experiment, an AI model was given a set of data containing pictures of different dog breeds, with an overrepresentation of golden retrievers.

The study found that the model's output was more likely to generate images of golden retrievers than other less-represented dog breeds. As the cycle continued, it gradually started leaving out other dog breeds entirely until it eventually started generating nonsense.

Stages of ‘model collapse'

“Model collapse is basically defined by two stages. The first stage is what we call the early-stage model collapse, and what happens here is when a model learns from another model, you first observe a reduction in variance,” Shumailov said.

In this stage, aspects not initially fully understood by the original model will also be poorly understood by the subsequent model trained on the previous one's outputs.

This results in oversampling the well-understood aspects while neglecting other important ones simply because they were not fully clear to the initial model.

Then comes the late-stage model collapse.

This is when AI models are no longer useful due to earlier models introducing their own errors into the data.

The errors present in the initial data are passed onto the next model, which adds its own set of errors and passes it on as well.

As the data is continuously produced and recycled, the models start misinterpreting reality and making more errors.

“If there are some errors inside of the data that were generated by model one, they basically propagate into the next model. And ultimately this results in the model basically misperceiving the reality,” Shumailov explained.

Types of AI model errors

According to Shumailov, there are three types of errors that models could make: architecture errors, learning process errors, and statistical errors.

Architecture errors occur when the structure of the AI model is not fit to capture all the complexities in the data it is provided with, leading to inaccuracies as some parts are misunderstood or oversimplified by the model.

Learning process errors happen when the methods used to train the models have inherent biases, which pushes the model to make certain types of mistakes.

Finally, statistical errors emerge when there isn't enough data to accurately represent what the model is trying to learn. This could drive the model to generate predictions based on incomplete information, resulting in error.

Implications of ‘model collapse'

When models collapse, the main concern is that the rate of improvements in their performance may slow down.

AI models rely heavily on the quality of the data they're trained on.

However, when they're trained on AI-generated content, this data continuously introduces errors into the system.

“It's likely that we'll have to spend additional effort in basically filtering out the data. And this probably will mean that there may be a slowdown in improvement,” Shumailov said.

Moreover, as variance decreases and the data becomes less diverse, underrepresented data are expected to be disproportionately affected, which raises concerns over the inclusivity of the AI models.

“We need to be extremely careful in making sure that our models are fair and that they don't lose track of the minority data inside of them,” Shumailov said.

Source

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button