The Doom Loop of Synthetic Data

Apparently AI systems will continue to improve forever — but is that even possible?

Aug 06, 2023

Hi, and welcome back to Untangled, a newsletter and podcast about technology, people, and power. Let’s take a jaunt through July, shall we?

I wrote about the concept of emergence and how AI chatbots are really just knowledge sausages, explained why ‘Will AI take your job?’ is the wrong question to ask, previewed a new essay on ‘the artificial gaze’ and solicited your help, and urged all of us to stop comparing ourselves to AI.
I launched a workshop on managing conflict with my colleague Kate Krontiris — you still have two weeks to sign up!
I previewed Untangled’s first-ever ‘tiny book,’ called “AI Untangled.”

Now on to the show.

If you’ve only got a minute this Sunday, here’s the essay in three bullets:

The latest narrative in AI is that it will continue to improve in perpetuity: sure, these systems might produce unreliable or harmful results now, but just you wait, they’ll get better.
But new research forecasts that we’re likely to run out of high-quality data in the not-so-distant future.
In this world, companies are likely to use the synthetic outputs as training data, leading to a degradation of the models themselves and, ya know, our shared reality.

Sometimes it feels like I’m playing a game of ‘narrative whac-a-mole.’ A silly AI narrative pops up — e.g. it has emergent properties! it hallucinates! — and then I (and many others) give it a whack. And this happens over and over.

The latest narrative in AI? Continuous improvement: sure, these systems might produce unreliable or harmful results now, but they’ll continue to get better. But what if that narrative isn’t true? What if we run out of high-quality data to train AI systems? That might sound like an absurd hypothetical — but a recent paper actually argues otherwise, predicting that we’ll run into a “bottleneck for training data” between 2030 and 2050.

The improvement to large language models we’ve seen over the last few years comes from training them on lots of data. In particular, high-quality data — from books, scientific papers, public code repositories like GitHub, and other high-quality datasets like Wikipedia. The problem? It’s possible that ChatGPT-4 is trained on trillions of words. As Ross Anderson of The Atlantic writes, “Ten trillion words is enough to encompass all of humanity’s digitized books, all of our digitized scientific papers, and much of the blogosphere.” So where will the next ten trillion words come from, how will they be digitized and stored, and who will have access to it?

In “Will we run out of data?” Pablo Villalobos et al. use a number of assumptions about the growth of cultural production, internet penetration rates, compute availability, and forecast that high-quality language data will be exhausted by 2027. The paper concludes, “If our assumptions are correct, data will become the main bottleneck for scaling ML models, and we might see a slowdown in AI progress as a result.” Even if their assumptions aren’t correct, it still stands to reason that it’s only a matter of time. ChatGPT is trained on data scraped from the web. As a larger percentage of the web is made up of ChatGPT outputs, the more future models will be trained on data produced by prior models — it’s like a snake eating its own tail.

Refer a friend

Imagine a future where there isn’t enough high-quality data — what might companies building these systems do? Villalobos speculates on different ways we might be encouraged to record our own speech and share it with companies. For example, he imagines a future where we wear dongles around our necks! As ridiculous as that sounds, we already passively record data about ourselves for the benefit of profit-making companies just by having phones, browsing the web, and using IoT devices. So it’s not too hard to imagine a situation where companies market new technologies to us under the guise of wellness or productivity (or something), in order to capture even more data.

Technologists will likely find new ways to increase storage and compute availability, but it’s the high-quality data itself that is the bottleneck. We can’t just turn on the cultural production faucet and rapidly produce more books and scientific papers — though, I wouldn’t be surprised to see tech CEOs start funding these efforts as a bank-shot attempt to meet their data needs. Nor do I expect tech companies to just say ‘ah, well, we gave it our best shot, time to give up on this whole AI pipe dream.’ So the only other option is to use data of ever-dwindling quality.

The first stop on the search for lower-quality data will be user-generated content. Your texts, posts, Tweets, Stories, and Notes will be what train our future AI overlords. Already, OpenAI wants you to upload your own data and files to help train ChatGPT.

But Villalobos actually predicts that low-level language data will be exhausted between 2030 and 2050. If low-quality data isn’t enough, what happens next?

Companies will start using synthetic data — the outputs generated by the models themselves — as subsequent training data. This might already be happening: a former Google AI engineer left for OpenAI after alleging that Google trained Bard, its chatbot, using data generated by ChatGPT. This is a big problem — consider the difference between training a model on scientific articles, and training a model on probabilistic outputs, which are often inaccurate. Writing about image generation, Ted Chiang likens this process to “the digital equivalent of repeatedly making photocopies of photocopies in the old days. The image quality only gets worse.” This, of course, hinges on the assumption that the quality of the data generated is worse than the training data.

Enter the idea of feedback loops from complex systems. In systems thinking, feedback refers to “any reciprocal flow of influence.” according to Peter Senge, author of The Fifth Discipline. In other words, Senge writes, “Every influence is both cause and effect. Nothing is ever influenced in just one direction.” In its current state, ChatGPT has been trained on more high-quality data than low-quality data. But over time that proportion is going to flip. This is the beginning of our loop:

An increase in low-quality data relative to high-quality data >>
Leads to a relative increase in the use of lower-quality data to train AI models >>
Leads to a relative increase in lower-quality data outputs >>

This is a reinforcing feedback loop that looks something like this:

Help, I’m stuck in a synthetic data doom loop!

This kind of feedback loop is an engine of growth — it either amplifies something positive or accelerates its decline. In this case, it is accelerating decline. As Senge writes,

“Whatever movement occurs is amplified, producing more movement in the same direction. A small action snowballs, with more and more and still more, resembling compounding interest. Some reinforcing processes are ‘vicious cycles’ in which things start off badly and grow worse.”

AI researchers are already demonstrating this process. Researchers at the University of Oxford, the University of Cambridge, the University of Toronto, and Imperial College London trained a model on synthetic data and found that it causes “irreversible defects in the resulting models.” They go on to basically describe the feedback loop I depicted above, writing:

“We discover that learning from data produced by other models causes model collapse – a degenerative process whereby, over time, models forget the true underlying data distribution, even in the absence of a shift in the distribution over time.”

Model collapse sounds … bad. One way to avoid all of this would be to prohibit the use of synthetic data in training sets. At the very least, we should monitor whether these systems start to incorporate their outputs as training data. Ted Chiang offered this idea, writing “If the output of ChatGPT isn’t good enough for GPT-4, we might take that as an indicator that it’s not good enough for us, either.”

It’s hard to know how this plays out because there is no clear ‘ending’ to a narrative of continuous improvement. If we assume that AI can indeed improve forever, and prioritize pursuing this, I can foresee two paths, which could easily happen in tandem: one is where the need for high-quality training data becomes urgent, and therefore expands and intensifies the agenda of personal data collection set by technology companies. The other path will see greater and greater amounts of synthetic data used to train models, thus perniciously morphing ‘continuous improvement’ into ‘continuous degradation’, and creating a context where we persistently justify the use of AI, while suffering the consequences of its harmful outputs.

As always, thank you to Georgia Iacovou, my editor.

Untangled with Charley Johnson

Discussion about this post