The Doom Loop of Synthetic Data
Apparently AI systems will continue to improve forever — but is that even possible?
Hi, and welcome back to Untangled, a newsletter and podcast about technology, people, and power. Let’s take a jaunt through July, shall we?
I wrote about the concept of emergence and how AI chatbots are really just knowledge sausages, explained why ‘Will AI take your job?’ is the wrong question to ask, previewed a new essay on ‘the artificial gaze’ and solicited your help, and urged all of us to stop comparing ourselves to AI.
I launched a workshop on managing conflict with my colleague Kate Krontiris — you still have two weeks to sign up!
I previewed Untangled’s first-ever ‘tiny book,’ called “AI Untangled.”
Now on to the show.
If you’ve only got a minute this Sunday, here’s the essay in three bullets:
The latest narrative in AI is that it will continue to improve in perpetuity: sure, these systems might produce unreliable or harmful results now, but just you wait, they’ll get better.
But new research forecasts that we’re likely to run out of high-quality data in the not-so-distant future.
In this world, companies are likely to use the synthetic outputs as training data, leading to a degradation of the models themselves and, ya know, our shared reality.
Sometimes it feels like I’m playing a game of ‘narrative whac-a-mole.’ A silly AI narrative pops up — e.g. it has emergent properties! it hallucinates! — and then I (and many others) give it a whack. And this happens over and over.
The latest narrative in AI? Continuous improvement: sure, these systems might produce unreliable or harmful results now, but they’ll continue to get better. But what if that narrative isn’t true? What if we run out of high-quality data to train AI systems? That might sound like an absurd hypothetical — but a recent paper actually argues otherwise, predicting that we’ll run into a “bottleneck for training data” between 2030 and 2050.
The improvement to large language models we’ve seen over the last few years comes from training them on lots of data. In particular, high-quality data — from books, scientific papers, public code repositories like GitHub, and other high-quality datasets like Wikipedia. The problem? It’s possible that ChatGPT-4 is trained on trillions of words. As Ross Anderson of The Atlantic writes, “Ten trillion words is enough to encompass all of humanity’s digitized books, all of our digitized scientific papers, and much of the blogosphere.” So where will the next ten trillion words come from, how will they be digitized and stored, and who will have access to it?
In “Will we run out of data?” Pablo Villalobos et al. use a number of assumptions about the growth of cultural production, internet penetration rates, compute availability, and forecast that high-quality language data will be exhausted by 2027. The paper concludes, “If our assumptions are correct, data will become the main bottleneck for scaling ML models, and we might see a slowdown in AI progress as a result.” Even if their assumptions aren’t correct, it still stands to reason that it’s only a matter of time. ChatGPT is trained on data scraped from the web. As a larger percentage of the web is made up of ChatGPT outputs, the more future models will be trained on data produced by prior models — it’s like a snake eating its own tail.
Imagine a future where there isn’t enough high-quality data — what might companies building these systems do? Villalobos speculates on different ways we might be encouraged to record our own speech and share it with companies. For example, he imagines a future where we wear dongles around our necks! As ridiculous as that sounds, we already passively record data about ourselves for the benefit of profit-making companies just by having phones, browsing the web, and using IoT devices. So it’s not too hard to imagine a situation where companies market new technologies to us under the guise of wellness or productivity (or something), in order to capture even more data.
Technologists will likely find new ways to increase storage and compute availability, but it’s the high-quality data itself that is the bottleneck. We can’t just turn on the cultural production faucet and rapidly produce more books and scientific papers — though, I wouldn’t be surprised to see tech CEOs start funding these efforts as a bank-shot attempt to meet their data needs. Nor do I expect tech companies to just say ‘ah, well, we gave it our best shot, time to give up on this whole AI pipe dream.’ So the only other option is to use data of ever-dwindling quality.
The first stop on the search for lower-quality data will be user-generated content. Your texts, posts, Tweets, Stories, and Notes will be what train our future AI overlords. Already, OpenAI wants you to upload your own data and files to help train ChatGPT.
But Villalobos actually predicts that low-level language data will be exhausted between 2030 and 2050. If low-quality data isn’t enough, what happens next?
Companies will start using synthetic data — the outputs generated by the models themselves — as subsequent training data. This might already be happening: a former Google AI engineer left for OpenAI after alleging that Google trained Bard, its chatbot, using data generated by ChatGPT. This is a big problem — consider the difference between training a model on scientific articles, and training a model on probabilistic outputs, which are often inaccurate. Writing about image generation, Ted Chiang likens this process to “the digital equivalent of repeatedly making photocopies of photocopies in the old days. The image quality only gets worse.” This, of course, hinges on the assumption that the quality of the data generated is worse than the training data.
Enter the idea of feedback loops from complex systems. In systems thinking, feedback refers to “any reciprocal flow of influence.” according to Peter Senge, author of The Fifth Discipline. In other words, Senge writes, “Every influence is both cause and effect. Nothing is ever influenced in just one direction.” In its current state, ChatGPT has been trained on more high-quality data than low-quality data. But over time that proportion is going to flip. This is the beginning of our loop:
An increase in low-quality data relative to high-quality data >>
Leads to a relative increase in the use of lower-quality data to train AI models >>
Leads to a relative increase in lower-quality data outputs >>
Keep reading with a 7-day free trial
Subscribe to Untangled with Charley Johnson to keep reading this post and get 7 days of free access to the full post archives.