What does it mean to ‘train’ AI anyway?
A survey of how individual datapoints become large, training datasets
Hi, it’s Charley, and this is Untangled, a newsletter about technology and power.
👇ICYMI
March was busy at Untangled HQ:
I analyzed the interventions offered by research papers addressing the anti-democratic state of the internet.
I argued that automation is a structural and political problem, not an individualized one, and that power lies with those who frame the problem.
I offered my take on the potential TikTok ban and explained how researchers embed how we misunderstand AI into the scientific process.
I explained why Google’s Gemini isn’t ‘woke.’ Rather, Google papered over a systemic problem and it backfired.
I published an essay about a new category of crypto project — Decentralized Physical Infrastructure or DePin — and how everything is a Ponzi in crypto, until sometimes, it’s not.
On to the show!
One year ago, I used 3,000 words to answer the question: what even is technology? This year, I want to follow up with another doozy: what is data? In this special issue, I outline how, especially in the frame of AI:
Data are made by us.
Data are classified into value-laden boxes.
Datasets are contorted by scale, shortcuts, and mental models
This isn’t an academic exercise. AI is all about ‘the data.’ Who has it, whether it is of high quality, etc. Tech giants like Google, Microsoft, and Facebook have had a head start on the sector because they already hoovered up all our data; OpenAI and other start-ups are testing the boundaries of copyright infringement by scraping the open web for data. We’re also seeing crypo projects finding ways to decentralize data scraping.
This data is then turned into a training data set, which is then used to train AI models that offer a guess about the future. If we don’t understand how data are made, we can’t understand AI and its attempts to recreate the world. Let’s dig in.
Data are made by us
Data don’t fall from the sky. Data are made by you and me. We interact and transact with one another. We do things out in the world. We engage with institutions. We click and scroll online. In the collection of essays, “Raw data” is an oxymoron, Lisa Gitelman and Virginia Jackson make clear that data are never raw but situated in a historical and social context. The data generated from the various actions we engage in are constrained or nudged along by the social norms, belief systems, institutional practices, policies, and the cultural context of the day.
Let’s make this practical. In The Condemnation of Blackness, Khalil Gibran Muhammad argues that “From the beginning, the collection and dissemination of racial crime data was a eugenics project, reflecting the supremacist beliefs of those who created them.” As I wrote in Building Alternative Futures, “Historically, while social scientists conflated blackness with criminality, white criminality was explained away by structural inequities and poverty.” Gibran Muhammad sums it up this way:
“Crime statistics have never been just about behavior no matter how obvious it may seem that numbers speak for themselves. They are proxies for beliefs, a way of defining reality and seeing things. Whatever truth they represent in counting actual arrests or real prisoners is itself a reflection of intense social and political struggles.”
What gets recorded as ‘data’ isn’t self-evident or objective, it’s the output of a contestation of beliefs, norms, and power.
Old assumption: Data are raw and self-evident.
New assumption: Data are made via interactions with social systems.
Data are classified into value-laden boxes
Okay, so we take an action — big or small — in the world, and out pops a data point. Then that data gets recorded in a particular way or ‘classified.’ Take the US Census as an example. As I explained in Who are you? the boxes we can check aren’t an afterthought, they’re inherently political.
“In the 1990s, a group of Americans argued that ‘multiracial’ should replace the problematic ‘other’ category in the race and ethnicity section. This was seen as more appropriate by these advocates because, as Bowker and Star write, it would ‘not force individuals to choose between parts of themselves.’ […] But many civil rights leaders disagreed on the grounds that if everyone selected ‘multiracial’ that would mean lost information and lost resources for specific groups. Distinct categories are needed so that oppressed groups and communities receive the political and economic resources they deserve. The Clinton Administration decided that it would allow people to check more than one box but would disallow the inclusion of ‘multiracial.’ When we check boxes in surveys, these decisions may seem inconsequential, but the boxes available to us and the ones we pick are laced with political choices beyond our reckoning.”
Of course, there are other times where the classification occurs without political struggle, in a way that is invisible to us, and impossible to question. That’s right, I’m talking about Google, Facebook, and nearly every other modern technology company that uses algorithms and millions of data points to construct and then situate us in li’l boxes. These companies use our data to classify our gender, needs, and values. But they aren’t actually ‘ours’ in any meaningful sense — they own the data and don’t care how we self-identify. No matter what mode of data collection we’re talking about (a form, an algorithm, or something else) how individual data points get recorded reflects choices made by someone, somewhere.
Once a classification scheme is in place, the not-at-all ‘raw data’ can be labeled to train a machine. Now, in AI models, companies often skip this process altogether and rely on unsupervised training (i.e. trained on correlations/patterns in unlabeled data). Building a ‘supervised model’ (i.e. trained on labeled data) takes time and money. In the great new resource, “Models All the Way Down,” Christo Buschek and Jer Thorp estimate that it would take 781 years — working full time — to look at each image in the influential AI training set, LAION-5B. Rather than do the labeling up front — and to accommodate this ridiculous scale — companies will try to convince us that data are objective and that if we stitch together a big enough data set, it can represent a ground truth of reality. Other times, companies outsource this process: remember when OpenAI paid Kenyan laborers meager sums to label data that included violence, hate speech, and sexual abuse?
While investing time and money to label training data might minimize the problems above, it would generate others. The labeling process is infused with social and cultural biases. Even if the classifier ‘hate speech’ might be agreed upon by a team of developers in Silicon Valley, people across the world are going to have very different perspectives on whether data should be labeled as such.
Old assumption: Data are objective.
New assumption: Data reflect the values, biases, and decisions of people in power.
Datasets are contorted by scale, shortcuts, and mental models
Training datasets are the foundation upon which AI is built. They provide examples upon which systems generate supposed truths. But they don’t materialize out of thin air — they too are created, maintained, and managed by people working in organizations and companies with their own dynamics and incentive systems. In “The social construction of datasets: On the practices, processes, and challenges of dataset creation for machine learning,” Will Orr and Kate Crawford interview dataset creators who expressed how creating high-quality datasets is “a practice that is in flux, and requires considerable individual judgment, hard work and an ongoing struggle for resources and ever-increasing scale.”
Keep reading with a 7-day free trial
Subscribe to Untangled with Charley Johnson to keep reading this post and get 7 days of free access to the full post archives.