 Good afternoon. I'm Carol Chris, the Chancellor of the University of California at Berkeley. It gives me great pleasure to welcome all of you virtually to the Berkeley campus this afternoon for the Tanner Lectures on Human Values. This distinguished lecture series is presented annually at each of nine universities, including UC Berkeley. The others are Cambridge and Oxford and here in the United States, Harvard, Michigan, Princeton, Stanford, Utah, and Yale. This series was founded in 1978 by the American scholar, industrialist, and philanthropist Obert Clark Tanner, who was also a member of the Faculty of Philosophy at the University of Utah, who was also an honorary fellow of the British Academy. Tanner's goal in establishing lectures through the Tanner Philanthropies was to promote the search for a better understanding of human behavior and human values. He hoped that the lectures would advance scholarly and scientific learning in the area of human values and contribute to the intellectual and moral life of humankind. Human values are defined as broadly as possible and lecturers may be chosen from any discipline. The lectureships are international and transcend national religious and ideological distinctions. Lectures are chosen for their uncommon achievement and outstanding abilities in the field of human values. Today's lecture is one of seven special Obert C Tanner lectures on AI and human values. Other special lectures in this series will be presented at Cambridge, Michigan, Oxford, Stanford, Utah, and Yale. The Tanner Lectures Board plans to publish a volume that will collect all seven of these special lectures. The special lecture at Berkeley was organized through a collaboration of the Berkeley Tanner Lectures Committee with Jennifer Chase, Associate Provost of the Division of Computing Data Science and Society, and Ken Goldberg, Professor in the Department of Industrial Engineering and Operations Research. I'd like to thank them for their work on this event and for their excellent decision to invite Kate Crawford to present this special lecture at Berkeley. Like many events over the past two years, this event has been affected by the ongoing pandemic, which has already cost us to postpone it twice. Rather than postpone it for a third time, we've decided to host it today in virtual format, which will enable us to broadcast it to a larger audience. Now let me call on my distinguished colleague, Professor Jennifer Chase, to introduce Kate Crawford. Jennifer will also moderate the discussion that follows. Thank you, Chancellor Christ. I'm delighted to be with you all today and very excited to present today's lecture by our Tanner Lecturer, Kate Crawford. Her lecture is entitled Excavating Ground Truth in AI, Epistemologies and Politics in Training Data Sets. This should be a fascinating look at artificial intelligence and its far-reaching implications. Today Kate Crawford will present her lecture on this subject, and tomorrow at the same time, we will discuss Art, Activism, and AI, the seminar discussion for this lecture. That will be more of a conversation with expert commentators. Marion Foucault, Professor of Sociology, Sonia Ketal, Associate Dean, Faculty Development and Research in the Law School, and Trevor Padlin, Artist and Geographer. And now some information about our lecture. I will begin with a few personal remarks. Kate Crawford is a dear friend of mine. I met her about 12 years ago when she was a visitor from the University of Sydney to one of the institutes I founded and led at Microsoft Research. I was immediately struck by Kate's stunning ability to see the deep societal impacts of emerging technologies. Within a year, I had hired Kate away from the University of Sydney to join our institute. It was wonderful to be Kate's manager, mentor, and friend for the past 12 years, and to see her grow into one of the world's leading scholars in societal implications of AI. Now for a more formal intro. Professor Kate Crawford is a leading scholar of the societal impacts of artificial intelligence. She is a research professor at USC Annenberg, a senior principal researcher at Microsoft Research, an honorary professor at the University of Sydney. She is also the inaugural visiting chair for AI and Justice at the École Normale Supérieure in Paris, where she co-leads the International Working Group on the Foundations of Machine Learning. Over her 20-year research career, Kate has produced groundbreaking research in publications such as Nature, Science, Technology and Human Values, and AI and Society. She has advised policymakers in the United Nations, the White House, the Federal Trade Commission, and the European Parliament, where she is on the International Advisory Board of the Panel for the Future of Science and Technology. She received the Erton Prize from the British Society for the History of Science for her project, Excavating AI with Trevor Padlin, who will be one of the discussants tomorrow, and her large-scale investigative project, Anatomy of an AI System, with Ladin Jolie, won the Beasley Design of the Year award in 2019, and is in the permanent collection of both the Museum of Modern Art in New York and the V&A in London. Professor Crawford's latest book, Atlas of AI, Power, Politics, and the Planetary Costs of Artificial Intelligence, has been described as a fascinating history of data by the New Yorker, an urgent contribution by science, and named one of the best books of 2021 by the Financial Times and New Scientist. Before the lecture begins, a little bit of housekeeping due to the virtual format of the lecture. Because our audience is viewing the lecture online, we will have to to gather our questions in that format. On the event page for this lecture, there is a button for submitting questions. If you find during a lecture that you would like to ask a question, please submit your question via the button and form. We will be sure to view all the questions submitted on the form and pose as many as we can during the Q&A session at the end of the lecture, also possibly answering some of them tomorrow. We encourage you to ask questions that relate to this lecture. And now, without further delay, I am pleased to present to you our Obert C. Tanner Lecturer on Artificial Intelligence and Human Values, Kate Crawford. Thank you so much, Dean Chase. I'm now going to try the wonderful moment of switching to slides. So let's see how we go. Well, first, I want to begin by saying that I am deeply honoured to be joining you here today. And I'd like to start with some thank yous. First, my particular thanks to Chancellor Christ for inviting me to give this Tanner special lecture along with a Tanner committee. And my thanks, of course, to Dean Chase, both for her kind introduction and personal mentorship. I'm deeply grateful also to the extraordinary trio of scholars and artists who will be joining our discussion tomorrow, Marion Forsard, Sonia Katyal, and Trevor Paglen. And finally, I'd also like to thank Jay Wallace, Jane Fink, and Ken Goldberg for all of their planning, without which none of this would be possible. My only wish is that we could be gathered together in person. But as we know, these are unusual times. So nonetheless, it is an important moment for us to be considering the values imported into our technical systems. So today, I want to talk about how truths are made. This has long been a topic of philosophy and in the history of science and science and technology studies, where every tale comes with a twist. And the idea of universal truths is complex and deeply fraught. But it has become increasingly urgent that we look more closely at how truth operates within machine learning. Machine learning systems are materially influencing billions of people every day from recommending who gets a job interview and who is offered bail, to directing newsfeeds, police resources, and autonomous weapons. How we train them to interpret and make decisions about the world is not only a technical question, it is also increasingly a part of the stuff of everyday life. So my talk has several interwoven threads. One is a tale of epistemology, of how AI systems are making new forms of knowledge. It is also a proposition about ontology, how reality is constructed at the level of large collections of data. And it's a planetary story about how technical designs have direct impacts on the earth and the wider ecologies that we live within at a time of overlapping crises, where autocrats are violently seizing power, where the world is still reeling from a pandemic and with a climate catastrophe already well underway all around us. How we live in this world and how we find ground truth and common ground are amongst some of the most urgent problems that we face. So back in 2008, Google launched Project Ground Truth, which aimed to harvest all the data from Street View and satellite imagery along with multiple sources of local information. An engineer described it this way, we are building a mirror of the real world. Anything that you can see in the real world needs to be in our databases. Indeed, much activity in machine learning can be thought of as a practice of trying to mirror the world, of taking data and from it producing a highly detailed machinic portrait. Now, this may sound relatively unremarkable to our ears in 2022, but it is at its root a profoundly radical project. It echoes the kind of seismic shift brought about by the invention of photography in the 19th century and the invention of artificial perspective in the 15th century, ways of seeing that were highly constructed but came to be seen as natural. When Alberti outlined artificial perspective in 1435, it was gradually accepted across Western Europe and ultimately came to be seen as an infallible method of representation, what W.J.T. Mitchell called a mechanical production of truth about the material and mental worlds. It was a profound invention that at the same time denied its own artifice by laying claim to being just the way things really look. And as artificial perspective completely transformed our understanding of vision in ways that are now completely naturalized, so is artificial intelligence, altering the relationships between humans and the world under the banner of science and objectivity. It has created a new regime of truth. As Iris Murdoch once wrote, ethics and epistemology are always very closely related and if we want to understand our ethics, we must look at our epistemology. And this is why we'll begin here at the epistemological level to see how world views and values are built in at the technical layer. By digging deeper and excavating into the idea of ground truth in AI, we can begin to see the ethical limits, the assumptions and ways of seeing that are encoded about humans in the wider world. So my aim here is to critically examine these ideas in the context of the history of AI and then widen out the grounds in order to contend with the impacts of planetary scale computation to include labor, minerals, energy and the ecologies from which these systems draw so much. So today we'll begin here to look closely at the idea of ground truth in AI and to see what it produces in the world and how it might be imagined differently. So what do I mean by ground truth? Well, the phrase originates from the German word Grundwachheit, literally foundational truth. And historically, the concept of ground truth comes from multiple domains. The earliest uses in the 1800s are theological, which refer to the truth that is grounded on the earth, that is the experience of humans in the world on the ground, as opposed to the truth of God. In the environmental sciences in the 20th century, it came to mean the truth of the ground in crop mapping and remote sensing, such as measuring fields of wheat, barley, potatoes and other staple crops. But it was most strongly adopted in military aerial reconnaissance in the mid 20th century. In the CIA archives, in fact, ground truth is first defined in a 1964 paper as the actual state of terrestrial surface environment in support of airborne remote sensor operations. Now capturing ground truth was a project hotly pursued by the CIA and the US Air Force with their first reconnaissance satellite program known as Corona. The first Corona satellites went into polar orbit in 1959 and took a series of quite extraordinarily detailed photographs of Soviet terrain using giant rolls of 70mm film. Now when the film cartridges were full, the satellites would somewhat spectacularly eject them to be caught by Hercules aircraft that were fitted with giant nets. These images provided an early example of ground truth of Soviet military bases in the 1960s caught as they fell literally from the sky. So in these disparate contexts, ground truth signifies information drawn from the direct observation of the earth. We're speaking most literally of the ground itself. It's used to compare to maps and models as a way of verifying the situated and specific realities of a terrain. But in the last 50 years, the term has been enthusiastically adopted in the field of computer science. And here it comes to mean something quite different. In machine learning, the term ground truth refers to a set of training data that is used to train an algorithmic model to identify patterns in the world. In supervised machine learning, which is what I'll mainly be focusing on today, training data is collected and then commonly labeled by humans using a distributed labor force on platforms like Amazon Mechanical Turk. More training data can be used to validate a model by providing it with data it hasn't encountered before to see how well it performs. And finally, once a model is built, testing data is used to assess if the model can make accurate predictions and to compare its probabilistic similarity to ground truth. So training data is a foundational part of how machine learning works today. These datasets might be filled with images, words, sounds, policing records, demographic information, credit histories, and geolocation data, depending on the task at hand, be it recognizing faces, filtering email, or directing autonomous vehicles. It is the substance from which machine learning models make meaning. Now, some of the largest and most well-known datasets today, like ImageNet or Common Crawl, are created by scraping the internet, trawling through millions of images or text that ends up captured in enormous databases. But unlike military satellite images or remote crop sensing, these dragnets of data are not being calibrated against an external reality. They are produced. They do not drop from the sky. So the theorists, Albalardo Guilfogne and Yossi Perica, observe that ground truth in machine learning does not distinguish between sources of data, but it refers rather to the distinction between outcomes produced by a model and its expected values. So in this sense, it is a form of ground truth that is read from a mass of images instead of comparatively off the ground. So here they are talking about computer vision on which more later, but it does reveal the kind of self-referential loop that can occur when datasets become ultimately ungrounded from their material origins. So consider this common assignment that many students are given in computer vision courses. Build a system to classify a group of images into two categories, say those that contain a dog and those that do not. Now to many of us, this might sound like an obvious or overly basic issue, a solved problem, but there is no platonic form that dictates these decisions. They are subjective and have considerable gray areas. For example, have a look at this collection of images. Which ones would you label as a dog? Do toy dogs count? What about wolves or cartoons or prairie dogs or hot dogs? Where do you draw the line? Well, there are no universally right answers here, which is why the designer of an AI system holds the power to decide what the truth of the world will be as defined by a training set. This process is literally called ground truthing, an active practice of making data materials into what will be the foundational truth for a technical model. And it is, from the very outset, a very human, highly cultural and highly subjective construction. Categorization questions also plague contemporary vision systems when they try to disambiguate similar images that may seem dramatically different to us, such as this classic example with Chihuahuas and blueberry muffins. As one popular textbook on computer vision observes, compared to other topics, little formal analytic work has been published to guide the creation of ground truth data. In general, the size of the data set is key to its accuracy, and the larger, the better. Training data, then, is at the core of truth claims in machine learning. But the authority of training data comes not from calibrating measurements in the real world, but through scale. The larger it is, so the story goes, the more reliable it becomes. But this too raises questions. When is the data big enough to be sufficient to encompass all the variety in the world? Here we begin to glimpse the uncertainties within the practices of machine learning. As Florian Jaden observed in his study of the practice of ground truthing in technical labs, we get the algorithms of our ground truths. Yet the practice of making ground truth is one of the least explored in ML and raises, I think, the most profound epistemic and political questions. What we count as ground truth matters. So tonight, we'll consider the trouble with truth claims in AI. I'll argue first that the reason we need a much deeper engagement with training data is that it's become such an important foundation to how machine learning produces a worldview. It is the layer at which we can see political outlines take shape. Second, I'll argue that the turn to large-scale training data freights along with it ethical and environmental consequences. But as we'll see, they are deeply enmeshed in ways that are rarely considered. Ultimately, this is a call to a material effects of AI that we should look critically and closely at the data material that becomes ground truth and see it as necessarily connected to the labor practices behind it, the energy requirements of a model, and the ultimate ends to which a system is used. So I'll consider this concept of ground truth tonight in four chapters in order to show you that it is far less grounded than is commonly assumed. First, we'll go to the training ground and look at the histories of data and AI that brought us to this point. And then slippery ground to look at the instability of concepts in training data and the role of human labor in making data sets. Then contested ground, how scientifically controversial ideas end up being built into technical systems as though they are received facts. And finally, poison ground, the ecological and environmental consequences of planetary scale computation. Chapter one, training ground. Let's begin with voice. While assistants like Amazon's Alexa or Apple's Siri have become commonplace, they represent a particular approach to how data should be gathered and used. The story behind speech recognition gives us a glimpse into how we got to this moment, what's often called the statistical turn or the move away from trying to get computers to understand us, to programming them, to use data to predict us. The story begins here at IBM's computational speech recognition lab in the 1970s. Now at this point in the history of AI, knowledge-based or expert systems approaches were in fashion, modeling human language production and perception and then teaching computers grammatical principles and linguistic features. But that all changed in 1972 when Fred Yellenek was hired to lead a new group. Instead of building AI based on human-derived expert knowledge, Yellenek believed in a data-driven approach above all else. In an excellent study of the CSR group by Chao Chengli, she shows how he began using statistical methods to analyze how often words appeared in relation to one another. But making this statistical approach work required an enormous amount of data. Natural speech data back then was hard to come by. They tried IBM technical manuals, children's books, patents of laser technology, books for the blind and even the typed correspondence of the IBM fellow Dick Garwin who created the first hydrogen bomb design. But it just didn't sound like natural speech and it wasn't enough words. But then they hit a jackpot, a major federal antitrust lawsuit against IBM. The proceedings lasted for 13 years and a thousand witnesses were called and this created a corpus of 100 million words which became the test bed for their work and then they began to get results. Robert Mercer became a key figure in the CSR group, the man who would much later become famous as the reclusive billionaire who backed Donald Trump, Steve Bannon, Breitbart News, and Cambridge Analytica. But back in 1985 he coined a phrase that would stay with the field. There's no data like more data. The CSR group represents a sea change in the computational sciences. In Lee's words, they're emblematic of an approach that would repeat for decades, the reduction from context to data, from meaning to statistical pattern recognition. The aim was to strip away human expertise in preference for data-driven probabilistic techniques or so-called brute force approaches. Yellenek describes the shift this way. Physicists study physical phenomena, linguists study language phenomena. Engineers learn to take advantage of the insight of physicists but are yet to make use of the insight of linguists. Well, I think this is an excellent description of the type of disciplinary worldview that is at the root of these approaches, prioritizing computational pattern recognition in the manner of a physicist because it was producing quicker results and at scale. Yellenek had a favorite joke that he liked to tell about the lab. Whenever I fire a linguist, he would say, our system performance improves. Today we see the philosophy behind those early experiments at IBM deployed at scale in neural network-based large language models like GPT-3. GPT-3 is trained on a huge corpus of text known as the Common Crawl, which was scraped from 60 million domains across the internet, including mainstream sources like the New York Times and the BBC, but others less so like Reddit and teen chat forums. If you give GPT-3 text, it will predict what sentences are most likely to come next. But even the designers of GPT-3 will admit that it has significant problems because there is so much content on the web, for example, that sexualizes women. GPT-3 places words like beautiful, naughty, or sucked near female pronouns, while men get adjectives like fantastic, jolly, and personable. When it comes to religion, Islam is most commonly placed near words like terrorism. Now, when Luciano Floridi and Massimo Cialati tested GPT-3, and you can see a prompt that they used on this slide, they found it to reiterate what they called humanity's worst tendencies as it kept making dehumanizing comments about women, Jews, and black people and so on. But their conclusion was that it was only confused people who would use GPT-3 to get some ethical advice and that they would be better off relying on their own moral compass. This understates, to my mind, the seriousness of what happens when these engines are being built into the fabric of text work, such as automating copywriting and headline generation, and employed in multiple kinds of white collar office tasks. We are looking here, I believe, at a foundational problem in the way that the statistical approach to language removes meaning, context, source reliability, or culture. It becomes a predictive engine for showing frequency, where the vast flotsam and jetsam of the internet is its ground truth. As we've seen, these epistemic shifts in the history of AI, away from expert systems and towards data-driven probabilistic techniques, also come with different ways of manufacturing ground truth. And when large datasets become ground truth, they gain enormous power in shaping technical systems. And this is why a close focus on that data is needed, what Trevor Paglin and I have called an archaeology of datasets, carefully sifting through the material layers, cataloging their taxonomies and principles, and analyzing what normative patterns of life they assume and reproduce. This is a fairly atypical practice, as training data is rarely looked at very closely by ML designers. It has taken on the status of an infrastructure to be applied as an aggregate mass to solve a problem. But by paying close attention to the ground truth data of machine learning, we can see how these systems also produce new social and political perspectives. We can think of this as a shift from when prediction becomes production. So let's consider how this phenomenon works in computer vision, and in particular in face and object detection, in order to look at the shifts in these ground-truthing practices in just the last half century. In the 1960s, Woody Bledslow became one of the first people to attempt facial recognition. He had been funded by the CIA to see that if he could find some sort of mathematical approach to detecting the same face across different photographs. His data was a collection of just a dozen photographs of men which he used to mathematically derive facial landmarks of individuals. By 1970, we have one of the first public demonstrations of face recognition at the Osaka World's Fair. At a highly popular attraction made by NEC, visitors would sit in front of a television camera, have their picture taken, and a computer program would extract basically several feature points from the face. It would then tell the visitor which of the seven celebrities they had on file they most resembled. The thousands of people who visited that exhibit had their faces scanned and unbeknownst to them, their faces would become part of a ground-truth training set for the team to test algorithms for years. The exhibit foreshadowing what was to come was called computer physiognomy. The need to acquire face images by whatever means possible also drove the Ferris Project in the 1990s. The Department of Defense Count and Drug Program along with DARPA in the Army Research Laboratory wanted a high-resolution large-scale collection of human faces to see if face recognition was ready to be applied to intelligence and law enforcement. They created a training set of portraits just over 14,000 images using people from nearby military bases, and these images became a standard benchmark, a shared measurement of ground truth to compare different algorithms. As the biometric scholar Kelly Gates has observed, a government-produced benchmark had the direct result of lending legitimacy to the technology and creating commercial interest in making face recognition profitable. However, the arrival of the internet in so many ways changed everything. Researchers and developers came to see it as a natural resource there to be mined. By the 2000s, training sets began to reach a size that scientists in the 1980s could only dream of. Millions of images and trillions of lines of text were published every day. It was all grist to the mills of ML, and it was an epistemic shift away from dataset construction of staging photo shoots and acquiring participant consent. As Lorraine Daston and Peter Gallison have observed in their history of objectivity, major shifts in scientific production are also about the creation of new epistemic virtues. The turn to internet-scraped datasets represents a virtue of scale, of ground truth at arm's length, over a ground truth that was once constructed with people as participants. This brings us to one of the most well-known and influential image training sets ever made. ImageNet is a colossus of object recognition in machine learning and a benchmark. It was first conceptualized by Fei-Fei Li, who published it with her research team in 2009 with the grand aim of mapping out the entire world of objects. And this is where we see an enormous shift in scale and a new practice for ground truthing. The ImageNet team scraped search engines like Yahoo to mass harvest over 14 million images, which would then be organized into more than 20,000 categories. The underlying structure was adopted from the semantic structure of WordNet, a database of word classifications developed at Princeton in the 1980s. The ImageNet researchers selected only the nouns with the idea that nouns are things that pictures can represent, that that would be sufficient to train machines to recognize objects. But how could they hand label that many images into noun categories? Well, the first plan was to hire undergraduate students for $10 an hour to find images manually and add them to the dataset. But that would have taken decades to complete, and so the idea was abandoned. The team contemplated using an algorithm to try to cluster the data, but it was quickly realized that the quality of labeling would be seriously compromised. The answer came with the launch of what was then a new service, Amazon Mechanical Turk. This platform meant that it was suddenly possible to access a distributed labor force to do online tasks like labeling and sorting images at scale and at low cost. This was a different kind of ground truthing, with workers being asked to sort up to 50 images a minute into categories with very little context as to what it was all for. These crowd workers would become surrogates of ground truth. They were charged with making the labels for the ImageNet team on the assumption that no specific expertise was needed and that all people will have the same contextual understanding of images regardless of cultural context. This assumption is even stranger when we realized that the 49,000 click workers who worked on ImageNet came from over 150 countries. These surrogates weren't particularly trusted either. The ImageNet team would task multiple workers to label the same image and then use an algorithm to determine if there was sufficient agreement around a particular label. Just as an IBM where the rules of grammar and linguistic properties were discarded, so too was expertise from the social sciences, philosophy or art history, the many disciplines that study this relationship between images and labels and the larger world. Instead, what they did was download a mass of images and outsource the work of placing them into predetermined categories. The click workers were not given any option to interpret or to reject labels altogether or given the context of what they were building. Their analysis was less important than their verification. It's analogous in some ways to Robert Mercer firing his linguists. The face ultimately was, here we see like ultimately the faith rather was placed in scale that with enough images and enough crowd workers, you produce ground truth statistically. Humans were a sorting mechanism for ground truth that could be algorithmically ranked. Here we see how images like speech are assumed to be self-explanatory. If we think about this as a practice of ground truthing, it's neither an expert systems approach nor a solely data-driven method. It is a human machine hybrid, but where the humans are given very little control or time, ground truth would be made by severely constraining what humans could do. And this returns us to the argument that the practice of how ground data is made is itself a form of politics where those who are designing a system have the power to decide the world view while distributing the labor of production to those who do not have the same agency as full participants. Which brings us to chapter two, slippery ground. How can we see the ways in which ground truth collapses in machine learning? Well, let's return to the example of ImageNet and look more closely at the images and labels themselves to get a richer sense of how these epistemological slippages occur. ImageNet's structure is labyrinthine, vast and filled with curiosities. The first indication of the true strangeness of ImageNet's worldview is its nine top-level categories that it drew from WordNet. Plant, geological formation, natural object, sport, artifact, fungus, person, animal, and miscellaneous. Now, these are pretty curious categories into which all else must be ordered. Below that, it spawns into thousands of strange and specific nested classes into which millions of images are housed. There are categories for hot pants, hot plates, hot pots, hot rods, hot sauce, hot springs, hot toddies, hot tubs, hot air balloons, and hot fudge sauce. It is a riot of words, the reminiscent of Borghese's mythical encyclopedia. At the level of images, it looks like madness. Some images are high-resolution stock photography. Others are blurry phone photographs. Some are cartoons. There are pinups, religious icons, pornography, children's school photos, Hollywood celebrities, and politicians. It veers wildly from the professional to the amateur, from the sacred to the profane. In the case of the over 20,000 categories that were originally in the ImageNet hierarchy, noun classes such as Apple might seem relatively uncontroversial, but not all nouns are created equal. To borrow an idea from the linguist George Lakoff, the concept of an Apple is a more nouny noun than the concept of, say, debt. Nouns occupy various places on an axis from the concrete to the abstract, from the descriptive to the judgmental. And we see this at work in the almost 3,000 categories used to classify people. The categories with the most associated pictures are things like Gao, and some of these images are very sexualized, and Chief Executive Officer, which you can see here, where the majority are men in suits. With these highly populated categories, you can already begin to see the outlines of a worldview. ImageNet classifies people into a huge range of types, including race, nationality, profession, economic status, character, and even morality. And some of the concepts aren't even visual at all. As you can see from this one, the concept of a debtor. So what does that mean as a visual concept, and how can you see a person's bank account? And gradually the nouns become moral judgments as we see here in the category of bad person. Others are more unpleasant. Here there are labels such as phonies and bezels, swindlers, forges, and traitors. Then there are many, many categories that are genuinely offensive, deeply racist, misogynist, and ableist that are simply not repeatable in this talk. So let's remember that all of the people in these categories that we've been thinking about tonight contain, ultimately for them rather, they had no idea that their graduation photos or their holiday snaps are in one of the most influential training datasets in AI history under these kinds of categories. This becomes particularly concerning when models are trained on ImageNet's person category, which has been happening for over a decade. So here you can see the famous picture from the situation room during the Bin Laden raid. And it's been labeled by a model trained on ImageNet's person category. And we can see men in ties being listed as micro economists, but Hillary Clinton is labeled as a sick person. And you can also see President Biden listed as an incurable. In late in 2019, in response to mounting criticism, the ImageNet team revised the dataset. First, they created labels to classify content as offensive, sensitive or safe. And they asked 12 graduate students to put the images into these categories. Once they removed those categories, they repopulated what was left with more diverse images in their terminology, more images of different assumed gender, race, and skin tone. The result is that more than half of the categories disappeared, along with over 600,000 images. The rest were deemed as safe. But what constitutes safe when it comes to classifying people? The removal of dehumanizing content is not wrong, but it doesn't contend with the epistemic question. The taxonomy of ImageNet reveals the complexities and dangers of human classification. In the metaphysics of ImageNet, terms like micro economists or basketball player may initially seem less concerning or offensive than the use of labels like unskilled person or redneck. But when we look at the people who are labeled in these categories, we see many troubling stereotypes. In fact, there are no neutral categories in ImageNet because the selection of images always interacts with the meaning of words. So politics are baked into our classificatory logics, even when the words aren't offensive. So when we look closely at ImageNet, everything is flattened out and pinned to a label like taxidermy butterflies in a display case. While this approach has the tone and aesthetics of objectivity, it is nonetheless a profoundly ideological exercise. This is a lesson of what happens when people are categorized like objects. This brings us to another kind of slippage in machine learning, where paradigms slide from one domain to another, where tools that may have had success at one thing, such as face recognition, are pushed into an entirely different domain, such as race recognition. After the terrorist attacks in the US in September 11, 2001, an enormous amount of defense spending moved into the domain of automating facial recognition. This was a turning point, and the research applications widened out from a focus on domestic law enforcement to controlling people crossing borders and beyond. And similarly, the research ambitions began to expand, from detecting individual faces to trying to detect gender, race, emotions, and sexuality. This is a slippage away from biometric signatures into relational and cultural concepts, and so a series of scientific claims and folk theories emerge about what we can see in an image of a face. For example, this is a paper titled Do They All Look the Same, Deciphering Chinese, Japanese, and Koreans by Fine Grained Deep Learning. This claims to have created a way to accurately classify the features of each ethnicity based on a collection of photographs that they took from Twitter and the Celeb A data set. And yet the actual features their model detected were things like hairstyles, smiling, and eyebrow grooming. And there are hundreds of similar papers claiming that they can do race detection, gender detection, or both. A core problem is that these systems seek to reduce human identity to a mathematical model. As Simone Brown has written, this process of digital epidermalization is an imposition of race on the body, which colonial powers have enforced on populations for centuries. Above all, the concept of a pure race signifier has always been in dispute. In her writing about race, Donna Harroway notes that the little machines for clarifying and separating categories, despite their work, the entity that always eludes the classifier was simple, race itself. The pure type which animates dreams, sciences, and terrors keeps slipping through and endlessly multiplying. Again, we have a collapse of concepts and changing social constructions. Yet in machine learning, the myth of the pure type has emerged once more claiming the authority of objective ground truth. So machine learning systems are, in a very real way, constructing race and gender. They are reification machines in that they produce what they name. When technical systems make identity claims about people, it becomes a statistical uroboros, a self-reinforcing discrimination process that amplifies social inequalities under the guise of technical neutrality. One side of this will to classify people is the production of a kind of colonial ordering, where identity, meaning, and judgment is imposed from above. But the other aspect is how these recognitions and misrecognitions feed into a wider political economy. At the level of industrial applications, these tools are designed to extract value from images, from targeted advertising to modulating insurance rates to generating financial scores. And this is where an image is not just an image. It's an ongoing process of moral and economic valuation. So here at the level of political economy, again, we see this deep entanglement of the epistemic, the ethical, and the material consequences of how machine learning works. Chapter three, contested ground. So we've seen how there have been these slippages of concepts in machine learning between the metaphor, the noun, the relational, and the fixed, the concrete, and the abstract. But at a higher level, there are also methodological assumptions that have been imported in whole cloth from different fields, even when they're highly controversial or rejected in their home disciplines. This is a system designed by a company called Four Little Trees to be used in schools during the COVID-19 pandemic. The system claims to detect children's emotions and classifies them into feeling one of six things, such as happiness, sadness, anger, surprise, fear, or neutral. The caption reads, if the corners of their mouth are raised, the machine detects happiness. Well, for any of us who've worked in the service industry, we can certainly tell you that that isn't always the case. There are deeper questions to ask here, too. Why does this system assume that there are six emotions? Or why are these expressions assumed to be uniform across people, regardless of age, culture, or context? So where does this taxonomy come from? And this idea that emotions are packaged in six universal categories. This story takes us back to a psychologist by the name of Paul Ekman. This is him in Papua New Guinea in the 1960s. Funded by the Department of Defense, he went on to do cross-cultural research with a few people. He wanted to show that facial expressions were universal, regardless of culture. Ekman would show his research subjects, photographs like this, and a subject would be asked to match it to a label, such as happiness or surprise. But this forced choice format would later be criticized for priming subjects as to what the emotions might be. But it was an important moment for the propagation of an idea of six universal emotions. Now, Ekman, of course, is not the first to try to capture emotion and fix it to categories. It has a much longer history. He, in fact, even cites Duchenne in the 1860s who used electroshock on patients in asylums to try and capture on photographic plates what he saw as universal taxonomies within estates. Duchenne described the human face as a map with each muscle representing a movement of the soul. But then these ideas began to be built into early AI. This is an example of a training set made by the psychologist Jeffrey Cohn and the computer vision researcher Takeo Canade of Carnegie Mellon. Ekman's system was an early inspiration for two things essential for later machine learning, a stable, discreet, and finite set of faces that humans can use to categorize photographs of faces and a system for producing measurements. But in the history of automating emotion detection, there is this deep vein of dispute going back to the anthropologist Margaret Mead who critiqued the idea of universitality of emotions of any case back in the 1960s to most recently the work of the psychologist Lisa Feldman Barrett. Barrett's team showed that there were no consistent relationships between a facial expression and an internal emotion. And to quote Barrett, it is not possible to confidently infer happiness from a smile, anger from a scowl, or sadness from a frown as much as the current technology tries to do so when applying what are mistakenly believed to be scientific facts. So why, despite all of this evidence against the idea of inferring emotions from faces, does it persist? Well, in one sense, Ekman's paradigm became popular because the theory fit what the tools could do. It assumes a transparent relationship between appearances and essences and a series of discreet and universal categorizations could be mapped to a simplified ground truth. Further, it was an idea that could scale. The more complex issues of context, relationality, and culture are not readily interpretable by machine learning tools. Now, emotion recognition is dispersed widely with companies like Unilever using it in job interviews to judge whether someone will be a good employee in shopping malls to predict who might be a shoplifter and also in the automotive industry inside cars to detect if a driver is feeling negative emotions. One startup even recently offered to alert nearby police to drivers who might be looking sufficiently agitated or distracted. This is, I think, one of the great dangers of emotion detection AI and more broadly of accepting ground truth classifications at face value, if you will. This phrenological impulse, this desire to know more about a person than they choose to reveal, has been a common goal for systems of policing and surveillance. In the early 20th century, the polygraph, the lie detector, was designed to rest away interior truths from a subject. But after decades of use, it was found to be entirely scientifically unreliable. And yet it continues to be used for screening applicants and monitoring employees in the United States government today, particularly as a way to assess people's sexuality in ways that are both invasive and heteronormative. Highly contested systems can persist, even when the scientific theories that bolster them no longer hold. So this desire to extract ground truth from subjects from their involuntary bodily signals is now also seen in recent ML papers that claim that they can detect sexuality, for example, where the researchers labeled images from dating sites into the reductive binary of straight or gay, and in criminality detection algorithms that claim they can predict who will be a criminal using driver's license photographs as a training set. This is where we start to see physiognomy and phrenology getting a rerun in machine learning. So we can see how the field makes a conceptual error when it classifies people's race, gender, or sexual identity based on their face. It is inherently a confusion of fluid and relational categories with fixed objects like a dog or a chair, and it poses new classificatory harms. The papers commonly justify what they're doing by saying that it's important to show that these kinds of things can be done with off-the-shelf tools. But I'd pose the opposite argument that we need to have an ethical obligation not to engage in this kind of computer physiognomy and phrenological practice. So these claims of predicting whether someone is angry or sad or whether they're gay or a criminal are being made at a time of rising autocratic power and political conflict, when many would love to deploy an unaccountable system of classification and control. As Stuart Hall reminds us, our systems of classification become the objects of the disposition of power. So we've seen the epistemological slippages and the hidden assumptions that can underlie ground truth in machine learning, and we've mapped the statistical turn which has driven this rapacious desire for vast collections of training sets. But these approaches are not immaterial or abstract mathematics. They are physical infrastructures with models that bring with them enormous demands for data and energy. So in order to see and understand how ground truth is made and the problems in its construction, we now have to move beyond the data to speak of the labor, the working conditions, the economies, the geopolitics, and the greater environments on which they depend. This is the basis of a material ethics that is so often missing from the stories of how machine learning innovations are made. A final chapter, Poisoned Ground. This is about two, the largest industrial city in Mongolia, where you can see an artificial lake filled with toxic black mud. It reeks of sulfur and stretches as far as the eye can see, and it contains more than 180 million tons of waste powder from the nearby Bayon Oboe mines, which is the largest deposit of rare earth minerals on the planet. Rare earth minerals are essential for making everything from iPhones to hard drives to LCD displays. Now, while they're relatively common in the Earth's crust, the ratio of usable materials to waste toxins is extreme. To refine just one ton of rare earth produces 75,000 liters of acidic water and one ton of radioactive residue. Those waste products end up dumped back into lakes like this one. And here, we have the Colton mine near Rubea in the Democratic Republic of Congo, where extensive cobalt mining for smartphones and rechargeable lithium batteries takes place. In addition to cobalt, this mine produces radioactive waste that leaches into drinking water, as well as being linked to extreme human rights abuses and the exploitation of child laborers. So in these places, we see a very different kind of ground truth of how the tech sector is affecting the planet from the chemical composition of the ground to the air that we breathe. In just the past five years, we've seen an enormous increase in computational demands. According to OpenAI's data, since 2012, the amount of compute used in the largest AI training runs has been increasing exponentially, with a three-month doubling time. By comparison, Moore's law was said to have a two-year doubling period. What that means is that in 10 years, computational demands have grown by more than 300,000 times. This has produced a computational culture which has recently turned to gargantuan datasets, such as Clearview AI's collection of over 10 billion faces for the purpose of selling facial recognition to defense and police. This has had the accompanying effect where the tech sector has overtaken the aviation industry in terms of its carbon footprint. Now, this is still a nascent area of research, but one of the early papers on this topic by Emma Strubelle found that training just a single NLP transformer model in an academic lab produced more than 600,000 pounds of carbon dioxide emissions. That's the equivalent of around 125 round-trip flights from New York to Beijing. And worse, this does not reflect the true scale of the commercial models built by the likes of Facebook and Amazon, but the exact details of that energy consumption is unknown. It's kept as a highly guarded corporate secret. The philosophers Michael Hart and Antonio Negri call this the dual operation of abstraction and extraction, abstracting away the material conditions of production while extracting more information and resources. The way in which the AI industry abstracts away environmental harms is commensurate with the way it imagines ground truth, stripped of context, removed from any harms in its making. The AI industry has abstracted away the energy, labor, and data that it relies upon. From the perspective of material ethics, we can see how these epistemological, ethical, and environmental concerns are in fact tightly interwoven. The data-heavy, deep learning approach of indiscriminately extracting as much data from the internet as possible has simultaneously produced serious social, political, and environmental harms. So to conclude, where do we locate the ground truth of AI? We can see it in the earth's crust and the atmosphere. We can see it in the bodies of the workers all along the supply chain, from the mines to the click farms. We can see it in the assumptions, politics, and contestations of meaning built into machine learning systems, and in the very notion of ground truth data sets themselves. So I've made three core arguments here. First, it's that we need to critically engage with how truth is manufactured in machine learning. We can start by engaging deeply with training data, by studying how knowledge is constructed and deployed, and by whom, what the sociologist Corinne Norsatina calls the epistemic machinery. We also need to develop better practices of data stewardship, of data investigation, curation, and care. The making and maintenance of data is one of the most overlooked domains in machine learning, despite its foundational role. Second, data sets aren't simply raw material to feed algorithms. They produce what they name. The data set in Jeff Bowker's words ultimately shapes the world in its image. So this practice of collecting data, categorizing it, and labeling it is itself a form of politics, with questions about who gets to decide what things will mean. As such, much of the discussion around bias in AI systems misses the mark. There is no neutral or apolitical vantage point that training data can be built upon. Removing harmful labels or adding pictures with different skin tones does not reform the fundamentally flawed classificatory assumptions and paradigms that are being taken at face value. This reminds me of what Terry Weinergrad, one of the founding figures in natural language processing, once said about AI. He said that it had projected a three-dimensional world onto a two-dimensional plane. By systematically eliminating dimensions, it is both simplifying and distorting. The particular dimensions we eliminate or preserve in this exercise are not idiosyncratic accidents. They reflect a philosophy that precedes them and that they serve to amplify and extend. This is such a powerful reminder, both of how we so commonly see machine learning as a flat land where the vivid color and complexity of the world has been diminished. These decisions are based in a politics, in a worldview, that relentlessly prioritizes value extraction for commercial and military applications in a radical redrawing of civic life. Which brings me to my final argument, that the epistemological, ethical, and environmental consequences of large-scale AI are co-constitutive, so tightly entangled that we must consider them together. It's only by dismantling this narrow conception of ground truth and contending with the underlying dynamics and political economies that drive machine learning that we can see these larger topologies at work. This, at its heart, is a call for material ethics of AI, one that looks beyond a dataset and beyond a machine learning system to the full-stack consequences of what we build. A material ethics would mean that we would not train a model without thinking about what worldviews are being crystallized, how workers will be paid to sort the data, and how much energy the model will consume, and what kind of institutions, with their own structural legacies of inequality and power, will deploy it into what ends. The civil rights activist and law professor, Derek Bell, said it like this. To see things as they really are, you must imagine them for what they might be. The legacy of ground truth in AI reveals a failure of imagination, a reduction of human meaning and a loss of perspective. But this failure also represents a chance, a concrete opportunity to reimagine how machine learning works from the ground up, to think anew about where it should and should not be applied, in a way that emphasizes human creativity, solidarity, and potentiality. This work of imagination requires us to expand our perception of what is possible, both for our species and our creations. How well we succeed will determine if we can support the flourishing of life on an already imperiled planet. Thank you so much. I look forward to your thoughts and our ongoing discussion. Kate, I was just blown away by your lecture. You know, I first encountered 12 years ago, and your stunning ability to bring all of this together, and it is really, really amazing. And I must say one thing, sorry to bring this back to what we're doing at Berkeley, but we are forming, hopefully quite soon, a new college of computing data science and society. And you have more eloquently articulated why we need society in there as an absolutely equal partner to computing and data science than anyone I have ever heard. I mean, this is just incredible. I think we are scheduled to take a short break. Please do submit your questions. I'm not sure if the tool is working or not, if the question tool is working or not. So those of us who will be discussants tomorrow and others, when we come back 10 minutes from now, those of us who are participants in this in a way tomorrow also have some questions to ask Kate in case the submission tool for the YouTube questions is not working. So I'll see you back here in 10 minutes. Absolutely breathtaking, Kate. I am happy to welcome all of you back. We did get some very, very nice questions, very stimulating questions. I think we probably have time for two of these questions now. So the first is from Al from Southern California. So Al says, I can see AI rules will be made by the biggest player and they will always be favorable to him. Will there ever be a fair game? Thank you for this question, Al. And I know your use of the gender pronoun, which I think is very appropriate given if we look at the tech billionaires exactly who is running the game, as you say. You point to, I think, one of the most serious political formulations that we need to contend with, which is just how concentrated tech industry power has become. I think in some ways you really have to go very far back historically to find parallels. You could certainly think of the early railways in the United States or perhaps standard oil just in terms of their wealth and power since the pandemic has been extraordinary to see that shift. And so the question really is what kinds of mechanisms do we have to really kind of reorient back to civic life? Now, this remains an unanswered question. We've seen attempts to have regulatory frameworks. We've seen legislation drafted in the EU, the first ever draft act for AI. It's an omnibus bill. And of course, in California here, there's some of the strongest privacy laws and there's a series of new federal algorithmic accountability bills. But I certainly don't think this is enough. I think what we're looking at is actually something far more structural and problematic in terms of what has happened when we have technology companies that have effectively become like power states, these sort of transnational entities that have exceeded the power of nation states to regulate them. I mean, in some ways, what you're really pointing to is the need for an international governance. And of course, this is all happening historically at the time when all of our international governing bodies have been so profoundly weakened. And of course, we're seeing the horrifying legacies of that right now, of course, with the latest invasion of Ukraine. So in terms of thinking about what mechanisms we have for controlling power, I return us back to the types of collective action that we've seen over the centuries, which is how do people say this is enough? How do people say these systems are not suitable for these purposes? And they're only suitable for these? You're seeing a little bit of it here and there where people are doing things like banning facial recognition in some cities and banning predictive policing. I think they're really important case studies to look to. But certainly that activity of change on the ground and solidarity on the ground is the only way I think we get lasting accountability. Thank you, Kate. The second question comes from Odette from Berkeley, and she said, where do you see the locations and the communities of human engagement for this work of reimagination? Thanks, Odette. Certainly part of this work of reimagining technical systems is happening within the fields themselves. We're seeing emergent communities and conferences. There are books on design justice by people like Sasha Costanza-Chok, books on race and technology by Ruha Benjamin, ideas around how do we re-center questions of justice. So that's starting to happen. But I also want to think about completely different communities. And this will be one of our topics for discussion tomorrow afternoon with Trevor Paglin and Sonia Katyal and Marion Forsat, where we'll be thinking about what is the role of artistic communities and activist communities. It's interesting actually, certainly for myself, some of the collaborations that have really transformed me as a researcher and as a person working on these topics have been collaborations with artists. I'm thinking here of my early work with Vladan Jola back in 2017 and 2018 where we tried to map the entire life cycle of a single Amazon Echo. And we thought we were just going to look at the technical pipelines, which we were trying to sketch out. But we realized very quickly that to truly understand its entire lifespan, you have to go back to the mines. You have to look at the smelting practices. You have to look at the container ships. You have to go all the way through to the end of life where these devices get discarded in the ground in places like Ghana and Pakistan. So I think for me, trying to imagine very different communities, having part of this practice of reimagining is a really important one. We so often center the tech industry and the tech fields as being the spaces where solutions will be found, where we will have change. But I think there's a hubris there. And I think we have to look much more widely to not just within academia, but beyond it to try to think about what kinds of worlds we want to live in. Thank you so much, Kate. There are a few more questions, which maybe we can address in the discussion tomorrow. I am sure I speak for everyone viewing this when I say it was fascinating, a breathtaking presentation today. We're all really looking forward to the discussion tomorrow. I want to encourage our viewing audience to return at 4 p.m. West Coast time tomorrow to view Kate Crawford again, this time with their expert commentators, Marianne Forsat, Sonia Ketel, and Trevor Padlin. Tomorrow's discussion should provide additional perspective and raise some new questions. You won't want to miss it. For now, this concludes our presentation today.