 The biometric, the BIPA, I forget what it is, by Steven Vance and a few other plaintiffs who sued IBM. I think this lawsuit's been paused now as a class action suit, not 100% sure. Under Illinois statute, this says biometrics are unlike unique identifiers that are used to access finances and under sensitive information and they go on to list a few, but biometrics are biologically unique to the individual therefore once compromised the individual has no recourse. So they accuse IBM of taking these unique facial measurements of thousands of Illinois citizens in flagrant violation of BIPA's requirements and go on to say that the defendant IBM is violating people and subjecting them to increased surveillance, stalking, identity theft in other invasions of privacy or fraud. I even wanna go ahead and promo ties this excavating AI article that Crawford and Paglin said, Crawford and Paglin say in their excavating AI article this is certainly not a fix and there are still half a million people's photos there without their knowledge or consent classified in ways that they'd likely reject. Also, this history raises problems of its own as we explained here, but in their own archeology who are the people in these data sets and what productive work does excavating these AI data sets perform? So in this piece, Michael Lyons, creator of one of the data sets talks about one of the articles they talk about in excavating AI called the Japanese Facial Female Facial Expressions data set was exhibited at art shows in Paris and Milan by Crawford and Paglin and served as a teaser photo on Twitter and other social media notably without consent of the subjects of the data set. And you can see I'm not showing the subjects of the data sets in this presentation in any kind of identifiable manner. And so Lyons notes that ethical double standard in the work is even the data set as a means of demonstrating data extraction but falling short of upsetting their consent in their own practice. And this I think is probably the biggest stretch in this talk but it's sort of thinking but if you bear with me please, please bear with me. You're with me 35 minutes in at least say with me the last 10 minutes. Five minutes? Oh, okay, all right. It's cool, all I have is five minutes. So Everest Pipkin in their work lease work describes the act of watching this data is called moments in time which consists of three second clips of videos which are classified into discrete actions by human annotators. Each video is slowed down, interpolated and upscaled immediately in imagined detail. The stream of videos is haunting ghost light. The face is rendered ghastly by the resampling process but commonplace as we see how poorly AI power generator models seem to do with human features. Pipkin planned on watching only a small percentage of the videos from the data set but ended up watching all one million three second videos in their entirety. They remark on their hours of watching the violence in these works. In the archive there are moments of extreme emotion and personal vulnerability. Tears screaming in pain, moments of questionable consent including pornography, racist and fascist imagery, animal cruelty and torture. And worse, I saw horrible images. I saw dead bodies. I saw human lives end. In Pipkin taps into the nerve of the matter machine learning data sets are a violent archive of faces, actions, moments taken without context. Many of the frames which are available for people to contest being included in this archive are limited. They can't really ask to be removed be it informed consent as a scientific mechanism, data subjectivity law within the US and US EU privacy and data protection regulation or other liberal traditions protecting one's likenesses. And so I think my biggest claim here is that what I'm thinking about in this thought and what I think this book is about is that there is a whole body of cultural labor of data maintenance here. Here I'm riffing a lot from Dylan Mulvin who is a cultural historian in science and technology scholars who wrote an amazing book called Proxies. The way that we think about data, data cleaning, data munging, there's a whole host of cultural labor, of physical labor, of maintaining the illusion that data sets are clean, right, that there are, to pretend that there's an ontology that can be sprouting as if by magic from the head of Zeus, it doesn't happen of that nature, right? Data is messy, data takes work to appear clean. And I think that work is done over and over whether it's in terms of developing their ontologies of how they're labeled and of the data subjects and the people whose lives are encapsulated in this image data sets. With that, thanks so much. I really appreciate it. So thank you very much, Alex. That was a wonderful talk and you're wonderful. And so we have barely five minutes for questions who wants to start. And I can also, if folks want to ask something in English, in Spanish, I can also translate, I can offer that. Yeah, who wants to ask a question? Yes. Hello there. So let's suppose that we have to build a new data set from scratch. It's not gonna be easy, I know. And you're in charge of this. How can we do it in a sustainable way? In a proper way? It's not fine. I said I don't wanna propose any solutions. Well. I always wanna refuse to answer the question, but I will not do that. Why are you collecting the data? I just wanna ask that. Why are you collecting it? That's my first question. And it's probably gonna be my only answer. I'm sorry. Anyone else? Questions? I promise Alex will answer that one. I'll answer it, sorry. Don't expect me to give you any guidance. If you wanna write an R pipeline or something, I don't ask someone else. Yeah, Shani. Thank you for the talk, Maisie. How we as a community can help to keep accountable to the industry that is creating this and misusing this. I know you don't promise solutions. No, no, it's good. I mean, there's a lot of things here, right? I mean, one of the things that this is a call for is an understanding of where data come from and ask for data histories. Dylan, in early in this talk, talked about data sheets and data documentation. I would say that we need to go, I think data sheets as an artifact are a start and data sheets are a means of being reflective. And so understanding where these things come from and what histories they rely is, I think, critical. That's a question on kind of self-practice. Holding other people to account, I think, is much harder, right? That has to do much more in the realm of regulation and policy and whatnot. But those are tools that could be action upon to hold big players accountable. Thank you. We have time for one or two more questions. Anyone? Yes. Well, thank you for the talk. I always get so frustrated because ImageNet keeps getting used and nobody really holds anyone accountable. The incentive structure as well, benchmarks like the vein of getting same but academic acceptance. So as much as we publish papers about how ImageNet and all are terrible, how do we still go to conferences who celebrate people who submit and use these terrible data sets? How do we still do that? Oh, God. I mean, that's what your dissertation's about, right? Sorry, such a... How are we gonna solve this, yeah? I mean, it's about incentives, right? I mean, there's a problem that there is some, what is it called? Not necessarily a Matthew effect, but there's sort of a thing where it's so entrenched to use a certain kind of thing as a benchmark, right? And so it seems like there needs to be more moves to moving away from benchmarking as a practice to other types of analysis. The problem is that more and more, especially with chat GPT, the benchmarks even are not even stable. They seem to keep on moving, right? And so what are ways to make alternative kinds of modes of evaluation? We can kind of propose that and we can try to find alternative venues, but I mean, we also have to look at the sort of political economy of what it means to win a benchmark. And, you know, Sootsgever and Hinton went off to find fame, but so did Matt Zeiler, went on to find and clarify and, you know, people winning these competitions and say they can beat Soda on whatever, get millions and millions of dollars, right? So it becomes not only a question of scientific integrity, but it becomes a question of political economy and allocation of who gets what. So I think that is a problem. And that's going to take a lot more inquiry than just intervening on scientific grounds. Okay, unless there's a very short question that requires a very short answer. Yes, do you have a working title for the book? Oh, gosh, no, I'm really bad at titles. What? Yeah, fuck ImageNet. It could be abolished ImageNet or it could be, you know, ImageNets of X or something. I don't know. Yeah, I said, someone, if you want to talk a title, I don't know. I'm sure you're better at this, Daniel. I don't know. All right, then thank you very much, Dr. Aleksana. Thank you.