 on, and it's digging into and understanding this history and focusing on what we needed for these particular tasks, that there was the motivation to change the skin tone scale that was available to us. So as the third thing, you should always be able to create a new and optimal classification scheme that perfectly suits every research task you ever encounter. Except that's obviously not reasonable, right? Sometimes you already have data. You're just given a data set. Or you don't have the institutional power to challenge any of these things. And so in these situations, the thing that I find most helpful and the thing that I always sort of try to encourage is to document the histories of the classification schemes that you encounter and impose limitations on the ways that you or users of your data sets can use these data sets when these classifications are taken out of context. So I'm going to give one more example with tweets that are written in Amharic or written in Amharic. So I have a data set of a bunch of different tweets. And when we get tweets from Twitter, when we got tweets from Twitter, you would get a ton of different labels associated with every single tweet, things like author ID and who they were applying to and stuff like that. And one of the fields that you get is the language that's detected by Twitter. So in this case, the tweet has been detected to be the language Amharic, which uses this script called Gaez. And Amharic is a language spoken in Ethiopia. And let's stick into what this detected language column actually does. So the language detection model that they use is not public, so we don't actually know what they're doing, where their training data is coming from, who labeled it, how they labeled it. And looking into the documentation, kind of deep in there, you find that all the tweets, there are 70 different categories, 69 languages, and unknown. And only one of these languages is typically written using the Gaez script. But Gaez is used to write like 20 different languages. And this made us really suspicious, because we know that we have examples of lots of different tweets that are written in Gaez that are not in Amharic, which made us suspicious that everything written in Gaez was being labeled as Amharic. And it turns out, with a little more digging, we did find this to be the case. So it did not reliably detect Amharic well. It labeled all the other Gaez languages as Amharic. And we suspect, based on the documentation that they provide, by documentation, I mean a blog post from like 2015, that there are a lot of other edge cases that they're not able to detect here. Because Gaez is hard to type. A lot of people use a lot in alphabet. People code switch a lot, language switch a lot. And so this category actually ends up being a really complex category that there isn't a clear way to understand. And so it's really critical that we document these things. It's in this vein that work like data sheets for data sets comes in, which you should look up if this is sort of the kind of thing that you do, which is sort of like nutritional information for your data sets. And much like the role of nutritional information supposed to give you information about how something should be consumed, we can use information that we have to stop our data, or hopefully stop our data, from being used outside of where classification schemes are useful. There are options such as just don't use it. So for example, you could choose not to do analysis on tweets that are labeled Amharic or to not use this language category altogether. This wasn't an option for us because we cared about these tweets. You can also improve upon the classifications yourself. If you find a classification scheme not to work, you can simply not use it and do something else. So you can rerun your own models that detect these languages. You can hand label these tweets. And again, I say with X-Model using X-Method because documenting these processes is really, really critical for, again, establishing context in future uses. And finally, encoding limitations on how things can be used, sometimes at the very granular level, encoding in the actual usage of your data scheme instead of maybe providing open access or something using an API that has limitations that can stop things from being used out of context. So ultimately, to summarize, classification has consequences in that you're choosing what to make important and what gets erased and what gets abstracted away. So when you're dealing with data collection, it can be really, really critical to define the classifications that you actually need to make in this particular context. And sometimes, if you have the institutional leverage to try to move away, this can be intention with relying on previous literature and things that you might need to do. If you have the institutional leverage, this can be really powerful. Next, digging into the histories of the schemes that you have is, again, really helpful. And finally, above all else, documenting these histories, the reasons that you have for using the schemes that you and imposing limitations is really, really useful for mitigating the fact that classification is complex. Thank you so much for your time. Thank you. You're welcome. Hanabi. Hello. All right. Thank you so much for that excellent talk. We do have time for a few questions. Since we started late, I'm only going to give you all a few minutes to go to the next room between talks. So we definitely have time for questions. Who would like to get us started? Do any examples jump to mind of organizations doing this particularly well or adopting some of the tips that you? Oh, that's such a good question. Just doing one thing well. I'm truly drawing a blank, but there definitely are. I mean, OK, I don't want to glamorize Google, because I have so many issues with this. But the fact that they removed gender detection from their Vision API, they used to allow you to label things like man and woman, and they took that out and replaced it with person. So many other terrible things in that API. But I thought that was one good example. And hopefully, things that other people have followed. There are other, I mean, again, ImageNet, very problematic in a lot of ways. But they removed a lot of problematic labels that were in the data set. Kept lots of other ones, but there are instances of people taking out problematic categorizations. Do you think there's a, this is maybe contradictory, but do you think there's a somewhat objective way to label the subjectivity or something of a label? I don't know, I'm imagining something where you could say, well, look at a data set and say, OK, this doctor, yes or no categorization, obviously fails some tests. If training data and labeling was open, like some way to sort of rate it. Rank and categorize. I mean, I think, no, I think there are other fields of study that are better suited to qualifying and comparing those types of things. Like I would ask a social scientist how to better, yeah, because I think ranking and quantifying isn't necessarily the right approach there. But there are more specific and nuanced ways of analyzing the differences between things. I don't know, ask Alex. Don't ask Alex, just check it out a bit. Don't classify anything. Thank you for this amazing talk. That was super useful. I'm super interested in classification systems and things like that. I'm kind of a newbie in machine learning myself. So is there a way, and this might be a dumb question, but to see if problematic tags or whatever labeling of things in a massive data set, I'm just curious, because if you have a massive amount of things, just wondering. Absolutely. Yeah, it's a really, with the advent of massive data sets, this can be a really difficult question, but something that's very critical, because that is what your model is, is the data that's under it. There are things like the know your data tool. And I think Huggingface also now is putting out a lot of other tooling for visualizing and understanding what models are doing and sometimes what's going on in data sets. So there are definitely tools out there for uncovering these types of things. But it definitely should be easier and more well-documented, because there aren't always pressures for people to document these things. But yeah, different tools for different data sets. Cool, thanks. That is all the time we have. Let's give Dylan another round of applause. Thank you so much. Thank you.