 My name is Evan Daski. I'm a data scientist and program officer at the Rockefeller Foundation, so in that role I do a little bit of data science So I work in service of our teams our grantees folks who wouldn't otherwise be able to afford data science services We help them out with that and then I also fund tools infrastructure Help people start products to make sure the machine learning tools are accessible to people who are working on Development and humanitarian problems mostly across this work One of the biggest problems that we face is building label dating sets for machine learning So today I'm going to talk a little bit about that We kind of took a step back and did some research this past couple months I'm going to share that research with you just to kind of get everyone on the same level Do a one-minute introduction to supervise machine learning for those who haven't Sort of encountered this field before the basic idea is that we're going to show an algorithm or computer a Set of objects, let's say they're pictures and we're going to show that a set of labels Let's say what's in that picture typically maybe be a bounding box round where actually the cat is in that picture Where the dog is we show it a lot of those so you can pick up on features You can pick up on the general patterns and then we show it something out in the real world It might be a slightly different cat a winking cat or a dog with a different color Etc. And then the algorithm is going to make a guess and say I actually think that looks a lot like what I've learned to be a cat or a dog There's only one subfield of machine learning, but it tends to be where we spend a lot of our time, right? This is where when we're saying we want to automate diagnosis from medical imagery This is a big part of what we're doing when when I fund a team that wants to Detect crop yields from space in Africa across whole continents. That's what they're doing They're doing a lot of supervised machine learning The problem is is that in the environments where we work the data sets that do exist are incredibly biased Incredibly low representation if you take image net or it's sort of somewhat successor open images Probably the two biggest machine learning imagery data sets 60% of those images are coming from just six countries in North America and Northern Europe, right? So that means when you show a classifier is this a wedding and it's a wedding safe from Pakistan The classroom won't know that that's a wedding That's sort of trivial in that case, but you can imagine that transported into the medical field, right? Identifying cancer identifying symptoms of a heart disease or heart failure things like that and it becomes catastrophic It becomes tools that we can't even begin to use similarly In many places where we're getting transactional data say even like open data from our cities That encodes a lot of bias a lot of racial bias. There's a great study. I'll be linked in my Work here, but from Rashida Robinson basically looking at police departments that were under consent degrees So they were literally being investigated for racist policing and then using that data Directly to build classifiers about where they should police in the future So even if those police departments fix their policies fix their work They weren't fixing the algorithms are built on those data and these patterns repeat over and over again in the places where we work So he said let's take a step back. Let's talk to some of the people who are building these data sets Let's talk to some of these people who are building them in industry. Let's talk to academic researchers let's talk to critical thinkers we're thinking about this and You know ask them a set of semi-structured questions and work through it and say how do you build these data sets? What motivates you? What are the technical challenges you're running into? What data sets have you seen them been really effective? What data sets have disappeared quickly about 20 folks? I'm not going to attribute anything to anybody because a lot of folks actually are working on new stuff potentially IP related stuff and then I Will also note that mostly interviewees are focused on text applications or imagery applications So we don't have anyone in here who is say looking at audio or things like that And those are unique domains don't want to discount them, but most these folks are working on text or images Here's some of those lovely folks these people are doing really awesome stuff So for example Desmond Patton he is building a classifier that looks at Instagram posts and tags whether there is actual sort of a true threat and Not for police departments, but for harm reduction folks who are working on this stuff in Chicago This is a really the idea of you know giving communities access to algorithmic decision-making so they can help and preserve their Sort of communities is an amazing thing Mutamba or sorry, Mutambasa Daniel is working on building classifiers for such a wheat rust in cassava and That means going out working with farmers collecting imagery from them building incentive structures to get them to send imagery Making sure they aren't cheating on those imagery, etc. And so these are just amazing folks be happy to look up any of their research It's really great stuff Today I'm going to share essentially five lessons from this work and these are lessons that are more general So in addition to each of these overarching lessons We have a fair amount of research that goes into like each domain I'm looking more specific, but I want to take a step back and say across all of these initiatives What are the five things that we're kind of learning or seeing in general patterns? The first one is that motivation shaped data sets more so than any other thing when we dug into it and we're asking people Wait, why are you doing this? People are completely cross-purposes. There's a set of motivations I'll describe as commercial and those are essentially people who want to build the new product There people are saying well, I've got an idea I probably pitched it to someone somewhere and I need enough data to just build that minimum viable product That's gonna get me the fundraising to go do the thing. That's one set of folks There's another set of folks who are saying you know what I need this data set as my moat access to these data to these records These labels records is what I need to make sure that no one's ever gonna be able to come from my market And there's a bunch of other folks who are working in commercial settings who know they have a problem with their algorithm so these would be folks who are working on algorithms that That don't work to detect hate that don't work to detect bias in hiring and they know They have a critical problem and they're working commercially to try to fix it before someone calls them on it The second set of motivations are what I'll call methodological folks and these are folks who frankly Care less about the substance of the data what they care about is that the data set will be a benchmark The data set will allow them to sort of compete with others to prove out a new method to publish Or to simply pursue curiosity. This is where a lot of computer scientists find themselves Of course, there are many computer scientists with many motivations But generally these are the folks who are approaching it with a little abstraction and then finally And this is the sort of place where I find myself as a data set funder is the applied folks and what we're thinking about is There's no one in Africa who's building a classifier that allows community health workers to determine if a certain disease exists in Geography acts right I need to solve that problem Across these motivations, none of them are you know distinctly bad or distinctly good I would say there's some that lead to bad outcomes or good outcomes But you tend to when you have a project see different people across the project with different motivations and that leads to I Would say that the the process of managing these motivations is what putting together a multi skilled team looks like So you will have to have some computer scientists who are interested sort of in the pure problem. They're working on This is often the problem why it's hard for us to track some computer scientists to work on some problems that for us in the Applied setting are super interesting because there's nothing new to be done there, right? Like it's totally a discovered field and it's an application problem And so for them to go through the process of working with you on specking out a data set or doing anything like that is That's like a lost quarter for them and then similarly the commercial folks They're happy to talk with you about building maybe this data set But the moment they find out that you're gonna have to license this in a certain way that say we would make them License it open sort of like that. They're gonna walk right away, and they also have a Potentially a lower bar for quality So they might be able to just get it to a certain point where they could pitch it or build it But it's not gonna be sufficient for either scientific or for us to kind of release it The second area that we found a lot is Transactional labels are really worse than you think so when you're getting labels out of electronic health record system You're getting labels out of a power billing system These labels are bad because they were built to do one specific test that was not trained a classifier to Classify something down the road They were built so that a medical billing office could efficiently send out bills to people, right? They weren't built so that you could determine within 20 minutes of someone hitting an ER whether they're going to have a higher or lower risk for for heart failure and That is always the case with transactional labels and transactional labels Well, I'd say more so than any other type of labeling Carries the sort of systemic biases of a system with it and it can be harder to see and get out And so if you're going to use these transactional labels, which of course people do of course, there's value The basic recommendation is you have to embed with that team for six months to a year, right? And you have to really understand why these decisions are being made What different labels mean what are these sort of embedded incentives behind them and and then you can you know Consider to see if it's useful and we funded teams that in the power setting actually someone who worked with a Large power company in Africa for two years and knew exactly where all those data sets are broken He's able to build a reliable demand prediction algorithm But only because he has that sort of depth of knowledge not just because he has access to the data Third thing is there's essentially a labeling spectrum that we see emerging, right? So on one end it's stuff that anyone can label that's finding a school bus in this image That's stuff like highlight all the universities in this document Simple stuff you can parse that out to crowd workers the issues They are simply making sure that you're using what one of you're treating people fairly You're paying them what their time is worth and to that you're giving them enough guidelines Enough rules and sending yourself sort of a gold standard data set to compare their performance against that stuff That's that teams are getting really good at that's that's totally possible But it only addresses a very small set of the problems where we want to build data sets on the other end We have the experts only category. This is stuff medical imagery. This is stuff in a lot Of like the hard sciences. So this is stuff where to even become an anodator is at the end of a long process potentially a PhD potentially sort of a medical degree There's actually less in those two. There's less interesting stuff in either those two fields And you think a lot of the most interesting stuff is in this sort of messy middle So these are things where yeah, you could probably get some undergrads to do that, right? You could probably contract with some people in mechanical history to do that And there's this there's this weird sort of messy middle and and here is stuff like So we use the the example of imagery of plants and diseases and plants that's stuff that farmers in Africa working in agricultural settings But without any formal education can be trained to tag and identify it can also be trained to tag and identify Pests the problem is it's in this messy middle. It's not automatic. Not everyone can do it not everyone wants to spend the time doing it and and it's not a It's not necessarily a good value for folks who are in that expert end So the kind of the three rules we're thinking about there are one first incentives and here actually monetary incentives while sort of necessary for a lot of this work are really not sufficient for improved performance Going back to that sort of labeling wheat rust example What that team found is that simply by giving immediate feedback on whether the algorithm at a basic level thought there was or Wasn't the the sort of disease present? Increased farmer compliance with giving more samples, right? So if you're getting value in that moment, you're shortening that feed back loop even if you caveat and you say, you know We're not totally sure you should also call your farm extension worker have them come check it out, etc If you can shorten that up for people That's what gets people coming back submitting more samples submitting higher quality samples Coming to your meetings coming to your trainings getting better at tagging things. The second is new tools So this is where Simply building tooling that makes problems easier for people Creating workflows for people of doing a lot of pre-processing. So images actually look more distinct And it's not as hard to trace boundaries around things that can really increase the number of problems that people can tackle without much formal training and the final one is Think kind of rethink from scratch whether you can reclassify in a different way So instead of asking doctors who may disagree about the presence of sepsis or something like that in an emergency health setting Can you actually for this problem use 60-day readmission? Can you use you know within one year death? And can you train a classifier that gets you the type of answers you want in the moment? But with a totally different much easier kind of classification problem Fourth don't ignore shelf life. So and this one was actually somewhat surprising to me when I asked people You know, how long is your data set can be good for if you you cut off today? It's getting this the sort of f score across the board and you know in five years. We actually put in your data support and then scored again You know, how long is this going to be a viable data set for classification going on? And the answer generally is a lot shorter than I initially thought So there's differences across domains, but the two basically rules are one is is this data set operating in an adversarial environment? So does someone have a motivation to adjust their behavior in relation to being scored or classified in some way? If so, that's gonna be like almost a weekly basis You need to figure out a system or a way to get new labels into your system really quickly. The second is How stable are the biological social processes that go into it? So an emergency room is really unstable things change in an emergency room all the time Staff come and go people come and go etc in that setting You're almost on that similar cycle of like it has a very short shelf life You need to be funding data sets every sort of year every quarter every time a new sort of cohort of medical practitioners comes through And then finally in fifth the overarching theme here, right? Is we have to think about machine learning day sets as infrastructure not as research projects and this is something that I think a lot of practitioners will tell you and When pushed there's still there's some ambiguity about what they mean by infrastructure So it was unclear, you know Essentially what they were saying is like we need more money for longer periods of time to do more of the same thing We're doing which doesn't quite sort of You know I get why they're saying that but it doesn't quite give us anything new to work with And kind of pushing a little more what we got to this definition of of what makes a good data set as infrastructure One it's ubiquitous. It's not bespoke. So it's not for one sort of one sort of Use case or things one sort of geographic use case. It's across a whole bunch. It's for a community So thinking about image net like image net was for the computer vision community And that's why it's sort of there's a lot of uptake right because they knew their community They spoke to it. It's not for just one team within the computer vision community And then other signs are thinking about it as a separate budget not a budget line And so this is sort of a little bit me writing myself as a funder, right? but if I see someone with a research project come in and Curation is just one budget line Unless they have a lot of prior work and that's for maintenance That's not going to build an effective data set, right? We need to be thinking about separating out that budget and saying we really care about this domain and area Let's fund the data set ahead of time and then work with a set of researchers downstream to maintain and kind of keep That going forward and then finally major contributions There's one really attractive aspect about machine learning data sets is that we can actually see how much new data improves our performance And so creating economies are out that rewarding people Tracking usage and making sure that people are being compensated on those bases is is a huge part of what makes this infrastructure Not a research project So with that I'll close up. I would love to talk to anyone here who is building data sets who's thought about this Specifically anyone who's thought about like how do these data sets differ from open source software communities? How they differ from just simply the open data community because there isn't there's a lot of alignment But there's a little clarity about what lessons kind of crossover. So if anyone here, let's talk about that. I'd love to thank you