 Hello everybody. Hi. My name is Frank and I am with Hope One Source and we're a nonprofit organization that is doing one thing and with that I want to ask you all one question. Who in this room right now knows that Drupal is being used to help and homelessness and save lives? All right. Well, that's news for everyone in the room. That's awesome. That's awesome. So take Angie, for example, a single mother of two who was living in her car under a bridge. She signed up to Hope One Source and Open Source Drupal-powered completely volunteer built text messaging platform. And as soon as she signed up she started receiving text alerts about emergency shelter and food for her children as well as later on childcare and a job. So that is beautiful because right now what I want to offer you is being volunteer based. I'm going to be handing out two tablets and if at any point you would like to get involved it can be an hour during a lazy Sunday or just one line of code. You can help the many Angie's in the community be able to get themselves out of homelessness and of course save their lives. So thank you very much and I'm going to be passing out the tablet if you folks are interested. Thank you. Thank you so much. Hi everyone. Thanks for joining me this afternoon. My name is Lauren Mifeo and I'm an associate principal analyst with GetApp which is Trip Advisor for B2B software for small and mid-sized businesses. So the idea with GetApp is that if you're a small or mid-sized business owner and you're looking for any type of software for your business, whether it's HR or CMS software, you'd go to GetApp.com, read LinkedIn verified reviews of products, filter for the features you need, and then choose the tool that's the best fit for your business. At GetApp I cover cloud business intelligence software for small and mid-sized businesses and I also research the ways that new technologies like AI and blockchain are integrated into small and mid-sized businesses. I've also contributed to opensource.com along with speaking at several open source conferences like ATO and Drupal GovCon and so I'm really happy to be with you today especially because I see some familiar faces from GovCon last summer. These are a few other places you can find me online. My link to my GitHub account is there along with my Twitter account and finally an interview that I did with the Association for Computing Machinery on the topic that I'm going to talk about with you today. I'm also giving this talk at the local ACM DC meetup on May 2nd so if you're in the city I'd love to see you there if you want to see this again. On the right is a picture of me running my first marathon in 2016 and to start off this presentation I want to take us back to 2015. Who remembers a website called HowOld.net? A couple people. So it was a website that allowed you to upload photos of yourself and the idea was that HowOld.net would use facial recognition software to guess how old you were. I had some fun with it last summer and uploaded a photo of myself taken that same month and I was flattered when HowOld.net seemed to think that I was 24 years old because that's a fair bit younger than I actually am. I was less flattered when I uploaded a second photo taken just a few months prior and the website thought that I was 38 years old so I aged 14 years going back three months and this HowOld.net made the news pretty widely as an example of how facial recognition software powered by machine learning meant well but still had a long way to go towards maturity. The thing is machine learning gaps aren't always funny. They can actually have serious consequences for end users of products that are built on ML data sets and one of the most infamous examples of this is a product called Compass. Compass is an ML algorithm that predicts defendants likelihoods of recidivism but research from ProPublica which is a nonprofit journalism outlet found that Compass has made biased predictions based on race. The algorithm was two times more likely to predict that black defendants were high risk for recommitting crimes and it was also two times more likely to predict that white defendants were low risk for recommitting crimes but there was a big problem with both of these predictions. They were incorrect and this isn't a hypothetical scenario. Compass has been used by judges in over a dozen US states to confirm prison sentences including the lengths of those sentences and whether defendants were released on parole or not and based partially on Compass's recommendation a Wisconsin judge denied probation to a man named Eric Loomis. Instead that judge gave Loomis a six year prison sentence for driving a car that had been used in a recent shooting but when Loomis tried to take his case to the Supreme Court the justices refused to give it a hearing and this was problematic because their choice not to hear his case signified that most justices condoned the algorithm's use without understanding how it reached conclusions that were often incorrect and this sets a dangerous legal precedent especially because we don't have a lot of legislation or laws around AI and confusion about AI shows no signs of slowing down. The question is did someone do this on purpose? Probably not. It's more likely that this was an unconscious mistake. The Compass algorithm probably didn't get enough training to recognize diverse types of facial variables including skin tone. It's also possible that the algorithm drew from historical data about rates of arrests between people who have black skin versus people who have white skin and it made an incorrect correlation of skin tone with recidivism. We don't know for sure because Compass is what's known as a black box algorithm and that means that we can't know exactly how it makes decisions or what went into the data but we can infer based on its flawed results that machines have their own biases just like people do and that's really what machine biases. It's programming that assumes the prejudice of its creators or its data whether that bias is intentional or not. This is a problem and it's a big one but we'll start to solve it by answering four questions today. I'll go into some more detail about what machine bias is. I'll move on to explaining why it's dangerous for end users. We'll go into the root cause of machine bias and finally I'll give you six concrete steps you can take to add bias testing to your product dev life cycles. Let's start by asking again what machine bias is. We just went over it at a low level but it's worth unpacking that a bit more because we all have bias whether we know it or not. I know less bias than you are just because I'm standing up here. By virtue of being human I have my own bias and some say that because that bias is what makes us human it's why we should outsource big decisions to machines. We all know the stories about how you shouldn't get surgery on a Friday or how you don't want to have your case heard in the late afternoon when judges want to go home. Machines don't get tired, cranky or irrational and so the thought process says that outsourcing big decisions to them will yield more objective results. But as we already learned that's not the case. Machines have their own biases just like humans do and this will continue the more lifelike these machines become and this bias manifests itself in two distinct ways. The first is through direct bias. This is when models make predictions based on sensitive or prohibited attributes so things like race, religion, gender and sexual orientation all constitute examples of direct bias. But believe it or not that's actually not the form of bias that you have to be most concerned about. The bigger issue is indirect bias because it's harder to detect. This is a byproduct of nonsensitive attributes that correlate with sensitive attributes and this is why the compass algorithms predictions correlated with race. The outcomes for one group are compared with another and if the difference exceeds in agreed upon threshold the model is considered unacceptably bias. So if the difference in incarceration rates varied more than some percentage between prisoners of different ethnicities you can make a reasonable inference that the algorithm is biased. Machine bias is also dependent on two important factors. The priority set by the algorithms designers and which methods of fairness were accounted for. If one or both of these is absent at the start of the product life cycle your risk of bias creeping into the data set goes up dramatically and because these are often absent at the start that's where problems can creep in later. Machine bias is also based on measures of fairness and this defines how to treat all participants in a system with equal respect. There's a wide range of possible measures here and they often have negative correlations with each other. That's because as fairness increases by one measure it decreases by another and just like with any product management scenario this is going to lead to tradeoffs and competing priorities especially because sometimes as fairness increases accuracy of the model can decrease. So let's go back to the criminal justice example. In this case the algorithms designers have to make some tradeoffs at the start and decide which results are the most fair. They have to ask themselves should 10 innocent people go to jail to make sure one guilty person isn't released or should 10 guilty people be released in order to keep one innocent person out of jail. Regardless of which choice the designers make their products users will feel the impact of that decision. And that leads us to the next question. Why is machine bias dangerous? There are several answers but I'll give you three to start off with. The first is that machine bias reinforces human bias. Machines are like toddlers they only learn from the data in front of them and the data that they're given. So if machines are fed any data that reflects its creators unconscious biases the machines themselves will reflect that bias and this isn't a new problem. St. George's medical school in London used an algorithm in the 80s to take a first look at applicants. The idea was that they would outsource the initial review process to machines and then the machines would decide which students were worthy of coming on campus for interviews to med school. And then it was adjusted until the algorithm's decisions matched the characteristics that the admissions team wanted to see. But four years later two doctors realized that the algorithm was biased. They realized that it was rejecting female and non-European sounding names for interviews even when these applicants had all of the other qualifications. So they were qualified to be interviewed based on merit but they weren't being flagged through to the interview stage purely because of their names. And these doctors found that up to 60 applicants were being rejected each year just because of their names not because of their merit. The second reason that machine bias is dangerous is that it doesn't account for context. Algorithms learn from the data they're trained on and that means they don't understand nuances like regulation change. They also don't inherently know if their data is biased or homogenous. And so this increases the risk that they'll either reinforce early biases or that they'll learn new biases when they're deployed. So one example of this is the practice of redlining in Portland. From 1859 through 1990 people of color were banned from buying property in certain Portland neighborhoods. That practice is illegal now but we still have more than a century's worth of real estate data based on that practice. So as a result any predictive models trained on the data set are at risk of perpetuating bias of an illegal practice. And this bias can impact who gets approved to buy homes, what the loans are and where those homes are located. So again the impact of these decisions are very real for the people that are inadvertently using the products. The third most extreme example of machine bias being dangerous is that it can put users' lives at risk. Again this isn't something that's inherently new. We all put our trust into the designers and engineers of these systems but if those teams are too homogenous they won't build for diverse user needs. One example of this in the past is crash test dummies. In the 60s crash test dummies were modeled after the average male. The average male's height, weight, etc. And as a result female drivers today are 47% more likely to be injured in a car crash and female crash test dummies weren't brought to market until 2011. So that decision 50 years ago had ripple effects that still impact citizen safety today. Now consider what would happen if speech recognition software in an autonomous car can't recognize different types of voices because it was trained on a narrow set of speech patterns. That sounds extreme but again it's not hypothetical. There are multiple instances of voice user interfaces that don't understand different accents and inflections because they weren't trained on enough voice variables. And Dr. Carol Riley really said it best. If these systems don't recognize people of every race as human there will be serious safety implications for the users of these products. We can't solve the problem though without knowing what the root cause of it is. So why does bias exist in the first place especially if most of it is indirect and unintentional. A big cause of the issue is black box algorithms which is what compass is. It's hard for us to know if an algorithm's results are truly fair. And this is hard even for the algorithm's creators because machines often behave differently in deployment than they did in production because they're exposed to new data in deployment. And so the result is that in extreme cases none of us can see inside an algorithm to learn how it made decisions. That's what we know as the black box problem. It's the inability to discern exactly what machines are doing when they're teaching themselves novel skills through reinforcement learning. When we don't know how algorithms make decisions we can't ultimately trust them. And so in the near future I think companies are going to have no choice but to be more transparent about their results. We're already seeing legislation in Europe that would find large tech firms for not revealing how their algorithms work. Google put an alert out to investors that ethical concerns about their products could impact revenue. So tech companies know that this is something that users have their eyes on now. And then as extreme as regulation sounds it's actually what users want. Research from the University of Chicago and the University of Pennsylvania showed that users have more trust in modifiable algorithms than in those built by experts. And that's even when the modifiable algorithms are wrong. The point here isn't related to accuracy. It's more about seeing how the algorithms work and being able to make decisions. And if you have the expertise to contribute then you could theoretically fix the problem. And so this really supports the crucial role that open source plays in public trust of tech because as the research showed people prefer algorithms when they're modifiable and they can see how they work even if they turn out to be incorrect. With that said there are some valid reasons why you would want to keep your algorithms private. I led a boss session on this topic at open source summit North America last August and I talked with developers about some of the reasons you would want to do this. The obvious one is that these algorithms are considered proprietary and revealing how they work could yield a competitive an advantage for their competitors which they obviously don't want. Another reason is that they're built on endless neurons that each have their own biases and so picking apart every single neuron is in theory maybe possible but it's just not practical for them for most teams. And the third is that these systems are at risk of manipulation by bad actors. There was a program manager at Twitter in that boss session and he said that his team is on constant guard from people who want to manipulate the systems with their own mal intent and so for that reason it obviously makes sense to keep them private. But for transparency's sake it's still a good idea to open them where possible and here's where Drupal actually has a bit too big advantages. Large amounts of tagged structured content. The volume of data and tagging data are both really essential for data sets that you can use to train ML products that are less biased and I'll get into that a bit later in the presentation. Frank Kerry gave a presentation at Drupal Con Baltimore two years ago explaining why Drupal data is ideal to train deep learning and neural networks so if you haven't looked it up online yet I highly recommend that talk and these networks can make a big range of improvements to your site from generating alt tags automatically to image captioning so again the data that is generated through Drupal can be really valuable here. And luckily we're seeing more open source tool kits coming out these days to help address this problem. One of them is a python toolkit called Lime. It's based at the University of Washington and it doesn't try to dissect every single factor that influences an algorithm's decisions. Instead it treats every single model as a black box. So what it does is it uses a pick step method to select a representative set of predictions or conclusions that it wants to explain and then it approximates the model close to those predictions then Lime manipulates the inputs to the model and measures how predictions change. So here's an example of a Lime classifier from text classification. They took two classes Atheism and Christian that are tougher to distinguish since they share so many words. Lime's researchers trained a forest with 500 trees and got a test accuracy of 92.4 percent. So if accuracy was your core measure of trust you would be able to trust this classifier. And ultimately we're here because responsibility for algorithmic choices doesn't start with regulation. It starts with all of you. It starts with strong data development and product teams collaborating to make sure the parameters are as fair as they can be from the outset. And the good news is that while bias is inevitable it's not impossible to overcome. There are several steps you can start taking with your teams to boost the health of your data sets and mitigate bias from the start. The first step is to document your priorities up front and this requires you to answer two key questions. You want to look at what methods of fairness you're going to use and how you're going to prioritize them. This so sensitive attributes should be identified and declared out of bounds here unless you've made an explicit justification for including them. You are also going to want to include a minimum threshold of acceptance and acceptance functioning before you deploy your product. This is helpful criteria both for internal teams and for choosing external vendors. So if you are a business wanting to use an ML product you want to be making sure that their product teams are looking at the data and you want to get some insight into how they're making decisions. The second step is to train your data under fairness constraints. This is tough because when you try to control or eliminate both direct and indirect bias you'll find yourself in a catch 22. If you train exclusively on nonsensitive attributes you'll eliminate direct discrimination but you'll introduce or reinforce indirect bias. But if you train classifiers for each sensitive feature you'll reintroduce direct discrimination. Another challenge here is that detection can only occur after you train the model and when this occurs your recourse is to trap them or scrap the model and retrain it from scratch. So to alleviate that don't just measure average acceptance and rejection across sensitive groups. Instead you'll want to use limits to determine what is or isn't included in your model. This expresses discrimination tests as restrictions and limitations on the learning process. The third tip is that you want to appoint someone to monitor your data sets throughout the product's life cycle not just in production. That's because developers tend to build training sets based on data they hope their models will encounter in production but many don't monitor the data that their creations receive from the real world. And this is problematic when ML systems use reinforcement learning to constantly be updating their results and refining their results. Bias should be reduced as much as possible here while maintaining an acceptable level of accuracy as defined in your product spec. And it's also not uncommon for the algorithm to be updated without the model itself being reevaluated. So to alleviate that you're going to want to appoint someone to monitor the source history and context of data received in deployment and production. And if that person detects unacceptable bias or behavior the model should be rolled back to an earlier state prior to the first time you saw bias. The fourth step is to tag your training data. Tagging refers to which classes are present in an image and their locations. But doing this at scale and with accuracy is a huge challenge that's prohibitive for a lot of teams because the tagging the cost of it is proportional to the time spent on it. And so tagging sounds easy until you think about how much time it would involve to hand draw all of the homes in a data set for example. And tagged data sets often have their own biases which can lead to bias in the set. The good news is that more products are coming to market so that they can decrease the time and the cost of tagging. One of them is a SaaS product called Brain Builder. It's from a company called Nerala that's based in Boston and they use open source frameworks like TensorFlow to help users manage and annotate their training data. Brain Builder also aims to bring diverse class examples to data sets and this is another key step in the data training. You'll want to use diverse class examples in your algorithms because training data needs positive and negative examples of classes to learn what things are and what they're not. So consider the example of homes within a data set. If the algorithm contains only images of homes in North America it won't know to recognize homes in Japan, Morocco or anywhere else once it's in deployment. And its concept of a home is thus limited. So training data needs positive and negative examples and if you want specific classes of objects you need the negative examples. Because this more closely mimics the data that the algorithm will encounter in deployment. There are barriers to this mostly the fact that a lot of data sets used to train these algorithms are too small. Again the volume of data is really important for training models that are accurate and that's why we see relatively few players in the tech world dominating the AI space because they have enormous amounts of data that they can use to train their systems. The good news is that 2018 in particular saw an increase in the number of open source AI data sets. So Sync'd has a helpful list of 10 open source data sets you can use and if you're looking for public data sets by industry GitHub has an extensive list as well. I have URLs to both of these in the presentation so if you're interested in getting them afterwards please come talk to me and I'll email the presentation to you so you can get the links. The final tip I want to hit home is to focus on the subject not the context. So let's start by looking at the training set on the left. This is an example of a biased training set that's because the wolves in this set were tagged standing in snow but the model wasn't shown images of dogs. So when dogs were eventually introduced the model started tagging them as wolves because both animals were standing in snow. So classification errors like this mean that the model being trained was either incomplete or inaccurate and the person monitoring the data should be rolling back the model at this point. On the right we have a training set focused on the subject which is dogs. So again you want to look at how much weight the algorithm is giving to different aspects of the data and make sure that it's applying weight appropriately. I just threw a lot of information at you and I don't expect you all to necessarily remember it but if you leave here only remembering one thing I want it to be this. Ethical debt is technical debt. It's not just a nice thing to build data sets that are equal for everyone. Bias leads to poor data which leads to poorly trained neural networks which leads to a bad user experience that in extreme cases can be deadly for end users. And on your end when bias is found your only recourse is to scrap the model and retrain it from scratch which is an enormous undertaking for you and your team and it creates a lot more work for you when you're already working on a very complicated technical problem. And I also want to apologize because the title of my talk is a little misleading I think. I promised that you could erase machine bias from your AI data sets. You really can't. Bias is inevitable for a lot of the reasons we discussed and the more complicated these data sets become the more likely it is to be introduced. What I do want to hit home is that while machine bias is unavoidable it's also manageable. Like every problem with product management you're going to have competing trade-offs and priorities you're going to be working with cross-functional teams that all have their own goals but that's why you have to plan ahead for bias from the start. If you design products at the beginning with the end in mind you'll know how to spot problems and you'll be able to refine those issues over time before they get to the end user and everyone will thank you as a result of that. I have some resources if you want to learn more about this topic. If you want to get specific about how to use TensorFlow for Drupal there's a great article on medium about that. My colleagues at GetApp and Gartner which is a Gartner company have also written about machine bias. We also if you want to sign up for a free trial of Brain Builder using your own data to build models you can do that here and if you want to learn more about how ProPublica analyzed the compass algorithm I talked about a link to that is here and again if you want any of these URLs feel free to come talk to me afterwards and I'll email the presentation to you. Thank you. I think we have a few minutes for questions if anyone has one. Yeah the question is at the beginning you said that some of these data sets and algorithms should be proprietary because a bad actor could get in there etc but how does that work are you telling us that the company with their proprietary thing is ethical and we don't even need to look at those things because you mentioned there without that there would be ongoing inspection and ongoing which I think there should be. So the question was around whether these systems should be proprietary or not or whether proprietary rationale should be a case for why these systems should stay private. I don't think that's the case I was actually talking about the reasons developers give for keeping them private so that was a point that was made when I led a boff session on this topic a few months ago I was talking to developers about the reasons why you would want to keep this data private from their perspective and they brought up the proprietary nature so that's a rationale for keeping them private but for the reasons outlined I think anytime you can make them open source for all of the reasons described that's a better outcome for everyone because people can see how it's making decisions and if it's an open source project theoretically someone could come in and solve an issue if they see it. Exactly because we seem to have a problem schools are requiring people to get like iPads and stuff and you know we wrote to the newspaper about it in an open ed thing and one of the responses was very scary from the parent who said my kid does not need to know what's going on inside the computer apple knows what he needs. Right no I and that's a that's a huge issue and again I think that's why the supreme court case is so problematic or rather I should say the lack of a supreme court case because the justices are who are the most powerful lawyers in the country were condoning an algorithm that not only makes incorrect decisions but they couldn't even see how it was making those decisions and so that sets a very dangerous legal precedent I think when we can't see how those systems work they do involve unintended consequences in that I wish I could give this presentation to that parent as usual that as usual the people who need to hear it most probably aren't in the room but you can try anything else okay thanks for coming and enjoy the rest of the conference