 Welcome, everybody, to this talk. My name is Chelsea Barbis. I am from the Media Lab, which is one half of the equation for this AI Ethics and Governance Initiative that we are jointly pursuing with the Harvard-Burkman crew. And I am very happy today to have the opportunity to introduce Julia. Julia is coming to present some of her undergraduate research work, actually, which is phenomenal that she's able to do work like this. So recently graduated last year from Dartmouth College and is now working as a software engineer in Silicon Valley. Her background is both in computer science and women's sexuality and gender studies. Yeah, yeah. So kick ass, right? And so I think her work really kind of embodies the power of being able to combine disciplines like that and is helping to fuel a lot of the underlying assumptions around the ethics of using actuarial tools to improve decision making by probing at really some of the most fundamental questions of, do these predictions actually help us make better, more accurate decisions moving forward? So super excited to hear your talk. And then we'll have Q&A afterwards, just some points of housekeeping. So we are recording this. It's also being webcast live. So if you're watching online, you can participate by asking questions at the hashtag BKCHarvard. And then we're going to wait to have questions until the end. And then when we do, we're going to pass around the mic. So please speak into the mic so everybody who's watching via the recording can hear. OK, thanks, Julia. Thank you. Hi, is this, we're good on mic? All right. Thank you so much for having me. I'm going to be speaking about my research on the accuracy, fairness, and limits of predicting recidivism. With the rise of big data and the prevalence of technology in everything we do, we become very frequent subjects of algorithms. Tons of data is collected on all of us on a daily basis. And a lot of our daily experiences might be determined by algorithmic systems. A couple of examples. Google has algorithms to determine what search results to give you, which ads to show you. Spotify has personalized song recommendations. Amazon has algorithms to determine which products to show which user. And even banks and online lenders like Zest Finance use algorithms to determine who to give loans to. Recently, algorithms have crept their way into the criminal justice system. For example, the police force in Atlanta uses a predictive software called PredPole to predict where in the city a crime is most likely to occur at each hour of the day. In Chicago, the police force there actually has technology to make a list of the top 400 riskiest individuals in the city ranked by their likelihood of committing a homicide. Tools known as risk assessment instruments take into account a defendant's criminal history and demographic information to make predictions about how risky a defendant is. Recidivism prediction instruments are a type of risk prediction. And they are meant to predict recidivism, which is when a criminal defendant reoffends at some point in the future. The Compass Recidivism Risk Scale is a recidivism prediction instrument that is built by the for-profit company called Equivant, which was recently North Point, has been rebranded to Equivant. Compass is used in courts in many states in this country to predict recidivism. Specifically, the tool was built to predict a new misdemeanor or felony within two years of when the score is calculated. Compass is used in pretrial and sentencing and parole decisions. And since it was built in 1998, it has been used to assess over one million criminal defendants. In May of 2016, which is around little less than two years ago, we came across this study by ProPublica on the fairness of Compass. What ProPublica found was that although Compass has pretty much the same accuracy for white and black defendants, it makes very different types of mistakes on these two groups. It found that if you're black, you are almost twice as likely than a white defendant to be classified as high risk and then not re-offend. And the opposite is also true. White defendants were almost twice as likely to be classified as low risk and then actually go on to re-offend. So ProPublica released the analysis saying that Compass was biased towards black defendants because the tool was under predicting recidivism for white defendants and over predicting recidivism for black defendants. Tools like Compass are introduced in criminal justice because we assume that they're gonna be more accurate and more objective than the people who are tasked with making these types of predictions. And people hear words like big data and machine learning and they sound really impressive and people are usually pretty quick to assume that tools that use these methods are accurate and objective simply because they were built on a lot of data. And we started doing our own research into Compass and recidivism prediction algorithms in general after this ProPublica piece came out and noticed that there was this assumption underlying the whole conversation about algorithms in general which is that algorithmic prediction is inherently superior to human prediction just because it's built on data. But we actually couldn't find any research that proved that algorithmic recidivism prediction was definitively superior to human recidivism prediction. Compass's accuracy is around 65% which seems pretty low and this tool has very serious consequences. The decisions that it helps makes have very severe implications on a defendant's future on their life. So we wanted to confirm is Compass outperforming human predictions? For this study we had a data set of over 7,000 criminal defendants from Broward County, Florida. This is the same data set that ProPublica used in their study. This data set has the defendants' demographic information, their criminal history, the crime they most recently had committed and the Compass recidivism risk score that they were given at their pretrial assessment. It also had whether or not they actually recidivated over a two year period after they were assessed by Compass and this two year period did not include any time that that person may have spent incarcerated. To assess the human performance of recidivism we used Amazon Mechanical Turk which is an online crowdsourcing marketplace where people are paid to perform short tasks. The short task for our study was that participants would read a small paragraph about a criminal defendant and then were asked to predict do you think this person will commit a new crime within the next two years? These are the seven variables that were in each of those paragraphs about a criminal defendant. They had the defendant's gender, their age, the crime that they committed and then at the bottom kind of a small description of what that crime was. Whether or not the crime was classified as a misdemeanor or a felony, how many juvenile misdemeanors and felonies were on their record and then the number of prior crimes on their whole record. You should note that the participants did not see a defendant's race on making these predictions. Our study, or for the study we randomly selected a subset of 1,000 defendants from that larger data set of 7,000 and we picked this subset so that it was a representative sample of, sorry, so that Compass's performance on that subset was representative of how Compass performs on the larger data set of over 7,000 defendants. Our study had 400 participants and the participants were each assigned to see a random block of 50 defendants. They were paid $1 to complete the task and we wanted to incentivize them to pay attention and to do their best. People who were doing tasks on Mechanical Turk make more money, the more tasks they complete so we wanted to incentivize them to slow down, take their time, try their best and so we offered them a $5 bonus if their accuracy was greater than 65%. It's also important to remember that these people who were in our study were people getting paid to answer an online survey and we assume have little to no experience in the criminal justice system. Here are the results from this study. Here's the distribution of 400 participant accuracies. These participants achieved an average accuracy of 62.1% with a median accuracy of 64%. And statistically this was just barely less accurate than Compass's accuracy on this data set. So these individuals responding to the survey they did pretty well compared to Compass but statistically were just barely less accurate than Compass. There's a concept called the wisdom of a crowd which is that the judgment of a crowd of individuals can actually outperform the individual judgment of those people if those individual people aren't experts in what they're judging. And we wanted to know if there was any wisdom in the crowd of people that we had assembled for this study. For each, oh no, that looks awful. That's such a bummer. Okay, shoot. Okay, well this is a better graphic. I'm sorry, I don't know what happened with that. Each of the 1,000 defendants was seen by a small crowd of 20 participants. Each of those participants made a yes or no prediction on that defendant. Yes, this person will recidivate, no, this person will not recidivate. So if over 50% of the crowd predicted that someone would recidivate, we considered that a crowd prediction that this person would commit a new crime. And then if 50% or more predicted no that that was considered a crowd prediction that that individual would not recidivate. So by pooling together this crowd wisdom, the crowd was able to achieve an accuracy of 67%. And this is statistically indistinguishable from Compass. So this is pretty crazy that by just pooling together 20 responses from people who responded to this online survey, we were able to achieve the same accuracy that Compass achieves. There are a couple different ways that we can compare the performance of the human crowd and Compass. First, like I said, in terms of accuracy, performance was the same. We can also look at the AUCROC value, which you might be familiar with, is a pretty standard way to measure predictive validity in this type of prediction. And in terms of this AUCROC value, the crowd and Compass had the exact same performance. We can also assess accuracy in terms of sensitivity and bias, which is a signal detection measurement. And in terms of both sensitivity and bias, the performance was the same. We also compared how this crowd performance compared to Compass in terms of racial fairness. Before I go into fairness, I wanna define a couple key things you need to know to understand how we looked at fairness. So in this sort of prediction scheme, there are two things that are predicted. Either someone will recidivate or they will not recidivate. And then what actually happens? Either someone does go on to commit a new crime or they do not. So if somebody is predicted to recidivate and they actually go on to do so, this is called a true positive. If someone is predicted to not recidivate and then they don't recidivate, this is called a true negative. What's important to look at is when the predictions are wrong. There are two different types of errors that happen in this case. The first is called a true, sorry, a false positive. False positive error in this is when someone is predicted to recidivate but they do not. And in terms of recidivism prediction, this is a detrimental error from the point of view of a criminal defendant. A defendant who is given a false positive error was scored to be high risk even though they actually did not go on to recidivate. And so because these scores are used in pre-trial sentencing parole decisions, they were scored to be riskier than they actually were and could have faced harsher consequences because of that. The other type of error is a false negative error which is when the prediction is that someone will not recidivate and then they actually do. From the point of view of a defendant, this is, I guess, beneficial error to happen to you because you were scored to be less risky than you actually were. So first in terms of fairness, we looked at how the human crowd and compass did in terms of accuracy on these two groups. So the black defendant accuracies on the right, white defendant accuracies on the left. Pretty much the same, both human and compass had pretty much the same accuracy on both white and black defendants. However, if we look at those false positive, false negative errors, they made very different types of errors on these two groups. Both the human predictions and compasses predictions suffered from significantly higher false positive errors on black defendants than white defendants. And the opposite was also true. Both the human predictions and compasses predictions suffered from significantly higher false negative errors on white defendants than on black defendants. This is the same type of bias that ProPublica initially found about the compass predictions. And this shows that in terms of this measurement of fairness, the compass predictions and the human predictions are indistinguishable. Compass isn't being more fair than these human predictions are. There are a lot of different ways to measure fairness in an algorithm and the debate about what the best way to look at fairness is actually still ongoing. There hasn't been consensus about which measurement is the best in order to establish that an algorithm is fair or not. We looked at three other standard forms of fairness that I'm not gonna go into for the sake of time, but in every single way that we compared the fairness of the human predictions to the fairness of the compass predictions, the predictions were the same. There was no category in which humans were more fair or compass was more fair. To summarize, these results suggest two things. First, compass is not more accurate than human prediction. These people responding to an online survey were able to achieve the same accuracy as compass. The second is that compass is not more objective than human prediction. In all of the ways that we looked at, compass's predictions offered no form of fairness or objectivity that the human predictions were lacking. So to wrap this up, we've shown that this commercial software that is used in many states in this country to predict recidivism is not more accurate and not more fair than random people responding to an online survey making the same predictions. One of the major concerns about bringing algorithms into the criminal justice system is that they're what are considered black boxes. A black box is a term to describe a machine that takes in a series of inputs and then produces an output from a series of unknown or unexplained steps. You don't know what's happening in the black box. Judges that see scores like compass don't always know how a defendant's identity and criminal history were used to create those scores. Equivont, the company that builds compass, does not share the inner workings of the compass algorithm and they actually fought in court for their right to keep the algorithm secret because it is proprietary or a trade secret. Because we found that these non-experts were as accurate as compass, we started to wonder about what was going on here. What is compass doing? Why is it not more accurate than these people? And we wanted to know what the inner workings of compass were because Equivont doesn't publish this information. We decided to try to reverse engineer it and see what was happening. So we built some of our own classifiers. We built the classifiers on the full dataset of over 7,000 defendants and these are the same seven features that all of our participants saw in those paragraphs that they used to make predictions. On this dataset, compass achieves an accuracy of 65.4%, which is slightly different than it did on that smaller dataset. We started with by training a powerful non-linear support vector machine classifier, which can learn non-linear separations and data. And shockingly, this produced an average accuracy of 65.2%. How we built these classifiers was over 1,000 iterations of 80-20 training testing splits. So this is the average accuracy over those 1,000 iterations. So this wasn't any better than compass, which was a little bit shockingly. We said, okay, we replicated it, but maybe it's simpler than this. Maybe it's even simpler than a non-linear approach. Maybe a linear approach can do the same thing. So then we built a simple linear classifier logistic regression using those seven features and again, we're able to achieve the same accuracy, 66.6%. So then our next question was, okay, what's the simplest thing that we can build, the absolute simplest thing we can build that still reaches this accuracy? Again, these were the seven features that we were training our classifiers on. So we trained our classifiers on every single subset of these seven features. And what we found is that a classifier that only looks at a defendant's age and the number of prior convictions on their record produces the same accuracy as all of those other classifiers. So these are, again, this is compass's accuracy, the non-linear support vector machine, logistic regression trained on all seven features, and then this is logistic regression trained on just age and number of priors. Given that these all reached an accuracy ceiling of around 66%, it's important to start thinking about whether or not any improvement on this accuracy is possible. To summarize all of this together, this research has three main findings. The first, compass is not more accurate than human prediction. The second, compass is not more objective than human prediction. And the third, no matter what's inside compass's black box algorithm, it's the same as its equivalent to a very simple classifier built on only two features. Couple of conclusions from this whole study. First, when we're talking about the use of compass in the courtroom, we have to ask ourselves, would we place these same very serious decisions in the hands of 20 people who we pay $1 to $6 to answer an online survey? Because what we showed is that these two approaches are functionally equivalent. The second is that we can't take any algorithm's accuracy or fairness for granted. This just shows that compass is equivalent to this small crowd of people, but I think what's important to take away is that we can't assume that any algorithmic approach is going to be super accurate or that it's going to be more fair than human predictions just because it was built on data. Finally, we need to hold all of these type of software accountable. We need to have systems in place that test this stuff, that make sure it's achieving high accuracies, that make sure it's fair, and that make sure it's actually performing as we expect it to before we let it loose on the criminal justice system. Currently, there are a lot of software like Compass that are in use already all over the country, and if they are gonna be in the courts, it's essential that judges know how accurate they are. If they're gonna still see them, it's important that they know that something is only 65% accurate, that it's not more accurate than normal people guessing instead of thinking and assuming that these scores are the magic solution, that they have perfect prediction. As we move into the discussion, I wanted to pose a couple questions for the group and I'm interested in hearing what people's thoughts are on these. The first, what is more important? Accuracy or transparency? If I have a tool that I can prove to you has an accuracy of say 99%, but I can't tell you what's in it. I can't tell you what features it uses. I can't tell you what weight anything is given, but I can prove that it's 99% accuracy. Are we happy with that? Do we want to use that in the criminal justice system? Or would we rather a tool that's 75% accurate? 70% accurate, but I can tell you exactly how it works. If I can tell you it uses these three things, these are the weights, this is how you make it, but it's only 70% accurate, is that better? The second, if we can't have perfect prediction, what's an error rate that we're comfortable with? Do we want 70% accuracy, 80% accuracy? What is gonna be that threshold for accurate enough where we feel ethically okay making decisions about people's lives through this? And then the third is who should be responsible for regulating and verifying this type of software? At what point in the system should there be intervention to make sure that it's performing as we expected to and who should be held accountable when it's not? Thank you. Professor. That was a great talk, thank you. And thanks for that research. My name is Kate Crockford, I work with the ACLU here in Boston. Really interested in what you've done here. I'm definitely gonna use it in my work at the state house here. I just wanted to point out to everybody in the room that one word that didn't come up in your presentation is justice, right? I mean, there's accuracy and there's fairness. But say we had a system that predicted someone's recidivism with 100% accuracy, right? I mean, I think this is sort of what you're getting out with your first question. Would we as a society think it's acceptable to cage someone on the basis of the possibility that they may commit a future crime? I think that sort of flips our justice system at least the way it's supposed to work on its head. So I just wanted to raise that issue and get your response, thanks. Yeah, absolutely. I think that that is a really good question to take into account no matter what the accuracy is. Is that at every point, we're making decisions based on what someone might do in the future. And I think a good idea to respond to that would be maybe we need to shift how we're thinking about why we're trying to predict this in the first place. Are we trying to predict this because we're trying to keep them locked up? There's been a lot of research that shows that length of a sentence doesn't help you not commit new crimes in the future. So if you're gonna keep someone in jail or in prison indefinitely because they might commit a new crime in the future, seems like maybe not the best approach. Maybe we need to be, if we know someone's 100% gonna commit a crime, how can we put resources into helping that person not do that thing and change the course of their life? That'd be my answer to that. Hi. Come over here. Hi. That was great and I loved that question as well. But outside of that, wouldn't we expect that Compass would have the same accuracy rates because it's training on human data anyway? And also what I got from the two features being the most predictive is that maybe those two things are the most predictive and we should just be using that. But another thing is that maybe we should be thinking about why prior convictions is something that is so predictive and it's clearly highly racialized as well. So is there a way to look at that causally? And also, even though it's so simple, maybe should we be using that because it makes sense? So a couple of questions there. The first one is that did we expect Compass to be as accurate as humans because it was trained on human data? So it was trained on recidivism data, not human prediction data, if that makes sense. So it was trained on these are individuals, these are, this is their demographic, this is their criminal history, and then this is whether or not they committed a new crime. Sorry. Yeah, also. Sorry, recidivism also includes like a police officer has to catch someone which is also racialized and then if someone reoffends but doesn't get caught versus if they went to prison and would have recidivated but didn't because they were in prison, how do you distinguish between those types of features? Yeah. And that's, you bring up a really good point which is that this shows, and this confirms the work of a lot of different researchers that prior crimes is one of the most indicative variables of future recidivism, but it's not perfectly predictive. And as you mentioned, correlated with race. And that's where we see those racially sorry, the racial bias in the false positive, false negative predictions is because the black individuals in this dataset and nationwide are more likely to have prior crimes and more prior crimes on their record. And so because that's one of the most predictive variables, they're more likely to be classified as high risk even if they won't actually recidivate. And if you are a white individual in this dataset and in the nation, you're more likely to have fewer prior crimes on your record, which means you're more likely to be classified as low risk even if you will actually commit a new crime in the future. So there's absolutely so many stages where human decisions, human biases are baked into that data and have produced the data that the classifiers are being trained on. The question for you about the crowd sourcing element of this. Yeah. Did you find a relationship between unanimity among people who were voting in the, so if all 20 people thought they would recidivate, did that predict a higher accuracy than just one person? I don't remember if we looked at that, but that is definitely something that is worth looking at. I don't think that there were many cases where everybody had the same answer, but I could be wrong, but that is something that I will look into. So that's a great question. Thank you. Like Kate, is there, Kate? Like Kate was saying and also when you were saying, we talk about race and we talk about racialized, but I like for us actually discuss the word racism because there is racism inherent not just in the people that you ask to code, whether or not you put the race of the criminal, they have somebody in their mind that is a criminal and it probably ain't white. And so the data also is biased because it's the collection of all these other racist infuse interactions and media portrayals. And so kind of grabbing from Sophia Noble's book, Algorithms of Oppression, and also a discussion that we've been having internally in Kathy's Ethical Tech group, do you as the researcher believe that algorithms themselves can be characterized as racist or do you believe that algorithms are somehow objective and are simply used in racist ways? So I think that the algorithms themselves unless there's somebody writing if black predict yes, unless that type of thing is happening, the algorithm itself isn't racist, but the algorithm can be trained on data that produces racist outcomes and that data can be result of racist policing, sentencing, et cetera. So the algorithm itself, the math isn't racist in its own way, but that doesn't mean that it can't produce racist outcomes. Just a very simple question. Is Compass aware of your results and have they said anything? Yes, they have. They released a press release right when the paper came out. They had the main things they said were first, we had reported that Compass uses 137 features in their prediction because that was the number we had available to us given the documents that they have made public and so they responded saying that Compass only uses six features in their predictions. And then the other thing that they said was that the study confirms that Compass is predictively valid because it has that AUCROC value of 0.71. It's important to understand that 0.71 is at the bottom of the threshold for satisfactory performance. Satisfactory performance starts at 0.7. So, and that doesn't negate the fact that the human crowd produced that same AUCROC value. And this is an important distinction and something I've tried to emphasize in the reporting of this article is we're not trying to celebrate human prediction. This shouldn't, the takeaway from this study shouldn't be oh, humans are great at this because neither of these are great at predicting. They're reaching accuracies around 65%. So it's important to remember that the takeaways that Compass isn't more accurate than humans. Not that humans are as accurate as Compass. So thank you so much for your presentation. I wanted to ask about your use of the word human prediction as the category because obviously in the criminal justice context, the relevant actor or consumer of the product is not just humans in general, it's judges. And so there are sort of two scenarios that one could imagine. One is that because judges are professionals and the sense that they see thousands of cases that maybe they might have somehow a better sense of what outcomes might be than a random selection of humans. But then there are other scenarios that you could imagine where the judges maybe are racist or maybe even in cases where judges are elected, they're not whereas your question was framed around trying to increase the folks' accuracy. Maybe judges are concerned about accuracy but maybe they're also concerned if they're elected in trying to avoid false negatives, because that could play very poorly. If someone that they let out on bail then goes and murders someone, whereas false positives are less likely to cause political repercussions. So in which case in that scenario you could imagine a situation where we might actually view judges as being worse than a random selection of humans. And so I was wondering how you, in terms of your research, obviously you weren't able to run this with a thousand judges, but I was wondering how you thought about that relationship between humans in general versus the actual people who would be using these tools or that these tools couldn't very replace judges. That's a great question. It's something that we thought a lot about and we are trying to figure out if we could possibly run a study with judges because it's a question we've gotten a lot and I think it's an important one to ask is what would judges' performance be on this same task? In terms of the use of the word human, the efforts of this we're trying to figure out what's the baseline? What are we working with? If we're trying to make an algorithm that's more accurate than people well how accurate are people? And we couldn't really find any information about what that human baseline was. And so obviously these tools are meant to assist judges' predictions and these aren't judges' predictions, but again it's important to remember that this is trying to show that this software isn't performing as well as we expect it to and the human baseline shouldn't be used as a measurement of what a judge's predictive ability should be, it should be used as a way to show that this software needs to be held to a higher standard and there needs to be transparency in what its accuracy actually is. Hi, I am Kathy, a fellow computer scientist. I have a comment that will lead into my question. I actually respectfully disagree that algorithms aren't racist. I think we might not explicitly see if black than this, if white than that, but we can have if neighborhood than this or if income level than this and somehow sometimes those secondary features that are in the algorithms can inherently make the algorithm racist. So that leads into my question of what are your ways you think developers, for example, developers who made the compass tools or other developers who make many of these, many tools that are deployed across many governments and sell these tools together. How can do that they do better? How can they know that like a number of prior crime rates aren't actually, so how do they, what are your thoughts on getting developers and people who make these tools to do better? So to the first point, I think that even something that has a zip code or anything that is a proxy for race is a product of the data, not the algorithm. Maybe this is just kind of a distinction of what part of the algorithm you're talking about, but that is that the data in having zip code, in having prior crimes, you have a variable in there that is essentially putting race into the algorithm and that has racially biased outcomes. In terms of what kind of technologists can do in creating this, I think first is awareness. First is not operating under the assumption that what you're building is perfect, not under operating under the assumption that you're gonna be devoid of bias if you put even more data in. Even more data might not make this better and so I think it's asking the right questions. How can we mathematically try to level the playing field? How can we with an awareness of what the differences are in prior crime rates amongst these different demographics? If we know that that is what we're working with, can we try to tweak these numbers so that when you put someone in, they can't tell what the race of a person is even if that variable in the past was a proxy for race? So I think it's first just having an awareness of what the consequences could be and then trying to mathematically build to make that not happen. Hi, up here. So I'm wondering with the prior crimes and whether or not the data either compass or other algorithms that are out there are the one that you used. Whether or not you disaggregated what the prior crimes were because there are different recidivism rates if it's the same crime that the person is repeating over and over again or if it's various crimes over their life and so just for predictability purposes whether or not you think it's worth disaggregating data points that actually do present differences in the real world and if you tried to do that. So the data set that we had unfortunately didn't have every single crime that they had committed it only had that number of priors. So we had their most recent crime they had committed and the one that they if they recidivated the one that the new crime that they committed but we didn't have that history but I think it's important because if you're trying to predict whether someone will shoplift versus murder someone those are very different outcomes and so having possibly specific crime classifiers or trying to predict a specific kind of risk depending on what type of crime you think someone might commit. I think that's definitely a worthy question to ask and something to look into. I actually wanted to follow up on Kathy's question about race and algorithms and data and one thing that's occurred to me is that I agree with you that often it's bad data that causes these racist outcomes and algorithms but also what about and this is I guess specific to the context of machine learning which we're seeing underlies a lot of these the choice of what variables you're optimizing for, right? So the idea that preventing reoffense is the primary goal here is not an entirely objective when it's one that's laden with cultural values which are likely associated with race as well as income level and possibly even things like level of education and so something that was to say like if you did something like Compass but instead of using did the person reoffend in two years you said did the person subjectively see their quality of life as higher after five years, right? Then maybe you start saying you're optimizing for a different social value so I guess what is your thought about developers not just looking at the data but the choices they make about how to set up their kind of machine learning goals what's the responsibility there to look beyond kind of the status quo? Yeah, I think that the responsibility lies at every single step of the way which is why I'm grateful for the liberal arts education that I did get and the fact that I was a double major in computer science and women's gender and sexuality studies and this other major that was not related to computer science was the one that taught me to ask these questions and to be aware of how certain factors influence people differently based on their identities and so we need computer scientists in education at every level to be taking classes that make them think about the consequences of what they're building and they can't say oh I just made the thing they told me to and pass it on that's someone else's problem that's not my job to be asking the ethical questions I don't agree I think that someone who is making an algorithm that is going to affect other people should be responsible for the decisions and what they're optimizing for and then also at other steps of the way people should be verifying that what they built is actually optimizing for what they said that they were building for. I don't know where the mic is. You built several systems trying to mimic compass did you find that those systems all had the same racial bias as compass? Yeah and that's pretty much because the most predictive variable is prior crimes and that is correlated with race and so the racially biased outcome happens in pretty much everything. And you say neither compass nor the humans were explicitly given the race nor were they given like photographs or names or anything else that might try to derive race from. So basically despite attempting not to put that factor in it got there anyway. Exactly. Yeah the differences between false positives and false negatives by race were pretty revealing because there's been any comparable research to see if such differences exist when broken down by gender. They're definitely, oh I bet there has been. I'm not, I don't have the definitive answer for that but I'm sure that there has been. There's been a lot of research on gender differences also because there are very distinct gender differences in criminal behavior in general and so that's a whole different conversation as to whether or not there should be different prediction software for men and women in general because these two genders have very different criminal history but no I'm not sure if there has been work done. I bet that there has but I'm not positive. Hi, so you I think a couple of times and people in their questions a couple of times have mentioned how race became an item in the algorithm even though it wasn't explicitly stated and then if I remember correctly one thing you said was the hope or I'm paraphrasing this but the hope is that we could somehow take race out of the algorithms but I'm wondering also if there's a way to allow for race to exist in the algorithms or allow all of these factors that then put race into the algorithms and doing better work on realizing the bias in the outcomes and how we can then adjust for the bias in the outcomes. So knowing that race is gonna seep its way into these algorithms and then we see the higher rates of false positives for black people. Can we actually look at those results and then tell judges to know that there's a high rate of false positives so assume maybe that there's that bias in the information and use the information differently based on the race of the individual. Yeah, absolutely. There's also I think what's important is that there's awareness that even if you don't put race as a factor that doesn't mean that there can't be racially biased results and so maybe we should stop kidding ourselves by not putting race into the algorithm and maybe use it as a variable to try to even that playing field mathematically. I know that there are a lot of people working on this so there's hopefully will be research about how we can have an awareness of how someone's identity may have affected their entire life's history and if we can use that accordingly to make those predictions and then absolutely I think that in terms of judges there should be transparency in what the rates what the success rate, what's the false positive, false negative rates there should be transparency in what the performance of these tools are overall so that they can interpret those values as they should and interpret them accordingly. Hi, great presentation, really interesting research. I was wondering if you could tell us how your research or the compass tools in general came to have binary categories for race and also if you knew how racial categories were collected, whether it was self reported or whether an official in some capacity is the one that chose those categories. So the categories for race weren't binary in the data set nor in the study but they were the white group and the group sorry the group of white defendants and group of black defendants were the two most popular that's not the right word. Largest groups in the data set so we only looked at those two but I think that there were six categories of race in the data set and in our study. I'm not sure how those this was that was the data that came in from that county in Florida where the data is from. That's how the race classification happened. Was there a second part to your question? How it was collected but do you happen to know how the accuracy rates were for other groups? Not off the top of my head but I'm not sure. Thank you. Thanks for your presentation, it was great. You may have touched on this, apologize if I'm really asking the question but does mechanical Turk capture demographic data and was the sample that you used representative of a population of judges or the United States or some other population? So mechanical Turk doesn't capture demographic information but we asked so we asked for race, gender, age, education level there might have been more but that was it and our demographic was definitely not representative of the nation at all. There was a majority white demographic but we did have ages from 18 to over 74 and of all different education levels. So racially not as diverse as we wanted it to be with the respondents but we did have a good scale of other types of demographics for that. I was just curious if you could speak to the more general problem of using algorithms in the sense of using group level data to predict individual recidivism versus the more typical model in the past of having a judge make the individual determination. Yeah, that's a great question and something that there was a case in Wisconsin where an individual were given a sentence and that sentence included the compass score that they were given and that was one of their arguments was that they were using group level data to make a decision about an individual person. I think that that is just a question of whether or not we should be using these algorithms in general because you are using historical data about not that individual to make a prediction about that individual so that is just a basic question of whether or not these should be used in the criminal justice system in general. Actually I just have two questions. One is, is there an effort being made to educate judges about your research and then the other question is... Okay, answer the first one. I don't, there's not an official effort to educate judges or hoping that the media attention that this got brought this into public people's attention. We have gotten some emails from judges about it and one state, oh I don't remember which it was, there's a state that was undergoing a kind of committee review where they had something up for vote to mandate the use of one of these types of tools and so they reached out to us asking for an opinion. So I'm not sure if there's an official, as far as I know there's no official effort to have this aware or to make judges aware of this but it's also specific to what tool they're using and that's why it's important that there's transparency about any software that's in the criminal justice system so that when a judge sees a score it knows where it came from and it knows this is only 60% accurate or this is 80% accurate or this is an awareness of where that score came from and how reliable it is. And I just remembered my second question which is are litigators using your research to try to get defendants to make it recognize that maybe that's not the best predictor? Not that I know of. I just have a, sorry Sebastian Diaz from the Berkman Klein. I just have a question about the data specifically. Is the data, is the score that Compass gives a binary data or is it on a scale? Good question. So Compass's score is actually range from one to 10. There were a lot of details that I didn't get to go into about the study that for the sake of time but yeah, Compass gives scores from one to 10. One to four is low risk, five to seven is medium risk and eight to 10 is high risk. The decision to make it a binary cutoff was to be able to compare it directly to the human predictions and we wanted the human prediction to be as simple as possible and not have people be giving scores from one to 10 but it was based off of the practitioner's guide for Compass says that scores above a four should be considered an indication that that person will recidivate. So the scores aren't binary but they have given, sorry, the scores aren't binary but they have kind of a binary interpretation of how. And so this is a follow up. Since they're not binary, did your group metrics match the scores that were given by Compass? Did our group metrics. So when you had one to 20, I imagine if you divide that to two you would have one to 10. So how we did, yeah, so when we did translate the crowd votes into a score from one to 10 so that we could calculate those AUC, ROC values. So the crowd was just what percentage of people guessed yes that was the score that like binned into one to 10 that was the score that we were assigning people based on the crowd votes. Did that match Compass? Oh yes, it matched Compass's, oh no in binary ways. It didn't completely match Compass in the binary prediction. The crowd predictions and Compass's prediction matched in I think 669 of the cases, of the 1,000 cases. It's somewhere around that number. So when it was binary the crowd predictions and Compass's predictions were in agreement about 70% of the time. Hi, thank you. There's a lot of ways to reduce recidivism by social work, education, job opportunities. Is Compass blind to those programs? I'm not sure. I think that, well actually so they do in the assessment for Compass an individual has to answer those 137 questions and a lot of those are what are called needs assessment. So they only use those six variables which probably are very similar to the six variables we used, I'm not sure but they probably are. And a lot of those other questions give the court and anyone involved in creating a plan for them an idea of different places where they need support. So I'm not sure. I know that Compass, they do provide those questions so that the judge knows a little bit more about that defendant but I'm not sure if they are. The Equivont, the company that makes Compass has a lot of different programs and software available so they might be doing more in that sphere but I'm not sure. I just thought I'd follow up to that and say there's a great paper called Zombie Predictions by the guys at Upturn which actually talk about this specifically one of the weaknesses being that risk assessments right now don't incorporate learnings around risk reduction strategies. So you're likely to get zombie predictions, things that are based on older data that existed prior to some sort of risk reduction intervention happening and therefore would skew the predictive capacity of the results and so I thought maybe as a question of that how would you like to see people incorporating I guess risk reduction as a strategy into these tools? Are there ways we could change risk assessments to be more intentional about thinking about risk as this thing that changes and that we could actually reduce as opposed to just asking the binary question of lock up or release? I think maybe that's related to this question about what are we optimizing for? What are we trying to predict? What are we optimizing for? What kinds of errors do we prefer over others? And so maybe we can be optimizing forward if they're going into this program, how did that play out two to three years later and that maybe we're not saying play out in terms of recidivism but play out in terms of do they have a job two to three years later? And so what are we trying to predict and like what are those features we're putting in? I think could be a good way to start. Hi, my name is Sarah. I'm working at MIT on AI for Public Interest projects. I'm curious, you'd mentioned earlier different sort of methods for measuring fairness, algorithmic fairness. I was wondering if you could say a bit more about that, whether you've done research and what the trade-offs might be in utilizing different measures as we in a way audit the public sector's use of algorithms. That's a great question. So one of the, what came out of the ProPublica article about Compass, so ProPublica said these false positive and false negative errors, there's a discrepancy in terms of race, this is biased. And North Point, sorry, at the time North Point now, Equivant responded saying that their tool was well calibrated. So what calibration is, is any recidivism risk score should mean the same thing regardless of race. So Compass's scores are range from one to 10 and a score of three should mean the same thing no matter if you're white or black. So what they responded saying is that the scores are actually well calibrated, which is that at every score, your chance of recidivism is the same regardless of race. And they are accurate in saying that their scores are well calibrated. And this sparked this debate over how can this be true? How can both of these seemingly valid ways to measure fairness be contradicting with each other? And what people figured out is that under certain circumstances, satisfying these two types of fairness requirements is mathematically impossible. So if the recidivism rate in two different groups is different, you can't have a tool that is both well calibrated and it has the same false positive, false negative errors. That's kind of a math thing that we'll take a longer to explain, but basically what's important is that certain fairness requirements are mathematically impossible to simultaneously satisfy under certain situations, which is kind of crazy because we want both of those things. We want a score to mean the same thing no matter if you're white or black. We also don't want black individuals to suffer from twice as high false positive rates than white people. So both of these measurements are ethically valid to want in an algorithm, and if the recidivism rate isn't the same, we can't get both of those. And so this is why the debate about how to measure fairness is so important because there's so many different fairness measurements and fairness constraints that we want in these algorithms that are totally valid to desire and we have to figure out what we're optimizing for, what is important in a specific context, and maybe what is the most important fairness measurement with these scores. There are a couple other fairness things, but that's the main one that we also looked at, yeah.