 Thank you Chris and thank you to the Berkman Klein Center for having me here and thank you to all of you for showing up. I'll be talking about the role of algorithms in our justice system and hopefully I can help point the way forward for a general realistic and ethically responsible way of using them to improve our justice systems. I'm happy to have questions wait until the end but if you have a burning question and you want to you want to ask it go ahead I'm happy to get a question in the middle of the presentation as well. So let's start by understating the obvious. We humans are not perfect decision makers. Our decisions fluctuate with how tired we are with the weather with our general moods. Last I checked Wikipedia's list of cognitive biases started at almost 200 and a lot of those cognitive biases are really stubborn. They're not the kinds of thing that are going to go away with more training. For example I had a layover in Vegas a few weeks ago and because they had their first little bit of snow in 11 years I was stuck there for two days and I was alone so I did a fair amount of gambling. I've taken a lot of statistics courses. I've done a lot of statistical studies and I've read a lot of statistics papers. I think I understand the idea of an independent draw pretty well. When I was gambling was I affected by the gambler's fallacy? Absolutely so right. I just lost ten bets in a row no way I can lose the next one. I knew it was nonsense but that didn't mean I was immune to it and I don't think there's any good reason to believe that these deficiencies in reasoning go away when we give someone legal training and black row. We now have evidence suggesting that judges are in fact affected by things like the cases they immediately decided before, the temperature, the proximity of their last meal, the political polarization that occurs during election seasons, their workloads. Judges are prone to these types of deficiencies just like the rest of us and I think that's a really big problem in the adjudication system. We rely on a massive number of decision makers to go out and decide individual cases. That means a lot of different people with a lot of different biases activated under a lot of different circumstances and what we end up with is a lot of noise. Justice begins to look more like a coin flip. Some individuals who should prevail don't, some individuals who shouldn't prevail do but I think it's even worse than that. As justice Brandeis has simply put it in most matters it's important that the case is settled rather than it be settled right. If businesses and people can't predict how the law is going to apply to them they're going to struggle to plan their affairs. I think researchers are starting to sketch the scope of this problem. In a working paper with Ryan Hubert I show that at least 40% of cases and the Ninth Circuit Court of Appeals could be decided differently based simply on whether they're signed to one panel rather than another panel. There's a lot of other research showing disturbing amounts of inconsistency and asylum, social security, disability decisions, and criminal sentencing. So given all this it's not surprising that people, a lot of people are excited by the prospect of algorithms improving decision making. But there's also been a lot of pushback and what I want to do is just quickly review sort of three broad classes of algorithms point to their weaknesses and then hopefully point towards a proposal for again realistically and ethically responsibly using algorithms in our justice system. So the first class I think about is legal reasoning algorithms. These are attempts to code the law into a computer program. The law specifies the results, gives us some facts, and the law says what the legal result should be. These legal reasoning algorithms try to replicate that. Turbo tax is a really good example. We give turbo tax our basic factual inputs and it spits out a legal answer. Turbo tax is great. It's really useful for much of society. It excels at those easy standard cases that don't have hard issues. It even excels at mechanically complex calculations. But where it starts to suffer is in those kind of contested cases that make it to the court. Where the facts and the mapping of facts to a legal result are just hard and then require the kind of discretionary judgment laden investigation that sort of if then statements embedded in turbo tax just aren't good at doing yet. At least without a major leap in technology we're not going to be able to rely on these types of legal reasoning algorithms to improve our decision-making. Maybe a really simple way of saying that is turbo tax is a really long way away from being able to help tax courts make decisions. There's been a lot of focus on a second class of algorithms, what I call input prediction algorithms. These types of algorithms focus on predicting some core input that's relevant to the proper legal result. The big example right now are those models predicting an individual's risk of recidivating. This is a purely predictive pattern matching relying on just statistical associations, not legal logic or any causal relationship between variables. For example, whether or not one's parents are divorced might be predictive of whether or not someone is going to commit another crime. That does not mean at all, of course, that whether or not someone's parents are divorced is causally related to someone committing a crime. It's just a predictive exercise. It's this type of predictive exercise that I think has everyone excited about algorithms. With large data sets, lots of measurements on variables, and machine learning that can get in there and really detect sort of minute statistical patterns, we can get some pretty impressive, impressively accurate predictions. But there's also a lot of pushback on these algorithms. Let's try to understand a few other reasons why. First, some scholars have given this the term selective labels problem. Whenever we're engaged with a predictive exercise, we want to make sure there's a reasonable match between the data we're using to build a model and the data we're applying that model to. That's a real problem when we're trying to estimate something like recidivism. Because we can only see whether or not someone recidivates when we release them. So judges release someone, then we can observe whether or not they commit another crime. If we detain defendants, we can't observe whether or not they commit another crime. Maybe it might be a few years later when they've aged out, but now all sorts of things have changed. Age is a major driver of criminal risk. So we're stuck in this boat of building the algorithm on release defendants and then trying to apply it to both detain defendants and release defendants to figure out who should be a release defendant. There are some creative ways to get around this kind of problem, but it's a really difficult problem. Second form of technical bias I think a lot of people are familiar with is that we might just not have very reliable measures of the thing we're trying to predict. We're trying to predict criminal recidivism or criminal activity. So we have this equal number of white spy and black spy all committing crimes, but black spies are more likely to get detected and sent in prison. If that's our measure of recidivism, that's going to artificially inflate black risk scores relative to white risk scores. But that's technical bias. Let's pretend we can get rid of all that technical bias. We've solved all the problems, we're done with that. We're still left over with claims of normative bias. Even a classically or an algorithm that is accurate in the classical sense may involve troublesome normative disparities. One way to put it is basically algorithm stereotype. Even if those stereotypes were to be true, we might be uncomfortable making use of them. Now there's a lot of literature on this and I'm not going to pretend to have a final resolution on algorithmic fairness and how we should deal with these kind of issues. But I think at least a couple of things are clear. One, just removing troublesome variables from the algorithm won't solve the problem. We can't just remove race and gender from the algorithm and feel good about ourselves. Other variables, basically everything is statistically associated with race and gender. So taking race and gender out doesn't do anything because all the other variables that are related to race and gender just put it back in. And I think a second point is we should be careful not to evaluate algorithms in a vacuum. Human decisions can also involve troublesome normative disparities. And if an algorithm can improve decision making without making those disparities worse, I think that counts for a win for the algorithm. But I think there is a pretty serious concern that once those kind of normative biases are in an algorithm, it becomes more difficult or they become more resistant to change. Humans can adapt, more progress, but once we have an algorithm, it's a little more hidden, maybe a little more difficult to get that algorithm to move with the times. Finally, there's the black box problem. I think the black box problem occurs at basically three levels whenever we're in this predictive exercise. The first level is we're doing predicting. And predicting does not give any causal explanation. It's just letting us know what statistical associations are. So even a regression, not fancy machine learning, just a regression being used to predict something is a black box. We might get a little comforted because we can find that coefficient on a variable and we could say, oh, controlling for all these other variables, your parents being divorced results in a .2% higher chance of engaging in crime. Modern social scientists do not view that as a reasonable way to interpret that coefficient. We have no idea what that coefficient really means. Throw another variable in there, that coefficient bounces around and changes. Maybe it goes negative. Second level is once we move to these more complicated machine learning models. Now we can't even understand what's happening at a statistical level. Not only can we not meaningfully interpret a coefficient, but now it's really hard to figure out what's going on in the machine learning algorithm. Tons of interactions and if humans' lives are being decided on these algorithms, there's an intuition at least that they'd like to be able to understand what's going on. And then third, probably the most ridiculous level of black box, and I don't think it'll last for too much longer, if I had to guess. We have a lot of these prediction algorithms are built and maintained by businesses that make trade secret claims. So that means litigants rarely even get the chance to even try to look at the complicated machine learning algorithm to figure out what's going on. And I think there's one other big problem with these kind of input predictive machines. They don't really offer much of a way forward for most of our other adjudication systems. It works a little bit in crime. We find this really important variable criminal recidivism. Let's predict that and that really feeds into what the ultimate decision should be. But what do we do in social security cases? It's really hard to think something we could plausibly measure and predict that would go into the judge's decision. I'm saying with contractual disputes, sexual harassment cases, securities litigation, right? Basically if we're focusing in on those input predictive models, we're leaving a whole bunch of other litigation contexts to sort of left to offend for themselves. So what else can we do? I want to focus on what I call decision predictive algorithms. Here we're just going to skip the whole idea of trying to predict some core input to the decision and we're just going to try to predict the ultimate legal decision itself. Most of the development on these kind of decision predictive algorithms have been made in the private sphere, Lex Machina. Now I think the value of these types of algorithms in the private sphere is a little lower than is generally appreciated and it comes down to another mismatch problem. To build these kind of predictive algorithms, what's the data we build them on? We build them on litigated cases. These other cases Lex Machina, they never see so they can't incorporate them into their models. So we build these models on the litigated cases and then what do we apply those models to? Well I think the idea with these legal analytics companies is to help litigators decide is it worth going forward with this case, is it worth going forward with this motion, what's my probability of winning this motion. But if they're on that edge, they were previously in that non-litigated cases. There's very little reasons to think that the algorithm built on litigated cases would apply accurately to the population of non-litigated cases. These are all statistical associations. There are a set of statistical associations that exist within litigated cases. We have no idea that that set of statistical associations also exist in the non-litigated cases. So I think the value of these decision prediction algorithms is really in the public sphere and let me try to motivate that argument a little bit. In a given case, didn't quite mean to do that far, there we go. In a given case we usually only get to see one decision. For example we get to see Judge Judy decide a case after her football team won, you know, on a rainy day after she happened to rule favorably in her five most recent cases and who knows what else. We get to see this one outcome. But, right, maybe the result would have been different under different circumstances. Maybe she would have ruled positively, for example, if she decided the case after the football team won on a sunny day and after she had issued three negative rulings in her last three cases. But of course we don't get to see these other possible outcomes. But it's worse than that. We don't get to see how any of the other judges would have decided this case under any other circumstances. But this information is really valuable. Instead of deciding a case's fate on the basis of just one randomly observed decision, that X in the top left corner with Judge Judy after her football team won, wouldn't it be nice if we could somehow get information on how all of the judges would have decided under a wide variety of circumstances. That's basically what a decision predictive algorithm can give us. By simply dropping the random factors out of the algorithm, we could basically estimate the percentage of green check marks on this board. And that seems like a much more reasonable way to decide cases than relying on just one random decision. A co-author and I often think of this as synthetic crowd sourcing. With a decision predictive model, what we're trying to do is simulate a world where every judge votes multiple times and then we can sort of estimate the frequency with which a given case would have succeeded. Now you might be thinking, why does dropping these variables, football win, weather, judge, decision streak, all the other possible noise variables that exist out there in the world, what does that work to eliminate their influence in an algorithm? When I just pointed out that removing variables like race and gender doesn't work. And the answer is because unlike race and gender, all of these noise variables are not statistically associated with the case characteristics. So there's no proxy variable that could just reintroduce those associations. Now there's the first nice feature of this kind of approach is we finally have a nice match. We're using the predictive algorithm, we're building the predictive algorithm with litigated cases and then we're going to apply the predictive algorithm to litigated cases. But before you get too excited, there are some concerns. If we're using this kind of algorithm to automate or even recommend decisions to judges, litigants can react and respond. So there may be people who would not have been have litigated before. They step back and they think, ah, my case isn't good enough. I don't think I have a good chance of winning. So I'm not going to go litigate. Once an algorithm is being employed, right, let's say we're using the algorithm to recommend decisions to judges. Once an algorithm is employed, the litigant may say, wow, my fax spit out a really high number. I'm going to go join the litigated cases, right? But the algorithm is not going to be particularly accurate with respect to that individual because that individual is not part of the data set that was used to build a predictive model. And there's a second way litigants can gain the algorithm, right? Predicted losers, those people who had sort of low probabilities of winning, they might find easy little ways to trick the algorithm into thinking they're the type of case that has a high chance of winning, right? A simple example, maybe the algorithm uses law firms as one of the predictor variables and one law firm is associated with a high winning rate, right? Maybe one of the predicted losers says, ah, I want to pretend I'm one of those firms that would have hired the top law firm and gotten and had a strong case. So I'm going to go hire that top law firm, right? They're masquerading as those litigants with a high probability of success. And there's still the black box problem, right? If we're using these kind of algorithms to guide, automate decisions, there's still this concern that a black box that we don't understand is spitting out our answer and, right, that raises some due process concerns. And then finally, right, we've gotten rid of all the noise variables, but by getting rid of noise, we can actually, right, increase some of the bias, right? So if if judges are sort of widely biased against a particular type of case, the algorithm will embed that bias, right? It's based off of judges' decisions and could even inflate that bias, right? So again, if we're using these algorithms to guide or automate decisions, we've got some concerns, right? Especially again, this concern that once an algorithm, once we've got biases in an algorithm, they may be more resistant to being fixed and adjusted. So in light of all these problems, I am not going to suggest that we use these decision predictive algorithms to automate or guide decisions. Instead, I want to suggest I think we should use them to revitalize what I think of as our traditional noise control system, right? The appellate system. Right now, our appellate systems are struggling under intense caseloads. We don't know how to sift through the massive number of cases to find those cases that need the most attention, right? To find those cases that were most likely decided incorrectly. We basically rely on an ad hoc mix of two strategies. One, we set up fees or or other barriers. Right? And the idea here is, right, we can't give all of the, we can't give appellate resources or full appellate resources to everybody at the first level. But maybe if we set up a sort of some screening procedure, some barrier, either financial or Byzantine paperwork, that only the litigants who evaluate their own case and think they have a high probability of succeeding will undertake pay and go into the appeal system. That might work, but, right, there are some obvious equity problems. Do we really want people being able to, by the way, into the appellate court? And regardless, it's pretty clear that courts are not willing to set fees anywhere close to where they would need to be to incentivize efficient screening. And then we have this other second option, which is, well, let's take a peek. Let's get someone to provide some preliminary assessment of a case to figure out whether or not it has a good probability of being reversed. Right? We hire staff attorneys in your courts of appeals. We hire just other staff to do the work of taking a preliminary look and saying, yeah, judges, we think this should go up. We think you might be interested in reversing this. Right? That kind of preliminary assessment is probably going to be full of lots of mistakes. Right? It's preliminary. It's being done by non-judges, so it's not likely to be very perfect. And it's, I don't know, I compare it to a modern bank trying to detect credit card fraud with just a bunch of human eyeballs. Right? It's not going to work. It's crazy. No modern bank would do that now. And I don't think modern adjudication systems should either. I think we should use algorithms to identify the cases that were most likely decided wrong at the lower level. So let me give you a couple of examples. The California Board of Parole hearings rarely reviews its own decisions. For every case, two parole commissioners are assigned to hear the inmate's case. And it's, and the only time it's reviewed, with one exception I'll tell you about in a little bit, the only time it's reviewed is when one of the two sitting commissioners suggests that the full board reviewed the case. I think that should change. I think, I do not think we should so cavalierly let someone's freedom depend on the idiosyncrasies and moods of the two commissioners that just happened to be assigned to them to that day. So, Hannah LeCur and I wanted to get a little proof of concept and try to show that maybe algorithms could help the California Board of Parole hearings develop a more robust review system. So we used a computer script. We got the population of hearing transcripts, used a computer script to pull out a large number of variables. And then we built a predictive model, a decision predictive model, leaving out all of those noise variables. We didn't include who the judge was. We didn't include the weather that day. We didn't include whether football teams had won or lost, right? We didn't include what the judge's last meal was. We dropped all those variables and we're just sort of trying to estimate sort of the remaining underlying signal. And so we do that. And so we get cases, for each case, we have a probability that that case will succeed, right? Say 70%. And now we take that 70% and we calculate a degree of error. A degree of error is just the difference between the predicted outcome and the actual outcome. So if we have a case that has a 70% chance of success and it gets a zero, right? It gets denied, that's a large degree of error, right? That is not what we expected and it's a sign that maybe these noise variables are making a case come out in a way it wouldn't usually come out, right? If it's a 70% probability of success and it's a one, right? They grant, it's not a very high, not a very high degree of error, right? Just 0.3. So we would prioritize reviewing the former decision and not the latter. So the board of parole hearings, as I noted, does not really review their own decisions. But a little oddly, the governor does review a subset of decisions. So we use this as some type of empirical test to see whether or not our sort of estimated degrees of errors were picking up on cases that were probably outliers and that most judges, right, under most circumstances would reverse. And we see at least Governor Jerry Brown likes our model. As the degree of error goes up, Governor Brown is much more likely to reverse the case. And let me show you one more example. This is the Ninth Circuit Court of Appeals, same basic story. We scraped the population of docket sheets from the year up there. Yeah, 1995 to 2012, civil appeals. We pull out a whole bunch of variables. These are just a handful of examples. And we're doing this because the Ninth Circuit Court of Appeals is struggling under intense caseloads. Historically, it has served as the main noise control body for the Ninth Circuit, right? It stands ready to fix any outlier district court cases, any noise driven district court cases. It's been there to try to fix those cases. But now the Ninth Circuit Court of Appeals has exploded in size, right? It's caseload has increased by 1,400 percent over the last 50 years. Judgeships have increased by only 250 percent. But even then, we now have at least a minimum 3,600 different combinations of judges that could be deciding a case. And in response, judges are not doing as much of the work as they once did, right? They're hiring staff attorneys, they're hiring people out of law school, they're hiring law clerks to come in and actually make a lot of these decisions. Then get maybe some cursory review by the panel judges. And we're suggesting that let the Ninth Circuit better be able to identify, again, the bad decisions, the decisions with a high degree of error, the decisions that are much different than we would expect, right? We will identify the single X up in the corner when everything else is a green check mark, right? Those are the cases we want to get in front of a review board to try to make sure that X was a right or at least reasonable X. And just, again, a little test to see whether or not these estimator degrees of error successfully capture or point towards decisions that most judges would think are bad decisions. And we get some nice results, right? For every additional degree of error, the probability of dissent increases by 0.3%, negative appellate history, which includes things like on bonk review or Supreme Court review, for every degree of error, we get a 0.4% increase. And cases with high degrees of error are also, right? They receive more negative analysis in future opinions. So future judges look back at those opinions and criticize them as bad decisions. And we think the Ninth Circuit could do a lot more to prevent and regulate these types of bad decisions before they even happen with the aid of an algorithm. And that will end it. Thank you. And I, yeah, thank you. Thanks for that. That was nuanced, fantastic. A question about two things. The part at the end seemed to be less about injustice and more about the court being self-inconsistent. So you're not going to identify systematic injustices only inconsistencies of the court. What it seemed like you were going on the judge duty side is kind of classic econometrics, causal inference in order to eliminate. I mean, you had, we don't observe the counterfactuals. So we try to find judge leniency or comparable cases in order to estimate the effects of irrelevant factors. So can you say how those two things link up? Yeah. So I'm not advancing the idea that we should try to go out and measure. It's fine if other researchers do. Let's do it. It's fun. We find all the ways that noise influences decisions. It makes headlines. I don't think that approach gets us very far towards a better adjudication system. In part because there's just so much noise. We can pick out some random variables to go look at, but there's just the sources of noise are nearly infinite. So I don't know how successful the project. And I didn't mean to suggest I was heading down, heading down that road. You're right, right? My definition of injustice is pegged to the judges' collective views. And I think outside, it's very hard to get another empirically robust definition. I think most of us respect judges. They're trying to do a good job. On average, they get the right decision, right? Hopefully more than 50% of time, they get the right decision. And so the idea is let's try to aggregate all those votes, right? And now we've got this, we've got Conor says jury theorem. We have this representation, like, oh, 63% of votes would be cast in favor of this. Decision that goes away from that. In the end, right? It's just a, this judge's decision is incompatible with her colleague's decisions. But I'm happy enough, and I think most of us, most of the time, would be happy and enough to call that an unjust decision. Right? There are going to be sometimes where there's an outlier decision and you think, yeah, that was courageous decision, right? All the other judges are getting this all wrong. And this judge, this time, she went out on a limb and made this courageous decision. And those are going to happen sometimes, but I think they're really rare. I think for most of us, most of the time, when we look at one of these outlier decisions, these decisions that just are far deviations from where we expect most judges to decide, it's not going to be a courageous decision. It's going to be a bad decision. Thanks so much for this. I think my question, feedbacks well off of what you just said, I'm wondering a little bit how you see judges and the judiciary using this type of information to actually change these patterns. And I also wonder the impact of this type of thinking of consistency on the efficacy of novel legal theories to advance certain types of legal evolution, because it seems like this would generally trend towards very consistent interpretations of specific topics. All right, we're sorry. Remind me of your first question. I don't have a pen up here. I was just more about how do you see this information being used by the judiciary? Yeah, in a lot of different ways. So in the paper I'm writing on the nine circuits. One way I'm talking about implementing it is in the nine circuit, there's basically already this two track system of justice. Pro say appeals are sort of automatically put over to one camp. They're not going to get in front of real judges. They're going to be decided by staff attorneys. They don't need this extra attention. They're easy cases. And then there are the other sets of cases. These are more serious, contested, plaintiffs, and defendants represented by law firms. They're going to get more oral argument, things of that sort. So I look at that and I think the court's already using an algorithm. It's just a really, really bad algorithm. There's just one simple variable and they're splitting it like that. Instead, if you estimated this sort of these degrees of error or estimated a case's probability of succeeding, you find that there are not large subset of pro say cases that belong in this other category. They should be getting more immediate attention by judges because they have a high probability of winning. As far as how this contributes to new and developing legal theories, I don't know. I don't know how much it does. This has been a part of what we've done all along with appellate systems. Appellate systems are basically tools to get the outliers within the norm. And I think what I'm doing is just suggesting how we can do that a lot better. We need it for the recording, sir. Could you please speak into the microphone? Oh, I'm just going, oh, my god. Versus in series, very serious things. When those things are being there, it's important. Yes. OK. So I take that as a good point. And it might be true. But let me push back just a little bit more. Those hard cases, yeah, judges are going to be thinking a lot harder about those cases. They're going to be trying. And maybe we would think these noise variables are going to affect them less often because they're more concentrated. They really care about getting this case right. But another way of putting it is it's a 50-50 case. It's really hard. So maybe that's exactly where we would expect these noise variables, even if they're minimized by intense concentration. They don't need to have much of an impact to get a judge over the line and roll one way rather than the other. Thank you for the great talk. The one thing that I was missing is, so you take all the history of the cases for your training data, of the way they ruled, and you eliminate the noise variables, and then you kind of like try to output what's the error for a new case. Is that correct? We output, the first step is building the model of the training data set and then generate a predicted probability for the test set on the probability of winning. From there, then we look at the actual outcome and calculate the degree of error. My question was, I did not understand the Jerry Brown graph. Is that more like you try to predict how likely the case is to win, and that was actually ruled in that way. Is that what this graph was showing? So on the x-axis, we have degree of error. So this is how far away the actual outcome at the board of parole hearings level, not the governor. At the board of parole hearings level, how far away their actual outcome was from the predicted probability of success. And now that gives us a degree of error. Some sense of how much of an outlier is this case? How weird is this? And should we have some suspicions that this is driven largely by noise? And then we're just using the governor graph to show that, A, look where we think a case has been driven by a lot of noise. The governor, the initial decision has been driven by a lot of noise. The governor seems to agree with us. We'd love a better comparison. The governor is not perfect. We don't think this is a good system of review, but it's what we had to test. Thank you. I'm going to use the example of an approach to defendants in one place where you can improve on an existing bad algorithm. Is there any reason why you couldn't combine that probability of error with a severity or harm of that error, for example, in a prosaic case where it's a couple years in prison versus life in prison or something like that? And would there be reasons why you'd want to recognize that? Oh, yes. I think so. I'm not suggesting that the Ninth Circuit only focus on cases with high degrees of error, right? Even a death penalty case that has a really low degree of error probably deserves more attention from judges than a small contractual dispute with a really high degree of error. So yeah, I'm not suggesting a one-size-fits-all. Hi. I'm just curious if you've shared this with judges and appellate judges and how they've reacted to it and what they think about this being part of the decision-making formula in terms of where they spend their time and attention. I have not... I have not shared it... I've shared it with... I think there were maybe a little overwhelmed. I mean, I didn't really think we were going to get them to take this on. I have not shared it with any federal judges. I'm to be perfectly honest. I'm much more concerned with getting it published. Maybe after that fact, after that accomplishment, they'll turn towards, you know, making a dent in the world rather than just an academic dent. Thank you. Thank you.