 Should we get started? OK. OK. Cameras rolling. Streams, I don't know. Is this going out? I think this is going out, right? Streamed out. All right. All right. Well, thank you for joining. I know this talks a little bit out of the way. You found it, if you're looking for this talk, that is. And I think this would be a good session. I actually like that it's a little small. Maybe we can make this a little more conversational and interactive, because it's about an interesting topic. The idea of bias in machine learning, what is it? How do we find it? And of course, what are we going to do about it? Along the way, this being the open source micro-conference, we're going to talk about open source tools to you could use to help find and solve these problems in your machine learning problems. I myself am from Databricks. Databricks also does a lot with open source. You may know Databricks from Apache Spark, for example, Delta MLflow. This is not about Databricks. Well, I will show you it in action in Databricks, but you could do this anywhere. And yeah, as for me, I work on data science and machine learning at Databricks. I used to do the same thing at Cloudera. So I've been in the open source business at the enterprise level for a while here. And again, yeah, if you know Apache Spark, I've been working on that as a committer for about eight years. We even wrote a book. And actually, unlike a lot of people in Austin, I'm from Austin. So this is a nice, nice quick trip for me and glad to be here. But anyway, enough about me. Yeah, so the problem statement here is really this. I think that my machine learning model is biased in some sense. And there's been a lot of talk about what that means and maybe how you detect that. But I think there hasn't been as much attention to what to do about it. And this is not a problem that, say, Databricks solves directly. It's not something we do in our direct line of work. But being customer facing, we interact with customers doing tons of stuff with open source software usually on the platform. And they have these questions like this. So we often have to go outside the walls a little bit and research, well, what are the basics here? What is the state of the art? What can I put together with modern ideas, maybe modern open source packages to answer these problems? Because a lot of what we end up doing is encouraging people, enabling people to use open source software to solve these problems. So again, everything I'll show you today, totally open source. You can do it in Jupiter. I'll show you a little bit of Databricks along the way for fun though. So I think one of the issues here we face when talking about biases, well, what does it mean? What do we mean when we say our model is biased or its predictions are biased or something like that? Well, intuitively, we might say it's when the model's making consistently different predictions for different subsets of the data in a way that doesn't make sense. So we don't think it should. Now, commonly that comes up when the data are about people, of course, bias and has some connotation of ethics and morality, and that comes up when we're thinking about what models do to people. It's not necessarily limited to that. I mean, for example, someone gave me a good example of a classifier that was learning to predict something about batteries and some genomics life sciences problem and it turned out that the actual, I think the manufacturer of some part of the battery was highly predictive of the results and that doesn't make any sense. It shouldn't be. So yeah, sure, some of these ideas might come up in different contexts, but certainly here, I think we're talking about fairness and bias, fairness to people bias for or against people. So let's take that as read. And often, well, intuitively, it's somehow when maybe the wrong inputs or facts affect the prediction, maybe your demographic information. So those intuitions seem right, or maybe more broadly, intuitively, it's when outcomes are unequal, when the model's doing something that is demonstrably different for different subsets of people. So these are loose, intuitive ideas, and I think they're all directionally right, but they're different ideas actually and they don't necessarily lead to the same conclusions. Where does bias come from? Well, obviously, the real world. The real world is not fair. We know that and the data we collect about the real world can reflect that real world on fairness and the models that learn from that data might learn to repeat that on fairness. So that's unfortunately where a lot of the bias we talk about comes from. It can also come from the data itself. I mean, the data itself may be incomplete. It may be inaccurate in different ways that affect how models behave down the line. I don't think it's really models. The models themselves aren't really the ones doing the bias, right? It's there, I think, actually the heroes in this story or can be used to help detect and correct some of the biases that we may have observed, learned from the real world or learned inadvertently through the data collection process. So I try not to say model bias. I don't think that's accurate. It's more bias in the machine learning process or outcomes. Now this is a well trod, well, at this point I think it's a popular topic and it's well understood. But even so, I found myself, when I was starting to think about this and learn a little bit about this space, a little bit confused. What is fairness? And this became more obvious when I was even talking to customers about what they meant when they said fair. So to give a slightly anonymized example, imagine you are an auto insurance company and you've got a model that predicts what the right premium is or you're gonna risk for auto insurance customers. Basically it assesses how likely is this person to be to make a claim? And of course we want that to be fair. People, this is a model that has an effect on people's lives and we want to make sure it's not somehow bias against men or women or people from different parts of the country. So what would you have to see about such a model to say yeah, that seems fair. For example, if I told you that the average premium the model suggests or predicts for men and for women what was the same? On average, it's not seeing the same level of risk overall for men and women. Is that fair? I mean it sounds sort of right, okay it's not different for men and women. But is that enough? Is that too far? I don't know because there's some other things we might throw out there. What if I just told you the model that's predicting the premium just doesn't use say gender as an input. Doesn't know what the person's gender is. Is that enough? Is that fair? Seems like a good idea but it's, I don't know it's a different idea and it's unclear. Or what if I said that yes the model knows gender but it looks like the effect on the predictions is about zero. So the model says I don't think this fact feature matters that much even though I'm using it. Is that okay? Maybe? That sounds reasonable. And the last one maybe a variation on the first one. What if I said that maybe more narrowly if you look at just the people that actually ultimately had an accident and made a claim that the average premium the model was predicting was the same across different demographic groups like men and women. Now they all sound kind of reasonable on the surface but they're different ideas and their intention I mean these aren't the same answers. These aren't the same criteria. And I think as I read some of the literature here it became clear that there are some standard answers but not just not one answer. So I think the right definition of bias will kind of depend on the problem you're solving and what you believe fair means in that context. So hope that gave you a taste of how this isn't quite as simple as it seems. Even just as detecting what it is we're trying to solve here. So okay what can we do about it? Let's say we picked a definition and we decided yes we have a problem according to one of those criteria. Where can we go to fix that? Well we could of course try and fix the real world. In a lot of cases that's where the problems coming from and sure it'd be great if the data science teams could go change the world. However it's hard to take that as like an objective for the quarter I think. So okay yes but let's put that aside as out of scope here. Now data collection, the data itself is sometimes the issue. It could be that we have collected data differently for different subsets of people for example. Maybe inadvertently, maybe we just have sparser or less accurate data for some groups of people than the others. And in this case it's the data itself, the reality we're describing with the data is just fine. It's just we don't have quite the right data. So this came up famously in facial recognition software problems. If you don't have good pictures of people from all different walks of life then those models unfortunately don't do is perform as well in recognizing faces for different groups of people. The pictures are fine, people look fine. It's just that we don't have enough of certain types of data. I don't think that's where I wanna go with this particular talk. Often it's hard to change the data we have. Maybe you as a data science team and maybe some in the room are data science people. We don't have a ton of control over the data sets we've collected. We can't go back and magically fix data sets sometimes especially if they've been built up over decades. The best we can do sometimes is maybe think about ignoring some of the data we do have. We can always ignore data, we can't necessarily make up new data. So sure, I mean this is something to think about it's just maybe not the angle I wanna talk about today. Instead I wanna look at what we can do in the model which is I think not where the problem comes from but maybe where we can enforce some changes that ultimately counteract or resolve some of the effects we're trying to fix when we talk about correcting for bias. And there's really two main things you can do with models. Number one, to measure some metrics for the model that assess its fairness and then try to optimize for those metrics instead of just accuracy or lowest loss. And we'll get into that class of answers. And the other class of answers which I find personally compelling is to actually correct for bias directly. Figure out the effects of let's say demographic features and then back them out of the prediction and that's totally possible. Now for some of you in the room that have worked in this space or seen this I don't think what I'm gonna show you next is news but for those that haven't encountered this problem before and maybe haven't encountered some of the basic ideas here I hope this is a good introduction to the 101 the basic easy things you could do to reason about this and then fix it with software in your modeling process. So let's take some of those options one by one. I think a common instinct people have is just to ignore the sensitive data. If I don't want my model to be biased against people in one state versus another or men versus women or by race then maybe I should just ignore that information let's pretend I don't know let's not feed the model that information. Very tempting and it's not unreasonable but it's not at all clear that's sufficient. After all this is a pretty narrow definition of fair this doesn't say anything about the outcomes it just says we don't believe anything sensitive went into the income input explicitly and of course that doesn't mean this information didn't work its way in. For example if I know your income I can probably know something about your age it's unlikely a 12 year old makes as much money as a 50 year old for example and likewise your location down to your zip code says a lot about your demographics as well so even if I'm not using that information directly I might be indirectly reasoning about those types of features by through these proxies. So how much what features are correlated with those sensitive features is sometimes hard to assess it's not linear it's not obvious. So I don't know I think we'll take a look at the effect of just ignoring sensitive data but given you a preview I don't think that's gonna be enough. So if that's been your answer your instinct good idea but you'll probably need to go a little bit further than that. And the other end of the spectrum maybe the most aggressive answer you can give is to force the output of the model to parity. And as I said the goal in this approach is to have some measure of fairness or quality in the outcome of the model and then optimize for it. And of course we're already optimizing for something when we build models we're probably minimizing a loss or maximizing F1 score, maximizing accuracy. And so there's gonna be a tension here we probably can't optimize for three things at once right? The common fairness metrics you will probably hear about and there's a lot are number one equalized opportunity. So the equalized opportunity criteria says that we should see the same true positive rate or recall across all groups in the data conditional on the label. So maybe to give an example and this is for the like a binary classifier case which is what we'll be looking at today something similar goes for continuous predictions written in regressions but what this says is if I look at the data points where the true label is positive then I should see that the recall measured across that subset of the data is the same no matter how I break it down by these sensitive classes. So and yes that's we'll see a little more on this to make it more real but it's not just the condition is not just same metric across all subsets of the data sensitive breakdowns of the data by like age or race it's conditional on the label and I think that's an important difference here. There's also equalized odds same idea just let's demand equal or similar close true positive rate and false positive rate so a little stronger condition. This is a pretty broad or aggressive definition of fair this is saying I don't care so much about what the inputs were or what the model was doing I just care about the outputs and one way or the other I'm gonna make sure that the models outputs are fair even if it means trading off something else like maybe some some accuracy. So that's an aggressive definition and then maybe there's something in the middle here which I'm actually personally interested in and that is to maybe target the effects of these sensitive features in your machine learning problem like age or race and then having isolated the effect back them out. So as we'll see later in the example if I could figure out how much effects your particular age had on the model's prediction for your probability to have an accident or to get a bank loan or something well I could just subtract that out right and then that would maybe more directly undo the effects of things that I don't think should have an effect on the model. So with those basic ideas in play here let's take a look at a particular problem. Now if you've ever looked at the space you've almost certainly come across this data set and this problem. The data set is called the compass data set and compass is a system used in Florida to predict whether people in jail are gonna go commit another crime. So it's predicting recidivism. I think it's the verb to recidivate. Something like that, re-offense maybe. So obviously this is, it's an important thing we do want to accurately predict whether people will re-offend because we don't want to keep people in jail if they're not a threat and if they are a threat that they are gonna commit another crime we do want to keep them in jail of course. And this definitely impacts people's lives. It's definitely a place where bias could come up. Real world bias and bias in the data collection process and the prediction process as well. And indeed ProPublica studied this data set in the system several years ago and came out with a report saying this looks bias. They did a great job taking a bunch of data and cleaning it up and making it available which we'll see in a minute. And if you look at their paper they definitely say it looks like this system is biased against African American inmates or defendants. So for example, here's the, compass actually predicts a deciles score and here's the deciles score breakdown where 10 is high, more likely to re-offend for black defendants and white defendants. And that looks different for sure. It's clearly more skewed towards won't re-offend in the right case for white defendants. And you also, if you look at it, you see a bunch of stuff like the models twice as likely to have a false positive for black defendants as for others. Seems pretty open and shut. They make a strong case that this system is problematic and it is biased. But even so, if you read a little further there's other people that go analyze the same data set and say, no, it's not biased. You're just asking the wrong question. For example, here's another paper, links are in the slides which I'll share that says, well, if you actually break this down by basically conditional on the label looking at people that did and did not re-offend the deciles scores are very similar across, for in this particular case by race. So two people with slightly different definitions of what fairness means come to completely different conclusions about this system. Now, not here to argue about whether this is or isn't biased but let's try to do this ourselves. Let's see if we can build a model that accurately predicts, pretty accurately predicts whether people will re-offend and then decide whether that model seems to be biased or not and then experiment with some techniques to undo that effect. So let's have a quick word about the data set. I'll switch over to Databricks in a minute to show you some details and some code. But yeah, the good news is they already did a lot of the hard work to clean up this data sets and make it much more ready for machine learning. So this was not hard to model on. It's got about 12,000 data points, about 40% of the data people in the data set and data points did re-offend over some period and it has a lot of useful features. Demographics, for sure, age, race, gender, priors, the number of times they've been arrested before and some other basic information. So it seems like, yeah, we might be able to come up with some kind of predictor that's reasonably good here and in this particular case, we're gonna try and predict non-violent recidivism. It's a pretty straightforward classification problem and with that, let's start going through some things we can do. One thing we can do is nothing. So what if we just threw all this data into a model and said, okay, look, how is it fair? How bad is it if we just do nothing here? Yeah, let me not spoil, let me go to Databricks here. This is Databricks for those that haven't seen it. If you are a data scientist, the thing you need to know is just that it's notebooks in the cloud. It's a lot more than that, but the primary interface is just notebooks. So this won't be different if you are doing all this in Jupyter or whatever. And let me skip through some of the setup here. Let's just say I read the data. Let's say I wrote some code to build a model. I kind of went fancy here and let me close this out. I did some hyperparameter tuning with Spark and HyperOps. And oh, by the way, just as a quick aside, if you are interested, if you've ever heard of MLflow, this is integrated nicely into Databricks for model tracking and model management. And if you run, you can do this outside Databricks. If you run standard open source tooling like XGBoost and HyperOps, it logs all the stuff you're doing pretty automatically. And so I can, as I build this first model, I can even look at what this hyperparameter tuning process look like and drill into the results and even compare them. Not gonna say much about this, just give you a taste of this. If you like MLOps, take a look at MLflow. That's what I used here to just track and manage and all the models. But I'm not gonna talk about this. The point isn't, the modeling process itself suffice to say, I threw some standard open source tools at it, XGBoost did pretty well. I let a Spark cluster go tune this for an hour or two. And I got a model out. And okay, I could show you that the metrics and it's accuracy, it's okay. It's like 70% accurate or something. But what I'm interested in is maybe stuff like this. These are confusion matrices for this classifier broken down by race, in particular, defendants that are not African-American and are African-American. Now, race is not the only salient factor in this dataset. It is the one that the pro-publica study was looking at. So I figured, let's look at that question. But obviously, it's no less, more or less important than being unfair based on age or gender, for example. So yeah, typical of confusion matrices, we have, let's see, predicted label across the columns, true label across the rows. And you can already see something's different here. So in the middle, this right column is more blue. So the model is clearly more willing to predict, recidivate, will re-offend for African-American defendants. And this is just the difference of the two confusion matrices. And you can see the difference here, 9%, 16% higher rate. That doesn't sound right, right? That doesn't, that looks unfair. But as we said, kinda depends on your definition of unfair. And instead of just kinda eyeballing some confusion matrices, why don't we go compute some more, these standard fairness metrics? So I skipped over a little bit of utility code, maybe I'll come back to, the good news is there's tools for that. And one tool I wanna call your attention to is Fairlearn. It's actually, let me call it up here. Fairlearn from Microsoft. It's a open-source package from Microsoft. And it's a toolkit. It does a number of things we'll see in this example. Among other things, let's see, it gives you tools to compute these fairness metrics, like equalized odds and all that. So in this example, sure. Let me just, among other things, use Fairlearn in here to do some computations of these metrics. And let's take a look at those. It's a little hard to see, maybe I'll zoom in. That's big. So this breaks down a few important metrics by actual label is, did they recidivate? That's the zero on the left. And then race as well. And you can see that recall, so recall and false positive rate are the two things I think we're interested in here. And you can see that they're different across the, by race. So for example, the false positive rate's about 20% higher. You look at the group that did not actually go out and commit another crime. And among those that did, the true positive rate is about, also about 22% higher. So there's a difference in those metrics. And those are more the metrics I think people generally accept as the thing you look at to decide whether this is fair or not. And so that's a big difference. And so yes, this, I think most people would say that's a problem. There's something going on here. Okay, so good man, what do we do, right? What do we do about this? How can we fix this? And how can we get those numbers closer? Well, as you mentioned at the outset, I mean, one thing you might do here is just throw out the demographic features. Let's throw out race. Let's throw out gender from the input and see what happens. So now the model doesn't know. I can't really use that information as a predictor. So let's say I did all that again. I retuned my XGBoost model, same deal just with less input. What do we get? Well, it looks pretty similar. It's a very pretty similar story. I mean, these differences are smaller. And if you look at the differences in the actual false positive, true positive rate. Yeah, the difference is smaller, but it's still pretty much there. 18% and about 20% for false positive and true positive rate. So it didn't do much. And maybe that's surprising. I mean, you may have been sitting there thinking, well, surely if the model doesn't even know what each person's race is, how can it come up with different answers? Well, at this point, we might dig in a little deeper and try to understand what the model's doing with this input. This is where one of my favorite tools comes in. It's right in there in the title. It's called SHAP for Shapely Additive Explanations. And again, if you haven't seen this, you gotta use it. It's a great tool. It's easy to use. It makes beautiful plots. It does something really cool. It's there to explain what models are doing at the prediction level. So it's trying to decompose the model's prediction and attribute its prediction to all of the inputs. So because your age was this, that is good for like a plus 3% chance that you re-offend. And because your gender was that, that's good for minus 5%. It does a lot of great stuff, but the most important thing it does is compute the SHAP values or decompositions of the prediction attributed to the inputs for every input to the model. Again, it's an open source package. It's everywhere. It's easy to use anywhere. Try it. So let's throw it at this problem. So I'm gonna use SHAP here to explain my XGBoost model. And I get a plot, maybe something like this. So this is trying to tell me what features are most important to the model from top to bottom. And most important feature it says is priors counts. This is the number of times they've been charged, convicted before. So how to read this? Every dot is a person in the input. And the greenish ones are where that features value is high. So people with high priors are in green, like they've been convicted a number of times. When it's low, they're in that purplish color, maybe zero. And their horizontal position is its SHAP value or effect on the model's output. And the nice thing is SHAP values are in units of the model's output. Here it's probability. So these values are percents. And you could basically say that, well, looks like people with zero priors, it's probably these people here, that explains according to this model about minus 5% chance of re-offending on the whole. And this feature is the most important because it's positive and negative effects or have the largest average absolute value. So does demographics show up? Yes, looks like being male generally causes the model to SHAP attributes plus one 2% chance of re-offending to being male, let's say. That's when the features on in green. Races here, African-American, the effect is small though. It's plus one or 2% and even being Caucasian apparently is good for a slightly smaller plus chance of re-offending. According to the model, right? So one general takeaway as well, the model's saying this feature doesn't really matter that much, these demographic features. So maybe it's not surprising when we take them out, the model doesn't change that much. And you might stop there and say, well, okay, well, that tells me this isn't that biased. I mean, the model says this isn't that important, I could take it out, I could leave it in, it's not doing a lot. The absolute value of the SHAP values are just small. Maybe, I mean, that's one valid way to reason about it. I should say this is not, you don't draw causal conclusions for this. So this isn't saying that being male means you are more likely to commit crimes. Not quite, it's explaining the effect of being male on the model's prediction, which is related, but let's be careful about our conclusions, right? And I think along the way, it's also worth thinking a little harder about what this is actually saying. So we're trying to predict whether someone is going to go out and commit another crime, right? The data set tells us whether they were arrested and that's not the same thing, right? It's not necessarily true that being arrested happens exactly when you actually committed a crime, right? So we have to also be careful about making causal conclusions just given that. The data set isn't quite representing what we would like it to. It's close, but not quite the same. So be careful with direct causal interpretations here, but it might be enough to help us reason about bias here. Okay, so if the effect of these demographics isn't that big to begin with, what else can we do? Well, as advertised, maybe we can force the model to equalize those metrics, to prioritize having a similar false and true positive rate across different sensitive subsets of the data while trying to also maintain, let's say, accuracy. These things are intentioned, but hey, can do its best. And that's where fair learn comes in again as well. And the thing I like about it is it's actually pretty easy to slap on to your existing modeling process. So above my modeling process, it's not that hard, especially for people that have seen XGBoost. I mean, this is essentially my modeling process here. Just building an XGBoost classifier. There's really not much to it. And fair learn has a couple of tools you can wrap around your modeling process to change how it prioritizes things. One thing it can do is change very the thresholds it uses to decide whether to make a given probability as high enough to say, yes, this is a positive prediction or negative and have that vary differently across different subsets of the data. So let me maybe go to slides to explain that a little bit. Yeah, let's go here. So for example, we might have as input something like this and we build this XGBoost model and XGBoost pops up probabilities. And normally we might say, well, if the probability is over 50%, we'll say yes, this person's gonna re-offend. And of course, we can choose that threshold differently. It doesn't have to be 50%. But it also doesn't have to be 50% for all inputs. For example, we could choose to have different bars, different thresholds for different subsets of the data. So maybe you have to see greater than 55% or something if the input says this person's male in order to say, okay, yes, we're gonna, they will re-offend. And that's how this particular tool works. It's gonna find some thresholds that seem to optimize these fairness metrics while doing as little damage as possible to overall accuracy. Now you might look at that and say, well, that doesn't seem right. I mean, we're literally changing the bar for different people. We literally have different bars in this model. Is that okay? Maybe, maybe not. I don't know. In this case, maybe, maybe this means justifies the ends. Maybe you say that's directly unfair. I don't know. So I think it depends a little bit on the problem you're solving, whether this feels appropriate or is appropriate. But that's what it does. And this is one way to do it. Fairline has a couple other tools like this. For example, it can also try to achieve this by simply reweighting the inputs differently to try and optimize. That's a little more complicated and takes longer. So I didn't do that. But let's see what happens, right? So we can slap this into our modeling process and then optimize this overall model, this XGBoost thing wrapped in this optimizer that fit and adjusted the thresholds internally. And yeah, it was logged with MLflow, tuned with Hyperopt and Spark. Okay, this looks better. I mean, now this looks just, just glance at these confusion matrices. Yeah, it looks a lot more similar. Those differences are a lot smaller. Good. And that feels, this feels a lot better, right? And if you look at these fairness metrics, you'll see they're much closer. So the false positive rate and true positive rate difference is about 5%. And in fact, you can kind of tell Fairlearn how close you want this to get. So I think the default's 5%. So it's tried to keep as much accuracy in the models as it could while getting this difference down under 5%. And you can vary that, of course. So what's the downside here? Well, accuracy overall. For example, I think if you look at, you'll see that accuracy for the people that did actually re-offend is lower now. In fact, I mean, qualitatively, you can see it already. The model's just generally less willing to say, yes, this person will re-offend. These numbers in the right columns, those percentages are just lower. So we probably have in this model fewer false positives, but more false negatives. And overall, I think if you looked at the numbers, the accuracy is overall a little lower. But maybe that's a good trade-off because the problem here was this preponderance of false positives for a certain group of people. And yeah, maybe that's worth it. Maybe that's an acceptable trade-off. But you are making a trade-off here by doing this. There's not a free lunch here. Last thing, I know we just got a couple of minutes. Time goes fast. I wanna introduce a last idea that I think is simple and maybe even more compelling in some ways. What if we use that tool SHAP to, because member SHAP helped us break down the explain, attribute to each value in the input, its contribution to the overall prediction. So why don't I throw in all these features, like race and gender, and let SHAP tell me how much they added to the outcome and then back it out. So what SHAP does, for example, is given some inputs and given their predictions, it'll kind of decompose that overall probability. So maybe being age 26 was good for plus 5% for that person. Being age 41 was good for minus 10% for another. And age and gender here are sensitive features. So why don't I just add those up and figure out the aggregate contribution and then just back it out of the raw probability that the booster comes up with and then make a decision on that, right? That's all it's doing here. Seems more principled in the way I like it. And this is actually, it's not a new idea. It's in the SHAP documentation, just not quite in so many words. And it's easy to implement here too. This is, you just have to add an explainer and you have to, this is where we're adjusting the model's predictions. And this is just a custom wrapper so I can log and deploy this with MLflow. But this, it's a couple lines of code. And what happens? Well, I'll take a second to guess what might happen if we do this. The answer, well, we're still kind of back where we started. Remember we said the effect of race and gender and age was small to begin with, according to the model. So backing it out didn't change things that much. I mean, this is a little closer than even that second example. I think that's good. That difference in fairness metrics, though, is still there. It's smaller still, but just a little bit. So in some context, you may say that's not good enough. In other context, you may say, okay, that's better and I feel like this is a more principled thing to do here. I'm actually trying to isolate the effect, the bias that somehow the model's attributing to these sensitive features and just yanking it out. So easy to do with tools like SHAP and a little bit of code. I think at the end, yeah, I tried to plot the results. I think the overall takeaway is unless you're trying to force the metrics to be the same in a case like this, they're gonna kind of be the same no matter what you do. That won't be true in all cases, though, right? Last bonus round, I don't have time to get into, unfortunately. One really cool thing you can do with SHAP is use it to cluster people, not on their attributes, but on how the model treats them. You can cluster their SHAP values, very natural thing. They're all in dimensions and units of the model's output. So this might be useful because maybe on average, I might say as in this case, on average, it looks like race and gender. It's not that big a deal on average, but that does not mean that it's not a big deal for everyone in the dataset. There may be some outliers there that for some reason, the model's attributing a lot of their outcome, their actual ground truth outcome to their demographics and that's probably the handful of cases we are really interested in. We need to, yeah, go investigate those. So you could cluster these. Here's clustering gone through T-SNE and that's another interesting outcome, I think. You could use SHAP for it to go find those things that need more investigation. And that, I think, is all I've got time for. Thank you very much. Maybe since we have one minute, if there's a question, happy to answer it live here. Also happy to receive email here if you'd like a copy of the notebook or the slides, happy to share those as well. So go for it. One quick question. Do you think that the bias issue is more of a policy issue or do you think is more of a technical issue? How, what's your perception in that? Is it a policy issue or a technical issue? Oh boy, like where does the bias come from? I think, as I say, I think bias comes from the real world for the most part, right? And we're just learning about that through the data and through the models. I don't think it's largely a technical issue, like it's coming up. It can, through the data collection process. I think the technology, if anything, is the hero. We can detect the problems and highlight them and maybe try to counteract them, but it also isn't the solution in and itself, right? So I mean, let's say we found there was just rampant bias here. We can't go fix Broward County Florida's jail system here, but at least we can investigate it. One more? Do we have time? Oh, she's got a mic for you there. Okay, thank you. Is there a matrix for fairness? Because like we have Kappa score for the agreement, is there a customized matrix for fairness? If we just rely on recall, or maybe true positive or false positive, then we are saying that I don't see the difference that we should see in the first place if you want to make it better. So is there a customized matrix for fairness in the model, like for anything while you were exploring it? I'm not sure. So are there, there are certainly different definitions of fair and different metrics. And I think I've put the links to some key papers in the slides here. Yeah, there's at least one paper that goes through about 20 different ones. And they're often just variations on a theme. The common ones I see most often are equalized odds and equalized opportunity. And a lot of these are supported in fair learn. So you can even just look at fair learn stocks and see all the options. And yeah, I imagine you could customize it too. But you're right. What you really need to think clearly about what you're trying to optimize for in these frameworks before you go optimize for them. I just defaulted into equalized odds. But in another case, maybe that's just not appropriate. Maybe I'm only interested in two positive rate, for example. So it depends. One more, then I'll take other questions offline. Just curious, have you ever run the shop after you use fair learn to readjust the models? So do you know if solely the race factor was the one that is actually shaping the final result? Or is it some other factor also gets tuned as well? So the shop values before, like on the raw plain vanilla model, that already shows that the demographic features aren't that big a factor. I didn't show it here. But yes, you can rerun this on the corrected model from fair learn. And yes, the values go almost much more to zero. It also is, well, actually I don't know if I did that. I did it after correcting the shop values. That by construction makes them just about zero. But yeah, I think you would find that. Yeah, because what I imagine here is if you have a 20% difference before you're actually throwing, before you're doing nothing, right? Then if you're only zeroing out those 1%, then it's not going to make a cut. It's not going to do much. Yeah, I was just going to wonder what exactly is the other factors that is actually coming into place. Maybe that's the clear defectors for all of the others. Like if you take them out, I mean, it's very similar. If you actually do this for the model without the sensitive features, it's very similar. You just don't see the ones that don't exist. The ordering is the same. It's mostly priors in your age that seem to matter. Well, thank you, everyone. I appreciate it. I'm happy to talk offline here with more questions. Thank you.