 Welcome to the tutorial introduction to Responsible ML. This is a tutorial prepared for user 2021. And there are four tutors. I will introduce them briefly. We are from MI Square Data Lab at Warsaw University of Technology. And MI Square is a larger group of people. We are doing a lot of software, a lot of data analysis. We will learn more about our activities later this day as well. But maybe let's start with the tutors. So there are four of us. And maybe we will start this tutorial with short introduction of all of us. So maybe it will be easier for me to start with myself. I'm working as Associate Professor at Warsaw University of Technology. I'm primarily interested in Responsible Machine Learning. Also, I'm very interested in visualization, data visualization and model visualization. Recently, I'm focused on machine learning operations. I think that this is a very important part of Responsible Machine Learning. Also, I am a software developer. I maintain and develop a few R packages. You might know some of them. And during this workshop, I will show you one call to Dalek, but you will also learn about others. Feel free to contact with me. There is a link to my LinkedIn page, but it's quite easy to find my email address. So if you find a question, please contact with me and I will be very happy to contact with you as well. So that's about me. Maybe Hubert, maybe you will continue. Sure. So my name is Hubert Blaniecki. I'm a master's student at Warsaw University of Technology. I studied data science and I'm currently interested the most in explainable machine learning. Also, I'm also developing some tools for interactive model analysis, which I will be showing today. Also feel free to contact me on LinkedIn or by email. And I hope that we will have some fun today. Okay. Hello, everyone. My name is Anna Nakazak. I graduated in Mathematical Statistics at the Warsaw University of Technology. I am interested in explainable AI and data visualization. Feel free to contact me. Okay, so hi everybody. My name is Jakub Ryszniewski and I am a research software engineer at TMAs for Data Lab and also a data science student at Warsaw University of Technology. I'm interested in fairness and responsible applications of deep learning. And also feel free to contact me. As you see, we're sitting in the same room, so it's easier to communicate but also if there will be any problem with the internet, you will lose all of us. Let's hope for the best. Okay, so today we have like three hours and these three hours are divided into four parts. Before break we are going to tackle two elements, two elements of responsible ML. Briefly, we'll be talking about lifecycle of predictive models. And then I will be talking about explainability and machine learning, showing side pyramid and some deeper analysis of model, deeper exploration of models. And then after the break, we'll tackle two very new, very fresh solutions. One is related to fairness and they do not harm principle and second is related to automation. So we will see how to use model studio and arena to automate all things that will be covering during the first part of the of the talk. So if you have any questions, then let us know and you can let us know either here on chat I see that people are using this chat so it's great it's working. You can also use the slug tunnel. Of course with slug is easier to maintain these materials and discussions after the workshop but during the workshop to the zoom that is also fine. All materials are on GitHub Web page and I will just switch my screen to this web page and I will show you how to download these materials where you can find them. I will just switch to this screen. Okay. So with this link tiny URL slash RML 2021. You will find a web page and these parts are linked in the section of materials to find PDFs. You can also access our code and sometimes combined HTML. So you can, it's, you are welcome to execute all comments and do exercises along with us. But if you prefer to keep some parts or read in advance what will happen in few minutes so you are of course free to do so. The data is linked here so you can download the data and there are also links to other other tutorials that we had recently. You can easily access this web page and so we started introduction was the few slides that we have already covered. And now we will move to the second part models plus side. So let me again change the screen. If you have any problems accessing materials that we know. I hope it will be easy but any case, let us know. So the first part will be related to modeling and extensibility. And to be honest, we prepared something special for this part unusual so it's kind of experiment and you are the genie pick that we are experimenting with. And because we move these notes for this part into a comic book combined with with with a real book, you will find a link to the this PDF on the web page. So please download it like 30 megabytes as it might take some time but still, it would be easy to follow the tutorial. And the idea for using this comic book is that we'd like to present your free perspective. And it's easier to kind of distinguish this perspective is you can use different styles to show them separately. So this perspective perspective are that we think that you are going to talk about this machine learning you cannot forget about math and about algorithms. So in the following pages will talk will be looking also a bit about math and algorithms that need to be some formulas, but also to do responsible machine learning you need to good tools. And as people that are using are we have already a very good platform, but again, there is a lot of packages and you will be doing some recommendations for you which packages to choose. And sometimes you can find books or materials that are focused exclusively either on algorithms or software. So here, we'd like to share with you also the third part, the process, because it's important, you can have good you can have very good. You can show the behavior of different methods or models, but still, the data science exploration is kind of unique is very iterative so we'd like to, we'd like to talk about this process as well. So in the following pages you will find a bit of each of these these three components. They are divided in a different visual style, like the parts related to the process are sketched as a comic book, so they split the kind of part every two or three pages. The software is linked or marked with the R snippet section, and then the algorithms are most cases just after the comic part. So, in the first part of this tutorial, I will be using this PDF to go through all these elements. I will be doing some hands on exercises, but all materials are here. And also, I really like the visual memory using the visual memory. So for me, instead of putting all these notes comments as a slide, I will be using this PDF to have these guides to remember what is where and why we're talking about this part right now. So these three elements will be kind of tackled during the workshop. The idea is to follow our adventure of two persons, better and beat, and it's kind of important because they also show a very different approaches to modeling and better is more mathematical, maybe more calm, more oriented on on theory, why beat is hot headed person that is not code instead of try it. So, you can of course go. We will not read this aloud, but you will go through the comic and you will see how the process is, is, is, is looks like. So, during this workshop, we are going to use a data set related to COVID pandemic. And it's a first because it's kind of current topical. We experience pandemic all of us and actually in our lab, we had occasion to work on a real data related to mortality and based on this real data we have prepared some artificial kind of realistic data sets on which we can practice both modeling and the exploration of a model. So it's kind of very interesting, interesting use case, but please remember that the whole pipeline, the whole stack of tools and the whole stack of methods apply also for other approaches so whatever you are doing credit scoring or some survival analysis or other things. Same tools, same approaches work. Of course, it has a little different but the process is similar. The use case will be related to COVID and actually our use case is also based on this. So we will be doing some exploration, then we will do some model assembly, and then we will be doing some model evaluation, and in different parts of this tutorial will be either more focused on exploration, or will be more focused on model assembly, or close to the end of the first part to be more focused on the model evaluation or exploration or explanation. And then we also like to tackle this last mile, last step, the delivery and automation so you will see a part of all of these elements. And if anything will be presented too quickly, we have still like three hours only, then you have all materials you can read these materials and if anything is still unclear, let us know. So this is the plan. Let's start. So we'll start with a first model. And it might be a surprise, but because usually when we are thinking about models, we think that you need to have a date and if you think about like other competitions or other competitions or if you think about other courses, related to machine learning, they quite often start with a data set that you have data set and let's explore and let's train a model. But in many cases, even before you have any data, you can find a lot of interesting information on our internet and it's a case here as well. So our goal in this part of the workshop will be to prepare a machine learning model and before machine learning model for mortality prediction of COVID. But before we start doing this on a real data set, I would like to show you how to create a model without data. So this is how to do this. Quite often, you can find some statistics and in this case, and CDC publish statistics related to COVID mortality. And this is a screenshot from the website. And this screenshot, you will see how they assess the relation between the mortality and age group. And if you click on this website, you will find a slightly different numbers because they update this information every few months. But the example is prepared for this data. So let's create a model in R and let's validate a model based on this information. So without data, let's start with something based on the main knowledge. So first, in this tutorial, we treat model as a function, just a standard mathematical function that we take observation from like the dimensional vector and then we turn this observation into some prediction, some score. So, if you are going to create a model, we need to prepare a function and this is the R function. It's kind of easy. It just takes a vector with a data frame which supposed to have a column H. And in these lines, what is happening is that the H is being checked whatever it's in given interval. And then, according to our age, we assigned a particular relative risk to our selected patients. So if someone is below 4.5, then we assign relative risk equal to two to such persons. So this is a standard R function. In fact, models that we'll be working on are just R functions. Some of these parameters will be trained on data. But here, we have the luxury that actually all these parameters were kind of known or presented on our website. But it's still a machine learning model. In this tutorial, to work with such models, we need to create some wrapper around them because models will have different structures. These are just a function, R function. But some models we have kind of complex structures, sometimes in S3 R classes or S4 classes or R6 classes depending on the library that we can use to train a model. So to have some uniform interface around this model, we'll be using a Dalek package. In this Dalek package, you can provide a model and explain or specify how model predictions should be extracted from the model. Now it's easy because the model is just a function, so it's enough to call the function, but you will see later that it might be more complicated. For some models, you need to have some specific arguments, specific parameters to extract information from the model. This is important here, this is our first iteration. Important part are that you don't really sometimes need that data to create a model. Some models can be learned from literature. Then we will treat models as R functions. But what is also important for this tutorial, we need this additional extra overlap or abstraction over models. And then we'll unify different models. And then using this Dalek wrappers, despite the internal structure of a model, you can always use a predict function to calculate predictions out of the model. And I guess that those of you who have experience with R, they know that some models by default return vectors with numbers, some return data frames, some of them return categorical variables. So here this explainer is needed, have this uniform interface. Okay, so this is the first model. And then what we should do with the model, of course we should check how good is the model. So in the next part of the story, but I and we tried to find a suitable data, our data to validation. And here we have the luxury that in fact we will have two data sets, one of them will be used for training and second we use for validation. Of course, sometimes we have just one data set that you can work on and we need to use some specific techniques to split the data set into training and validation part. But here in this use case we have this luxury that actually you have to independent data sets and this in this in this story is even nicer because these data sets are so called out of time data sets. So if you are just something from same data set to subsets, which is kind of quite often, then these two subset are very similar to each other. But in real world, it's kind of frequent that with time some relations are changing or some distributions are being changed. This is called data drift or model drift. So here we'll have some samples of data from two different time intervals, one from summer second from spring to validate the model on so called out of time data. Okay, so let's let's let's validate, let's validate. So first we need to read the data and the data you can download the data from GitHub web page. It's in that two CSV files one is called COVID spring and second is called COVID summer will be using COVID spring for training and COVID summer for for validation. So here we have the standard CSV files, and in the remaining part of materials you, you will, you can you can use these two objects. Here, I will have some time for hands on exercises, but you like to follow this presentation, these examples, along with your console type typing all these comments into our console, then you can either copy them from the PDF, which is the idea because this arrow is not copied in a right way. The better approach is to open this RMD file, and all these comments all these descriptions are in the RMD file will not find a comic book there. So, yeah, it's better to use that there, but it's much easier to just open the RMD and line by line execute these these comments. So if then you can just follow these these examples line by line. And again, if you have any problem with accessing materials, let us know, we'll have few people on chat so they can assist you. So where we are, we are with two data sets. And then of course, as a good statisticians who should explore the data here that exploration is relatively simply, we just draw few basic summaries, because I guess that the classical approach will be to do a lot of exploration exploration, and then had a head for modeling. But in our process that iterations are very quick, very, very. So exploration might be on the beginning, just simple exploration for data, but then we'll show you how to use model exploration. So even quicker, learn relations, related to the model. So here we have to start, we just start with histogram for these two groups and to see this is kind of expected. So from media and from other places that other people are more exposed to the COVID and the red color here stands for people who died. So so we see that this older age group is it's in fact more exposed to worries. And also in the data we have kind of interesting features like comorbidities, like diabetes or other diseases. So you can also use a mosaic plot for other techniques to, to see what is the link between different feature and and the target variable. Okay, so I found that in medicine, it's very frequent that in medical papers you have so called table one and table one is a table that summarize your features for selected subjects. And it's because it's requested by medical journals, you will also find a few very good are packages that prepares that summaries. And here you can use a table one package. It has a very interesting, very simple to use a function create table one, you can just specify variables, data and the target column. And then out of this you will get a nice HTML table or later table or just mark up mark up table with summary. So here you have all features variables that are present in the data. And then you have information about the number of cases in both groups, survived and ended. What is important here and also it's very common, you should not model without checking what is data, because sometimes you have some information in it. And here, the story the backstory is that we'd like to prepare a model that will suggest who should be vaccinated. So who is more exposed to the risk, assuming that if someone is exposed, it should be vaccinated first. So, some of these features as you see, they are collected during the interviews with Polish counterpart of NIH, but we should not use these variables and say for example the last four like hospitalization fever. These are very important variables. If you calculate the p values or summary statistics, you will see that these three are probably the most important variables, but we should not use them because we don't have access to these features in advance. So you can have a very good model with very high performance, but using these forbidden features. So during training, maybe this feature are in the data during validation, maybe because of some problems or process, maybe they're also in the data, but in a real applications, they won't be present during the prediction. So in this exercise, we are just using this exploration to recognize that there is some relation between targeted variable and particular variables. And also, we should remove some features. And this is kind of important step. There is a lot of examples in which one can get a very high model for a high performance, but because of some data leakage, the performance is too optimistic. So yeah, this, the first iteration ends with a subsection of features that can be used for modeling. Okay, so we have our CDC model. These are basic models. We have data that can be used for validation. Let's validate this model. So validation is very important. And in a second you will see and probably already know that there is a lot of different measures that can be used for validation. What I'd like to highlight here is that the selection of the measure is very important. And it should be of course related to the research questions that we have in mind because using the wrong measure might lead to a wrong model. So let's maybe for a second focus on different choices that we have and then I will convince you why I will see the best choice in our case. So for regression probably you know that there are some areas like mean square error or root mean square error, but we are working on classification case and I find kind of amazing to use a pregnancy test as an example of the confusion matrix for better classification. So talking about different measures of assessing model performance, you will find very different measures like sensitivity, specificity, precision recall, accuracy of one and so on. All these measures actually are based on four values and these values are two positive cases, four false positive cases, false negative cases and true negative cases. So just to maybe quickly define what are these cases, you can imagine a very simple test for pregnancy. And you can imagine a binary situation like someone is pregnant or not, and the binary situation that someone has a morning sickness or not. And still using our data, these are related and just get this statistics from this webpage. You can see that this simple test morning sickness has given sensitivity and given specificity. So it's not that sensitive, but it has kind of high specificity. And to these numbers, sensitivity is calculated as a fraction of true positive among all true positive and false negatives while specificity is a fraction of true negatives among all true negatives and false positive. But of course you can define different other different measures and here you have a short table for these measures. They work pretty well if you have like these two options, like the test has two options like health sickness or has not sickness. In our case, you remember that what we can learn from the CDC webpage, they were actually some risk scores, some continuous risk scores. In such case, you need to in order to use these statistics, you need to binarize the continuous score into two subjects lower or higher to some cattle. So this is how the area under curve is being defined, the RLC curve, this curve is being calculated as a set of points for different cutoffs. So these different points differ in this version of cutoff. So if you choose a different cutoffs, like in the previous pages, you have this risk, like 10 times larger, 45 times larger and so on. If you put a cutoff like 50 and say that these groups have larger risk than 50 times higher than reference group, and these people have lower risk than 50 times larger than reference group, then you can divide all these patients into two subgroups, like higher and lower than some cutoff. In different cutoffs, you can get different tables like this one. For different tables, you can get a different coefficients like sensitivity and specificity. And with the different coefficients, you can have the whole curve called RLC curve. And the curve summarizes how good is the actual test, because if it's random, this curve will be very close to the diagonal. So if the test has a very high specificity and very high sensitivity, despite the cutoff, then this curve will be very close to a kind of rectangular. So just by measuring, calculating the area under this curve, we can have a very nice description of how good is the ranking, how good are these scores. If the scores are sorted according to the release. So this is why in these examples we'll be using AOC. As a measure for performance, it's just one of possible values, you can find lots of different statistics, but for this use case, it looks like it's a very good choice. So we have a data set, we have a model, we have a measure to assess the model performance. Now, let me show how to use this model and data to actually calculate performance. I hope that you remember or even executed this example for a CDC model created as R function. Here I have same lines with two additional components, two additional arguments to calculate to estimate the performance of a model I need the validation data, because the performance will be evaluated on something here on validation, on validation data. So here to this explainer, I've added two lines that specifies the data and target variable on the validation data COVID summer. So once this object is prepared, and again, this is kind of additional fast because you need to specify some additional abstraction in addition to your R function need to create this additional data explainer. But once you create this explainer, you can easily use a large collection of functions that are prepared for model exploration. So for example, the first function that we are going to use is model performance. It does enough to use model performance function with the selector explainer to calculate various statistics. So here the function knows that it's classification because it was specified with the type argument. So now in that classification, these various statistics were calculated and we can see some simple summary for this model, how good it is on the validation data set. And again, all examples that I'm going to show you are working in a way that you fancy to create explainer with a model, then you can calculate some model explanation, and then all these models nations can be visualized with the plot function. And this is what is happening here, you can use the generic function with the calculated explanation, and then you can just create a visual summary for this explanation. So there are few summaries implemented, like ROC, box plot or lift. In different domains, different summaries are more prevalent. So for example for credit scoring, it's kind of very common to use lift cards that summarize the kind of lift in the subgroup of customers in terms of risk of credit default. But we will be using ROC cards. But again, there are different statistics that can be easily selected with the GOM argument. Okay, so what we have is the assessment, how good is the model. It's like the IOC, it's 0.9. It's pretty high accuracy. I told you that the random model will have IOC equal to 0.5 or close to 0.5. So this is kind of close to one. And on one hand, you may say that's enough. We have a good model. It's kind of confirmed by CDC, so it's kind of official with official statistics. So we try not to use the model and that's all. But in the data scientist spirit, there should be this force that forces you to try other approach. Maybe you will find a better model, maybe by changing something, changing hyperparameters or changing structure or glass of models that can be used, maybe you can get higher, better results. So whatever for this data, whatever you can build a better model with some known machine learning approaches. So probably the first choice will be linked with logistic regression, but here for some reason, and I will tell you later why I choose three base models. Maybe whatever you can get a better model with classification keys. On the beginning of this book, you will find links to very good source materials. And of course these three hours are not enough to cover details in every or even a few machine learning approach. So I'm not going to tell you in detail how the series are working. I guess that's for someone. Some of you know this already. Some of you may read it because it's it has got some rights here. I will just present some very short intuition, because it will be important later will be creating more and more complex models currently were using very, very simple models that were proposed like 45 years ago. They are pretty old. Again, you will find a lot of different are packages that can be used for training here for decision tree model. I prefer to use party kit package, which is really good because first you have very fine control on different arguments, but also you can get a pretty good visualization of a model, which is also important. So actually to train a model. This is super simple in with a part of it. You can just specify the formula, the data set, some additional hyper parameters, and with just one line you can train a model and then with one line you can visualize the model. We have few variables. And the way how decision trees are being grown is that first, every variable is being checked whether they can be split it, and which variable gives the best split, you can choose a different measures for the perform the quality of split. So there is no genie or entropy, so there is some short discussion on the left page, but here we are just using defaults to see what is happening with the model. So it turns out that the best candidate for speed was the eight variable and the best cut off was 60 in seven, then this procedure is being repeated so in the subgroup of patients that are older than 64 or seven. It turns out that the second best candidate for speed are cardiovascular diseases. And this procedure procedure is being repeated. So finally, with given hyper parameters, you end up with given decision tree with seven nodes. And what is very nice about decision trees is that they are very transparent, especially small trees, this is relatively small, you can easily see which features are being used for specific decisions. Yeah, so you can see that in the CDC model we are using only eight, but it turns out that for people that are all younger than 67, also the other diseases are important like cancer status, while for people that are older, other diseases are important, here cardiovascular diseases. So we see that of course, both things contribute to higher risk, but for different age groups, either cancer or cardiovascular diseases are the second most important feature. So, okay, we have a model, this model was trained on the COVID spring data. So now we can buy data model. And this is where the Diana Spiner will pay off because again we'll create some additional wrapper around the model, but with this wrapper it will be very easy for us to compare different models. This is where we are creating the wrapper, we are using expand function, first argument is this decision tree, then we need to specify how probabilities will be extracted from the tree. And for different models, the way how this course needs to be extracted, there are different interfaces. Dalek will figure out these functions, solution predefined functions, but here I've specified this function just to show you how this is, how this works. So, you can forgot about this line and this will also work, but just to know what is happening underneath, it's good to have this as well. Once you create this predict function, then in all following steps, you can just use this explainer and it will always return scores, so you don't need to worry that some models have different structures. Okay, so we are using same data for validation, COVID summer, it is a classification model and in plots we like to use a label tree to make it easier to recognize different models. From here, the model is totally different, the first CDC model was based purely on domain knowledge. This model is based on data, but the way how we can explore the model is totally uniform, so you can use model performance function, you can calculate all these statistics, you can plot them with their plot function. So just to highlight what is important here, oh, more complex model and higher AUC, oh, that's good. And when you use the plot function, you can specify as many explainers, explanations as you wish, and they will be all plotted in the same plot, same chart. So it's kind of very frequent situation that you are developing different models, it's totally different structures, and then you need to select one of them. So it's pretty convenient to put them in a same chart to compare them against each other. Okay, but decision trees are pretty old, let's try something more sophisticated. And probably the next step should be a sample of decision trees, so random forest. Yeah, Damon, 20 years ago, wrote a beautiful paper, how you can reduce variance in decision trees by using bootstrap aggregating, with some additional very nice features. I mean, this is not a good time to go into the taste of how random forest are working, but just to give you some idea, we are just combine trees trained on different bootstrap samples. They are different because they have slightly different data sets used for training, and then they are being aggregated together. They're being aggregated together with average or some voting or something like that. So, let me, let me now train such this model in our, it sounds like something very complex because you need to sample like 100 or 500 bootstrap samples, but again, you can do this easily with existing packages like renter or with random forest. So basically I am using MLR3 because MLR3 will give you some additional benefit that will pay off in a few slides. So this line defines a task. So this is how MLR3 approach predictive modeling, you need to specify the idea of the task, data set, it's called backend because it might be larger and maybe start with some database. And the target variable idea need to specify because it's by the classification to need to specify the positive classes, it can be either yes or no. And having the task defined in the second step, you'll need to create a learner. Learner is a definition of algorithm that will be used for training all required parameters. So here for learner we are using render and we are using render with 25 trees. So the third step in training with MLR3 is training, you can just use the train function out of this model, out of this learner and the train function in place will tune, will find the coefficients for all these decision trees. So it will sample bootstrap data sets, it will train trees and it will create the whole assembly. It's kind of a sophisticated procedure but as you saw we can do this in like three steps, and then you can wrap this model with the explainer. Again, in the explain function, you can just specify the model, and then you can select the right predict function to extract scores from the MLR3 model. You see that this function is different than these two other examples. For CDC model we have different predict function for decision tree, we have different predict function, here we have also a different predict function. And Dalek will guess the right predict function but just to show you how it works, here is an example. So, okay, we have three models. Again, the same function model performance, you can calculate performance of this model and you can plot all these three models in a single chart. Thankfully, fortunately, the AOC is higher and higher. We started with 0.9 for CDC model, then we have 0.91 for tree base model and here you have 0.94. So it's pretty good improvement. And here, Randall Forrest is a very good model. It's very frequently used as a default benchmark for tabular data. It's pretty robust. It has a lot of good statistical properties. But you need to specify some hyperparameters like number of trees, the procedure for sampling particular rows, the minimum node size and so on. And this manual tuning can be replaced by some AutoML tool. So in the fourth model I would like to show you how to use a free tuning package to find the optimal subset of hyperparameters. So this is probably the most advanced model that will be covered today. And again the whole procedure is described in this sketch. So the idea for hyperparameter tuning is to have some methods that will suggest your hyperparameters, sample your hyperparameters, and then you need a good measure to evaluate how good are these suggestions, and you iterate this process until some criteria. So you need to specify these three things. The space for hyperparameters and the way how they will be sampled, the evaluation criteria, and then the terminator when it should be stopped. So how to do this with a free tuning library and paradox library. First you need to specify the self space and see the self space says that you have the hyperparameter number of trees. Let's try different values from 50 to 500. Then you have a hyperparameter maximum depth. Let's try different hyperparameters from value one to 10. And so on you have different hyperparameters like speed rule. Let's see, is it better to use the genie speed rule or extra trees. Having defined the hyperparameter space, now you can define the whole setting for our model. Here you have the learner. It describes what family of algorithms we are going to use. The sampling is the way how a model will be evaluated. We cannot use validation data for these internal evaluations. So here these two lines says that we'll be using AUC but calculated on on five fault validation. So internally data will be five fault validated and the performance of the model will be measured. So just a small comment. I am kind of the Latin kind of maybe pretty steep. We started with a very simple model and now we are talking about very sophisticated machine for AutoML. Just wait a few minutes. I will return to make more basic applications in a few minutes. I will just cover the whole spectrum of possible solutions. So we have set space, we have the evaluation metric, and we have the terminator. You can specify different stopping criteria. Here we use a very simple one, number of evaluations. So just try 10 random hyperparameters sets and choose the best out of these 10 hyperparameters, hyperparameters sets. And here we are using random search for selecting these candidates for hyperparameters. If you are free, again, you can just use the train function in place. This will estimate, select the best hyperparameters and coefficients for trees in the random forest model. So with this, you have a model and now we can check how good is the model. To do this, again, we need to create a dialog explainer. The predict function is again slightly different for some reason, but once we wrap the AutoML model with dialog explainer, we can forget about these internal differences because we have this uniform interface. So having this dialog explainer, we can now easily calculate model performance and plot the ROC curve. So what we can see here, it's not like it always decays, but in this example, with the AutoML model, you have better results than for default parameters. So this random forest model with default parameters was slightly worse. It is not a big difference because we are very close to one, but still with AutoML, you can get a higher performance. So we have four models. And should we stop here? You can ask the question, now we train four models and is it enough? Like, should we say that because IOC is the highest, there is nothing else that we can do. So I think not. I think that here, actually, the real fun begins. And what you can do more is to do more sophisticated analysis of these models. They are complex. They are very different. Like with three, you can just plot the tree, but with 100 decision trees in a forest, it's much harder to plot them. Of course, you can do this, but it will be hard to figure out what is happening in the model. And if you have additionally these hyperparameters or other sophisticated models, it will be very complex to understand what is happening. So to really understand what is happening, I would like to present you a few procedures linked with model excellent ability. This is an important part of responsible machine learning. This is this pyramid that let's see what is there. We already covered a part of it was rated with model performance. So like the top of the pyramid, here you can kind of measure whatever the performance model is high or not, but there is much more to discover. So in the remaining examples, we will just going for this pyramid to dig deeper and deeper in the model and to learn more and more about relations that are in the model. Okay, so first up will be rated with future importance for some models, we have some existing methods for assessment of future importance. But they are kind of model specific. So you will have a different values for linear models you will have different values for decision trees you have different values for random forest. If, if you would like to compare different models in terms of importance of features in these models, you need something that is more realistic. So hopefully, fortunately, we have a measure like that. One of them is called permutational very important. It's working in a way that you can just check how much the model performance degradation if a selected column is permitted. So you can imagine your coffee data, and then you can see that the initial performance was 0.95. But then if a column X is permitted, the design drop in performance, and by checking how large the drop, you can assess how significant was this variable. So this is a very simple technique, but very, very useful. So, I think these are really some sleeping beauty that is just waiting for us to discover this and use this. With direct, once you have explainer, you can simply use the model parts function model because this is a model level analysis, global analysis, and parts, because it's related to importance of parts of the model. And with this model parts function, we can calculate how important our particular features. So here, difference being that you calculate the model performance with all features without any change, then you permute this column, and you see that the performance dropped, but just it's almost a zero. It's not the case for H, you see that it is very important variable and if you sample, if you perturbed, permute values in the H column, then the performance model drops a lot. So it's important value because if you blind the model for this value, then the performance will be much lower. So this is for all models. We have four candidate models, let's compare them. And just, you can, now we can forget about all these differences. These models again they have very different structures, but once they are wrapped in this explainer, you can just calculate the important characteristics with the model parts function, and you can visualize these importance with the plot function, generic plot function, just by specifying these four explanations. So here is the summary. In these plots, the beginning of each bar is the initial one minus AOC. So it's the initial model performance. And then the longer is the bar, the more you will lose if this particular column is blinded. So for CDC model, the only important variable is H, because we remember that it was just a table with few columns and they were all age groups. So in this function that we have created, there is nothing except H, so of course all other variables are very important. For the tree-based model, it's slightly different, because we have like three variables there. When you plot the tree, you see that there is age, cardiovascular diseases, and cancer, and they have some importance. The initial performance is slightly better than for CDC model. Edge is the most important feature, but there are others as well. And for the Reynolds-Horace model and for the Automatically-Tuned Reynolds-Horace model, you also see that the initial performance is better, higher, so one minus AOC is lower, but also the length of the bar is for age is much longer. So for both these models, they see that the most important variable is H. There is some effect of other features, and it's not zero, it's a positive effect, but the most important effect is H. Okay, so this technique was here in the higher mid-formal exploration. Like we started with the global assessment of the model performance, it was AOC, and now we dig to decompose this global performance into parts that can be attributed to some variables. So we learned that for age, this is the most important variable for each model. Some of these models do not use anything else, like CDC. Some of these models are using something else, but age is more important. So let's go deeper. Let's see what we can find more in this model. Okay. So in the next state, you can not only see which variable is important, but you can learn how it's related with the target. What is learned by the model? What kind of relation is learned by the model? And these models do not assume any specific family of relations, like in linear models, you need to specify either spines or some polynomials or something like that. There are three base models that are just splits, but there is a lot of them. So these are pretty flexible models. So let's see what has been learned by these models. To do this, you can use the PID partial dependence profiles. There are other techniques as well, but here I will focus on partial dependence because of their simplicity and how they work. You can just imagine that you'd like to see what is the expected model response if the particular variable is replaced by some value. So here we replace the variable Xi by value t and you will see how the model expected or average, because we use estimator, model response, how it will change after this replacement. So this is an example of how the model looks like. For variable h, we are replacing all values of the h variable by value 10, 15, 20, 25 and so on. And then we check what is the average model prediction and it looks like this is the average model prediction. So this is the partial dependence profile. We can behave kind of along our expectations because for young people, the average prediction is long for older people is high. What we can learn as well as well is the steepest increase there is. Well, this is kind of kind of interesting. The curve is not monotonic. So you saw that here we have this non-monotonic behavior. In some models, you can enforce monotonicity constraint, but in random horrors, you cannot do this by default, so there is some variance for these age groups, but there is just a few patients there, so this is where the variance comes from. And how to do this in Dalek? Again, it's very, very simple. The only thing that you need to do is to use the model profile function. Model, because you are doing the global analysis, in this pyramid, you are on the left side. And once you are on the left side, it's model. So you are doing this model and profile because you are interested in the profile. And then you need to specify the explainer and you can, by default, all variables are being calculated. But if you like to make it quicker, so you can specify variables that are of interest for you. And here we'll focus on age variables. Having this explanation, you can use plot function. This is nice, but the largest fun, the largest gain is from the comparison of different models. So you can do the same for all four models that we have trained, and you can combine them in a single chart and you can see what has been learned by these models. So let's maybe focus on these results. The CDC model, it was this table from CDC webpage. So it behaves as we expected that these cutoffs define this table, and you see some jumps in this cutoff. Other variables are not important. So here, the expected value is pretty easy to calculate. And the three model, and the three model actually have only few splits related to age. And you can see these splits, one is here and the second is here and there should be third one somewhere that is very visible. For more sophisticated models like render or render with automatically tuned hyperparameters, these curves are smoother. There are always trees underneath, but they are aggregated based on the large number of trees, these profiles are smoother. And you see that kind of these three models render three, so the models that we, let me use a different color. The models that we train on the data, they have this high increase in mortality around age, it's like 60. While for CDC model, the highest increase in mortality was after age 75, it was like 78 or something. So you see that different models and different relations. And of course, if you are working for predictive modeling, and if you know more about the domain, it's easier to say why this is there different, what might be the source of differences. We should remember that the training data for these three models are actually from population, why the CDC model was built on CDC data, so it was for United States population. So it's a very different age structure and race structure and gender is more as 5050 but other features are different. So we can compare these models, what has been learned by these models. And it's pretty useful. You can do this and you can do even more sophisticated analysis so you can call plot these partial dependence profiles split that in subgroups. So, in addition to showing the average or expected value you can calculate expected value independently for people with diabetes and without the diabetes. We expect that there is some interaction because with age, some of these diseases are more frequent, like with cardiovascular diseases. After 50, the chances that you have this disease are much higher than for young people. So these features are correlated and of course there are people that have many diseases and people that have none of them. So all these features are correlated. So it's very useful to explore to visualize model behavior for subgroups. Here you can easily present age with diabetes to see what is the model response for these two groups. These two curves are pretty parallel. So it looks like the effects of these features are more or less learned in additive fashion, but for different features it might be a different. So it's always was to do this to this. So much more. I'm not going to present you the case because we have like just 20 minutes for this part, but there's much more than you can do with these profiles and you can read about all other approaches or possibilities in this part. But like in statistics, I would like I would say we but I maybe should say, I, I was kind of restricted to think about models in our global perspective, like we are interested, sometimes in mathematical behaviors. We are interested in global behavior on average how model behaves on average is good or bad on average is the increasing or decreasing. Meanwhile, quite often in the high state decisions, what is really important is the particular prediction. So you have a particular patient customer, I know individual, and you are doing a single prediction, and you would like to understand better what is happening with this prediction. So what is more and more interesting and new methods are being developed for local analysis model. And in the machine learning world, actually this perspective is even more prevalent. The global model behavior is sometimes not that interesting, because instances are very very different from each other. What is of interest is the state analysis of a single instance. And that doesn't mean that how to do this. Here will be using an example of Steve, Steve, oh, Steve is a person. Here you have some of Steve, and we'll see what we can learn about model behavior for Steve. So now we'll be talking about. So we'll be talking about this left part. Right, that's a single prediction. We can use predict function to collect equate this prediction but now let's see what exactly how it's being equated, which variables contribute to this prediction. Procedures that can be used to understand factors that contribute to the prediction. In this short workshop, I'm going to show you just two approaches, we are closely interconnected. We would like to calculate effects of particular instances of our model response, and we'll be doing this in sequence of conditions. So let's imagine a situation like that that you start with kind of average model response. You have 10,000 patients, and on average, the average risk score is like 0.01. And now we can say, okay, but my case is different. I'm not average person. My age is like 75. Let's condition all of these variables on 80 equal to 71. So here is a sequence of conditioning, conditioning, and with every new conditioning, you will get a slightly different average. You can repeat this conditioning for every variable, and at the end, you will condition on all variables so you will end up in the prediction for the single observation. And this is the average from population. So by these conditionings, you see how this average became this prediction. We can learn from these conditions, conditionings, we can learn which variables change the average most significantly. Maybe I will show you an example for our stiff. Here we have all data. This is the distribution of scores for every patient, now 10,000 patients data set. And then we can condition on age, age and cardiovascular diseases, age cardiovascular disease and gender and so on. And by conditioning, you see that the distribution of predictions also are changing. And the red dot shows you where is the average expected value of these conditions. And it's, it's the changes larger the beginning for the most important variable age is much smaller on the end. Unfortunately, for non additive models, depending on the order, these changes will be different. So it's not obvious what you can do how you can deal with different changes attributions for different orderings. But you can imagine two very simple strategies. One, let's use some horrific that we try to guess variables with the highest impact, and you will condition on the beginning on the variables with the highest impact. So then the first coefficient will be large and remaining will be smaller. So this approach in that package is called breakdown. And there is a holistic that is doing this conditioning but by looking for some efficient or optimal ordering. The second approach is that let's try all of these orders. Yep, average across all possible orderings. There is a lot of them. We have a large number. We are a statistician so we can average and we can control the variance. So that's average across large variance, large number of orderings. And this approach is called Shafi values. So both are based on these conditional orderings. But in one case, you are using some holistic to find the kind of important variables in the beginning, while Shafi values just average across many different possible orderings. Both things can be easily implemented executed around in Dalek. You need to of course create the instance of interest here. We can define our steve, our main 76 years with cardiovascular diseases without diabetes and so on. And then having this observation we can calculate the score for a larger model for example, and now we can use predict part function to calculate these contributions. Predict because we are on the right side of the pyramid and parts because we are interested in the parts which variables influence the prediction. So these are these individual contributions. You can see that some of these features increase significantly the risk of death, like of course cardiovascular diseases they should, they should. Some of them decrease or do not change the risk. So with the plot function, you can simply plot these values, and you will get either the Shafi values, depending on the type argument. So you can get the Shafi values with both plots so you can see how accurate is this estimation of the attribution of variable. So we have this breakdown plot, which has slightly different values because we are using just single ordering. And here we're using the waterfall plot to see how particular effects contribute to the final model prediction. For a steve it's 0.32. So it's, it's a pretty high, and it's high because he's already all the way. Not that old, but from the coming perspective. Okay, can we do more of course we can and we have still like 15 minutes to save what we can do so let's let's use this 15 minutes. We have this pie our pyramid and our pirate me to have the next level in which we can do that profile, the profile analysis of the prediction from this level. We learned which variables are important, but we still don't know how these features are linked with model prediction. So with the profile analysis you can do this. To execute profile analysis. Again, we will have a single function predict because it's on for the single prediction and profile because you are interested in profile. So you can calculate this profile and then you can use a plot function to visualize this profile. So are we do this for continuous variable eight and for categorical variable kind of a square diseases. And here are these profiles. So these are conditional profiles for Steve. So we can say that this is a Steve, he's 60 76 years old to me, and he's a model. He predicted risk was like 30 to 50 remember correctly. So we can check how the prediction would change if the age variable will be lower or higher. So this plot shows you this individual response is not average based individual response and you see that, in fact, the render model learned that being below 60 makes a big difference, and being above 65. It's also a big difference but in a bit more direction. So this is a digital profile. It was for age and this is for cardiovascular diseases. There are just two possible values there. So there are just two averages. And the dot shows you where is Steve, still have still has these cardiovascular diseases. So the dot is here. So if you dig deeper, you can explore your models in detail, and you are not limited to our signal model analysis, because we have trained for models, you can nicely easily combine compare perspective, given by all these four models. And again, here you have four columns with trip the legend, because it's used to much space but there are four colors like CDC model, three model render automatically tuned render. And you see, from these different perspectives, what is the possible risk for Steve and how it's linked with the most important variable age, and you see that these models and pretty different perspectives. There are more parameters that you can use. In these materials, you will find these parameters explained to maybe deeper details. However, it's still like just 42 pages, in which we like to show you both the mathematical part, the software part and the process part. So, of course, it's kind of just, you know, a small portion of things that you can get out of the software and methodology. So I strongly recommend you to go deeper and there are links you saw that on margin, you found some links. And for them refer to these three books. And as a particular link is our excellent book rated to the methodology. You will find lots of discussion about this balance between the variance and, and bias trade off. Lots of examples and discussion why the evaluation on the different data set, what are benefits and problems with saturation. That's a very good reading. And there are three books. It's a very nice book that shows you how to use sometimes complex features of another free package. And there is a lot of different packages in our and probably some of you are using carrot or tidy models, or some other solution. At least here you will find a few nice examples, especially well prepared for this automatic auto ML approach in which you can find new hyper parameters that can automate tune them in the automated fashion. And the last link is expert analysis is like 300 pages. So it's longer than this tutorial, but in the case it will follow it will introduce you to work of more exploration. So if you'd like to learn more about different techniques, how to plot or how to evaluate or how to calculate particular metrics, how you can change them what are they parameters and how these parameters affect the behavior. So go there and you will find much deeper description with snippets of code for both are in Python. So, also, of course, I'm biased because I want to go turns but highly romantic reading. And I think that with this we are close to the end of the first part. In the next two parts you will learn more about the automations. The automations like now we spent a lot of time to type all these comments to you need to remember names of these parameters names of these functions and you need to remember some additional arguments. So all these things can be automated. So in the next part in the third part you will learn about the model studio the model arena for automation. In the second part, just after the break, you will learn about another set of techniques very important in as possible much learning related to the furnace. So I want to introduce you to the furnace and Hubert introduce you will introduce you to the automation. We have still like five minutes before the break so if there are some questions that remained I am more than happy to answer your questions and all again, together with my colleagues you are working on this crazy approach to presentation of especially much learning through mouth software and coming book, if after this workshop, if you have time to read these materials and if you have any ideas what should be changed included please contact us. More than happy to hear please contact us. Your opinion is very very important for us. Let's keep in touch with the Polish slides and English slides. So these materials that you have here, they are not final. You have to have final materials closer to the fall. So because sometimes there are some dirty tricks. Please do not maybe share this dirty version. You will get the, the, the team version closer to the fall. So, and this will be the version that that will be happy to share. So all our materials are, they are, you can access them and you can use it for teaching for the Emma book you can get the paper puppy, but the online copies for free on the DPR license this materials currently they're still in progress but once they will be public they will be open license. So you are more than welcome to use it and of course, even more than welcome to share experience from, from, from teaching with them with us. So you can use it and please do so we are more than encouraged to do so. I saw that there was a lot of questions but that's okay. There was a question, whatever the presentation Polish. Not yet, but the book will have two versions, the language version, actually foreign language versions, Polish English and Python are so of course we are more reasons but we know that there are other communities there. Okay, I do not see other questions or comments. So, let's have a break. And it should be like 40 minutes. Oh, hi. Hi. Yeah. Okay. So let's make a short break, like 10 minutes. So let's meet again after 10 minutes and then we will go to the two other points here. Sorry that we don't have enough time to execute everything in our console actually has prepared our studio, but I see that the time is kind of run out. But I know that in the remaining two parts, there will be a plenty plenty of opportunities to actually execute some R code in the console. So, yeah, let's see again in 10 minutes and let's play with with with fairness and packages for automation. Okay, so I am like a Vishnevsky and I am developer of her models. And today we'll learn about this tool how to use it, gain some intuition. So, let's get to work. So, first of all, why is this all important. Well, machine learning models and decision systems have been had a history of discrimination and somewhat shady practices. For example, in this machine in the work of ProPublica, they found that the software used across United States for predicting whether someone would be a recidivist or not was against blacks. And it is the case that will be focusing on in the practice part today in gender shades. The researchers found that the popular gender classifiers were biased against dark females and the accuracy was the gap between accuracy between lighter males and darker females was around 30%. And Google fixed its racist algorithm by removing the labels of gorillas because their model classified black people as gorillas. So as you can see, there are many harms to be made. So we have to watch out. What is this bias? What do we mean by bias? Well, bias can have many sources, for example, like historical data or some labeling process. And we can define it as a different treatment of some groups of people by the model. And this, this collection of subgroups will be called a protected vector. As you can see on the right side, on the right stop site. There are two examples of this protected vectors. It can be for example ethnicity or sex. And one of them will be called privileged. For example, male or female. And it can be described by some non-discrimination criteria, like separation, independence and sufficiency. And they are this independence or independence given some other variables. And they measure the bias in mathematical way. But this mathematical way isn't like practical for us. We would like something to not be approximated by, but calculated directly. So we have to use some metrics. And with the help of confusion matrix for each group of people, for each subgroup, we'll be taking this metrics derived from the confusion matrix like TPR, positive predictive value, false positive rates, precision, etc., etc. And this metrics are either relaxations or equivalents of this independence, separation and sufficiency criteria. So let's get some intuition. Imagine that we have two subgroups, one of them is privileged, and one of them is unprivileged, and we are predicting the credit rate. Group A has acceptance rate of 80% and Group B of 50%. You may say, okay, maybe Group A has better credit history or is wealthier than Group B. Okay, but it turns out that from a group A, 90% of good credit seekers got the credit. Meanwhile, in Group B, this percentage was only 60%. So these are these two different ways of measuring the bias. And in the first case, the US metric was statistical parity ratio. And in the second one, it was true positive ratio. So how to do it easily, of course, with fair models, it uses Dalek package as a backbone, and it is designed to work on group fairness metrics. And so, as you know, any classification model would work. Why classification? Well, this flow is designed for the classification models, but from recently, you can also use regression models for it. And it has this iterative approach. As you can see, you start with a model, you explain it, and you make a fairness check. You see if it passes fairness check, if he has great, and if no, you can use some bias mitigation methods. If you have more than one model, for example, dozen models, a better way would be to visualize some of this bias, some of this metrics with other visualization tools. So you pick that one with smaller parity loss, and either this one passes the fairness check, or you have the model with least bias. So as you can see, this iterative approach is easy for both prototyping and testing. And let's see how it looks in the code. So in the GitHub repository, you have a file called fair models.r and you can go along the code with me. Let's import the needed libraries. Maybe I will make it bigger. Okay. And then you have to run this line. It will download the data, because the data in fair models is a little bit more focused, more, more, a little bit smaller, and this data has a lot of different columns. But we will be, as in the ProPublica case, we will be predicting if someone will be a recidivist in two-year time or not. And we will have the information about amount of priors, what is the deciles scores, et cetera, et cetera. And I hope that you downloaded this data now. So we will process it, and we end up with data that looks like this. So we have information about the age, the charge degree, race and sex, how many days was this person in jail, and of course if he or she is a recidivist or not. So first of all, in fair models, we would like to predict the favorable outcome. So we have to flip these labels, so we predict not whether someone would be a recidivist, but the event where he or she would not be a recidivist. So we are flipping, and we will make this simple linear logistic regression model. And as you know, this explain function will explain the model, giving the data and the target. We'll check the model performance. It's quite all right. And now you'll make a fairness check. So it goes like this. We have to provide an explainer. This protected vector, which in our case is a vector with race. And we have to provide a privileged parameter, which is in our case Caucasian, which we suspect that will have the most privilege. So let's run this code. And it gives us this Dalek like interface where we can see the, what were the types of the parameters that we passed. How did they change? How many explainers were there in total? What are the cutoffs for those explainers? And what are the, how many metrics calculated properly? So let's assign this output of the function tree variable. We'll call it f object, and it is of class fairness object. And this variable can be printed and plotted. So maybe I will make it just a little bit bigger. So we'll go through everything that is on the plot here. And as we can see, there are bars for each of the subgroups. And they are five metrics that are calculated, both with this fairness metric name and the way to how to get it using the confusion matrix. There are two fields green and red. And of course, intuitively, we would like all bars to be within this green area. And there is this score on the X axis. But we'll get to this later. First, let's see what this fairness object consists of. So there is a field like group confusion matrices, and it is self explanatory. We have a confusion matrix for each model and for each subgroup. And it also grabs data, which is similarly to confusion matrices, but it's the metrics for each model and each subgroup. So it's like TPR, statistical parity ratio, false positive ratio, et cetera, et cetera. We have also privileged parameter, which we passed. We can see it right here. The protected vector change to factor, the cutoff, also for each model and for each subgroup, because it can be changed for each of the subgroups individually. And we have an epsilon value. We will also get to this later. But first, we will just make this plot a little bit more readable. So we'll be focusing on two of the subgroups, Caucasian and African American. So as you can see, we filtered them out and we will make this again the model and the explainer. And just like before the fairness object, and we'll plot it. And yes, it is more visible. So now we will focus on the x-axis and how to read it. So let me get back to the presentation, where we have, where my slides are, okay, where we have the exact same plot. And we have, we'll take this value of TPR bar and see how, what is the meaning of this bar. And it's calculated like this, we have a true positive rate for African American and we divided by the true positive rate of Caucasian. So it's a ratio of those two metrics. So intuitively, closer, the closer to one, the better. And here, of course, the Caucasian is the protect the privileged subgroup and we divide each of the subgroups by the privileged subgroup score. And this epsilon value is a boundary between the green and red fields. And it is on default set to 0.8 due to equal employment opportunity commission. And it is so called four fifths rule. So basically anything less than 80% of the selection rate can be judged as adverse impact. But this epsilon value can be, of course, adjusted to users needs. So if you feel like there should be more bias available or less than just tweak this variable. And here you can see what are the exact boundaries of this green field. The ratio of the metrics should be within epsilon and one divided by the epsilon. So let's see how does it work. So now we are providing this epsilon variable, and we will plot it. And as expected, it shrinks this green field. But let's revert from now for now. And we will, I will just maybe make it a little bit smaller. We will make more models. So we make a ranger model with 100 trees, certain depth and a seat so you can have the same, the same models as me. We make a model, we make a explainer. Let's see the model performance. It is a little bit bigger than the logistic regression. And we would like now to compare this models. So there are a few ways to do it. The first one is to get the explainer. And the furnace object created earlier. And of course, we may provide the protected and privileged. But we in fact don't have to because there is already protected and privileged hidden in the F object or furnace object. So we may just run it like this. We get the information that it was taken from the first furnace object. Or we can have two furnace objects and merge them together. For example, like this, we make a second one, and we merge it with first. We can of course can have multiple explainers. So we get this logistic regression and this random forest. And now we have to provide the protected and privileged. And all this in all these ways we got the same, the same object that can be plotted now. So as you can see, we can now compare those metrics across the models. So Ranger was slightly better at equal opportunity ratio, but slightly worse at predictive equality ratio, but in fact it passes more than more metrics than this linear model. We can see this like that. Yes, so Ranger passes free metrics and linear model passes to metrics. So the total loss is the summarized height of this bars. So now you may ask, well, what is the race column in the data that all of that? Well, we may see how this affects the predictions. We can make data frame only with few columns. We make a model and explainer. Let's see the model performance. And it is similar to the first random forest. And we can now compare those models for each other. Can we do it like this? No, because fairness check has to have explainers with distinct labels. So we will provide the label parameter. And this label parameter can be either one value or a vector of values. It has to be the same length as the amount of explainers. Here we have one explainer, so label will be just one value. So let's run it and let's plot it. So we get a little bit less, but the change isn't that significant. So we will later we'll see how we can mitigate that bias. But first let's learn about different ways of visualizing that bias. So for example, if you want to see the raw scores of the metrics, we have such pipeline that we provide F object to some visualization technique in this case it is metric scores and then we plot it. So here we have a shape for each subgroup and a color for each model. And the Caucasian subgroup is represented by this vertical line. So intuitively the bigger the distance of this point or the shape to the vertical line, the worse. So in the case of accuracy, the distances are small. So it is a less biased metric than for example false positive rate. But there are other ways to visualize this bias and we'll now learn a tool that will allow us for more flexibility. So this tool is called Parity Loss. And for example, let's take this true positive rate metric. And if you would like to calculate the parity loss of this metric, it would be some of absolute values of the ratio, natural logarithm of the ratio. So let's not get into math, let's get into intuition. The closer the ratio is to one, the smaller the parity loss will be. And this parity loss aggregates all of the subgroups, all of the metrics for the subgroups into one metric. Okay, now we'll see how to make those visualizations. But first let's access this parity loss in our fairness object. It can be accessed like this. And let's make first visualization. It will be a fairness radar. So we have five metrics, all of them are the values are the parity loss. And each of the visualization methods has its own documentation and the fairness metrics visible in the plot can be changed. So let's change them for visibility to those ones. And we have only now three, and it is a little bit easier for us to compare the models. What if you would like to have all the metrics? Well, we can do it with the help of fairness heat map. It shows us parity loss for all of the metrics with the endograms which show us both the similarity between the metrics and the similarity between the models. And what if you would like to see summarized metrics? We have to stack them first. So you have this stacked metrics plot and it is this accumulated parity loss for all of the metrics. And what if you would like to have a see both the performance and the fairness metric at the same time, we have something called performance and fairness. And we have to pass this fairness metric and the performance metric to it. So just like that. And here you have an accuracy on the x-axis and inverse false positive parity loss on the y-axis. And note that this is an inverse value and it is done because we would like intuitively our models to be on the top right corner. If, for example, some model would be here, then it would mean that it has the least parity loss and the most accuracy. So what if it's all a lot of functions and it is hard to remember. Well, fortunately, we have a function for that that shows what are the available ways of visualizing the bias. And we just have to pass the name of the type to the function plot for models. Now we will do it with stacked metrics and we get the same result, fortunately. And now, okay, we have this bias and the question is, what can we do about it? Well, there is actually a lot of that we can do. So there are two bias mitigation strategies implemented in four models. First is data preprocessing and explainer post-processing. And today we'll cover the resampling, reweighting, reject option-based classification pivot and cutoff manipulation. The resample focuses on mitigating the statistical parity ratio and it duplicates the underrepresented observations from unprivileged subgroups, and it removes the overrepresented observations from privileged subgroups. And it bases actually on reweight that computes weights by dividing the theoretical probability of giving the favorable label for the subgroup by the observed probability. And the reject option-based classification pivot, just like the last word, it pivots the probabilities close to the cutoff to the other side of the cutoff. So if you have, for example, in close proximity of the cutoff that is on default, 0.5. And we defined some theta to be 0.1, so the area is 0.4 to 0.6. And if it's the unprivileged subgroup receives probabilistic response of within the left proximity of the cutoff, then it will be switched. It will be pivoted to the other side. And if the opposite thing happens, when the privileged subgroup is on the right side of the cutoff, it will be pivoted to the left side. So let's see how to make it infer models. We'll make a furnace object with random first explainer. And first resampling, we have to provide this protected vector and the target of the model. We get the indices from this resample output and we can get the data from it. So we'll create this data frame resampled and we'll use it for this model creation. We'll make a explainer out of it with label resampled. We'll make a furnace tick just like before, adding the explainer to the furnace object. And we'll plot it to compare. And as you can see, the bias is less significant, but it's still visible. So let's go to reweight and it is the same as the resampled. So we get a protected vector. We get the target of the model. And in response, we get this weights. And we provide it as case weights to our ranger. And we of course make explainer, furnace object, we just add it incrementally. And we plot it. So the bias is even less significant, but there is still two metrics that do not fit into this four fifths rule. So now we will go to the post processing. And we have to provide this random forest explainer. It can be every explainer can be reweighted, reweighted, resampled, et cetera, et cetera, but for now provide the base one. We make a furnace check with label ROC. And we plot it. So our explainer now has with this flipped probabilities has a even even less bias, and it's actually fits in this green area. So, so it is very, very nice. And now we have a cutoff manipulation method. So we provide the explainer to the furnace check. So we have a brand new furnace object. And we pass it to setter is paribus cutoff with a subgroup African American, and I will explain to you what it does. Changes the cutoff only for the African American subgroup. And that's why it is called setter is paribus cutoff because other cutoffs are constant. And the lowest value came out to be point 36. So we now may add this random forest explainer, but with different cutoff for the subgroup and different label. So now we have even less bias. The least bias I think that can be, we may see this in the plot, the print method. And yeah, the Ranger cutoff has the least total loss of the parity. Now we may, for example, check the false positive rate and the accuracy of the model. And here you can see something called fairness performance trade off. We have the biggest accuracy for the Ranger model, slightly less accuracy for the slightly less biased models, and the lowest accuracy. And the least bias for Ranger cutoff that has the minimal bias here. The difference in accuracy isn't significant. So depending on your needs, you may choose the most suitable model for your case. So now we have a exercise. I will check how much time do we have. Okay, maybe we'll, I will just post the, the, the, the, I mean, yeah, okay. But I have a little bit more to go through. I will post the solution on the GitHub. For, for example, like an hour. So that is a exercise for you, you may, we make it and check with with my solution. And now I will introduce you briefly to the regression module. We have this, this is a score that is a value assigned by this, this north point system that was that was analyzed by ProPublica. And for the Caucasian, we see that the desire scores are lower. And they are this way of representing the probability of reoffending. And for the African Americans, these scores are more or less constant. So, just for your knowledge, this is the, the, the, the target will be regression, but it will be not like this through ground truth target, but it will be more like maybe a little bit biased, biased column that we will be trying to predict. So let's filter out two of the columns that, that we will no longer need. And now we'll have the data frame for the regression problem. We'll make a render. We'll make an explainer. And we will make a furnace check. But you have to note that the function name for the furnace check in regression is furnace check regression. So we do just like this. And as you can see, we have only three metrics that we will that we will be checking. So let's assign it to a variable and let's plot it. And you have this free non-discrimination notions that that are being approximated by logistic regression model inside the fair models. And this approach is more or less experimental. So if you use it and have some, some thoughts, please share it with me. I will be really honored. So yeah, this will be our, this sums up our furnace tutorial. There is a landing page called for models that are why I, you can check, check it for example with this QR code. There are, there is an article, there are blogs, documents, tutorials, et cetera, et cetera. And there is also a for models implementation is in Python as a Dalek module. So that will be all for me. Okay, great. So maybe I will stop sharing now. Okay. I guess the first one was, is there a reason why the predictive parity ratio is lower when integration strategies are used compared to the base model? Yes, because the, the base model tries to lower the cost function. And the, so it is the most optimal model in the sense of the data and the this mitigation functions, mitigation methods, change this output, change this optimal setting. So the performance is lower. So the second simple question is there a hierarchy to the furnace check measures is one more important to the order. Great question. It's, I don't believe there is a some kind of hierarchy. The most cited ones are the statistical parity ratio and the equal opportunity ratio or the equalized odds. And because the equalized odds are this are the true positive rate and false positive rate this is like a combined metric. So these are the most cited ones, but you have to adjust your, your metrics, your perspective for your problem. It's like universal for, for everything. But this five metrics should be sufficient for that basics for the start of the furnace exploration and analysis. Yeah, but I can't text. Can you repeat. Okay, okay, I found it. Are there any alternate trigger party for text data for responsible ML. So for text data in fairness, I, I didn't encounter one. Okay, yeah, yeah. I don't know if you heard the Przemeck. But yeah, if it's like the probabilistic output, the grand truth for for this binary label, and the predictive the protected vector, then you can make a fairness analysis. And the model is able to take in more than one. Oh, I lost more than one potentially bias variable for example, if I have more than one say race, religion, gender, do I have to check one by one. Not at all, you can like merge this variables together so you have for example race underscore religion underscore gender. So we're having the data first example, where we have multiple protected multiple indicators of subgroups. So for example, it would be like Caucasian Christian male, for example. It's possible to use for models with other mitigation methods such as by Asian optimization methods. Yeah, so I don't know if you heard Przemeck, I will repeat it. Yeah, you can just. I'm muted and I'm free with it. Okay, so in this case, Jacob showed that you can actually compare few models so with patient optimization you can do any mitigation strategy on your own, and then just have collection of different models and you can compare these models to check whatever the patient operation was more efficient than other mitigation models. So, some of these mitigation techniques implemented in some models, but you can always do this on your own with some other technique. Yeah, if you have more questions, then please answer it. I will answer it. Okay, so I guess now it's who birds time. So thank you. Great. Okay, I hope that you can all hear me. So, once again, I hope that many of you are still here and listening to us here is sorry to have a repository where you can find all the resources and materials that you are using. We are using today. We're focusing on showing you two tools to our packages that will help you automate your explanatory analysis. So it will be the first focus on the model studio. And I will rather show the whole idea and not run some code because it will take too long, but you can, you can follow the code in this and the file. And then I will run some RNA code in this in this file RNA are I will try to run around in our console, and we'll go from there. So actually, I would like to answer one question concerning considering the analysis like how to automate it. So, like talk about a lot of explanations, you can learn about them in, for example, in the explanatory model analysis book or in the materials that we prepared for this workshop, and kuba also were showing you a lot of code. And I would like to briefly show the motivation behind creating the model studio and RNA packages. And I will start by introducing you to a simple concept in the using Dalek package. So I will upload the libraries that will be needed, but actually, Shemek was talking about mortality prediction, kuba was talking about some recidivism, let's, let's maybe use some more happy data set to assess our models right now. So we're using the happiness data is on Kaggle. Actually, it's, it will be also out of time and validation. So the train data set this is a common data combined from from the years 2015 and I believe 18 and then the test data is from 2019 all the details about how these data sets are combined on our repository. So here we see as that for a given country, we have its happiness score, and also some attributes are quite intuitive to assess. It's a good, I believe it's a good example to show how we can explore our models. So, here I, we want to create a black box model right and then explain it. So I use the rancher package, which is great for training random forests. And as was seen before, many times, it always begins with the explain function from the Daleks package where you pass your data and also the target variable vector. And usually, when I start my model exploration analysis, I would probably want to find the most important features, as well as see what is the dependence of these features on the model prediction. So I would, okay, I might say that I'm an advanced Dalek Dalek user, right, so I can write some code that looks like this also supporting it with a great patchwork package that lets you combine your plots. And here we have great visualization of what is happening in our model on a global level. So we have feature importance which shows us the most important features in our model, as well as partial dependence profiles which shows us how the model prediction depends on the given variable. So this is, this close are quite complementary, right, one supports each other, the other. So first you want to find the most important features and then we want to assess what is their dependence. And usually I will proceed and do the instance level explanation, right, so for a given observation here choose the first observation from the test set. It's actually a Finland country, which has, which in a given year had the highest happiness score. So we will do the same, but use instance level explanations. So I'd combine the breakdown profile which shows what is, what is explanation of the prediction, but also combined with the what if analysis. So we have a given country and what if the freedom life choices lowers will the prediction also lower, or we have the another variable let's say social support and what if it lowers right. For example, but now there are several problems in the approach that I'm showing right now. So the first one is that we need to know how to write a lot of code. And because this is this are only two plots right and, and it might be that our report or HTML type with our analysis will be really long. So another problem is that I'm showing only an explanation of one observation. What if you need to explain your model for 100 patients or 100 clients in bank. Then this report will probably have like 50 pages of such plots, such explanations for each observation. Also, the third point is, after the long code and also a lot of long reports is that we would often want to compare the instance level explanations with the global level, the model level situations. And now, okay, if I have a huge screen maybe I could fit all of these in one screen right but I need to scroll to compare for example the breakdown profile and the feature importance plots to see if these these instance with Finland country somehow varies from the overall behavior. Also, if I want to compare these areas by this profile, if the predicate the partial dependence profile those need to have them together on one page somehow next to each other. So these three points like we wanted to create a low code library that could be used by everyone really that could automate the whole process and also that could show you more explanations in a more convenient way. So this is where the most of the package comes. So you just load your, your load the most of the library and really use the models to the function on the explainer object. I see that I also specify one option here it's a spoiler for, I will talk about it in a minute but it's also for convenience in the visualization. So I use the model studio function and what it does it automatically computes a lot of explanations and creates an interactive dashboard from these explanations. And here the crucial part is that this interactive dashboard is without a server meaning that all the explanations are first computed and then you can explore them in in HTML file. So it's quite convenient that we can also insert this dashboard inside the rendered vignette last time doing now, but usually in our console, you would just get an HTML file with which you can open with, let's say browser. So we are after running this function we are presented with such dashboard which has different panels I would like to talk about them briefly. So there are multiple plots that we can choose to explore in this model studio. Let's go through the example that I was showing before so maybe I will choose the feature importance plot also partial dependence plot. Here we can try to do the breakdown analysis and the service for this book. Okay, that was quite easy. I didn't need to code like 20 lines of code I just, I just, I'm just exploring the model right less time coding more time exploring. So actually, the default would be that the model studio dashboard has chooses some free observations that it will, that it will let you explore. And we can analyze the breakdown plot next to our future importance lot and the future importance of next to our partial dependence. Maybe I would like to change the variable to analyze some other, some other relationship. Okay, it's nothing interesting here. Oh, this plot looks more interesting right now I'm now I'm analyzing the health life expectancy dependence on the average prediction of a model here is the PDP plot, but also we have the what if analysis for a given instance. So we see that the PDP plot is quite consistent with the what if analysis for this plot, which is great. Of course, and then given time we can also go and look for another variable and other observations that we can analyze. So there are three caveats that I would like to highlight about this dashboard before we proceed to talk about more parameters of this, of this, of this function. So the first is that there are also some, some EDA plots mean exploratory data analysis which I guess that this is quite important in machine learning also to analyze your data. So maybe I'll show the future distribution plot. And so usually we don't think about it. It's not presented in a way but actually, we now know that it's really important to analyze our explanations next to the data. So if you have the partial dependence plot, you can see if the future distribution somehow varies between between those two. We can also choose another plot. It was like target versus feature. Okay, so here we see the data, the data analysis plot for what is the relationship between the target feature and the, and the variable and we can somehow see that the model and the variable somehow imitates this relationship right with its dependence. Right here is a small, small put note where you can see some measures that were calculated for this model for the given data set which is in the explain explainer. So this is a regression model therefore we show the regression measures. And also the second thing the fit notice this stamp, I would say, right so we see which version of the package. This dashboard comes from, but also the date and time it's often useful to manual where your analysis was conducted. Of course, some of these features can be customized. And the third point that I would like to discuss briefly is that I would say this is a feature in an experimental stage but it's quite useful for people that at first glance might not know the meaning of the explanations. So this is the, the descriptions here that you can hover over and get some textual description of your explanation. Pretty simple right there. And so some, some decision was made to choose that the number of important variables for range of prediction is still out of six. And also these are the variables of the highest importance. They are here. So maybe for PDP, you can see, okay, another textual description doesn't always work perfectly. The text explains the plot right so it says what is them in prediction on the given validation data, what, where is the highest prediction will just occur. And also if there are some break points, right. So sometimes in the explanation there might be some break points that these description might suggest. So it's, I would say, experimental features so it won't be, maybe it won't be like greatly accurate, but it might be useful for people who are seeing these plots first time and they don't even know what they mean. Okay, so this was a default. This is what we will see if you copy and paste any example from the documentation. So I would like to mention some parameters of this dashboard of the dashboard function. I would like to say that, of course, the, these parameters are well described in the documentation as well as in the package of nets, but let's get started. And we didn't pass any observations before. There's of course a possibility to choose the explanation, the observations that will let you would like to use for local explanations. So here if I choose these explanations I pass them as a new observation parameter. Also, I can specify their score. So, here, you can write down the box, apart from the observation name, which is actually your own name for the data frame, you'll get a target, right, so we can compare predictions to the original value. Okay, but also if you don't want to explore any particular observations, you can just increase the number of observations that the local explanations will be calculated for, for example, 210. These dashboards as they won't vary much. Okay, but some people might want to focus on someone with a few plots and or they have for smaller devices to watch these are for son. Therefore, there is a parameter one of the most important which is a face to the parameter where you can set how many there will be based in this So here I would say we have a small module studio with only two panels, which is for example convenient if you want to analyze instance level explanations. So what you would usually do is have a breakdown which shows the attributions into observation prediction, but also you are interested in what if right. So clicking on those bars will make the what if analysis change, and you can clearly see what will happen if any variable increases. So for example for this country, and if they generate city variable increases the prediction won't change much. Right, so this is also variable not really used for this instance. So we have the social support variable, which actually will lower the prediction if it lowers. So, this is a smaller version of the model studio, but actually, you can, you can increase right in the dimensions if you have huge monitor, and you want to analyze all six explanations next to each other or really dive deep into exploration you can increase. So in the dimensions of the dashboard, I might, I'm aware that you might not clearly see all the text here but I'm just presenting the general idea of of these, of these panels. So if you are, I would say, more advanced user and you would like to really dive deep into the exploration process, then you can create a larger dashboard and analyze more plots in it. It might be quite convenient. Okay. Right. So here's a larger model studio. Now, coming to the parameters of course this function has a lot more parameters that you can customize. So the one of the most important parameters also are the N and B parameters, which stand for and samples and the bootstrap runs, I guess. So as we know the explanations are estimated from the data, but they, they might have a long competition times therefore, if you have like 100,000 observations on data sample only 1000 of them to create some of these explanations. So the N parameter size tells you how many observations will be used for the partial dependence and accumulated dependence profiles. Now you can increase this number, which of course will prolong the computation time of the function or can decrease it. So here is a corresponding parameter for the future importance method. So here, 3000 observation or less if you have less in your data sample be used to calculate this importance. Now the B value is for, for the bootstrap rounds of the future importance method. So maybe I have a plot here, which I can show you. So a box must correspond to the different rounds where we estimate the future importance is right. So if you want stable, more stable results you increase the B value, but it will take more time to compute. And the same goes for the shopping values, they might be quite unstable so increasing the B parameter might help. And then of course the dashboard will take longer to compute, but you'll get better explanations. Okay, so there are these, these four parameters that you should, you should probably have in mind and if in doubt checking documentation, how they affect the creation of the dashboard. Now, I know a lot of power users of the studio who like to customize everything. And the good good information is that there are a lot of parameters there that you can really customize. So just going, going through the few of them. So the max variables parameter will tell you how many, how many variables will be present in the future importance plots and breakdown or template value spots. And so usually we are only trusted like three to five most important variables and then it's useful to lower this number. Maybe the default value is 10. Some of you may not like the animations of the visualizations then you can set the time to zero and they will disappear hopefully. And so here I lowered the time of I make the animation faster. So here is a really crucial parameter and that we should have in mind, meaning that, of course, we are, we are wanting to be responsible machine learning engineers right so we want to compare our explanations with data exploration visualizations, but there is there are some cases where you don't want to do that. And precisely, this is the situation where you want to share the explanations of your model with others, but don't want to share the data. Right, so maybe you are in company on to share the explanations of your model, but don't share the data of your clients or maybe you are working with some medical data and these are classified. I hope that this dashboard is an standalone HTML file that you can really send to someone through email. And if it, if it produces this exploratory data analysis visualizations, then the whole data set or a subset of it will be present in this HTML file right so you will be sharing these visualizations but you also be sharing the data HTML file if someone wants, they will be able to extract them. So if you are working with some classified data and you are sharing your model analysis with someone, then it's, it's probably wise to set the EDA parameter to false. And more parameters that are for customizing the appearance of the dashboard are available in the options variable using the MS option function you can overwrite the defaults which are present in documentation. So here I'm using the parameter that I was talking about before this margin left parameter is probably one of the most popular options to change when you have three long variable values, and you will increase this value. Of course you can change the default most generic title of the dashboard. If you want to be more accessible to others I guess you can think about changing the line size and some point sizes increasing them right. And if you like other colors you will change the colors of explanations. Okay, so here I have the dashboard generated with these options. I hope it fits into my screen. The colors changed. The animations are more faster. They're faster. I don't know if you can see that. Also, there are no EDA plots here which I also mentioned, and in the feature importance plot you will have only four maximum variables. Okay. So I guess there are more, more observable functions that you can use to update your module studio. If you forget to set any of these parameters you can add more observations to your analysis right. And also, maybe you want to save these dashboards as RDA files right are objects, not so HTML files then you can load them later and change for example the appearance of the dashboard. So I'd say this is quite convenient to search for these two functions in the documentation. Okay, I believe that this is quite an introduction to the model studio dashboard. I would like to highlight also the another package, which is the arena dashboard which is quite advanced tool. And the most important thing is that it will compare multiple models. And here I have the code that is from GitHub it's quite short it has like 100 lines of code and I will go through it really fast. So I load the libraries and beta and I want to brute force this predictive problems so I just feed all the models and create all the explainers from them right. So I was mentioning before that we can create all these explanations, which will, which will take six models and create a visualization is for them. But it, it's maybe not quite convenient here you see that this plot is quite clunky. So what we would like to do is to create an application, the arena application, push a lot of explanation to it, meaning add them into the dashboard, and also add observations in the dashboard. It's quite similar to the model studio but here you add more models. And finally we run the arena. So here, here we see that. This can be so fun. Okay, here I have the this example running now. And this is an advanced dashboard and are in a dashboard that you can explore multiple models. It has a great documentation resources, which, which are, which are under the arena Dr AI domain, but I would like to be free show how to use it. So you can choose multiple models to explain and the first thing that most of us would do is actually comparing the performance of the models. So, here are the metrics okay they just calculated, I can, and I can now I'm free to like experiment with these applications make them larger smaller, and such, and there is a customization. So in this plot, for example, I see that the best model are SOMs. And I also would like to compare it I guess the Ranger model so I can already the models that are maybe not suited for me due to their performance. I can leave this for future users later. And, and now I would like to do the exact the same analysis that I was doing with the model studio but I will be doing it for multiple models. So the first part is the variable importance, which will compare the importance of these of these variables for multiple models. The first part is the small like we can always make it full screen and and analyze it more deeply. Okay, so I see where these models vary right in their users of their predictors, but, but it's quite similar overall. So, the best feature of this dashboard, apart from it that we can analyze multiple models is actually that we have a lot of more pages. So here I click the plus sign, and I create another page and the analysis is not lost. I will create another page. I will also change the name of the project to maybe user 21 as we are here now. And for these three models, I would like to compare their partial dependence, which is quite convenient for a given variable. So this interface is quite similar here we have the changing observation and changing the variable. So we can choose another variable, let's say health life expectancy. We can put it in the background and have another plot. We can lock this plot on a given variable. Right. So we couldn't do that in model studio, we can lock it in for a given variable now change the variable back to capital capital. And here we have another plot of course they can fit here because I enlarge the whole dashboard so you can see better. But when I want to decrease the zooming it should fit better. So here you can see the model analysis of different variables. Of course you can compare it to the variable importance and showing off only the model explainers but I can also compare different explanations for for local level understanding right so here we have the different observation. Maybe I'd like to also compare it to the shopping values in Poland country. It takes some time to compute but after a while we have these plots. So this is what I'm showing now is a live dashboard that computes right all the explanations, but actually you can also save the whole data and then analyze it later. There is another part of the code that I won't be running now because it will take some time to compute and it will save in the JSON all the data. Right so I would run this code. It's, it's quite similar, but I will instead I will save the data. And then here are our sources that you can either upload this data to some server and then just add here the URL so that the dashboard will download this data. Or you can add this file as it is and explore all the possibilities. There are a lot of the, a lot of the parameters in this arena dashboard also. So I want to have some time maybe to go through them but I really recommend you to go to the documentation so all of these, all of these concepts all of these caveats are really explained here and maybe even if we have videos where you can see how some, how we can annotate some plots using marker and there are also all the options explained here. So I think this is a great resource and if you think that this, this dashboard will help you with your work, then consider using this documentation as your next resource. I would say that this will wrap up this, this automation part.