 My name is Sergio Peña-Fiel, I'm from Chile and today talk is the first one of the series and it's about model evaluation in general but focusing on interpretability which is the topic of the series itself. So a little information about me, I'm Yamamaster in computer science from University of Chile. My master thesis was the classification model that we will be reviewing in this series. I am also the tech leader of the data science unit in the Arturo Loves Pere Foundation which is this cancer institute that we are talking and I also have an startup because we have some intravenous spirit and the startup is about healthcare also, it brings technology to health institutions and interoperability and things like that. So one of my main interests is to apply these methods in healthcare, this is what I do mainly. So for today we have this agenda, we will be talking about performance evaluation first which is more or less what we know about comparing models and evaluate them and then we will be reviewing the different strategies for interpretability evaluation. Yes, among this we will be looking at different types of interpretability, some agreement metrics and sensitivity analysis which is automatic methods to evaluate interpretability but we also be reviewing expert evaluation which requires people, users, experts on the model to know if the results are good or not and we will see an example of all of this applied in a real use case at the end of the presentation. So the motivation for all of this is that in machine learning field we have many different models, as you may know we have artificial neural networks, we have random forest, super vector machines, many, many models that solves some tasks, for example in supervised learning they solve classification and regression tasks so we need a way to know which model is better, a framework or a methodology to compare this is better than this one and have an answer for that. This can be evaluated in different aspects, not only performance, so we have some of the most common dimensions or aspects that we are interested in evaluating in this model, first we have performance that is how accurate and correct are the results of the model compared to the real outcomes, this is by far the most common aspect to evaluate models, I think 80% of particles in machine learning only report performance evolution and forget about the others, but we have other ones like interpretability which is the ability to explain the results of the model, so you have a machine that not only sets this will be for this class, it also give an explanation about this output. So this is interpretability and as Nelson said before we have some use cases where interpretability is required, like when there are loss or compliance to meet with this model or when you have critical process that can be given to a black box machine without knowing which is having, but we also have other dimensions not only interpretability and performance, for example we have complexity which is like the number of resource, time, memory or anything, space in general that a model takes to make predicts for some other use cases, for example if you are deploying your model to a very limited machine like in Internet of Things scenarios you have to address this issue of complexity, so you will prefer less complex model than other ones, other aspects are like scalability, for example there are models that need a lot of data and the data have to be synced to make prediction, other models you can cluster them in different machines and there is no problem and the last one is consistency which is that the outputs of the model are similar to similar inputs and this is important when we are pre-training models over time for example and when you deploy a new model with new data you expect to be similar to the previous one, so in this scenario you need consistency more than the others, so we will talk about the first two ones, performance is like the most straightforward way to evaluate models, since we are talking about supervised learning we know in the data that we have that a certain record has a certain class, a certain label or value which is the real value that we collect from the process that we are using, so typically what we do is to separate the data set into two groups one is used for training the model and the other one is used for testing for evaluating this model and the idea is to separate the data at the beginning so the model never sees the other set and we can compute several metrics indicators about them, so for example for classification only to name a few we have the accuracy which is the proportion of the records in the test set that were correctly predicted, we have the F1 score that is like a harmonium mean between the precision and the record, so it's a better indicator when we have unbalanced data sets and for regression we have for example mean square error which is the average distance between the real values and the actual values and the predicted values and we have other things like error square score which is metric of how it is also coefficient of determinant and it's a statistical indicator that says how much of the variance of the prediction is explained by the prediction and not by random, so we have and for that we have many others metrics, all of these metrics has a formula, so you can put the data, compute the formula and you have a result, so for performance it's very easy to say because you can measure that and compare them in the same scenarios, you can also maybe have seen this visualization like confusion matrix, receiving operator curves and an actual versus predicted plots, this is very common but if we go to interpretability which is the main thing that we will be reviewing here there is no simple way to evaluate interpretability and like performance we don't have formulas, we don't have anything simple to do that, so first we choose the definition of the interpretability to know what is this so here we have two definitions of two authors from the last years and the first says that interpretability is a degree to which a human can understand the cost of a decision and the second one says that interpretability is a degree to which a human can consistently predict the model's result, so what these two have in common is that there is a human in the definition, so in performance you only have a formula, you put this and even here we have a human that need to understand some part of the process of the model, so this is what makes interpretability harder to evaluate, well this is the conclusion that there is no rigorous way to do that and we are dealing also with the human opinion for example, so maybe some expert will agree with some results but other not and you have all of this kind of problem but we will be reviewing some strategies to do interpretability evaluation, but first we need to know the degrees of interpretability that we have in models typically we have five groups which are these ones from the most interpretable to the least one, first we have the model transparency which is like the model are simple and the user can understand what the model does and they can make the prediction alongside with the model so this is the case for example of decision trees linear regression where we can see the tree and go and descend in the branches and see the result and this is the most interpretable model because they are like white boxes you can see what is happening, then we have the global interpretability that is the one that the model provides an explanation that works for all of the data, for all instances so you have like a certain result that if you apply this result every time you always see that it matched the definition so in this case the model can do certain complex things but it outputs this like global statement for interpretability, this is the case for example of this one Dempster-Chefer gradient descent which is the model that I created and we will be reviewing this presentation and also other ones like night base and some other models then we have the third group that is group interpretability which is that the model can find patterns that apply to certain groups of the instances of the data in this case this is the most common I think kind of interpretability but it is not so strong than the global interpretability here we have for example K nearest neighbors where you can see where my neighbors and we can explain somehow the prediction the same with Bayesian networks we can know which attributes depends on the others and we can have some kind of interpretability there also and random forest with feature importance we also have some kind of interpretability the next group is the local interpretability which the model can explain one instance at a time this is the weakest interpretability of all of them this can be achieved by for example sensitivity analysis and other methods that we will be reviewing later in the presentation and finally we have the models that have not interpretability at all and that is mainly the case of deep learning and another kind of model that makes a lot of non-linear computations something like that so if we need to group this the first two are the best the first two groups are what we call they are interpretable models the next group is which is only the group interpretability is like a slight interpretability model so we can know some insights but not specifically what is happening and the last one is mainly considered as not interpretable models so this is the classification and we have a trade-off between accuracy and interpretability if we put these three groups in one axis and the accuracy in the other and we ask these models to solve the same task we typically see this kind of chart where the most accurate models are the ones that are not interpretable and the least accurate models are the ones that are highly interpretable so we have this trade-off where if you go for interpretability you will lose performance and the other way around so this is something that happens in real scenarios so we will be reviewing a little bit about decision trees because the decision trees are as we said before the top of the interpretable models and as you may know the decision trees are these classification models which we have a tree and the inner nodes are attributes of your dataset that have a condition and for the two branches we descend if the condition is true or false and then we have other inner nodes which are an outer node, a leaf so we have the class predicted for this input what is interesting about the decision trees is the building process of these trees because they follow a certain methodology which is the maximization of information gain so when you are trying to build a decision tree for a dataset specifically you need to know which node I put in the root or in the first node so the algorithm does is to search all the possible splits of the data so for all the attributes it's very expensive and it computes a formula which is called the Gini impurity which tells you how pure or impure are the two partitions that generate this split and you try to minimize this impurity so you have the most different partitions in the split and then if you iterate over all the possible attributes you find the best one and you take this, it's a greedy algorithm you take always the best solution possible and then you repeat the process recursively for all the other partitions what is interesting about this process is that the decision trees if you follow this methodology is guaranteed to be the best possible decision tree for the dataset that you provide there are no other better decision trees for example exchanging some nodes will give you worse performance than this so if the decision trees have this property we can see that they are like a baseline for our interpretable models and this is the first thing that we can do for evaluating interpretability so if we assume that the decision trees are the best interpretable model and we know that the decision tree procedure always generates the best decision tree we can compare our model to a decision tree how similar or how different is my model to a decision tree and this is the first interpretable metric that we will be reviewing and how to compare if they are similar or not we have what we call agreement metrics so we compare some aspects of the model to the decision tree and they give you formulas that we can use for computing this interpretability so here are three of these agreement metrics the first one is feature rank correlation so in the decision tree we can know that the attributes that are in the root or in the first layers of the tree are more important than the ones that are in the bottom of the tree so we can rank the features using this with the distance from the root and also many of the models have a way to rank the features like feature selection models for example so we can compare these two ranks using statistics we have a metric called rank correlation if you have two ranks of the same sets you can check how similar they are so this is the first agreement metric and interpretability metric the other one is rule equivalence this is a place for rule-based models if you have models that can produce rules like random forest for example or gradient boosting or things like that you can also have an order for these rules so in the same way the ones that are in the top are more important in the decision tree and if your model gives this you can check if the rules are equivalent in these two models and you can count or something like that how many of these rules match between these two models so you have another metric this is simpler in general to compute because you only have to match one thing with another and the last one is subset differentiation and this is an important thing that can be applied to any model so you have your interpretable model and you want to compare with the decision tree so you ask the model to produce the interpretable result for the whole data and it gives you an answer and then you descend one step in the tree so you have two different sub-datasets and you ask the model to make the interpretable result for these two subsets and the idea is that as the split is so strong the interpretability for these two subsets should be very different to the interpretability of the whole set and you can measure this and if this is true for your model so you have a model that behaves like a decision tree because it's changing in the same way so here we have three metrics that are not so simple like before as you can see but they can provide you insights about interpretability going to another topic we have something that we call feature importance that we checked in the last slide also that is a common technique that many models have especially from the second group from the slide the interpretable models that can rank these features and gives you a measure of how important this feature is this is helpful for feature selection like for saying these attributes are not important I can drop them from the dataset and have a better, simpler model but the drawback of this is that it doesn't give you the actual result that happens when you variate one of these attributes for example if we change this first attribute it's the most important and we increase the value what happens? we cannot tell from this chart which is the result of this change so this is what the limitation of this model has so this is the same and then we have another kind of interpretability for the group of the least interpretable models which is sensitivity analysis sensitivity analysis is a technique that doesn't require that the model has to be interpretable because you change the input of the model so you make small changes to the values of an input of the model and you see if the output of the model changes with this change and you can know the interaction or the dependency that they have so there are two main sensitivity analysis that are very popular the one is partial dependency plot and the other is local prediction boundaries we'll see both of them the first partial dependency plot are these charts that explains how a certain attribute one of the dataset changes your output so you start with the whole testing set for example that you have and you force the value of the attribute to have a certain value that you, for example, zero and you compute the prediction for all of the records and you plot in this chart and you do the same for 10, 20 and the full rate and you can so connect the line so this one line is one record in your testing set that you have changed one attribute and then you can see if this changes in the attribute change their output if these lines tend to go up then you know that when the temperature, for example, is higher the dependencies that the target value is higher and you can measure or plot the average of these all lines and you know better understanding of the chart so this is a partial dependency plot you can also can produce partial dependency plot for two variables so you can have one variable in this axis and the other one in your plot with a hit map or a console chart the change is in the output and you can see the dependency of these attributes to the output so this is a sensitive analysis, partial dependency plot and gives you interpretability one by one in the characteristics it's better than feature importance that you can see and it's not so hard to produce they are a good alternative for having interpretability the next one is the local interpretability that tries to find the boundary between the classes or the values so what we do here there is a method called line local interpretable model agnostic explanation for a record, this is local interpretability so you have one instance you change the value of this record in all dimensions in all attributes and measure the results you input this data to the model and see the result with this result you create a new data set with the perturbations and the predictions in one side and you pass this data set to a linear algorithm for example a linear regression or a logistic regression so that these models are interpretable like very simple linear models you can know which part to interpret them and they represent the locality of the record that you are looking so for example using this figure we can explain better if we have this point right here and we perturb this point to make new other points near to this point we can see that in this case it predicts red for all of these but for this predicts blue so the boundary is here and then you apply this to a linear model and they give you this line that you can know this coefficient for example which is here and the coefficient set which attributes are important for the explaining for example in this case moving in this axis change the prediction because if we with a small jump in this direction we change completely the prediction while in this direction it is not so important because if we go from here to here we are stealing the red class so this is the local explanation but as I said this is valid for this record if we try this same methodology to another record for example in this region the dependency is the opposite we have this direction that variates less than this so this is the problem with local interpretability that we cannot generalize the results but what is important about this method is that we can use it with an instructor data so we can apply the STEM methodology to images and for example block some part some regions of the image and check if the prediction changes for example if this is object classification object detection network and this is obviously a fraud but if we hide these regions will the model also break the fraud? that is the question so we can know exactly what are the key regions the critical regions that explain the break and we can do the same with text if you have text, pretext you can delete some words or erase them or hide them and ask the model for example if this is a network that it says this review is good or bad for the product for example if we delete some words we check if the output is the same as before or not and this for example in the guise of images gives you what I said before the key critical regions of the image so for predicting a fraud we see this part when we only give the model this image it predicts a pool table for example and when we give the model this or a region it predicts an error value so we can see exactly what are the main regions of these models but this is local again this applies only to this photo we can see other models so the next one is like I said before using actual people, experts to evaluate these models after all the definition of interpretability has you human in there so there are two main lines for example to evaluate this with experts one is a literature review which doesn't require experts directly you only need to look for article papers that supports the same rules that your model your interpretable results give so if your interpretable results say something and you find a paper that says the same you can use this paper to support your results and the last one is if you have access to experts you can present this result and ask them if they think that this makes sense or not you can do this in a focus group in a survey there are many strategies to do that so finally we will see an application of the things that we present here to the stroke risk page on a problem so as Nelson said in the beginning of this talk this is a problem that we addressed like five years ago and the idea is to predict if a patient will have a stroke in the next year, in the next three years or in the next five years because a stroke is one of the main death causes in Japan and worldwide and we're interested in detecting this earlier as early as possible so this is a binary classification problem between stroke and non-stroke classes and the data that we use is pageant demographics like age, gender, body mass index or things like that we have the history of diseases we have all the diseases that a patient has and we have some results like blood results, urine tests, things like that so we apply our model the temperature gradient descent to this scenario and one of the things that the model output are these tables, the rule tables we will see this in detail tomorrow so for this you only need to know that this is one of the output of the model so the model gives you these rules that explain the prediction for each class so for the class of stroke for example what this said is that if the patient has a historical disease of cerebrovascular disease then it is very likely to have a stroke this is the topic we have another rule for example if the patient has diabetes they increase the risk of a stroke and so on and we have the same for the class of non-stroke so the model outputs this rule we can apply what we see early for example comparing to a decision tree we will be using the second methodology to compare rule by rule we can compare the results to line we can see partial dependency plot and also with experts checking the medical literature and assert so if we apply the same problem the same task with the same data set for the decision tree this is the decision tree that it predicts and the second metric is rule equivalence so if we have our rules from the model from the slide before we can compare if this rule applies it's equivalent to one of the rules in the other model so we can see that the first rule is the same as the root of the node so this is a match and this is good for the interpretability and then if we descending the tree for example the third rule appears here and the second rule doesn't appear in the tree but we have for the first and third rule a match which is good generally we don't have so many matches between the decision tree and other models so have one is good having two is better but this is the first evaluate the second one is using lime as I said before we can present one instance here we have two different instances of patients that has a stroke and we can see the local boundary which are the coefficients of the variables and we can also make this kind of rule equivalence among these one so we have the first one repeats there and there so this tells us that basically this is one of the most important rules the second one appears also in both and you can do the matches there for example the diabetes up here and here but not all of the matches and that is okay this is expected because there are very different models we can do the same for the other class we have a stroke and no stroke and with no stroke we find fewer matches because here we have very different patients in real world rather than the group of patients that has a stroke that has very specific characteristics so here we don't find so much matches but this is okay so the next one is partial dependency blots like I said before we can take one attribute and force all the records to follow these values and we can see the contribution so for example this tells us that the higher the body fat a patient has the higher the rate which is one of the rules that we have also before for example for platelets we have a break point here where if you have less than this value the risk is higher and if you have more it's less risk and then you can see other interpretable results from these charts and moving to the expert side of interpretability evaluation we can compare with medical literature so for all of these rules we can check the literature, check papers, check articles if they have something to say about that so for the first one which is having a cerebrovascular disease in the past we have strong evidence of that here are three articles that all of them said that the risk is increased by 25% for example 5 times in other cases so this is very well known in the medical field and the model also extracts this from the data we don't know anything about medicine obviously so this is an interesting result for example another one just for showing to you the diabetes we also have some articles that state that the stroke risk is twice in patients that have diabetes than the other and this is like medical studies that take patients use all the biomedical analysis to reduce the randomness so these are probably very correct but we also did an expert survey to actual physicians and neurologists and we asked them if they agree, disagree with the rule that we are presenting so we take the rule we write it in a more comprehensive way this is in Spanish by the way we ask them if they consider that this is true, false or it may not apply and we did this for the first most important rule and this is the result we can see that for example for the first rule all the experts agree that this is actually true so that's okay and for the other three rules there are some differences but in general they agree in the last column you have the validation rate but there are other rules that the ones listed here that the experts think that that are incorrect that the validation is less than 50% that means that they think that the opposite should be true this is one of the most interesting aspects, features of the model I think because you can check with the data that this is true and this challenged the knowledge of these experts or at least they may wonder if this statement is really true in medical field many things are derived from experience that if something before said that this is true then I consider that this is true but they usually or in some cases don't follow like a super scientific way to prove this to prove this statement so this is interesting and at least there are different research lines for them we can do another study to see if this attribute is true or not and the model gives that and this is quite valuable in my opinion so to end this presentation we have reviewed that evaluating interpretability is not easy like in performance you have to consider two aspects you have two forms all like this one is automatic based on comparing with other models or with experts that are real users of these models doing interpretability analysis also can help you to generate better models you can understand better what a model is doing as putting the data in and having results you can understand what is happening and this is good for any task I think and the last thing that is important to highlight it is not necessary to all of the rules of the model should be explained but one of the techniques that we see before is that we can have attributes or rules that are new for the study or particular for this case and this is okay and like performance we don't aim to achieve 100% just a higher value but not perfect so this was my presentation thank you for your time I think we have time for questions yeah the best way is to perform another study to prove exactly this statement so you have for example that the high density cholesterol is good instead of bad so you can have patients and separate them in groups performing a new study to prove if this is true or false this is the best way to know exactly if this is true or not partial dependence plot if this is a regression this is your target and if this is a classification it will belong to the positive class the probability of having a stroke of belonging to the stroke class is increasing let me go this way so in decision trees you have nodes and the nodes split your data into two groups the one that follows the rule and the one that doesn't the model produces interpretability you can check the interpretability before the split for example if before the split this was the explanation of the model and you can check it after the split for each of these subsets and the idea of subset differentiation is that these subsets should be very different from the original because your split in the node is strong so you can make sure how different these interpretable results are with respect to the first one and you have a metric of similarity to the decision tree yeah it's kind of removed one feature but it's like doing a more intelligent way in a smart way because you have the support of the decision tree in the beginning you have different degrees of matches you have the full match that is that they are exactly the same but you can have partial matches for example the same attribute but different ranges and you can for example compute the overlap between these two rules and then you have a metric of this similarity an easy way to do that is the split strategy if you apply the rule to all the whole data set you see how many records for example you have in this and this and then you can make sure if they are equal or similar or something like that you can apply the rule for checking yeah that is like intrinsic in the model for example artificial neural networks you cannot derive rules directly from their techniques but they usually don't work so well so this is like a precondition to apply these metrics your model have to have the ability to produce rules yeah I never heard of that strategy but it could work at least it's interesting to try yeah there are other models like I said that you can input with any models and it can output rules automatically and do very similar thing that what you are saying but I think this it touches the state of the art and so this is new for all of these techniques are very new they have like 5 years