 Good afternoon, everyone. This is the last lecture of the series. So, and it's about extending the Demserchefer classifier model that we were reviewing from in these previous lectures to other applications, like other problems or tasks, and other use cases. Yeah, well, you know me, I'm Sergio Pinafiel. And today's we will be reviewing the classifier that we see in the series, but also showing you the limitations that it has, because as we have been talking about the wood of the classifier, the interpretability and all of the things, but the classifier has many drawbacks that we will need to address to make it better. And then we will be seeing how to use this Demserchefer light theory to apply to solve regression tasks. We will see two different models that we have been developed in this line. We will be reviewing also the Demserchefer classifier in geographical scenarios where you have coordinate with latitude and longitude. And finally we will be reviewing how to use Demserchefer to combine any model. Right, which, so in this series, we propose a model that is a classification model that use the Demserchefer theory and gradient design optimization for the mass values. It's a rule-based model who has many highlights as you can see there, like the optimization using gradient design, the ability to have a global interpretable results, and accuracy comparable to other machine learning classifiers, not the best, not the worst, right? And the ability to input expert knowledge as rules in the model is another highlight. One of the most important ones is that the model can handle missing values, which is not so common in other models, like artificial neural network, for example, and you can handle missing values with this model like in the beginning, and it can be applied to any tabular dataset, right? This is the highlights, the pros of the model, and in this chart that I showed you before, we are in this section, in the highly interpretable group of models, and with an accuracy comparable to, I don't know, a supervecton machine or, okay, nearest neighbor, right, there. But as I said, the model has drawbacks, right? And the main drawbacks or limitations of the models are listed here. One is about the Demserchefer implementation that we used that I showed you in the last lecture also, that uncertainty is simplified to one value, right? You can have many different combinations of the outcome of masses that the theory proposes to use as mass, but we are only using one uncertainty for the mass assigned function. That is a limitation because we don't have the full expressability of the Demserchefer theory, but we do this because of performance mainly, because if you have the full subset array of masses, you have exponential length computations in every step, and this is very expensive. Also, the second drawback that is one of the most important ones, I think, is that rules are defined in the beginning of the model before the training, and the model itself doesn't learn the rules, right? It adjusts the values of the rules, set which rules are important and which are not, but the model cannot produce rules. You have to set up the rules in the beginning, and this is a huge limitation because you are constrained to what you put in the rules, and in some cases that is OK, but in other cases you would like to have a more automatic way of finding these rules. This is something that we need to improve to make a fully automatically model for that, but today is one of the main limitations of the model. The third one is related to the first one, that the model is slow, right? You see in the implementation that we tried to do in the last lecture that even a simple model with eight rules took a lot of time to train, right? When we ran the sale of training, the for loop, it took like a minute or two for a dataset that has a hundred of records, and if you have like hundreds of thousands of records or millions of records, this is super slow. It starts to take hours or days to train, so it's important to address this problem of performance, of making computation faster, and we have applied some tricks, some numerical, for example, simplifications of the operation that the model does internally, but that is not enough, and we think we can find new ways to compute the same rules with less resource, right? So, this is another thing that can be improved. The fourth is that rules define hard boundaries for records. What I mean with that is that if you have a rule, for example, that says that some attributes have to be greater than 20 or any value, right? If you have a record with the value of 20, you get the rule, right? But if you have a record with a value of 19.999, this rule doesn't apply for that. So, it's like a huge step. It's not a soft transition between applying one rule and not applying one. So, this is a problem for expressibility of the model, and it's related also to the last one, because we would like to have a way to be able to move between rules in a more soft way, right? And the last one is because of these hard boundaries that you have, the model only can have two to the air, different outcomes, whatever is the number of rules. So, if your model has, for example, in the previous example, eight rules, the model can only produce two to the eight, which is one and 28, right? Or two, one, two, 56, I don't know, but different outcomes, right? You cannot be able to generate another outcome of using these rules. And because we are finally discretizing the input space to complain to a rule or not, so you have also this limitation. And this limitation doesn't happen in other models like an artificial neural network where you can multiply, for example, values, real values floating numbers, and obtain any other value. But you have this limitation, for example, in random forest algorithms. They have also this problem. But we have possible improvements for these problems, and these are the three ones that I would like to implement, but because of time and time, basically, I didn't, but we can try. For example, the first one is to address the uncertainty problem that we only have one value for the uncertainty. We can explore using all the subsets, but in order to do that in a realistic way, we need to use an approximation algorithm for computing themself rules, for example, or the combination of themself. Because if we use just the formula that the Demserchefer theory proposed, it will be very, very slow, right? So we need to address this with approximation algorithms. The second one is to soften this rule behavior. We propose a metric of degree of belonging to a rule. Today we have a record has... A rule is true for a record or not. It's binary. We have two possible options there. But we can extend that to a degree of belonging. So if you have one rule, we can say that a record has this rule with a degree of 0.1, for example. And then you can have more soft way to move between having a rule and not having a rule. This will increase the performance of the model because we have run some experiments on that. But we need to somehow put this in the optimization method also. The degree of belonging has to be a function that can compute the derivative for the optimization model to work. It's not so easy, but it could be a great improvement, especially in performance of the model. And the last one is to address the problem of having the rules defined in the beginning. We can let the model to drop rules or combine rules in the training phase. If after some iteration, some rule is always decreasing in the certainty, increasing uncertainty, that rule may be not informative, and we can drop them. If one rule is very correlated with another rule so that every iteration that this rule increases some values, this also, we can combine them. These are two other options to drop rules and generate less rules. If we have less rules, we can put in the beginning more rules because it will not increase the training time. These are three of the main improvements that can be done to the model and maybe in the future releases of the model these improvements will be implemented. That's the idea. Because we would like to, for example, also to learn the rule in a certain way. For example, if we have the same example as I said before, if you have a page in 8, for example, greater than 20 or something, you would like to be able to change the value of 20 and put this as a parameter of the model. To do that, if you would like to be this value to be a parameter, you need to be able to compute the gradient all the way to this value. Because now if you remember in the last lecture we have a method called select rules or something like that that we use if statement to know if some rule applies or not. If an if is not something that we can differentiate to find a gradient. If you change that with a function you can apply the differentiation to the condition. This is the idea with the degree of belonging. These are improvements that we will be working on and we hope that in the next iteration of the model they will be developed. Moving to the extension of the Demserchefer classifier, so far we are only talking about classification, but the other main task of supervised learning is regression. For regression task it's not so clear or not so straight how to use the Demserchefer classifier to produce a value, a real value in a non-constrained domain. In the other case you have the frame of discernment like the possible outcomes or the Demserchefer and you see one-to-one mapping from these outcomes to the classes of the classification problem, but in the case of regression it is not so clear how to use this Demserchefer mass and all this stuff to produce a target value that a regression needs. But we have explored two different methods that I will present here to address this problem. The first one is this one, which is the embedded interpretable regression and the idea is to create an embedded model when you combine more or one models and the idea is to have first the Demserchefer classifier, the same classifier as we have seen and then a regressor, any regressor could be gradient boosting, even a linear regressor, it doesn't matter. The idea is to have the target value ranges for example from 0 to 10. This is the range of the values of the target and you can create groups with this variable. For example, if the target value is less than 2 you assign one class like the lowest class. If the value is between 2 and 5 you assign another class, 5 to 7 another and 7 to 10 another class. So you discretize the target value to have like these classes. And then if you have these classes you can apply the Demserchefer classifier to them. You have a classification problem now. If you apply the Demserchefer classification model to these groups of target values you will find the classification, all the stuff and also the interpretability for each group. We should have the most important rules for example for the lowest values of the target the same for the mid values and for the higher values. So we can have the interpretability from these groups. It's an approximation of course but it will work. After that we can apply a regressor to each group. We divide this group into 4 and we train a regressor in this subset and in this subset and so on. So we have like 4 regressors, one for each group and the regressor task is to predict the value and the classifier task is to produce the interpretability. So we combine the classification with several regressors to produce a target value and an interpretability result here. So that's the idea of the embedded interpretable regressor and we actually have an implementation for that that I would like to show to you. Similar to the classification model we have this embedded interpretable regressor in a package from Python. First we need to install the classification because it's a dependency of the regressor and then the regressor. The regressor is also in a github repository publicly available. This is not so... It has no good readme but the documentation inside is okay so you can check that. So in order to use it in the collab we need to clone the repository and for this example we'll be using a toy dataset. This is the insurance dataset that has some variables about a patient. Like the age, the body mass index and so on. And this last value which is our target is the church of the insurance. And we would like to predict this value and also have interpretable results about this prediction of the dataset. We can make some charts just to see the data. We have 1,300 records. The age is balanced, all attributes the charts are not so balanced but that's not end. So for this problem we will be applying the same methodology as before and we will be splitting the dataset into a training set and a testing set. In this case we are using 33% of the dataset to the testing set. So we have for example here our training features. They are all numerical because we are applying this getDami's function there which converts categorical features to numerical features. This is mainly a limitation of the regressor because the classifier can handle categorical values. So here we have the regressor, the embedded interpretable regressor, AI regressor. And if you look at the docs of this class you need to pass a regressor model first then you pass the parameters for that regressor the number of buckets this is how many groups that you are dividing your target value to perform the classification. Then you need to pass the bucketing method how these groups will be created. We have different alternatives for example range means that the value is split evenly with the range values with the target values. The quantity is the splits ensure that every group has the same number of records and the max score maximizes the sum of the target in each group. Then we have arguments for the regressor and arguments for the classifier. We can see here how to use it we need to import the embedded interpretable regressor and we also need to import one regressor that you use. For example here we are using gradient boosting regressor but you can use linear regression a random forest regressor anyone that has the feed predict interface will work here. So after having a regressor we need to instantiate this embedding interpretable regressor we need to pass the regressor that we are using and we can pass some arguments for the regressor these are arguments for this regressor then the arguments for the bucketing here we are using three buckets with the quantity strategy and then these are parameters for the classifier the number of iterates and the loss function and the learning rate all these stuff. So we can instantiate this and then it is used like a normal model with the feed predict methods the feed does the training so here the classifier is being trained also the regressors for each group we have here the rules strategies that we see before and ok it feed the classifier and the regressor and for predictions you only need to do the predict method here over the testing set and there we have the white bread so if we see that we see that each record in the testing set is being assigned to one value and these are real numbers not classes as before so we can see the metrics of these models know that we have a regression model we have new metrics for regression like the R-square and the mean absolute error so if we test that we see that the regressor has an R-square of 20, oops I don't know what's happening so the regressor has an R-square of 84 which is high mean absolute error of 1,700 these are the metrics for the full regressor the embedded regressor that has all the buckets and groups internally but we can see the classifier produced this confusion matrix as you can see and this is the accuracy and the F1 score I guess so the classifier achieves 78% of accuracy that's good not so good and the interesting thing of this embedded regressor model is that it doesn't depend much on the classification accuracy because after that you have another regressor that can fix the accuracy problems of the classifier so even if this value is not so high it can achieve high values for regression metrics and all of these is for finding the most important rules for each class so here we have for example the same method as the classifier has the print most important rule and we can see which are the most important rules for all of these three buckets class 0, 1 and 2 and 0 is the lowest class so this is the group with the lowest target value so with the lowest income charges of insurance and we can see the rules that they produce for example these tell us that people who doesn't smoke with a smoker yes it's 0 doesn't smoke pay low insurance also don't people who doesn't have children young people again and so on so the rules kind of make sense for this class if we see the other one for example the class that pays the most which is class 2 we see that there are people with a lot of children four children right are people that are at all and people that smoke and so we can see that it makes sense and we can achieve this kind of interpretability for our regression even if we are grouping the records to apply the classifier so this is an interesting trick and it works sometimes so this is one of the main ways to have regression tasks and interpretability like before this is the regressor so you will have this implementation also one to apply to a regression problem that's okay so if we move to the presentation again this is not the only regression model that we have tested or done another one is this one which is called the weight evidential interpretable regression this is the extension of another method that is called evidential interpretable regressor that uses a combination of the k-needest neighbors regressor with them search efforts here and the way it does is finding the k neighbors to a record and use their target values to produce the value of the prediction that we are looking for but instead of taking the average or make them both or something like that the model proposes a custom design function which assigns different weights to each dimension so you can for example shrink one dimension another and the distance will change of course and also the distances are not compute as like the two point distance, say clean distance that we know they are evidence of the them search effort theory so if I have one neighbor that is close to me according to the clean distance function we say that we are certain that this record is similar to what I want and if I have a record that is far from the record that I am predicting it has a high uncertainty and we pass this to mass assign functions, we apply the Dempster rule and then we average or some function to compute the target value so we apply two different things one is this weight of the dimensions that are parameters of the model by the way so they are learned by gradient descent of course like in the previous example and then we have this like them search effort decision to produce the target value this model unfortunately has only local and group interpretability it doesn't have global interpretability so this is the main limitation of this model but we have applied this model to some application to predict healthcare costs like in the example that I showed you before it can produce good results in performance as you can see in this table it achieves it outperforms gradient boosting and some artificial neural network architectures with a better air square for example and a lower mean absolute error and it can produce these rules that doesn't apply to the full data set but can give us insights about the data and another application of this app that I didn't mention is that the uncertainty that is computing the Dempster check for decision for the target value is also for computing a confidence interval of the target value so if we have high uncertainty we have a greater confidence interval if we have low uncertainty we have a lower confidence interval too and we applied this for example for predicting retail indicators for example conversion rates sales or the number of visitors of one store and it worked well here is a chart that shows the actual value and the predicted value and it worked well right so this method is more complicated than the previous one so I can show you in real action but it's also a viable publicly so you can fork and use it another use of Dempster check for that is not a regression it's a classification or again but in other context not tabular data set that we are working before this is the Dempster check for geographical data we have proposed a way to handle this type of data set using Dempster check for first geographical data set are the ones that has latitude and longitude in some of the attributes and you can put them in a map so what is usually the task is to produce a value that depends on these coordinates also so one of the important things or changes that we made to the model to handle this geographical data is to associate the distance with the evidence similar to the previous one so if we have some event a record from the training set that happens in a place and we are predicting a value near to this place so we can state that this value is informative has a few lower uncertainty and if we are far we have a higher certainty similar to the key strategy but we use it here in the geographical data set and the other important thing about geographical data set is that they can be augmented using public data set if we are for example predicting something that is happening here in Yerevan we can use the public information about where are the restaurants where are the bus stops, where are the banks and this can be put as evidence to the model or as rules for example so the model can decide if they are informative or not and we have a way to combine this geographical information or evidence with the Demster rule and produce some of this case for example we apply the Demsterchef geographical to predicting crime occurrences in Chile. We work with the police there and we apply this model to for example a region or a certain quarter of the city they would like to know which are the hotspots or the places that are more likely to occur crime to be there and prevent them. So we train this model and apply all the methodology and the output is like a hit map that tells us where are the most likely spots for crime occurrences. We can check with these stars that are from the testing set or predictions were right we achieved like 80% accuracy or something like that so this is one of the applications that we developed for the Demsterchef and as you can see you can extend this model to regression to another type of classification that are not tables but there is another way to use Demsterchef that is like not in the model itself but in the decision after the results of the model so as you have noticed the Demsterchef theory is for decision making you have evidence, you have uncertainty and you can make a decision with that following the procedure. So one thing that we can do is for any task we ask the existing model to solve the task and we have a regression a classification anything or we ask different models to solve the task and what we typically do is to compare these models and find the best one and we took the best one and we deployed them but another thing that you can do with this result of this model is to say that this model is giving me evidence about the result of the task that I would like to solve and this evidence can be combined using the Demsterchef theory so if the model 1 gives me a result for example in a classification problem the model 1 says that the class 1 is the one for this record and has these probabilities distributed we can convert them to a mass assigned function uncertainty for the model depending on the result of the model and we can do the same for the model 2 and the model 3 so if we have for example here gradient boosting an artificial neural network and support vector matching anything that gives me the same outcome or different result but for the same problem we can transform these results to mass assigned functions combine them using Demster rule so we have a mass for a combination and then provide a result that is the combination of the 3 first models so we are using here like I said the Demsterchef theory to make a decision on which model it's better for each record and this is very simple you can code this and you can use it for example to have a better model and usually this strategy works better than picking the best model in the first place and this is something really interesting so you can use the Demsterchef theory not to build models but to use the result of models as well so that ends this presentation the last things that I would like to share are the articles and papers that explain all of these models that I presented in this presentation under the previous one they are all published in articles so you can check them there are many ones so you have something to read and that's it