 So good morning. Thank you for joining us in the second lecture. Today we will have a little short presentation at the beginning about the model and then we will have like a tutorial of how to use the model and the ideas so you to follow the tutorial to know how to use it and well unlike the the yesterday presentation if you have any questions you can interrupt me in the during the presentation so the idea is to be like a more interactive session. So yeah first we will be reviewing the classification model that I presented yesterday which is an interpretable classification using the Dempster-Chefer theory yeah so well you know me I'm Sergio Benafiel the master in computer science from Chile and this was part of my thesis work so the first thing we need to know is what is Dempster-Chefer theory which is the mathematical background that is behind the model this is a new mathematical framework that some author proposes in some years decades ago it's a framework for a making inference and a reason with uncertain it's like the probability theory the Bayesian theory that we may know that you assign probabilities to events and then you can compute for example the expected values or the probability of an event and you can make inference from this frame the Dempster-Chefer theory is the same but building all the elements from another perspective taking into account the uncertainty the certainty of the events and the uncertainty in all of the process so some authors call the Dempster-Chefer theory like a generalization of the Bayesian probability theory because every scenario in the Bayesian theory is a scenario in the Dempster-Chefer theory so you can express everything that is in the normal probability theory here but you can express more other scenarios this theory is widely used in the system making system so the system making support system because you can ask for example experts if they agree with a certain statement or condition or anything and you have a procedure to combine the evidence of all the sources you have to produce an up so in the Dempster-Chefer theory the basic element is this one that we call a mass assignment function yeah and a mass assignment function is the equivalent to a probability in the Bayesian theory that you assign a value which is called a mass to the outcome source of certain process for example in this case we have like the expected if tomorrow will be right near sunny right in the classical probability theory you will assign for example 40 percent to one of the outcomes and 60 for the other right in the Dempster-Chefer theory you assign this to the power set of the outcome right so the the power set is like all the possible choices that you can make if you if we allow you to pick more than one element so for this case when you have two outcomes we have four subsets right which is the the first is the null set that means that neither sunny or rainy will be tomorrow which is a maybe it would be some contradiction there so generally this null set has a mass of zero and then we have the the singletons that are if I pick only the rainy subset of the event we assign them a probability and that means that it is the certainty that I have that tomorrow will be raining the same for the sunny if I pick only the sunny subset the singleton you have the certainty that I have to to to be sunny and we have this a complete set like sunny or rainy that means that either of these two outcomes will be the response and you may be wondering why this is an option in the Dempster-Chefer probability theory and that is because of the uncertainty this is a measure finally of the uncertainty that I have in the problem for example if I am not a meteorologist and I don't know anything about the weather maybe I will set this value to a very high value because I don't know the process I am not an expert but if I am certain about some of the outcome we can move the mass from this uncertainty set to one of the other ones and this allow us to express the uncertainty explicitly in this package of masses yeah the Dempster-Chefer theory proposes also a procedure called the Dempster rule that you can combine a combine different a mass assigned function so if for example one meteorologist said that this will be the setup for tomorrow and another have a other masses you can combine them to produce a new mass assigned function that represent the combination of these two so so this is the basic setup of the theory you can go deeper in the theory make all the mathematics but that's the basic the mass assigned function have to have some restrictions the mass value must sum one like in probability theory the null set is always zero because there is an outcome the null set is not expressing anything new and well there are other metrics like the belief which is the minimum support for an outcome the plausibility which is the maximum support of an outcome and then the theory can be viewed as working with probabilities but not exactly one number instead having ranges of probability for example you can say that it will be rainy between the range of 0.32 to 0.92 which is this plus this so this is the range of the probability for the first outcome and you have another range for this from 008 to 068 so you can some authors see the theory like working with probabilities but using ranges instead of numbers right so we took this theory and put it in a classifier and that is what I proposed so this was the chart I showed yesterday with the trade-off between the accuracy and interpretability and we saw that there is no model that was high accurate and high interpretable and or proposed models and we would like to be in this part of the chart in the mid zone of accuracy not this the super high accurate modes but being interpretable and to do that we proposed this model which is the Demserchefer classification model and we use gradient descent to optimize the values for this model we will see that in a minute but some highlights of the model because I am proposing a new model right and you can address some of the most common problems in classification for example we can make the model be able to handle missing values like without needing to input values or any other strategy and we can have for example a procedure to include expert knowledge in the model directly and being interpretable and an extensible many things that other the other dimensions as we would like to have in this model so how the model works the first is that the model is rule-based yeah so and the rules are in the model in the context of the model are defined as the combination of one statement or condition that we can verify with the data and a mass assigned function attached to this condition so for example we can have in this example xx greater than four but we can have for example one a condition that say the age of the patient is greater than 20 years for example so and then this rule will apply for adult patient right and the mass assigned function the idea is to if this condition is true so we have this evidence about the problem that we are solving for example in the stroke risk prediction problem that we saw yesterday and you have two options right you have a stroke no stroke and the uncertainty so for example you can have one rule that is if the patient has diabetes that we know that is one of the most important rules and then these values will provide you the evidence that supports one of the outcomes for this for the condition that you put in the rule so that's the idea of the rule in the context of the model to have evidence attached to conditions so the model works with these rules you have to define the rules that's the first step for working with the model and that can be done by two alternatives we can use expert knowledge for providing the rules so we can ask experts what they think are important for prediction and we can put them into the condition of the rules or we can generate them automatically using a statistic for example or something like that you see the range of one variable you make splits and you create one rule for each split so the the conditions must be set in the first step they are unchanged during all of the process but the mass values are learned by the model in the training process so in the beginning the the model starts with a high uncertainty value and the the other ones are close to zero because we don't know in advance we don't know anything about the outcomes and during the training process these masses values are adjusted to the optimal values for the for the prediction so yes in the case of yeah it depends on the number of outcomes you have if you have the only two like in this example there is one row for uncertainty but if you have three you have the power set of this which is eight right so you have uncertainty for each combination of outcomes yeah yeah in this implementation we use the second alternative to have one row for uncertainty but the model support the other because mainly because of the exponential behavior of the of this table I mean if you are classifying into 10 classes you have more than a thousand rows and it can be very expensive to compute so we use one row for uncertainty but we also did a binary classification so that that's equivalent yeah because one of every one of these is a parameter that the model has to to tune so if you have many of them you need more data to come so it's not feasible yeah yeah so the prediction process of the model works like this this is the the chart we have the input vector the x the input vector we have our rule set all the rules that we defined in the previous step and the first thing that the models does is to select the rules the only rules that the record that we are seeing satisfied yeah and so in this example these and these are two rules are true the condition are true for this record and we combine them the mass assign function using this demster rule that I said before to have only a mass assign function for the combination of these and with this mass assign function which is the combination of this we estimate the the class with maximum probability because we can go from demster check for two probabilities there is like transformation that allow us to do that so we can transform this a mass assign function to a probability distribution and this is what the model outputs the and the the the class with max estimated product so in the first like iteration of this obviously the the uncertainty is very high so this value of the resulting estimated probability will be practically all uncertainty full of certainty but but we can apply the gradient descent technique to optimize the mass value so for the training process and this is supervised learning so we know the class of this record yeah in the training set what we do is the same they predict the the value of the class of x using the same procedure but when we have the the estimated class we can compare this class to the actual class of the the record yeah and compute the loss between these two so the loss function could be the mean square error or cross entropy anything and what we can do is to update the initial maths values which is these values in the in the rules in the mass assign function from the rules using gradient descent so we can compute the derivative of this function with with respect to every one of these values and we can apply some optimizer to that for example simple gradient descent or stochastic gradient descent add them anything to update this value yes so in you repeat this for all of the records in your data set right and using the same procedure and then you repeat this multiple times like in neural networks the when you process the full training set that's called one iteration or one epoch and then you do this or another time and until the the loss is or the variation in the loss is low so the model have a converted to the optimal values so after this training process the mass values of all of the rules the original rules are updated to the optimal values for the prediction of this task right yeah so you can see that the for example if this is a symbolic value you have the combination which is a formula and then the transformation which is another formula and then the loss is also another formula so finally you have an expression which is very large but you have this original value somewhere in the in the formula so you can compute the derivative with respect to that value and update the the mass according yeah yeah we do that using automatic differentiation which is something that we will be talking in the next presentation we will see this particular process in detail in the implementation in the next presentation so for now it's important just to know that the the the mass are updated to the optimal values using this procedure so if the masses are after the training process updated to this optimal values we can sort the the rules according to one indicator for example this is the indicator that we use which is the geometric mean between the class of the mass of one class on the complement to the uncertainty and that gives us an order for the rules which the last which rule is more important for the prediction of this class and which are less important for the prediction of this and with that we can have or the interpretability result because after sorting the rules using this indicator we know that this is the most important rule of all of the rules that we propose in the first place so we can chop that as a as a interpretable result right and this produced the the tables that we saw yesterday yeah this gamma indicator is maturing the importance of the rule to the prediction of the class k this is something that we propose maybe there are other indicators to to measure the same thing but the idea is to if we have a high mass for the class k and a low uncertainty so this value is low so this geometric mean would be a high value and the the rule would be sorted in in the first place this is descending by yeah so these are the three tasks that the model can do the prediction of of course the training process and interpretability which is new and as I said in the beginning the idea is to show you not only theoretically how this model works but also in practice so for today's we have a demo hey using actual code and the implementation of the model and how to how to use yeah so we have here a jupyter notebook in python the implementation is in python right and the in plane you can find the model here this is the the repository of the model which is public so you can go to this url which is my repo i don't know how to send you this link but wow okay we can continue this is the repository for the method that i was talking the classifier using the answer chevron gray in this end and in the read me you have a a little bit about the the implementation how to install it you can install it like any other package using a peep install or other installer and you can be important we have different implementation of the model we have three the one that this the last one which is the most new implementation is the best one because it's better performance on a can handle multi-class classification and you can use it like any other model of machine learning for example in if you are using scikit learn you know that you can have this method fit and predict to train and predict new values so this is how the model works the the the details are here but and we i would like to show you just like a simple example of the model so yeah the first thing is to install the the model like as i said before you can use peep install and putting the the URL of the the repo there and this will install the the dependencies and the model itself yeah okay so it would be no problem there installing that and for this example we we would be doing like the hello world of classification which is the the prediction of the iris data set yeah so we take the data set of this which is very known right the the iris you have only four attributes about the the features of these flowers and then you have the speeches that corresponds to this to this data set right you may know all of this so this is we will apply the model to this just to show you the the a basic example right there are some the solicitation here about the the data set and the idea is to predict the speeches the last column using the the other four attributes are symbols right ah yeah yeah it's a feature of we will call up that you can click this button and generate charts for you so just to visualize the the information so in this case we will be doing like a normal procedure of a training a classifier the first is to separate the data from the target value they can break the class and separate the training set to the testing set so what we are doing here is to first we shuffle the data set because the data is sorted in the in the first in the source this is a limitation of the model that the classes must be integers from 0 to 0 9 so we replace this with the values we assign them some values and we enforce them to be all numeric values we use the 30 percent for testing and the 70 percent for training we split them and after that we have this for variables that holds for the training set for example and yeah which is a matrix of the characteristics every row is a record every column is a feature right and we have the the classes associated to this and training values right they they must be numpy arrays and matrices to in order to work that are the model so this is data processing right the model itself can be imported as it sees here we are using like I said before we have very different implementation of the model so first we have a very naive implementation which follows the procedure that I said before without doing anything smart or and then we apply some tricks to to make it faster and the fastest model is this the es classifier multi queue so you can import it from the dsgd library that we installed in the beginning and and this well there is some documentation that you can check in order to know which parameters you have you can pass to the model you can also check the github here the the parameters the like the hyper parameters of the model and we need to instantiate this model right so we create a new instance of the classifier multi queue and we pass the these parameters that are controlled parameters for the training and the for the training mainly the first one is the only required parameter that tells the number of classes because according to this value the classifier took different strategies and then we have the iterations the number of epochs we can define a range for example we force that there will be a minimum of 50 epochs and a maximum of 400 we have the flag that is the build mode to know what the model is doing when we apply the the training we can change the loss function we are using here the mean square error you can use cross entropy as well the num workers is the the numbers of threads that the model will use to to parallelize the computation and this the mean delta loss is the value the threshold value and which if the loss changes less than that so we consider that the model have converted so these are control value attributes for the training process like I said before and this model implements the scikit-learn interface of classification models so every method that you use in scikit-learn library like fit predict predict probe and many others will work here yeah we we implement this this interface and this is and so this is the fit model which trains the the models make all the the computation of the themself rule the select of the rules and optimizing values gradient descent all of this is packaging this method and you have to pass the training set the features the target classes and then you have a some other attributes to define the rules right because like I said before you can input the rules manually I mean you can between these two steps you can add the rules to the model there is a method called odd rule that that allows you to do that but you can also generate the rules automatically and we have these flags or parameters to control the rule generation right if the model doesn't have rules and we are asking to to train it it will generate automatically the rules so you can tell the model if you want to add single rules that is for example one attribute you see the range of the attribute you make splits and make one rule for each split this the number of splits or breaks are controlled by the next attribute which is the single rule breaks in this case we are making three rules for each attribute um then we have a other for example mode these admult rules means that the model can also find which pair of attributes have higher variance for example and make them make combination rules for for that also but in these cases it's set to false only we will be only being single rules and then there are other parameters to to know the column names to make the model print the values correctly and we have these to print the progress during the the training so if we run this you can see that the model is training this is the number of iterations that the model have processed and this is the loss in each iteration so it takes some time because this is a very small data set right it's like only 150 rows and it took like nine seconds so the model is not super fast like other models that you may be used to but it it worked right the loss I can show you this again so the loss is one at the beginning and the loss is well this converges faster but the loss is decreasing during the iteration well I don't know ah yeah because we are retraining the process so the model have converted already yeah we have a single we can see that if we turn this to true yeah so here are the actual rules that the model uses so it generates a for example for the first one the separate length it generates all these five rules right four rules sorry and it and all of these rules are associated with the mass assigned function that you see here right this is like the table that I showed in the beginning for the first class the second class the third class and our center so here we have all the rules that the model generates and the optimal values that the model reach after the the fitting so in this example for this setting we have like 50 60 rules and for all of these attributes ah yeah this is done automatically by looking at the training set and figuring out where are the best split following some strategy for example you can have the range and split it evenly I mean if the if this is going from from zero to ten and you make a four split you might doing one one two point five five seven point five or yeah or you can do it yeah by frequency you can control these rule generation also but the default is to use these even the groups have even a number of instances yeah yeah I mean one of the main limitation of the model in my opinion is that rules are are defined in the beginning and they are not changed in the training process so you define this value in the generation of the rule and this value doesn't change in the training process the only values that changes are this one the ones in the mass assign function and this is something that we can improve for for the model ah yeah this is the mass assign function associated to this rule so this is the mass for the first class this is the mass for the second class the mass for the third class the first class is the the one that we define here so the first class this lights the setosa species the second one is the virginic and the other versity color right it's like the it's like this table that I showed before are the are these values right the the values for the first class the second class here we have three classes so we have another row and the value for the uncertainty they are they are here no I mean the these values are are optimized during the training process ah yeah in the beginning the the mass values are set to to random values or to to to high uncertainty and low for this but during the training process these values changes and these they are optimized to the optimal values for the prediction um yeah it's like the probability but also tells you the uncertainty of the the rule for example if we we may find a rule with high uncertainty like this one or this one so the model tells you that if you have one record in with several lengths between these two ranges you are not able to assign one class because this doesn't give you any information the uncertainty is too high yeah that's that's the the meaning of these rules and we can see that they are very interpretable we can just go and see these leads and and extract knowledge from that yes this is the next step so we can yeah but because we have these rules fit with the optimal values but we need to know if the model is accurate because if we don't have an accurate model then these rules doesn't apply to the task so we can predict the testing set and this is what we we usually do for for yeah evaluating the performance and we can make the performance evaluation between the prediction of the testing set and the actual classes of the testing set and here we have some results so the overall accuracy is 0.94 which is high not too high right and there we have another metrics for example f1 score is from 0.92 to 0.97 for all these classes and in general the the classification is good for this problem right you may think that other models like drive and boosting give you 99 percent that's okay but this model gives you these rules the interpretability that that's the the win you can see the confusion matrix here also the most of the records are in the diagonal some of them are misclassified ah the super metric is the number of records in the testing set that belongs to this class yeah so for unbalanced data set this is an important metric because you may have few records of one class and many others for for the other and the support showed this in balance in this case it's very balanced yeah so the model is a great it's not the most accurate model but it's okay so we can trust the the rules predicted by by the model and instead of looking in this long table and and searching which rules are most important for each class we have and this other method that is print most important rules that gives you a report of these rules so so for each class we have the first class for example the setosa speeches and there are the top rules for for predicting this this class so we have like the rule six that said that the separate width is between these values the mass for the the setosa singleton is very very high 0.96 and uncertainty is very low as you can see so this is the most important rule for this class and you can start making make sense these rules or making fun of what you know about this and then the second most important rule is this one and and so on so if you see there are like here these four rules are like the most important rules for this class maybe you can train a model only with these rules and it will perform similar to to this one you have the same for the next class so this is the the second class the Virginia class and in this case the most important rule is this rule 11 that said the pattern length is greater than some value and then you have another high mass low uncertainty rule so you can check what are the characteristics of this group and the same for the last one right and in this case there is a rule that has one mass for the singleton that means that in the training set probably all of the values are between these numbers so it converges to that and and so on so you have like a representation a characterization of each class using this class so you can tell a user for example an expert depending on the on the on the problem that for this outcome these are the most important rules or features that we that we see so this is the the power of interpretability of the of the model in general and not necessarily you have to compute the denser rule between all the the rules that applies to the record for so for example if you have one record that a satisfied rule 10 but also satisfied i don't know rule four you have to combine these two this mass assign function with this one and this will give you a new mass assign function that is the combination for this too and this is the reported outcome for this record yeah i mean in the beginning the rules don't belongs to any class they have all the classes as you can see here these are the most important for this class but you also have values for the first class the second class and the third class so this is the picture like after the training when you know which rules are important to to one class but in the beginning all the rules are the same for them all and they and it starts to to figure out which are the most important values in the training process oh yeah well we can infer from these two rules that these break it's like artificial in general if you have a the special length between these two values you can say that is from this class right and so this this is the the conclusion from this view in this but all these values are independent i mean this is one all of these are parameters of the model so are these like symbolic values that we compute the laws and make the gradient descent update so it was a coincidence that they converges to the same value but it will probably be explained by this like the the the artificial inclusion of another break in this in the middle yeah yeah so this is according to what we see an interpretability result and it's a global interpretable result if we compare what we showed yesterday because these rules applies to all instances right this is not like a local interpretability or lime or the things that sensitivity analysis that we talked yesterday that this rule only applies to one specific record this this is not the case these rules apply to the whole data set and that's what the why is global interpretability but with global interpretability you can also reduce local interpretability for example in this case we sample an instance from the data set so we have a one record with these values and we can check from here right which is the predicted class which is the third one the actual class was this one of course and we can see all the rules here yeah that apply to these instance in particular and the mass of all of them and are certain so you can see here more clearly what the model does and this is like the selected subset of rules that satisfy the record and the mass values and you combine them using the sensor rule to have one final value and then transforming that to a probability and gives you this result yeah I mean we have this record and if you see the sepal length is 5.6 so this rule that said that the sepal length is between 5.2 and 5.7 is it's true for this record so here we have only the rules that are true for the the the record and the the mass values so yes no it can be in this case it was there but it could be maybe yeah for example this is very close to one but it's not one and the other one is 0.002 I have in the previous step of this output because we compute the combination of this all of this rule and the combination has the uncertainty but to met the interface of prediction in the most other computational programs we need to assign one value to one class and we don't have room for another uncertainty value so that's why we transform this a mass assigned function to a probability distribution we have a procedure to transform that and this is what it is reported but we can extract the uncertainty I don't have a handy method to do that but we but this is possible it's something that is computing the process yes we can report that and this is very interesting I mean you you miss this information that is in the model that we should be presented here but this probability distribution comes from the the mass assigned function for the combination that has the uncertainty so they are equivalent in the sense that you are missing information but this it's dependent on the previous one so it is not like it's different it's another representation right yeah yeah yeah yeah we miss that information yeah but like as yeah the old is very small yeah yeah yeah but it's something that we can we can report so yeah that ends the first I think we are in home time right yeah so this model like I said in the beginning is a viable here you can use it it's very simple as you may not do it's like using any other model and you gain this interpretability result so we promote you to use it if you use it let me know because it's always nice to to have another person using this in a new scenario so I will somehow sharing with you this collab and the link to the report so you can try it at home or anything right yes yeah yeah yeah this is something that we will be talking in the next lecture because these three methods like I said before this is the more naive of this one that implements the the procedure like following the steps and then you can apply some strategies to skip some parts of make computation faster and these are like new versions of the same model for this the first one is very limited for example only works for binary classification then we add support for multi-class classification and finally we add this commonality transformation which is a technique to make computation faster for this model but we will be reviewing this in detail in the next in the Thursday presentation right we will be in this presentation I will show you how I make this method how you can make machine learning models and so and we will talk about the complexity and how you can improve the the the results implementation uh yeah yeah we have another track regression but by the time we don't have time for for no I use the pytorch library but in a very low level we will see this in Thursday also but I use a method called automatic differentiation I don't know if you know that which is a technique to compute the gradient of one expression numerically well you are adding some values and it's very helpful for computing derivatives with respect to one only one variable so we use this for for the implementation well this is implemented in the pytorch library in the very low level of the library and I use this for for building the the model no it's something that comes from the model because you as Nelson said you don't apply the rule it's like you don't have to apply the rule the same number of times uh uh for each one of them you come for example have one rule that only applies to two records right uh that are the ones that are not missing in this column and it's okay the the model supports