 Thank you for joining us today. This is the third class of the series and it's like the most exciting one for me and it's about the implementation, the real implementation of parametric models or models in general. We'll be doing this an introduction because it's an extensive topic but after this presentation you'll have some insights to perform these tasks. So while you know me, my name is Sergei Beniafele, I'm a master's student from the University of Chile. Yeah and today we will be reviewing the types of machine learning models because they are different types and we will be talking about one of them. Then we will be talking about automatic differentiation which is something that I said in the last presentations and it's the mathematical framework that allows us to produce these methods. We will be talking about gradient descent of course and we will be implementing or trying to implement two different parametric models. The first one is linear regression, the simplest parametric model and the next is a simple version of the Denzelchefer classifier. So in the field of machine learning and supervised learning we have two types of models depending on how the model learn or what we think about the learning of the models and the separation is in parametric models and non-parametric models. Parametric models are the ones that follow some mathematical modeling of the problem so you see that the output is defined by a mathematical expression that combines input values with another values that need to be adjusted that are called parameters so parametric models are like following a mathematical expression of the problem combining these two inputs and other numbers that need to be adjusted. These parameters are adjusted by looking at the training set and applying some techniques to find the optimal values for example minimizing the loss and non-parametric models are models that use all the data, the training set to build a data structure. For example we have decision trees which are non-parametric because when you have your data set you produce this decision tree with a procedure and it is a tree, a binary tree so it doesn't have these values that need to be adjusted the same goes for example for K and US neighbors where you have all the training sets and you look to the nearest one for a record then you don't have parameters there or near and so we will be talking in this presentation about the first one the parametric models because they are the most flexible models in a certain way they have gained some interest in last years because of the deep learning all of the deep learning model are parametric models and because we have no tools to to produce these models more easier than before so we will be talking about these parametric models and to optimize the values of these parametric models we need to use gradient descent and gradient descent is a procedure to change a value according to the derivative of some function and to do that we need automatic differentiation which is something that I have been saying a lot so for automatic differentiation we here's an example for example we have a this expression which is in the last row and uh no sorry the the expression is x square times sorry most of the where where ah here ah here sorry sorry so this yeah sorry for the technical difficulties but okay okay so um this is an expression three for a mathematical expression the mathematical expression you can follow the tree to the leaves and see how they are combined for example this one is x square times y plus and here we have the exponential effects plus the a logarithm of y right so this is the expression an example of one expression and we would like to know which is the derivative of this expression with respect to both variables with x and y so to do that automatic differentiation builds this table which is the an expression tables that combines a one expression with a other time I mean so to build this table you have to go through these three to the leaves and make like a defuse process the first if you go all the way back here the first that you find in sex square right and you call this a e1 right and as you know the the value of x which is two in the you can compute this at this point right so you know this is four and then if this is called e1 you can replace all of these subtree with e1 and then you find e1 times a y which is exactly the next expression in the tree and as you know the value of e1 which is here you know the value of y you can compute the value of e2 right and in each step you only need to do one computation to build the stable and the numerical value of the expression right we can continue the the the process for example then you need to go to this this is the next leaf in the tree right which is the exponential of x you can compute the volume assigned into a e3 then you have e3 here the next leaf is there which is the e4 right and then you have a this two e3 plus e4 and then you have their final a node which is this one all of this one which is e2 right and all of this one which is e5 and then you compute this table right so this is a evaluation of an expression we don't use any derivative here but you can compute this table this table is important thing about automatic differentiate so what you can do with this table this is the same table that as before a you can compute the derivative with respect to one of the variables for all of these expressions right and following the derivative rules the chain rule and and others you can compute this the values for example if you take the derivative of e1 with respect to x you know that e1 is x square so the derivative is 2x dx over dx which is one as you know the value of x is 2 you know this is 4 right and in the same way you can compute the derivative of e2 by looking at this applying the multiplication rule of the derivative which says that is the derivative of the first one times the second one plus the first one times the derivative of the second one and all of these values you know the their values right because this is exactly what we compute in the in the first row which is 4 we can replace it with 4 y is 3 e1 is here is 4 and dy dx is 0 so you have all when you compute this table you have all the values that you need to replace in the symbolic expression to get the numerical value right so this is the procedure if you follow this you will find the derivative of e6 with respect to x which is the derivative of the whole expression and and then the numerical value here is the one that you are looking for right so it's very a very straightforward procedure so when you compute this table in this order you can compute these derivatives with respect to x again and then you have your numerical result there if you want to do this for y you will have to do the same but replacing y in every step right and the procedure is the is the is the same this is called the forward automatic differentiation because you start with a one expression of you you start from the e1 to the e6 right in this order so you follow the natural order of doing this derivative but there is another possibility which is called the backward automatic differentiation that you start by looking the derivative of f to f is the whole expression with respect to their expressions and you go back with respect to x e6 e5 e4 e2 e1 and then with respect to x or y right you you follow this in the backwards direction and and usually this is much more simpler because if you the first one is always one because you are derivative the the whole expression with respect to the same this is one right the next one you have to separate this depending on the on when e5 is contained in this table right e5 if we found it here is in e6 so we need to compute the derivative of f with e6 times the derivative of e6 with respect to e5 and if you look at this expression this is one we know from the last row of the table and if you look at this expression if you have this e6 which is this formula and you are trying to derive this function with respect to this variable this is one also so you have this as one and if you follow this you may notice that most of the values are ones after you reach a like one expression that combines with a variable which is this one for example where you will find the derivative of this is y right and and then when you when you reach this step and you have all of this information you can compute the derivative of x with respect to x using the same procedure you look at this table where x is where x appears which appears in e1 and e3 so you compute the derivative with respect to e3 e3 with respect to the x and then f with respect to e1 e1 with respect to x and all of these values you know in advance from this state so this is a one right this and e3 with respect to x well you have to compute this is e to the x right then this f with respect to e1 is 3 from this and e1 with respect to x is 2x right and then if you compute this you will find the the same answer as before which is e squared plus 12 right this one so this is maybe for a human it's more difficult but for a machine it's easier because many of the values that you are working with are ones and when you reach this you only have to do one more step to get the value of the of the derivative that you are finding for example if you need to know the derivative of x with respect to y you don't need to do this all again because all of this doesn't depend on the variable you only need to do this step so it's also faster this algorithm is like n times m if we call this the number of rows then and the number of parameters m and this is n plus m so it's linear and easier to compute so this is back world automatic differentiation I don't expect you to to understand all the mathematics behind but to only have an insight to what the models are doing and why this is working and why it is fast so so having this we have a procedure so to the compute derivatives of any expression we can compute derivative for an expression that we define that will be our models and parameters for example and as I said before we will be using gradient descent and gradient descent is basically this formula which states that the the next value of a parameter yeah the effects is a parameter the next value of a parameter is the previous value a minus a sum a factor and this factor is the derivative of the of one function which we will be calling the loss i'm multiplied by a factor which is called the learning right right and in this for example in this chart you will see if you apply this formula several times and this is like the plot of the loss function in the c-axis then if you start here this and applying this formula several times you will find your path to one of the minimums in this function so it means that every time you apply this formula you will have a lower loss which is what we expect but the problem with that is that is very dependent to this value to the alpha value so here are three values for their learning rate so if we use a value of a learning rate very low which is the blue line we will see practically no difference in loss we it's erratic but it's mostly constant so the the loss isn't decreasing and if you if we select a value of a loss very high which is the green line we will see this like random behaviors on the loss where you can reach very low loss in in in little epochs but after that the loss can grow and go down it's very random and if you find a learning rate just right which is the orange line you will see this which is a typical loss reduction function which goes down maybe up sometimes but in general it follows this decreasing pattern right the problem is how can i know which value of the learning rate i need to use right because if as you can see if i select it wrong we maybe we have a a good model but the model it's not learning it's not converting to the optimal values so for that we have what we call optimization methods yeah and we have many optimization methods but the idea of the optimization methods is to extend the gradient descent formula to apply in some other factors that reduce the risk of having a bad learning rate right in this case a rms probe which stands for root mean square probe is a procedure that follows the same idea of updating the value with the previous value plus a value that depends on the derivative but it holds these expressions which are these ones that also depends on the second derivative of the loss which this is one of the most accurate optimization methods but the problem is that you have to compute an extra derivative which is not so cheap in computation so it is good but it expensive and another one which follows the same like idea is a atom optimization model which is a little bit simpler than rms probe but also requires to compute the second i mean the another derivative here we have other parameters to control the the conversions of the model and sometimes this is the way to but for our examples we will extend with a naive gradient descent for now yeah so this class will be also we will have an implementation here not just theoretical things so to do automatic differentiation we have many alternatives i use the torch library which is the provided by the pi torch team which is one of the main machine learning frameworks to build models you probably have used it to build neural networks or anything but a torch in the low low level have automatic differentiation implement so for example just to to to to know how to use it we can have an expression a simple expression for example this one which is a square plus three b where a and b are parameters we and we would like to know which is the dq and d over the a and dq dv for a equal to and b equal one right so how to do that in torch we need to define a tensor a tensor is like an multi-dimensional matrix right but it can be a value also in in zero dimensions so we can define a distance source and assign the values that we are interested in which is two and one right and if you pass this flag this requires ground parameters it and it will trigger the all the automatic differentiation stuff so if we see this and we can see that we have a tensor with two a tensor with one and both are with gradient a on with gradient computational then we can define our or our expression right here and which is a square plus three b and of we if we apply this we see that this is seven of course if you replace the values you have four plus three and the interesting thing about this is that it also has a gradient which is the second parameter but instead of this which has the the required right two which are called the leaves of or the variables they have a gradient function yeah and the gradient functions tells this tensor how to compute the derivative backwards right in the using the second alternative that we have so we can compute the derivative of q using this the backward a method we need to pass a direction that's but it is one and after that the value of the derivative are stored in the leaf nodes in the graph a property so if we look at a that was i am being brought we found these four and three and if we compute the derivative the q over ta remember that q is here you will find that the q the a is two a right and a is two so this is four which is okay and the q dv is just three right which is here and it's also okay we can compare them this is the competition yeah and you find four and so this is a basic example to know that we can compute a derivatives with automatic differentiations very easily using torches you only need to define which variables are the leaves now the the parameters that you're interested in computing the derivatives with respect to and then you use this backward a method and all happens in a all the differentiation happens in a behind right and this can could be any any any expression and you will find the derivative with respect to the values with no problem right well a how in this we can use this exact a to a create a parametric modes right and this is the second part a we will be implementing a linear regression using this automatic differentiation alternative right so for that we will be using like this is a we are defining some toy data set we have x which ranges from minus 10 to 10 and y is a this line which is 2x minus 1 and we apply some random noise to this distribution so after that we have these which are points but they are clearly following one line here right and so yeah in linear regression we know that the the expression for a linear regression is that the output is the input multiplied by some constant and we add some bias right and a and b are the numbers that we don't know are the parameters of the model we will start with random values for example or fixed values and we will apply the procedure of derivative the loss computing the gradient descent and update the values to know if they converge to the optimal value so we can do that by using the same as before we define a and b to be two tensor that requires the gradient computation we can start with a fixed value of one or a random value it doesn't matter and we can compute the expression for the linear regression which is this yeah so the next thing that we need is to compute the loss yeah and the loss is the measure on which our model differs to the real output so one there are many loss functions but one of the most used one is the mean square error yeah and the mean square error is defined by the the distance square from the output and the real value taking the average of all these points so you this is the formula it's like a a subtracting or computing the difference between your prediction and the real one maybe we can see which are y and yr to be more clear why is the is this array which are the values that are plot in this chart right in the y axis and yr is our linear regression this expression right here which is also an array but other values because we have these fixed numbers that are not okay so we can compute the loss by computing the difference between these two arrays a squaring them and taking the the mean the average and this outputs a value right and right now the loss is 39 that it's a high value for the loss but we would like to decrease this value as much as possible right but the interesting thing is that as we are making these expressions we start from these variables that are leaves that require gradient we did mathematical expression and here is another mathematical expression we keep the the gradient so this tensor not only has this value it also has how to compute the derivative of this expression with respect to a and b so that's exactly what we do in the same in the next cell we can compute the backward of the loss we're just in just like the previous example we compute the gradient of q we can compute the gradient of the loss easily and then in a grad and b grad we have these two tensors so what this is telling to us is that these values here a which is one and b which is one if you follow the the gradient descent you will need to increase the value of a because this is negative and decrease the value of b right to to adjust to better optimal values for the model and that's what we can do here right so we can use this information the a grad and b grad to and using a learning rate in this case we are using this learning rate to compute the the new values for a and b so the new value for a and b is a minus the learning rate times the gradient which is the formula gradient descent and we did do this for these two and after that you have new values for a and b this is 1.06 right and this is 099 which are a little bit different from the ones that we start with these ones but they should be better that means that the loss should be less so if we compute the loss again we compute the the linear regression with the new values the loss for the new values we end up with 35 we start here with 39 right so we actually decrease the loss by four points using just one iteration of the gradient descent and we can do this again I mean we can compute the derivative the gradients now these times are different and we can repeat this many times here is a for loop that does the previous steps 200 times maybe we can start with 20 just to print the loss so if we see that we are applying the same procedure several times and the loss is a plot printed here and the loss starts to decrease and after all of this we have these values for a and b right which are a they have changed if we plot a these values we have like a this is are the points the blue dots of the data set and the orange line is the or linear regression and it starts to resemble the the the shape of the of the data set but not quite I mean the best line start maybe here and go there so we can that's because the optimal values have not converged as you can see the loss is always decreasing here and we can do this so 200 times it doesn't matter or another 200 times if you like another and then this is a better line to approximate this this value right but you can repeat this a lot a lot of times yeah and you will find yeah this is like the the one of the most accurate line 40 right so and if you can see the last values of a and b which is out there 1.91 and minus 103 are very similar to the ones that create the distribution in the first time right we will expect a to b 2 and b to b minus minus 1 and they are close to that value right this so the the model have a using this procedure which is the same as I I showed you in the presentation we can find actually find the optimal values of a model which is this is a very simple model two parameters but we can find them right that's a that's a the the idea of all of this procedure yeah so and now yeah that you know how to define parametric models using these parameters and one function how to compute the loss and how to apply the gradient descent we can move forward to the classifier implementation of the Demserchefer classifier imprint but in order to do that we need to review some terms of the Demserchefer theory that we are that we need so the first one is that we have mass assigned functions right the mass assigned functions go from this power set of one dot comes to zero one the mass of the null set is zero and the sum of all the subset must be one right that's this is what we see in the first presentation or in the second one I guess this is new which is the a pygnistic transformation I said in one of the lectures that mass assign function can be transformed to probability distribution and we have many different transformations one of these one of them are the pygnistic transformation which is this formula right here that you take the mass of a subset and you divide it by the by the length of this subset but how many items it has you sum all of the subset that contains an outcome and this is the probability distribute right and the final thing that we need to define is the Demser rule this is the combination rule that I showed you and or I mentioned that allows you to combine two different a mass assigned function so if you have the m1 and m2 masses you can combine them using this formula first you need to calculate this constant k which is the mass of all the subset that doesn't match that the intersection is null you multiply them you sum all of them and then you have your conflict coefficient which is k and when you have k you can compute the Demser rule which is the multiplying all of the subset that intersects to the to your a interested set so in this case the formula may seem a little bit long and explain but what it does is you multiply the mass of the certainty with the certainty of the other set or you multiply the mass of the certainty of m1 with the uncertainty of m2 and you sum all of these things right it's that's the idea behind the the Demser rule and we can implement these things in our implementation so the classified implementation we will be reviewing here is a very simple one it's not the one that is implemented in the in the library but it showed you the the the first steps right so the first one is to know how to represent this mass assigned function right a mass assigned function will be an array at least and if we have n different options the array will be m plus one and the first n will be the mass of the singleton the certainty that we have for each value and the last one will be the uncertainty yeah and here we have a different functions that allows us to make some some computing the first one is to create a random mass assigned function yeah of a length k with some uncertainty that it simply as you can see here it simply make random values for all the the array except for the last one that is fixed to the uncertainty value that we have and it output this as a tensor right which has the gradient on like in the in the previous example so for example we can be go viewing this if we call this function create random mass assigned function k it would be output this tensor a four length tensor which random values for the first three ones that are the classes this is the certainty for the class one this is the certainty for class two this is certainty for class three and this is the uncertain the last value is the uncertain right and it just an array with a gradient on we can for example use a higher value of uncertainty then these are lower right this allows us to create a random mass assigned function right which won't be the starting point for the model the next one is this function the tensor rule and this computes the tensor rule the same formula that I presented you in the in the slides is here yeah and as we as I said before you only need to multiply the certainty of the two masses and the uncertainty of the first with the certainty of the second and the uncertainty of the second with the certainty of the first one and this computation applies for all except for the last one which is the uncertainty you have to divide it by some constant in this case is three and yeah and and there you have it you don't need to compute the the k expression explicitly on this factor you can also which is a you can normalize it by summing all the computing the sum of all the values and divided by that this is easier to than computing this k so this is the tensor rule and we can for example in this one create another a random mass function and then computing the tensor rule rule yeah so maybe we can yeah so we have here one mass assigned function created by this random function another one and then you have the combination of these two using the tensor rule so the combination is these values the the a certainty is increased but I forgot to normalize them yeah this is the value and so we have implemented the tensor rule using the mathematical definition of that right and the final and the one the next function that we have here it's same a tensor rule yeah that allows us to pass an array of masses a list of masses and it computes the tensor rule for all of them so instead of doing this a one by one we can now for example create three masses put them into an array and apply the tensor rule to all of them so we combine these three with one function right which is here and the final one now the next one is the probability transformation which is the other the other formula that I showed you here this one and it's implemented there which is the mass of all except the uncertainty divided by this which is yeah if you do the computation you will find that this is the constant that you have to divide to so very simple and we have another one to a one whole one array we will be reviewing this in a moment but the this transformation for example we can apply them to the final for example the after combining all of these three masses we can compute the probability transformation and now we have a one value for each for each class which is a probability distribution for for this class so we don't have the uncertainty here but this is what the models need to be to classify right yeah so this is all the themselves for theory that we need so we can start using it in a real example yeah we will be using the same data set up before iris that's it but in this case we will be doing all the stuff by one by one by yeah so the first thing that we need for our model if you remember is to create a rule set yeah and a rule set is a collection of masses that are associated with some a condition right in this case for this a simple example we will be creating eight random mass assigned functions right and let me show you this before so so the rule set is this eight mass assigned functions put in an array yeah and we will have this function that selects the mass assigned function depending on a record yeah so this select rule if you remember the model I don't know if you have the model yeah but anyway you need to define a function where if you pass a record it output a subset of the rule set using with just the rules that applies for that record or are true for that record right in this case this function is also very simple we are not using any condition we are fixing the condition of the rules to be lower or greater than the mean of each column right so we have our data set here right it has four attributes for each of these attributes we will be having two rules yeah if it is less than the mean of this value we will be using one rule and if it's greater we will be using the other rule and we do the same for all of these for a attribute that's why we have eight mass assigned function because we need two per attribute and this function what it does it iterates over the the attributes yeah all of these four and it verify if the the record the value of the record in this attribute is lower than the mean of the attribute of the column right and if this is true we are putting the one rule of the data set the two i rule to i in the position two i rule right and if not we are appending to this rule set the rule in the position two i plus one right so imagine that you have the age rules in this distribution if we we are selecting basically this for example this and this and for record we are selecting four of the of the of the eight depending on the on the value so we can see for example say well yeah one something for example let me show you yeah here we have one something right with some value for the attribute if we apply the select rule to this sample what we get is a subset of the rule set right this is the full rule set this is a subset and this is the rule that applies for this yeah the first if we change it to another record we have some other values we have we will have a different or we should have a different rule set right so this is the the first step of the model let me see if I can yeah here it is so yeah we are here we have defined a rule set eight rules and we have defined this function to select the rule to select the subset of the of the rules so the next step is when we have the subset we can combine with them through rule we already have a method to for that which is change the s rule right so if we apply change the s rule to the select of of an actually of a record sorry yeah instead of a giving me the subset of the rule set it gives the combination of the subset of the rule set and here we have so one one mass assigned function for the combination with some values and the next is is to yeah so we have this combination and the next is to find the the probability distribution and we also have a method for that which is probe distribution right probe transform sorry and if we apply that we finally get this a probability distribution for the for this record yeah so we start with a record we produce an output right and we can for example for this we can compute the arg max the the maximum for some please and we can say that okay this record is from class three yeah so yeah now that we have a or our prediction process right this is obviously very bad because these are random values like in the first iteration of the linear regression that we did before they are fixed value random values that doesn't or haven't learned anything about the data but we can do the learning process now so we can we are splitting the data set into the training and testing set the same machine for and here we have the training process so in the training process we are doing the same as a before right and let's start in this line we compute what we did in the last cell right this one which is the prediction which is the probability transformation of the chain of them to rule of the selection of my record yeah so this give us the an array with the probability distribution just like that and we compute the one whole encoding of the real class right which is for example if this is one which belongs to the class one what it does one hot is to give you an array with a one here and zero zero in the other two if if y is two for example it will say zero one zero and it's 53 or it is zero zero one right so having these two we can compute the loss with the same formula as before yeah we can compute the difference between the prediction and the real value square and taking the mean and we can go backward in this with this loss yeah the only difference here is that we are using adam optimization instead of a naive gradient descent as before and for that we need to we are using the adam implementation that torche provides and we need to say which are our parameters in our cases the rule set the learning rate the initial learning rate with this method step it will be update the values of all of the parameters yeah so this procedure we repeat for all of the records in the training set yeah this is the the inner port and we repeat this several times which is the outer for form is 50 iterations so if we before doing this I now yeah if we do this we can see that the loss is decreasing over the iterations and it reach some final value here right and as you can see it's very slow also because of the implementation it yeah and if we see the new rule set it's very different than the rule set before yeah the values have changed to new values that may be optimal for this for this task right we can test this and we can for example make the prediction of all the testing records I'm putting into an array so here we have the predicted classes for all of the testing set and we can use any metric to report the classifier so this very simple naive implementation reach an accuracy of 77 right yeah not so bad for being for doing in 10 minutes and by hand by scratch yeah so not so bad not so good there are a lot of room to improvement here but we have done a new model this is not an artificial neural network this is not something that you important you use it it doesn't even have a fit and credit method all was done by hand right so this is an introduction to where how the the model works right under the hood it is very similar to this but it handles very many edge cases and we have another tricks to to improve the performance for example this looks a lot we have we can make computation faster but this this is basically it is an introduction to what you can do we put here the demster check for definitions and rules and all the stuff but you can bring another mathematical framework or another mathematical modeling for a problem you can set up parameters create your own model and train them using this methodology is it's will be very very similar yeah so this is today presentation I think I am yeah yeah yeah that's a great question and this is because yeah if you see there are minus values here and yeah I mean but I this is because when we are optimizing these values in this loop we are not forcing them to be positive or to be any any or to have any restrictions so the optimizers said that the optimal values is this minus minus zero zero five yeah the optimizer obviously doesn't know anything about the demster check for a restriction and we have to apply them manually so after for example this step we probably need to check if the constraints are hold or not and fix them if needed yeah by hand this is something that I don't put here but it's in the implementation in the real implementation has this constraint fixing right I think it's better to do it in the intermediate steps because all of these computations and result makes sense only if the constraints are met so if we don't have this you can probably have some sort of problem for example the probability distribution after all of these computation may be negative also and that's a problem because we know that probabilities itself can't be negative so it's better to fix these values in the between steps right yes but after this step you can enforce these two restrictions to be positive and to the and the must have to up up to one you have to update all these two in the same process I don't know if I answered your question we can maybe we can see the real implementation this is the real implementation by the way and you can and you can check it these are the same very similar select rules and all the stuff and here is the normalize method and the normalize method but it does is it clamps the values between three wrong one and after that if the sum is less than one it divided by the sum and if it's not less than one which is gray center or equal it will divide by the sum so after this and we repeat this for all parameters after this all of the condition must be true