 Namaste. In the last session we studied how machine learning model looks like. The session before that we also studied how the input data looks like. Now that we know about what is the training data in machine learning and also how model looks like. We will you know change gears to understand how to estimate the parameters of the model. Let us begin. So, most of the models so we looked at linear regression model and the linear regression model looked like this B plus W 1 x 1 plus W 2 x 2 all the way up to W m x m. I can compactly represent this as i is equal to 1 to m W i x i. So, this is my linear regression model and we normally use h W of x. So, this is the regression model. We also looked at the logistic. How does logistic regression model looks like? Logistic regression model predicts the probability of y is equal to 1 even x as 1 over 1 plus exponential of minus z where z is nothing but a linear combination of features and their parameters. And in case of neural network model we learn some complex function which also had lots of parameters with it. Now that we have this particular model and we have training data our job is to come up with values of these parameters. So, let us try to build an intuition about it based on a very simple example of a linear regression with a single variable. So, let us say this is a single variable x 1 and this is the value y to which is the output value or the label. And let us say these are the points and let us say we want to fit a line, linear regression represent a line. So, let us say we somehow got let us say we have a line which passes through the origin. So, the equation of this line is y is equal to b plus W and x 1 here be 0 because line passes through the origin. So, we have W 1 x 1 as an equation of this line. Now our job is to estimate the value of W 1 or let us say if I use a different value of W 1 it will result into a different line. So, here we have a line which passes through the origin and which has got some slope. So, if I change the value of the slope I can draw let us say some other line from the origin which looks something like this or I can draw some other line which looks something like this or. So, just by changing the value of W 1 I can get different models or different functions. Now which of this function is the most appropriate function for our cause is the central question in front of us. So, we are having training data we have fixed up our model and now the task is to estimate the model parameter. And one of the tool that we use to estimate model parameter is called loss function. We normally denote loss function with J and loss function is the function of parameter value. Depending on what parameters we choose we get a model and because of model function we incur some loss. Let us see what loss means in the context of linear regression. So, you can see that for this particular value of x 1 this was a true value, but if we use red line as our model then this is the predicted value. So, this is an actual value and this is a predicted value for this value of x 1. So, we incur some error over here. The error is difference between the actual value and the predicted value. In the same manner incur some error over here, some error over here, here we have negligible errors, some error here, some error here, some error here. If we use red as the line of choice or red line as the model. If we use some other line as a model then there will be different errors. You can see that these are all errors for the orange line. How do we measure this loss cumulatively? So, what we do is we find out the loss at every point and submit across all the points. Let us write it mathematically how we do it for a single point. For a single point we look at the prediction at that point minus the actual value of y. And since we do not care about the sign, we take square so that all positive and negative losses are measured in the same manner. So, this is the loss, this is the actual value. If this equation is bit scary to you, so you can read that this is an actual value, this is the predicted value and we square this up. So, we have actual minus predicted and we sum this loss across all n points. And we add one half as a mathematical convenience. So, is this clear to everyone? So, what we are doing here is we are calculating loss at every individual point and then summing of the loss across all the points. And this is a total loss that we incur because we choose parameters w and b. And because of w and b, we get a model and because of model, we incur some loss. So, that is the relationship. So, if we expand this in case of linear regression, this will look something like this. This is nothing but b plus w1 x1 minus yi square. This is of course, ith point and the whole square. So, you can now see, now it is pretty much obvious to see that this j or the loss function is actual function of the parameter values. So, loss function is the central piece that helps us to identify model parameters such that the loss is minimized. So, we try to identify parameters in such a way that we minimize our loss. And we will see how to minimize the loss in the next section. So, this is the loss that we compute for linear regression. Let us try to see how we can compute loss or how we can formulate a loss function for classification problems. In case of classification problems, we have two labels. So, let us let us prepare a table of the labels. So, we have an actual label and we have a predicted label y and this is y hat. So, if actual value is 1 and if you predict 0, then there is an error or vice versa. Actual value is 0, we predict 1, that is an error. If we if actual value is 1, we predict 1, that is fine. So, these are these are three combinations. So, this is an error situation. There is an error here, error here and this is no error or no error. So, let us try to develop an intuition for the loss in case of classification. So, if the actual value of y is 0 and if you predict 1, you want to give a very large penalty and if actual value of y is equal to 1 and if you predict 0, we want to give a very large penalty as well. So, we will we have very very similar curve here. So, we say that so, we write this mathematically as if y is equal to 1, we have a term minus y log of p which is log of the prediction and if y is equal to 0, we use the loss minus of 1 minus y log of 1 minus p and this gives us what is called as cross entropy loss which is written as minus y log p minus 1 minus y log of 1 minus p. So, let us try to understand when y is equal to 1, what happens? For y is equal to 1, we get the term minus log of p and since y is equal to 1, this becomes 0. So, eventually we only get the first term for y is equal to 1. For y is equal to 0, you can see that the first term becomes 0 and we only get the second term. So, we get minus log of 1 minus p. So, this is a this is a clever representation of these two losses into a single equation. This is a cross entropy loss that we try to minimize while solving classification problems. So, cross entropy loss is specifically used for binary classification problems. For a multi class classification problem, we use categorical cross entropy loss and if we represent our output or our labels as integers, we use what is called as sparse categorical cross entropy loss. So, these are the loss functions that are used for classification task whether with logistic regression or also with neural network models. So, having defined a loss function, we know how to measure an error once we fix up the parameters of the model. Now, our job is to find out the optimal values of parameters such that the loss function is minimized. How do we really solve this problem and this is where optimization techniques or optimization algorithms help us in tackling this particular problem?