 Namaste, welcome to the next session of our course Practical Machine Learning with TensorFlow 2.0. In this session, we will study about optimization algorithms and their roles in machine learning. We will specifically study a specific optimization algorithm called gradient descent and its variation mainly mini batch gradient descent and stochastic gradient descent. We studied what is called as loss function of machine learning model. We used the term or we use symbol J W B. So, loss function is parameterized by W and B which are parameters of our model. So, just to remind you our model that we considered for linear regression was as follows. The output Y is equal to B which is a bias term plus W 1 X 1 plus W 2 X 2 all the way up to W M X M. Here we have training data with M features and we have a model with M plus 1 parameters and these are the list of parameters. So, we can think of model as parameters by B W 1 W 2 all the way up to W M. We will use W as a vector to represent all these M weights. So, that is why we say that the loss is the function of parameters and in case of linear regression the loss function looked something like this. So, we calculated the loss at each point and the loss was calculated as the predicted value on ith input, the difference between the predicted value and the actual value and we squared it. So, this is the predicted value, let us write this, this is the predicted value and this is the actual value. And just to remind you this I over here represent that this is ith point, this is ith and this is label for, so this is the loss function. So, geometrically, so let us consider there are M plus 1 parameters. So, loss function is essentially a function in this case. So, the loss function can be shown, let us say this is the loss, this is parameter W 1, this is parameter B and there will be several such kind of parameters, let us say W 2, W 3 and so on up to W m and loss will be a surface in M plus 2 dimensional space. So, it will be some kind of a hyper bowl kind of a shape. What we will do is in order to understand this we will take a simplified model and try to visualize the loss function in the 2D space. Let us take a simplified model where there is a single parameter, our data is of the following form. So, we have data D is we have this pairs, we have the feature and the label, we have n such data points. So, we can compactly represent this as we have ordered pair of features and labels, there are n such kind of ordered pairs and each data point has a single feature which is represented by x 1 and our y is a real number. So, this is a regression problem and the model that we use is y is equal to B plus W 1 x 1. So, the loss function is J of B comma W 1 is 1 over 2, we compute the difference between the actual value and the predicted value is a predicted value minus the actual value and we take square of the difference. Geometrically, this loss function looks something like this. So, we have J B W 1 or the loss on y axis, we have weight W 1 on one axis and B which is a bias on the other axis. Since this is a quadratic function, this will look something like this. It is a bowl shape function, it will be a bowl shape function, it will be a bowl shape function. So, what will happen is we have to find out optimal point on this bowl. Let us spend couple of minutes to understand the duality of the loss space and the model space. So, we have a model where we have input x 1 and we want to predict the value y. Each of this point on this surface, if we take this particular point, this particular point, let us show some data points here, these are data points. If we use this particular point over here, this point would correspond to a line. This point is represented by two numbers which is W 1 and B and for W 1 and B, it gives us a specific loss and our model also has two parameters which is W 1 and B. So, for this specific value of W 1 and B, we get such a model. If we select another point on this surface, let us say this particular point. This point will represent some other line. So, let us call this as and let us call this as point 2. So, this is W 1, 2, this is the second point and this was the first point. If you choose some other point, this might represent some other line, let us call this as point number 3. This is essentially a line which is W 1, 3rd point and B 3 and for each one of them, there is a loss and in order to recall what the loss is, so for this particular line, for the third line, the loss is essentially a distance between the actual value and the predicted value. And this is the actual value, we calculate the difference and we take square of the difference. So, we find out all these differences and sum them up and that represents our loss. So, if we sum all these numbers up that you will get a loss corresponding to W 1, 3 and B 3 which is some number on the y axis which is a loss. So, this is very important to understand the duality in the loss space and in the model space. So, point in the loss function represents a model and we want to get a model that gives us the minimum possible loss. Is our objective clear? Our objective is to find a model or model parameters in such a way that the loss incurred due to those parameters is minimized. So, of course, I mean you must be wondering that you can also try some kind of a brute force approach where you will explore this particular space and try to find out points that gives the minimum value of loss. But this is not really efficient. If we take let us say a parameter space of or the loss function with M parameters with the value of M being very large and large M's are kind of their routine in our day-to-day machine learning problems. So, if you are trying to solve this problem in the context of M which is some large number then you know brute force is almost impossible. So, we cannot really do brute force. So, we have to do something more intelligent and you know the way we are phrasing this particular problem we are saying that we want to find the values of parameters in such a way that the loss function is minimized. So, this is a minimization problem find W and B or find parameters such that the loss function is minimized. Let us see how to do that and let us first develop the intuition of it and then get into the details of our first optimization algorithm which is gradient descent which is workhorse of machine learning algorithms we will see that in a minute after understanding the intuition behind it. So, what is our learning problem? The learning problem find W and B such that loss is minimized and loss will represent with j. You want to find out the values of W and B in such a way that j is minimized. Let us start to understand how we can do this, so what I will do is I will again consider our linear regression model Y is equal to B plus W1 X1 and in order to give or order to keep the expression simple we will assume that B is equal to 0. So, we get a very simple model which is Y is equal to W1 X1. Now our job is to find out, now the optimization problem is find out W1 such that j of W1, B is minimized. So, there is exactly one parameter here because you have already set B to 0, so there is exactly one parameter and now we will see, we will first visualize how the loss function looks like. So, loss function is parameterized by W1 or the value of W1, for each value of W1 we get some loss and since we have a squared error, mean squared error or a squared error as a loss function, so you can see that it is a bold shape function. This function in the language of mathematics is called as a convex function. So what we will do is, so essentially what is happening here is for each of the value of W1 there is a corresponding value of the loss and you can visually see that this is the point where loss is minimized, where there is the value of loss is minimum. So since this is a problem with a single variable or with a single parameter, we can visually find it, but the problem here is if we have multiple parameters, we cannot even visualize the loss function, how can we algorithmically or how can we programmatically find out this particular point over here. So, what is given to us is we know the loss function and we know, we essentially know the loss function and we want to find out the value of the parameter such that the loss is minimized. Now there is this particular method which is called as gradient descent, gradient descent helps us to do this programmatically. Let us try to understand gradient descent intuitively before getting into the details of the steps involved in the process. So what happens here is, we first initialize the value of W1 to some random value. Let us say I select this particular value of W1. So for this particular value of W1 there is a loss associated with it. So this is the point where, so this is the point where my initial guess landed. So my initial guess is this for the value of W1 and at this point what we do is we calculate the loss. So what is it representing? I am selecting, I am randomly selecting or randomly setting the value of W1 to some parameter. So this will actually get me a model. This will actually define a model for me. So remember the duality of the loss space and the model space. So we have x1 here and we have y here. Let us say these are the points and we have a model which is a line passing through the origin. Now what we do is, there is a difference between the predicted values and the actual values. And what we do is we calculate square of the difference and sum them up across all the points and get loss corresponding to this particular model. Now so the first thing that we did is we randomly initialize W1 then what we did? We calculate the loss value, we calculate the loss. So we know the loss. Now this is a point where we want to reach. How do we really reach this point from here? Now think of this as a task that is analogous to let us say you know climbing down from the mountain top. So what happens is while we are climbing down the mountain top, we are let us say at a specific point we look around and find out what is the direction that will take me down to the valley. So gradient in gradient descent we exactly do it at a particular point on the loss surface. So at this point what we will do is we will calculate the slope or the direction in which I should be moving the direction of slope. So let us say this is a slope at this point. So this is a tangent to this point. So we can calculate slope of this tangent and we get the direction of the slope. So we will move in the direction that is opposite to the slope. So the slope is negative here. So we will move in the negative directions, we have to move in this direction. So first we will calculate the loss, second is we calculate the gradient. And once we know the gradient, the next question is how much we move from the original point so that we reach the valley. So there are multiple options that we have. We can move or we can step, we can have a longer strides or we can take shorter strides. So the length of the stride it is decided by a parameter called learning rate which we denote by later alpha. So learning rate helps us to control how long strides are we going to take from a particular point. So let us say if we are at this point and if we have some learning rate we are going to take a stride and we will end up over here. So this becomes a new point and what we do is we repeat the same process that we did at this point, at this point we first get the predictions on this particular model, we calculate the loss and then we calculate the gradient. In order to get the loss we should also have predictions. We need to get predictions at every point. As you can see here this is a prediction for this particular point, so predicted value and actual value. Because in order to calculate loss we need to know what is the actual value and what is the predicted value. So we first make the predictions with the value of the parameter, we substitute that in the model and we do the predictions and based on predictions we calculate the loss. And after calculating the loss we calculate gradient of the loss. So we will again do the same thing at this particular point and we see that it is the direction of the slope. So we will be moving in this direction by taking some step. So let us say we come here. Now you can see that as we approach this particular point which is our golden point or point where the loss has got the minimum value, as we go closer and closer to that point the derivative or the gradient will become smaller and smaller. At this particular point the gradient will be 0 because it is a minimum point. So you can see that as we move closer to the gradient or the derivative or the gradient as we move closer to the minima the gradient value will become smaller and smaller. So our strides, so we have a constant learning rate but our gradient is so we calculate gradient we have learning rate and then what we do is we have a new point. So let us say W1 nu is equal to W1 old minus learning rate times the gradient. So you can see that learning rate is constant, gradient is gradient becomes smaller and smaller. So we will be making eventually shorter and shorter stride or stride will become shorter and shorter as we approach the actual minima. Is this clear? So this is how we reach to this particular point. Now you can work yourself, you can take a pause here and you can work out yourself how this particular calculation work if I randomly initialize the point over here. You can see that the gradient, this is the direction, this is the slope and we will be moving in the opposite direction of gradient. So we will be, since gradient is positive here we will be making, we will be moving in the opposite direction. So whatever value we have we are, so here the gradient is positive. So what will happen is since gradient is positive at this point we will be effectively moving in the opposite direction because this is the value of W1 old, we are going to subtract something from it. So we will be moving in this particular direction if I am starting from here. On the other hand when we started from here, here the gradient was negative and since we are going to move in the negative direction of this. So you can see that gradient is negative, this negative and negative becomes positive. So we move in the positive direction or we have some value, we are adding something to it. So we are getting a value which is greater than the old value. So if we start from here we are going to get the value of W1 which will be greater than the previous value. If we start over here we will get values of W1 which will be lesser than the previous value. So this is the intuition behind gradient descent. I hope you understand this clearly, if you have not understood it, it might be a good idea to take a pause, go back in the video and check it out. And once you understand this then we will move into the next part where we are going to write down the steps that are involved in the gradient descent algorithm. So let us write down steps in gradient descent. Now what we will do is we will try to generalize the gradient descent for M parameter case. So when we are trying to establish the intuition of gradient descent, we did it intentionally with a single parameter so that it is easy to geometrically show what is happening. But when we go to the M parameter setting it becomes difficult to visualize what is happening. So let us consider a setting where we have M plus 1 parameters in the model, BW1, W2 all the way up to WM. And we are trying to solve for a regression. So the model is y is equal to B plus WI xi. So this is a linear combination or linear combination of the parameter and the future value and this is a short form, this is a short hand form of writing the model. And obviously our loss function is the squared loss which is done over N examples. So the first step is we randomly initialize BW1, W2 all the way up to WM. So we randomly initialize all the parameter values then we are going to repeat until convergence. We will see how to identify that we have converged to a solution, repeat until the convergence. We first use all these parameter values and we predict y hat for data point this quantity is nothing but y hat. So we calculate y hat or the predictions first then we calculate loss, calculate loss which is J of BW. We know that the predicted value and we know the actual value based on that we can calculate the loss. Then we calculate, so we will take, so is this clear to everyone we will see how to calculate gradient of loss. Fifth step is B new is equal to B old minus alpha times the value of gradient. So gradient with respect to B with respect to W1 and we do it for all the parameter values WM old with respect to WM simultaneously. This part is very, very important. Note that when we are calculating, so couple of points note here we are calculating these gradients with respect to each of the parameter values and we are notice that this is not a equal to sign. We are using some kind of a other notation where the effect of this is we are setting the value of B to a new value which is coming from the equation on the right hand side in this case which is this equation and when we are calculating gradient with respect to W1, we are not going to use this B new, we will still be using the value B old and all the old values for all other parameters and we change the values from old to new right at the end in the step number 7. So this is what is called a simultaneous update. So this is important to note that all these parameters have to be updated simultaneously at the end of the loop. And then we again go back to 2 and go to step 2, we check for the convergence and we essentially repeat. So what happens after this step is we get a new point on the on the loss surface and we repeat the same process at that particular point in the loss surface. So this is the basic gradient descent algorithm or a sketch of gradient descent algorithm for your reference. We have not yet talked about how to determine the convergence of this or how to calculate the gradients which we will see in the next session. Thank you.