 Namaste. In the last class we studied gradient descent, we understood the intuition behind gradient descent and looked at broad steps that are involved in gradient descent. In this class we will continue to understand gradient descent some of the nuances of gradient descent and some of the variations of gradient descent that are being employed in the practical machine learning training. Let us begin. So, this is where we stopped in our previous class where so, let us quickly recap this particular scheme. So, we looked at gradient descent which has got 7 steps. So, we randomly initialize the parameter values then we repeat until convergence the following steps. First so, the randomly initialize parameter values gives us a point in the law space. This point will correspond to a model in the model space. Since we get a model with these parameter values we can predict the output for each data point in the training using the particular model. Then we calculate loss which is J B W you know using this particular equation and you know this is the predicted value and we the loss is calculated as the difference between the predicted value and the actual value and we square the difference we sum it across all the end points and we you know use half as a factor for mathematical convenience and that is how we get our loss function. After calculating the loss function we calculate gradient of the loss function at those parameter values and we use those gradients to update each of the parameter value to a new parameter value by subtracting the product of learning rate and the gradient with respect to that parameter. And once we complete this particular process of calculating gradient with respect to each of the parameter we update all the parameters to their new values simultaneously and then we go back to step 2 where we go to a new point and we repeat the process. We continue this iteration until the algorithm is converged. So, let us try to understand the notion of convergence and then we will come back to understand how exactly we can calculate the gradient. So, let us try to understand how the conversion works in gradient descent. So, we have loss on y axis and we have the value of parameter on the x axis and this is loss function in the context of linear regression and this is the minima. So, what happens at minima? So, at minima value of gradient is 0 and because value of gradient is 0 we do not see any change in the parameter value because this is how we calculate the change in the parameter value. We take the old value and we subtract the product of learning rate and gradient from that and if gradient is 0 we do not have anything to subtract. So, value do not change. So, this is the first way in which we calculate the convergence is by observing that there is no change in parameter value either in subsequent iterations or for past few iterations. So, we can also apply this gradient descent algorithm on some tricky functions which may not be convex. So, another safeguard. So, what happens is? So, let us say we have a function which is like a flat function after some time. So, what will happen is we will not see any change in the parameter value after some time or it will be very, very slow. So, we do not want to keep making the changes and keep running this particular algorithm forever. So, we also define some kind of maximum number of steps or we define a parameter called number of steps for which we want to run this. So, we either check for one of these two conditions and if one of these two conditions are being met we stop the gradient descent updates. So, the first condition is no change in the parameter value in the subsequent iterations or in past few iterations or the number of steps that we decided to run gradient descent algorithm are done. So, we do not do any more updates to the parameter values. So, this is how we handle convergence part of it. Let us try to understand how do we handle the gradient calculation part of it. As I earlier said gradient is nothing but the derivative of the loss function with respect to the parameter value. So, let us try to take an example of, so just for your reference I am going to write the loss function again. The predicted value minus the actual value we take the square of it and we sum the difference, the square of a difference across all the end points. And now we want to let us say calculate and let us say so our W is actually there are bunch of parameters here. So, there is W1, W2 all the way up to Wm and we have Wm to be. So, what we do is we calculate a partial derivative with respect to the parameter value. Let us say with respect to W1, partial derivative of the loss with respect to W1 del del W1 of j W, B. And so let us apply derivative operator on both the sides and try to calculate the whole thing. So, this type of here they should, they should go from 1, they should all any examples. So, what we do is get half out and we apply this on this particular function. So, we get 2 into, so we have, so we get the difference and we calculate slightly complex looking though it is not complex. So, have some patience and be with me. Then I was talking about mathematical convenience. So, this 2 gets cancelled with this 2. So, we have this factor of half deliberately added to get an advantage or mathematical convenience while calculating the partial derivative value. So, we get is the predicted value minus the actual value the difference and what happens is we will do a small expansion here. So, this is going to be, this particular part is nothing but B plus 1 to m, W i, this is j, 6 points like this right plus W 1 x 1 plus W 2 x 2 all the way up to W m x m. Now if we calculate the derivative of the whole thing with respect to this, what happens is all other values are constant and derivative of constant is 0. So, only this particular term will survive, this term has parameter W 1 with respect to which we are calculating the partial derivative. So, only this particular term will survive all other for all other terms the derivative will be 0. So, what happens is we get and then the derivative of this particular term with respect to W 1 is x 1. So, the derivative of the loss with respect to W 1 turns out to be the summation over difference into the value of the feature x 1. So, is the, this is the derivative or this is the gradient with respect to W 1. Now let us also write the equation of gradient with respect to B the bias term. In case of bias term what happens is let us write it down del del B, the difference between the predicted value minus the actual value and we multiplied by 1 for the bias term. So, this is how we calculate it whereas in case of other all other parameter values it was something like this del del W j. I am writing it here so that you can see it on a single screen and it is easier for you to. So, you can see that only difference is that this x j in case of each of the parameter, but in case of bias we have x j is equal to 1. So, this is how we calculate the gradient with respect to the particular parameter W j. So, you can see that this particular parameter calculation or this particular gradient calculation involves summing up over all the input data points which is denoted by this particular equation. So, this equation what it is doing is we calculate or we first do the prediction on each of the input data points and we take the difference between prediction and the actual value on all the data points and submit across them. So, this particular process think of it think of you doing this on a very large training data set and this becomes the bottleneck. So, this particular step takes quite a lot of time and hence we are going to study couple of techniques in a short while from now which will help us to do this particular step more efficiently. So, one possible way so this particular version of gradient descent is called as a batch gradient descent batch gradient descent. So, here we take all n data points we consider all n data points to calculate the gradient. So, we have slightly of we have couple of variations of this gradient descent idea. So, in one variation we use we instead of using all n points we use a batch of k points where this number k is much lesser than n it is a much smaller than n and what happens now is so here instead of calculating the gradient on all input points this sum gets reduced to k points it has to be i. So, x j value of i data point. Now, what really happens is instead of taking k points instead of taking n points we are taking k points. So, what is happening is let us say this is our training data we divide this training data into this is. So, we divide it into smaller chunk each chunk has exactly k elements and this k is called as batch size. We normally use batch size which is a number which is power of 2 30 to 64, 51 to 1024 etcetera. So, this particular version of gradient descent is called as mini batch gradient descent where only difference is instead of taking all the n points we select a chunk of k data points and calculate the gradient update. Now, there is there are a couple of new terms that comes into play here. So, in each iteration we are using exactly k points to calculate the gradient descent update. So, and when we complete one gradient descent update for all the data we call this as one epoch. So, one epoch is a full iteration of the complete training data of n points. So, in an epoch we will have n by k batches or in n by k batches or n by k iterations. So, in an epoch we will have multiple iterations. So, n by k to be precise and so, we will be doing multiple epochs to this. So, there will be more iterations. So, normally we specify the number of iterations in terms of epoch. So, it is very important to understand what epoch is and we also specify the batch size which is which is an important parameter in the mini batch gradient descent. And you can see when we go into neural network and any other machine learning models also mini batch gradient descent performs better or does more efficient learning than the batch gradient descent. So, extending this idea further if you use the value of k. So, if you use k is equal to 1 we have very special case where we are changing or we are making a gradient update just by looking at a single example and this is called as stochastic gradient descent. So, in case of stochastic gradient descent the value of k is equal to 1. So, overall we have this scenario this is the entire so, at size 1 we have hgd and we use complete training data to make one gradient descent update it is called gradient descent and we have mini batch which is somewhere in between. So, that is why mini batch is preferred over hgd or gradient descent in general. But if you have extremely large amount of training data even some people use hgd and you will see later in the neural network that we use some variation of hgd to make or for learning the neural networks. So, let us write down the gradient descent process again in the context of mini batch gradient descent and stochastic gradient descent. We draw bunch of k examples and example contains features and the labels. We first randomly initialize parameters until convergence what do we represent until convergence? We first obtain predictions from the model. Second we calculate loss from the batch and we calculate the calculated gradient of loss with respect to each parameter. Next one is each parameter w we say w to w old and finally update simultaneously. So, you can see that the only difference between the gradient descent and batch gradient descent is the first step where we select a batch of example and we calculate it on the batch. So, the same process can be applied here if we instead of k example if we select all n examples and if we calculate the loss on the batch of all the examples here then you know it is a same process as the gradient descent. So, these are so, we studied gradient descent and couple of its variations. So, this makes us ready to train any machine learning model. So, you can use one of these techniques to train machine learning model. Now, the k question is how do we know whether the machine learning model is getting trained or is it facing some issues and if it is facing some issues how do we really fix those issues. So, before watching the video further I would strongly encourage you to stop here and think for a couple of minutes. How will you how will you measure or how will you monitor or how will you infer that your model is learning by looking at subsequent losses in the gradient descent updates or in the gradient descent step. Any ideas I mean it will be a good exercise for you to pause here for a couple of minutes and try to understand how you can or try to think how you can solve this problem. So, remember that in gradient descent what is happening is this is our toy example where we have a parameter value here and j or loss on y axis and this is our loss function. So, if your model is training correctly what will happen is we start somewhere and we get updates and with each update we can see that the loss is coming down the loss was somewhere over here to begin with when we made this update the loss became lesser and lesser and further lesser. So, what happens is that if you plot a graph the iteration number and the loss. So, ideally what should happen as we train the loss should come down and should come down to 0 smoothly it should come down to 0. In real life unfortunately getting 0 loss is sometimes difficult because of noise in the training data and the noise in the training data occurs due to some issues in labeling or in data collection. So, what happens? So, if you have a learning curve. So, these are the iteration numbers at iteration 1 we had very high very high loss iteration 2, iteration 3 and so on. So, you should see in if your model is learning in subsequent iteration you should see that your loss will gradually decrease. Now, there are different characteristics of this learning curve. So, this particular curve is called as a learning curve and this is one particular type of a learning curve you can see that the learning curve strongly depends on the learning rate. So, learning rate what happens if you have very small learning rate then you will possibly taking baby steps to reach the minima which is over here. So, in that case your learning will be very very slow. So, your graph will look something like this it is learning but it is learning very very slowly. What if you have slightly larger learning rate then what might happen is that you will be making longer stride and you will reach obviously faster. So, you were you might come down faster. Larger learning rate your you will reach may be 0 error quite fast, but in this is a slow learning rate where you are learning very slowly here you are learning very fast. So, slightly larger learning rate larger than the first one. Then you might wonder if I make learning rate very high I might be able to reach the minima very in no time, but unfortunately that is not the case. If you have too large if you have too large learning rate then essentially your gradient updates. So, we set like this right. So, w nu is w old minus lambda times gradient minus alpha times gradient. Now, if this value is too large what will happen is let us say I am starting over here and if this value is very large I might end up crossing the minima to this side and then when I calculate the gradient here. So, I will probably oscillate again on the other side and I may not converge I will keep on going from one side of the minima to the other side. So, what you will observe is either you will not learn or sometimes your loss will come down and sometimes your loss will come down and then go up. So, you will see some such kind of. So, this kind of learning curve is a problematic learning curve and it tells you that there is something wrong with your learning rate and you should probably reduce your learning rate if you see this kind of a pattern in your learning rate. If you see this particular pattern then you have a room to increase the learning rate and if you see this particular pattern these are all perfect patterns. So, you probably found the right learning rate that helps you to reach the goal faster. So, ideally what you should do is you should start with some learning rate like 0.001 and then increase it by order of magnitude. So, this is how you can diagnose if your learning rate is too high or learning rate is too small and take corrective measures so that your machine learning algorithm trains efficiently and even before efficiency you have to make sure that it is training and learning as per your expectations. So, this brings us to the end of this particular module. In this module we studied different optimization algorithms mainly of the flavor of gradient descent and its variations and we also understood how to monitor and diagnose problems in the learning or problems during the training and how to fix them. See you in the next session. Thank you.