 In this video I want to take a closer look at gradient descent. Now this is a series for healthcare workers, healthcare professionals, those perhaps people, well everyone, perhaps some people who have an in-depth knowledge of mathematics and computer science but really want to contribute to this field. So this is not absolutely necessary for you to be able to write the code in the end, but I still feel that it's important and you can watch that. Now on the video which I'll link to up here, the video on just looking at linear regression to develop that intuition, we really looked at developing this cost function. The cost function, remember we have this set of feature variables, we're going to multiply them by these parameters and then we're going to get to a result. That result is going to be different from the target, we subtract those two from each other, we square them and then somehow we go through all of the samples in our data set and we can either, you know, one of the best ways is just to average those and that gives us a cost function and that cost function will be a function of all of those parameters. And when we just had a single variable, remember we just had a single feature variable that still gave us two parameters that we had to find that is in two-dimensional space and if we have many more features we're going to move into multi-dimensional space. So we have this function in multi-dimensional space, we can draw the multi-dimensional space, what we want to do or what we are saying mathematically is at the bottom of this function, at the minimum, because remember on the X, if we boil this down to a single variable, which is not realistic but a single variable, on this axis I'm going to have my values 4 beta and on this axis my cost and I want to minimize, I want to see where the cost is at its absolute minimum. Now as I said, this is idealized, I'm bringing it down to a function of a single variable. f of X at school, Y equals f of X and here we have X squared minus 3X plus 2 something to that extent, doesn't matter. We have this cost function and I want the point where the cost is at its lowest. I mean for me that is beautiful that we can break down this problem where we want to create this model that is as accurate as possible, we can break that down into some function that we want the minimum of. So when we have that, there must be a way that we can say well initially we just give a random value, we're saying we are starting right there. Let's start right here with a certain beta value. That certain beta value is going to give me a certain cost and there's the cost and I want to now know well that is just at random that point was chosen and that's how it's done in deep learning. I want to get there, how do I get there? Well I have to use what I have and I have to update this value. There was in reality, this is the point I am looking for in reality. I have got to move from this point to that point. How do I do that? And we use this idea of gradient descent. Obviously a gradient, if I'm standing on top of a hill, there's a gradient down. I'm using gradient descent, I want to go down, I want to descend along some gradient. And a gradient at a specific point on a graph is nothing but a derivative. When we've looked at derivatives, I need a derivative. But let me just show you then how this really works because I can use this derivative to generate a new value for beta which will be here which will give me a lower cost but I've got to get from there to there and how do I get from there to there and eventually in these tiny little steps how do I get there? Again as I say just by looking at this it's easy to see it's got to be there but in multi-dimensional space with lots of beta values it's impossible to do. How do I physically do that? Because what I will have then is a better beta value which I will put into my cost function running again through all of my samples averaging all my errors and I'll have a new cost function with this new beta and that will be slightly better and better and better and I'll run through it again and again and again until I eventually approach this point. So what I've done here is I'm just expanding this so that we have this better view. So I start with this random value of beta right there. I have some value of beta which I've plugged into my cost function and that is going to give me wherever this is and it doesn't matter which side I start on. It doesn't matter which side I start on so I start with a given beta value. Now how do I get from this one to this one which might be slightly better? The trick is to take this point and to get the derivative of this function so I'm going to get the derivative of my cost function with respect to this parameter of mine, the first derivative. And remember the first derivative is nothing other than a slope so I'm going to get a tangent line which passes and touches that point just there. That's all I'm going to have. And what I'm going to do is I'm going to say I have an old beta let me say this beta old, the one that I start with and I'm going to update it by taking that and subtracting from it some learning rate which I'm going to just put a Greek symbol for alpha and we'll see what that is about, a learning rate multiplied by this derivative, dc db. And why negative there, what is this, why this negative? Well what I'm going to do, I'm going to say beta new, my new one which I'm going to move towards is take wherever I started that value and subtract from that minus the product of these two things. Now if I'm on this side of the ideal point my slope will always be negative. Remember this is a negative slope and that's a positive slope because we go from F to right so we're going downhill, that's a negative slope, that's a positive slope. So if I have a negative slope this value here is going to be negative and a negative times a negative is a positive. So my new one is going to be the old one plus something, this value plus this one so I'm going to move slightly in this direction. If I was on this side it would have been a positive so there would be a negative times a positive which will be a negative so my new beta will be the beta minus something so I'm going to move in this direction towards this point so it works out beautifully every time. So no matter which side I am if I do this negative I'm always going to move in the right direction. This alpha as I said is called a learning rate and that determines how big these steps are that we take and usually we have alpha values of say 0.001, 0.001 we can even 0.1 as these orders of magnitude difference between them. Some people like Andrew Eng likes 0.3 and 0.03 he goes up in steps of 0.03 there he just multiplies by 3 every time and goes up from there. It doesn't matter it's in this sort of range and that ensures that this step is not too big and I'll show you why we don't want this big step but this is the thing of beauty because I now have a new beta which sits right here I can go up and see what the cost is there I see the cost is less and this now becomes my old beta and at this place I take the derivative which I can now plug in there multiplied by the alpha subtract that from the old which is this one now and get a new one which is now going to be here and I'm going to repeat that story every time going through that whole algorithm and every time I'm going to update my beta update my beta by using whatever the slope is the derivative at that point and please remember that this graph is just a graph that comes from the creation of my cost function now just to show you here the reason why we don't want to take big steps because you would think well then we can very quickly go there what happens is we overshoot now we go from this point to that point and then back to this point and back to that point and back to that point and you keep on overshooting and you can't do that because then you're never going to settle on this lowest point we also don't want this to be too small otherwise we're going to take so many steps and we won't move for very long and that takes a computation that takes a long time so it takes a long time for our computer program to get towards that point and we might never even get there so that is all fairly nice and I hope you really get it you have an intuitive understanding of what is going on here it's just this beauty I mean if you just think of every step that we've constructed here it is such a beautiful thing that we plan construct this and we can go look for the minimum of our cost function which is really remember just an average of all the errors that we make it is a beautiful system but please now understand that we use the derivative of every point which is that slope at that point to build into this thing and we keep on updating we get a new beta by taking the old one where we started and we subtract from this 0.001 whatever we choose for alpha times what the slope is the slope is going to be negative on this side positive on that side which means we're always going to we guarantee to walk towards in the right direction now here I'm trying just to do a three dimensional very difficult for me to do I have not an ounce of artistic ability anyway this is the x-axis the y-axis and then the z-axis and what I'm trying to show here with these dotted lines that I've got this plane which is parallel here to my x-axis because unfortunately it's not that easy we don't just have this simple one even for a single variable we can have this shape for instance this bowl shaped here in three dimensions so what do we do? well now we have a water crisis in Cape Town so in my office I need drink from paper cups now take a little paper cup like this I don't want to because I want another cup of coffee and I'm trying to spare this and we use the coffee because we don't want to use water because we have no water in Cape Town we have a water crisis so we have a little bit of water we're trying to save we're all doing our bit but if I imagine this is my cup and it won't look like this imagine it's round at the bottom but imagine if I cut through it now this is what I'm doing here I'm saying there's this three dimensional object here and there's this plane cutting through it and think about that just for a moment if I have this plane that lies here obviously on the x-axis which is here there's plane T movement on the x-axis but it cuts that y-axis at one point I'm keeping y stable it is not moving at all and if I were to cut through that imagine taking a piece of paper this is like trying to cut a line through this this is going to be a very nice little graph there on that cutting edge there's going to be a knife so a knife's edge there and that is what you do with the partial derivatives if you looked at the video on partial derivatives so if I take the partial derivative of my cost function here with respect to x it means I keep y completely separate and if I keep it constant and if I take a derivative with respect to that constant it disappears, it's zero so it will tell me in what direction I have to come on this axis and I flip it around and then I cut through there keeping the x-axis constant somewhere here just going through somewhere here keeping that going through a specific spot there and if I cut through it there again there will be another little graph of that line of intersection and I can check what slope I want today and remember when we talked about vectors if I wanted to get there I could do is first walk in that direction and then walk in that direction and that's what we're going to do with deep learning no matter how many dimensions it is we keep all the other dimensions constant and we just look at where we're going to walk in that direction then we're going to see what we're going to walk in that direction and then what direction in multi-dimensional space multi-million dimensional space you can all break it down by keeping all the others constant and think about this analogy of this cup if I were to cut through it that cut line is going to be a nice little graph for me so no matter how many dimensions I work in there's always going to be that little bit of a slope and through this analogy we can walk in the right direction to eventually get to the bottom where my cost function is going to be at a minimum I really hope that this helped you as I said I'm not aiming this series at someone who's a hardcore mathematician or computer scientist who studied these things I want everyone involved to help us use deep learning to solve medical problems, healthcare problems and you don't necessarily have to know this but I think there's a elegance there's a beauty in this and it would be lovely if you did understand leave me some lines in the comment if I can explain it in a different way because I really want you to grasp these concepts they are beautifully else I'll speak to you in the next video lecture