 So in this video, I want to show you a little bit about gradient descent. We're going to use a very simple model called linear regression. So we're just going to have this single feature variable and we're going to predict a continuous numerical variable. And gradient descent really is at the heart of all supervised learning, at least deep learning models. So we're living in this era of the functional approach to AI, where we can write the function that we're trying to minimize. And that function is all about the difference between what our model predicts and what the actual value is. If we can express that difference between the prediction and the actual value in a function, as with functions, we can get the minimum of that function. Think of y equals x squared, we all know that right down at the bottom where x equals 0, that's where the function is going to be at a minimum. And that's exactly how we learn. We're going to have these parameters, the unknowns, the variables, and we're going to have them in multidimensional space and we're just trying to minimize that. And that is what gradient descent is all about. And think about standing somewhere in a valley and you're looking down, you want to get right down to the bottom of the valley, you can't just walk straight down, you've got a zigzag away a little bit down, and that's exactly what it is if you think about it. If we have a multi-valued function, a multivariable function, of course that's going to be in hyperspace, we can't imagine that, we can only imagine three-dimensional space, but imagine that multidimensional space, we are just going to zigzag our way down to the minimum. And at that minimum for each of the values, for each of our variable values, unknowns, our parameters, we're going to have values for which we have the minimum of this penalty, this cost function, the difference between the actual and the predictive value. And that is really at the heart of it and if we do linear regression, of course we can write a very simple gradient descent. So I'm going to show you how to do that in Julia, but I'm also going to put some pen to the screen on my tablet and I'm just going to show you the basics of the algebra and the calculus behind the partial derivatives. I'm not going to teach you about how to do partial derivatives or do matrix vector multiplication. I'm going to assume that you know how to do that. If not, put something down in the comment and I'll make a video because it's just a couple of simple things that you need to know about linear algebra and about matrix multiplication, matrix vector multiplication, and it should all make sense. I'm trusting that you do know that. If not, let me know in the comment. But I'm going to assume that you do and I'm just going to show you how we go from the code to what's actually happening behind the scenes and from what's happening behind the scenes back to the code. So they just understand it's not just a code, a function or two functions that we write and you don't understand where they come from. I want you to know really where it comes from and that's going to make the code so much easier to understand that makes gradient descent so much easier to understand and that will just help you along as you start developing your knowledge of deep learning because that's what we're going to do in supervised learning. We are just going to do gradient descent. So let me open up a Jupyter Notebook and we're going to use Julia just to make life easy for ourselves. Let's have a look. I've opened a Jupyter Notebook here using iJulia and then just the Notebook function inside of the REPL and we have our Julia kernel running here inside of a Jupyter Notebook. You see I'm running Julia 1.4.0 up in the top right corner and let's see what the libraries are that we're going to use. So I'm using GLM that's generalized linear models, statistics, data frames, random and plots and I'm using the plotly backend so make sure that you've installed all of this through the REPL. So just a few short notes then. What is linear regression? Well, it's a type of modeling. It's all about continuous numerical variables and we're going to have a feature set of variables we call them feature variables or independent variables and then we're going to have a target variable or a dependent variable and then for each sample so that would be a row in a spreadsheet we're going to use the values in that row for that one specific sample and it's got to predict as close as possible to the value that's in the target variable or the dependent variable then and if we have as many as possible samples we can run through all of them and we can use those feature variables to predict each and every one of the target variables so it's a continuous numerical variable that we're trying to predict and that makes it a regression problem and we're going to keep our independent variables, our feature variables we're also going to keep them continuous numerical and as much as in this first instance all we're going to do is have a single feature variable and we're going to try and predict the target variable so I'm using random.seed here in my first, as you can see here, in my first code cell and I'm setting that to 1 and if you set it to 1 we're going to get the same random value selected and I'm going to have this computer variable that I'm going to call uppercase x and that's going to be a random value it's going to be a floating point and I want 50 of them so 50 random values between 0 and 1 and then for the target variable I'm going to call that under lowercase y I'm just going to add a bit of random noise to a constant multiple of each of these I'm taking x so that's now going to be each of those 50 values dot multiply with 10 and that means each of them gets multiplied by 10 and then each of them, to each of them I'm going to add a bit of random values so on this side I've got 50 values so I better add 50 on this side strictly speaking I didn't have to have the dot there because it's just a scalar times an array and that means each of them are going to be multiplied by that but what I need to do is dot plus, that's very important because I've got another 50 values there so that's going to be an array of 50 values but this time it's from a normal distribution so random n 50 and multiply that by 2 just to get it a bit bigger and so that's just adding a bit of random noise to the 10 times the x value there so let's run that and now I've got some values that will be my feature value in x my feature variable in x and my target variable in y in case that didn't make too much sense let's have a look at visualizing the data that always helps so I'm going to use the plot function and I'm going to plot my array of feature variables my array there of my target variable and the series type is going to be scatter so I might as well just have used the scatter function there and I'm adding a title and I'm adding some labels so remember the first time you run this it's going to take a while and there we go we see we have our values between 0 and 1 for the x axis down the bottom and then for the y axis that's going to be 10 times the value and then we're going to add just a bit of random noise to that so it's not all on a straight line but you can certainly see that there is some trend to this as we go up in my independent variable so does the dependent variable value go up and in any one of these and I hover over them we can see the black at the bottom for the independent variable inside the independent variable inside of x I have 0.58 and that gives me given that I get a dependent variable or y value of 9.27113 and you can hover over each of those we see in black what the independent or x value was variable value was and then blue at the top we're going to see what the y the dependent variable value was so what a model would like to do is draw a straight line from I suppose as we can see there is this correlation and it's the positive correlation so some line a straight line that makes it a linear model that given any x value I can go up to that line and that will give me what a predicted value would be a predicted dependent variable so you can see right here from school you remember the equation for a straight line that's y equals mx plus c where m is the slope and c is the y intercept so if x is 0 this mx term becomes 0 and then y will just be the intercept so y equals mx plus c now we're not going to use the m and the c we're just going to use different symbols just to confuse things because that's the way it works for c we're going to use this beta sub 0 and for m we're going to use beta sub 1 so that we have this y equals beta sub 0 plus beta sub 1x and that's still going to be a straight line but you can well imagine we need to know a slope and an intercept for our straight line that would be the unknowns how would we know what would be the best model so what I'm going to do here is just a random guess I'm going to make beta 0 I'm going to set that equal to 0 so when x is 0 when my input variable is 0 my output variable is going to be 0 my y intercept is going to be 0 and I'm going to make my beta sub 1 my slope 10 and I chose that because that's how we designed it up here when we started so my guess wasn't going to be too bad so that's what I'm trying to plot here so I'm using plot with the bang there the exclamation mark which means it's going to plot on top of a plot that's already in memory and I'm going to use just some values that I want to collect from 0 to 1 in steps of 0.01 so that we can get this straight line going and if we were to plot this there we go there's my model now you can understand that I have this red line as my model and my model is y equals 0 plus 10x so beta sub 0 is 0 and beta sub 1 is 10 and that gives me this line which means given any input value here next axis if you go upward means the red line say there if I just hover just right I get a little red one the model is going to be 3.8 so for 0.38 of course it's 3.8 because it's just 10 times but that means if I get new data I've got a nice model now if I get new data you just give me the input variable and I can predict the target variable for you it's not going to be terribly accurate because here and amongst the real data that we have we see close to here it's 0.59 is 5.9 but a real value would actually be 6.67 so it makes a little mistake and if I draw vertical lines between each of these data points down or up to the red line those are going to be errors the errors that we make and we can well imagine what would be the best possible line that we can draw here so I need the best possible beta sub 0 and you can see mine is the 0 here when x is 0 the y intercept is also 0 and for beta sub 1 what could be the best possible one so that I make the least amount of errors and you can see now the error is that between which the red line predicts for that x value and what the actual value was so we're trying to minimise this and that's what machine learning or deep learning is all about we've gone in artificial intelligence through various ages and we now at the functional age of artificial intelligence where we create a function and that function we call a cost function and we're trying to minimise that cost function and we're very fortunate that we can do it and I'll show you just exactly how that works why we are so fortunate why this functional age of AI is working so well but just to remind you here right at the bottom you see I have a y hat here because y was my vector of target variable values but that's going to be slightly different from the ones that my model predicts so I'm going to call that y hat that's this very common thing to do and y hat is going to be beta sub 0 plus beta sub 1 times x and x is going to be either a matrix or on this instance it's still a vector because I only have one feature variable if I multiply beta sub 1 with that beta sub 1 is 10 and it can just be a scalar but if I wanted to add a vector to that I've got a bit of a problem with beta sub 0 and we'll see how we fix that so beta sub 0 must actually be a vector of the same length as x but we'll see how to deal with that so just below that you see y hat sub i so for any y hat I can predict that's any of the samples in my spreadsheet say for instance I just take that x i in that same row I take its x value and I multiply by 10 and I add 0 and that's going to give me a predicted value so let's see how well it does so for x sub 1 if I execute x1 in square brackets so we're just using indexing the first of my 50 values was a 16 bit float and it was 0.166 and let's see what the actual value was the actual value was 2.616 that was the actual value now I'm going to get the predicted remember this is 10 times that x sub 1 that first x value and that gives me 1.699 and that's quite a bit different from the actual 2.61 so I made a bit of an error and you see the error symbol there just epsilon and what the term that we usually use is the residual specifically in any algebra we talk about the residual but in deep learning we'll just say that that's the error so the error is this going to be my actual value minus the predicted value or you could do it the other way around as well and it doesn't really matter because in the end we're going to square those differences but you see to get the real y sub i you have to add that specific error term the residual in each and one of the cases so let's do all 50 of them so I'm going to store that in a computer variable called y underscore pred 10 times x so it's a scalar times that's why I didn't have to put dot multiply as I did before you can just say scalar times that vector and that's going to broadcast it to every value in the vector my residuals is then going to be the actual ones minus the predicted ones so let's just run that so that's stored now and that difference y minus y pred so the difference between each one of those so it's a vector minus a vector so it's going to be element y subtraction and those are that's now going to be a vector that RES raises is going to be a vector of all my residuals so let's look at what they look like and this is going to be actually quite important when we talk about using parametric tests et cetera but at the moment you can see I have this kind of normal distribution to my residuals a better way to do that is rather let's just do a scatter plot let me show you what that looks like and what we have to remember is that there's the zero line right down the middle the dots that fall on that zero line or close to the zero line means there's very little residuals it was very close to the prediction and actually very close to each other and some were quite far away and if our model is the best possible then these residuals will be as close to the zero line as is possible so you've got to imagine that there's a line between everyone straight down to the zero line or from this bottom one straight up to the zero line those are all my residuals that brings us to this idea of a loss function so somehow I've got to quantify how bad my predictions were and to do that we have a loss function and we have the same thing in deep learning that there is some difference between my vector of predictions and my vector of actual values and because this is the functional age of AI I can create a function to express the difference between those two vectors and in linear algebra one of the common ones that we can use is the mean squared error MSE and that just means I square all those differences all those residuals so yeah you see I've done it this way around so it's predicted minus actual so we take the first sample the first row in our spreadsheet that contains the data or in this instance our two vectors but anyway I subtract them from each other and then square so it doesn't really matter in which order I subtract them I square them and then I add all of those and I divide by how many they are and that gives me a mean a mean is you just add everything and divide by how many they are and that's the mean squared error the name says it all and that little symbol there means from one to M, M is the number of cases I have the number of subjects sub i's and in our instance that's 50, but let's just be clear as to how we got to this Y hat so if we were to expand this little summation expression here on the right and side remember this is what really is every single one of these Y hat sub i's is beta sub 0 plus beta sub 1, X sub 1 minus Y sub 1 squared That is how I get each and every one of these, the beta sub 0 plus beta sub 1, x sub 1, and then minus the y sub i on that side. So there's my first example, there's my second sample, there's a third sample, there's that, and I'm going to go on until m, and in our instance m is 50, and I divide by. So remember that is what happens behind the scenes, not this little shorty thing, that's what happens behind the scenes, is every one of that y hat is quite a long, I mean it's it's quite a thing. So let's just have a look at the first three predicted values, remember we had that 1699, 693, and what the actual values were, and then what the feature, what the feature variable values were. So there we have them. So let's just see how we did all of those. So it's 1.699, that's the actual, the predicted value, minus 2.617, that was the actual value, I square that, and I square all of those, and so it's the predicted minus the actual square predicted minus the actual, but each of these predictions remember had to be calculated first. So there we go, I've written it all out again for you to show you it's 0 plus 10 times that minus that squared. So it's all of those. Now the 0 is something we just guessed at the moment, and the 10 is something we just guessed, but we don't know this to start off with. Specifically if we have 5, 6, 7, 8, 9, 10 plus feature variables, we're gonna have a lot of these unknowns because every one will have that x sub 1 there, there'll be more of them, and each of them will have to have its own beta in front of it. So we're going to run into problems, we weren't just able to look at it as we did there in two dimensions with a single feature variable and just guess sort of where that line should be. But let's create a function in Julia, it's function, and then the function keyword, and then I'm gonna give it the name MSc, it's gonna take y underscore hat and y, and it's just the sum of y hat minus y dot squared, the dot, the dot power I should say, the dot power there means each one of the elements in the vector because y hat is a vector, y is a vector, it's going to broadcast, it's element wise minus element wise, but each one of those elements I want squared, so it's dot square, and then I divide by how many there are, so that's just a code representation of my mean squared error there. So let's look at the mean squared error for my model at the moment, and we see it's 3.97, so that is the sort of the average error, the average residual that we have at the moment, and it's squared, so because we just square all of those, we don't take the square root of it, so it's mean squared error, so that's the average of the square. Now that was just my guess, and I'm gonna move away a little bit to normal statistics now, moving away from what we're trying to achieve here, it's just for you to understand gradient descent and how to code that in Julia. Let's just go back to statistics just a little bit. The worst case model is the one where we just have mean, that's not the worst case you can draw very bad lines, which is even worse, but we're always going to compare against the base model, and in statistics and the base model just uses the mean as your prediction, so I'm taking all my target variables here, and I'm just calculating the mean, that's 4.7, and all I'm going to do now is repeat that 50 times, so that I have a vector of 50 values, they're all going to look exactly the same, and let's look at what the mean squared error is, if all my predictions are just the mean of my target variable, and we see that's quite high, the mean squared error now of this mean model is 12.28, that's quite a bit higher, so that means the average of the square of my residuals is quite a bit more, and we can actually compare a model that we have created to the baseline model, and there is that comparisons, I'm just writing out the word comparison there, you take the mean squared error of your base model, this is the one that just uses the mean for all its predictions, it doesn't have a beta sub 0 plus beta sub 1 times x, is this the mean, and you subtract from that the models MSE, and you divide by the base, and that gives you a comparison between your model and the base, how much better is it, and that comparison as you can see here we usually call that r square, or the coefficient of determination, and if we multiply by 100 of course we get that as a percentage, and we can say that our model predicts so much of the variance in our variable, so we see that 67, so we can say that our model with a beta sub 0 of 0 and beta sub 1 of 10 explains 67.7% of the variance in the target variable, and that's just how we express those, that's just normal statistics, that is how we would evaluate our model, and we wanted of course to explain 100% of the variance, but that would be an r squared value of 1, that's the highest it can go, and of course 0 is the lowest that it can go, that's a very bad, your model is then very bad, it's just the same exactly as the base model then. So there we go, that's just a little bit of statistics, back to this idea of gradient descent, this functional age of AI, and remember I said we can express how bad we are with a function, now let's create a very easy function to work with, and there we go f of x equals x squared minus 10 times the sine of 2x plus 12, and you can see that green thing there, now just imagine for one second that this is our cost function, now we know it's not our real cost function because we already have two unknowns in our cost function, remember when we expanded it up here we have beta sub 0's and beta sub 1's in there, those are the unknowns in my function, x sub 1 that's given I know for every sample what the x value is and the independent value is and what the y value is, so don't get your x's and y's confused with what we did in school algebra, the unknowns are beta sub 0 and beta sub 1 or the y equals mx plus c, m and the c, those are the unknowns that I'm looking for, so imagine I don't have two of them, the m and the c I'm only looking for one of them, and I'm just looking for one of them, so I'm going to rewrite something in school algebra, so either the beta sub 0 or the beta sub 1, one of them, and imagine that my loss function looked like this, the whole loss looked like this, which means I can plot it here just as a single variable graph and there you see the green one, and what we're trying to achieve remember here is we want to minimize this function, now just to look at it as very easy, we see here m is right at the bottom, that's the minimum, so whatever the value that I read off of here from the bottom which will be either beta sub 0, beta sub 1 whatever the case is, that will genuinely just be the value for that variable that will give me the lowest cost function, now I should just go back and we quickly say what is the difference between loss and cost function, people use them interchangeably, this will be the loss, that y hat minus y squared, that's the loss, the difference, the squared difference between the predicted value and the actual value, but if I sum over all of them and divide it that's my total cost, that's all the losses, all the individual losses combined, so that's the difference between a loss, a loss is just a function that just looks at one sample, and if I combine all the samples together all the losses together, that's the cost, the cost function, so right here we are with a cost function, and it's very easy to find that minimum, of course now just looking at it like this, I can just look at the picture and tell you it's down here by m, that is way because on the y axis I have cost now, and on the x axis I have different values, possible values for my beta sub 0, beta sub 1, my unknown, and I can just look that it's there, but in reality of course it's not that easy, so how do we get to this point if it wasn't that easy, if we didn't have this nice graph to look at, because we've got other problems here, we've got other local minima, which i there, k, o, those are local minima, but the global minimum is at m, so how do we get there if we can't see this graph, well what we do is we just start anyway randomly and I've started in this instance here by b, I can start anyway but I'm starting at b and I'm going to use the derivative, now you've got a know a bit of linear algebra here of course as I mentioned and you've got a know a bit of differentiation of calculus, so I've just started here b blindly, but I can imagine I'm this is a valley and I'm just trying to walk to the bottom of the valley, how do I get to the bottom of this valley, well if I stand there obviously on a slope and I know I want to get to the bottom of this valley, so just go down the slope as easy as that and slope, I mean in real life think you're walking down in a valley, slope is a slope, you want to walk downhill to get to the bottom, you don't walk uphill to get to the bottom, sometimes you have to because you can see here from o I'll have to go up to n to get down to m, but let's just imagine you just want to walk downhill now, unfortunately in calculus we can also use the slope which is the first derivative, so if I know what the first derivative is there at b, it slants downhill and I can use that fact to get to a better value, so if I go from b straight down here it'll be 4. something but I know I want to get to this one here, 0.7 somewhere there and if this slope here, this line if you remember from school from calculus that's a positive slope, so that's a positive value, that's a positive m value or beta sub 1 value, if I use that somehow to change this, iteratively change where I am at the moment, at the moment I'm here by about 4.1 to change that to something that's more negative, now my slope is a positive at the moment, so if I take x to get it to a smaller value of x I've got to subtract something from that, so let me just subtract from that the slope that I have there and that's what we get to down here, so my x sub nu is to start where my x sub old was and subtract from that df dx, that's the first derivative of my cost function there with respect to x, just subtract that from that and because that df dx at the moment where we stood here at b is positive, I'm going to have x old and I'm going to subtract from this a positive value which means my x nu is going to be slightly smaller, so I'm going to get going in the right direction, if I was way over on the other side, say up here on the far left hand side to the left of I here, my slope would have been negative, but a negative times a negative is a positive, so using that slope if this slope was a negative here, so a negative times a negative is positive, so if I was way out here on the negative x axis I would have moved towards the right, so the slope is always going to help me go in the right direction, but we do add this little step size there, we don't want to move, we don't want to take a giant leap, I mean imagine you are a big giant and you much bigger than you are now and you're just walking in this valley, you can take a huge step, but you make it so big that you walk straight over to the other side of the valley, that's not what you want because every time you step now you're just going to jump back and forth, back and forth and that's a very bad thing with gradient descent, so we add this little step size, we call it the learning rate and we usually make it quite small, maybe start at 0.03 so that we take itty bitty little steps because what we want to do is not overshoot the minimum, so we just always multiply that, but I think you can clearly see how we can use the first derivative, the slope to guide us always into the right direction, now you can see the two problems here, one we've discussed, you're going to end up in this local minimum, not the global minimum and that is a problem in gradient descent and deep learning and there's all sorts of extra things we can do, regularization etc. to help us out with that, in practice though it's not the worst thing in as much as we don't want a perfect model, we never want our model to memorize our training data, we want it to generalize to unseen data, so it's not as big a problem as you might imagine, but still some models can be severely affected by ending, continuously ending up in a local minimum instead of a global minimum, just for now remember it's not the worst thing and that's all we do, but now you're going to tell me, well we don't just have one, we don't just have one unknown, we have two unknowns in this instance, a beta sub zero and a beta sub one which we have to find the best value of in our cost function, so here we have very simple x squared plus y squared, that's what I did to create this little bowl here and it's now in three-dimensional space because I have two variables not just one anymore, but fortunately we have partial derivatives, I hope you can remember what partial derivatives are, so here we go with partial derivatives and you can see that if I keep one of my two variables constant and that represents either this pink or the green one, if I keep one of them constant it cuts through my shape and you can see here this black line that comes down and becomes dotted on this side, that's where this green is intersecting my cost function and there I just have a nice parabola again or this red one is intersecting here in the front and that's just a parabola so I can look at them individually and just step in the right direction for each one of them and if I combine those steps I am really going downhill towards the bottom of my cost function and no matter if I'm in hyperspace with many many more feature variables that means I'm gonna have many many more unknowns, I can find each of them by just one step at a time keeping all the other ones constant and just taking the partial derivative in other words of the one that I'm interested in and so I'll just go through all of them individually and just update all of them again because this X new here that just represents either beta sub 0 or beta sub 1 and if I had more beta sub 2 beta sub 3 whatever I just update all of them individually by keeping everything else constant and that's what we do with the partial derivative and partial derivatives are very easy if you have multi-variable function because all the other things are just constant and taking the derivative of a constant is very easy so no problem there we can just scale things up as as large as we want them and if we continuously walk down the slope that is how we're going to find our minimum now comes the bit of linear algebra you've got to understand a bit of linear algebra now what I'm going to do remember I said we had a bit of a problem with beta sub 0 because we want to add two vectors to each other and beta sub 0 is just a single value else was very easy to a zero at the moment but I've actually got to have a whole vector of them because I can only add two vectors to each other or subtract them from each other and that was my loss function resulting in my cost function if the two vectors are equal size so my y pred and my y actual have got to be exactly the same size because I have to have element y subtraction so that's very easy we just right in the beginning add a whole new column of ones constant of one because beta sub 0 times 1 is still going to be beta sub 0 but now I have a scalar beta sub 0 that I'm multiplying by a vector of all ones that gives me a vector of that length we need so I'm changing my x which was a vector of 50 elements into a matrix of m times 2 m was 55 so this will be a 50 times 2 matrix 50 rows and two columns I still have x sub 1 x up to all the way down it's still all my feature variable values but I've added a column of one and then I'm going to make another column vector and that vector is going to hold my two unknowns beta sub 0 and beta sub 1 and I'm going to call that theta which is a vector you see the underlying there and then my prediction becomes a matrix times a vector and if you know anything about about linear algebra you know that in this order the column number there has got to equal the row number there otherwise you can't do matrix vector multiplication and the result is going to be take the rows from that one and the columns from that one so we're going to end up with a m times one or 50 times one vector that's exactly what we want so y hat it's just going to be the matrix x which is a 50 times 2 matrix for us times a 2 by 1 column vector and that gives us a 50 times one column vector and that's exactly what we want we want those 50 predictions so if we wrote our feature variables as a matrix adding the ones right in the front and we add and we write our unknowns as a column vector and we do matrix vector multiplication we're going to get exactly what we want beta sub 0 times 1 plus beta sub 1 times x sub 1 next one beta sub 0 times 1 plus beta sub 1 times x sub 2 that's what matrix vector multiplication does and that gives us now times one we can just drop it so it says beta sub 0 plus beta sub 1 x sub 1 beta sub 0 plus beta sub 1 x sub 2 etc all the way down to 50 that is how we get our prediction so that will make for very easy to write code because very easy to do vector a matrix vector multiplication in code and then also my loss function here remember which i'm going to write my loss function here and it's going to be another vector and it takes theta as an input which is a m which results which is going to be an m times one and that's this y hat minus y a 50 element vector minus a 50 element vector and that's still going to result in a m times one or 50 times one vector so that's beta sub 0 beta sub 1 minus y sub 1 and then the second one and the third one so that makes it very easy to create these things so my prediction is this going to be x times theta and my loss is this going to be that x times theta which is y hat minus y very easy so let's look then at this mean squared error thing here now we don't use mean squared error we actually use half mean squared error so we add this half term remember it was everything divided by m i sum over all of those and just divide by m but we actually divide by multiply all of that by a half now that doesn't make a difference to finding a minimum because all i'm doing i'm just scaling my cost function by a half i'm just scaling it by scalar it's not going to change anything as this i'm still trying to find the minimum of this thing this cost function but putting the two there makes for much easier partial derivatives and you'll see that later so if i write it out i'm going to have this idea of this l which was my loss i'm going to take its transpose times itself now if i do the dot product between a vector transpose and a vector what am i doing i'm squaring each because i'm just multiplying each term by itself and take a piece of paper and convince yourself of that if you take a vector column vector and you take its transpose which now makes it a row vector and a row vector times a column vector that's the same vector that is just element wise squaring of each one and adding all those squares that's what a dot product does it's element times element plus element times element plus element times element and that's such a neat way to do the different squared and if we were to write it out remember this is what it would look like it's 1 over 2 m so that's beta sub 1 plus beta sub 1 x sub 1 minus y sub 1 all squared and then the second example the second sample that's all the way down to the 50th or the last one that is what we have at the moment if we just did this transpose as i said here and just all these matrix multiplications and subtractions vector subtractions that i've talked about there and then this transpose times itself there that's what i end up with and now let's square all of these and you see i end up with something like this this whole long term beta sub 0 squared plus beta sub 0 beta sub 1 x sub 1 y tends beta sub 0 y sub 1 plus beta sub 0 beta sub 1 x sub 1 plus beta sub 1 squared just convince yourself to square all of those things that's what we end up with and we still have the 1 over m but we do it by a half so 1 over 2 m and that eventually gives us a cost function if you just look at all these ones and twos and threes as subscripts i can write that all as a summation and there is my summation and now it becomes very easy to take partial derivatives remember this whole long thing now is my multidimensional cost function and i'm trying to minimize that so here i have my cost function and i take the partial derivative of that with respect to beta sub 0 and i take my partial derivative of beta sub 1 with respect to beta sub 1 and you can see i've done it for you there there's the one there's the other one pause and see if you can do it yourself if you can't do it just from the summation notation here which is actually very easy if you think about it um if you take this one here and just take from that the partial derivative with respect to beta sub 0 you're going to get that and you see if i bring the twos out that two would cancel and that's why we put that two there because now we have this very neat a very neat update and remember this is the the derivative that i'm going to use the slope that i'm that's going to help me walk in the right direction we do the same for beta sub 1 and now we just have this beautiful thing where we have the column old column vector the beta sub 0 beta sub 1 uh yeah i should say right there and i subtract from that element wise my little learning rate times 1 over m times these partial derivatives and i'm going to do that with respect to beta sub 0 and beta sub 1 so this is going to be a column vector equals a column vector minus another column vector and they each two by ones and that's it as simple as that and then in code if we go back all the way that is what it looks like all the steps up till now that is exactly what we have so we're going to have this loss which is going to be x times theta minus y and the cost function is then the 1 over 2 m and to get all that element wise things it's the loss transpose that little apostrophe is transpose times the loss and to update and take my old beta sub 0 and beta sub 1 so there that beta sub 1 i subtract from that alpha over m times this x transpose times the predicted minus y that's going to be the cost function and we're going to move away from the notebook here i'm going to do it pen and with pen and paper while actually i do have a screen that is a tablet and i'm going to write on that and instead of pen and paper and recording the pen and paper and i'm going to convince you that this line of code is nothing than using all of these to convince you that those are the same so what i want to convince you of is that this piece of code that we saw in the notebook so i take theta which is just going to be this two column vector there we have it beta 0 and beta 1 our unknowns equals theta remember in the world of computer languages we evaluate the right hand side and then we assign it to the left hand side so theta equals theta so we're going to start with the old theta that we have minus we have alpha our step size or learning rate divided by m the sample size multiply that by the transpose of x and once we've done that that is multiplied by the difference between the prediction and the y and you know what that pred and y and x and alpha and m and theta are from the code and I want to show you that that's exactly the same as what we did here so what we have is just a bunch of column vectors so we're saying that this old where we stand at the moment column vector minus and we have here a constant that's this alpha divided by m and if we multiply that by a column vector it's just like broadcasting it's going to multiply it's going to scale the multiplication with a vector so it's just going to multiply each of the two and on the top we have the partial derivative of the cost function with respect to beta 1 and the partial derivative of the cost function with respect to beta 2 beta sub 1 I should say so I have two 2 by 1 column vectors I subtract one from the other one and what we're doing here in essence remember it's one line by one line so what we've got here is beta 0 equals beta 0 so whatever beta 0 is at the moment and we're going to subtract from that alpha over m times the partial derivative of the cost function with respect to beta 0 and then separately from that we do beta sub 1 that equals beta sub 1 minus alpha divided by m partial derivative of the cost function with respect to beta sub 1 so those are two separate equations it's a linear system and hence we can write it as these column vectors and I want to convince you that this is nothing other than this but that is what we are writing so let's start off by let's looking at let's go back to blue let's do the pred first so we see pred there and remember we said that that was equal to x times in our code times theta so that was this x times theta let's just have a look at the dimensionality of this remember this was a m times two matrix it was a m times two matrix and this is going to be theta which is this going to be a two by one column vector so that was a column vector and if we multiply those two out what was it going to look like here remember the x was this one one one all the way to one at the end and this was going to be x sub 1 x sub 2 x sub 3 all the way down to x sub m that was our matrix and we're going to multiply that by this column vector b1 b sub 0 b sub 1 and what does that multiplication give us well we've got a matrix times a vector and so we can do that multiplication because look at this that two equals that two and what we're going to get out of it is a m by one column vector that's exactly what we want so let's just do that so what we're going to have here in the end is that very simple thing in as much as we just multiply those out so that's going to be beta sub 0 plus beta sub 1 x sub 1 and beta sub 0 plus beta sub 1 x sub 2 and remember there should be a beta sub 0 multiplied by one every time but anything multiplied by one is this one and at the end we're going to have beta sub 0 plus beta sub 1 x sub m and from that we subtract y so again that's element y's so this is going to be minus y sub 1 minus y sub 2 all the way down to minus y sub m and this is a column vector m times one matrix or column vector so that makes it like really easy to do now let's see what is happening on this side alpha divided by m times the transpose of x so what does the transpose of x look like well the transpose of x here we have matrix x so x transpose well that's just going to be the columns and rows swap around so this is going to be one one one all the way to one and this is going to be x sub 1 x sub 2 x sub 3 all the way to x sub m so this now becomes only two rows and m long and we're going to multiply that out by we're going to multiply that out by this alpha over m so all we need on this side is just alpha over m that we're going to multiply this with and if we do that we have alpha over m alpha over m alpha over m all the way to the end alpha over m and here we're going to have alpha over m x sub 1 alpha over m x sub 2 alpha over m x sub 3 all the way to alpha over m x sub m and there is our still two two by m matrix now what we have here we have to multiply that by what we have here so these two have to be multiplied by each other there we go they have to be multiplied by each other so this is a two by m two by m this is a m by one where are we going to end up with well let's keep it in pink we're going to end up with a two by one and that's exactly what we have because this here is two by one that is our beta one beta a beta sub zero beta sub one column vector so if we multiply these two out with each other that's just going to be quite a long stretch of things that we have to do so let's have a look at it so it's going to be each one of these x sub m's and we're going to multiply that with each of these so what we're going to end up with here is an alpha over m and then we're going to have beta sub zero plus this automatically makes that little here we go beta sub zero plus beta sub one x sub one minus y sub one and that's the x sub m for those and we have to carry on we have to carry on with this plus alpha over m and then in the end we're going to have x sub two minus y sub two after all of this and it just carries on and on and on and on and on and then for number two here for number two we're going to have the alpha over m and we're also going to have the x sub m and then the beta zero beta one x sub one minus y sub one plus alpha over m x sub two then beta sub zero let's go on on this side let me just make some space so then do do that again plus beta sub one x sub two minus y sub two plus alpha over m x sub three oops that does not an x x sub three and we just carry on like that to the end but what we left with here in the end is a two by one is exactly a two by one matrix most importantly we can just shorten this we have beta sub zero and a beta sub one on this side and that is going to equal beta sub zero beta sub one minus what we have here these two and if we shorten that this is going to be basically an alpha over m because that's just a constant multiple with broadcasting here we're going to have the sum from i equals one to m because we've got m samples and that's going to be a beta sub zero plus a beta sub one x sub i minus y sub i that's what we have every time is this the one and the one and then the two and the two that's the only thing that changes and for beta sub one we're going to have the sum of i equals one to m and all we're going to have here is an x sub i and here we're still going to have the beta sub zero plus beta sub one x sub i minus y sub i and that is still a two by one column vector there so we have a two by one column vector a two by one column vector and just this constant multiple of a two by one column vector and everything works out we can subtract two two by one column vectors from each other now let's go about it from the other way remember we had the mean squared error actually the half mean squared error so let's have a look at that that's going to be one over two m remember and then it's the sum of i equals one to m and we're going to have y hat minus y those two vectors subtracted from each other so that's just the i there and the i there and we square them that's exactly what we want and we remember of course let's go back to orange this is one over two m and we have still the sum of i equals one to m and how did we get get y hat well we got y hat all we did was we said x and remember that was a m by two and then we multiplied by theta which is a two by one so we end up with a m by one which is exactly exactly what we want and then we subtract from that just why that that vector and of course we just have to square each of those so we just have to square each of those and remember that we are going through each of those i's and for this multiplication here that is going to give us a m by one so it's just one by one of them square them and then sum sum them so let's have a look if this works because if we go y hat and we do sub i remember that is going to be beta sub zero plus beta sub one x sub i that's exactly what we have there and if we look at the whole y hat as a vector that's going to be beta sub zero plus beta sub one x sub one remember and then beta sub zero plus beta sub one x sub two all the way down to beta sub zero plus beta sub one x sub m so that's our whole column vector so for all of them and what did we do then well we had this loss this idea of a loss function and that was just y hat minus y if we can think about that as two as two column vectors and all that is is just all of these minus beta sub one x sub one minus y sub one and beta sub zero beta sub one x sub two minus y sub two all the way down to beta sub zero minus beta sub one x sub m minus y sub m and that's still a m by one column and that was our loss remember but we didn't want that we want each of those squared so what did what do we do well we say take the loss as it stands now take its transpose and multiplied by the loss again so the transpose of this was going to be this whole long thing beta sub zero minus beta sub one x sub one minus y sub one comma and then beta sub zero minus beta sub one x sub two minus y sub two comma all the way so now this is going to be a one times m and then multiply it by this m times one beta sub one x sub one minus y sub one beta sub zero minus beta sub one x sub two minus y sub two all the way down to beta sub zero minus beta sub one x sub m minus y sub m there we go and that is a m by one and what are we going to end up with well we're going to end up with a one by one and that's exactly what we want that is our part of our mean squared error we just have to divide by one over two m but how this works is it's the square that one is exactly the same as that one and this one is exactly the same as that second one so we're just squaring each of those and adding all of them up and that's exactly that is exactly what we want because in the end we have this idea of the loss squared so let's do that exactly let's make this lost squared and just see what it what it looks like so it's this squared because it's this one multiplied by that one and this one by that one so let's see what it starts to look like it's going to be beta sub zero multiplied out with me so that's beta sub zero times beta sub zero that's beta sub zero squared plus beta sub zero beta sub one x sub one minus beta sub zero y sub one so I've multiplied all of those out and then for the next one so that's going to be beta sub zero beta sub one x sub one plus beta sub one squared x sub one squared minus a beta sub one x sub one y sub one minus beta sub zero y sub one minus beta sub one x sub one y sub one plus y sub one squared so it's all those three terms multiplied throughout by all those three terms so in other words beta sub zero minus beta sub 1, x sub 1, minus y sub 1. We multiplied that by itself, beta sub 0, minus beta sub 1, x sub 1, minus y sub 1. If you multiply all that out, that's what you're gonna get, and it just obviously carries on because now we do the second term, and the third term, and the fourth term, etc, etc, etc. So in the end, if we have the mean squared error, that was going to be 1 over 2m, and we're gonna sum up from i equals 1 to m, all of them, and if we just group these terms together, if we just group some of these terms, and just remember that we came from 1, 2, 2, 3, all the way to m, we can just bring them all together in a summation, so it's going to be beta sub 0 squared plus beta sub 0, beta sub 1, x sub i, minus, we're gonna have beta sub 0, y sub i, this is c, and we have beta sub 0, beta sub 1, x sub i. Let's just do all of those, and we can see here we're going to group all of them in the end, and we have a beta, let's just give ourselves some more space, beta sub 1 squared, we're gonna have an x sub i squared, minus beta sub 1, x sub i, y sub i, and we're going to have another, let's just give ourselves more space, you need a lot of space for this, beta sub 0, and y sub i, we've got that one, minus we've got beta sub 1, we've got x sub i, y sub i, so that we'll have two of them in the end, and then all the way in the end we have y sub i squared, so let's clean those up, because we can certainly group some of those terms, so we're still going to have here 1 over 2m, we're gonna have the sum from i equals 1 to m, and if we grouped all of them, if we grouped all of them together, we're gonna have this beta sub 0 squared, we're gonna have 2 times beta sub 0, beta sub 1, x sub i, we're gonna have minus 2 times beta sub 0, y sub i, and we're gonna have beta squared, beta sub 1 squared, we're gonna have, that's not right, beta sub 1 squared, x sub i squared, minus 2 times beta sub 1, x sub i, y sub i, and plus y sub i squared, that's all neatly cleaned up, so all we've done, there's one term, there's another term, so there's two of them, et cetera, so we just start cleaning them up, so that's our mean squared error, that is in the end, that's our cost function of two unknowns, and for all i equals 1, 2, 3 up till m, and that's our cost function, and all we have to do is take the partial derivative of each of these separately, so let's do that, let's take the partial derivative of the cost function with respect to beta sub 0, well that's 1 over 2m, the summation for this kind of derivation, or derivative, stays exactly the same, and we're treating beta sub 1, our other known as a constant, so this is going to be 2 times beta sub 0, it's the first derivative of that, what remains from this one is going to be plus 2 times beta sub 1, x sub i, because there's a 1 power there, we bring it to the front, this becomes 0, so that goes away, yeah, we have minus 2 times y sub i, that's all we have there, and now remember, look at all this, that's why we put the half term there, because now I can take these 2s, I can take all of them away, and they're all gone, and there's my partial derivative, very simply, partial derivative with respect to beta sub 0, let's do the partial derivative of the cost function with respect to beta sub 1, and if we do that, that's 1 over 2m, we have sum over i equals 1 to m, and we're going to have 2 times, so that's a constant, so derivative of a constant is 0, because we're only doing it with respect to beta sub 1, so there we're going to have 2 times beta sub 0, x sub i, as simple as that, and what remains for us right here at the end, that's a constant, and there we have another term, so that's going to be plus 2 times beta sub 1, x sub i squared, and right at the end here, we're going to have minus 2 times beta sub 1, x sub i, y sub i, again I can get rid of the 2s, that's why we did the half, this makes it all nice and neat, and the other thing that I can also do, you see they all have a common factor, so I can say that the cost with respect to beta 1 is going to be 1 over m, and we sum from i equals 1 to m, and we have x sub i there, and that's beta sub 0 plus beta sub 1, x sub i minus beta sub 1, y, i, of course you see the little mistake that came in there, there we have 2 times beta sub 1, and if I take the derivative of that, of course there should be nothing left there, there we go, that looks much better, and if we go all the way back up, that's exactly what we have here, that's exactly what we have here, nothing other than that, so with our code, and with doing it by hand, we get to exactly the same thing, we get to exactly the same thing, in other words we really have, let me write neatly beta sub 0, beta sub 1, that's going to be beta sub 0, beta sub 1, and we subtract from that, alpha over m times this idea of the partial derivative of the cost function with respect to beta sub 0, and the partial derivative of the cost function with respect to beta sub 1, that's exactly what we're doing, it's a linear system in two equations, and we see that's exactly the same thing that comes from this line of code that we have right up here, that is exactly that line of code which gives us exactly this thing, and that is what we want to do with gradient descent, so let's get back to the code. So now that you've convinced that they're the same, let's have a look at what it would look like, function is cost function, it takes x, y and theta, my m is just the size of x, the first of the tuple because it's going to be rho comma column, the loss is this x times theta, that gives me my y-pred minus y, which makes my cost function exactly what I showed you right up there, it's a simple line like that, so if we run that, I have a cost function and now I've got to do gradient descent, so I'm going to call my function here linear ridge gradient underscore d, you can see, you can give it a name, it takes x, y and alpha, and then a fit-intercept is true and the number of iterations we're going to do 5,000. So length is still going to be my number of cases, it's going to be m, that's the length of y, and if fit-intercept, which we are going to set to true, I'm just adding a constant, I'm just going to add 50 of them at the moment of ones and I'm just horizontally concatenating that, the constants are all the ones to the x, and that's how I just get all of those, if not, it's just going to use x, so you'll see that quite commonly, but in our instance, we want those ones, otherwise things are not going to work out for us. So n is the number of columns, so that's going to be size of x, the second one, row comma columns, at the moment for us it's going to be 2, and then I've got to have random guesses. I've got to start beta sub 0 and beta sub 1 some way and I'm just going to start it with 2 zeros, so it's 2 zeros on top of each other, column vector of 2 zeros, and then I'm going to have just this little placeholder of 50,000 cost values, because each time I'm going to have a cost value of the 5,000 times that I'm going to run over this, remember, every time I'm going to get better and better and better values for beta sub 0 and beta sub 1, because my gradient descent is going to walk towards having the minimum cost function. I'm just creating a vector there with all zeros, and then for ITER, this is my keyword there, or my variable name there, just to run through this range of 1 to 1,000, 5,000 I said. The pred, remember, is x times theta, and then I'm going to start overwriting each one of those 50 empty or zero values by the cost function, given what my best theta is at the moment and we're starting at 0,0, and then we're going to update it so that we have new beta sub 0, beta sub 1, and then I run through all of that again. The next time, my pred is going to be slightly better, and my cost value is going to be slightly lower. I'm going to run through this 5,000 times, and then the end, I want the stupor returned of my vector of unknowns, beta sub 0, beta sub 1, and then the 50 or the 5,000, I should say, values of my cost, what the cost function was, and hopefully that is going to come down. So let's run that. Let's run that here. We're going to run that 5,000 times, and because I'm returning a tuple, I'm just going to split them up so that I have them separately. So here what I want here for cost values is just the second one, the j there. And my time step is going to be 5,000 of them as well because I just want to plot this. I want to show you, see how quickly that was. Let's plot that. You're going to plot that. And there we go. My cost started quite high, and every step that we took, my cost came down, down, down, until here by 5,000, my cost was very low at 1.92. That's a beautiful iteration. I see I didn't make use of this time step there. I could put the time step there, but I just used that range there, 1 to 5,000. So I needn't have put that. But I want my 5,000 cost values and you can see how it comes down. Let's see what my theta values look like. That was going to be the first of the two poles that got returned there, theta. Let's store that in a computer available called PARM, short for parameters. And we see a beta sub 0 of negative 0.03 and a beta sub 1 value of 9.43. Isn't that beautiful? And let's just plot that. And you see there is my model. Very close to the one I guessed that before, but this would now be after 5,000 iterations, it's really getting close to the minima for those two unknowns. So we've used gradient descent to get to this model. Now let's just use the GLM to see another way to do this. Now I'm not going to run through the explanation of that because this was about gradient descent and how to code it inside of Julia. But let's just use ordinary least squares. So I'm not going to explain what ordinary least squares are. I've got a YouTube video on how to get to that, but this would be another way to get to the best values of theta. And that just takes x transpose times x, the inverse of that, times x transpose times y. Or if you break that up, and that's the video that I really explained this, how to get beta sub 0 and beta sub 1, but this is going to be only for single feature, only for single feature variable. And you can see there how to do beta sub 0 and beta sub 1. So we've imported the GLM function. Well, first of all, I should say we've got to write this as a data frame. So I'm going to have two columns in my data frame, feature and target. And feature is going to be all the x values and target is going to be all the y values. And then I'm going to use the LM function in GLM. That just stands for linear model. And we're going to use at formula. And then we're going to say target dependent on feature in the data frame DF. So that's just the syntax of that. And I'm going to store it in a computer variable called OLS because this is ordinary least squares. It's going to make use of this equation. The ordinary least squares for beta sub 1 and beta sub 0. And then let's just have a look at that. If we use this beta sub 0 and beta sub 1 equation there, well, so my OLS is still running because it's got a low GLM. And there we go. It finds these are estimates. And the concept is minus 0.15 and my feature, my x variable, my for beta sub 1, its value is going to be 9.64. So those would be the best values. We can see a p value for that. And we can see the low and upper balance 95% confidence intervals there. But let's use these two equations there. I've written them out in code. You can definitely have a look at that. And then for beta sub 0, it's going to be 9.12 and 9.59. So very close to the optimized code that is used inside of generalized linear models here. It's slightly optimized from here, but you can see it's very, very close. So I can just look at of the best model. We can just store all of those. Look at its MSE and use that MSE to subtract from that. And that was the R squared. Compete to the base model. We use the R2 function here. And it's the C0.686. It's exactly the same. So the GLM uses exactly the same function as we did before. Remember the comparison function that we did compared to the base model. So this is back to normal statistics. We can say the best fit model explains 68.6% of the variance in our model. So that is a gradient descent. I hope you now understand how to go from the concept of it to the code and why the code is so easy, but also understand gradient descent itself. That in AI today, I'm talking specifically about deep learning that we can write a function and that function is a representation of how far away are we from the best possible prediction and that we can iterate over this and make the unknowns, the parameters, as we call them, better and better and better until we get somewhere where, given those values, given those unknowns, given those parameter values, we can predict an outcome quite accurately. And that would be exactly the same for regression problems and classification problems. Classification problems in supervised learning. Remember that we have a categorical variable as our target variable. Here we have regression. We have a numerical variable as our target variable. Try this yourself. Play around with Julia. Create some, or at least bring in some of your own data and see if you can run linear regression as a model.