 Welcome back. We're going to continue our continuing our look here into linear regression really to cement our understanding of deep neural networks. I want to apologize again for all this banging that you're going to hear. They're building a neuroscience centre right side of my door. They are here very early in the morning. They stay until very late. So even recording these before work recording these after work does not matter. The banging goes on every time I come to my office. If I have a free moment here, I almost have to bring ear plugs. It's just an absolute nightmare. So if you hear all the banging, nothing I can do about that. The reason behind these videos, remember, I want everyone to get involved with using deep neural networks to solve health care problems. I'm a surgeon. Doesn't matter. I know about deep learning. I can use deep learning in my research and I want everyone to be able to do that. I want people with domain knowledge in health care or those very interested in health care to reach out and work with people in computer science and mathematics who are already doing deep learning. I want us to work together, but it is us with the domain knowledge with the interest that really we have to bridge that gap. We have to learn about deep learning. There is no excuse. The time is now to make the effort and to learn about this. Now these videos are not only for health care professionals or those involved in health care. Even if your interest is well outside of that and you only want to learn about deep learning, these videos will serve you properly as well. Now, as I mentioned in the previous video, this is RPUBS, an RStudio document using Markdown that I've published on RPUBS. I will write about it on Twitter. I will mention it on LinkedIn. These videos on YouTube, they are available here on RPUBS and the actual files are available on GitHub. So subscribe on YouTube, follow me on Twitter, connect with me on LinkedIn, look at all my files here on RPUBS and also download the files for your own use on GitHub. I really want to spread the word about learning about deep neural networks. So in this video, I'm going to continue this looking at linear regression just as an example of the basic concepts behind deep learning before we get into any models. I'm going to use linear algebra and multivariable derivatives in this video. Don't run away. I'm going to make two extra videos just on the very basics of linear algebra and the very basics of derivatives calculus just to show you what it's about. You really don't need to understand it though. You're going to write a simple line of code and the computer is going to do that all for you. But I think it's worthwhile at least just to watch those two, just to remind yourself if you've seen it before or just get behind understanding it before you're using it. I think it's worth the effort, although a real trade is not absolutely necessary. Now, if you really want to know about linear algebra and you really want to know about multivariable calculus, I have two playlists on YouTube. I'll link to them. They each have, I think, over 100 lectures in each of those playlists. If you're wondering about linear algebra or multivariable calculus, watch my videos on that. Now back to linear regression and how that is going to help us. I'm not going to read from this. You can read this all on your own in case you don't want to read it and you want to hear my voice explain a few extra things. Continue watching this video. Remember that we are trying to predict some outcome called a target variable. We can also call those actual values the ground truth. We're going to try and predict those ground truths by creating some model. When we have only a single input variable to predict the target variable, this is how we can predict it. This is a mathematical model of that. So given any of the input variables, if I multiply that by some unknown called beta sub 1 here, and I add to that some other unknown beta sub 0, it's going to give me a prediction of what it thinks the target value must be. And remember, if I added an error to me, I'm going to get the actual value. And this model is a model of a straight line. Remember from school, y equals minus a half x plus 2. That is the slope and that is the intercept and that's exactly what this is. Beta 1 is a slope and beta 1 is a y-intercept. It is a model of a straight line. My model in predicting an outcome variable, a target variable, the ground truth, if I only have a single variable is the straight line. As simple as that. We've looked at this before where we look at the error that I make every time. So if I whatever predicted this y hat sub 0, given any input value minus the actual ground truth value and I square that, we are now going to call that this error and we're going to call it something else. We're going to call it a loss function. So if my target value is a number and my prediction is also a number, I can subtract those two. So my loss function is just the difference between two. And I remember I square it because I don't want negatives to all cancel out if I sum over all of the errors. But I'm going to call that a loss function. If my prediction was something else, it was, you know, is this a malignant nodule on a CT scan or is it not? My loss function would look different. This would not be the loss function for that. So every kind of problem that we are going to develop the deep neural network for will have a different loss function. It just depends on the very type of variables that we're dealing with. This is not a CT scan. These are two sets of numbers and number I'm predicting in a real number ground truth number. So this would be one loss function. Every problem will have its own type of loss function. Now, some people make this distinction between the two. Some say loss function cost functions is the same. You know, it's a new field. There's so many terms and there's no real standardization of what these terms are. So I'm separating these two terms out just for clarity's sake. I'm saying this is my loss function. But remember I have more than one sample. So I've got to somehow combine and express all the errors in one and one way to do it is this. I'm going to sum up all my errors. So sum up each of these over each of the i's in the sample. I'm going to sum all of them and divide by how many they are. And I'm going to call this my cost function. So the cost combines all the individual losses. So what this is, what is it? Well, it's just the average of all the individual losses. And you get other cost functions. So this cost function is particular to this problem where I have a numerical output and a numerical predictor. A numerical target, which is predicted by a similarly numerical value. So this would be a cost function. In this instance, other loss functions will have other cost functions and we can look at them. And the computer code in the end, it's going to be very simple. They each have names and you just type in the name and the computer knows what to do. No problem. But I want you to understand the concept. Now this L remember was this. So if I plugged in this into this equation and the Y hat was actually that. If I plugged that into that and that into that. This is what I'm left with. This is the full thing. So the cost function, given some of these two unknowns, beta sub zero and beta sub one is this thing. I'm going to do beta sub zero plus beta sub one. They're going to stay the same over all these I values, sample one, sample two, sample three. But I do that multiplication. I do this addition. I subtract that from the real value. I square that. I sum of all of these. I divide by how many there are. And that's going to give me the cost. Now, let's put this into action. What we're trying to do though here is we're trying to get values for beta sub zero and beta sub one. So we have this very best line to make the minimum error. My cost function has to be the value of all of this has to be as small as possible. And if I pick the right value for beta sub zero and beta sub one, I'm going to have the smallest possible cost. That is my aim. And to put it that way makes it very easy to transform that into mathematical equations. And that is really what we want. So let's just create a contrived example. I'm going to have five values 1.3, 2.1, 2.9, 3.1, 3.3. Those are my feature variables. And I'm going to just add some random noise to that. You can look at this line of code so that I have a target value based on the feature value. I know that there's some connection between the two. And right here is where we plotted. So here's my first one. The input variable was 3.9 and the output variable is 0.7. It was 2.1 for my feature variable and my target value is 2.2. For this one, it's 2.9 feature, 3.4 target, 3.1 feature, 1.9 target, 3.3 feature, 3.5 target. And I can actually just write them here. You see the 1.3. I have to create something that will predict 0.7. I have a 2.1. It has to predict 2.2. Given 2.9, I want to predict 3.4. Given 3.1, I want to predict 1.9. Given 3.3, I want to predict 3.5. So let's plug that in into this equation that we have here. This equation is just the sum over all of these pairs that the x, y pairs that we've got here. And we see them here. Each of these are x, y pairs. There's the x, y pair. So let's plug them all in. So there's the first one, the 1.3 and the 0.7 is the true value. 2.1, my predictor, 2.2, the true value. 2.9, 3.4. We subtract that. We square each of those. We sum and we divide by how many there are. And there's my cost function. Now you can do that all on paper and you'll see it comes out to this. There's my equation. My cost is 6.55 minus 4.68 beta 0 plus beta 0 squared minus 13.1, et cetera, et cetera, et cetera. This is an equation in two unknowns, beta sub 0 and beta sub 1. And I can graph this in an equation in two unknowns. I can graph this and this is going to be some shape in a two-dimensional space. Two-dimension because I have two unknowns. And this is what it actually looks like. It looks like this equation. It looks like this. And how can I minimize this cost function? C is here on my z-axis, my up and down axis. Well, it will be a minimum when these two values, beta sub 1 and beta sub 2, gives me the lowest value on my z-axis. My z-axis goes up and down here. And somewhere along this curve, you can see from left to right, it curves up and it curves a little bit there. Some way there's a value for beta 1 and beta 0 that will make C on my z-axis the lowest possible. And those are the values I'm after. I'm trying to learn what these two values should be, beta sub 0 and beta sub 1, so that my cost function is at its lowest. That will give me the best lines, the best values for my parameters, beta sub 0 and beta sub 1. And remember, I can use those as slope and intercepts so I can draw a line through these points that will minimize the error. Because remember, if the line goes here for a given input here of 2.1, it might predict 1.75 right here. And there's a difference between 1.75 and the actual 2.2. But I want all of these errors along this line to be minimized. And the way to do that is to change it into a mathematical equation. Here we have two unknowns. It draws a shape for me in three-dimensional space. And I want where this value, what value for beta sub 0 here? And beta sub 1 there will give me a little point here somewhere. And I want it so that it is the lowest point on the C on the z-axis. How do I do that? We do that through taking what we call partial derivatives. Now I'm going to make that little video on partial derivatives for you so that you can just, if you want to know where it is, if I take the partial derivative of C with respect to beta sub 0, I keep beta sub 1 as a constant. If I do the derivative with respect to beta 1, I keep beta sub 0 as a constant. And there I get the two derivatives. Now what is a derivative? It is a slope. And I'm very interested in those slopes because at the minimum my slopes are going to be 0. I'll show you later on down. I'll give you a better explanation for that. But they are my two partial derivatives. And I want to set each of them equal to 0. So there's the first one. And there's the second one. And I'm just setting them to 0. I want where the first derivatives are 0. So what I can do is this minus 4.6, take that over to the other side. The negative 13, I can take that to the other side. So I'm left with these two equations. This one, it won't draw. This one here and this one here. So there's two equations and two unknowns. I can write that as an augmented matrix. And that's why we need linear algebra. So we need both derivative calculus, calculus of derivatives, multivariable derivatives. And we need linear algebra. I can do elementary row operations and reduce this to reduced row echelon form. And again, watch the video on that, which gives me a solution for beta sub 1, which is beta sub 0, which is negative 0.5, and beta sub 1, which is 1.1. And there we have beta sub 0 is negative 0.5 and beta sub 1 is 1.3. That's the intercept and the slope for the best possible line because by putting the slopes to 0, don't worry about it. It'll give me the values for the best possible line. Let's cheat again using linear models here inside of R and I give it a little formula for that and we see the two values there as predicted the minus 0.5 and the 1.1.3 for my intercept and my slope, no problem. That's the basics of it. How did it do that? Let's just reduce this instead of this three-dimensional space because we have two unknowns. Let's reduce this to a single variable. So we have two-dimensional space and not three-dimensional space and I want to make it so simple by this very simple equation that you must have seen at school, y equals x squared. There it is. For every x that I put in here, I'm going to land at 1 there because 1 squared is 1 and 2 squared is 4. So I'm going to land up with 4 there, et cetera. Very simple. Now don't get this x and y confused with what we had above the y being the target and the x. No, no, no. I'm using something from school x and y, but remember they represent on this just a beta 1. In fact, the x here, I'm just going to use it to be beta 1 and I'm trying to find where on this beta 1 line, this x line, where must I go on this x line to get this curve that my model created, just as my model created this curve up here, this three-dimensional curve through my lost and cost functions. Imagine that my cost function is now, just this thing I want to know where on the x-axis is this thing at its minimum. Now it's very easy to see. The minimum is right down here where x is 0. But imagine it's a convoluted shape as we had here. It's not easy just to see there where the minimum is going to be. And as we go with more and more and more predictive values, feature variables, it's going to be this convoluted shape in multi-dimensional space. You wouldn't know where to start to get to the minimum. And that's that minimum where we're after. Now one way to go about it is what is called gradient descent here, as opposed to what I used up there to do it. And the gradient descent says, let me just blindly start anyway. Let's start here at negative 2. What is the slope here of negative 2? Well, I've got to take the derivative of x squared. The derivative of x squared is just 2x. Watch the video on derivatives if you can't remember. And at negative 2, 2 times negative 2 is minus 4. So if I draw a line that just touches this point here where x is negative 2, or I should say beta 1 in our analogy here is negative 2, it'll be the slope right here and the slope is negative 4. You can see a tangent line that just touches the minimum would be a slope of 0. So we're trying to get to a place where the slope is not negative 4, but 0. So how do we get from a point? Again, as I say, just looking at this is easy, but imagine in multi-million-dimensional space it's not that easy. But I can use this slope here to get closer to this idealised point where x is 0. And the way that I do that, I take this negative 4 and I multiply it by a tiny little step called what we call a learning rate. Say that's 0.01. So I'm going to take 0.01 minus the negative 4 and that gives me minus 0.04 but I subtract that, I add a negative 2 that which gives me plus 0.04 and that is my step. So I'm going to go from negative 2 using my slope which was negative 4 there and I'm going to step to the right 0.04 places. That's going to bring me right about there to negative 1.96. At negative 1.96 I'm going to get the slope again. I'm going to plug that slope in here, multiplied by the learning rate and subtract that from where I was before at negative 1.6 which is going to bring me closer down. And so I'm going to go closer down, closer down, closer down, closer down and every time I'm going to see my slope getting closer and closer to 0 until I approach this spot here where the slope is 0. I've used gradient descent by using the derivative of each point to get to a lower and lower and lower point. And just look back, just think about it at the moment. I can construct a cost function no matter in how many dimensional space I can then use this idea of gradient descent to go down on this slope. As I say, as we go up and up in dimensional space we have to use these partial derivatives not just a single derivative here. I said the derivative of virus with respect to x is 2x I have to use partial derivatives and I have to walk in each direction. So if I go back up here, if I just randomly started at a point and that's what we do, we randomly start at a point I look at, I walk in this direction and I walk in that direction and if I combine these two separately so I work from here to here I hope you can see the cursor and then I walk from there to there that would have been the same as just walking straight that same as walking straight down there but I'm still walking straight downhill but instead of one single step I take two steps one in this beta subzero direction and one in the beta one subdirection and if I had multi-dimensional space I would combine separately all of those and all those steps together would eventually be somewhere lower down the slope and I'll use through all the partial derivatives in all the directions I will combine all of them and walk further down the slope and further down the slope until I get to a point where the slope is zero so think about it, you're standing somewhere in a valley close your eyes and you have to walk down the valley what you need to do is to pick a direction and just in that direction just a straight line through your feet which way goes up and which way goes down just through your feet so you'll take a step in the direction across that line, straight down you'll turn 90 degrees orthogonal and draw another line through your feet and at that line which is now perpendicular to the first one just draw a line that line see which direction is up at the slope and which is down, just take a small step down so combining those two little steps you took would have been just stepping across once now again draw a line again parallel to the very first line you had and decide on that line which side is up which side is down, walk a little bit down turn that line 90 degrees and again, is it which way is up and down take a little step down those two combines will be down and so you can combine these two these steps at 90 degrees with each other say left, right, left, right, left, right and so you can work your way blindfolded to the bottom of this valley just thinking where you're standing and which way is down, which way is up but in two different ways if you were being able to live in multi-dimensional space of course you would have to do this for each for each dimension that you are that's what we're doing here with partial derivatives we're just looking at a single direction take a step for that one take a single direction, take a step for that one and for all of those combined would have been one huge leap and that's what we're trying to do and simplifying it here by just showing you how we step down using the slope, the partial derivative on each of these and eventually we're going to get to this bottom and that minimum that minimum is where the values that we are after, the parameters that we are after is at its best where the prediction is going to be as close as possible to the ground truth so that's it we really looked deeply now into linear regression and how that fundamentally in our head makes us understand that we can create this cost function and we can minimize this cost function in a way so that we walk down this slope of this cost function to get the very best values to minimize our error and that is what we're going to do with deep learning we're going to create this model and we're going to have in some of these there are millions of parameters not just beta sub 0 and beta sub 1 there are millions of them and we call them weights in deep learning same thing, we're going to have millions of them and we have to optimize them so that all of their values are at their very best so that in the end our cost function is at a minimum at the error that we make in our prediction is as close to the ground truth as is possible and there you have a deep learning so from here we're going to start looking at deep neural networks themselves but hold on tight to these very basics watch the video on linear algebra and the video on derivatives they will help you out you don't have to, as I said if you really want to get behind this doing some research really on deep learning itself as opposed to apply deep learning which we do in healthcare, we apply it to a problem then you've got to understand multiple calculus you've got to understand linear algebra and I've got those videos out, I've made plenty of those lectures you can watch them and I'll put the links down below I'll speak to you again