 In the previous video, we introduced the idea of a neural network as a regression tool that takes x's to y's through a hidden layer called z's, input layer, hidden layer, output layer. We described why one would want to produce z's in terms of x's of the form sigmoid of a linear combination, because that allows you to make a differentiable version of decision tree cuts, except that because this is an arbitrary linear combination, you can use any hyperplane, not just coordinate hyperplanes. And instead of making a discrete or sharp cut, we're going to use a sigmoid function so that the cut varies smoothly across the boundary in a differentiable way. Now, we want to have the output variable dependent on the z's as a linear combination for the following very simple reason. If you want to describe any sort of function, y versus over x1 through xp, and it's got different bumps and regions, so let me sort of draw a picture here. So here's x1, here's x2, and here are my y's, and I have some sort of wobbly function here with various bumps and holes and so on. I want to find a way to produce this function. Well, what you can do is, by combining different z's, think of each z as somehow a wave crest, and by having those wave crests interfere, we can make any combination of bumps and valleys. So maybe I will produce something where let's make a cut like this and have it swing up that way, and then maybe make a cut this way and have it swing up that way. All right, that's a terrible picture. But the idea is that from the top, I can cut a region, say between 0 and 1, and then also cut it, say, here between 0 and 1 and have it swing up there. So there's a little sigmoid there and a little sigmoid here. And maybe I can make another cut like this, 0 and 1, with a little sigmoid there. And overall, what we're doing is by having these different wave fronts interfere, I'm getting this area, which is going to be overall a height of approximately 3, whereas you can add up the other regions here. So over here, we're to the right, and we're to the right. So this would be 2, and over here, we're above that one, that will be about, let's see, that's about 2, that's about 1, and this is about 0 and 1, and 1 make 2. So we have a graph now with a high spot in this sector, and these other regions are lower to various amounts. Now that's just adding up these three sigmoid functions. If you take a linear combination, then you can amplify one versus drop the other, and by adding more and more sigmoid functions going in different directions, you can make a peak in some region, or a valley in other regions, and in the long run, essentially fit any, at least approximate any integrable function this way. This is very much like the idea from DecisionTrees, that if you put enough depth in your tree, you can approximate any Griemann integrable function by DecisionTree, because after all, what you're doing is cutting it into rectangles and adding up rectangles. And the same way here we're doing that, except instead of using straight-up rectangles, we are using high and low spots, which are separated by a differentiable curve, but those high and low spots cut across a hyperplane that can go any direction. So it's reasonable to expect that you can approximate many things this way. So this isn't ananzatz, right? It's not the only possible way to model things. It's simply one way, which is mathematically convenient. Linear combinations of linear combinations, where we use a cutoff function to make a bump to make interference patterns. So this seems like z, you're effectively doing a support vector machine idea. But by interesting more and more z's, you can make more and more complicated functions. Now, why is this computable? That's really the question, right? This seems like kind of an okay, but maybe kind of crazy way to do things. With DecisionTrees, they weren't too hard, because we could search along each coordinate separately and do some sort of logarithmic search process. With support vector machines, we're looking for a single hyperplane, and so there's a simple optimization problem there to solve. But with these, we're searching for more and more and more and more hyperplanes, and we have lots and lots and lots of coefficients to decide. So how could we possibly do that? The answer is, as so many things in optimization theory, is a notion of gradient descent. So what we're going to do is think of the error of this model as a function that depends on the coefficients, and we'll simply apply gradient descent to that model. So remember, we actually have honest data here. We have data, say, xi, yi. This is the ith data point of our testing data. And if we've built a model, we can compute the mean squared error, which I'll call r, which is 1 over n, and that's our number of data points. Sum i equals 1 to n of the quantity ri, whereby ri, I simply mean take the actual response yi versus the version predicted by our model and take it square. So this is the residual sum of squares error, average of the number of data points. Of course, the 1 over n is not super important mathematically. It's simply useful if you want to think of this as being an average. So this is the squared error of a single data point. So what we want to do is use r as a height function in terms, not in terms of the x's or the y's, but in terms of the coefficients a and b. So think of r as a function of a, b is a multivariable problem. r is a function that takes in coefficients. There are m of those and m times p of those, and outputs a real number, apply gradient descent. Remember that gradient descent is a very simple idea from multivariable calculus. The gradient of a differentiable function points in the direction of most rapid increase. The negative gradient of the function points in the direction of most rapid decrease. So if you want to minimize a function in terms of some variables, you first just make a guess, and then you take a small step in the direction of the negative of the gradient. And by linear approximation theory, since your function is differentiable, you know that the value there should be slightly less than it was before. And I repeat and I repeat and I repeat. You just take one step at a time in the direction of the negative gradient with some small delta t or delta length. And that should gradually take you to a critical point, which ought to be a local minimum of the function. So we're going to minimize r by moving a and b. All we need is an initial guess and a way to compute. In fact, if you look at this equation, I don't really need to minimize r as a whole. I can think about minimizing it on each individual data point. That's a more convenient way to write it down. Because after all, you're simply doing that on the sum, and you take a partial derivative, you pick up a single ri. So we'll minimize the ri's across all the data points by moving a and b. So that's the layout. We're going to apply gradient descent to the residual mean squared error of our model, starting with some initial guess of our coefficients for our model, and then just simply iterating until we are happy with the quality of the model in terms of the gradient going to 0 or in terms of achieving some reasonable notion of minimum error. This process, by the way, as we'll see, is called backpropagation in a neural network. It sounds fancy, but all it really means is applying gradient descent to a model which is almost linear. It's linear with a sigmoid function interjected in there. So backpropagation of neural networks is simply gradient descent from calculus. Okay, let's do a little bit of analysis here. The thing we want to know is the gradient itself. So we need the partial derivatives. So the partial derivative of ri with respect to the coefficient am, well, remember ri was y minus f of x squared. So we have, by the chain rule, minus 2 yi minus f hat of xi to the first. That's the power rule. And then we need to differentiate f of xi. But I only want the coefficient on a. And if you think through, this is the model y versus a was a times z. So the coefficient on am is zm. But actually what I'm going to do here is I'm going to say zi comma m. And what I mean by this is that if you have data point xi and you have a current model a and b, so if you know, if you have a data point xi yi, you're going to get an intermediate hidden variable xi. And so this is the mth component of that xi variable. So this is the partial derivative of the residual versus that coefficient am. Similarly, if you want to know how this varies in terms of the b's, well, the b's are on the inside of this function, right? So it's the chain rule derivative of the outside function times derivative of the inside function. So what we're going to have here is minus 2 times yi minus f hat of xi to the first. That's still there. But then we have to differentiate inside the z of x to get the b's. So I need am times partial zim partial dmb because after all am, zim, bmp, right? This is the chain rule on the inside function. So if I expand this a little bit, this partial derivative is straightforward. We have sigma prime by the chain rule. Let's call this q bmq xiq and then times xip. So this is the input to sigma. I just use a different index and get the p's confused because p's fixed in this equation. So I know this looks complicated looking at like this, but write down the equation again for r in terms of y and f hat of xi and the definitions of how y depends on z and how z depends on x in terms of the a's and the b's in sigma. And then just apply the chain rule from calculus. The thing that's very interesting about this equation, and again this is not really special to neural networks, but it's something that often happens in gradient descent problems, is that when you look at the intermediate variable because of the way the chain rule works, this term here appears in both. After all that has to happen again by the chain rule, the derivative of the outside is the derivative of the outside in both cases. Let's give this a name. Let's call this delta i. And this is delta i. But also notice that again it depends on which i-th data point you're working with right now. Let's also define this entire expression here as let's call that epsilon, what does it depend on? It depends on i and actually m, epsilon i m. So in fact we have a pair of equations that's very interesting. The partial r i partial a m is delta i z i m. And that's just multiplication, right? i is fixed in this equation. And partial r i partial b m p is epsilon i m x i p. And that's also just multiplication. And, let me squeeze in down here, epsilon i m is equal to some expression a m sigma prime of etcetera times delta i. Okay, so together these are the back propagation equations. The reason they're called that is because it makes the gradient descent problem very easy to work with numerically. So here's what you do. The initial guess is to a and b. You start with some guess. Maybe you choose to be zeros or ones or something very simple. Maybe you have some a priori information about the system that you want to start with. So you guess a and b, and in a classic idea of gradient descent you compute their gradients. So you ask how does r depend on a and how does r depend on b? But in the process of doing that, once you choose a and b you get the y's and you get the f's and you get all these intermediate coefficients. That's computed when you come up with your model a and b. Therefore, what you can do is take a and b and compute y and the intermediate coefficients. You use those to compute delta and you use delta compute epsilon. So you know you choose a and b, you get delta, you get epsilon. Once you have delta and epsilon, you have the gradient written down explicitly. So what you do is a simple loop. Step one, guess a and b. Two, compute r and intermediates, that is the sigmas and so on, for that a and b. So that's a particular model you compute through this rule. Three, compute delta i for the ith data entry and use that to compute epsilon im and use that, those two to compute partial ri, partial am and partial ri, partial b, mp. Those are the partial derivatives that make up the gradient. Four, update a minus equals, say am, minus equals some small step size tau and again, standard numerical analysis theory applies. You want to choose a step size based on concavity and so on, but choose a small step size and update your model based on those partial derivatives you just computed and now repeat. Of course, like any gradient descent problem, you simply repeat until you think you found the minimum in a reasonable way, maybe because the gradient is shrinking to zero or maybe because you've hit some acceptable level of residual square error. So this is why neural networks are a useful concept. This system, going back to the original page, is called a feedforward neural network which is solved by back propagation. That sounds like a mouthful, but really what we're saying is it's a feedforward neural network because we're introducing a hidden composition layer and saying that we're going to feed information from x's to z's and z's to y's. So it's feedforward in the sense that you're moving from x's to z's to y's, producing y's from x's. And back propagation, solving it by back propagation means when you choose a's and b's and you compute that you get error on y which shows you how to adjust a, which shows you how to adjust b, so the error propagates backwards and you recompute by gradient descents. So the problem of compute is feedforward and move the error back for gradient descent is back propagation. This is the way you solve neural networks. Now this classics of neural network is great. There's a couple of enhancements that are worth mentioning right now. These are written up in the notes that I have posted on D2L. First of all, you don't have to have a single output y. If you think about what I'm saying here, all we're saying is that you have linear combinations of sigmoids of linear combinations. You could have many output variables y. For most regression problems, you have a single y, but you can imagine having multiple outputs, actually, which makes it much more powerful than lots of our other regression methods which can only deal with a single output. Secondly, I've been leaving off the detail, of course, of shifting by a constant. I either think of that as including a column of ones in my original variables x or by taking on an extra variable here. The equations don't really change in any significant way. They look more complicated because you have minus a constant in every term, and that shows up in all your personal derivatives. But from a mathematical analysis point of view, it's the same thing. It's just an affine function versus a linear function. The other requirement is that the other possibility, I should say, is that you could, when you go from z's to y's, you could add an extra little function here. You could say it's g of that. So maybe, especially if you're doing classification, there's a reason to squash the output by a sigmoid function or some classification function. Maybe you have a priori guess that y is modeled by some classic distribution, like, say, a normal distribution or a chi-square distribution. So you might be evaluating what you expect some relative probability to be based on that. So maybe the function for y is a little bit more complicated than just a linear combination. It might be some function of a linear combination. But again, as long as it's something, as long as this function g is something for which you can take a derivative smoothly and apply the chain rule, the whole idea of back propagation still works. Okay, that is classic neural networks. The next video will be about the general idea of recurrent neural networks, which is doing this over and over and over again in time. And then after that, we'll wrap up with the notion of a long, short-term memory network, which is an enhancement to a recurrent neural network that avoids some of the problems with the gradient computation.