 This video is an introduction to classic neural networks. We're going to talk about neural networks for the purposes of regression, but as will be clear from the algorithm, they are also very good at classification. In fact, it's that most applications of neural networks nowadays are in classification problems. But to keep things clear, we'll talk about regression. Suppose, as always, that we have predictors x, and let's say we have n entries in p different predictor variables, and we want to produce a regression, let's call it f hat, that does a good job of predicting the response variable y, and there are n of those. So we're looking for a function y equals f hat of x, and the idea of a neural network is to break this up into two steps through composition. So what we're going to do is think about each predictor x as an input to some function. So I'll draw those as circles. I'll use actually capital P here, and I'll use lowercase p as an index. So I have x1 through x capital P, and we want to produce the variable y. But to do so, we're going to introduce an intermediate layer of variables here. We'll call that variable z, so we're going to go from x to z to y through composition. And let's suppose we want to introduce, let's say, m, capital M of those z's. So we want the z's to depend on the x's, and then y'll depend on the z's. So in other words, what we're going to do here is describe our function as f hat of x is actually a function y of z of x. For some, z, which is z1 up to zm as a vector. So why do we draw it this way? Well, it's traditional in the world of neural networks to think about what's actually going to be a linear map as a pile of coefficients. So if each arrow here represents a coefficient, then y is a linear combination of the z's and each z is a linear combination of the x's. In fact, we're going to make an ansatz that is a hypothesis, a premise for our model that's of the following form. We need to assume that y of z is of the form sum on the z's of some coefficients am times zm. So y is going to be a linear combination of the z's. And we're going to assume that z in terms of the x's, each zm in terms of the x's I should say, is, well, I want it to be a linear combination, but there's one detail we're going to add, and that will be the subject of the next slide. But I'm going to add up from p equals 1 to capital P of some bmp times xp. So that's the pth chord. This is a matrix b times the vector x in some sense. But I'm going to take that output. I'm going to adjust it by a sigmoid function. So just some terminology. In the language of neural networks, the predictor variables, the input, is called the input layer. The y's are reasonably called the output layer, and the z's are called the hidden layer. So I will discuss in the next slide why we want to lay things out this way. But overall, the idea is to predict y based on x by factoring through an intermediate variable that we're going to call z. The y will be linear combination of the z's, and the z's will be linear combinations of the x's except that they'll also be adjusted by a function sigma. So we go from input layer to hidden layer to output layer. And each of these arrows here represents one of the coefficients bmp. There are capital M times capital P of those. And each of these arrows here represents the coefficient am. So if the function sigma is fixed and known in advance, what we're really trying to do here is generate these capital M coefficients and these capital M by capital P coefficients. All right. Why is a model like this useful? I think it's very important to think of this in the context of decision trees and random forests and also in the context of support vector machines. So if you think about doing a regression problem with decision trees, the problem is that if you have, say, two variables x1 and x2, the decision tree cutoff is only allowed to be x1 equals a constant or x2 is equal to a constant. And so the key limitation of decision trees, in addition to the fact that they overfit very easily and so on, but in terms of the actual structure, is that the model relies on the original coordinate system. So if your data happens to be spread out in some diagonal fashion or maybe something more complicated like this, it's going to be very hard. You're going to have to make lots of cuts to get that diagonal action through vertical and horizontal cuts. So how could we improve this? Well, what you do instead is say, look, what I really want to do is allow myself to make cuts and regressions based not just on hyperplanes, which are each individual variable set to a constant, but any hyperplane in the same way we do for a support vector machine. So I might want to say, look, actually what I want to do is I want to draw, say, a hyperplane maybe like this, or maybe a hyperplane like this, or some combination of those. I want to match my data not on a particular specified family of hyperplanes, but across all possible hyperplanes. So what does a hyperplane look like in x1, x2, or x1 through xp? Well, a hyperplane looks like b1, x1, plus b2, x2, plus dot, dot, dot, plus bp, xp equals some constant. Maybe I'll call it b0. That's after all the standard equation of a hyperplane. So think back to the previous slide. We said that we want zm to be sigma of a sum of little p equals 1 to p of capital B, m, p, xp. So ignoring the sigma for a moment, this is exactly like the function part of the equation of a hyperplane. So what's going to end up happening here is we're going to make decision boundaries based on level sets of hyperplanes, not only on the chord of hyperplanes that you use in a decision tree. Now, this is where the sigma comes in. Sigma has to be what's called a sigmoid function. A sigmoid function, actually sigmoid is a class of functions, but usually it refers to one particular function, which is the function sigma of t is e to the t over 1 plus e to the t, which can also be written as 1 over 1 plus e to the minus t. This function is a very nice property that, well, this is the logistic curve. It's this classic s-shaped curve, which as you go to minus infinity approaches 0, and as you go to positive infinity approaches 1, and actually if you plug in 0 you get 1 half. So it's this long s-shaped curve, and the idea here is to have a smooth, this is smooth, it's infinitely differentiable, it's just a great function to work with for calculus purposes, and it looks something like the classic indicator function, what's called the heavy side function, which jumps from 0 to 1 at the origin. There are other examples of sigmoid functions, popular choices also include things like arc tangent, which you know takes negative pi over 2 to pi over 2, and stretches that out, just like computing the tangent of an angle. And you can also use hyperbolic tangent, which is closely related to this original logistic growth function. So the idea here is that when we make a cut with a hyperplane, we want to talk about being on one side versus the other side of a certain level of that hyperplane. So let me draw a picture of this. Suppose I'm working in two-dimensional space x1, x2, so in this case p is equal to 2, and I consider some hyperplane family of the form b1, x1 plus b2, x2, or b1, b2 are some fixed numbers. If I draw the level sets of this function, it's a family of parallel hyperplanes. Well, they're supposed to be parallel, anyway. Now what I want to do is suppose I'm working on a data problem where I want to say, actually, the data I'm working with has concentrated up here, and the data does not appear down there, or maybe it takes a different value down there. Maybe there's a class distinction between this region and that region. So what I'd like to do is make a decision cut right about here and say that the property of being an x versus an o is true here versus false here, or rather the function takes one value here and a different value here. So what we'll do is we will say yes and no by drawing a sigmoid function along these levels. So imagine you have a sigmoid function, which makes that transition from 0 to 1, and in this picture, the sigmoid function, we're feeding it this value to cut off at some level. So we could say take a sigmoid of B1 x1 plus B2 x2, say minus B0, where B0 is the level at which you want to cut. If you leave that off, you'd be always slicing through the origin. So combining a sigmoid function with a linear combination is a way to express the idea of making a decision boundary with two important properties. One is that it's a smooth function. The cutoff is not discrete. It's a differentiable function. It has all of its derivatives of all degrees, and it's very easy to work with because it's based on exponentials. And two, you can use any level of any family of hyperplanes. So those two properties together mean that this form of Zm is a sigmoid of some equation like this, some expression like that, linear expression. So you can make a sigmoid and feed into it what's effectively a family of linear, a family of parallel hyperplanes, gives you a picture like this. Now to be precise, you might want to include the minus Bm0, that sort of term here, or you can include that in your definition of sigma, or you could include it by having one of your x-coordinates B1. It doesn't really matter. There's many ways to include that variable. For the simplicity of presentation, there's a lot of constant terms, but there's many ways of including them with minimal disruption. Okay, I'll pause this video here and we'll pick it up in the next one. But this is why we want to introduce the hidden variable Z, which depends on the x's, through a sigmoid of a linear combination. The next question is how does y depend on Z and y?