 Okay guys, so Reza is not here today, which means you, but you get me. So before we start, there was a bunch of questions about how to turn in the homework. And so I guess there was a little ambiguity in the slide from the first lecture. So if it's a programming assignment, send me a copy of your source code. If it's not a programming assignment, if it's just handwritten work, I don't need a copy. I don't need an emailed copy of that. But I do need hard copies of everything. So of your code, of any figures that your code generates, of any responses to the questions, and then obviously any mathematical proofs that Reza asked you to do. Does that make sense? So I only need copies of the source code to the email account that's on the course website. Okay. You want a hard copy of like figures generated? Yes. Okay. Yeah. Would you... I just kind of panicked it off like really. Can I submit that later? For this time, I'll let it go. But did you submit the figures? I submitted the figures. Okay. That's... For this one, I'll let it go. In the future, please print out your code, any figures that your code generates, and any responses to the questions that's needed. I grade the hard copies of it. It's just easier for me to go through and grade the hard copies and give them back to you. But the emailed copies are there just to make sure, one, that your stuff works and I don't have to retype it all. And two, in case there's any question of cheating or anything like that, we have a digital copy that we can go back to and no one has anything to say about it. Okay. Does that make sense? That's okay with everyone? Yeah. You can use whatever programming language you want. I guarantee you, if you can write it in, I can read it. That said, the vast majority of people use MATLAB. So this one, the first homework assignment could have been done in about five lines of MATLAB code. So choose whatever language you want, but some of the programming assignments are much easier in one language versus another. So it's up to you. But if you can write it, I can read it. I'm not worried. Okay. Any other questions before we get started? Okay. So last class, or two classes ago, Reza talked about the preceptron. And you did a homework assignment on it. So the preceptron, as you recall for homework one, what you had was you had some series of inputs, x1, x2, all the way up to xn. And they provided some weighted sum. So you had w1, w2, all the way up to wn. As the input to a summation, and then you passed the results of that summation through some nonlinear function. In this case, it was the sine function to get some estimate of your output, which we called y hat. Right, so that's sort of the simplest of all these sort of linear classification data sets, right? And what this does is linear classification. You want to learn the weights such that you minimize some error between y hat, the estimated output function, and the true output function y, which in the case of the homework assignment, was given to you, right? It was whether or not the person liked the movies based on the features that we had selected previously. So our goal here is to find some vector of weights, w, which I'll denote w with the bar underneath it, meaning a vector. It's the same notation that Resid uses. Equals w1, w2, all the way up to some, I'll call it w sub m, transpose, right? And so given a new set of data, so we provided you with a training set which featured 10 features for every movie. And you fit them. But given a new data set, you want to see how good your algorithm works to classify that new data, right? You have a training set in which you find weights given some known outputs. And then you apply that to unknown weights. You want to predict whether or not people will like this movie or not. And so how does this generalize to other examples? So I'll call it, how does this generalize to other actions? Right. And this is the problem of validation, which we're going to talk about today. OK. So before we get into talking about validation, we have to come up with a, we need a construct that describes how much things cost us when we fail to predict accurately. Our goal here, our first goal is to find w. And given some vector of weights, how well does the w that we found generalize to new stuff? So the first step before we can even talk about validation is finding w. And so to find w, we're going to, for the rest of the class really, talk about cost functions or, as Reza uses the word, loss functions. So a loss function, which we're always going to use the notation j, j is going to be our loss function, assigns a cost for being wrong. And what you want to do is you want to attempt to minimize this loss function, i.e., find w to minimize j. And so the theme, obviously, that we're going to keep coming back to through this class is, how do we minimize j? Well, in classical calculus, j is going to be some function of the inputs and of our vector of weights. And so you're going to see, and we're going to see it multiple times, we're going to take the derivative of j with respect to w, set that equal to 0, and we're going to find the minimum point of that function. So we're going to find w where j equals, where the derivative of j with respect to w equals 0. Does that make sense? So imagine we have some data set d. So d is composed of a series of x vectors. So x, some vector x at some point in time, I'll call it 1, and some output, which can be a vector, but in this case, I'm going to call it a scalar, 1, all the way up to x bar, or x vector, at some time point n and y of n. So we're coming up with these. We have a data set in which we get pairs of values, x is being inputs and y is being the outputs of some unknown process. So y is our outputs. y hats are going to be our estimated outputs. And so we can come up with a very simple loss function. So we can say our loss is going to be equal to 0 when our estimated output, which is y hat in this particular case, equals our actual output. So there's no cost to being right. But you can imagine that there would be a cost to being wrong. So any time y hat is not equal to y, any time our estimate does not equal our true output, we have some loss associated with our estimate. And what we want to do is we want to minimize this loss function, this loss function being a function of some input vector x and some weight w. Does that make sense? So we're going to minimize this function f, who has inputs x and w, to minimize our loss function. So we may want to minimize our loss function over some training set, just like we gave you in the Perceptron learning homework. We have some training set. We've collected some sample of data. And then we want to find the weights that minimize that loss function. So in that particular case, the function that we're going to be minimizing, which I'll call j, is going to be equal to 1 over n, n being the number of samples that we have in our training set, times the sum from i equals 1 to n of some loss function. So in this particular case, we're going to make it a quadratic loss function. And we'll see that quadratic loss functions have some nice properties that we can then use later on. So in this case, our loss function is going to be y on our output minus our estimated y, our y hat, which in this case is going to be some function of x and w. And we're going to square it. So because it's a quadratic, we'll see it has nice properties. Basically, we can find the derivative and the zero point in the derivative is going to be the minimum of that function. So a lot of times, you'll see cost functions that are squared. There are cases where you don't have squared cost functions. In that case, you still do the same thing. You're going to find the derivative inside of equal to zero, as we'll see in this example. But you need to do a little bit more work on the back end to make sure it's actually a minimum. So we're going to find w that minimizes j. I'm going to close. Anyone have any questions so far? So we have some y. This was the output of our system, whatever we're given, our training set. And we have some estimated values. So what we're doing here is we're minimizing j, the cost function, over some training set. Why are we minimizing the cost function over some training set when we really actually want to test this data on another set of data? So we have a training set in which we learn the weights. But eventually, we're going to have the weights and we're going to want to apply them to something else, some different set of data. So why aren't we just fitting that test set of data? Well, for one thing, most of the time, we don't actually have the y values, the output values for that test set of data. We only have a limited training sample. But in addition, we make some assumptions about that test sample. Basically, so why are we minimizing j over a training sample as opposed to a test sample or some other sample of data? Well, we assume that the training sample and the test sample come from the same distribution. They're independent samples from the same distribution. If they're from different distributions, learning weights over our training sample isn't going to help us very much for our test sample. But if they're from the same distribution, we assume that one choices of some input and the associated output are drawn independently and at random from some similar distribution in both cases. It shouldn't matter whether we minimize j over our training set or our test set. We're going to come up with the same solution. So we assume that the training sample and the test sample are drawn independently from the same distribution. So now that we've covered all the things that we need to cover, we're going to go through some particular examples. So in order to solve one of these, we're going to be calling them regression problems. And we're going to be going through a couple of different algorithms to solve the regression problems. So in order to solve a regression problem, you need to do three things. The first is you need to have some hypothesis about the underlying model of the data. What that means is, is this linear? Is this quadratic? Is this some fifth-order polynomial? You have to have some hypothesis about how the data was actually generated. How do Xs result in Ys? So what is the relationship between, in this case, W, the weights and our output? Is this linear, quadratic, et cetera? So that's the first thing. You have to have some hypothesis about how the data was generated, or you can't really go on from there. The second thing that you need is you need to specify your loss function. How do you penalize a wrong output from your generative model? And three, now that you have those two things, you actually need to solve the minimization problem. So for example, let's assume that we have some Jn, a loss function, which is a function of our weights, W, and it's given in the same quadratic terms as we had before. So let's assume that we have some 1 over n times the sum over all of our n observations times Y, the summation of Y, which is the output of our real process, minus some function, which is a function of X and W squared. Same thing I had over there. So what does this mean exactly? Well, if we assume that the underlying model that generated the data is linear, we can rewrite this using our linear notation, so in this case, the i data point, minus sum W naught, which is a bias, minus sum W1 times our input X1 at time point i squared. So this is the simplest linear case. You have a bias term, and you have a slope term. So your weight vector, if you thought of the weights as a vector here, is a two-dimensional vector, has two elements in it, one, the bias term, and two, the slope term. So things to note is that this function is quadratic in terms of W, which, as I said before, gives us some nice properties of the derivative. Basically, we can find the derivative, and we know that the point where that derivative is equal to 0 is going to be some minimum. So what we really want to do in this particular case, we have two unknowns, W naught, which is our bias, and W1, which is the slope of our line with respect to X1. What we want to do is we want to find the derivative, D, with respect to W naught of Jn, I'll just call it Jn of W, equal to 0, and some derivative also with respect to the first weight, the slope weight of Jn of W equal to 0. And so we have two equations, and we're going to set the derivatives equal to 0. We have two equations, we have two unknowns, we solve these two equations, we get our two unknowns, and we find basically the best W naught and W1 that minimize the error, the squared error, between our observation and our predicted values. I'm going to stop there and see if anyone has questions. Anyone? No? So that makes sense. We're just taking the derivatives with respect to W naught and W1 and setting them equal to 0. So we have a minimum at some point in W space, in our weight space. OK, so let's actually run through this problem then, just because it's probably been a long time since calculus for you guys. So we're going to run through this one, and then we're going to do the same thing, but in matrix notation, which hopefully will be a review of matrix derivatives for you guys. So given this function, this quadratic loss function, which assumes the linear model, let's find the W naught and the W1 that effectively minimize it. So if we take the derivative, the board is sneaking up on me, so if we take the derivative of the function with respect to, which one did I take first? Let's do W1 first. If we take the derivative, remember the derivative is a linear operator, and so when I take the derivative of this function, I can really just pass the derivative right through the 1 over n. So the 1 over n is just a scalar times the derivative with respect to W1 of the sum over i to n of our loss function, y sub i minus W naught minus W1 x1 at time point i. It's quite right. And again, the summation is also a linear operator, just like the derivative is a linear operator, and so we can take that inside the summation, and you get 1 over n equals the summation, or equals 1 over n times the summation of i equals 1 to n of the derivative with respect to W1 of y sub i minus W naught. And now we can apply the derivative, remembering the chain rule, and so we get 1 over n times the sum of, this is going to be, because this is squared, we get our function on the outside, 2 times y i minus W naught W1 x1 to the i times, anyone? Minus x1 to the i, right? Everyone happy with that? Now we're setting that equal to 0. Let's do the same thing with respect to W naught. This one's a little bit easier, and so I'm going to do it in two steps instead of three. So we get 1 over n times the sum of the derivative with respect to W naught of y sub i minus W naught minus W1 x1 sub i squared. And this is going to be equal to minus 2 over n times the sum of y sub i minus W naught minus W1 x1. So the 2 came out in front, as did the negative sign for multiplying by the negative 1, right? Chain rule. OK, so we have two equations. I'll call this equation 1 and equation 2, where this is equal to 0 as well, right, from here. And so we have two equations. We have two unknowns. We can substitute one into the other or however you like to solve these, and you'll end up with W naught and W1 that minimizes this loss function. Does that make sense? OK. So what's nice about these two functions is that they actually tell us a little something about what the residuals in our model fit look like. And when I say residuals, what I mean is those pieces left over, right? There's always, yeah? So just to prove that this minimum exists, like this still may not have a solution for W0, W1. I think it's always going to have a solution, because it's a continuous function, and it's squared. So the derivative is known. So both of those two equations are linear and W1 and W0. Correct, there should not be any place where W is undefined in this particular case. Does that answer your question? OK. And that's one of the benefits of a quadratic loss function as opposed to other types of loss functions which you can use. So in the quadratic loss function right now, the way we set it up with a linear model, as you said, the weights are always, you're always linear in W. So you can find W naught and W1 to satisfy the equation. So is it always the case that you're going to end up having, like, say, a convex solution? In this particular case, we have a convex function. But you run across cases where it's a convex, or do you have to use different things on it? In this class, I can pretty much assure you that it's always going to be convex, and you're going to come up with the minimum. However, if your model is different, and we'll talk about different models at the end, you'll see that it's no longer linear in terms of the x's, but it is still linear in terms of the W's based on the weights, based on the way you set up the problem. So as long as you have a quadratic loss function, I think what I can say is it's always going to be convex, and you're always going to find a minimum, as long as you have a quadratic loss function. OK? I'll get back to you on it, but I'm fairly certain that I can say that. OK. OK, so a little bit of notation just to let everyone's on the same page. So we've always had y. This is the output of our actual process, whatever that is. Our inputs have been x's, which have been a vector. These are the inputs. And we have had some predicted output, which we've always been calling y hat. Similarly, you can have W, which is a vector. And that's the actual weights used to generate the process, and W hat, which are the predicted weights. So if you knew the underlying process exactly, if you knew it was a linear process and had some W's, you would know W without the hat. Whereas what we're doing, the problem that we're talking about today, is estimating our W's, estimating our y hats given some series of inputs. Does that make sense? So the difference between y and y hat, the actual output and the predicted output, I'm going to define triple equals sign to be equal to y tilde. So y tilde is the error. You can think of this as the actual, it's not the loss function, but it's the loss. It's the error that's associated with your actual output and your prediction. And so what we've had, by definition thus far, is that continuing our problem up there, y tilde, by definition, is going to be equal to y sub i, so y tilde sub i, minus W naught minus W1 x sub i. Does that make sense? All I did was substitute in y hat, which we said was a linear function. So based on that, what do we know thus far? Well, from equation one, I can substitute in y tilde. So I'm bringing down equation one, which was 1 over n times the summation. And so what I can say is that this is equal to 2 over n. I'm bringing the 2 out in front times the summation of i equals 1 to n of y tilde sub i, the error, times minus x sub i, and by definition, that has to be equal to 0. Right? Everyone see what I did? All I did was substitute in the definition for y tilde. Remember, y tilde is our error. So what does this imply? Well, what it implies is that the 2 over n doesn't really matter in this particular case, because it's all equal to 0. But this says that the residuals, which is the error between the prediction and the observation, y, is not a linear function of x. OK? So there's going to be no linear trends. Similarly, from equation two, and everyone can still see that, from equation two, what we know, I can do the exact same thing. And I can say this is minus 2 over n times the sum of i equals 1 to n substituting it in. Now, I just get y tilde equal to 0. What does this imply? Well, remember, this is just taking the mean. It implies that the mean of all residuals is going to be equal to 0. Does that make sense? So what does that look like? Well, so if you had some x and some y tilde, the error, I'm going to extend this axis. You're going to have points whose mean is 0. These are the residual values for all of my x's. It's not going to be linear in x. I can't draw a line straight through my residuals when this is 0. But notice that this has nothing about higher order functions of x. So I drew it in a very specific way. This looks, I don't know, sinusoidal or something. So it says nothing about the higher order powers of anything. It just says that there are no linear, there's no linear trend in our residuals from the first equation. And it says that the mean of our residuals is equal to 0. So those two properties come out. Does anyone have any questions so far? So what we've done thus far, just to review, is we've talked about loss functions. In particular, we've talked about quadratic loss functions. And we've derived a very simple set of equations based on a linear model. And we've taken the derivative of that loss function of our quadratic loss function, assuming a linear model, and found w0 and w1 that minimize that loss function. Now, the way we're typically going to do that in this class is instead of looking at summations because they're rather cumbersome, is we're actually going to be looking at it in matrix notation. It's a little bit easier to look at. And so I'm going to do the same thing that we did over there. No new concepts introduced, but I'm going to talk about it in terms of matrix notation. So we're going to do the exact same process. We're going to define what our model is. We're going to define our loss function. And then we're going to solve the minimization problem by taking the derivative of that loss function with respect to now a vector w and setting that equal to zero. Does that make sense? Okay, so a couple of definitions, just like we had over there. So let's let y, our output on any trial, equal to y sub 1 all the way up to y sub n. And just because this is the first time we've really sort of talked about matrix notation, I'm going to put down the sizes just so that you can see how everything cancels out. So this is an n by 1... I'll use lowercase. This is an n by 1 matrix. Right? And I'm going to say my matrix now, x, is going to be equal to 1s in the first column. This is going to correspond to our bias term in our weights, right? followed by some x on trial 1, x on trial 2, all the way up to x on trial n. Okay? Does this make sense? And I'm going to call this, and the size here is an n by m. This was an n by 1. Right? And our vector w of weights is going to be equal to w naught, which is the same thing over there. It's our bias term all the way up to some w sub m. So this is an m by 1. So what is our generative model in this particular case? Well, we can say that y is equal to some x matrix times our weights. Right? So that's how our generative model is. Right, y is an n by 1, x is an n by m, and so what we get out is an n by 1. Does that make sense? So w naught being multiplied by the ones column, w sub n being multiplied by the nth column of the matrix x. Okay? So given our generative model, what is our loss function here? Well, our loss function using the same notation as before, jn of some vector of weights, w was equal to, we've written it a bunch now, the sum i equals 0 over n times y minus w naught minus w1 x1 of i squared. Right? Which we can write in matrix notation as 1 over n times my sorry, I forgot the i up there now y times my vector y minus x times w transposed times y minus x times w. So notice I got rid of the summation because the summation is now included in the matrix multiplications. Right? And if you go through the sizes, you'll see that it all works out. And it is still a quadratic loss function. Okay? What this is equivalent to is 1 over n y tilde transpose y tilde. So what do we need to do in this particular case? Well, just like we did over there, we need to take the derivative with respect to w and set it equal to 0. So Reza does this in the notes for the class, I'm going to do it in a slightly different way than Reza does it. Whichever way you choose to understand is perfectly acceptable. This is the one that makes the most sense to me. But you can take a look through Reza's notes and see if that one makes sense to you. So Jn of w is 1 over n times y xw transpose y minus xw where those are all vectors. I'm just going to apply this operator. So if you recall from your linear algebra course to apply the transpose, I can just bring it inside the parentheses. But when I have a product to take the transpose, I take I flip the order. So it's now going to be w transpose x transpose times y minus xw. Does that make sense? Okay. I'm just going to multiply that out real fast. 1 over n times y transpose y minus y transpose xw minus w transpose x transpose y minus w transpose x transpose xw. Plus. Thank you, Max. Plus. Notice that this is a scalar. This is a 1 by n, y transpose times an n by m times an m by 1 gives you a 1 by 1. So this is a scalar. This is a scalar. And so to make things simple, I'm going to combine those two terms together. It's going to be y transpose y minus I'm going to say w transpose. I think that's the way I did it. Yep. w transpose x transpose y plus w transpose x transpose xw. Okay. I'm going to take the derivative with respect now to my matrix or my vector, excuse me, w of j and of w is equal to 1 over n times the derivative with respect to my matrix w times this quantity. Okay. Yes? Two lines below. You say that quantity is zero unless you What quantity is zero? y bar minus y bar y bar minus two lines below that. Here? Yeah. Okay. Well, you want this to be equal to zero, right? This is your error. This is the error between the predicted value and if I was being accurate to the notation, I would say these are w hats. Very true. Yeah, these are w hats. I'm estimating w's. Okay. Sorry about that. Yes? Two. Yes. Thank you. Two. I combined them together and didn't add the two. Okay. You can say these are all hats, right? We're estimating our w's. So let's take the derivative with respect to w. So just to write it again d,d,w j,n, which is a function of w is equal to 1 over n times I'm going to take the derivative as we do it. Right. So the derivative of y transpose y with respect to w is just zero, right? There's no w's there. So it's zero minus two times in this case the w transpose is going to disappear because we're taking the derivative with respect to w and we end up with x transpose y. Right. And then we add the derivative of w hat transpose x transpose x w hat. Right. Which is quadratic in terms of w hat. Right. Which is quadratic in terms of w. So the two is going to come down and we're going to end up with x transpose x w hat. Everyone happy with that? Quadratic in terms of w, I'm taking the derivative, the vector derivative with respect to w. Okay. If that does not make sense, please stop me. It's having that equal to zero. Just as we did before. What do I want to do? Well, I just want to solve for w. Well, so in this particular case the one over n disappears, right? And I'm left with two x transpose y equals two x transpose x w. The two's disappear. X transpose x w hat. Okay. And so to find w, what am I going to do? I'm going to pre-multiply by x transpose x inverse. So x transpose x, that quantity inverse times x transpose y is equal to w hat. This is called the normal equation. You're going to see about a hundred times through the course of this class. And you're going to see that the normal equation is the result of a very large number of learning algorithms if you take the learning algorithms to their limit. So what does this mean exactly? It means that the perceptron is used in class, right? If you, instead of having the perceptron and used a linear model or something like this, in the perceptron at every point in time we learned a little bit, right? You went through an iteration, there was some error, you updated your weights on the next iteration and hopefully you were closer to the actual value that was coming out. Here, if we have all of our data and it's labeled, if we have y, right, and we know our output, we can immediately come up with our values for our weights, w hat, just by applying this equation. No learning necessary. One shot, you have your weights. Does that make sense? Okay. Does that generalize some multivariate of y just by making y and w respectively matrices? Correct. Okay. Interestingly enough, a small aside, one other way to solve this, remember that y hat equals times w hat, right? Or we could say more generally y, we want our y's to equal xw, and so if we have our y's and we have our x's, you guys may know something called the pseudo inverse, right? You can take the pseudo inverse of x, right, where I'll denote the pseudo inverse as x dagger, right, where pseudo inverse of x dagger x is equal to the identity matrix, right, and so if you do the pseudo inverse xy times the pseudo inverse of x times x which is the identity equals w, so again in one shot, you can find your w's. It turns out that if you look at the definition of what the pseudo inverse operator is, it equals x transpose x inverse x transpose, okay? Your homework today will be using the normal equation to find the weights in one shot for a simple classification example. Okay? That one is again due next Monday. Okay, so let's talk briefly about polynomial regression. So we've done only linear regression. To do polynomial regression, the only thing you really need to do is change the definition of your matrix x. So previously we had ones in our first column corresponding to the bias term, right, and we had x sub one, x sub two, or x sub trial two, x sub trial n, right. If we wanted a quadratic term in there, all that is is x at trial one quantity squared, x at trial two quantity squared, all the way up to x at n quantity squared. We're now our w matrix, our w vector, excuse me, is going to be equal to w naught, w one all the way up to w m, just like it was before. Does that make sense? So to do polynomial regression, all you need to do is change your x matrix, right? So this will do a quadratic polynomial. The normal equation remains the same. Okay. So just to give you a brief introduction to fitting with polynomials. So generally the fit improves. Can everyone see the screen? Just a little bit. Generally the fit improves with increasing order of the polynomial. So in the first panel you have a linear polynomial, then you have a quadratic, then you have a fifth-order polynomial, and here you have a tenth-order polynomial all with the same data. Note that the residuals go down with each one of these cases, but in general you are likely, as your polynomial order increases, overfitting the data. Basically you're fitting the noise. So you want to find the best possible model to describe the data. And so in this particular case what you can do is you can do what's called a leave one out scenario to do cross-validation. So if you fit your original data set and then leave one out what you'll notice is that your error dramatically increases. So if you're overfitting, if your polynomial is fitting basically noise if you leave one of those data points out, you're going to end up with a drastically different amount of error. Whereas if the fit is appropriate leaving that one data point out doesn't make that much of a difference at all. So one way to estimate whether or not you're overfitting is the cross-validation which is quite easy to do. Basically all you do is you go through your data you leave out a data point you fit your model likely using the normal equation or whatever model fitting technique you're using. So you fit that model to, in this case the notation here is w of not i you find w for all of the data except for that i-th point you fit the model and then you find the error at that point and you take the average error for all of the data points you leave one out, take the average error over all the rest of the data points and so you can find now your cross-validation error which is the mean error at that data point that you left out. And so if your model is overfitting what you'll end up with is curves that look sort of like this. And so in general if the model is a good fit and you're not overfitting the data increasing the order of the polynomial is going to bring down the error with each increase in the model order. However as soon as you start fitting data or fitting noise more than data as your model order increases you'll see a large increase in the cross-validation error. Does that make sense? So in this particular case the actual data was fit with a second these things really want to be up fit with a second order polynomial and you see that the cross-validation error is at its minimum for a second order polynomial and then increases on either side of it. Does that make sense? So it's one way to find the appropriate model for your data at least in terms of polynomial fits. Any questions? Okay. That is it for that. So we're going to talk now about a particular learning scenario called LMS and in order to do that where did I leave off? I left off over here. We need to talk about the difference between online learning and batch learning. This class goes to a 415, right? Reza stuck me with one of the longest lectures of the semester so I'm going to get to enjoy that. Okay, so we had batch learning in which you're given basically a training set. You're given some inputs and some outputs and you're given them all at the same time. So you have some big long list of your inputs some big long list of your outputs. So given a series of inputs and outputs at once as opposed to an online learning algorithm in which you're given a single data point which I'll call a pair of x's and y's at some time i, this might be a vector depending on what you're doing and you need to update your model accordingly or you need to update your weight estimates accordingly. So our normal equation here assumed that we had a big long list of y's and a big long list of x's and therefore in one shot we could figure out what the estimate of our weights needed to be. In this next problem we're going to talk about the LMS algorithm which is stands for the least mean squares algorithm it has a number of different names. It's also called the delta rule and if you're in psychology it's called the Rescorla-Wagner rule they all mean the same thing. Basically imagine that we have some system and we have some space say x1 and x2 on our axes and we get some input x at time i and we can plot that as a vector in this space just like Residid during the first class. I'm going to actually make this a little bit longer so things are easier to see. So this is just a vector it has some components x1 and x2 that I've plotted in my two-dimensional space. So we can imagine that there is some output y and so what I'm drawing there is some product, some unknown product such that when you take the dot product with x the length of this vector is going to be equal to y normalized by the length of x. So this is going to be y sub i. I'm going to change my notation ever so slightly and I'm going to call this x at n and y at n it's not going to make a difference but just to be consistent with Residid's notation. So imagine that we also have some weight vector we also have some weight vector at trial n. So previously we only found one set of weights. Now in this online learning algorithm we're going to be trying to come up with a new set of weights or the optimal set of weights given the data that we have seen thus far, right? And so you can imagine that what we want to do is if you take if you imagine that our model is given by I'll call this w hat x times w just like it was before, right? What you want is w times x or w dot the dot product between x and w to also fall on this line y, right? Does that make sense? So just to give you some intuition about it we can make this some angle alpha, right? And now the question is what is this, if I project w onto x what is that, what is this distance? Different marker, not useful at all. What is this distance p? Right? So in this particular case we're just doing the dot product so p p here is going to be equal to the magnitude of w times the cosine of alpha the angle between them where I've used this notation as ever, I know what that notation means the double bars that's the double bars as the L2 norm so if I say the double bars of x it equals the square root of the first element of x squared plus the second element of x squared I'll make it squared plus the third element all the way up to the mth element of x squared right, that's the L2 norm right, it's your distance formula okay so given that we have this model right what we can also say is that w of n transpose I guess to be complete we should say these are at some time point n w of n transpose times x is equal to the magnitude of w times the magnitude of x cosine of the angle between them which is again alpha right, so p is equal to magnitude of w times cosine alpha w transpose x is the angle between w and x times their magnitude and so what we can do is we can say that just by simple substitution cosine of alpha is equal to w of n transpose x bar over the magnitude of w and the magnitude of x boy that's a lot of lines so what is this? well that top part is y hat at time point n so this distance p of w projected onto x that distance is equal to y hat right divided by the magnitudes of w and x so what do we want well we're in this particular case our output was y somewhere on this line and our predicted output was the projection of w hat onto x at time point n so we made a mistake there's an error here and so in order to get the output to fall on this line we need to change our weights on the next trial by some amount so that when we project our new w hat of n plus 1 onto x it falls on our line y right so we need to add some delta w so that w of n plus 1 results in the projection falling on the line and so one of the easiest ways to do that is just project it find a path parallel to x and that's what we're going to do here I'll put that one up at the top so in this particular case this is also if you're at all confused about the picture there's a very nice explanation of it in Reza's book if you've picked it up already so what is delta right how much do we need to change our weights by well delta on trial n is equal to the error right y n over the magnitude minus y hat of n over the magnitude of x right that was just that distance that's how long I need to add on to it right but in what direction do I need to add it on well I need to add it on in the direction of x and so to do that I just multiply by x on trial n divided by the magnitude of x at trial n so this gives me the magnitude of the error this gives me the direction of the error I'm adding it on parallel to x does that make sense so this is a unit vector with direction x and so very simply now w at time point n plus 1 is w at n remember because w of n plus 1 equals w of n plus delta hat's here right plus 1 over the magnitude of x at n squared times y tilde of n the error at n times x of n so that's pretty much the LMS algorithm right there right so we change our weights in the direction of x based on our error okay so we can derive something called the steepest descent algorithm the steepest descent algorithm assigns a cost function for being wrong just like we did before it's equal to 1 over 2n times the sum from n equals 1 to capital N this is why I changed the notation of y on trial n minus sum weights w hat transpose x of n squared very similar to what we had before right so for a particular x of i we can take the derivative so you can take the derivative with respect to I'm sorry some w i right you can take the derivative the same as it was before 1 over n now times the summation from n equals 1 to n just like it was before times y of n minus w hat transpose x of n times some x of i but if you wanted to take it for all w's right it would be the same equation right except now this is a vector same as it was before right so this quantity though no I took it I put it to I was sneaky I put the 2 out in front so it cancels right this represents the average error over all data points so in the LMS algorithm up here we have the error on a single data point whereas in the steepest descent algorithm what we're doing is we're minimizing the cost of changing w over all previously seen data points does that make sense so the LMS is basically estimating the it approximates the error over a local region the particular x that you've seen on this trial whereas this is going to change our weights based on our entire history of our weights w or of our inputs x excuse me does that make sense so typically what you would do in this particular case if you wanted to write the learning rule out you're going to say that my w on trial n plus 1 is equal to w on trial n that was in the LMS case plus eta which is my learning rate times 1 over n times the summation of n equals 1 to n of my derivative which was y of n minus w transpose xnx where eta is my learning rate so the higher eta is the more I'll learn from my error right if eta is 0 I'll learn nothing from my error w of n equals w of n plus 1 so what we really want to know and in the last 7 minutes or so what we really want to know is in this particular case when we were looking at the normal equation we found w's in sort of one fell swoop in this particular case receiving new inputs x at trial n and we keep updating our weights based on the history of what we've seen in the past or if we're talking about the LMS algorithm the delta rule algorithm we update our weights simply based on the error that I've seen in this particular trial what we really want to know is what does w at some infinite time do given an infinite number of inputs does w converge to some known value do we get some stable condition for w and so what we really want to ask is does w converge okay so given an infinite series of inputs and outputs does w converge so I'm going to assume you have some base knowledge and I'm just going to give you some of the matrix series definitions that you need in order to complete this problem I think there is a homework on it on Wednesday that you're required to actually go through the derivations for it I haven't looked at the homework for Wednesday or the homework that we assign on Wednesday but I think it has something to do with that or at least it used to so I'm just going to give you the definitions you can believe them or not they do work so imagine we have some weight at time point one and that's equal to some weight at some time point zero plus given our learning rule eta times the sum from n equals one to all of our values of y and minus w not transpose I'm going to get rid of my hats now right so we have some starting weight and some weight on the first on the first iteration through based on the area that we receive right and so this is equal to what this is equal to w not plus eta times x transpose y minus x w not so what I've done here is I really don't want to play around with the summations so I've stuck all my stuff in a matrix which contains my previous history values right so now I'm using capital X's instead of vectors of X's so now these are matrices so I can just do matrix multiplication and get rid of my summations right so I've just these are exactly the same I've just gotten rid of the summation right and so w one in this particular case is equal to the identity matrix minus eta x transpose x times w not plus eta times x transpose y all I've done is rearrange the equation I've multiplied my eta through and I've taken out my w sub zero that okay so if we look at w two right w two is the same equation right we update our weights based on our previous values it's the same thing except now when we see w one we replace it with our quantity so w two is equal to w one plus eta times the sum n equals one to n but instead of doing that I'm going to write a matrix notation here x transpose y minus x w not right and I'll bring my w ones down right and so that equals I minus eta x transpose x times w not I'm going to write this a different way I'm going to write this as w two I minus eta x transpose x w one plus eta x transpose y which is equal to I minus eta x transpose x w one was I minus eta x transpose x w not plus eta x transpose y plus eta x transpose y right all I've done is substitute w one in in this equation okay so if I actually wanted to write w on the infinite trial you'll notice that this is a series right I can form a rule that describes what this is and the rule is w at infinity equals the sum from I equals one to infinity of I minus eta x transpose x I minus one eta x transpose y plus I minus eta x transpose x to the infinity w not that make sense so does this series converge so this is in fact a series so does this series converge so you need to know two definitions first the sum from I equals one to infinity of sum matrix A to the I minus one is equal to one over I minus A which is equal to