 Okay, so the last time we looked at the singular value decomposition and generalized inverses. Today we will connect these ideas of generalized inverses to least squares problems. So what are these least squares problems? So if you consider a matrix A of size m by n and B vector B of size of length m and a vector x of size of length n, then consider the problem Ax equals b. So if you want to solve this problem, this is basically m equations in n unknowns. And there are three possibilities regardless of what m and n are. Either they can exist a unique solution or they could exist an infinitely large set of solutions or there could be no solution. There is no nothing in between. So for example, you can never have two solutions, exactly two solutions. Okay, so in the case where there is no solution, it's natural to ask what would be the best possible solution you can find. And so we will in this part of this course, we will focus on the case where m is greater than or equal to n. So you have more equations than you have unknowns. The case m less than n is something that I cover in some sense in the course on compressed sensing. So m is greater than or equal to n here. And we will focus on this problem minimize with respect to x the L2 norm of Ax minus b. This is called the least squares problem because this norm square is the same. So minimizing the L2 norm of Ax minus b is the same as minimizing the square of the L2 norm because squaring is a monotonic function. So whatever x minimizes this will also be the x that minimizes this cost function. And this itself is the sum of the squares of all the individual entries in Ax minus b. And so that's why it's called the least squares problem. You're finding the solution that achieves the least square error in this Euclidean norm sense between Ax and b. So there are several questions that one can ask. One is does this have a solution? Can you solve this problem? Second is is that solution unique or are there other solutions which are equally good? And the third point of course is why are we choosing to use the L2 norm here? So the third question is the easiest to answer and it's just going to be a completely hand-waving answer. The answer is that L2 norm is kind of the most logical or easiest to understand notion of distance. It's the Euclidean distance. It's the one we are most familiar with. But more importantly from an analytical point of view, Ax minus b square is a continuously differentiable function of x. So for example, you can find its derivative and so on. And so it is amenable to optimization much more so than other cost functions you could imagine. And thirdly, this L2 norm is invariant to orthonormal transformations. So Ax minus b L2 is the same as q times Ax minus qb L2 for any orthonormal q. We'll see that this is actually very useful for a very useful property for simplifying this problem that we want to solve. So before I actually go about solving this problem, I want to make one small point that the solution to minimizing a norm of Ax minus b is indeed dependent on the norm. So take a very simple example where a is a 3 by 1 vector of all ones and b has 3 components b1, b2 and b3 where b1 is greater than or equal to b2 is greater than or equal to b3 which is greater than or equal to 0. So they're positive numbers with this ordering. And now let's look at the LP norm. When I take p equals 1, so x here is a scalar because this is just a number, this times this vector, you want it to be as close to b as possible. So if I'm looking at the x that minimizes the L1 norm of Ax minus b, so what is that x? What is the optimal x that minimizes this? What you can see is that it will, when I do Ax minus b, I will get x minus b1 mod of x minus b1 plus mod of x minus b2 plus mod of x minus b3. If you look at this for a few minutes, given that b1 is greater than or equal to b2 is greater than or equal to b3, which is greater than or equal to 0, you can reason out that the x that minimizes this should be equal to b2 because if I use any other x, then all three terms will be greater than 0. But the first and last term will add up to a constant value if the x is between b1 and b3. If x is beyond b1 and b3, it will actually be even higher. If it's, for example, if x is bigger than b1 or less than b3, the total cost will be even higher. But if it's between b1 and b3, mod of x minus b1 plus mod of x minus b3 will be a constant and mod of x minus b2 will be greater than 0 unless x is equal to b2. So, the solution is x optimum is equal to b2. Similarly, if I take the two norm, this is x minus b1 square plus x minus b2 square plus x minus b3 square. If you just differentiate with respect to x and set it equal to 0 and solve, you can easily show that the solution is b1 plus b2 plus b3 over 3. It's just the average. It minimizes the mean squared error between these two. And if I take the L infinity norm, I'm looking at the smallest among mod of x minus b2 and mod of x minus b3. And this you can see, you'll have to reflect on it for a minute, but you can show that this is minimized by choosing x optimum equals b1 plus b3 over 2. So, this is just to illustrate that the solution to the problem does depend on which norm you're considering. But the key advantage of x minus b2 is that it's a differentiable function of x. So, we can find minimizers by differentiating, setting equal to 0 and solving. And this derivative, it'll turn out and I'll show it to you in a minute, is an easily constructed linear system. And this linear system is positive definite if A has full rank. So, that is the property that we'll use. So, before I solve this least squares problem, I just want to mention that the, this least squares problem is when the origin of it is in linear regression. Okay, so here your given points in the two dimensional plane x1, y1, x2, y2, etc. So, these points could be some points. I don't mean to draw them on a straight line, they could be all over here. And what we believe is that there is a linear relationship between x and y. And so, we want to find a line that fits these points, which explains this relationship between x and y. And so, we believe that there's a linear relationship of the form x equal to alpha y plus beta for some alpha and beta. Okay, so for instance, if x is the acceleration, so alpha is the acceleration, x is the time and v is the, and y is the velocity, the velocity at time x is going to be alpha times x plus the initial velocity, which is the velocity when x is equal to zero. And so, that is the linear relationship. So, we go out there and we see some vehicle moving and we record the time and the velocity at that time. And then we want to infer this relationship between the velocity and time and determine the parameters alpha and beta. So, the straight line of best fit, the so-called straight line of best fit between x and y is, can be obtained by this method of least squares where you ask what is the alpha and beta that minimizes the mean squared error between yi and alpha xi minus, alpha xi plus beta. So, this in turn is actually a special case of this general least squares problem we just put down, where I just set a to be the matrix with all ones as its first column, x1 to xm as its second column, b is the vector containing y1 to ym as its entries and x is a two-dimensional vector with alpha and beta as its two entries. It just has two variables. Okay. So, another small remark is that if I look at this problem ax minus bl2, if the rank of a is less than the number of variables in x, then this the solution to this problem is not unique. Okay. But we can make the solution unique by considering among all the solutions that is all the xls's for which this norm of axls minus bl2 is the minimum among all such vectors, which is the one with the minimum norm. And this is something I will talk about in just a few minutes. But so there is a way to find a unique solution, but it is unique in the sense that it is the least length least square solution. Okay. Now, coming back to our problem, our problem was to minimize the norm of ax minus b square with respect to x. Of course, this would be the same as minimizing norm of ax minus b cube ax b minus b to the fourth or in fact any monotonic function of the norm of ax minus b, but square turns out to be very convenient for us. So that is what I will consider here. This is nothing but ax minus b transpose times ax minus b. And as usual, if we expand it out, we get b transpose b plus x transpose a transpose ax minus x transpose a transpose b minus b transpose ax. Okay. So now, what we do is we differentiate with respect to x. And what I am writing here is actually the vector derivative of this. Vector derivatives work in a very similar way as scalar derivatives. But unfortunately, I don't have the time to teach you a module on how to find vector derivatives. But I'll just mention that if I have f of x is a scalar function of a vector x, then we know that the gradient with respect to x f of x is equal to the vector containing the partial derivatives. Okay. So this is how you differentiate with respect to a vector. And if you apply this idea, you can show, it's actually very simple. It's very elementary to show these things. But the derivative of x transpose a transpose ax is two times a transpose ax. And the derivative of the sum of these two terms is the two times a transpose b. So if we set the derivative equal to zero, we get that a transpose ax equals a transpose b. So basically, we need to solve for x that satisfies this problem. So this is again a linear system of equations. And in fact, they are called normal equations. So in order to see why these are called normal equations, we have, we will go through a small geometric argument. So what I will show is that if z is a solution to the least quest problem, then the error vector az minus b must be perpendicular to every vector which belongs to the range space of a. Okay. This idea is called the orthogonality principle. So and also the fact that az minus b must be perpendicular or normal to every vector belonging to r of a is the reason why this set of equations are called normal equations, because they represent this fact of normality between az minus b and any vector in the range space of a. So pictorially, what it looks like is this. So if I take ax, right now I am for simplicity denoting the space spanned by ax, which is the range space of a by a straight line. And b is some other vector, which may not lie in the range space of a. If b lies in the range space of a, you know that you can find an x such that ax exactly equals b and then norm of ax minus b square will be equal to 0. So you will achieve no, there will be no residual error once you find the least square solution. But if b is outside the range space of a, essentially what we are doing is to project b onto the range space of a. And what is left over is this error, this is the a, this is az, this is az and az minus b is the residual that is this vector. And these two vectors az and az minus b, in fact any vector in the range space of a has to be perpendicular to this az minus b, otherwise you can possibly improve your solution by moving by picking a different z. So that's the basic idea, simple geometric argument. So in other words, what we are saying is ay should be perpendicular to az minus b for every y belonging to r to the n, which is the same as saying y transpose a transpose is the transpose of this times this which is nothing but the inner product between ay and az minus b must be equal to 0 for every y belonging to r to the n. Now, if this is true for every y belonging to r to the n, it means that a transpose times az minus b itself must be equal to the 0 vector, which is the same as saying a transpose az equals a transpose b, which actually brings us back to the normal equations. And in fact, all of these arguments are reversible. And that's the reason why these equations, the set of equations a transpose az equals a transpose b are called normal equations. So if z satisfies these normal equations, then it solves the least squares problem. Okay, so we have the following theorem. So let a be in r to the m by n and say m is greater than n, I could even choose m greater than or equal to n and x is in r to the n and b is in r to the m. Then the problem of minimizing the L2 norm of ax minus b with respect to x always has a solution. So there's no case where you can't solve this problem, you will always be able to solve it. It has a unique solution if and only if a is full rank that is m is greater than n here, so it has full column rank. And this unique solution is given by x equal to a dagger b, where a dagger is the Moore-Penrose pseudo inverse of a. So I think somebody, Debar Panmibe was asking whether Penrose is the same guy who got the Nobel Prize. And yes, I checked that indeed Penrose is the guy who got the Nobel Prize in physics in 2020. So a dagger is the Moore-Penrose pseudo inverse of a and if rank of a is less than n, so a is ranked efficient, then there are an infinite set of solutions to the least squares problem. However, in this set, the least length element is unique and it is still given by a dagger times b. So this a dagger b turns out to be a very beautiful solution. It is always the solution to this least squares problem. It is a unique solution if rank of a equals n, but if rank of a is less than n, it is the unique least length solution to this problem. There will be infinitely many solutions for this problem, but the least length one is given by a dagger times b. So the way we show this is first, we will show that a dagger b is always a solution to the least squares problem by using this geometric argument. So in order to show this, it suffices to show that a dagger b actually satisfies these normal equations. So we arrived at the normal equations also by differentiating this norm of ax minus b square and setting it equal to zero. So if it solves the normal equations, it does minimize the solve this optimization problem. Also recall that in the previous class, we wrote down the pseudo inverses and we said if a is u sigma v transpose, we can write a dagger to be v sigma 1 u transpose where sigma 1 is the matrix which has its top left r cross r entries as the inverses of the r cross r diagonal matrix which is the top left diagonal matrix in sigma containing the nonzero singular values and everything else being equal to zero. And this is of size m by n, sigma 1 is of size n by n. So now if I let x equal to a dagger b, then and I look at what happens to a transpose ax, that's the same as a transpose a a dagger b. And if I substitute these two formulas, a transpose is v sigma transpose u transpose, a is u sigma v transpose and a dagger is v sigma 1 u transpose times b. u transpose u is the identity matrix and v transpose v is the identity matrix. So that's the same as v sigma transpose sigma sigma 1 u transpose b and sigma 1 is the pseudo inverse of sigma. And so you can check that sigma transpose sigma sigma 1 is just sigma transpose. And so what we have then is v sigma transpose u transpose times b which is nothing but a transpose b. So a transpose ax equals a transpose b when you let x equal to this a dagger times b. So a dagger b solves the least quest problem. It satisfies the normal equations. Now if I look at the system a transpose ax equals a transpose b, there's a system of linear equations, n linear equations in n unknowns, a is m by n. So a transpose a is of size n cross n. So this is n equations in n unknowns. And we know that this will have a unique solution if and only if the matrix a transpose a is invertible, which is true if and only if rank of a equals n. So the least quest problem has a unique solution if and only if rank of a equals n. Okay, now suppose rank of a is some number r which is less than n. So if we let y equal to, we define y to be v transpose x, then we can write ax minus b as, okay, so going forward I'm not going to write the two norm everywhere. But this is all for the L2 norm, okay. So this is the same as norm of u sigma v transpose times x minus b. And v transpose x is y. So I can write this as sigma y minus u transpose b. So here I'm using this property that u is a unitary matrix. So multiplying this whole thing by u transpose will not change this norm. But if I multiply by u transpose, so what I was saying is that if rank of a is less than n, then we can write norm of ax minus b as norm of sigma y minus u transpose b. And here I am using the fact that u is a unitary matrix. And that's why n, so multiplying this thing by u transpose does not change the value of the norm. So what this means is that x will minimize the norm of ax minus b if and only if y minimizes the norm of sigma y minus c. This is a unitary transformation, it's one to one. So this x will minimize norm of ax minus b if and only if y minimizes sigma y minus c, where c is this matrix, this vector u transpose b. But this is a beautiful simple system. So it's very easy to see what's going on here. If I write sigma to be sigma 1 to sigma r and zeroes everywhere else. So these are the top left r cross r block and it's a diagonal matrix. Then if I expand sigma y minus c square, that's going to be sigma i equal to 1 to r, sigma i yi minus ci square plus from r plus 1 to n, sigma is zero. So it'll just be ci square. So now if I look at what happens as I choose different possible values for yis, I can see that by choosing yi equal to ci over sigma i, for the first r values of yi, I can make these terms zero, but y doesn't touch these terms. So the minimum value of sigma y minus c squared is equal to the second term here, sigma i equal to r plus 1 to m, ci squared. And it occurs when yi is ci over sigma i. And the remaining yis can take any possible, any values we wish. And since r is less than n, we can have infinitely many solutions for y. But the one with minimum Euclidean norm is obtained by putting yi equal to zero, i equal to r plus 1 to n. And in fact we can write this y as sigma 1 times c. Because the norm of y is just the sum of all these guys squared plus yr plus 1 squared plus etcetera. And the norm of yr plus 1 squared plus etcetera can be minimized or the sum of yr plus 1 plus yr plus 1 squared plus yr plus 2 squared plus etcetera up to plus yn squared can be minimized by choosing all of those guys equal to zero. And this actually yields the solution of minimum Euclidean norm since the norm of x equals the norm of y. So whatever y minimizes the norm of y is also the x that, the corresponding x is actually the x with minimum Euclidean norm. That's because v is an orthonormal matrix. So the minimum norm least length solution is x equal to vy and y itself is equal to sigma 1 times c. So and c itself is equal to u transpose times b. So v sigma 1 u transpose is nothing but a dagger. So the least length solution is a dagger times b, which completes the proof.