 So, in the previous lecture we were talking about this projection matrix right. So, we saw that if you have the subspace U contained inside this inner product space V and if you suppose that there is an idea of a best possible approximation of a vector in V of any vector in V onto a subspace of itself that is U then this best possible vector can be obtained through this map which we showed was a linear map which we showed was a projection map and we have seen that this projection therefore takes an object V and maps it to V hat which is the best approach of V in U right. So, of course this is a mapping from V to U right. If I remember correctly we have also shown that the image of this projection map is the entirety of U right. So, in fact without loss of generality if you pick out any object in U, U is its own pre-image right. So, if you try to project an object that is already sitting inside it and get its best possible approximation within the same subspace even commonsense says that it must be the object itself right. So, that is what this essentially means and we have also proved it through our idea of best approximation where we had this relation in terms of the inner product between the error and any vector in the subspace yeah and that inner product needed to vanish. So, we showed that this is indeed true the next object of interest is the kernel of this projection map and we are going to claim that this is nothing but the orthogonal complement of U right and why should this be true. So, again it is again the same old trick how do we show that two subspaces are equal containment either way. So, suppose V belongs to the kernel of P which essentially means that PV is going to vanish right. But what exactly is this PV? PV by this very definition is nothing, but the best possible approximation of V inside the subspace U right. So, therefore, V minus PV it is inner product with any object inside what? If I take the inner product of V minus PV what is this? This is the error. So, it is inner product with any vector inside U must identically vanish right. But what do we know PV is already 0. So, this is true for all U contained in U, but whether I write this PV or not does not matter because I know by my supposition that this is already in the kernel of P. So, this is 0. So, this essentially means that the inner product of V with every object inside the subspace U is 0. Then what does that mean? Where should be V come from? V is exactly going to fit the bill for a member inside the orthogonal complement of U. Because no matter what U you pick out what small U you pick out from the subspace U the inner product of V with that U is going to be 0 which means V is orthogonal to every vector inside the subspace U. That means V must belong to the orthogonal complement of U. So, based on this I started with the assumption that V belongs to kernel of P and now I observe that this implies that V belongs to the orthogonal complement of U. This part till here is it clear. So, now if you plug in PV is equal to 0 here this means that V is inner product with U is 0 rest of it does not change. So, that means V is a vector which if I take the inner product with which I if I take its inner product with any vector inside the subspace U it vanishes. So, that means, V must belong to the orthogonal complement because the orthogonal complement is precisely that vector space which contains all those vectors which are orthogonal to vectors in a given subspace U. That is the definition of something belonging to U perp or U use orthogonal complements. Therefore V must belong to the orthogonal complement of U. If that is so that it means that based on this I can claim that the kernel of P must be contained inside the orthogonal complement of U but I have to show equality. So, I will have to show the other side as well right. So, now on the other hand suppose V now belongs to the orthogonal complement of U alright. So, what can I say about PV then? Where does this belong to? Of course the job of PV is to project anything on to U. So, PV belongs to so this is naturally, naturally true yeah because the image of P is U itself. So, this is job to project it to U right. Now then think about this object V minus PV what can you say about this object? What is this object? By definition of being the best approximation of V inside U that is PV, PV is supposed to be because that is how it maps. PV is the best possible approximation of V inside U. Therefore, this V minus PV is what? It is the error and where is the error supposed to belong to? Sorry U perp the orthogonal complement of U. So, now this is equal to 0 for all U implying V minus PV belongs to the orthogonal complement of U okay. V belongs to the orthogonal complement of U, V minus PV belongs to the orthogonal complement of U what does that mean? V minus V minus PV where should it belong to? This is an object inside the orthogonal complement of U by my basic premise. This of course by the very nature of the projection map which maps to a best possible approximation that means thereby making this the error. So, the error must belong to the orthogonal complement. So, both of these objects belong to the orthogonal complement. Therefore, there is some odd difference whichever way you call it also belongs to the orthogonal complement of U which means that look PV now belongs to the orthogonal complement of U. So, I have two things here. This is true naturally yeah is there any doubt about this? Projection map always maps it to you. So, anything when acted upon by the projection map will live inside you. So, PV is an object that simultaneously lives inside U and U perp. What does this mean? So, therefore, its inner product with itself will be what? Yeah, if it simultaneously lives inside U and U perp. Therefore, we erase this and carve out this space PV is inner product with itself consider without loss of generality the first PV to come from U is dual identity right double identity. So, you can consider it to either come from U consider the other to come from U perp matter of taste really which one you consider to come from where this is 0. But this means that the norm of PV is equal to 0. But if the norm of PV is 0 when can that be true? Yeah, it can only be true if PV is equal to 0 implying let me carve out a little more space implying that V belongs to kernel of P. So, when V starts out with the premise that it belongs to the orthogonal complement of U perp it ends up being in the kernel of P. So, from this we can infer that of course I can erase this part now this is now old. So, therefore, V belongs we have already written that. So, that means U perp must be contained inside kernel P earlier we had this now we have this combining these two together we get U perp is equal to kernel of P yeah savvy right ok. So, now what I leave to you as an exercise to try is that assume that P is this projection map from V to U. So, by now you know the definition of the projection map it is basically the best approximation of any vector in V inside U. So, this is how P is defined. So, for P defined in this manner this is a projection let me just state it upfront. So, you have to verify that identity minus P is a projection map from V to the orthogonal complement of U yeah. What do you have to show you have to show that it is linear as we have done with the projection what are the properties we have seen four properties essentially. The fact that a projection map orthogonal projection map is linear the fact that what it is idempotent we have seen it the other day because of course, objects in it on the subspace themselves get mapped to the same object. So, it is idempotent what about the third property we have seen that the image is the entirety of the subspace to which it projects and the kernel is nothing, but the orthogonal complement right. So, verify that this is also going to be an orthogonal projection. So, just try to check that this is. So, you can take V to be finite dimensional for your verification right. So, just try this out as an exercise and that will solidify your concept and let you realize where you stand in so far as the concepts are concerned ok. Any doubts about this derivation that we have done so far today yeah. Because P V is the best possible approximation of V inside U. So, by the previous days development by the in the preceding lecture we have already seen that if P V is supposed to be the best possible approximation it means that this is the error and then the error vector remember that example we gave of this child running after this balloon that is flying high the closest that the child can get is right below the balloon which is when the vector joining the child from the balloon is perpendicular to flat earth. So, it is the same idea this is that vector joining yeah. So, this P V is where the child is standing the V is where the balloon is. So, V minus P V is the vector joining the balloon to the child and that is orthogonal to any vector inside on the on the flat earth which is where the child can go to any position can that can be assumed by the child is a vector on flat earth. So, this is it. So, this must be true we have proved this also in the previous lecture. So, we have proved that this is true and that is what we are just using here right ok. So, now we will immediately see an application of this idea of orthogonal projection although we will not necessarily use this important result about the inner product. We will approach it quite directly and eventually show that we are going to land up with the projection matrix in fact an orthogonal projection. The problem is one that is very close to our hearts because this is the theme that has connected the first half of our course it has to do with the solution of systems of equations of the form Ax is equal to b ok. So, let us get back to that again from these abstract vector spaces we are going to talk about very tangible vector spaces and their subspaces. So, let us write again Ax is equal to b suppose A is real matrix m cross n b is an m tuple x is of course an n tuple this is the unknown. Let us further assume that m is greater than n what do we call such systems over determined the A matrix is tall there are more equations than the number of unknowns which is something that you most often encounter when you perform experiments you want to determine some unknown parameters some unknown coefficients you obviously do not perform the experiment as many times as the number of variables are there. Simplest case being verification of Ohm's law or to obtain the resistance of course the resistance is going to be just one parameter v is equal to r i right. So, when you perform the experiment this is your i or this is your v is going to be a straight line. So, what you could have done is you could have just performed the experiment once yeah with some current and check the voltage and the slope would have given you the resistance, but we do not do that do we we actually go ahead and carry out this experiment multiple number of times yeah and then due to noise and other things we end up getting several different data points and you do not actually go and fit this data like this to pass it through every point we do not do that because from the physics we know that this is going to be a linear relationship we draw the best possible line the best approximation what I am going to tell you is that this idea is very similar except for the fact that now you do not have just one unknown coefficient which is the resistance in this case you can have multiple unknown parameters, but think about it like you perform the experiment n number of times to obtain n different parameters makes sense right. So, that gives it some sort of a physical outlook or interpretation. So, over determined is mostly what you do in experiments unless your experiments are very costly to set up and you want to perform fewer experiments and number of variables and then do the best job that you can in which case of course you have to be very precise in that case you normally do not expect too much noise or disturbance and you know parametric uncertainties and all everything is very precise it is one way or the other you have to pay the penalty somehow anyway. So, let us suppose that this is the case where it is over determined also suppose we will talk about situations where some of these assumptions are removed later on, but for now suppose A has full column rank which is to say that the rank of A is n clear with this being the situation of course it is not trivial now right it is not trivial as to how you can obtain. In fact, you do not even know if a solution will exist if you take the row deduced echelon form of this what is it going to look like how many non-zero rows can you have at most n. So, a bunch of those rows m minus n precisely will be all zeros and if you perform the same operations on B and if any one of those last m minus n elements in B turns out to be non-zero you will end up with the conclusion that there is no solution recall very beginning of this lecture of this course we had discussed this possibility right that if you do not have this compatibility that is when the rank of this augmented object AB the rank of this and the rank of A alone if these do not match then you will have no solution you know this already maybe from your earlier matrix theory course or whatever preliminary courses you might have done and we have also visited this result right at the beginning of this course maybe first few lectures and we have told you why this must be so so hopefully you have a better understanding of why this must be true right. So, the same thing here so you might not even have a solution but the point is if you perform an experiment you cannot just say oh you know it is either the perfect data set or nothing at all yeah does not make sense that way right. So, we will still have to do with the best possible approximation that we can get you might also say hey there is another solution if they do if the data do not fit then discard that round of experiment but then that defeats the purpose of performing multiple experiments it means essentially you are choosing just n of those arbitrary runs of the experiment for n n parameters to be evaluated again it is a waste but we would not want to do that. So, this curve fitting thing and all of this you see these are very important even in the modern day there is there is a serious problem called over fitting and there will be a problem that you will encounter you probably already seen if you have this is problem of interpolation and once this lecture is done if you attempt that problem you will see something very interesting I will not reveal to you yet what it is but the question should also make it apparent what it is that I am trying to hint towards it has something to do with this over fitting that is the hint anyway. So, let us focus on this picture shall we. So, what do we have here we are basically saying that this is the A matrix whose columns can be written as a 1 a 2 dot dot dot till a n each of these are m tuples of numbers right. Now, what we are asking for is an x such that x 1 times a 1 plus x 2 times a 2 plus dot dot dot x n times a n is equal to b in other words what you are asking for is some sort of excess that leads to this equality that is this is only going to be possible if b belongs to the span of the columns that is if b belongs to column span a we have a solution, but that is the boring case that is anyone can see that what if it does not if b does not belong to the column span of a what do we do yeah sorry we approximate. So, approximate what to what what is it that we are going to do yes best possible. So, best possible of what just flesh it out for me because it is a very tangible problem right project b to the column space of A yeah and then check out. So, what is that projection map it is not very straightforward now is it I mean to say something is there up in the air we can get something it is all well and good, but what is this projection map after all how is I mean is there a straight forward way to predict though this is a projection map and on top of that this see the way we have defined it it satisfies those four properties you cannot just think this through like Tesla did apparently with his induction machine yeah you have to really work this through. So, let us not take this approach of the inner product here let us just think about the basics what is the basic assumption here that if there is this I mean for want of a better demonstration let me just take again a 2D and a 3D case. So, suppose this is b and apparently this b hat is probably the best projection yeah when this vector is the normal to this plane, but that also means that any other potential b because we have seen that right that this best approximation is also unique. So, that we can call it the best approximation not just a best approximation the best approximation whenever it exists is unique. So, if you take another say b 1 b 2 b 3 b 4 all of these will entail a larger error the minimum error is in fact this. So, what we are essentially looking for is can we get x hat in R n yeah such that a x hat minus what is it a x hat is living where in the column span right anytime you take a post multiply this a by a vector all you are doing is taking a linear combinations of the columns of a. So, what we are doing is this by letting this x hat be arbitrary. So, this is of course arbitrary so it is like my search space this is now like an optimization problem is how we are going to formulate it because we have seen that this is going to be the lowest value of the error among all possible vectors that you can pick out in the column span of a right. So, what are we going to say that this error between a x and b this fellow's norm is minimized I will presuppose a bit of knowledge about multivariable calculus here one of the primary things being that when you are when you have a multivariable function like this it is a multivariable function you agree because there are x 1 each x hat has n variables and this is not a optimization over a single variable. So, what is the first order condition for optimality for a function such as this the gradient of that function has to vanish right. So, what is the gradient of a function of scalar function would look like it becomes a vector right. So, the gradient of a function becomes a vector what is going to be the second order condition I wonder this is where things start getting a little more interesting the second order is something that we call a Hessian or a Hessian and that is a matrix because now you have a vector and you are going to operate on this another derivative if I may call it that and that results in a matrix. So, you start with the scalar function take its derivative with respect to a vector that is a gradient it is itself a vector. So, again there is no unanimity about whether the gradient should be a column or a row most often it is taken as a column vector, but in some textbooks you might also encounter a situation where it is dealt with as a row vector because quite often the gradients dot product or inner product is taken with some other column vector. So, prima facie they just define it as a row vector, but let us not get caught up in that they are ultimately the same objects this function that you have you differentiate this with respect to the first variable that is the first of the n tuples second variable that is the second n of the n tuple and so on. So, you differentiate this partial derivatives of this function with respect to each of the variables and you stack them up together to get a vector that is the gradient. So, the first order condition for optimality is that the gradient must vanish and the second order condition is just like I am not going to the details of why because that will again take us too far a field. We just know there is that just like you have the second order condition in case of a single variable what is it double the first derivative must vanish and the second derivative must be positive for minima negative for maxima right. So, in case of matrices we have something called a positive definite matrix which by the way as an upshot of or Fourier into this we will also be in a position to define what a positive definite matrix is. So, just know that the gradient has to be 0 and at that point the Hessian must be a positive definite matrix. So, that is the condition. So, we will now investigate next how we can solve this problem using these two ideas right.