 So, in our last lecture, we looked at optimization, multivariable optimization, unconstrained optimization. So, we derived conditions for optimality, necessary and sufficient condition followed by the next lecture, we looked at application to solving linear in parameter least square problem. What is nice about linear in parameter least square problem is that the solution, optimal solution or the optimal value of parameters least square estimates can be computed analytically. So, just to have a recap, so this is just solution of linear least squares. We have collected data and we have some model, this model could be in general I said the could be y is equal to theta 1 f 1 x theta m plus error. In general you have a model of this form where f 1, f 2, f 3, f m are some known functions. Simplest one we looked at was polynomials, but it did not be polynomials it could be any functions which are known functions and then you are doing a function approximation. If it is polynomials it is polynomial approximation, we have collected data we said this is u not y and one minute we did not use x we use z here. So, let me collect that. So, z is the independent variable it could be anything it depends upon in this case it did not be space it could be any independent variable. We looked at many examples from chemical engineering where we could use this method for and then we have this data collected which is u 1, u 2, u n at points z 1, z 2, z n and using this data we wrote number of linear equations and finally, we put it in a matrix form u is equal to a theta plus e where e is the modeling error and then you know we found out theta least square is equal to minimum or minimize with respect to theta e transpose. Now, I am going to make a little small modification here as compared to the previous development. I am going to say here e transpose W e where e is equal to u minus a theta I want to solve this problem where W is a positive definite matrix is a symmetric positive definite matrix in general you can solve the problem earlier we had looked a special case of this that is e transpose e where W was identity matrix identity matrix is a symmetric positive definite matrix. This is a special case which we have looked at earlier I am just generalizing this W can be see for example, in some situations when you collect data and fit a model you know that a particular observation is more reliable or particular observation is less reliable. So, you could attach weight positive weight small weight if it is less reliable a large weight if it is more reliable so that when you optimize the optimizer will give more importance to those which are accurate measurements will give less important to those which are less accurate we could actually twist this or sometimes you need to do this because of you know variables have different values and so on. So, this is in general formulation in which I have some weighting matrix here which is a symmetric positive definite matrix as you know that this will define a two norm and so on. If you take a positive definite matrix it will be defining a two norm. So, if I have this phi which is e transpose e e transpose W e then if I use necessary condition for optimality then it is dou phi by dou theta is equal to u minus a theta transpose u minus a theta and then we had these rules of differentiation of a scalar function. So, this is a scalar function from n to r this is a scalar function from sorry m to r r m to r theta is a m dimensional vector this is a function of theta theta is a m dimensional vector in general. So, this is a this is a scalar function and we had rules of differentiation of a scalar function with respect to a vector and using that we came up with a formula which is we will get this equation if you use it a transpose W a theta least square is equal to a transpose W u. And finally, we argued that if columns of a are linearly independent then this matrix is invertible. This is a symmetric matrix symmetric positive definite matrix very nice matrix and theta least square is equal to. So, my least square estimate my least square estimate can be written as a transpose W a inverse. We also said the special case when W is equal to i we look at a special case when W is equal to i a transpose a inverse a transpose is called as pseudo inverse of matrix a. Remember here a is a non-square matrix a is a n cross m matrix theta is a m cross 1 vector and is n cross 1 vector. So, this is a non-square matrix say non-square matrix and then we will talk about its inverse or we talked about its pseudo inverse pseudo inverse is defined by a transpose a inverse a transpose. So far so good. So, we have this derivation one of your classmate was asking me in the last after the last lecture is that she knows about this formation formula using summations you know least square estimate using summation summation summation. Yeah you can actually this formula and that is not different they are one and the same and deriving that summation formula starting from this formula is part of one of the exercise problems. So, the exercise problem which I give you next for least square estimation will have that problem you will derive that summation formula for least square estimates. So, those two things are not different this is a more elegant compact way of expressing the same thing it is not different it is one and the same thing. So, the regression formula which you know with multiple summations can be very elegantly expressed through this a transpose a inverse a transpose the same thing exactly identical thing when you solve the problem you will realize that it is nothing different the same problem. What I want to do now is so far so good we have done lot of algebra we have found out the condition for optimality then we said the second derivative here what is the second derivative here the second derivative here that is dou 2 phi by dou theta square this is nothing, but a transpose w a 2 a transpose w a that will be second derivative the second derivative is always symmetric positive definite which means the stationary point which you have got through this is the stationary point this is the point at which the gradient is equal to 0 at this point at this point the second derivative is given by this matrix and this particular matrix is symmetric positive definite you have reached the global minimum. So, this is this is fine this is lot of algebra I want to give some geometric insights into what is really happening how do you relate this to your school geometry that is what I want to elaborate next. So, what was the thing here here you had a non square matrix right you had 100 data points may be only 3 parameters we saw the example of c p versus temperature. So, m was small n was large you are fitting some function some polynomial and the number of parameters where much much less than the number of variables to get in an insight into what is happening let us not work with 100 dimensional spaces we cannot visualize in 100 dimensional spaces right, but in 3 dimensions I can visualize. So, I am going to create a dummy problem from this which is very very simple 3 equations into unknowns 3 equations into unknowns it is a representative problem for this equation see what is the main thing number of equations is more than number of unknowns and then you are not able to satisfy all the equation simultaneously if you are able to satisfy all the equation simultaneously then you know error would be 0, but error is not 0 you are finding out least square solution. What is the error how do you compute the error here what is the error vector if you substitute this theta you will get the error vector into this equation. So, if I substitute here if I substitute E is equal to U minus A times A transpose W A inverse A transpose W U. So, I can write this is I minus A A transpose W A inverse A transpose W times U unless this matrix is 0 you will not get exact satisfaction. So, these equations are such that you cannot satisfy all of them simultaneously you are trying to find out a least square solution and then there is always going to be an error vector such that probably no equation is satisfied ok. So, the curve which you get here will not probably pass through any of the points quite likely it only captures the tendency in the data not going through every point in the data because ok. So, now let us let us try to get some insights by taking a simple dummy problem. So, I am going to take a simple problem in which I have this equation to be satisfied I want to solve this equation 1 2 3 1 2 3 is equal to I hope that this vector cannot be defined by just linear combination of these two let us I have not I have just created this problem right now I do not know whether. So, when can you solve this exactly just think about the geometric insight of it about it when can you solve this when can you actually exactly solve this how can you write this equation is there another way of writing the same equation ok I will move to here. So, I can write this as 1 2 3 is equal to theta 1 1 0 1 plus theta 2 1 1 1 0 plus e 1 e 2 right I have taken a very simple problem 3 equations 2 unknowns 2 unknowns in the terms of actually there are 5 unknowns actually there are 5 unknowns e 1 e 2 e 1 e 2 e 3 theta 1 theta 2 there are 5 unknowns, but well as far as the parameters are considered there will be 2 unknowns theta 1 and theta 2 parameter unknowns ok. If I write it like this do you get some more insight yeah. So, theta 1 times this vector let us call this as v 1 and let us call this vector as v 2 when can you solve this exactly when will the error be equal to 0 can you say something about the span do not forget the last quiz all possible linear combinations of v 1 and v 2 what will it give you ok let us say this is my v 1 and this is my v 2 let us say this is my v 1 and this is my v 2 ok what will it span what will it span v 1 and v 2 all possible linear combinations of v 1 and v 2 what will it span it will be a plane passing through the origin right when will the error be equal to 0 say it say it louder this is when this vector on the left hand side lies in the plane ok if this vector left hand side vector it exactly lies in this plane which is span of these 2 vectors ok all possible linear combinations any vector in this 2 dimensional plane can be generated by a linear combination if it does not what it means where is this vector outside ok let us picturize this as this vector. So, this is my u vector this is my u vector this is u ok. So, this is here this is my u vector. So, u does not lie in this plane it does not lie in this plane. So, I need to find out what do I need to find out a least square approximation see because ultimately I am finding out an approximation which is theta 1 times this vector plus theta 2 times this vector that approximation is going to lie where that approximation will lie somewhere here right a vector see this vector will be theta 1 times v 1 plus theta 2 times theta 2 times v 2 this vector theta 1 times v 1 plus theta 2 times v 2 cannot leave this plane right it is linear combination of these 2 vectors it cannot leave this plane ok. So, in your school geometry you have studied this problem point which is closest point which is closest to this vector in this plane how do you get it perpendicular ok. I just want to show that what we have done till now by so called least squares is nothing but drop a perpendicular ok. The point which you will hit the point which you will hit here if you drop a perpendicular what is the best approximation in the least square sense I will show that this point is nothing but the point which is the least square approximation of this vector in this plane ok I want to find out best approximation of this vector in this plane in the least square sense from the school geometry I know just drop a perpendicular ok this is the point which is closest to this to this plane point which is closest to this plane ok. So, this point and what I actually get by solving theta least square. So, what should be the theta that gives you this theta least square should be A transpose A inverse A transpose U this will give you the theta and this particular vector would be you know this approximation and let us call this U cap will be A times theta L s this will be U cap ok U cap is my approximation what is this vector this is if this is U cap if this is this U cap is A times theta L s ok. So, my equation is error equal to U minus A times theta L s my equation is U minus if I just complete this parallel this you know law vector additions this is nothing but error vector in least square approximation the error vector is perpendicular to the plane this you know from your school geometry right I am just generalizing that result in any inner product space what is very very elegant is that same result can hold in any inner product space in any Hilbert space you can have the same result which you know from school geometry finding minimum distance of a point from a plane ok. So, this problem this problem see it is you cannot visualize when suppose suppose this suppose here let us go back here ok suppose you had 10 equations suppose you had 10 equations in two unknowns ok I cannot visualize in the 10 dimensional plane, but linear combination of two vectors in 10 dimensions what is it like it will it will look like it will look like a plane like this something like this if I wear a creature in 10 dimension I could be you know it would be possible for me to visualize a 9 dimensional plane, but we cannot. So, but it will look something like this and then what you are doing you are just finding out the point in the plane which is at a minimum distance from this point which is lying outside the plane ok geometrically what you are doing is what is called as projections ok I am projecting this vector onto this plane projections you probably have done when you are doing your engineering drawing right. So, projections is something which you know from your engineering or right from your school ok. So, even though you cannot visualize 10 dimensional thing conceptually or you know it is not going to be different I mean if you are able to visualize that it would look almost the same. So, I just want to show that if I just proceed through this geometric ideas I will get the same thing ok that is my next that is my next task ok. Let me do a general derivation with only two vectors in my notes there is a derivation with only one vector one vector means distance of a point from a line I start with that and then I generalize it to distance of a point from a subspace actually in general if there were three if there were three vectors ok see if this problem if I just modify this problem saying that if I just modify this problem and then here left hand side is let us say this is not coming from some physical problem I am just arbitrarily creating some set of equations. Let us say I have this problem ok now here what can you say the least square solution will be lying in the subspace spanned by this column vector this column vector and this column vector the least square solution will lie in the subspace spanned by this column vector this column vector this column vector ok the component which is outside the subspace is given by this error ok we are able to split actually geometrically speaking we are able to split a vector into two components one lying in the in the subspace and one orthogonal to the subspace ok we are able to split the vector through least square what is the subspace subspace is spanned by linear combination of columns ok linear combination of columns will give me subspace this even e 2 to e 5 will give me the component which is outside the subspace together they form this whole vector ok together they form this whole vector now I want to generalize this and then just show you that what we have derived but I am going to take a case where w is equal to i ok wetting matrix is equal to i I am not going to complicate life by so that will give me some handle to ok. So let us write so this is so this is let us let us call this ok in my notes I am calling this vector as a 1 this vector as a 2 this vector as a 3 ok. So in general in general I can write my model as u is equal to capital U this my original model is u is equal to a theta plus error and I am writing my a matrix as a 1, a 2, a m there are m columns these are column vectors there are m columns ok in this case there are 3 columns here there are m columns ok. So I can write this equation as u is equal to theta 1 a 1 plus theta 2 a 2 I can write this vector equation ok. Now to simplify life let us take a case where there are only 3 vectors ok. So generalization to m is not so difficult so I will just take theta 1 plus theta 3 a 3 plus e this is the case when we have taken m is equal to 3 there are 3 parameters theta 1 theta 2 theta 3 I want to find out least square estimates of these square estimates of this ok using ok I am going to call this vector see this vector this vector belongs to the subspace which vector this vector p I am going to call this as a projection vector projection vector theta 1 a 1 plus theta 2 a 2 plus theta 3 a 3 this is the projection vector this is the projection projection of u on ok this is projection of projection of see it is linear combination of a 1 a 2 a 3. So this p vector has to lie in the subspace by a 1 a 2 a 3 right it has to lie in the subspace by a 1 a 2 a 3. So this is a projection I am going to call this as a projection of u onto subspace spanned by what is it that I want to minimize how do I find out this I find this out by minimizing square of distance right I want to find out. So my problem is find theta such that error 2 norm is minimum right 2 norm of error is minimum right ok. So let us let us start doing this. So what does it mean? So I want to minimize a phi which is 2 norm error square which is p minus u or u minus p u minus p right. But what is this is a 2 norm we are working in a inner product space. So 2 norm square is related to the inner product how. So this is u minus p u minus p right inner product of vector u minus p u minus p ok. Now my model is this u is equal to theta so I can write phi to be inner product of u minus theta 1 a 1 plus theta 2 a 2 plus theta 3 a 3 into u minus p p is again this vector I can write this vector again here right. What is p? theta 1 a 1 plus theta 2 a 2 plus theta 3 a 3 I can just put this vector here in space of p it will be a longer expression ok. Now how do you find out the minimum what is the necessary condition can you do this see my necessary condition is we will have to expand this but before expanding let us apply the necessary condition ok. So my necessary condition for optimality is dou phi by dou theta 1 is equal to 0 dou phi by dou theta 2 is equal to 0 and dou phi by dou theta 3 is equal to 0 right. These are my three these are my three conditions is everyone with me on this these are my three conditions ok. If I differentiate this phi with respect to theta ok what will I get ok just look at just look at it you are differentiating only once ok. If I differentiate only to this first vector what will remain only a will remain ok. You can check this but I am just going to write the final result that dou phi dou phi by dou theta 1 this is nothing but you will get here a 1 comma ok. If I differentiate this I am skipping in between steps you can actually expand this you can expand this entire inner product ok. You will get many terms ok you have to patiently find out the inner product of each element by element you know the rules of expanding inner product right right. So what I will get is a 1 inner product with this is equal to 0. So actually what I am getting here is that a 1 inner product u minus p is equal to 0 because this is projection vector this is my projection vector ok. So what I am getting is a 1 inner product u minus p is equal to 0. What is what is this u minus p error vector so what it says is that error vector is perpendicular to a 1 direction same thing if you do with theta 2 ok. So setting dou phi by dou theta 2 is equal to 0 you will get a 2 u minus p is equal to 0 ok. You will get this equation and well the third equation that you will get is the third equation that you get is dou theta dou phi by dou theta 3 is equal to 0. So that is equal to a 3 u minus p this is equal to 0. So what is the meaning of this what is the geometric meaning of this that this error vector is perpendicular to a 1 a 2 and a 3 ok. If error vector is perpendicular to a 1 a 2 a 3 will be perpendicular to linear combination of a 1 a 2 a 3 yeah. What is linear combination of a 1 a 2 a 3 it will be in the span of a 1 a 2 a 3 it is in the subspace of a 1 a 2 a 3 ok. So this error vector is perpendicular to the span of a 1 a 2 a 3 ok which means which means best approximation of u in the span of a 1 a 2 a 3 is obtained just by dropping but you know this is just generalization of the result from your school geometry to have a perpendicular from a point to a plane ok that is the best approximation of that vector in that plane that is all ok. That is all we are generalizing into a general function space or into a general inner product space ok. Now when is this possible this is possible only when your inner product defined that is very very important ok. Why we why we are so much why we are so much you know why we like least square approximation as against you know one norm approximation or infinite norm approximation because because this least square approximation comes attached with you know inner product inner product can be related to the geometry. You can talk about perpendicularity you can talk about projections the same idea of projections which you use in three dimensions ok distance of a point from a plane it just drop a perpendicular ok same idea is actually being actually being said here. So all that we have proved is if I if I redraw this figure in two dimensions if I have this plane which is span by say v 1 and v 2. So this is my v a 1 let us say this is my a 1 and a 2 and this is a vector this is my vector u ok this is my u vector then this is the p vector projection vector this is the projection vector ok this is the projection vector and here here what we are getting is this projection vector and this error vector e here is u minus p ok. This is the special vector such that this error see if I take any other vector here this error will not be perpendicular. There are so many ways of approximating you know this in this plane I could approximate here I could approximate here somebody will say that this is the approximation this is one possible way of approximating what is the least square approximation perpendicular ok least square approximation is the perpendicular. Now I need to go back from here and show I should start with these three equations and show that well what you have got actually is nothing but the same formula of A transpose A inverse I need to derive that right. I have just showed geometrically that these three vectors. So if I take a 1 a 2 the plane span by this and the error is perpendicular to this plane plane span at this and and so on. So now I need to go back and connect to A transpose A inverse business I will do that ok. How do I get theta 1 theta 2 theta 3? I get theta 1 theta 2 theta 3 by solving these three equations you know A 1 in a product u minus theta 1 A 1 plus theta 2 A 2 plus theta 3 plus theta 1 A 1 plus theta 2 A 2 plus theta 3 A 3 is equal to 0 well in three dimension in in general in m dimensions how do you with real the real values what is the inner product simple inner product A transpose B you know if there A and B two vectors. So here this will be so this first equation this first equation I can rewrite as A 1 transpose u is equal to I am just skipping one step in between ok I am just writing theta 1 A 1 transpose A 1 right A 1 with A 1 A 1 with A 2 and A 1 with A 3 right theta 2 A 1 transpose A 2 plus theta 3 A 1 transpose A 3 right what is the second equation second equation is A 2 u minus p is equal to 0 right u minus p is equal to 0 what will it give you A 2 transpose u is equal to A 2 theta 1 A 2 transpose by the way A 1 transpose A 1 is it a scalar always a scalar A 2 transpose A 1 transpose A 2 is it a scalar always a scalar right. So I will get another equation theta 2 into A 2 transpose A 2 plus theta 3 A 2 transpose A 3 just write the third equation just follow with this same thing I will get A 3 transpose u is equal to theta 1 A 3 transpose A 1 plus theta 2 A 3 transpose A 2 plus theta 3 A 3 transpose A 3 is everyone with me on this I have just written those equations I started with the geometric I started by saying that well what we have done is nothing but projections ok this error vector is perpendicular to A 1 error vector is perpendicular to A 2 ok these are the equations which I have written error vector perpendicular to A 1 error vector perpendicular to A 2 and in this case error vector perpendicular to A 3 how many equations and how many unknowns three equations three equations and three unknowns what are the three unknowns theta 1 theta 2 theta 3. So what do I get here what do I get here right hand side I can write this as A 1 transpose A 1 A 1 transpose A 3 and you can just collect it in the vector matrix form this is A 3 transpose A 1 A 3 transpose A 3 this into theta 1 theta 2 theta 3 right the right hand side I can write as a 3 cross 3 matrix just check this matrix is nothing but if I start with A is equal to A 1 A 2 A 3 these are the three columns of A matrix ok very very easy to check that this matrix is nothing but A transpose A is multiplication of A transpose A is this matrix ok the algebra ties in with the geometry very very nicely ok algebraically we we arrive at this condition A transpose A theta is equal to A transpose U. Now just look at this left hand side what is left hand side the left hand side I can write here the LHS I can write as A 1 transpose A 2 transpose A 3 transpose U what is this A transpose U right A transpose U ok that is the result which we got purely by doing you know algebra of necessary condition of optimality. So A transpose U is equal to A transpose A times theta this square solution right. So this is my theta ok I started with the geometric argument of projection onto a subspace I was able to recover the least square formula which is A transpose A transpose A times theta is equal to A transpose U what is the least square estimate this theta will be least square estimate of course we call it estimate so we give a hat above it. So this is estimate of theta this is not there is no way of unique way of finding out theta there are multiple ways but least square estimate is unique how it is computed A transpose A times theta is equal to A transpose U A transpose A if columns of A are linearly independent ok A transpose A is always invertible this is always a square invertible matrix symmetric positive definite square invertible very very nice matrix ok this matrix which we study in linear algebra does appear in a very very practical application we are trying to fit a curve in some data or fit try to fit a function in some data positive definite symmetric invertible matrix ok. So this this particular result A transpose A times theta least square is equal to A transpose U we could we could recover this result purely through geometric arguments of projection we said that basically what is happening here is we are trying to find out that vector in the plane span by A 1 A 2 which is at the closest distance from a vector which is outside this plane why we need least squares because this vector is not lying in this plane what will happen if this vector was lying in this plane error would be 0 ok what is the extreme situation this this U is perpendicular ok. So what is the best approximation 0 ok if U is perpendicular probably you are done wrong modeling you know error should be small lot that error the vector is perpendicular to. So this is this is a very very nice result this ties in with actually there is one more angle to whole whole of this thing. So we derive this result through algebra we derive this result through geometry we can derive the same result to statistics and you will get those summations which you are familiar with I am not going to go into statistical interpretation of the same thing I will upload my notes on statistical interpretation but if I go into statistical interpretation of this result it will take at least 2 weeks of you know I have to introduce so many concepts but finally you will derive the same results you will get a different insight from the statistics view point see I got a different insight into the same result to geometric view point ok algebra didn't tell me much you know it just said that derivative equal to 0 here I can relate to my school geometry that is very very important ok. So that is the beauty of this result you can actually show that the least square estimate obtained algebraically is nothing but projection of a point on to a subspace ok finding out this a vector in a subspace which is at the minimum distance from a point outside the subspace ok that is the that is the that is where you know that is why we work with inner product spaces Hilbert spaces because angle comes you know attached with the inner product you can talk of orthogonality ok all these things are not possible in other you know one axis spaces so that is why Hilbert spaces are so special ok. So next lecture we will continue on some more algebraic properties of v square I will show something more and then we will move to variety of fields engineering applications that is function approximations and also solving partial differential equation or boundary value problems we will revisit our problems again through this least squares approximation.