 So, now we want to look at the problem of approximating a function using d square approximation or optimization. So, now this is as I said in my last lecture this is different from what we have looked till now we have been looking at interpolation and interpolation you wanted the approximate polynomial or the polynomial to pass through every point. Now the difference is that I want a lower order polynomial which may not pass through every point but which is the best fit in some sense okay. So, let me first formulate the problem and then we will move on to necessary and sufficient conditions for optimality and then we will move back to solution. So, before we need to do some extra work in between before we actually solve the problem. So, the problem at hand is now why these squares why not something else and we have been using these squares probably in your undergraduate program in experimental methods or fitting curves and fitting lines and so on. Why we use these squares why do not we use some of absolute error and why do not we use 2 norm has some special 2 norm after teaching you about norms you are telling me only 2 norm defines distance no. So, 2 norm also defines angle I mean or not that 2 norm defines angle the definition of angle comes free with the 2 norm right. So, you buy 2 norm you get angle free okay. So, that is the advantage you do not get orthogonality you do not get all those definitions when you use 1 norm infinite norm one could formulate a problem of instead of least square fitting fitting a function in the absolute norm or fitting a function in infinite norm one can very much do that but 2 norm has something special now we will go to see what why it is special but before that we have to do some work. So let me again restate the problem my problem was that I have this points here then need not be equi-space there are some points here where I have this function u z and I know values of this function at different points. So, this is a function defined over some domain you can have it 0 to 1 does not matter or it can be from some a to b I know values of this function at different locations. So, this is the dependent variable u i is equal to u at independent variable z i where z i goes from or where i goes from 1 to 2 norm okay i goes from 1 to n now the main difference is that I want to fit now a polynomial which is not of nth order or n minus 1th order I want to fit a polynomial which is of a lower order okay. So, a typical problem is that I want to fit a polynomial say p m z I want to fit a polynomial p m z which is alpha 0 plus alpha 1 z plus alpha 2 z up to alpha m z to power m okay alpha m z to power m I want to fit a polynomial of this form okay now I cannot say earlier in interpolation we said that the polynomial value or the polynomial should pass through every point I cannot say that now okay what I am going to say now is that this error I am going to define error which is u i that is u i minus p m z i okay which is nothing but u z i minus p m z i okay this error has to be small in some sense but this error is not just at one point you have error defined at i goes from 1 to up to m okay now you have vector of errors not one error right you have n points this n could be large number of points n could be 100 n could be 1000 okay I have large number of data points so I know the function value at large number of points but I want to fit a polynomial of order 2 or order 3 okay classic example from chemical engineering I would give here is c p as a function of temperature we fit c p as some a plus b t plus c t square okay or sometimes we fit a plus b t plus c t square plus d t cube there is no unique way of choosing depending upon what range of c p values and what range of temperature you are considering okay what the choice of polynomial would differ okay you may have 100 values of c p okay you may have 100 values of c p so I may have you know c p 1 at temperature 1 c p 2 c p 2 at temperature 2 and so on so c p n at temperature n so these are different measurement points I know c p at large number of temperatures okay so if I actually if I actually try to fit into this I will have problem okay the problem is that none of this none of this you know there are more number of equations than the unknowns okay so now we have to do something about this okay I goes from here I goes from 1 to this is only here I goes 1 to n correct I goes from 1 to n not 1 to m the number of coefficients here are m plus 1 okay but I goes from 1 to n okay so now the trouble is that this when you write this equation it is not exact okay there is actually one more term missing here so let us concentrate right now on the quadratic equation cubic equation we will worry about later look at this quadratic equation what I have to say that c p is equal to this plus an error c p is equal to this plus error this is not exact representation this is an approximation this e is the approximation error okay why can I fit why am I allowed to fit a polynomial because p s class theorem applies continuous function can be approximated by a polynomial function that is why I am fitting a polynomial okay now but this is not exact so there is an error here so actually actually if I take this data points and I start writing I start writing c p 1 is equal to a plus b t 1 square plus c t 1 sorry c t 1 plus b t 1 plus c t 1 square plus error 1 right so there is some error this is not exact this will give you most of the variation but this is not exactly equal to this there is some error here so c p 2 is equal to a plus b t 2 plus c t 2 square plus e 2 c p 3 is equal to a plus b t 3 plus c t 2 square plus c t 3 square plus e 3 and so on I can write this equations up to c p n is equal to a plus b t 3 plus c t 3 sorry t n t n square plus e n so these are my equations now these are my equations can I solve these equations is there a problem here how many variables and how many equations how many equations I have right now I have n equations how many variables I have so equations n equations is equal to n number of equations is equal to n equal to number of data points how many variables n n plus 3 right so variables is equal to n plus n plus 3 what are the n variables e 1 e 2 okay and plus a b and c so there are three unknowns a b c plus there are unknowns these are the errors which are unknown okay all these errors are unknown now the trouble is how do I solve this there are infinite solutions to this problem why if I if I choose a b c in one particular way okay if I choose a b c in one particular way if I fix a b c by some means okay then I will get one value for even e 2 e 3 because once I once I specify a b c I will have n equations in n unknowns I can solve them okay trouble is how do I fix a b and c differentiate why sigma of all the errors not a great idea sigma is it a norm sigma of errors is it a norm so it might happen that all the errors might sum up to 0 so that means is that the solution why only square why not absolute so I could formulate my problem now I need to generate three more equations to complete the problem I have n plus 3 variables I have n equations to make the problem exact I need to generate three more equations what are these equations okay so now let us look at how do I formulate the problem I can formulate this problem in multiple ways I can formulate this problem I can formulate an index so I can say an index phi which is e 1 square plus e 2 square plus let me call this 2 index and then I can write this as e transpose e okay where this e is a vector of e 1, e 2, e n so I can propose that find a b c such that you minimize this I minimize this I minimize this I minimize this phi 2 with respect to a b c sum of the square of errors is as small as possible okay so I try to get a polynomial such that sum of the square of errors see this is individual error in each equation I am just squaring it and summing it up okay I want to find out that value which gives me minimum sum of the square of errors okay minimum sum of the square of errors so this is one way of formulating the problem I still do not know whether doing this is going to give me three additional equations I have to generate them so there is another way of formulating the problem this is not the only way I could say that you know I will formulate this phi 1 I will say that you know mod e 1 plus mod e 2 somebody might say that why this sum of the square of errors why not sum of the absolute errors nobody stops you from doing that you can so this will be nothing but one norm of e what is this two norm of e square right if e is a vector it is two norm of e square this is one norm of e vector okay so I want to minimize instead of minimizing this phi 2 I could pose the problem as minimize phi 1 right and somebody else might say well I do not believe in this I would like to minimize phi infinity which is max over i minimize the maximum error minimize the maximum deviation okay so instead of minimizing phi 2 I could choose to minimize phi 1 I could choose to minimize phi infinity this is infinite norm of error vector this is nothing but anyone of them is fine okay the question is how do I solve this problem now first of all you should notice that this is an optimization problem you want to minimize sum of the square of errors with respect to these parameters okay there is something different than what you have done in your undergraduate optimization or minimization in undergraduate course you normally study most of you maybe some of you have done advance things but most of you study maximizing or minimizing a function of one variable here you have a function of three variables okay so we need to generalize what we have studied in undergraduate how do I come up with minimization of a function which is multi dimensional okay which is multi dimensional so what is this function I will generalize this problem here okay now let us push this to the background we will come back to this Cp versus temperature business little later let us look at an abstract problem now my abstract problem my abstract problem is I want to minimize an objective function why just look at three variables I will look at general n variables okay I have a scalar function what is this function phi 1, phi 2, phi infinity what are these functions these are scalar functions norm is a scalar function right you always get but I need not always define an objective which is coming through norm there could be other ways in some other problem so in general I am worried about minimizing a function let us say phi x from Rn to R phi x is a function phi x is a function from Rn to R from n dimension to one dimensions so this is my objective function some objective of a vector x so x is a vector which is n dimensional in this case in this case x is nothing but the error vector I am now generalizing I am abstracting the problem I no longer want to just work with errors modeling errors I want to just go and generalize this and say in general actually this is a special case of a problem in which I have a scalar function mapping from n dimension to one dimension I want to minimize and I want to find out minimize or maximize minimize with respect to x1, x2 okay let us take here Rm to R xm I want to minimize this objective function phi x with respect to x1, x2, xm this is one problem now do not confuse this x with this en okay if I said that earlier there was a small error what is x here when I am generalizing from here to here what is x is the unknown parameters what is x abc okay so in this case in this case m would be 3 in this particular case m would be 3 I want to minimize this function phi of x okay with respect to x1, x2, x3, xm in this case in this particular cp problem x would be nothing but ab and c in this particular problem x would be ab and c I want to minimize some function phi x okay b it one norm infinite norm two norm whatever I want to minimize some function phi x with respect to x where in this particular case x happens to be abc okay how to solve this problem I need some background definitions here so well why I am just talking about minimization problem because if you have a maximization problem if you want to maximize phi of x it same as minimizing minus phi of x so I can just talk about minimization problem so it includes maximization we do not have to worry separately about maximization problem okay now first concept that we need to know is a global minimum a global minimum is a point say x equal to x star x equal to x star is called as a global minimum if phi x star is less than phi x you call this phi x star to be a global minimum phi x star to be a global actually equality is not there so you call it to be a global minimum okay if you take any other value so what is the global minimum for abc there is some special value of abc for which this particular objective function will assume the minimum value there is no other value of abc for which we will get a smaller sum of the square of errors okay if that exist that you can reach then that is called as a global minimum okay now when you do optimization you may not be able to reach a global minimum it may happen that you start doing a search and then you go to a local minimum okay go to a local minimum imagine that you know you are standing on the mountain and then there are multiple valleys okay you might reach a valley which is not the you know which is not a global minimum okay which is just a local minimum and then so we also have to worry about local minimum so in local minimum we do not talk about for every x in Rn okay we talk about some neighborhood of a particular point okay so in some neighborhood so some epsilon neighborhood you have defined neighborhood earlier we can think of a ball okay some ball around x bar okay if you can show that phi x bar is less than any other any other x in that neighborhood then x bar will be called as a local minimum here x star will be called global minimum because anywhere you go in the space you cannot get a value of x for which phi will be smaller okay the smallest value of phi is obtained at the global minimum and a local minimum is like a local minimum okay so you know in one dimension it is easier to draw so in a simple one dimensional case this would be you know if you have a function phi of x versus x so which is so this is my global minimum but with this is a local minimum okay so this point for this function is a global minimum whereas this is a local minimum and this is global minimum so in some neighborhood this is the minimum take any point in this neighborhood okay you cannot get value of phi less than this but that is not the case if you look at much larger domain okay so there might be some other value where you get much lower the conditions that we are going to derive now okay are pertaining to the local minimum we are not we cannot actually derive general conditions for global minimum and in general problem optimization problem what we what we find is typically a local minimum if it happens to be global minimum in some cases great I mean no you have achieved the what you wanted to do but in many situations you do optimization by numerical search and then you the solution depends upon your initial guess if you are far away from the global minimum okay you may not reach there so you when you do trial error approach to you know arrive at the solution you might be starting with a guess here and the solution might hit this minimum and you know your numerical method will declare that you have reached a minimum and if you happen to give a if you happen to initial guess somewhere here maybe you will reach this point okay numerical methods you cannot predict you might even go here and get into this so hard to predict what will happen but likely that if you are here you might reach this global minimum the conditions that we are going to derive are for local minimum what are the necessary conditions for a point to be a local minimum what is the sufficient condition for a point to be a local minimum that is what I am going to derive now okay now to qualify a point to be a local or a global minimum I need some more definitions see what was the key thing when you talked about minimum of a single variable function first was that the derivative at this point of phi with respect to x becomes 0 there is no change locally okay so the derivative is tangent which is parallel to x axis okay that is the key thing about so the derivative with 0 that was the first thing how do I extend this to the multi-dimensional case now I have a variable now objective function which is function of m unknowns not just one unknown so I should derive an equivalent condition second thing is so well then I have to make an assumption that this phi is differentiable okay one norm and infinite norm there is a trouble they are not differentiable functions two norm it is a differentiable function x transpose x above the squares differentiable function okay second thing is what qualified this point to be a minimum and this point to be a maximum second derivative the second derivative when it was positive it was minimum second derivative when it was negative it was a maximum now in this particular case if phi x phi x is a differentiable or if it is twice differentiable function then the first derivative of this with respect to x is going to be a vector and the second derivative of this with respect to x is going to be a matrix so I need some more additional structure which will help me qualify a point to be maximum or minimum okay now this is where you need to define matrices which are certain special properties so what is the special property I now need to define matrices which are positive definite or negative definite okay so now before I proceed I define a positive definite matrix and a negative definite matrix and an indefinite matrix if you understand this geometric connections of positive definiteness indefiniteness and then you know it will make much more sense later when you use these concepts okay otherwise many times you know in the course on linear algebra they are introduced not by connecting it to geometry they are just introduced by saying a positive definite matrix is one which has all eigenvalues positive but why do I need this animal which has all positive eigenvalues okay it is not clear to us this is where it will become clear why do I need such funny matrix well it is not really funny it happens to be very nice matrix it helps us in many ways and it is going to help us throughout the course in many ways okay so we have this first definition is positive definite matrix so when is the matrix positive definite right now I am just digressing from the main theme I am just going into a little bit of linear algebra may appear disconnected so if I have a matrix A this A is a n cross n matrix we are talking about real valued matrices right now so if x transpose A x is greater than 0 for any x not equal to 0 okay and x belongs to m dimensional space we are looking at m dimension or not n cross n this should be m cross m because we are talking about phi x where x is m dimensional so m cross n very very important most important word here is for any x even if you find one x for which this is equal to 0 and or less than 0 the matrix is not positive definite so this is the fundamental definition of positive definite okay give me any vector in the space okay x transpose A x is greater than 0 when x is not equal to 0 when will this be equal to 0 only when x is equal to 0 okay remember that well there is one more matrix that you probably have studied in your undergraduate is positive semi definite matrix when do you call positive semi definite matrix so if this becomes condition becomes x transpose A x is greater than or equal to 0 for any x so there are some vectors x for which this will be equal to 0 this will happen when A matrix is singular think about it when A matrix is singular its columns are linearly dependent null space is not 0 and you will get some well this basic definition translates to Eigen values being positive we will look at that a little later but in this case if you just change from this strict inequality to you know this greater than or equal to 0 it becomes positive semi definite A will be positive semi definite okay what about negative definite matrix just so this is this is positive semi definite okay so negative definite negative definite is x transpose A x is strictly less than 0 for any x that belongs to RM okay for any x that belongs to RM and x is not equal to 0 so non zero vector and any non zero vector you give me x transpose A x will be less than 0 and it is called as a negative definite matrix okay and a fourth one is of course negative semi definite so this is x transpose A x is less than or equal to 0 for any x belonging to RM so this is negative semi definite this is negative semi definite okay this is just the background work that I need to proceed further okay well the necessary condition for optimality is something that I would like to quickly derive in the class so every step I am not going to write because it is there in the notes I will just give you outline of the proof how it is done so the necessary condition is given by this theorem so necessary condition for optimality well first of all we assume that phi x is twice differentiable this is twice differentiable that is what we assume first okay then only we can proceed so if you want to know where this appears in the notes I do not know how many of you are carrying notes it is on page 80 it is on page 80 it is in the appendix section 8.2 this talks about now to do arguments about you know local optimality I am going to use Taylor series approximation okay Taylor series approximation is the bulwark of you know is the one of the main tools that we use to prove many many things so for a point the the statement of the theorem formal statement is given here I am just stating the main result so that is if x is equal to x bar is to qualify as a minimum or a maximum optimum we do not we do not know what it is to be we precise it a stationary point okay it could be a minimum or a maximum or it could be neither of them it depends upon some more conditions so the gradient of phi with respect to x that is doh phi by doh x1 doh phi by doh x2 this should be equal to if phi is a twice differentiable function okay what we can prove is that the necessary condition for optimality is that the first derivative of phi with respect to each of the variables not xn sorry xm we are working with this remind me they are working in m dimensional space okay with respect to xm should be equal to 0 in the specific problem for cp okay it will be what is the specific condition for cp value problem so it will be doh phi by it will be doh phi by doh a in a specific problem for cp it will be doh phi by doh a doh phi by doh b and doh phi by doh c this should be equal to 0 0 0 okay this should be equal to 0 0 0 so this particular equation actually is the necessary condition for optimality well I started by saying for the cp problem we need three extra equations the other three extra equations doh phi by doh a equal to 0 doh phi by doh b equal to 0 doh phi by doh c equal to 0 in general if there are m such parameters with respect to which we want to optimize then we have m equations here right we have m equations coming from the first order derivative to be exactly equal to 0 at the optimum point okay now the proof of this so the proof of this actually goes by I think contradiction what we do is we assume that you are allowed to vary a particular variable only in one direction keeping all the other values constant so I just want to put a say along x1 or x2 okay and then we assume that you know that this condition does not hold okay we assume that this condition does not hold but the point is a minimum okay and what you see is that if you make an assumption that the derivative is not 0 and the point is a minimum you will lead to a contradiction okay this two cannot be true okay so the way it is done is you know you write the proof either the full proof you should read here in the notes but the way this is done is that I can write phi x bar plus delta x so I take a small perturbation and let's say x bar is the minimum let's say x bar is the local minimum x bar is the local minimum okay I can write this as f x bar using Taylor series expansion okay plus i going from 1 to m dou phi by dou xi dx i plus the residual term at x bar delta x this is the second order residual term okay now if I say that if I say that the perturbation is only along one direction let's say we will put this as small delta x here delta xi so let's assume that only delta xi is not equal to 0 but delta x1 equal to delta x2 is equal to delta xi minus 1 is equal to delta xi plus 1 is equal to delta xm all are 0 but only one of them is not 0 let's make that assumption okay all all other delta x's are 0 I am choosing to perturb only from x bar I am choosing to perturb only xi ith variable here you have done this kind of thing have you if you have programmed numerical Jacobian you are keeping all the variables constant just perturbing one one value right we did this in the programming okay the same thing I am just perturbing one variable at a time okay now so the question is phi x bar plus delta x minus phi x bar okay so this is dominated by dou phi by dou xi so this difference okay this difference is dominated by this partial derivative times if you take delta xi to be very very small the second order term here will be insignificant this difference is dominated by okay now tell me what should happen what should happen if x bar is a minimum and if I move away from it what should happen to objective function it should increase so what should happen if this this should be always greater than 0 it should always be greater than 0 any moment I make okay now let's take the situation that this gradient is not 0 okay let's take a situation this gradient is negative okay this gradient is negative I can choose a delta x which is small negative value such that this will become multiplication will become positive if multiplication becomes positive it contradicts the fact that x bar is a minimum if I can make the see if I can make this multiplication positive it will contradict the fact that this is a minimum see because if you move away what should happen they should always remain see whichever way I go from the minimum if you are just imagine if you are in a valley okay whichever way you whichever way you go from the minimum point you know your height increases you it will never decrease okay now if the local gradient is let's say this is let's say this is positive this phi is positive then I can choose a delta x which is positive x i and make this positive which means that this minus this is positive which means x bar is not a minimum no you have to argue like that suppose this is negative suppose this is negative then I can choose delta x i to be negative so multiplication will become positive which means this difference will become positive which means x bar cannot be minimum you know so you have to argue in the in a way that well did I this should be one minute I think argument I have to repeat so this see if I move away from x bar the value should value should increase okay value should increase okay now if this is if this is negative I can choose delta x i positive okay if I choose delta x i positive if this is negative this is positive multiplication is negative which means I can move in one particular way and reduce this further which means x bar is not a minimum which means x bar is not a minimum I made a wrong argument earlier now argue other way around if this is if this is positive I can choose delta x i negative if I choose delta x i negative this will become negative okay so I can further reduce by moving away so then x bar is not a minimum so you can show this that for each variable you can argue like this so only way this point can be a minimum is if this derivative is zero okay because if derivative is non-zero you will be able to move a little bit and go further down which cannot happen if it is a minimum okay so only way this point can be a minimum is this derivative is zero okay so this is the necessary condition you look at the proof here to derive the sufficient condition what we do is we look at the second derivative the second derivative here is del square phi so del square phi would be the so called Hessian matrix the Hessian matrix is given by so Hessian in this case will be a m cross m matrix oh I am keep writing n it is m we are working in m dimensional space so this is m here okay yeah so this is m here so we have this we have to look at the Hessian why we have to look at the Hessian okay let us go back to this equation here we said that only way x bar is a minimum is the all derivatives are zero if all derivatives are zero this term vanishes right when I knew Taylor series expansion all derivatives are zero at x bar so the first derivative term will vanish okay then we have to look at the second derivative part okay so then I will write this here as okay so at the optimum the first derivative is zero so this is governed by phi x bar plus delta x transpose half into del square x phi delta x what how what will govern the local behavior of the function in the neighborhood of the optimum point see the first derivative is zero so look at the second derivative second derivative is delta x transpose phi x this this Hessian matrix this is the Hessian into okay what should happen if you move away from x bar it should it should increase okay when will it increase if this matrix computed at so this particular matrix this particular matrix is computed at x is equal to x bar so we are computing this at x equal to x bar okay the first derivative at x equal to x bar is zero I look at the second derivative get to get an idea about what is the local behavior so you second derivative this Hessian if this Hessian is positive definite or positive seven definite okay I can move away from the point but the objective function value will not decrease okay it will only increase such a point will be a minimum point such a point will be a minimum point so if del square x of phi computed at x bar if this is positive definite or positive semi definite then then x equal to x bar is a minimum okay what if this matrix is indefinite what is the meaning of indefiniteness for some x for some x this is positive for some x this is negative okay so in some directions the function will decrease in some directions the function will increase you know saddle point the derivative is zero but in some direction function decreases some direction function increases okay if this matrix is indefinite okay then I cannot say anything about this point this is not a maximum nor a minimum okay so if this happens to be positive definite it is a minimum if this matrix happens to be negative definite it is a maximum okay if this Hessian matrix is negative semi definite or negative definite then it is a maximum otherwise if it is positive definite or positive semi definite it is a minimum and if a matrix Hessian matrix is neither positive definite nor negative definite it is an indefinite matrix okay. So sometimes this multiplication is positive sometimes this is negative then you know then the point is neither a minimum nor a maximum it is a saddle point okay now this particular this is called these two are sufficient conditions for to qualify a point to be optimum. So first thing is that gradient should be equal to zero second thing is ACN should be either positive definite for a minimum or it should be negative definite for a maximum and then we can qualify a point to be an optimum point. So this is generalization of the result that you know from one dimension this now in the next lecture will apply to the specific problem of polynomial approximations okay.