 We will welcome everyone, we will continue with our study of optimization. I had some questions from the previous class which I want to address here. Some of you are asking me about the difference between what is a sufficient condition and the necessary condition. So let us just recall this. We had concluded that if x star is a local minimum of an optimization problem like this, where you have a, you are minimizing a function f x over an open set S, then it has, then and if x star is a local minimum of this optimization problem, then it has to be that the derivative of f at x star or the gradient of f at x star they must be equal to 0. This is, this has to hold. So this is what we call a necessary condition. This means that if x star is a solution, then it must satisfy this. Now we also got a stronger necessary condition which said that the Hessian of f at x star must be positive semi-definite. So the condition is written here in blue. So it said that if x star is a local solution, then the Hessian of f at evaluated at x star must be positive semi-definite. This is also a necessary condition. So if x star is a local minimum, then the Hessian must be positive semi-definite. This is, this is what this condition says. There is a condition which is, which goes in the opposite direction, which says that if such and such condition is satisfied, then x star is a local minimum. That is what we mean by a sufficient condition. So that is what is written here in red. So a sufficient condition for local minimum. So suppose x star is a point in S and again we are in the same setting. S is an open set, f is a different, twice differentiable function. And suppose we have that the gradient is of f at x, at x star is equal to 0. And moreover, the Hessian is positive definite, which means that v transpose Hessian times v is strictly greater than 0 for all v not equal to 0. So if the Hessian is positive definite, then x star must be a local minimum. So these guarantees that x star is a local minimum. But every local minimum may not satisfy this. There are local minima that violate this, but they must satisfy surely the other necessary condition, which means that the derivative should be equal to 0 and the Hessian should be positive semi definite. Also a quick outline of how the proof for this sufficient condition for local minimum would work. Remember by Taylor's theorem we had argued this equation here. f of x star plus delta h, where delta is some small positive quantity, h is a direction in R n. So f of x star plus delta h can be written in this sort of form, f of x star plus half delta square h transpose the Hessian of f at x star h plus something that is small o of delta square. Now how did we get here? We got here by using that the first order term, the linear term in h actually is 0 because the gradient is equal to 0. The gradient of f at x star is equal to 0. So this term vanished and you are left with only this term. Now if you look at the difference between the left hand side and the right hand side, if you look at the difference between the left hand side and the right hand side, let me if you look at the difference f of x star plus delta h minus f of x star, this quantity is equal to half of delta square h transpose del square f at x star h plus something that small o of delta square. So you have this, you look at this difference. Now we know that when delta is small enough, what must happen? See remember we are arguing in the opposite direction. We now want, we want to say that if the Hessian is positive definite, then x star must be a local minimum. So which means that we want to say that when delta is small enough, this quantity, the quantity on your left hand side here, the quantity on your left hand side must be greater than equal to 0 for delta small o. So the way to see that is to just simply take delta square common here. You have left with then half h transpose plus this quantity which is small o of delta square divided by delta square. This is true for all, so for delta positive and small enough. Now we know that when h is not equal to 0, when h is not equal to 0, this quantity, this underlined term, this term is strictly positive. Why? Because the Hessian at x star is positive definite, positive for a not equal to 0. This is because the Hessian is positive definite. Now because the Hessian is positive definite, this term becomes strictly positive and what do we, what can we say about this other term, the term that I have circled. It is a term because it is, the numerator is small o of delta square. I have divided that by delta square. What that means is as it is a quantity, small o of delta square is a quantity that upon dividing by delta square also goes to 0. So small o of delta square divided by delta square goes to 0 as delta becomes, as delta goes to 0. So as delta becomes small, this quantity will become eventually smaller in magnitude than the first term. It goes to 0. So eventually it has to become smaller in magnitude than the first term. Whether this term is positive or negative does not matter to us. Eventually in magnitude it is going to become smaller than the first term. So consequently whatever is there in the bracket here is going to add up to something positive for delta small enough. So this is this bracket for delta small enough, this bracket is positive and outside I have a delta square which is also positive. So in short what have we got, we have got that well if as a consequence we have got that this quantity, this difference is definitely greater than, this difference is definitely greater than equal to 0 which means that, so that gives you that which implies that x star is a local minimum. So this is guaranteeing that x star is a local minimum. Now let us do a couple of applications of this. First set of applications are going to be where it is evident that the optimization problem is actually an optimization over an open set. In the second case what is going to happen is that you will, it is not very evident that it is an optimization over an open set it has to be mathematically transformed in that sort of way and that will also lead us to our next class of optimization problem. So the first category is this problem of what is called least squares regression. So the problem in this kind of, the question in this kind of problem is that you have a bunch of, you have a bunch of input or you can say independent variable which affects an output and we want to infer a relation between input and output and the relation is not always exact. The inputs and outputs are related to each other possibly in a noisy fashion but we do not have a model for the noise, we do not have, there are many probably many other unmodeled things elements in the problem which are affecting the output. So consequently we do not have a perfect relationship. What we want to infer is a relationship that will be good enough for predictive purposes. So it may not be good, it is not necessarily good for, it is not enough that the relationship is matches very well on the data that we have but rather our goal is actually that it match, it should perform well on unseen data in order to and give us good predictions on unseen data. So I will explain this with an example. Suppose we want to, we are talking of say price of a used car just as an example. So price of a used car now used cars come in all kinds of varieties so but we can think that maybe the price of the used car depends on a few input variables such as for example the age of the car such as the brand or make of the car it depends on whether it is a petrol or diesel variant it depends on the condition I do not know of the car depends on the number of kilometers it has run what has been it is running etc. So these are in the language of machine learning what are called features. Features are what I was calling input variables. So these are what are called features and the price of the used car is what we want to infer from it so if you want to infer using the feature. So if I give the goal is that I give you a new, I give you a sample car which has which sample used car which has all these features it has a certain age I tell you its brand I tell you whether it is a petrol or diesel car I tell you its condition tell you the number of kilometers it has run and I want you to predict what the price is at what price will this sell in the market this is this is the problem and now of course you cannot do this without any data so I will give you some previous data. So I will give you previous data points or previous data points what do these previous data points correspond to they correspond they give you a list of features they give you values for features and they tell you the price the price at which the car sold. This price is what we call a label so you are previously given you are given some previous values of the features and you are given also the values of labels corresponding to that. And now what we what the goal is that using all this data you come up with some way some predictor for a new sample. New sample does not mean a brand new car a new sample of an old car of a used car an unseen sample which possibly has a different values of features and use that to predict the price. This is this sort of problem is a sort of a basic problem in machine learning now what we what we let us introduce a bit of notation here these kind these features I will let us denote features are going to be denoted as in the following way. So I have suppose so I have suppose n features n features include you know things like this and these I will assume they take values in a continuous space. So here these either all type features or brand type features which are which are discrete let me know let us ignore them for the moment let us assume that these are n features taking values in a continuous space. So this could be age of the car some way of measuring condition the number of kilometers it has run and so on. So these are features that n features that are real value each of them is a real number. So and I am going to give you m samples samples or data point. So the way this is we will denote this since there are n features so what we will do is we will put stack all of them into a vector. So the for the ith sample the values of the features that you that you have let us denote them in this sort of way a i of 1 dot dot dot a i of 2 dot dot dot a i of n. This becomes a row vector the column vector like this a i is you will just be the column vector formed by this column vector. The price or the label that we have got for in the ith sample is denoted b i and this is also a real number. So label now what you what one typically has also in addition to this is a bias feature I will ignore that for the moment. But so for simplicity just let us look at this sort of question you have you have previous data points here this is my feature space the space in which my a a is take their values. So I have I have a say a 1 here and the corresponding b 1 here I have an a 2 here a 2 suppose and the b 2 and the a 3 and the b 3 b 3 etc. And so you get these sort of you have these you have these m samples. Does everyone understand this there are m samples that is the m is the number of dots that I have here like this of this kind n is the is the length of the feature vector. Now what I would like to do is somehow come up for a new unseen point. So here is a new point in the feature space which I have never seen before I would like to know what should be the right value of the price or the or the label for this correct. So if I want to do this kind of prediction what I need to do is build a model using what is using the data that I have. And so what we will posit is there is we will hope for a linear relationship we will let us let us posit it in this sort of way let us posit that that b depends on a i in a linear fashion. Now what this means is so let us let us suppose that you compare the true value a i a true value b i with a with the value of a linear function of the feature which is denoted like this a i transpose x a i transpose x. So this is a linear function on a linear function of the feature. So the coefficients here x are this vector x 1 till x n these coefficients defy as I vary the coefficients I will get a different value for a transpose x and I am hoping to match get coefficients in such a way that when I these coefficients in such a way that when I that when I input a new set of features here unseen set of features it will give me a good it will give me a good approximation or a good prediction for the for the price or the label alright. Now the trouble is if you look at the sort of figure that I do here data the way it is because there are so many unmodeled elements there is clearly no real need not really be a linear relationship between between the label and the feature. So what one tries to do is then you think of a cost function a cost function or a loss function we we try to we look at you look at the mismatch or look at the residue like this look at the residue like this and we say well we want to look at we would like to this you would like to find an x such that this these these residues as I goes from 1 till m these residues are many are as low as possible this is this is what is called the training problem. So you look you try to find x's and thereby define this function define this linear function through those x's so you try to find x's so that this residue is as small as possible. So a popular a popular way of doing this is to do is to simply write it like this so let us write a as this matrix I will I will so this is this puts together all the observed features of the m data points. Let me write b also as this column vector these are the observed labels of the m data points. So a is now a matrix in r m cross n b is a vector in r m and what because that need not really be a linear relationship what we will try to do is find an x such that these residues are the sum of these residues is minimized. So now the residues can take any sort of sign so it does not make sense to just simply add them. So what we will so you need to take a norm of them so you let us see so what I am taking here is the I am taking the 2 norm I am squaring the residues and summing all of over the m sample. So I am looking for an x in r m so x belongs to r m the number that minimizes this particular loss. Another way of writing the same thing is to simply write that this is norm of this vector this vector A x minus b the L 2 norm of that vector is what you are doing. So what this problem does is it gives you the optimal set of the optimal the optimal x's x's are x's have different names they are called regression coefficients in some language in some communities they are called in weights by some what matters is basically they define for you once once the x the optimal x has been found once the x star the optimal x so let us call that x star the optimal solution. So if once the optimal x star is found all in order to predict the price in order to predict the price for an unseen sample what I need to do for an unseen sample say suppose if I find a new sample A tilde what I need to do to predict the price for that is to simply do A tilde transpose x star this is my predict this is this is the idea behind yes no there are there are so different norms have different properties and different shapes so so I this goes for more and more into data science I I I was not prepared to discuss it right now but I will give you a general idea see the when you look at points like when you get a bunch of data points like this not some of them can also be have out what can what you can call outliers so a data point like this which which really is well out of the trend right so the general hypothesis in this sort of work is that the data is comprises of a a a trend plus a bit of noise and we want to capture the trend but not the noise but there will occasionally be an outlier which come because noise is random it will occasionally be an outlier like this and now what that kind of outlier can do is that it can have an adverse effect on the way on your choice of x star because of the way you are training your algorithm right so on the way you have defined the because of the way you are because of the way the x star has been found you that that it can skew the the choice of x star a little bit so that to the point where now the noise starts being a bigger role than the trend now what has as a consequence what you would one way of getting rid of that is instead of hand picking what in saying that well this is an outlier and that is an outlier you probably do not know what is an outlier to begin with instead of doing that you try to look for a different different loss metric itself so so instead of the Euclidean norm look for take a different norm say for example you can take an LP norm so you can raise all of these to the instead of power 2 you can raise all of these to the some power to some power so this is the LP norm there are other types of loss functions that penalize large deviations much more than they penalize small deviations so that you can choose one of them so this goes more and more deep into what you want what the purpose of the study is what the nature of the data is and what but this is just a simple simple illustration that I wanted to show you.