 Now we will begin with our study of first class of optimization problems which is optimization of a function over an open set. So what this means is that the feasible region is an open set and the function objective function or precisely objective function, we will assume this to be a differentiable function. Now since we are going to talk about differentiable functions, I need to tell you a little bit about my notation for derivatives, second derivatives and so on. So now for that let f be a function from Rn to R. So any x in Rn, I will write its, I will denote it in this sort of way, x will be denoted with its coordinates as x1 till xn. So this is, so every vector x in Rn, this way it will be interpreted as a column vector. When I write something like fx evaluated at xat, what this means is this is the derivative of f at xat. So the small x here, the small x in this notation just simply says that derivative is with respect to x. So this will be important when we are talking about functions that are functions of more than one variable. So we may want to take derivative only with respect to one of the variables. So this just stands for derivative with respect to x. Another notation for the same thing is evaluated at xat or now the important thing to note is that this quantity f sub f x at xat, this is actually a row vector and it is defined in this way. You consider the partial derivatives of f with respect to each of the components of x. So you evaluate them into a row vector and evaluate all of these at xat. So f sub x evaluated at xat is this row vector of this thing. There is a related quantity which is denoted in this sort of way or to be more explicit about what we are differentiating with respect to that is denoted this way and that is simply the transpose of this derivative. This thing is called the gradient of f at xat. And once again the small x here simply denotes that the gradient is with respect to the variable x. Suppose if I have a function f which is a function from rn cross rm to r. So my variable x, the x variable lives in rn and the y variable lives in rm. Then if I write something like this f sub x, if I write something like this f sub x at xat, yat, this here is, can someone tell me is this a row vector or a column vector? It is a row vector with how many components? n component, because that is the number of components. I am taking derivative with respect to x here and x has n component. So this is a row vector of this particular thing also denoted by. Now if f is, so in both of these cases here in this case as well as in the earlier case, in both of these cases f was a function that mapped to r. So f itself did not have multiple components, it has just one component. So f is a scalar valued function. So then now if f is vector valued, let me just write this for your reference here. So this is the case of a scalar valued function. If f is a vector valued function then say we have, suppose f is a function from rn to some rm and what we will do is we will think of this every output of this vector, every point in the image of this of f is itself a column vector. So f should be thought of like this, f at a point x is equal to this kind of column vector f1 of x, f2 of x, dot, dot, dot, fm of x, it has m columns. Now if I want to, if I take the derivative of f with respect to x, what I need to do is take the derivative of each of these components with respect to it. So this evaluated at x hat now becomes, so every row, every component of f will have a derivative and will, its derivative would be a row vector like this. And so what you are going to get now is row vectors of that one below the other, so the whole thing would become a matrix. So this would now, this would become, so that is your, this is also, there are multiple names for this, you can call, this is called the derivative of f at x hat, that is one of the names. Another name for it is that is called the Jacobian at x hat. Now we can also write another higher order derivative, so this is what is called the Hessian. So the Hessian is simply this. So if I take the derivative of the derivative, but the derivative of the derivative, I need to take, when I am taking the derivative of the derivative, the inner thing has, so let me put it like this, I am taking the derivative of the gradient and evaluating that at x hat. So this is simply another term for the same thing, another way of writing the same thing would be, this is the whole thing evaluated at x hat. So this will again, because the gradient is a column vector, I can now take the derivative of that and that then will give me a, again a matrix, matrix that looks like this, all of these evaluated at x hat. Another notation for the same thing is simply del square evaluated at x hat, alright. So along with derivatives, we also need an important theorem that pertains to derivatives, of different derivatives and approximations of differentiable functions. So this theorem is, it could be known to you in some or the other form probably before this, this is Taylor's theorem. Taylor's theorem basically says that if you have a differentiable function and if you look at the value of the function close to, take a reference point and look at the value of the function very close to that reference point. So what Taylor's theorem is basically saying is that close to that reference point, the value of the function is very well approximated by a linear function that you can construct using the value of the function at the point and the derivative of the function. So what does it mean by well approximated and what is the sense of the approximation, that is what is made precise by Taylor's theorem, okay. And the main idea is basically that when you are close, if your function is differentiable, then and you want to look at how the function behaves near a point, you have some reference point and you want to know, well near it, how does the function behave, well it tells you that a linear approximation is, it can be obtained and it tells you what that approximation is, it tells you in a very precise sense. So the theorem is the following, so let f form Rn to R be a differentiable, let A be a point in Rn, so this is my reference point, this here, this you can, this is my reference point. And what I want to know is how does the function behave at another point x, okay. Let it say, then the theorem says there exists, there exists this sort of function h such that if you take f of x, then that value of the function at x that other point x is given by this, given as f of A plus f of A plus just change my notation, I want to instead of using the point, instead of denoting this by x, let me denote this by a y, okay. So my reference point is A and the other point at which I want to evaluate the function, let us call that y, so f of y is equal to f of A plus what is this vector now, this is a row vector, row vector which is the derivative of f with respect to x evaluated at A, okay. So that transpose y minus A plus another function h of y times y minus A where and this is the important part where h of y tends to 0 as y tends to A, okay. So now I want you to appreciate what this sort, what this theorem is actually saying and it will become evident as we go into the next main result also. But remember just for clarity, I want you to see what this is saying. See as y tends to A, of course it is true that f of y and f of A will come close to each other. As y tends to A, f of y will approach f of A, so that is not saying anything here, okay. What this theorem is saying is that as y tends to A, okay, if you look at f of y minus f of A, f of y minus f of A divide that whole thing by y minus A, then that starts behaving more and more like fx of A. So what this theorem is effectively saying is that if you look at the, it is of course true that f as y tends to A, as y tends to A, f of y tends to f of A. So what this theorem is actually telling you is how fast does f of y tend to f of A as y tends to A. So what this is saying is it is telling you, telling you somehow a measure on this difference, it is telling you that this difference f of y minus f of A divided by y minus A, this is something, this is equal to fx of A plus h of y, where h of y is a quantity that will become smaller and smaller, this will become small as y tends to A, it will become close to 0 as y tends to A, is this clear? So this is what the theorem is actually telling you. So it is not only telling you that the function values near A are close to f of A, that we already know, but it is also telling you how fast they approach f of A, okay. They approach f of A at a rate that is linear in y minus A, okay and the constant of linearity there is roughly equal to fx of A, which is fx of A plus this h of y, where h of y becomes smaller and smaller as you come close to. So all of optimization is about these, about how fast different quantities converge, okay. The relative rates at which different quantities converge is something that we keep exploiting all the time in optimization. So that is why an estimate like this one which comes from Taylor's theorem is you can say cornerstone of optimization, okay.