 Now, we are going to see that by what we mean by convex function we defined we stated it in the last class that a convex function is the one which at any given point I can get a lower bound on it right ah ah basically that lower bound came from a tangent which has to be any given point and that tangent was defined in terms of the sub gradients at that point w that you are interested in. So, that function the tangent function that was like a lower bound it was to my function all the time, but now if we allow ourselves bit more flexibility that the gap between that convex function and my linear function if it is like slightly good big enough it grows well then we will have a better control over regret ok. So, let us first define then what is that strongly convex functions. So, we are going to say so, this is the definition of my convex function strongly convex function with parameter sigma. So, you see that this is the part which you have already defined we know that a convex function this part already holds right. If you have a convex function f in fact, we have said that function f is convex depend only if this part holds where z is the sub gradient of your function f computed at point w and this part here this is basically a a tangent which is passing through your point w ok. But now what we are saying is yes this was my initially my lower bound on the function, but in addition to this if I if I add this extra factor to this that is I have increased my lower bound by this much amount then also this entire thing continues to hold be a lower bound on my function f and notice that what is this this part here. So, this is your this is at point w you are computing the tangent if you look into if you increase u that is if you are moving away from omega this guy is only increasing. So, basically if you look this difference between f of u minus f of w this guy is like also going. So, in a way what we are doing is suppose like let us say this is my convex function and if I had taken some w. So, this is like f of w z minus b and also what I want now is this term here sigma by 2 norm of u minus w. You see that this gap between this function and this at any point let us take. So, let us say this is w this is w and you take any other point at some point let us say this is to be u if you now look into the difference of your function this is let us say f of u and this is like f of w. So, now look into the difference between f of u minus f of w f of minus w earlier that difference was only this much then it was simply a convex function. Now, it is also growing by this much. So, if you are going to take u away from omega. So, you expect this gap to be also be enlarging right. So, in a way in that in this case the separation between this and this line. So, maybe I just do not want this if you just take this line. So, this gap is also kind of growing very fast and growing means at least by this amount ok and this has to be. So, this is f and this is any point w. So, this kind of this portion is like norm of u minus w square ok. This is I am not saying like whatever the norm you have given I want this to be take that and square it ok. So, this is whatever the norm this is defined with respect to some norm and I want the square of that norm and this is what it is right. So, what we basically by bringing this sigma strong convexity we are basically wanting this separation to be growing much faster. In this way it will become easier for us to kind of possibly to identify these functions ok. So, now, let us have some more properties of this strongly convex function ok. First thing is what we call. So, if you another lemma I am going to call it as now let instead of taking some every point w let us take a specific point w which is like a minimizer of this convex function. Then for all u equals to s we have this u of s minus f of w is going to be sigma by 2 times norm of u. So, what we are now saying that instead of looking at this omega where is let us say this function is getting minimized here ok take this point and now look at the tangent which is basically passing through this let us say like this ok. Now, we are saying that if you are going to look at the difference of your function like let us say this basically this difference that f of u minus f of w this is growing like this. This will be at least this difference is going to be at least this amount right. So, if so, this property is one thing because at any when we are going to deal with our kind of minimize at every point when we are finding w t in my algorithm right we will be looking at the minimization of some convex function ok. So, there this may be result will be useful like when I am going to look at this point how it is how was the value of the other points going to differ with respect to this value ok fine. Again I am so, just to argue so, here we really do not need to assume that this result holds for a function f that is differentiable this function need not be differentiable at any point ok, but let us try to understand this why this statement holds for the case where my function is differentiable ok at all the point. So, what we know about this suppose if w is my minimum value of my function f what will be the gradient at this point it is going to be 0 right. So, then in that case if I take this w to be the minimum value and in that case this is a just like gradient at that point right and that will be 0. So, in that case this z is a 0 vector. So, if you just plug in 0 what will be left it only this term which is like a lower bound in this case ok, but as I said it need not be just true for the case where my function is differentiable when it is not differentiable if when only my sub gradients are there this is fine ok fine what is the how to check that my function is then strongly convex ok fine. What is how do you how do you check my your function is convex that is the definition ok you can directly see whether the definition holds then you can check your are there any test for convexity yeah second derivative if it is a real function I mean if its arguments is real then you can differentiate with respect to that argument and see whether its second derivative is positive or not. But here we are talking about convex function which was taking vectors as inputs right here what is your test yeah Hessian right and what is that what should be the property of that Hessian right. So, whatever like what you want it should we want it to be positive semi-definite positive semi-definite or right. So, if it is strict fine otherwise let us say we all we need is my Hessian should be positive semi-definite or positive definite whatever you want. Now, is there an analog test for this what is the test? So, the test for that is take the Hessian. So, take the Hessian at omega and then if it is going to be greater than or equals to sigma times this then you can verify that your function is sigma strongly convex with respect to this norm. So, here I have defined this f this this is with respect to some norm right if it satisfies this property with respect to some norm then we are going to be ok and this is like a test in that case we know that this definition holds because of that it is sigma strongly convex with respect to that norm ok. Now, let us see I think earlier I made a mistake with respect to that Euclidean regularization ok. Now, let us see whether this property holds for Euclidean regularizer. So, what is our Euclidean regularizer? We said that R of omega is 1 by 2 and I am just keeping the eta term here it is the Euclidean regularizer. So, now, let us try to work out what will be it is second Hessian looks like. What is it second Hessian? So, what this matrix look like? So, this is delta 2 R omega delta omega 1 delta omega 2 like this right. So, then this is delta 2 R omega. So, maybe I will fix W 1. So, this is sorry this is W 1 and this is W 2 all the way up to delta omega 1 up to and delta let us say there are d components in this this is like this. And the last one is going to be delta R omega times delta omega d delta omega 1 all the way up to delta square. So, can you tell me if compute like if I if I take this R omega to be like this what is this values are going to be? So, 2 is getting knocked off here right it is just going to be identity function ok. Now, identity times a matrix it is going to be x itself right and now x into x and now I am interested in how already specified this function R with respect to Euclidean that is L 2 norm right. So, I would now see that with what whether this guy is again going to satisfy this property with some Euclidean sorry L 2 norm. So, in this case what I am going to get? This I am going to get x comma x and what is x comma x in a product? It is it is going to be my Euclidean regularizer here is strongly convex, but that is not a complete description right it is strongly convex what sigma. So, it is what we have shown is R w is strongly convex with respect to L 2 norm. So, if I have shown it to be strongly convex one strongly convex maybe we messed up when we tried to see that this guy was Lipschitz or it should be still fine. So, anyway please do verify whether still like we have left it we said that this is Lipschitz, but we did not complete that right. So, just go and verify this whether this is a Lipschitz if S with what value of L ok fine. So, at least now we know that this guy is anyway convex, but it is more than that it is strictly convex with respect to the L 2 norm ok. This is one regularizer we have I am going to discuss with you today one more regularizers called entropy regularizer. So, how many of you heard about entropy function? You heard it where differential learning is there a difference between entropy and cross entropy? We are talking about cross entropy not entropy right, what is self entropy ok. So, we will revisit this a bit later, but let me write it cross entropy ok or and you are also using the word KL divergence. So, you are just throwing in many random terms or you know about them ok we will discuss this later, but just like to be KL it is called take KL divergence or like you also want to call it cross entropy ok. Before I write it, the way we are updating this W's right W's are what? They are coming from my set U because I am trying to minimize my function over U right. Suppose let us say that is a convex function, when I consider this linear function when my losses are linear what was my W? T happened to be every time it was just like a average of all the W's sorry it was average of all the gradients right right. If we are if I assume that these gradients are coming from a convex set. So, their average is also it is also some convex valued right, but what are the W's? What are the probability vector? The W's the weights I am getting in every round Wt how did I obtain up Wt was like ok. So, after simplification all the things Wt plus 1 was Wt minus eta Zt right. So, Zt was not in my control that was chosen by the environment or the adversary. So, here Wt plus 1 is can you guarantee it to be a probability vector? It need not be right why why it has to be it can be any value ok, but to if I want to cover the cases where my W's has to be kind of probability vector right then maybe like if I am going to use such an update rule this Wt will not be probability vector continue to be probability vector, but if you are now recall go back and recall your your prediction with expert advice what were Wt's there in every round? They were probability vectors right you actually derive you made them to be probability vector in every round. So, if you have this kind of regularizer which yielded in such kind of updates you are not going to get probability vectors in each round and because of that you will not be able to generalize or recover the prediction with expert setting with this convex functions. So, you remember how we are getting all this we we just pitched this convex online optimization as a generalization of all the prediction with expert advice right. So, this online convex optimization that way should also cater to the case with prediction with expert advice algorithm setting is recovered, but in so far we are not able to do that because these Wt's are not there, but suppose you want to restrict your Wt's to be always to be probability vector then what kind of regularizers you will be interested in applying? Then comes all this scale divergence cross entropy because these are all make sense when you are dealing with probability vectors right ok. Now, let us let us say this is between two vector two probabilities I am going to call this as p and q. So, p is one probability vector and q is another probability vector both defined on the same spam sample space ok. Then I am going to define this to be p i log p i divided by q i. Let us say my some pulse space has some n elements in it or maybe not n maybe we will use some some k elements in this that is fine and then if you simplify this I can write it as. So, can I simplify this in terms of this? So, notice that this part here only depends on the p i distribution whereas, this part depends on both p i and q i here ok. So, this part here that is p i log 1 p i this part without this minus 1 we are going to call this function as entropy and we will discuss this later this is kind of how much information is contained when you generate a symbol according to this distribution ok. So, that is fine. Now, you are interested in this kind of regularizers ok and that is what I mean by entropy function entropy in regularizers. So, I am going to define my r of omega to be w i log of simply w i. Let us say it is over some d number of rocks is another one. So, right now I have not said this w has to be a probability vector, but let us say this is any vector w then I am going to define my entropy functions to be like this ok. So, now, let us say now we want to see is this what kind of function is this? Is this r of w? Is it a is this strongly convex if s with what sigma and with what norm ok. So, can we see that, but now let us say I am taking this omegas to be coming from all my x in r d such that my x i's are positive for all i 1 to d and my summation of x i is equals to 1. What now I have made is what is this? Ok, let me call this omega coming from along into s which is defined like this. So, what is this s? It is a semi semicircle. Is this a semicircle? I am what I am saying all the I take a if I take a vector all its components are positive and they sum to 1. It is a simplex right it is a probability simplex right. So, more generally what I can write instead of defining like this? Probability simplex. Yeah probability space or what you just call it probability ok if you do not want let us call ok. So, I want to just generalize this instead of like this I am going to define s to be all x all x i components are positive that the L 1 norm of this is bounded by b. So, what what I have done this is like special case of this when I said b equals to 1 and what is L 1 norm of this? L 1 norm of x is what? Summation of x i's, but it is not simply summation of x i's it is summation of mod of x i's, but you have already ensured that x i's are positive. So, I do not need to put my mod there. Now, if I am going to define my r of omegas like this and where my omegas are coming from this space now what about this? Is it any strongly convex function for some parameters ok. So, it so happens that r is 1 by b strongly convex. So, I have specified what is sigma then what what is the norm it is going to be strongly convex with respect to L 1 norm I will leave this you to check this. So, how you are going to do this? Again you just go and compute the Hessian and see whether this guy is going to be greater than or equals to 1 by b times L 1 norm squared ok fine. So, we have basically now defined two regularizers which are both strongly convex, but in two with respect to two different norms your Euclidean regularizer is strongly convex in L 2 norm your entropy regularizer is strongly convex in L 1 norm ok. So, that is fine we will see how considering these regularizers in different with respect to regularizers which are strongly convex in with respect to different norms how does this affect. So, fine let us come back to just regularize having a strong convex functions ok. Now, what we will show is and also we will be interested in functions which are lip sheets now ok. Now, what we were interested in follow the regularized leaders with strongly convex functions ok. We have just state the result and we will look into this proof in the next time. So, this if we have this if you are going to apply my follow the regularized leader the strongly convex functions this is what we will get. So, I am now just stating that what is the role of strong convexity of my regularizers in the bounds we are going to get when I going to apply my follow the regularized leader algorithm right. So, remember we had terms like this when we had follow the regularized regret we needed to bound the difference of the function f t in round t at the point w t and w t plus 1. Now, if my function f t is already lip sheets this holds already right, but now further we are going to say that if my regularizer R is strongly convex this can be upper further bounded as a L t square by sigma sorry this is also L t here is also L t here. So, this part is obvious the first one the second one is coming from the strong convexity of my ok. So, just what is the statement we are saying you take a regular a function R which is strictly convex function for with respect to some norm and now take assume that this f t is to be lip sheets and that lip sheet constant is L t for function t again with respect to the same norm ok. So, you have to because we need to specify the norm we are saying that these are specified with respect to same norm and if your algorithm is predicting this w 1, w 2 then this is the bound. So, in the next one we will show this and see that how this helps in getting the bounds for both kind of regularizers.