 So, we have been discussing this online convex optimization problem, but then we saw that if we just to do follow the regularize reader for if follow the leader in some cases we may end up a bad regret we showed with an example when my function was linear. Then we showed that if you somehow regularize it things can be better and, but we have to choose our regularizers appropriately ok. So, for the case when we have my loss functions defined to be like this and when we use our regularizer to be what 1 by 8 r time norm of think there was a 2 here right, we showed that we get a simple update mechanism which was basically a gradient descent that you update your weights taking into account the gradients in each step and you will account your gradients in such a way that your weights will decrease according to the gradients not simply the gradients of eta times the gradient. So, based on that we had this simple online gradient descent algorithm which was like we started parameter eta positive u then we regularize with w 1 which is 0 vector then your update tool was w of s 1 u s w 2 minus eta times the t and what is this z t here the z t is the we say that this is the sub gradient computed at w t ok. And if the function f was differentiable then we say this is simply the derivative at that point in which case the derivative here is simply this z t here right and for this algorithm what is the bound we showed finally, we had for this setup we had shown that the regret with respect to n were upper bounded by. So, we had something like what was that t eta norm of u square plus summation t equals to 1 to t then norm of vector square is that correct there is a eta here ok. So, then we made an assumption here we said that let us say norm of this is upper bounded by eta and also said that let us say these sub gradients are upper bounded by this l then this upper bound turned out to be t eta v plus eta times l eta. After this what we did? We optimized this over eta because that eta was an input parameter and by appropriate choice of eta what is the final bound we got the best upper bound root over ok. Now, from this expression as you see the regret is dependent on the size of this gradient here right. Somehow if whatever you are choosing if this gradients have to be very large then your regret is also going to be large. So, that is why we assume that ok the gradients could be anything, but we ensured that they are bounded. Adversary cannot choose any arbitrary gradient it has to be related like this and with that we got this bound. So, it looks like to get a good bound one has to control this gradient levels. Now, we will see that this gradients can also be thought of of course, this is the property of a your function f, but we can now going to connect it to the lip sheet property of a of a function. So, now we are going to make a bit little bit due to and do some notations. So, we are going to call f is L lip sheet if if you take. So, I am going to assume this functions f are all real functions real valued functions. So, if you take the difference between this function f at the point x and y this should be upper bounded by and now you see that I have taken this to be real value function, but this real valued function takes argument to be vectors x and y could be any vector and if this is like for all x y and this L is your constant this does not depend on what is the value of x and y here. And now you see that I have written this x minus y and I am interested in some norm of this right. What is this how is this norm defined? Right now I am going to say f is L lip sheet with respect to norm and I am going to define this whatever this operator. Now right now just think it as an operator ok. This operator is this if you tell me this norm and you are going to say your function is L lip sheet with respect to that norm then this is what I mean this is the definition ok. Now, what is the norms we are interested in? In general if you have a x of p norm is defined as summation x i to the power p this whole to the power 1 by p ok. So, what is if p is equals to 1 we are going to call it as L 1 norm. How does L 1 norm look like? So, L 1 norm is like. So, it is simply going to be summation of absolute value of individual components ok. Now if you take suppose p equals to 2 this is called L 2 norm and how does this look like? This is the one and this is squared of this sorry just this right and this is the one we are which we are most familiar with right this is called L 2 norm. And in general any norm with this parameter p is defined to be this ok. And I mean we are going to denote a generic norm. So, I am when I say this I have here I am just intent to say this is a norm of vector x, but I have not clearly specified whether this is L 1 norm L 2 norm or pth norm ok. So, this is just to denote this is a norm with respect to some p and depending on what is that we are interested in we are going to say this is going to be L 1 norm or L 2 norm ok that is basically in this we are not specified the value of p depending on what is value of our p we will say p is 1 or 2. So, there is a notion of dual norm. So, this is a generic norm. So, if you give me a norm whatever x is dual norm I am going to define it as and I am going to substitute it as like this. So, take any generic norm that has been given to you ok. Now, I am going to define its dual norm how I am going to define its value is going to be max of this inner product this is clear what we mean by this over all w such that this happens. Suppose I am interested in finding the dual norm of L 1 then this guy is going to be simply L 1 here ok. And now if I am going to be interested in dual norm of L 2 in that case we have to take this to be with L 2 ok. So, if this is the generic definition of dual norm when you have given a general norm whatever it is ok. So, you can verify I am just leaving you if I have w z two things this can be always upper bounded as norm of w and norm of its dual norm. Suppose you should take inner product of two quantities w and z the first whatever the generic norm you have let us say this is L 1 norm then the second term is the dual of that L 1 norm. So, here the dual corresponds to what are the associated generic norm. So, if this guy is L 2 norm here when I say this is associated dual norm of L 2 ok. So, here you see that I can interchange w and z right. So, here I could write it as z and here I could write it as w in which case I am taking the generic norm with on w here and its dual norm on z here ok. So, in most of our analysis will be only interested in mostly p 1 and p 2 ok, but it so happens that if you have p q greater than or equals to 1 such that 1 by p plus 1 by q is equals to 1 if this happens then so I am just going to call this as like L 1 norm and this is like L 2 norm here and this is like L p norm. Then L p and L q norms are real ok. So, if you are going to take L q norm it is dL norm will be given by q if 1 by p plus 1 by q satisfies this ok. So, suppose now let us do this suppose I want to take p to be 1. So, that is L 1 and then what is the q that is going to satisfy that satisfy this equation if I take p equals to 1 what is the value of q that satisfy this equation infinity right. So, so dual of L 1 norm is what L infinity norm ok. Now, let us define what is L infinity norm. So, by definition L infinity norm is this is x i to the power infinity and then take 1 by infinity power when the way to interpret that is simply max of i is the maximum component among them ok. And this think of intuitively like if you are raising this to power infinity and then taking this to be 1 by that power. So, if letting if you let p to go to infinity what is this power is converging to 0 right. And then in this case the one with the largest value dominate. So, that is why you can then you can argue that this definition just transfer to be this max of this component when by letting p go intending to infinity ok. The other thing is so now let us take L 2 if I if I take p equals to 2 what is the value of q that satisfy this equation 2. So, if my generic norm is L 2 what is this dual norm is again L 2. So, L 2 norm is dual of itself ok. So, that is why when I am so L 2 norm is the most simpler to handle I really do not need to if I just say L 2 norm I really need not worry whether I am working in the original norm space or I am in the dual space ok ok with this that is why L 2 and. So, here that is why it is important that whenever I am going to define my Lipschitzness with respect to a norm that matters because it could be L 1, L 2, L 3 whatever and depending on that this L could change. This may satisfy this for some L 1 if it is L 1 norm and if I am going to take this to be a L 2 norm this L could be different right ok fine ok. So, this with this notation now let us move back to what we are interested in. We here we were interested in the fact that this z t is here which are sub gradients we wanted them to be bounded. Now, is this sub gradients somehow related to the Lipschitzness of this function and if that is the case with what parameter. So, here I to make all these things work to get this sub linear bond I needed this condition right that all the sub gradients are above bounded by that. Is it that equivalent to saying my function is the Lipschitz? Is that true? So, we will see that S that is true. So, here is the result I am going to going to state it we will skip the proof. So, let my F is a real valued function from S to R be convex then we are going to say that then F is L Lipschitz. So, this basically states that if I have a convex function and real function and this is Lipschitz with respect to some norm ok and this is if and only if for every point in S and all the associated gradients sub gradients this satisfies. That is in this grade sub gradients in the dual space they are also upper bounded by L. So, if my. So, in a way what we are saying is if my function is Lipschitz then my sub gradients are also upper bounded by L ok. So, now let us take this norm to be L2 norm ok then this is also we know that this is also L2 norm. If my function here let us say is is Lipschitz with respect to L2 norm then all the sub gradients should be also uniformly bounded by that same constant L. So, in a way whatever the stuff we did here everything here works out instead of saying that my gradients here sub gradients are the dual norms of sub gradients are upper bounded by L instead of this if I say my function F the F1 F2 that I am going to see they are all L Lipschitz then everything goes through here right ok. So, henceforth that is why now instead of worrying about my gradients are the sub gradients are bounded in the dual space I just worry about whether my function is Lipschitz in the generic norm ok. So, henceforth that is why I am only interested now henceforth I will focus on whether my function is Lipschitz or I will assume that my functions are Lipschitz with some constant that automatically implies that my sub gradients are also Lipschitz. So, just notice here one small thing like when I derived all these things I assume that the squared norm are upper bounded by L, but here when I converted to this this is just dual norm. So, there is no square here. So, basically then we senseforth we have to replace our L by square root L that is the translation we have to do ok or maybe I what I should have done is when I wrote all these things instead of that maybe I should have considered this value and put all the things, but anyway let us not worry about. So, we will just keep in mind that then the square root L translation we have to do then we have to compare our bounds with whatever I achieved here. So, far what we did the all these things nicely worked out when we have a specific regularizer right. What was that regularizer? The regularizer we defined was R of u was 1 by eta 2 eta ok. So, this is the specific regularizer we took and this regularizer was defined. So, is this regularizer here was it a convex function with respect to so like when I when you this of course, it is a convex function was it a Lipschitz. So, just see if I take f equals to R here is it a Lipschitz and if it is Lipschitz you have to tell me for what L and with what norm ok. So, just plug in if you are just going to plug R here R here and just simplify what is that you are going to get ok let us do that. So, this is I am going to get it as 1 upon 2 eta norm of sorry x here minus 1 upon 2 eta ok. So, anyway this is already out right like this is 1 upon 2 eta the constant this is right. So, can you simplify this any further ok what is the definition of this norm of x whole square. So, norm of summation of x i square and this is summation of y i square this is written as less than plus and minus why is this true ok. So, this term here we actually get sorry if I just take it directly. So, this term is nothing, but whatever the components right because of this ok has split this, but what I want is here this. So, what is the this definition here if I am going to take it to be in L 2 norm what I want this this is nothing, but summation x i y i square under root of this whole square, but what I have here is only x square minus y square ok is x square minus y square is upper bounded by x i minus y i whole square. So, check out that this is indeed you can simplify this and you are you can get it as whatever this 1 by 2 eta norm of x minus y this whole square. Just simplify expand this try to get this x i minus y i square stuff in terms of x minus y whole square summation then you will end up something which you can write in terms of the norm of difference between x minus y ok. So, what we have actually is this function here is indeed Lipschitz and whatever the whatever the constant we have in front of it that is 1 by 2 eta here ok this one. So, this is you just apply the definition whatever like ok what is this what we want we want norm of x minus y right and you want it to be L 2 what is this is going to be I am saying you just have to simplify it right. So, you have to manipulate that and apply some inequality to get that right this is not the final sub I am keeping the stoop to work it out for you ok. So, you complete this. So, we have one regularizing function here which is convex we can look at other other regularizers as well right which we want anyway this regularizing function was convex that is a good we can because I was adding this regularization function to the sum of convex functions right and I want I was minimizing it. So, if I am adding this regularizing function to the sum of convex function if this regularizer is also convex function then the whole function is convex for me and I it easier for me to minimize this fine that property I desire. Now, what we are saying is in addition to this we would also like this all of this f t functions to be Lipschitz right then we are going to get this, but R the regularization function we have we are also thought it as convex function which I got it in the 0th round right the way we treated R is like f 0. So, we also want it to be also Lipschitz with some constant right. So, now, let us look at if I am going to consider other regularizing functions maybe let us say both convex and Lipschitz what kind of bound one can expect and what are the other possibilities we have ok. So, we would be interested in something like more what we is going to call as some functions which are strongly convex not just convex, but something more than that and for that we will be able to derive some better bounds.