 So, we have been discussing what in the last class, we started talking about follow the online convex optimization problem and in that we talked about the follow the leader algorithm. So, in the last class, you notice that when my convex functions are of the form like my convex functions in particular if I take it to be a quadratic loss function, we saw that my F t l algorithms give me what bond we got. So, my regret of my F t l was 2 l whole square times we got to to be log n having some constant we got it of this order ok. So, we just wrote it as log n, but I think we will also end up with this one term here ok. So, what did we assume? We assume this is a t we assume this to be upper bounded by l or we assume this to be upper bounded by l we assume the l square ok. If this is the case then we will not get a square term here ok. So, let us see we say this log n term came from what summation 1 by t, t equals to 1 to n right. So, this we said as log n, but I think the right upper bound is this. So, that is why we will write it as log n plus 1 ok it is just this part 5. So, good if you have this quadratic function. So, what was this? We have defined it as this and based on this what was the algorithm gave us every time it gave us w of t is simply 1 z of i i 1 2 3 minus 1. So, it basically say take the average of all the previous losses that you have observed. Why then let us take this kind of loss function. So, that is my linear loss functions right and we say that this linear loss function is nothing, but we can interpret as this is the expected loss right once the learner randomizes his strategy ok. How about applying this algorithm FDL algorithm on this setup? So, we will now argue that if you are going to use this setup if you blindly apply your FDL algorithm on this you may end up with a very bad performance ok let us see why is that. So, for example, take your S the convex set to be minus 1 and 1 and I am only going to interested in the scalar values now I would I do not take them to be linear I will just take I am interested in a scalar case. So, the dimension is all 1 here. Now, in this case I am going to treat like if their dimensions are 1 I could simply write them to be w z t both w and z are scalar. So, it is like this ok instead of I some some d dimension I am setting d to be 1. Now, suppose these z t's that are chosen by the environment are such that. So, z t's was chosen by the environment right it is up to him in whatever way you want he can choose them. So, let us consider one specific case let us say he chooses is to be 0.5 for the first round and for the subsequent rounds he is going to choose it as 1 if t is e 1 and he is going to choose it as minus 1 if t is odd odd and greater than 1 ok. So, now let us see what happens if I am going to apply my FDL algorithm on this ok. So, let us keep on computing what is my w 1 is going to be. So, what is my FDL algorithm will do FDL algorithm will do arg min or w over F of w t minus 1 this is what my FDL algorithm will do right in ground t. So, for w 1 this is m t. So, you play something I do not care anything I am not worried about it. Now, let us take w 2 what is going to be w 2? So, w 2 is going to be minimum of F 1 of w, but F 1 of w is nothing, but F 1 function is going to be minus 0.5 of w right because z 1 is defined to be minus 0.5 in this ok. Now, keep on doing this what is going to be w 3? So, w 3 is going to be 0.5 w plus yes yeah. So, w 2 means I am looking at F 1 right. So, what is F 1 of w? F 1 of w is w in z 1, z 1 is that quantity and what is this quantity is going to be? It is simply going to be w right. So, this quantity is like F 1 function this is F 1 and this is F 2 and F 2 we have this one here. And now if you are going to like if you continue to do this let us do this one more time what I am going to get 0.5 plus w what I will get it is going to be minus w. So, if you can continue to do this you see any pattern in this right. So, this guy here is going to be what minus 0.5 w plus 0.5 w and whereas, this guy is minus 0.5 w. So, because of that what will be the argmin of plus 0.5 w where w is coming from. So, this w is coming from s and we are taken s to be minus 1 1 right. So, what is this quantity? It is going to be minus 1 right and what is this going to be ok. If you do like this what is that you are going to get for w 5 you are going to get minus 1 like this and like that it keeps on. So, you see that what is happening your w i's are alternating from one round to another ok. So, now let us compute what is the loss incurred by we are interested in this quantity right f t of w t minus f of u ok. And now what is my f t of w t for n rounds ok. What is the first one is going to be in the first round it is going to be what? It is going to be plus 1 ok. So, let us ignore the first term f 1 something I will get it as f 1 of w 1. So, what is going to be f 2 of w 2? f 2 of w 2 is w t into z t right what is that going to be? When it is z t when t is e 1 I have 1, but whereas, when t is e 1 here what I am going to get? 1. So, w t into z t is going to be 1 when t is e 1 when it is odd minus 1 into minus 1 we get again 1. So, all the other terms here is going to be 1 1 1 1 right. So, this is going to be let us say n minus 1. And for time being assume that my this is also like my first loss is also 1 whatever you incurred I mean I can do anything, but I simply putting to be 1 here. Now what is this quantity? So, this quantity is nothing, but z t into u over t equals to 1 to n. So, the total loss incurred is almost 1 almost n like I incurred loss in every round. And what is this quantity is? This quantity is nothing, but u times t equals to 1 to n of z t right. And what is this is going to look like? Plus minus minus whatever it is going to look like, but I am interested in the smallest value of this right because I will also be interested this is for a given v, but if I am going to look over the minimum value of this I can take this u to be between 0 minus 1 to 1, but how can I make this the smallest value? Like suppose let us say n is an odd number, what will be this sum? Ok whatever it is let us say for time being let us take an n in which this quantity happens to be like after canceling everything only one term remains that happens to the positive number. Now to minimize this all I need to do is set this u to be 0. So, can I make this quantity to be 0 by choosing u to be 0 irrespective of what is this quantity is? Had it been a negative quantity I would have chosen something, but let us assume this turned out to be my n is such that this is a positive term. Now because of this if you look this over if you now minimize over u this quantity is nothing, but n is this clear? So, what I am saying is the losses incurred by your total n, but the smallest you could have incurred is 0 by choosing your u to be 0. So, maybe like instead of this let me just make it like min over u. So, just let us look into this let us write down what happened let us say t odd number. What is the sum of z t? It is odd number and greater than 1. So, let us take 3, 5, 7 like that if it is 3 what is this number and if it is 5 it is always minus 0.5 and if it is if this and if the t is even it is going to be plus 0.5. So, this whole quantity here depending on my n is order even it is going to be minus 0.5 or plus 0.5. So, when it is plus 0.5 how can I make this the smallest value by choosing u to be 0? Yeah, when u is what is that we said when it is let us say in this case it is ok in this case when it is 0.05 yeah right I can make it we can make it just minus 1 in that case it will be minus 0.05 ok let us do that minus 0.05 and that minus with this minus it become some 0.5 and if it is the other case when I have minus of this I am going to choose my u to be plus 1 it is still going to be it is going to be still this right, but still if you you see that this is still like order n right this is still order n. So, because of this if you are going to use this follow the leader on a linear function here you will end up with a very bad regret ok you are. So, what is the issue here it works so well for this quadratic loss functions or the quadratic functions, but it is doing so badly for the linear functions. So, what is that like can we say something about this? So, what happened basically when we are trying to do this minimization right this is arg min if I write simplify this this is going to be simply w times z t t equals to 1 to n with this function. So, depending on summation the values the arg min here it was changing in every round right it was becoming plus minus as we argued here it is changing in every round. Whereas, in the quadratic case the change was not so abrupt why was that no I am asking about this loss function. So, in this loss function what was my w t we wrote it right some ok we already written here. So, in this case the w t we found by this method where my f i w are this function my w t turned out to be the average. So, because it is an average every term here it will be influenced by what happened in the glass samples. So, in a way the change will not be that abrupt here right because the past everything is kind of getting accumulation because of that in the next round the things will not change that drastically it is kind of a running average right running average usually do not change suddenly. But whereas, this in this function the things were changing very rapidly right plus minus plus minus like that. So, in a way the kind of past what has happened it was kind of getting nullified it has no impact on the current ones. So, such abrupt changes were making this algorithm kind of unstable it is rapidly changing. Whereas, there is kind of stability here because your average your updates are kind of getting averaged based on the past observations. So, in a sense in a way what this is telling is for your FDL algorithm to give good performance your weights should not be getting changed abruptly and that was happening in my quadratic loss function, but it is not happening with my linear function here. So, now the question is is it possible to make that even if I am going to observe a sequence like this my WT's will not change so abruptly. So, that the updates are in some sense are stable maybe if we can do bring in such kind of stability in this even for my linear function my FDL algorithm may do better ok. And usually the way to do bring that kind of stability is through bringing a regularizing functions ok. How many of you heard about regularizing functions? So, what kind of regularizers they use there ok LASO for what? Yeah, but what kind of what was the difference between loss and LASO and the L2 regularizer. So, one of them was trying to make your weights kind of sparse, but doing that did you see any advantage right I am asking whether the fine if the weights are very close very small or 0 close features have no effect you can better not use them, but the question is did this improve performance ok. So, ok I mean in the test case you said ok, but then did you realize what was the reason ok. So, you are saying it kind of avoided over fitting anybody else came across this, but anything related to stability there. So, how is that ok LASO is ensuring by making sure that some all the weights are not big enough or not all the features are relevant. So, some of them need not be given importance, but why is that the squared norm? Why why smaller value, why you are saying it smaller value? You are adding it and then they are adding it to the loss and trying to minimize it. We weights are large. If it is really important. If it is really important. Ok. Fine the other way of looking it is ok you are adding this L2 norm to the loss itself this is the other way of saying that you are you are trying to make sure that that loss that norm of L2 is not large right. You can think of it like a constraint on norm of 2 you are constraining it to be within some value. Fine it also kind of so, there you are kind of restricting your weights there the goal was to make sure that you do not over fit to a training data, but here in this online version our goal is to make sure that the updates are not becoming too erratic in the sense the updates are not changing too much from one right on either. So, we want them to be kind of a stable like we do not if it is abruptly changing too much in a way that means, that we have started kind of ignoring the past. So, if we are stable means we are completely not ignoring the past like we are taking it and slowly allowing it to vary ok fine. So, then what is my regularized version of my follow the leader algorithm ok. We are going to study something called follow the. So, what we will do in this case is instead of simply maximum simply finding the sum of the minimum value of the sum of the losses we have observed. So far we will try to do this minimization after adding another regularizing term to this. So, the what this algorithm we are going to denote it as f o r e l. So, r e here stands for regularized leader. So, this algorithm is going to do for all t w t equals to r it means more and this term we are going to call it as regularizer. So, we are going to see different regularizers as we go on in the class and we see that how they are going to affect our performance. So, let us take a specific example of l 2 regularizer where I am going to define this r w to be this quantity. So, I am adding this quantity to directly to the loss right. So, I want this to be kind of small and this is the normed version of this. So, if I am doing this I do not allow my the minimizer to be kind of too big right because of that I will in a some way I am I am controlling its variations ok. Now, with this suppose if I take this and let us take my f of w to be simply my linear function where f t of w is defined by w of z t z t is the parameter in ground t. So, if I plug in in this can you tell me what is the w t I am going to get. So, just plug in you have put it here and put r norm of w here and then differentiate and find out what is the w t you are going to get. Can you differentiate and tell me ok, but yeah I am going to also define this to be some parameter here and now tell me what happens. So, if I am going to differentiate this with respect to w. So, this is going to be. So, do we get this the minimizer w t star to be minus eta times summation of z t ok. So, can I write it as. So, this is the summation from i equals to 1 to t minus 1 right. I can split this make the summation i for running from 1 to t minus 2 and separate out minus eta z t 1, but the first part is nothing, but w star of till t minus 1 right ok. So, what I have basically done is I have iteratively written this updates. If my previous update I have gotten in ground t minus 1 was w t minus 1 star and I got this z t minus 1 my new value is expressed as like this ok ok. Now, if you look into this function my linear function what is z t of here. So, my this is my f t of w my variable is w then what how can I interpret this z t to be here can I take this to be the slope of this function z t. So, what I am doing here in this in this update to get my w t star from this w t minus 1 star what I am doing is basically subtracting the gradient or slope of my this function and with this coefficient of, but scaled by this coefficient eta where is the eta is the term which is coming in my regularization function. So, what I am going to get the new update I am subtracting this gradient of my function at t minus 1 from my previous update. So, you see that my my weights are not dashed strictly up not will not change here right because they depend on the previous update and they are going to change as per the change in the gradient of my function. So, because of this so, what I am doing my weights are going to change. In fact, they are going to decrease based on they are going to change based on this slope gradients. So, because of this nature if I are going to use this regularization like this and I am going to use a slope like this what I will get my update rule to be a rule which is which we call it as gradient descent right because this is my gradient and I am reducing my weights by that. So, we are updating as per this gradient descent method. So, alternatively just to this what we are being strictly doing we are saying that this is nothing, but w t minus star minus gradient of my f t w theta and that is computed at w t sorry. So, z t minus 1 is the gradient of this function right. So, this is at sorry this should be t minus I am just going to write it as t minus r this is a gradient of my f t minus r ok. So, because of this we are going to call it as. So, any question about this? So, we have end up with a simple update rule right which is going to decrease the going to decrease the weight according to the gradient.