 Okay, today I just want to cover the proof of the weighted majority algorithm we discussed last time. So, my throat is not good, so we will just do that part, I do not want to stress it more. And from the next week onwards, we will start with more a setup called adversarial bandits. But let us try to complete the proof of weighted majority algorithm today. So if you recall the weighted majority algorithm, okay, if you recall the weighted majority algorithm we did in the last class, it was in a setting we called as prediction with expert advisor, right, where in each round, environment assigns a loss value to each of the expert and the learner picks one of the expert according to some distribution, which keeps updating in every round based on the past observation. And we defined in this case is expected regret to be, so this is this loss in round t and this is the total loss incur in round rounds and we compare this with one of the d experts and we said that this is going to be upper bounded by 2 log, if you are going to, so today let us briefly discuss why this is true. So one of you asked the question, is it necessary that in this case we have to assume that the adversary generates the label according to some hypothesis H, that H not necessarily be in my hypothesis class, but is it required. So now we said that whatever the setup we are going to study, we said that that would be mapped to the setup of this prediction with expert advice, right. And now we have said that the loss that you are going to incur that is the loss assigned to expert, we had defined it to be in our online classification setting to be if I apply i hypothesis on this, what is this value and this y t, so here we said that now when we were in this setting this v t is a vector of losses, we said that in the, this is arbitrarily generated, I do not know how it is being generated by the environment. In the binary classification this was the loss, but I could take this to be my loss, but I can take this to be more general version when I looked into expert setting, I took this to be any v t, right. So, we said that this v t is coming from, so the way now since this v t are going to be getting translated to my x t and y t were in this fashion in the binary hypothesis classification, but now I am allowing myself this v t to be anything, so fine if you define going to be loss to be like this, but if any other loss is also fine, my entire analysis go through. In that sense, how this y t's are generated, I really did not worry. Is it going to come from some specific hypothesis or he is following some arbitrary role which no hypothesis can define? I do not care about that, right, because this v t could be arbitrary number earlier, right. So, because of this even in the binary hypothesis class, ok, it makes more sense if you assume that adversaries also generating these labels according to some hypothesis, but which need not be my hypothesis class, but we could also relax that condition. We can say that, ok, maybe adversaries generating these labels according to some rule which is which I cannot even characterize with some specific hypothesis, ok. Now, let us try to prove this. First thing, this prove this bound, we are going to prove under the condition that my number of rounds is larger than 2 times d log d. Remember, I told you that n and d are input to my weighted majority algorithm, right. So, weighted majority I am going to write with inputs n and d. What was d? Number of experts and what was small n? Number of rounds. We are saying that if the number of rounds happens to be twice the logarithm of t than this bound holds, ok. If it is not, I do not know and also notice that we have defined a parameter eta, how? In the weighted majority algorithm, d by, right. Now, under this condition n is going to be greater than or equals to 2 log d. What will be this value of eta? It is going to be less than 1, right, ok. So, we are going to prove this condition holds under the assumption under this setup where eta is going to be less than equals to 1. So, the way we are going to do proof is we are going to consider this quantity and we are going to find a lower bound on this and find an upper bound on this and then manipulate it to get what we want. Finally, we will end up with this relation here, ok. So, this z t plus 1, what is, how did we define z t plus 1? So, z t is nothing but the weights of wi tilde s, right. You remember that wi tilde s were we are updating every time, every round and then we took their summation to get this. So, usually we call this z t here as the potential in round t plus 1 that is the sum of all the weights we have in that round. Those weights, so we had two set of weights. One is wi tilde s and another is wi. From this to this we went when we went from this to this, this was, this became a probability distribution, right. This not, this was not a probability distribution, but z t plus 1 in round t plus 1 is nothing but the sum of all these things. So, there was a index t right here, how did we write the index t superscript or subscript? So, now, let us substitute the value of only z t plus 1, ok. What is this z t plus 1 that we know as w t plus 1 tilde t by z t. And now further, now this is, this is i here and I am summing it over i. Now, let us substitute the value of wi, sorry this should be t plus 1 here, right, because I am looking at t plus 1 index here. How did I, now let us write wi tilde t plus 1 in terms of wi t. What was the relation between wi t at t plus 1 and at t? We have defined it as wi t e to the power minus mu i t, right. It is correct, this relation when you have to, this is already defined in the weighted, this is how exactly the weighted majority of algorithm is working. And so on which is tilde here, right. And now, now if I look into this ratio, by definition what is this quantity for us, wi t, right, without tilde. So, let us, I mean we are going to do a series of manipulation on this setup now before we come to this. I mean by the way like when you want to prove this bound, by looking at the algorithm it is a priori not clear what is the right intention to go through what steps, ok. But somehow once you write down the steps, it is clear that you will get this bound, but looking at the algorithm what is the right steps to go through this, it is not clear. So, the proof here is quite simple, in a priori it is not clear how to get this, ok. But so these are some of the classical proofs which we are going to use many, many types. So, just try to follow all the steps we are doing here, ok. Now, let us try to apply some inequalities to get a bound on this. The first bound I am going to use is if I have e to the power 8, this quantity is bounded by for a between 0, 1 and also this quantity will be bounded by 1, is this true? Ok, let us see. So, this is my if you have my a this is going to be my e to the power minus 1. So, now, if you are going to look your a equals to 1, what is this quantity is going to be? 1 by e, right. Now, is this clear? This quantity is going to be bounded by 1 minus a. So, how this the graph of 1 minus a look like? So, 1 minus a is going to start from, this is going to be 1 here and it will reach 0 here, right and it will be falling linearly. So, it is going to be like this and this is going to be a lower bound on this quantity in the interval 0, 1. Now, what about this? So, can we see will this quantity will be decreasing or increasing till point a equals to 1? So, ok, fine. Can you just check whether it is a convex or concave in a? How do you check? Concave convex, you know. After differentiating what is the quantity here? It is positive. So, if it is positive what it is? It is convex, right. So, convex means it has a minima at one point. What is the, where is the minima happening? At 1. After that it is increasing, right. So, it is like something like something, something like this, this quantity. So, you can see that we have a tight and they will be in this range 0, 1 they are pretty tight in this. So, we are going to use this bounds to proceed bounding this, ok. Now, let us take this. I am now going to treat this whole quantity as e to the power minus a, ok. a is eta nu i t and notice that eta I have by my assumption it is less than 1 and v i t also be less than 1, right, because they are coming from 0, 1 interval. So, this quantity eta nu i t is already quantity less than 1 and also it is positive quantity. So, I can apply, I am going to apply this upper bound on this, ok. So, if I apply this upper bound what I get? Now, further, now w i super script T this is, these are probability values, right. If you, now one, if I take it inside with this 1 and sum it over all i's this is going to be 1 and so I am going to write it as, ok, 5. Now, I have this quantity, but with a minus sign here, where, ok, there will be w i t there. So, there should be minus here, right. Now, minus I have taken inside and put everything in a bracket. So, now, let us define this quantity to be nu a for us, ok. And now, let us try to see. So, my argument is now even this quantity here is entire quantity a is again between 0 to 1, ok. So, let us argue why is that? So, eta is less than 1, v i t is less than 1. So, this whole quantity squared by 2 is going to be less than 1, ok. And this quantity is eta v i t is again less than 1. And now, this is I am going to take, these are probabilities, right. If I am going to take expectation that means, basically being with a probability that is going to be less than 1, fine. Less than 1 being clear, is it greater than 0? We also want a to be, is it greater than 0? Is this quantity non-negative? Why? Because this is this quantity, this quantity is going to be smaller than this quantity, right, because this is less than 1. If you square it, it is going to be further smaller and you are further dividing it by 2. And now, can I apply this other direction bound or lower bound on this? If I apply because of this negative sign, I will still get an upper bound on this quantity. So, I am going to treat this as like 1 minus a. If I do this, what I am going to get? Log of e to the power this entire quantity, what is that quantity? And because of this log and this exponential we can solve, we will just end up with this quantity here. Sorry, there is a minus sign also there, right. It was the log of e to the power minus a, that minus m is, so I have just added that minus sign here, fine. Now, I am going to just simplify this. So, this first quantity here is, I can write it as eta of inner product between Wt and Vt, right, and plus, ok. Now, I will do one more simplification. So, I will just pull out eta square from this, eta square by 2. And whatever the remaining W i t, nu i t, that quantity is still going to be something less than 1, right. So, I will just ignore that, then I will still get an upper bound. Because if I just pull it outside here, this quantity here, whatever remains this is going to be less than or equals to 1. So, that is why I only retain this and remaining quantity upper bound and 1. So, I get an upper bound here, ok. Fine. This is, we have done it for this ratio, this is true for any t. Now, what we will do is, we are going to add it over all t starting from 1 to t. So, if I am going to add, what I will get? I am going to get, sorry, upper bound, it is fine. Now, look at this term. I am adding log terms, right. So, if you simplify this, what you are going to get? So, if you, if you, this is the summation of the logs, right. So, I can write it as product of log of products. So, if I do log of products, see these ratios, how they are? They will cancel out and what remains finally, if you simplify it, what you will end up with? Z t n plus 1 by 1, the left hand side, which is nothing but log Z n plus 1 minus log Z 1. Now, look into Z 1. What is the definition of Z 1? Z 1 is summation of W i tilde 1s, right. How did we define W i tilde 1s? Can you go and see in the algorithm, how did we define this? We said initialize these quantities, right. What was this value? 1 and Z 1 is sum of all these quantities. So, because of that, what is Z 1? It is going to be d. So, this is going to be, ok. So, from this relation, what I finally get is log Z n plus 1 is upper bounded by, I will just going to simplify this, d equals to 1 to eta W t eta log, ok. So, this is one bound we have. So, now, try to get a lower bound on this quantity, log Z n plus 1. Now, what is, by definition, log Z n plus 1, this is summation W i tilde n plus 1, which is nothing but log W i. So, this is summation over i, right. And if I write it log summation i, what is this? By our definition, this is W i e to the power nu. What is this quantity? This is i and, ok. This is 4 e to the power. I am going to write it as, ok. This is going to be at the current value, this is going to be n plus 1, ok. I have just substituted the definition of W i tilde with this. Is this correct? Now, I am going to keep on defining this. I know that this W i n plus 1 is defined in terms of the previous quantities, right. And if I am going to keep on going back this repeating and going backwards repeating, what I will eventually get is e to the power eta summation nu i t, t. Now, it is some t summed over t. I am just, this I have done for, suppose if you just express W i n plus 1, you can write it in terms of W i n. Then one term of e to the power eta V i n will come. Now, go back and replace the W i n with W i n minus 1. That will give you another term of V i n minus 1. So, you can keep on going till backwards. So, that is why we are getting some of all this V i t square, ok, phi V i, yeah. This one you are saying this should be written as, oh I see. You are saying this should be, this is just by definition, ok. So, because of that now we will end up with this summation. So, now, let us try to play with this summation. Now, I want as I said I want a lower bound on this, right. So, now, see that this is a summation over from i equals to 1 to d. Instead of taking summation over all of them, if I only take one of the index and retain it and throw everybody else, will I get a lower bound or even if I take a maximum one which is correct? I am taking, I am taking instead of taking the sum, I am taking just the max element index among the sums, in the sum. So, this is going to be true. And this is again if instead of, I can write it as log of e to the power minus now min over r eta summation min over t. So, I will just take max of this all quantities is same as e to the power max of this quantity, but there is a minus sign, right. So, if I am taking this minus outside that becomes min of this quantity. Is this clear? This step of the manipulation, ok. Finally, what I will do is log is there, right. So, this is going to be simply going to be minus of, I am going to just simplify this, min of summation mu i t over i, ok. So, now, on the same quantity log of z n plus 1, I have on this upper bound and I have this lower bound. Now, I will simplify it to get the desired bound, ok. Just let me erase this figure part here, what we have minus eta minimum of, now this is summation over t here. I should have been careful this should be summation over t here not over i, ok. And this minimization over i. And now this is upper bounded by what? This quantity, minus eta, ok. I will just use this equation and this equation. Now, let us readjust this to get the desired quantity we are interested in. We are interested in difference of summation of this inner product with the minimum quantity, right. So, I have I want to find the difference of these two quantities. So, I will take it on the left hand side, I will eventually end up with. So, now what I will do? Fine, I got this quantity. Now, I want to bound I want to show that this quantity here which I have will be eventually I can write this get a bound like this on this, ok. So, first I will do is I will divide throughout by eta. We know that eta is a positive quantity, right. So, I can divide both side by this quantity and the relation will still hold. I have just divided, I have, sorry, ok. Now, we have taken this eta to be some specific value, right. What is this eta? eta is taken to be. Can you substitute that value and compute what is this quantities? Just substitute eta in this quantity. Did you get the same quantity as this? Square root 2 log d n. So, now, you may be wondering, ok, why did in the algorithm at all it prefer to choose this, ok. So, now, let us say you got this quantity. This is an upper bound that holds true. So, n and d are given to you. Let us say eta is your design choice that you want to set. How you are going to choose eta here? You want to bound this, this is your expected regret, right. You have gotten this upper bound. Naturally, this upper bound you want to make as small as possible because you want to make the regret small, right. If you have to make your regret small and eta is your parameter that you have to choose, how you are going to choose this eta? So, you would like to choose an eta which minimizes this quantity, right. So, now, let us take this function, this to be a function in eta. Now, can you find an eta that minimizes this quantity? How you will do it? Find differentiation and see what is the value of eta you will get that minimizes this quantity. So, you will see that if you try to minimize this with respect to eta, this is exactly the eta that is going to that will minimize this. So, that is why the eta has been set to be like this in your algorithm, ok. So, in a way eta is kind of controlling how much you want to give importance to the losses you have observed, right. Like, when I am going to update this weights, right, the way we are updating this weights from n plus 1 throughout, we are going to take eta times V i n whatever the loss I have observed I am I am not taking that value, but I am weighing it by eta factor. So, this is the importance, this is how much I am going to give the importance to the samples I have, while I am going to update this and one has to carefully choose that weight. If you are not going to carefully choose this weight, I mean you may not get a good performance, right. So, how to choose this? You see that this bound can change as eta changes, right. Suppose you choose eta to be very large quantity, ok. In that this quantity may be large, but this quantity may be large, small. For the simple case, let us take eta you took to be so, this is getting multiplied by n, right. So, if you are going to choose eta to be very close to 1, what is happening? You are regret you are saying is upper bounded by n almost, right, order n which is of for use to me. And if you are going to take eta to be very close to 0, you are trying to make this quantity smaller, but what is up is this quantity is blowing up, ok. So, in a way this eta is kind of balancing what we call as exploration and exploitation, right. In the first class we discussed little bit which we are going to talk more a bit later. This parameter is saying, ok, these are the losses I have been observing from this, but it may be like initially I observed few loss small losses on something, but I need not need not necessarily latch on to that. I need not start assigning high rates to that. I will be cautious about that. I will only take its value with this much weight, eta weight, ok. So, this is this parameter eta that uses a fine balance between how I am going to do exploring expert and that has to be carefully chosen. And you see that if my n is going to be large, right, what I am basically doing I am setting a small eta. Small setting a small eta means I am giving less significance to the weights I am observing. That means, am I forcing exploration here or I am preferring exploitation when eta is small. You are basically forcing exploration, right. Because you are you are not giving too much importance to the samples you have been already observing. So, when eta is let us say small, how will be the distribution look like? So, when eta is small, right, this quantity is like almost like constant. Like because if this eta is small, this quantity is going to be small that is e to the power something small that is almost close to 1, right. You can you can realize later if you want when eta is small this w i is all the w i tilde become kind of equal. That means, you are giving equal importance to all the experts. That means, you are basically forcing more exploration. But if your n is large that might be ok, right. Because if you have lot of many many rounds to play, you may be ok to do little bit more exploration, ok. If you kind of figure out what is that. But if n is small you do not have that luxury to do lot of exploration initially. So, you want to start right away thinking about taking the observation you have made with more seriously that is by giving them good weight, ok. So, that is why one has this eta has to be very carefully balanced and that has to then necessarily depend on how many numbers number of rounds I am dealing with, ok. If you have lot of numbers you may be more free to explore. So, I do not care initially because I have lot of time I will eventually find out. But you have less rounds you have to be more careful, ok. So, this algorithm is exactly trying to do this kind of balancing exploration exploitation by choosing this eta appropriately, ok.