 So, find our now goal is to show there exist an algorithm such that my hypothesis class becomes learnable right. Now, what is that algorithm ok. So, before we understand that algorithm we are going to make a slight detour and going to understand a setup called prediction with expert advice and study a algorithm called weighted majority and then show that weighted majority algorithm is the one actually that makes my hypothesis class learnable ok. So, first try to understand what is that prediction with expert advice setting ok. Suppose, let us say you are consulting some n number of experts and they are experts in the problem you are interested, but even among the experts the level of expertness can differ right. Now, what you would want to do is you want to among this experts you want to identify the best expert ok. How it works? Let us say think of this experts as my hypothesis class right. Let us say each hypothesis class is like an expert and as we are doing so far we our goal is to identify an hypothesis or an expert which minimizes my total number of losses ok. So, in this setting we are going to treat my we are going to consider that there are some let us say n number of experts in each round when an instance comes I am going to ask each of this experts tell me what is your prediction on this tell me what is your prediction on that. So, each of them can give prediction, but I have to decide which prediction which expert prediction I will go with ok. So, I may go with one I can see like out of this experts I can select one expert and maybe give as my output, but we know that if you are doing that we know that the adversary can make you incur loss in every round right if are good. So, what we are going to do that whatever the prediction we got from experts we are going to assign probability to each one of them and going to declare one of the. So, we are going to select one of this experts according to some probability and then I am going to declare that as my prediction. It is not that deterministically I will pick one expert and declare that as a prediction I am going to select this experts according to some distribution and then I am going to declare whatever the suggested by that expert as the prediction. But now as I said all these experts may not be of the same expert level right. So, at some point you want to start start selecting the expert a good experts with higher probability. So, what you can basically do is then assign some weights to this experts and then kind of update your weights on this experts based on whether the predictions are correct or not as you are going to observe. So, let us try to make this bit more formal, but now we are no more going to say only prediction language. We are going to say that there are n experts ok let us say n experts and in each round the environment is going to ok. So, the interaction between you and the environment is happening in this fashion. In each round the environment is going to assign loss value to each of the experts. You are going to assign some weights or probability to each of this experts and then you are going to select this expert based on your probability distribution on them and then you are going to declare that as your label ok. Let us say in round t let us say v t is a vector of dimension n. So, all of you understand this notation 0 1 to the power n means this is Cartesian product n times of this interval 0 1. So, in this v t i here means loss assigned to expert. So, in round t the environment has assigned losses to all the vectors sorry all the experts and now the learner is going to choose this select his vector w t which is basically a probability vector and then he is going to select one of this expert according to this probability distribution. You understand this point? So, let us say for simplicity like let us say n equals to 3 ok and then in this case w t could be let us say 0.2, 0.5 and 0.3. The learner assigned these weights to the experts then it means that he is going to select one of the expert according to this distribution. That means, he is going to select the first one with probability 3 will be second one with probability 5 and the third one with probability 3. If this is the case if environment is going to assign this loss to the experts and you have been selecting one of them with this probability what is the expected loss you are going to incur. So, if you are going to select ith expert yeah. So, if you are going to select ith expert your loss will be v t i, but you are going to select them in a random fashion according to this distribution w t. What will be the expected loss? The expected loss will be then w t i v t i right that I am going to incur in round t is this clear and what is this is nothing, but the inner product of w t and v t. So, right now I am just talking in the language of loss which is in the interval 0 1 ok. So, if you are going to pick ith expert in that round the nature or the adversary nature possibly adversarial has assigned this much of loss to him and you are going to incur that. And since you are randomized picking this experts you are you I am interested in knowing the expected loss and this is that quantity ok. If this is the case I want to minimize my expected loss incurred over n rounds ok. So, what is the expected loss incurred over n rounds that is going to be w t v t t equals to 1 to n yeah this loss in one round. Yeah because that guy we are selecting with some probability distribution right. Now, we are taking the expectation we are not interested in one realization right we are trying to. So, we have we have modified our regret and we are interested now in expected regret right that is because of that we will be interested in the expected loss here ok. So, fine you are right like I will select only one expert in a round finally, and I am going to incur the loss with respect to right. But because we are selecting that with some probability it could have been other also right with some probability. So, what you want to guarantee how your algorithm performs on an average not on a particular one realization ok. So, that is why we are interested in the expected loss here. Now, this I want to compare with what I get over if I select a particular expert. So, let us say if I am going to select a particular let us say I I happen to choose ith expert what is the loss I would have incurred over t rounds. So, let us say my my strategy is deterministic. In each round I only select ith user sorry ith expert what would be my total loss summation of V t i right. So, this is what I am doing basically I am comparing the performance of expected loss over n rounds to the performance I get if I have to play a single experts in all all the rounds ok. But what I may be interested in is the best I could get from the experts. So, this is V t sequences let us say fixed some V t sequence is there that is the loss generated by the environment. By applying your algorithm or like by applying by selecting the weights and choosing a experts according to the weight this is the loss I incurred now, but I am comparing this against the best loss I would have incurred who is the best expert among all these loss sequence that I have observed who would have given me the smallest loss and that is going to remember a benchmark I am going to compare my performance against this. And the goal in this prediction with expert advice is to minimize this quantity is that clear ok. Now, let us see what is the good algorithm to minimize this ok. So, we are now going to discuss a classical algorithm called weighted majority. So, notice that when I said the environment is generating this loss vectors V t it could be arbitrary process I did not say any put did any restriction on this right did all I am saying is environment is going to assign some loss ok. Let us now try to work out this. So, maybe I think we were going to use. So, in this course we will try to maintain the consistency let whenever use a capital letter that is for random variable. If something is like a fixed deterministic quantity I am going to always use a small letter for that ok. So, because of that I am going to use a slightly different notation for this I am going to use the D here because this the number of experts is fixed arbitrary. So, that is a deterministic quantity ok fine. So, this is the let us discuss this algorithm. So, for this algorithm the input is the number of experts D and also this algorithm needs you to tell apriori how many rounds you want to run this ok. Like we will see later how this can be relaxed where we do not need to tell apriori how many rounds we have to run, but we can stop at any time you want to stop and still get the same performance guarantee this this we are going to derive for this set up ok. Let us assume for time being like we are we have been told what is the number of rounds it is going to run and now we are going to based on this quantity D and n I am going to set a parameter eta which is set in this fashion 2 log D by n. Now, this algorithm maintains weights what is the in this set up I do not have any control over the way the environment or the adversary is choosing this loss vectors right, but what I control over is how I am going to choose this weight vectors this is this is under learners control right. So, depending on the way the learner chooses this Wt we will have different different algorithm. This weighted maturity is nothing, but a specific way of choosing this weight vector Wd sorry this Wt in each round. So, let us discuss how this algorithm chooses this weights in each round ok. So, this algorithm begins by so, there are two quantities here one is Wt ok maybe I should so, there are two quantities here one is W it and then there is W i tilde t ok. So, this W i tilde t initially I will start with taking 1 1 1 that means, I am giving equal weightage to all the experts ok. Now, when I start I am going to convert these weights into probabilities in this fashion. So, I am going to take this Zt to be the sum of all these W it's and divide this W i tilde by this Zt. Now, can you see this this W it over i's make a probability distribution is that clear to you? So, if you add all this W it this quantity is actually 1 ok. So, this W it makes a probability distribution. Now, you choose an expert i at random with probability W it right in every round you have anyway come up assign the probability vector on the experts you choose one of them with that probability. After you do that you get to see the loss vector from the environment for this and after this anyway this is the cost you are going to incur. We have already discussed this is the expected loss you are going to incur right and now now this is the main part of the algorithm how I am going to update this weights ok. The way I am going to update this weights is W it to be whatever the previous weight I have in the my previous round I am going to multiply it by e to the power minus eta times V it. What is eta here? Eta is what I have said here say initial parameter. So, notice that if my loss is high what is happening to the weights in the update they are coming down right. If my loss is small then it will not be coming down that much ok the ones for which losses are high they will definitely come down significantly. So, in a way what this algorithm is doing is it is kind of keeping track of the weights for all the experts and in every round it is updating this weights depending on whether that expert give me high loss or low loss. If the loss is high I am going to decrease is weight significantly ok. So, notice that like when I am going to get this entire vector V t. So, I know what is the loss for each of the expert. So, that is why I know this entire V it and I am I could do this. So, now again when you come back now again these are weights this need not be probability vector, but you again make it a probability vector by dividing it by the sum of all the components. And then again you are going to play an expert according to this probability distribution ok. So, do you think this algorithm should give a good performance? If you are going to update its weights like this ok maybe we will show that this is indeed that is the job what we wanted to do ok. Let us write down what is the performance of this algorithm. I mean this is one of the kind of celebrated algorithm in all our adversarial setting you will see that all the kind of later the bounds we are going to derive for various setting will be based on mostly this idea. The idea is simple like you increase the weights if the loss is small you decrease the weights if the loss is high ok. And accordingly you build a distribution and accordingly you select the experts. In a way that makes sense right like if I am observing high loss for some experiment I want to give him less weight. If somebody is giving me small loss I want to play him more maybe like that is the guy who is always keep give me lesser loss ok. Now, what is the guarantee for this? We are going to show that we are going to show that this algorithm has an upper bound of true log of d time ok. Notice that this bound is for the case where this w t's are chosen according to this algorithm. If you are choosing this w t according to some other logic this bound need not hold ok yeah no ok. So, this is log of d divided by whole thing divided by n. So, we need to show this. Now, I want to go back and see how I can use this result for time being let us assume this result is true ok. We are we are going to show it in the next class, but if this is true can I show this this result ok. Now, how I can map this result to this result ok. So, in this case I am going to focus on the case where my hypothesis class is finite to argue. So, let us say this hypothesis class H has d elements in it whatever the sign let us call it d ok and let us say and enumerate them H 1 H 2 to H d. I am going to now treat each of this hypothesis class as an expert ok. And so, what is happening in each round according to this expert setting the environment is going to assign losses to each one of this right that was v t. Now, in in this setup the environment is only choosing x t, y t in round t right. So, now, I have to map whatever the x t, y t that has been chosen in round t to v t that has been chosen in my expert setting. How I am going to do this is I am going to define v t i to be H of H i x t minus y t. Can I do this? So, I told you right like this v t is an arbitrarily generated. So, if x t and y t is the sample that is generated in round t then I can treat this as my loss in round t right for the ith expert ok. So, I have this vector now v t. Now, if these are the experts what I am saying is when a point comes x t I can feed it to all of them and get to see what is their prediction right and then I can take I can take their linear combination as my actual output. You remember here p t is something between 0, 1 it is not necessarily 0 or 1 it can be anything between 0, 1. So, in this case I will take I will define p t to be simply can I define it like this and is it still true that probability that y t hat is this. So, H i x t is the prediction sorry this is H i x t is the prediction given by height hypothesis on point x t. So, some of them may say 0 and some of them may say 1 right. So, for 0 those who say 0 they are not contributing anything to the summation only those who say 1 they are going to contribute to the summation. So, it is basically what I am doing is I am adding the weights of all the hypotheses that is saying 1 right. That means, with that probability that is the sum of all these w t's where this guy is 1 that is going to tell me what is the probability that I am going to predict it as label 1 in that round t. So, that is why I can write this p t as this is that clear ok. Now, let us express my p t minus y t. So, I need this quantity right basically I am trying to express these quantities in terms of my w t's and v t's ok p t I have already this quantity and now this is y t is this clear. So, now, we will do some series of manipulation simple one just try to follow them. Notice that this w y t is the probability vector right and this y t is a constant here in that round t. So, what I could do in this case is the can I manipulate it like this and now I want to do somewhat further manipulation of this yeah. So, that is what if you take it ok first part is clear right if you take for the inner one it is w i t into y t y t is constant pull out and this guy is 1 summation of this w i t's is 1 that is why this is true ok. Now, next step I want to write is this one not really right this v t i is the mod of this quantity, but here it is we are noted in that shape we have to convert it to that shape that will be our next next task. So, how can I write this how can I bring the mod inside ok. So, notice that ok let me first write this our claim is this is true. So, we are saying that even instead instead of looking the mod outside if you look mod inside for like this they are same. Can now let us understand why this is true is this true yeah mod of the whole term need not be equals to the mod of the sum right, but if we can show that all the quantities inside this summation they are all of positive or all negative then this is true right. Suppose if all of them are positive no problem if all of them negative no problem only issue comes if some of them are positive and some of them are negative ok. Now, let us take so my y t is what is a binary value right y t can be 0 or 1 suppose y t is 0 no problem right y t is 0 means h of x t is either 0 or 1 this w y t's are always positive. So, every term inside is positive then y t equals to 0. So, then this result holds when y t equals to 0 yeah and then y t equals to 1 is this quantity is positive or negative this is going to be negative right because this h of x t can be 0 or 1 and this has to be negative because of that this result holds. Now, we are almost done right now what is this quantity equals to this is equals to equals to this and this is equals to now what we have basically showed is this quantity here the first term here is nothing, but you can map it to the first term here and the second term here is the other term right because this v t are nothing, but this h x t y t and now because now we are saying this bound is independent of what is the sequence and what are the loss vector we have observed actually I could as well read this to be the infimum over h here right because this is the independent of what is the sequence. So, because of that can I claim that this algorithm here is actually if I apply on this it will give me this regret bound because I have already shown you right like if I apply this algorithm on this expert set advice setting I know already I am getting this bound, but we have just argued that this expert predictions with expert setting is nothing, but this setting ok. So, because of this I mean whatever the setting for which we define this regret bound right and this expression here is nothing, but this expression by appropriately defining your p t and your v ti. So, because of that we have this bound and where what is d here d is nothing, but cardinality of h and that is what we wanted to show and this is what is it a yes yeah this is your internal parameter for your algorithm use whatever internal parameter you want to use. What what I care is finally, based on your input parameters what is your outcome? No I am not saying that this is pretty much like for when I say weighted majority it does take parameters like this set proof we use yes to to show this we use that I have said we will do it in the next class ok, but what is the final bound we get is what I have written here and now we are connecting this setting prediction with expert advice to whatever the classification of online classification problem we dealt with and we have just said that this is nothing, but this setting by mapping this expert advice setting to my binary classification online binary classification problem. So, our claim is whatever we have this bound we can achieve what we wanted to basically show there exist an algorithm right. Now, the thing is this is that algorithm this is that algorithm which will give you this bound this is online right yeah once you have this this is online learnable because this is sub linear very great we are getting ok. So, in the next class what we will do is we will just basically go through the proof of this part and we will not prove for the case where this is infinity when the cardinality of h is infinity that proof is bit more involved we will just skip it, but we will we will complete this ok let us stop here.