 So, I am just going to state this lemma and this is called Chernov bound. So, this let me state this let. So, it is just saying that let be x i be a sequences i d random variables where each one is Bernoulli distributed with parameter mu and let mu had denote the estimates using t samples ok. Then we are going to say that then take any epsilon which is between 0 to 1 minus mu then it is saying that the estimate the sample mean mu hat being larger than mu plus epsilon is upper bounded by minus t divergence between mu plus epsilon and mu. So, now just let mu let us say what is my parameter epsilon range 0 to 1 minus mu. So, when epsilon is 0 this guy is simply mu and when epsilon is largest like 1 minus mu this guy is going to take value 0 ok. So, this mu plus epsilon is now never going to be larger than mu here in this case and similarly when epsilon is between 0 and mu we can argue that the mu hat being less than or equals to mu plus epsilon is exponential minus t again the divergence between mu plus epsilon and mu. We will not go yeah show. So, yeah it should be mu minus epsilon and also this is mu minus epsilon maybe that is I do not know it is all clearly visible let me just write it again here. So, what we are just saying is for epsilon between 0 to 1 minus mu we are just saying that 0 to 1 minus mu we are just saying the probability that mu hat greater than mu equals to epsilon is upper bounded by e x p minus t divergence of mu plus epsilon and mu ok. First thing is this bound this concentration bound we have is it any better than what we had then using my Hoffding's inequality. What was the bound we had in Hoffding's inequality? So, this bound we have e x p minus t. So, now if you are just going to apply the pin square inequality what we have on this divergence this divergence is lower bounded by delta k in this case is going to be epsilon right mu and mu plus epsilon. So, this is going to be I have a lower bound on this, but this is the minus sign here. So, I am going to get. So, you see that like the if you have if I can restrict myself to Bernoulli rewards then I have a deviation bound which is slightly better than what I would have got using my Hoffding's inequality. So, the k l you should be exactly exploit this and that is why it is going it is able to give all the bounds in terms of the divergence and this then they happen to be tighter than what we have are the UCB bounds. So, let us quickly go through the proof of this one ok. This is it will be a good exercise like for us to just repeat some part what we have already done in the last class not last class the class before that. So, what we have? We have probability that mu hat being greater than or equals to mu plus epsilon. So, what I will do is I will just unravel this quantity and this is like x t equals to 1 to capital T then mu and this is going to be I want this to be larger than n epsilon right. I have just multiplied this the denominator t I have taken it on the other side and also let us multiply both sides by some lambda positive. This is what we have done in the last class also and then exponentiate both sides it will if I do that it will not change my probabilities. So, is this manipulation correct? What I have done here is simply I have substituted the value of mu hat which is nothing, but this sample mean divided by t, but I took t on the other side x t minus mu on the whole thing. So, then I just multiplied this thing by lambda yeah lambda greater than 0 and then I exponentiate both sides. Now, we I am going to use our usual bound what was that like the Marker inequality. So, what I will get if I apply my Marker inequality? So, Marker inequality give me e x p lambda c minus mu this whole divided by e x p minus lambda t. Now, this is exponential of this sum and what is our assumption this x t's are IID sequence. So, this expectation of this exponential sum can I write it as product of expectation of this exponential terms. Now, I am going to expand this what is this expectation? This expectation is about I know that this x t is going to be 1 with probability mu and this is going to be 0 with probability 1 minus mu. So, if I just do that this is going to be e x p. So, this is with probability mu with e x p lambda 1 minus mu and with probability 1 minus mu this is going to be e x p minus lambda mu and then we have this e x p minus lambda t x lambda. So, what is this going to give? Now, you notice that this term here within the square brackets now they are the same for every capital T every t going from 1 to capital T. So, this is now going to be simply mu times e x p 1 minus mu into lambda plus mu e x p minus mu lambda times and then there is also 1 t, but I am just going to what I will do is there are capital T time. So, if this lambda e I can assume that this is I have been added capital T times and then I can take each lambda e inside. So, this is going to contribute lambda e here and also minus lambda e here and then this entire thing is raised to the power capital T. Now, this is a function of lambda and this holds for any lambda positive right. Now, what I can do? I can look for a lambda that minimizes this quantity. So, now, you can just look at this quantity and you can just differentiate and see that if you are going to set lambda to be log of mu plus epsilon 1 minus mu times I am just writing it this is the one that is going to give you that is going to minimize this exponent and then once you are going to plug in this you are going to get. So, once you are going to get this I am skipping the details like you can now when you plug in this quantity and then simplify you can get it this quantity to be upper bounded by mu plus mu plus epsilon. So, here I have written equality, but this is not equality right this is inequality here this is the Markov inequality here. So, when I plug in this quantity here what I am going to get is 1 plus mu plus epsilon 1 minus mu into mu plus 1 minus mu epsilon to the power this power is 1 upon mu minus epsilon and this whole power to power T. So, I have just what I have just done is just lambda this quantity I have plugged in this and just simplified then just going to get this. Now, you can again go back and notice that this quantity here is nothing, but the divergence between mu plus epsilon and mu and this is what we wanted to show. So, I have just skip this like whatever this bound we have you can just again manipulate it and write it as divergence between mu plus epsilon and mu. So, you know the formula for divergence between mu plus epsilon and mu right just go and see that that could can be expressed in this form. So, that is why and this is what we had made a claim ok fine. And similarly for and make sure that for the choice of epsilon if epsilon happens to be in this range it is going to be less than 1 minus mu here that means, this quantity here 1 minus mu minus epsilon is going to be non-negative already. So, so every quantity here is a positive quantity and you could write this and now similarly for this case you can again show that when epsilon happens to be between 0 and mu you can come up with a similar one ok fine. So, I just want to stop this discussion about KL UCB at this point here. So, what we have just discussed about KL UCB is we have given the algorithm and we have just discussed that it is based on this tighter concentration bounds and this tighter concentration bound is happening because we are using only Bernoulli distributions there. And now once we have this bound then in the analysis we can expect that if you are using this bound divergence will come and we have we have stated now already that the expected number of pools upper bound is coming in terms of this divergence and this happens to be a tighter bound and what we have for the UCB. So, but you need to note one point here is that in the case. So, what was the primary difference between UCB and KL UCB? The way we are choosing the index right. So, let us rewrite this. So, the index in the case of UCB was mu hat plus some alpha log t by number of pools. So, this is let us say for arm i this is for UCB. Whereas, for KL UCB what is this? This is max over q belongs to my range and what was this? And I wanted to see n of a this is for i let us say i t minus 1 divergence between n i t minus 1 comma q and this is less than or equals to. So, earlier finding the index of arm i is just doing this. Now, to find an index of arm you are basically trying to solve a optimization problem right. This is an optimization problem right. You are trying to find a q which maximize which is the one which is satisfying this condition. For q I already discussed for q which are larger than this quantity that is where this condition is going to be violated because this quantity is going to be increasing in q sorry this quantity is a increasing function for q larger than this term. So, we are saying that for q greater than r equal s i t minus 1 divided by n i t minus 1 this guy is a increasing quantity when q equals to exactly this this is 0, but if you increase q beyond this it is increasing. So, we know that at some point if you keep on increasing it has to violate this condition and the point right away it violates that is the maximizer. So, but so in general like if you just want to implement it a plane optimization problem it could be pretty computationally intensive. So, when you are doing it in the code you have to be careful like if you just write this it is going to take a lot of time to run. So, you need to exploit the fact that this guy here is increasing in q and c at what point it is going to be violated and then find the index based on that. So, this actually this is like for a given p if you look it as a function of q it is a convex function ok. So, let us say it is a convex function. If it is a convex function you can always try to see at what point it is going to cross violate this condition, but you should not just plainly you call some module which implements the convex optimization and tries to find a q here. If you just give it to some module which is finding uses convex optimization methods it could take quite some time to solve this. So, try to exploit the monotonicity property of this divergence and based on that quickly try to find out this q quickly. If you do that you should be able to reduce your run time of the algorithm significantly. If you are just going to use some convex optimization module to run this it could take a quite some time. At q equals to 1 1 minus p. So, I am not following where it blows up yeah q equals to 1, but no like every arm why should you. So, t is changing in every round right and this is also increasing function as you increase q. So, at some point it should be exceeding this. So, that is the point. So, yeah that whatever is the maximizer that is we are going to call it as an index of the term right. So, this is like q star is the arg max let us say which is that value that is the index for me. If it is increasing it right. Yeah it will be increasing for like ok this since this is fixed if you are going to increase in q it is an increasing function. So, fine like yeah you can write like exactly at what point, but we can just write less than or equals to till this point ok. So, maybe even if you want to just find exactly what is the q that is going to exactly equals to this that is like a finding a 0 crossing of a function right even that could take quite some time or maybe like maybe just see if you can use that property to quickly figure out what is this q. So, all I am saying is try to find a good way to implement this like it could it could take some time just use the monotonic properties may be like you can just equate it to 0 and find a 0 crossing if that gives you faster q value or you can just do a kind of a search yeah you can just rapidly increase your q and see if it crosses then bring down your step size ok. So, I think even if you are just going to call a 0 function that is it will maybe do similar thing, but that module can be if you see if you can fasten it somehow by your own method yeah yeah yes on the other side right that could possibly happen right. So, what is saying is so let us say this is my that ratio let me call this as simply mu hat mu i hat at that time this ratio that is the mu i hat. So, I will have something like this at as q goes and this upper bound is a constant let us say this is my log t plus log log t whatever it is. So, the point at which it is violating may be beyond this point or beyond this point. So, which one you are going to choose upper point. So, that is why like just why not the lower point also yeah you are just maximizing over all the points. So, you just know already you have to take the lower point. So, just focus on this part yeah. So, another one thing like if you know if you just do not exploit this property and then just see ok and give the constraint like this is my constraint I want to keep it below this and look for the maximum value of this function. It is going to find this extrema, but if you just to call a standard optimization model it could take a quite some time to do this, but now you know the property you have to just like you do not worry about this you are increasing and just want to see where you are going to cross that point just quickly find it out yeah. So, yes you need to like keep that tolerance very small yeah. So, otherwise I mean do not even if do not go for arbitrarily small and also do not keep it too large. It is not specified that is that is we leave you to your judgment what should be a good value ok. So, if you are going to choose it very small right even then you have to do too many halve wings and it is going to take a too much time. If you keep it the tolerance very large we will quickly find it, but it may be pretty off it may just end up here ok fine. So, I want to just quickly discuss this Thompson sampling and we will not go any much details into its algorithm. So, there is another set algorithm called Thompson sampling which is based on Bayesian approach. What does that mean? It assumes that the parameter. So, when I say environment it is nothing, but a set of parameters right which are defining the means of that distribution. It assumes that those parameters is itself drawns from some distributions and it is going to assume some prior on that and based on your observation it keeps on updating that prior and then draws parameters from that prior and then looks for the parameter which has the highest value. So, I am just going to quickly write the algorithm and we will discuss it. So, all of you know beta distribution what is a beta distribution? So, beta distribution with parameter alpha and beta has a pdf which is as following yeah x to the power alpha minus 1. So, beta distribution with parameters alpha and beta. So, this is a pdf of a beta distribution with parameter alpha and beta has a distribution which is given by alpha minus 1 and with the power beta minus. So, you understand this notation gamma function all of you know what the gamma function. So, it is and it is a proper pdf and defined for all x what is that x between 0 1 ok. So, the Thomson sampling uses this beta distribution. So, this is a simple version I have defined here for the case of Bernoulli reward distributions ok. What it is doing is for every time it keeps a track of how many 1s or a 0s I observed from a particular distribution. So, 1s corresponds to success 0s corresponds to failure. So, then it what it is going to assign a beta distribution for each i with a parameter s i plus 1 f i plus 1. What is this? This is a s i corresponds to number of failures and f i corresponds sorry s i corresponds to number of successes number of 1s it has observed f i corresponds to number of 0s or number of failures and now it is going to pull an observe a sample from that beta distribution and then it is going to see which is the one which has the highest value of the sample and then it is going to play that and if it observes 1, it updates the success of that arm by 1 otherwise it updates the failure of that arm by 1. So, what it is basically doing is it is going to assign priors to each arm through this beta distributions and every time it is going to upgrade update those priors based on the success and the failures I am going to assign observe for those arms and then it just repeats this process every time playing the arm which gives me the highest sample reward from each of this beta distributions.