 So, far I said we are restricting ourselves to stochastic bandits right. So, stochastic bandits we expect stochastic bandits to easier learn compared to the adversarial or the adversarial are easier there compared to the stochastic ones. Why? So, fixed distributions are there. So, if somehow if you can and once you have distributions possibly they will be governed by some parameters right. So, if you can somehow figure out these parameters then you know the distribution according to which your environment is drawing these samples right. So, you can start selecting the best term and what is that we are trying to say like our regret is defined such that we want to identify an arm which has the highest mean. This mean is not going to change right like because the distributions are fixed this means are going to figure going to remain constant throughout for all the arms. So, somehow you can figure out this means you know already which is the best action to take. Now, the question is how can you figure out this means right. So, the whole stochastic bandits now boils down if your aim is to minimize this regret how quickly you can identify the means of each of these arms right. Now, what is a good way to identify this means of the arms. So, somehow like instead of what is the sample what is the number, but you see that like if I get it from some arm if I collect some reward and take their average and if I have sufficiently many samples we should give me the mean value. Why is that? Strong law of large numbers right, we know from strong law of large numbers like if I take samples from each from this particular arm many times and I am going to average it that should give me the mean value. So, for this law this result to hold what kind of assumptions I need on the sampling I need to be. So, for the fixed arm the identical distribution is already there right because if you have fixed arm the distribution is not going to say our only need is that the samples drawn are independent. We are and that is one of the assumptions we are going to make in this when I said the environment if you are going to play an arm i t in round t and if I said the environment is going to draw an sample corresponding to that arm that sample is going to be drawn independent of the past samples from the same arm not only independent of its past samples from that arm also independent of the other arms ok. So, we are making going to make this assumption that the sample when you are going to pull an arm you are going to get a sample which is drawn from the distribution associated with arm and that sample is going to be independent of the past foods from that arm and also the pulls of the other arms. That means, to say that when you pull that arm you got a sample which at all which did not have the past samples and the other arms did not influence that reward you got from that arm that point ok. Now, so because of that it looks like I am perfectly by making this assumption I am perfectly in the requirement of the strong law of numbers criteria's right. The strong long to apply the strong law of numbers you need this IID assumptions and that I am assuming to hold. Now, all you need to know is when I do an estimate of my mean value how quickly I can get a good estimate. So, this law of large number says if you get many many many samples and if you take an average you will get you will get the correct value, but I am not going to do this forever right I may be ending this only after a certain number of rounds. Now, how to get that ok or like at least how many samples I need before I can get to have a good estimates of this means. So, for that we are now going to make a slight detour to understand concentration bonds ok. Now, it is clear for you right in the stochastic bandits if I have to identify the arm with the highest mean I need to quickly identify the means and the way to identify the means is take the average of the samples I collect. Now, I want to understand how quickly the average of the samples I am going to take they are going to concentrate around the true value of the mean values. So, I want to now basically understand how quickly my averages concentrate around the mean value. I know that if I take infinitely many samples it is going to be exactly that value of the mean, but I do not have that luxury. So, I want to see how quickly I go closer to the true value of the mean and for that we are going to study some concentrations or how quickly my average concentrate around the true value or let us call it concentration of measures or some also books also call it as concentration inequalities. So, we are going to talk about this we are going to revisit or some discuss some of the main concentration properties. So, let us x 1, x 2, x n are i i d random variables with mean. So, our natural estimator is I am going to if I want to estimate this mean quantity I am going to do is I am going to take average of this n samples. So, what so, this mu hat it is a random quantity right because it is the average of n random variables. So, what is the expected value of mu hat? What is the expected value of mu hat? Expectation of x. So, expectation of mu hat is expectation of this quantity right, but I can also write it as 1 by n summation of expectation of xi. What is the expectation of xi? So, here it should be with mean. So, this is the common mean because these are x 1, x 2, x n are i i d random variables right. So, this is let us call their common mean and this is their common variance. So, this quantity expectation of xi is going to be mu and because of that this quantity is going to be mu right. So, when this happens what we call this estimator as? So, this was an unbiased estimate we already talked about this right in the adversarial case also. So, fine it looks like and this is true for any n even if you have one sample this is the case even if you have two sample this is the case. So, this is three sample this is the case, but it says that ok in expectation this is going to give me the true value, but what about the variance? If I am going to look at the variance of mu hat what is this quantity is? Why is that? So, this is the variance right. So, what is this variance? This is going to be 1 by n. So, if I do this is this correct? I have just pulled out this 1 by n square outside and cross multiplied it here it becomes n mu, but now I have pulled this mu inside the summation. Now you can verify that because these guys are identically distributed and also independent when we expand this this becomes summation once you simplify this you are going to get it as sigma square by n ok. So, now as n increases this quantity is going down inversely in n right. So, what it is saying is fine even though it is an unbiased estimator the expected value of mu hat is mu is same for all n, but the error the mean squared error that is the basically the variance is going down when you have large number of samples ok. So, if you have small number of samples your error that is the value of this mu hat being away from the true mean could be very large ok. So, you want. So, that is why the number of samples are important only when you have sufficiently if you want this mean squared error to be small you need to guarantee the you need to have n to be some number you cannot get a small mean squared error for any n you need to have n large even though it is an unbiased estimator even for any n. So, fine then this error is going to go down only when n is going to be infinity right this mean squared error going to be 0 only when n equals to 0. But often we would be happy like yeah not necessary that mu hat has to be the same as the mu quantity, but as long as mu hat is very close to mu then it is fine. So, because and then you may be interested in the scale probabilities where you may be interested in asking this question what is the probability that mu hat like this or mu hat is. So, what you are basically asking is you are you are asking that ok mu hat is going to be larger than mu plus epsilon. And here you are asking that ok mu hat is less than the true value plus epsilon ok. And now this is kind of mu minus mu minus epsilon you can take it less you want it to be less than epsilon wait a minute yeah that is right right like you want. So, what is that you are asking you have your true value to be here and take this to be mu plus epsilon and take this to be mu plus epsilon sorry other way around. And now you want that your mu hat not lying in this region that is your mu hat is below this guy or your mu hat is in this region or in this region these are the bad cases for you right like you are basically asking that you are. So, this whatever your mu hat you would be happy if it is there or here, but if it is in this or in this region you are not happy and you are basically asking that question. So, this one is called upper tail probability and this one you are going to call a lower tail probability right. So, when you are going to ask whether your estimate belongs to this you are basically asking a lower tail probability and when you are asking your mu hat belongs to this you are going to ask for upper tail probability or you can combine both of this and ask this question what is the probability that your mu hat minus mu is going to be greater than or equals to this. So, that is what is the probability that it will be either in this region or this region and this one you are going to call it as two sided tail probability. So, basically then we will be interested in how small is this quantities what how how are these probabilities is this probabilities and how that they depend on the number of samples. Is it that if I have enough number of samples of course, this could be this probability could be small, but how small they are ok. So, now, we are going to now you will be interested in bounding these probabilities ok. So, this is one I mean the compliment of this is like you will be asking mu hat minus mu is less than epsilon that is in this region. So, what I want we may be interested in asking this ok, how many samples I need to collect. So, that if I take the average of them the estimate I get is within epsilon neighborhood of my true value right in that case if this probability is very small then I can be very confident that my estimates are in this region whatever and I may I may decide like how to choose this epsilon we will discuss that later. I mean depending on whatever epsilon you pass on this is the kind of results we are interested in ok. Now, inequalities we you are interested in now we want to see how we can bound them right ok. How many of you know Markov inequality at least all 611 guys should be knowing right. So, ok what is the Markov inequality? If I have a probability if I want to bound this. So, what is the bound expectation of just x or mod x here? I want mod x right because I want this quantity to be a positive valued random variable. So, I have deliberately put mod here this should be equals to expectation of mod x by epsilon square this is my Markov inequality. So, what about Chebyshev inequality? How many of you know Chebyshev inequality? Others do not know Chebyshev inequality. So, 611 guys do not know Chebyshev inequality and how to get this bound Chebyshev inequality from this? So, if I have x minus let us say mu and I want to get this. What is the bound on this? Variance of. So, it is epsilon square or epsilon here and is this correct? Variance divided by epsilon square. So, we already have one. So, what is x here? So, I will just write it correctly here. So, here x this is expectation of x here is this. So, we know that already the expected value of mu hat is mu right and we are interested in knowing what is the probability that this is going to be. So, by taking this x random variable to be my mu hat I have already one bound here right. So, let us compute it what happens my x equals to mu square here mu hat. So, can you just quickly compute and tell me what is this value is going to be? What is the variance of mu hat? So, variance of mu hat we already computed it to be sigma square by n. So, this quantity is like sigma square by n epsilon square. So, what this is we have? Suppose if you fix an epsilon ok and then as we increase n this this probability is becoming this bound is becoming smaller and smaller right. Then if you have more and more samples that the difference between that my estimator will be away from this interval is going to be come down come down or it is going to shrink as we are going to increase n. Now, let us see let us fix an n and now let us look into epsilon. So, once you fix an n this upper bound is like inversely changing with epsilon square right. So, if you want if you want to take this epsilon to be small what is happening to this bound? This is becoming bigger and bigger. So, if you are asking for this epsilon to be small like like you want a small interval around this mu you can your this upper bound is also going to be very large. In the sense you are saying that this concentration bound is also large that is you are not this probabilities can be is at least you are not guaranteeing it to be small it can be a large value when epsilon is small right. So, we are saying that you fix an n and now you shrink this mu make it smaller. Now, I am asking ok what is the probability that this mu hat lies in a smaller interval. I can only guarantee with smaller and smaller probability right that is because this guy is now going to be larger because when I because epsilon as I make smaller this guy will become larger and larger ok. So, what we are saying is if you fix this interval and increase n you have a good this this event happening is going to be very small that your estimates will be away from this interval is going to be very small, but if you fix n and you want this mu to be happening in a smaller interval small if you want to reduce this interval then you are you saying that mu hat lying in the smaller interval you are going to say that is going to happen with smaller only smaller probabilities ok fine. Now, the question is is this the best bounds one can have is it like is it like it just that this error goes down just like inversely n or it can go much faster because this after all this is a just an upper bound right. We do not know whether this upper bound is tight or is that one can get a smaller upper bound. So, the whole there is a lot of study that probabilistic do just to get a better bounds on such quantities ok and once you can get a better bounds maybe that will help you to understand what is the smallest number of samples you can have. So, that this happening is this event happening is small ok. Now, let us then the question is what is the better we can seek we are going to discuss one more bound on this using center limit theorem ok what is center limit theorem. So, if I have again I have a sequence of IID sequence if I am going to take their sum center them by subtracting the mean value and then normalize them like this what is this converges to normal distribution with what parameters and this convergence is in distribution right. Now, our mu hat is also of this shape right that average of n. So, we also have this term here let us see that from this we can using the CLT results we can get a bound on this quantity here mu hat minus mu being greater than mu ok fine. So, what is the CLT results say if you let your number of samples to go to infinity the average when you scale it in this fashion then that is going to converge this is asymptotic result. So, now, let us ask the question what is the probability that mu hat minus mu let let me ask you only one sided question probability that is mu plus or mu minus mu is going to greater than or epsilon mu. So, let me plug in this quantities what is mu hat mu hat is 1 by n summation xi mu right this is epsilon I just do manipulation and write summation of xi n mu being greater than epsilon n and now I want in this form. So, what I will do is I will divide and multiply divide both sides by n square root sigma square n square root sigma square right. So, this is same as probability that of xi minus n mu by summation n sigma square being greater than or equals to epsilon square root of n by sigma square ok. So, I know that this guy here when n is sufficiently large is going to look like normal distribution with mean 0 and variance 1. Now, pretend that n is sufficiently large and this is already Gaussian distribution ok with parameter 0 and 1. So, then what is this probability? This probability is basically asking that what is the probability that the tail distribution what is the probability that my normal random variable is going to take value greater than epsilon square root n divided by sigma square right. So, then I am going to write I am just going to plug in my Gaussian distribution it is going to be between epsilon square root n by square root sigma square to infinity 1 upon 2 pi sigma square is going to be 1 because this has asymptotically this has variance 1 and exponential minus x square by 2 dx right. What is this? This is less the pdf of a Gaussian random variable with mean 0 and variance 1. So, let me call this value simply for notational purposes to be u. So, what is this basically? We have a special name for this right and also special symbol for this what was that yeah 1 minus pi of u ok and in general this expression this integration do not have a closed form expression right this is hard to integrate. So, what we will do is we will just look at an upper bound on this what I will do this u here right and I am integrating it over the quantities from u to infinity that means, I am integrating over region which is larger than u ok. So, I will do. So, u yes we are doing epsilon root n yes this this we are saying that this is going to be Gaussian only asymptotically, but what we are saying that for any given n just assume that this is sufficiently large, but finite still and this has already converged to a Gaussian distribution just pretend that and then apply this. So, because of this maybe we can just say this is an approximation here ok. So, now, what I did is basically here everything remains the same instead of I have just added this extra term x by u and now your integration variable x is going to be larger than u. So, because of that this we will get an upper bound, but usually this x exp minus x this is easy to integrate once you are going to integrate you are going to get this value as after integration you are going to get I just keep the integration you can verify this. So, now notice this what is happening here this probability which of which was our interest now this is like decaying exponentially in n earlier we could only get it here inversely in n. So, this is exponentially in n. So, this is going to decay much faster right, but the caveat here is we made some approximations here just assuming that is n is sufficiently large. So, we are saying that if n is good enough this convergence can happen actually this probability can decay like exponentially in n not just linearly in n. So, based on this result we are going to prove some more concentrations along this which. So, this result basically gave us an intuition that actually the this probability can decay exponentially in n not like linearly in n even though we made some approximations here. So, in the next one we talk about Huffordings and other inequalities where we we get that exponential decay ok. So, let us stop here.