 side effect, and seven of them did not. And we want to bound the expected proportion of passions that will have a side effect if we continue giving this same drug to new passions. So this is a motivating question. And when we ask this question, there are two assumptions that we are making. So the first assumption is that the passions in the sample independently from the same distribution. And this assumption I will get back to soon. And the second assumption is that the new passions are sampled from the same distribution as the passions that we have validated the treatment on. And this assumption is sort of quite obvious. So if you have data, for example, from adults, and then you give the same drug to children, you probably don't expect to have the same number of side effects. Or if you train your language model on Shakespeare and then apply it to Twitter, probably you also don't expect the same kind of performance when you completely transfer the domain. So it's quite an obvious assumption, but it's pretty often ignored leading to many different problems. So a little bit of formal background. So we have that random variable. We use the expectation of that, which is the sum or all possible values or an integral if you have a continuous random variable of the value times the probability of observing that value. And throughout the talk, we will use mu to denote the expectation of this random variable. And for Bernoulli random variables, so if the domain is 0, 1, we have the expectation of that is 1 times the probability of observing 1 plus 0 times the probability of observing 0. It's the probability of observing 1. So in case of a Bernoulli random variable, the expectation is equal to the probability of observing 1. Now we will formalize the question that we asked. So we have a sequence of random variables, z1 up to zn, where z i's are in this set of 0, 1. And we will also consider more general the interval of 0, 1. So these are the passions that received the drug and either got or didn't got the side effect, so 0 or 1. And we assume that this z1 up to zn are independent, identically distributed. And we use mu hat n to denote the average of these observations. So let's say that a side effect is 1, then 3 out of 10, that's 0.3, the empirical mean. And we are interested in inferring the expectation. So what's the expected number of these side effects based on the observed average of side effects? OK. And well, you may be quite tempted to say, OK, I had 30% of side effects, so that's what I'm going to observe on new samples. But the size of the sample comes into play. So if you take a limit case when n equals 1, so you observe just, you give the drug to just one patient, you observe the effect, the outcome is either 0 or 1. And so if n equals 1, you don't really expect this mu hat to be close to the expectation. Your sample is too small, right? So when you increase the sample, you expect that this mu hat n will get gradually closer to mu. And the question is, how fast? So how fast this empirical average approaches the true mean of that random variable? If anyone has any questions, raise your hand and ask the question. OK. Now, I don't know what exactly background. Yes. Yes. So in my example, we have 10 passions. We have 10 random variables, n equals 10. For each of them, if a passion got a side effect, then that i equals 1. And if the passion didn't get a side effect, then that i equals 0, no side effect. And you want to know what's the expected number of side effects that you will have if you continue giving this drug to new patients. OK. Good. Any other questions? Any other questions? Good. So again, I'm not sure what sort of background you are coming from, but there are generally two big camps in statistics. There is the frequentist camp and the Bayesian camp. In the Bayesian camp, the Bayesian reasoning says that the parameters of the distribution, for example, the mean mu, a sample from some unknown distribution. Or if we're talking about this Bernoulli random variable, so the probability of z being equal 1 mu is sample from some unknown distribution. And then patients start with some prior distribution over these parameters. So some distribution that this mu may be equal to some value x. And then they apply the base rule. So they update their belief that mu equals some value given the sample using the base formula. OK. And in this case, so the probabilities, and we have many probabilities here, the probabilities they are over observations, z1 up to zn, and the parameters mu. So both the observations and the parameters are considered as random variables and the probabilities over both. Now, if the prior distribution over the parameters doesn't match the reality, then the result falls apart. So yeah. So the Bayesian sort of believe in their prior. If the prior matches the true generating distribution for the parameters, everything is fine. If not, then it's not very clear what you're doing. In the frequentist reasoning, the parameters are unknown but fixed. They are not random variables. So there is some probability of having a side effect, but it's not sample from some distribution or anything. It's fixed but unknown. And the frequentist, they bound the probability that the observation, this empirical average, deviates strongly from the true value. So what's the probability that it underestimates by more than epsilon? Or what's the probability that it overestimates by more than epsilon? Or what's the probability that it deviates in either direction by more than epsilon? OK. And the random variable in these probability expressions, it's the empirical mean, or the z-eyes, but not mu. And the probability is over the empirical mean, but not over the parameter. So essentially, if we do the frequentist formulation of the problem, we want to bound the probability that the observed empirical mean significantly underestimates, in case that's something we want to upper bound significantly underestimates the true mean. So what's the probability that this empirical average is, sorry, that underestimates the true mean by more than epsilon? So this is the frequentist way of asking the question. And this is the formulation that we will work with. Any questions about that? So with this formulation in mind, so now I'm going to show a few basic concentration inequalities, a few inequalities that bound this probability that the observed empirical average significantly underestimates the true average that we will observe on new samples. And well, the most basic one is Markov's inequality. And the Markov's inequality tells that if you have a non-negative random variable z and any parameter epsilon larger than epsilon, the probability that z exceeds epsilon is bounded by the expectation of z over epsilon. Maybe just to check where we are generally. Who have seen this thing some time before in their life? OK. Who has not seen this ever? OK. Good. So I'll show you a proof of this one. It's quite simple, although it may be still hard to get in two minutes while you are sitting here. But you can check the slides later. So for the proof we define a random variable as an indicator that z exceeds this epsilon. So this w equals 1 if z is larger than epsilon. It's 0 otherwise. And in this case, we have that w is smaller than z over epsilon. OK. So I have an illustration here. So we have this. This is the value of z. And this is the value of w. If z is smaller than epsilon, w is 0. So it's smaller than z. If z equals epsilon, then w equals 1. So we have this equality. And after that, w is smaller than z. And w is a Bernoulli random variable. So it's either 1 or 0. And so the probability of w being equal 1 is equal to the expectation of w. And then we're looking at the probability that z exceeds epsilon. That's the same as the probability that w equals 1, which is the same as the expectation of w. And now w is bounded by z over epsilon. So this is bounded by the expectation of z over epsilon. And we have gotten the inequality that we wanted to get. OK. So this is what we have. How do we use this inequality? And while here is the inequality that we have just derived. And we have this z1 up to zn that are Bernoulli IID. Again, you can think about these passions, getting treatments and getting the side effects. And we want to bound the probability that the empirical mean underestimates the true mean by more than epsilon. That's the way we write it. And first of all, we want to bring this to the form of Markov's inequality. So we want to have random variable on one side and this epsilon on the other side. The random variable is mu hat n. So we can take mu to the other side. And we get the probability that minus mu hat n exceeds epsilon minus mu. That's the same thing. And now this has the form of Markov's inequality. But we have assumed that z has to be non-negative. And the minus empirical mean of Bernoulli random variables, this is negative. So in order to fix it, we add 1 on both sides of the inequality. So I have 1 minus the empirical mean here and 1 minus mu on the other side. Now, this random variable is now non-negative. 1 minus the empirical mean of Bernoulli random variables. That's between 0 and 1. And this is my new epsilon. So this whole thing is my new epsilon in Markov's inequality. So this is my random variable. This is the epsilon. So this is bounded by the expectation of the random variable over the new epsilon. The expectation of the empirical mean is the true mean. So we get this. And if you get a bit more calculations, you get that this is bounded by 1 over epsilon plus 1. So this is a bound that you get. And well, you can play with some numbers if you want. But the point is that the concentration that we get from Markov's inequality, it's not improving with n. It doesn't matter how many samples you have. You get the same bound on the probability. So it's not improving as we increase the number of samples even though we would like to have something that shows us that the more we flip this coin or the more passions that observe, we get closer and closer to the true mean. We are going to get there. Another point to put attention to. So we used an upper bound on z at the derivation in this step. So we used the fact that z is upper bounded by 1 in order to bring this random variable under control. And we did not use independence in this derivation. So these derivation holds even actually even if the random variables are dependent. But again, it's not growing with n. And yeah, we'll also get to this point a bit later. Any questions? Yes. So the question is, I will repeat the question. So the question is, why do we need, why does it has to be a non-negative random variable here? Well, if you checked it, so if it would go negative, this inequality is going to break down. And the whole proof breaks down. So it doesn't work if it's negative. You can flip, so you can prove it. So the probability that minus z is smaller than minus epsilon, that side is worse. But yeah, that's all you can do with this inequality. Any other questions? More questions? No? OK. Moving next step, so we have Chebyshev's inequality. And Chebyshev's inequality tells us that for any positive epsilon, the probability that a random variable deviates from its expectation in the absolute value. So in either direction, by more than epsilon is bounded by the variance of z over epsilon squared. So this is Chebyshev's inequality. It's also not very difficult to prove, so I'll show you the proof. So we'll look at the probability that the absolute value of z minus expectation of z exceeds epsilon. These both things are positive, so we can take a square. It's the same as the probability that the square of z minus expectation of z exceeds epsilon squared. We just squared both sides of the inequality. And well, this is Markov's inequality that we used previously. And now we take this thing as our random variable and this as our epsilon. And we apply Markov's inequality so that we get that this is bounded by the expectation of the random variable, which is now the difference of z and the expectation of z squared over the parameter epsilon, which is now epsilon squared. And as you know, the definition of the variance, so this is the variance of the random variable z. Questions? How do we use this inequality? So again, we have a sequence of ID random variables. And we will look, now we can look even at the absolute deviation, so we'll look at the probability that the empirical mean deviates from the true one by more than epsilon. And we have Chebyshev's inequality, so this is bounded by the variance of the random variable, the variance of the empirical mean over epsilon squared. And I'm expending what is this empirical mean? Well, that's the average of the random variables. So this is direct application of Chebyshev's inequality. And now I remind you, so if we have independent random variables, then the variance of a sum is a sum of the variances. And if we have any constant, then the variance of a constant times the random variable is the constant squared times that same random variable. So we have this constant of one over n. If we take it outside of the variance, it will be one over n squared. And we have the sum of the random variable, so the variance of the sum will be the sum of the variances and they are i, d, so it's n times the variance of any of them. So, and we had n squared from the one over n, so we get that this is equal to variance of any of them. Let's say the first one over n times epsilon squared. So the probability that the empirical mean of this n i, d variables deviates by more than epsilon from the true mean is bounded by the variance over any single random variable over n epsilon squared. So we got a concentration that now improves with n. The more times we flip the coin or the more passions we observe, we are getting the average with high probability gets closer and closer to the true mean of that coin. And now we have used independence. So in this step, we have used independence of the random variables in order to take the variance inside the summation. Any questions about this? Yes, yes, yes. So mu is fixed and mu bar n, this is the random variable. And the question is, again, we observe this empirical mean, we have observed that three out of 10, three out of 10 passions got a side effect. And the question is, what's the probability that this observation deviates significantly from the true mean, from the true expectation of having a side effect? But in Chebyshev inequality, is that mu is equivalent to z when we want to apply it and mu is not a random variable, no? So in Chebyshev inequality, so z is the same as mu hat n, it's in the absolute value, so we can swap them, that's the same. So mu is the expectation of mu hat n. In expectation, the average is the true mean, but it has some deviations. If you look at the expectation of mu hat n, it's mu. So mu is the expectation of z and mu hat n, it's the z in Chebyshev's inequality. Okay, thank you. Okay? Yes. So they are independent, identically distributed, so they have the same distribution, which means that the variance of any of them is the same. So you can put z n here, so you can put any index here, okay? But, yeah. Any other questions? Okay. Moving forward, we have Hovding's inequality, so we are trying to get tighter and tighter and tighter. So now we have Hovding's inequality, so if we have ID random variables in the zero one interval with expectation mu, then for any epsilon greater than epsilon, the probability that the empirical mean overestimates the true mean by more than epsilon is bounded by the exponent of minus twice and epsilon squared, and the probability that it underestimates the true mean is also bounded by the same quantity. And these two are known as one-sided Hovding's inequality, so they don't hold simultaneously. If you want to have a simultaneous thing, so you have the corollary, which is two-sided Hovding's inequality, so the probability that it mu deviates from mu hat in either side by epsilon, it's bounded by the probability that it deviates to the left or deviates to the right, and it's bounded by twice the exponents. And this is a union bound, so the probability that event A or event B happens is bounded by the sum of the probabilities of the two events, okay? So let's make another check. Who has seen Hovding's inequality before? More. Okay. Who has not seen that? Okay, so we are moving a little bit forward. Anyone has any immediate questions? Yes. So, yeah, yeah, yeah, yeah. So, I mean, you can generalize it to any bounded random variable, so you need boundedness, and then the scale, so the range of the random variable will show up in the inequality. I am doing it for zero one to make it a little bit simpler, but yeah, you can have the range in there. Any other questions? So anything that we do for random variables bounded in the zero one interval can obviously be extended to any bounded random variable by just rescaling it to the zero one interval, doing the bound, and then rescaling back. And, well, what we got that by Hovding's inequality, the empirical mean convergence to true mean exponentially fast in N, okay? So the probability that it deviates by more than epsilon decreases exponentially fast as the number of samples N grows. So that's quite a powerful thing. But sort of before I'm also going to show you proof, but before I go for the proof, let's understand the bound a little bit. So what does it tell us and how we work with it? So this is the bound, so the probability that the true mean underestimates the, sorry, the empirical mean underestimates the true mean by more than epsilon is bounded by this exponent, and we denote the right-hand side by delta. So this is the probability that this is a bound on the probability that the empirical mean runs too far to the left of the true mean. If I solve this, so exponent of minus twice and epsilon squared equals delta, then I get that epsilon is equal to the square root of logarithm of one over delta over N. So it's just take this equality and you get this epsilon. If you want two-sided, then you will have two over delta instead of one over delta. So rewriting, so substituting this epsilon into the equation, we get the probability that the empirical mean underestimates the true mean by more than square root of log of one over delta over two N is bounded by delta. So this is another way of looking at this inequality. And so the probability that it underestimates by more than the square root of log of one over N, log of one over delta over two N is more than delta. So if I take the negation of this inequality, then the probability that the mu hat of N is close to the true mean by, doesn't exceed this true mean by more than the square root is at least one minus delta. So with probability at least one minus delta, mu hat N is within the square root of the true mean and this square root is known as precision. So how precisely mu hat N estimates mu and this one minus delta is known as confidence. So how confident we are that the true mean is within the square root of, sorry, the empirical mean is within the square root of the true mean. And again, you can also put an absolute value here by replacing one over delta by two over delta. So here is an illustration. We have the true mean. We have the number of samples N and as the number of samples grows, the empirical mean with high probability stays within this range from the true mean where the range decreases at the rate of square root of log of one over delta over two N. Any questions? Okay. And there is a trade-off, so we can take, we can take a smaller delta and then the interval gets larger. So if you want high confidence, if you want high one minus delta, you have to compromise on precision and the other way around. So if you want high precision, then you have to compromise on confidence. There is interplay between these two things that depends on how many samples you have. And if you take two extreme cases, so if you take delta equals zero, so you want confidence of one, then you can't control the empirical mean. So all you can say that the empirical mean is within infinity from the true mean. And if you take, if you want zero confidence, so if you don't require any confidence, then you can say that the empirical mean is at least the true mean. This also, it's going a little bit aside from the online reinforcement learning, but this inequality gives rise to this so known as probably approximately correct learning framework, which is the theoretical learning framework more in the offline learning. So, and why is it the probably approximately correct? So with high probability, with probability one minus delta, the empirical mean is approximately equal to the true mean. So this is the probably approximately correct learning framework. And again, to remind you that the probability here is over mu hat N, which is the random variable and not over mu, which is a deterministic quantity. So even though these inequalities are often written in a form of bounding mu in terms of mu hat N, the random variable is mu hat N and it's the probability that the empirical observation deviates from the true value. Questions? Yes? So we want to get a similar result like this, but for some probability. And like we want to estimate a density. And we want to get an epsilon delta pack estimate like you did here. Or if I want to rephrase it, so let's say I want to estimate functional. So we have a model and we want to estimate a functional of this model. So do we use the same like similar? So if you want to estimate the density, which means that you want to estimate the whole distribution, estimating the whole distribution that gives you more information than estimating just the mean. Okay. So it requires the different tools to achieve this. What we show here are tools for estimating the mean of a distribution based on observations, not the whole distribution. Okay. Any other questions? And well, there are different ways of using this bound. So again, to remind you, we have this delta, which is confidence and that's equal to E minus two N epsilon squared. This is what we just did here. If we solve this equation, we have the precision expressed in terms of confidence and the number of samples. And we have N, which we can also express in this equation based on delta and based on epsilon. So we can fix any two parameters out of the three. The three parameters are epsilon, delta and N, precision confidence and sample size and the inequality gives us the value of the third parameter. So how can we see it? So let's say that we fix N and epsilon and then we get delta. What's the question that we are answering? So what is the probability that mu hat N underestimates mu by more than epsilon given that we have N samples. So N and epsilon are fixed and then the inequality gives us a bound on the probability. We can also fix N and delta and then ask about epsilon. So what is the maximal underestimation of mu by mu hat N that can be guaranteed with probability at least one minus delta given a sample of size N? So again, you have a fixed sample and then someone tells you, give me a guarantee that will hold with 95% that the sample is not deviating too strongly. So how good guarantee you can get? You substitute your confidence, you substitute N and you get the precision that you can achieve with a given sample size at the desired confidence. And then finally, if we fix epsilon and delta, then you have this question, how many samples do we need in order to guarantee that mu hat N doesn't underestimate mu by more than epsilon with probability at least one minus delta. So you fix the precision. You say, I want to be within 0.01 from the true mean with probability 99%. How many passions do you have to collect in order to achieve this target precision with the target confidence, okay? So these are the three ways to use the bound. You can fix any two parameters, precision, confidence, or the size of the sample and then get the answer for the third parameter that you have not fixed. Questions, doing with time, yes, no questions. Okay, let's show a proof of this inequality and then I will give you a short break and then we will continue, okay? So this is the inequality. And in order to prove the inequality, we have Hovding's lemma, which I will not prove, but the Hovding's lemma says that if we have a random variable in the zero one interval and the parameter lambda that's positive, the expectation of the exponent is bounded by the exponent of lambda squared over eight. Now, the proof is a bit longer. It's not that I expect that within whatever two minutes you can follow the proof, but I just want to show you a couple of key points in the proof, yeah. So a proof of Hovding's inequality. So this is the inequality. Well, I've taken the division on the other side. This is the inequality that we want to prove. We can multiply both sides by n. This gives us this expression. And now the first thing that we do we apply Chernoff's bounding technique. So for any parameter lambda that's larger than epsilon, we have that x is larger than y only if exponent of lambda x is larger than exponent of lambda y. So I'm taking both sides of the expression to the exponent and with the parameter lambda. And again, this holds if and only if this holds so this doesn't change the probability. Now I can apply Markov's inequality and this thing is my random variable. So I'm looking at the expectation of the left hand side and this is my epsilon. So this is the exponent of lambda and epsilon in the denominator. I just take it as negative exponent here. And here I have an exponent of a summation and exponent of a summation is a product of exponents. So I have expectation of the product of exponent of lambda times this thing. So I've separated the summation over the random variables into a product of the random variables. And now comes the critical step where we use independent. So if we have independent random variables, expectation of xy equals expectation of x expectation of y. It holds for independent random variables. It does not necessarily hold for random variables that are dependent. So if I apply this, I can take the product outside of the expectation. So I get the product of the expectation of this thing. And now I can apply Hovding's lemma that I had here one by one to each of the elements here. And we'll get an exponent depending on lambda. And then I can find lambda that would minimize. So this lambda star equals four epsilon. I can find this lambda. And if I substitute this lambda all the way through, then I get the bound that I wanted to get. And again, you can sit at home and go through this one more time. But a critical point here is we have used independence in order to prove this inequality. Any questions? Yeah, there is a question here. Okay, good. Would we get a tighter bound on the absolute value if you used the sebi-shev's inequality if we look at the probability of the absolute value deviating? I mean, because we had to use like a very loose bound to bound the probability of z minus mu in absolute value lower than epsilon. But here, can we directly use sebi-shev's inequality? Is there something similar? You can apply Chebyshev, then you will get Chebyshev. But yeah, at least I don't see immediately why it would help. Yes, yeah. Well, you have this property. You have to be able to... Yeah, I understand that. If you are asking if they are non-correlated, yeah, if they are non-correlated, you will get the same thing, yes. If this holds, and it holds for independent, but it doesn't hold in general, if this holds, then you can have the inequality through. No, I'm saying if this does not hold, but there's a bound on the pairwise covariance. No. There's an epsilon by which this is violated, but the epsilon is given out. You would have to go through the derivation to see what you get there, so. So there's no known result. Yeah, yeah. Okay, yes. When you try to rule about what is usually the thought process that you arrive at a result, I mean, what would be the exploration of your thought to try to rule about? So, I mean, just a general question. What would be the best bound, or what... No, no, what would be the thought process? Thought process, how you get it? Yeah. That's a difficult question. Yeah, maybe. After you do many of them, you get some intuition of how you can get new ones, but I'm not sure I can formalize. And I mean, you can derive things just for the fun of it, but there are certain deviations that you need to bound for different applications. And then if you need to bound a certain deviation, then you start staring at it and trying to bound it in one way or another. Okay, then it's based on experience then. Based on experience, based on the needs. So, depending what you need to control, then you are trying to control that thing and either you succeed in controlling it or you can sometimes prove that you cannot control certain things. That also happens, okay? Yeah. Any other questions? There is one more question down here. So if we have alpha gene like inequality for a random variable, can we deduce how often inequality for a property of those random variables? So it could be, for example, the entropy or either a univariate property or bivariate property like KL diversions or something like that. So, again, this thing, it only gives you the bounds on how the empirical average deviates from the true average. If you want more things, there are different tools for other things but you have to study the problem that you have in that quantity that you have to control and see how you can control the quantity that you are interested in. Again, this thing, it's only controlling the deviation of the empirical mean from the true mean. If you want other parameters, well, then you need to look at these other parameters and see how they deviate and so on. Any other questions? Okay, then, yeah, the important point here is that this lambda star is independent of the random variables. And, well, before letting you on a break, I'll give you a question that you can try to solve. So, we have talked about the importance of independence. You know, I mean, we have used it in Hovding's inequality, we have used it in Chebyshev's inequality, but we have tried different inequalities, we're getting better and better and maybe we could have some other inequality that doesn't require independence and still provides concentration, okay? So, here is a question. So, construct an example of dependent random variables, z1 up to zn, such that these random variables take value zero or one. They have the same expectation, so they all have expectation mu, I didn't put it on the slide, but so they all have expectation mu, but they are dependent, okay? And you want to construct this set of dependent random variables such that the probability that the empirical mean of the random variables always, so with probability one deviates from the true mean by at least one half. Is the question clear? So, if we look at Hovding's inequality, Hovding's inequality tells us that the probability that the empirical mean deviates from the true mean by more than epsilon decreases with the number of samples if we have independence. Now you are asked to construct an example where the distance between the empirical mean and the true mean is not decreasing with n, it's always one half no matter how many random variables you take with probability one. Is the question clear to everyone or anyone has questions on, you have no idea what I'm asking for or you know what I'm asking for and, you need to construct a distribution over zero one up to the n where they are dependent, okay? So you need to construct dependent random variables. Yes. I'm not yet asking for an answer, yes. Yeah, I'm not really with this background, but is that some kind of distribution when some event happens in the one variable or the other ones become zero or something like that? Like what would that be, Poisson's or? You are on the right track, but your construction will not, I mean if all others are zero, they will not have mean mu. No, at some point, like. They all have to have the same mean and they may be dependent, okay? So I suggest I give you all a break of eight minutes until half past 10. You can think about this or you can just relax. So half past 10, we continue, okay? So in the interest of time, I'll just tell you an answer to this problem. So let's say that I take a bunch of coins, I tie them all together and then I flip them together, okay? So I have n random variables. They all have the same distribution. They all have the same mean, but they are dependent. They all have the same value. Okay? So I have n coins, I have tied them together and then I flip them all together. So the empirical mean will be either zero or one, no matter how many coins I have in this stack, okay? Assuming that the true mean is one half, then here I have either zero or one, the true mean is one half. So the distance is one half to the mean, no matter how many coins I take, okay? And what's the problem? The problem is that the coins that I add are not adding me any more information. They're all dependent, they don't add any information, so it doesn't matter how many I take, I'm staying the same distance from the mean. When they are independent, then every flip of a coin gives me a little bit of information about the mean, and if I average over many coins, eventually I'm getting close to the mean. But when they are dependent, they're not providing new information and the average is not going to the mean. So this independence is very crucial in order to get estimation of the mean that will converge eventually with the growing number of samples. Any questions? No questions? Okay. It sounds like there is some microphone somewhere working, no? Okay, maybe not. Okay, so now another important point. So so far we have talked about sort of flipping multiple times a single coin and how the empirical mean of that coin converges to the true value. If we start doing selection, then we need to take a union bound, okay? And I will illustrate this. So for the first thing I'm taking a little bit, maybe weird example, but so I have a bag of coins and they all have the same bias mu. So the expectation of that coin is mu. I am taking one coin out of that bag, I flip it n times, I get the empirical mean. I take another coin, flip it n times, get the second empirical mean and so on, so I try it k times. I get k different empirical means. So for any fixed i, for any coin, the probability that the empirical mean deviates from the true mean by Hoeffding's inequality is bounded by that. So we have that the empirical mean doesn't deviate by more than the square root from the true mean, okay? Now let's say that I want to select the best coin. So I select i star, which is the arg mean over i of the empirical means. I, out of these outcomes, I select an outcome that gave me the best outcome, the smallest mean. Now if I look at the expectation of this, the empirical mean of i star, it will not be equal to mu, okay? So each mean individually is an unbiased estimate and it concentrates, but the empirical mean of the best coin, that's not the true mean of the coin. If you do it too many times, you will, with high probability, have a coin that will always have zero, for example, which is not the mean. And since this is, the mu is not the mean of this mu hat i, sorry, it should have been n i star, we cannot apply Hoeffding's inequality because Hoeffding's inequality assumes that this mean is the mean of this random variable. So if we want to bound the probability that the empirical mean of the best coin deviates from the true mean by more than something, we have to correct inequality and we have to take this k, which will now come in a few steps, it will come from a union bound. So instead of this log of one over delta, you have to put log of k over delta. Now the probability that the empirical mean of the best coin deviates from the true mean by more than this square root is bounded by the probability that there exists any i, any coin for which the empirical mean deviates from the true mean by more than this square root. Okay, the probability that it happens for the best one is bounded by the probability that it happens for any of them. And now I'm taking a union bound, so the probability that there exists such a coin for which the empirical mean deviates from the true mean by more than the square root is bounded by the sum over the coins of the probability that it deviates for any particular coin by more than this square root. And now to this quantity, I can apply Hovding's inequality because I have separated the dependence between coins and the mean. So now I have again these coins individually and for any fixed one I have the inequality, but I have the inequality with k over delta instead of one over delta. So by Hovding I get that this probability is bounded by delta over k because I have replaced one over delta by k over delta, so I have delta over k and if I sum it I get delta. So I get that the probability that the mean of the best coin deviates from the true one by more than square root of something that we have k here which again comes from the union bound is bounded by delta. So now I am in control of deviations of the best coin from the true mean. Questions? If I am not taking too many coins, so this log of k over delta is not going to be too large and I will be able to control this deviation, if I'm flipping it too many times, if log k becomes of the same order as n, then yeah, then I will lose the control of concentration. Which is also intuitively again, if you repeat this experiment again and again and again and again, at some point you will with quite high probability say get all zeros. Okay. Okay, now a little bit more interesting example but the same flavor. So now I have a bag of coins, let's say drugs or treatments with different biases. There are some people talking and that's a little bit disturbing. So I have a bag of coins with different biases and now I want to select the best treatment out of this bag of treatments. I don't know the quality of each treatment or the quality of each drug. I take each of these drugs, I try them and times, I get this empirical means and now I want to select which is the, which is the best drug, okay? So and again, for any fixed drug, the probability that the empirical mean deviates from the true mean by more than the square root is bounded by delta, okay? But now if I take the best drug, I take the one that minimizes the empirical mean which is well a natural thing to do. Then again, the expectation and I forgot Anne again here but the expectation of the best one is not equal to the expectation of that one because we have done selection based on the outcomes. It's the same logic as we had just on the slide before when they were all identical and we can't apply hovding again but we can do the same trick as we did before so we can look at the probability that the mean of the treatment that was selected deviates from the empirical mean of the treatment that was selected, deviates from the true mean by more than the square root and again we have K here is bounded by the probability that for any treatment, there is this deviation and then we take the union bound and we take hovding inequality and this is bounded by delta and again this K that we have to put here instead of one, this is the price of doing the selection and the bound is meaningful if the amount of selection the number of different drugs that we're trying is significantly smaller than the exponent of the number of samples that we have. If we're selecting, if the number of drugs we are selecting from is exponential in the number of samples then we are getting something that's larger than one here and that's not very interesting because yeah, well, we know that it's bounded in the zero yes. So K is the number of coins that I select. The number of experiments that you have done yeah, the number of coins was the number of drugs, yes and that's the union bound that you have to take. You have to take the union bound over all the selections that you have done, okay? Any other questions? Good. So a mid-summary of what we have done so far so we have shown this Hovdingson equality that for i, d random variables in the zero one interval the probability that the empirical mean underestimates the true mean by more than the square root is bounded by delta and independence is crucial here so if there is no independence things don't work and if you do selection if you select from multiple experiments then we have to take a union bound over this. Experiments that we did, okay? Good. I have many slides. I guess I'm not going to do all of them but yeah, we will see. So now, well, we have started with Markov's inequality than Chebyshev then Hovding. It was getting tighter and tighter and tighter. Can we get something that will be even tighter than Hovding's inequality? Well, here is one example. The KL inequality. In order to define the KL inequality I have to define the Kulbeck library of divergence or relative entropy which is a distance measure between probability distributions P and Q and it's defined in this way so it's the expectation with respect to Z drawn according to P of the logarithm of P of Z over Q of Z and it has certain properties so it's the distance between P and P is zero, it's convex and it's asymmetric and we define the binary KL function which is if P and Q are biases of Bernoulli random variables then, well, this is the definition of KL. Now we're not going to do anything explicitly with the KL so I'm going to simplify it later so if you see it for the first time, well, don't worry. So we're moving to the next slide where I will show simplifications of this KL that will be a bit easier to digest. And we have the KL inequality. So the KL inequality tells us that if we have ID random variables in the zero one interval with mean mu then for any parameter delta in zero one the probability that the KL between the empirical mean and the true mean exceeds log of one over delta over n is bounded by delta, yes. Mu is the expectation of the random variables. The pre-dedesta constant when you write in this case. You mean in the binary KL? Yeah. Yeah, so in the binary KL again, in the binary KL, P and Q are just numbers, biases of Bernoulli random variable and this is the definition of the binary KL where P and Q are numbers in zero one, okay? Yeah. And again, I mean, if you see it for the first time it's don't worry, I'm going to simplify this on the next slide. So I will make it more digestible on the next slide. So, yeah. So this is the KL inequality. And again, don't worry if you see this KL for the first time it's not the most important part here because we are going to relax it. So we have the KL inequality that the probability that this KL is smaller than log of one over delta is larger than one minus delta. And we have Pinsker's inequality which says that the KL is lower bounded by twice the difference between mu and mu hat squared which means that I can, so I can take a square root of this, I can put it on the left side of the inequality and they get that as a corollary of the scaling equality I get that the probability that the distance between the true mean and the empirical mean that exceeds the square root is at least one minus delta. Okay? And if you recall Hovding's inequality, well it's essentially the same thing up to this one or two. So the KL inequality is always at least as tight as Hovding's inequality, okay? Well, that's good but yeah, sort of why introducing complicated quantities if we get the same thing as before. So there is refined Pinsker's inequality that says that for mu that's larger than mu hat we have another relaxation and if we do this relaxation and we do a little bit of calculations then we get that from this inequality we can also get this inequality, okay? And this is something, so this also follows from the KL inequality and this is something relatively easy to digest and more interesting. So what we get here is that, so if we look at say this inequality the distance between mu and mu hat decreases at the rate of log of one over delta over two n. Here if mu hat n is close to zero, or if it's zero, we get that the distance here decreases at the rate of one over n instead of one over square root n, okay? So we get what's called fast convergence rates, the rates of one over n rather than one over square root n and this may be significantly tighter than holding if the empirical mean is much smaller than one over eight, okay? So this is another inequality and instead of just for any random variable in the zero, one we get the same rate of convergence. Here it tells us that if the empirical mean happens to be close to zero, we are even more confident that it's close to the true mean, okay? So when the empirical mean is close to zero, we have even better bounds on being close to the true mean. So if you have a coin and you flip it 100 times and you get always zero, you are more sure that this empirical mean of zero is close to the true mean than if you take a coin and you flip it 100 times and you get an average of one half. So when you have the average of one half that is quite high deviation to the left and to the right, sometimes you get zero, sometimes you get one. If everything is constantly zero, you have a higher confidence that the true mean is also zero, okay? This is the message of this inequality. Questions? Yes, they are up there. This fact feels a bit surprising to me. I would have thought if you move a random variable by like a constant factor in either direction, then none of the concentration inequalities change. But this seems to be the case here, right? It's not a matter of shifting, it's a matter of the variance of this random variable. So if you have a random variable in the zero one interval with mean, so if you have a Bernoulli random variable with mean one half, it has a high variance. If you have a Bernoulli random variable with mean zero, which means it's always zero, then it has a small variance. And random variables with small variance, they have stronger concentration than random variables with high variance. Okay, that makes sense. Okay, thanks. Any other questions? Yes. Yes, the same, but yeah, you also have Pinsker's inequality in the other direction. So if it's close to zero or if it's close to one, and we're talking about the zero one interval, we get better converter. You don't see it from this inequality, but you have it like from the KL with a different derivation, you can also get it. If it's close to one, you also get faster convergence. Any other questions? Good. Yeah, and well, this was a relaxation of the KL inequality and the KL inequality itself is even tighter than what you see in this relaxation, but the idea is the same. And well, this is just a slide again. You can, in principle, directly invert the KL inequality because it's convex and you can do binary search and whatever, so if you invert it directly, you get something that's even tighter than what we have seen on the previous slide. But yeah, I will skip this one. Can we be even tighter than that? Well, the next step is Bernstein's inequality. So Bernstein's inequality tells us that, again, we have either under variables that are bounded by C and we have the second moment or the variance depending on what formulation you are using. And it tells us that the probability that the true mean exceeds, sorry, the empirical mean underestimates the true mean by more than the square root of twice the variance of the individual random variables over N and some extra factor is bounded by delta. So previously we were looking at the mean and we were saying if the mean is close to zero, we have tighter concentration. Here we have that if the variance is close to zero, we are getting better concentration. But there is a bit of a challenge that the variance may be unknown. So we don't know the distribution, we don't know the variance, but this problem is in principle solvable. So there is empirical Bernstein's inequality which the idea is to bound the variance in terms of the empirical variance and then plug it back into Bernstein's inequality. And then you get this inequality which is, well, if you look at this inequality and you compare it with Bernstein's inequality, it's essentially the same thing. The true variance is replaced by the empirical variance and you have a little bit worse constants on this additive term, but not by much. So you have seven over three instead of one. And you have this log of two over delta because you are taking a union bound. So you have a bound on the variance and then you have a bound on the random variable. So you have to take a union bound and then you get log of two over delta instead of log of one over delta. Okay, question, no. And there is also an alternative approach known as unexpected Bernstein's inequality. And there you can do a direct derivation based on empirical second moment instead of bounding the variance and then bounding the random variable itself and then you get some complicated result that I will not go into. But the point is that the direct derivation often gives you a bit tighter bound than doing this two-step force bounding the variance and then bounding the deviation. So how does this empirical or unexpected Bernstein compares to Hovding's inequality? So this is, and I will take random variables in the minus one half half. So I will shift it a bit just to make the comparison easier. So this is Hovding's inequality and this is Bernstein's inequality. And if we have this random variable, the variance is bounded by one quarter. So in the worst case from Hovding's inequality, we get this inequality, which is a little bit weaker than Hovding. So in the worst case, yeah, so if the variance is really small then we get this one over and we get fast convergence and the empirical unexpected Bernstein are significantly tighter than Hovding. And if the variance is approximately one quarter, then it's a little bit worse than Hovding because we have this additive term. But in general, it's not much worse than Hovding and sometimes maybe much better if the variance is small. So it can exploit the small variance. How this Bernstein compares with the KL and, well, we take the relaxation of KL, which exploits the small mean and we have the Bernstein. I'm taking just normal Bernstein based on the variance. The variance, if we're talking about random variables in the zero one interval is bounded by the mean. So the variance is smaller than the mean. So in general, so we can get a similar looking bound from Bernstein's inequality. So in principle, Bernstein's inequality is typically not much worse than the KL inequality because the variance that's used here is smaller than the empirical mean or the mean. But for Bernoulli random variables, the additive terms are better in the KL. So if you have KL, the variance is actually the same as the mean and the KL inequality is always tighter. But if there is a large probability mass inside the zero one interval, so if the variance is small but the mean is not small, then Bernstein's inequality is better and maybe significantly better. If the mean is large, but the variance is small. Yeah, and if you want to get best of both, then there is, for example, split KL inequality where you can split the random variable and apply KL to each of that, but yeah, this I will also skip. Okay, so a quick summary of these inequalities. So we have Hovding's inequality, which is zero's order inequality. It doesn't exploit any properties of the distribution. It's the same. It's only based on the range of the random variable and that has the slow rates. We have the KL, which is the first order inequality and it has fast rates if the empirical mean is small. And it's the best inequality if you have Bernoulli random variables. If you have Bernoulli, then work with KL inequality. And we have the empirical unexpected Bernstein's inequality, which is a second order concentration inequality and that gives fast rates if the empirical variance is small, okay? Yeah. So I guess I'm getting close to the end of my time. I'll give you very, very quickly what other things are possible to do without, again, mentioning too much details. So so far we have talked about concentration and the qualities for bounded random variables. Random variables bounded in some interval. It can be any interval. We can rescale it to the zero one interval that we have mainly discussed. If you have unbounded random variables, you have to assume some form of still control on this distribution. So one possibility is holding inequality for sub Gaussian random variables. So if you assume that the tails of the distribution of the random variable decay at least as fast as Gaussian distribution and this is one way to define the sub Gaussian distribution and you can see that this is very similar to Hovding's lemma that was used in Hovding's inequality and you can go back and through the proof and you can get an inequality which is essentially the same as Hovding's inequality with the variance factor. So if you have unbounded random variable but it's tails decay faster than Gaussian, then you can also have something similar to Hovding's inequality. And the last thing, so we have mentioned the importance of independence and we have talked independence, independence, independence. If you don't have independence and everything breaks down, there is one exception where you may have a certain form of dependence and it still works and it's martingales. So that is the Hovding-Cazumini equality for martingales. So if you have a sequence of random variables where the conditional expectation of a random variable stays the same, then you have the same inequality as Hovding's inequality. And most others of the KL inequality, the Berstein's inequality, you can generalize to martingales. So if you have this sequential dependence, the distribution is not the same for every consecutive random variable. They may depend on each other, but the mean stays the same, then you can have again the same inequalities. So a summary of what I showed you today. So we have shown a bunch of inequalities, Hovding's, which is the zero's order, depends only on the range. We have KL, which is the first order, exploits small mean and tied for Bernoulli random variables. We have empirical and unexpected Bernstein's inequality, which is a second order, and it's good if you have small variance. We touched upon, so if you have unbounded random variables, you have to assume some form of niceness of this distribution. For example, sub-Gaussian tails, and then you can also have Hovding's like inequalities. We have talked about importance of independence. So if you have dependent random variables, things break down, except if you have martingale, so if you have the sequential dependence when the mean stays the same, then you can generalize everything that was mentioned here to martingale's. And finally, if you are doing selection, remember to take union bounds. So if you are selecting out of different experiments, remember to take union bounds. And if you want to read more, I have another slide with a bunch of references, and well, I guess my material will be uploaded, so you will see the references and you can read more there. Thank you.