 So, next thing we are going to study is something called Chernoff bounds. We just do a quick pass on that, we will not go much deeper into that. Suppose I am interested in asking the question, let us say I am henceforth I am assuming IID sequence. If I am not assuming that I will make it a special mention about this. If I am not making a mention by default I am assuming my sequence is IID. And this is my Sn is the sum of my n random variables and I am looking at their average. I want to ask the question, what is the probability that Sn by n is greater than a for large n. Now, what I know about Sn by n from my law of large numbers. So, Sn by n converges to mean in probability and also almost sure sense. And this a, let us say this a is such that this a is greater than mu. What is mu? Mu is the my expectation. So, if my a is greater than mu, what do you think should happen as n goes to infinity? So, as n tends to infinity, I know that Sn by n goes to mu. Now, what I am asking it to take a value a which is larger than mu that is not going to happen. Because Sn by itself is going to value mu so it cannot take value. So, this is going to 0. And if I take a to be less than mu, what happens? Again 0? It is going to 1, right? Because if this a is going to be less than mu, I know that mu comes after that. So, I am going to take a value mu which is already greater than a. So, this probability should be 1. So, I am looking at these two pictures, right? Here is mu, the first case I consider is a. First I ask the case, okay, what is the probability that I take a value beyond this, right? I am not going to take any value in this because I will be always hitting here. Next I am asking the question, okay, let us say a is here, what is the probability that I take value greater than a? That is always going to be 1, right? Because I am always going to take value mu which is already greater than a. Now, let us take the case, what does this goes to as n goes to infinity? What this can go to? So, what this will go to from CLT Central Limit Theorem Gaussian Distribution. If any sufficiency laws, let us say we are already approaching Gaussian distribution. Now, I am asking this Gaussian distribution being greater than or equals to c. So, this is nothing but the complement of CDF of Gaussian distribution, right? With 0 mean and variance sigma square. So, it will have some value. So, let us write to reorganize this. So, if you reorganize this, what you are going to get is Sn by n, this should be greater than or equal to mu plus c by square root n. Check this. I am just reorganize this. Can you quickly verify this is indeed correct? Okay, correct? So, what I am going to do is, I will do this, I have to divide throughout by square root n. Now, what I am asking, what so this is basically, this is the question basically Central Limit Theorem is answering as n tends to infinity. So, if I just put it alternatively, it is saying it is basically trying to answer the question, what is the probability that Sn by n is going to be greater than or equals to mu plus c by square root n for some c. And we know that this Sn by n as n goes to infinity converges to this value mean. Now we are asking it to be away from mean by this quantity which is itself diminishing to 0 as n goes to infinity or where this quantity is going to mu. So, often now this type of distribution where we are trying to answer the question where Sn by n is trying to be away from its mean value by some this quantity, this is we usually call normal deviations. Because we are just know that Sn by n is anyway going to converge to mu. So, we are just asking this question to be Sn being just away from mu by some small neighborhood. Whereas, questions of this form where a is arbitrary, all we are saying is some a which is greater than mu. It could be far away from a, from below or above wherever it is. But here this a, so you can think this is an an, most of the time as n sufficiently large, this an is very close to mu. So, this center limit theorem is basically trying to answer what is this normal deviations and this Sn by n such questions usually these are referred to as large deviations. And basically these large deviations and this normal deviations they try to answer the question. I know that this goes to 0 as n goes to infinity, but I want to know at what rate it goes to 0. So, basically I want to see for a finite n how this quantity behaves. So, let us see whether we are able to get a bound on this. So, you understand the difference between this deviation and this deviation. Now, we will come to Charna bound. Charna bound gives the bound on such large deviations. So, this is, these bounds are very important when you are trying to do analysis of machine learning algorithms. So, in machine learning algorithms you will often have large samples of data and from that you are trying to estimate some parameters. And now often those parameters will be related to the mean values of this random variable. And now you want to see, let us say you took, you have n samples, when you take the average of those n samples, if n is sufficiently large you understand that that will be the correct value of the parameter, the mean value. But now if n is not, so you want to see like if you do not have enough samples, enough data points that you will be being away from that mean value, what is that probability? So, let us say you want to, you have lot of data point from some, let us say weather prediction or something, the average temperature of a city or something. And then you want to estimate what is the average temperature of the city. You, let us say from first 10 or 20 years you have collected these samples, so these are like daily recorded values. On this day this was this temperature, this day this is the temperature. And some, let us say 10, 20 years you have samples, you take that value and then average it. Then I mean fine, you know that it is not exactly, if you want to take the average of this finite number of samples, it will not give me the exact value. But if at all, if it is away from that mean value and if it is, is it going to be larger than by some margin. So, let us take this a to be some mu plus epsilon, where epsilon is some positive quantity. And you want to see that will it be away from epsilon margin of the true value. So, this will try to answer it with what probability that is going to happen. And you will see that this probability actually falls exponentially in the number of samples. So, what we want to show is equals to A, we want to bound this. How to do, how to bound this? What we will do is, we will try to simplify this by taking x1, x2 all the way up to xn minus na greater than 0. These are same, right? I just took na there and replace sn by their sum. This is what I will get. Now, let us take a parameter theta which is strictly positive and then multiply this both sides. So, since this parameter theta is strictly positive, if I multiply it on both sides, this ordering will not change, right? Now, let us exponentiate both these sides, right? If I exponentiate this, what I am going to get? Minus na, right? Now, because I have exponentiated, why did I exponentiate here? Any guess? So, now, okay, so when you have this, what is the theorem, what is the result you want to apply? Marks, Markov inequality. So, to apply Markov inequality, what I want that random variable to be? Non-negative, right? And when I have exponentiated one, that becomes non-negative. So, this guy, the left side is non-negative random variable. I can treat this as now an entire random variable. So, and this is going to be greater than or equals to 1. Now, what is, if I of divided by, that is 1, right? I will just leave it like this. So, now, this quantity here, I can write it as expectation of E of theta i product i equals to 1 to n times e to the power n a. Why I could write this as like this? Yeah, minus theta na. So, why I could write it as product like this? These are independent random variables, right? And now, yeah, so why we multiply theta because we want to later optimize it over theta. Okay, right now, what theta I said is theta is some greater than 0, right? So, what I am going to get is because of that some upper bound. And this upper bound holds for any theta, right? And now, later, as you will see, I want to tighten this upper bound. How can I do that? Look for a theta which minimizes this upper bound, right? I have an upper bound. I want this value upper bound to be as small as possible. If this is small, then this is going to be a good upper bound for me, right? How can I do that? By minimizing it over theta. Yeah, so now, let us try to do that. Okay, so this can be further written as expectation of E of theta x 1 to the power n times e to the power theta na. Okay, so, of all this guy here expectation of E of theta 1, this is called the moment generating function of x 1. So, what is the difference between this moment generating function and the characteristic function? There is just no i here. Okay, so and often this is denoted as m of theta. So, what we have now finally putting all these things together, probability that Sn by n is m of theta n times e to the power theta na, right? Yeah, less than equals to 0. Now, what I will do is I am going to write it in an exponential format. This is going to be then n of log m of theta e to the power theta na. Is this correct? This is log to the base e here. Now, further this is going to be e times n log of m theta minus theta of a. Okay, so now, I have this quantity here which is a function of theta. Now, what I want to do is I want to and this is true for any theta which is greater than 0, right? I have just arbitrarily chosen theta which is strictly positive. Now, I want to find a theta which makes this bound smallest. How can I do that? By minimizing this quantity, right? So, and that is, so if I just do this, if I, so this is true for all the things, right? So, here instead of this, if I am going to write e to the power min over or like I could still write min over theta of e to the power n log of m of theta minus theta a. If I do this, do you think this quantity is still an upper bound on this? It is still an upper bound on this, right? Because this is true for anyway for all theta. What I am just doing is look for a theta which makes this smallest. And is it true that this I can write it as e to the power, I can take the min over? Is this true? I just do this. And then optimizing, now my optimization problem just boils down to minimizing this quantity here, right? So, let us say, so let us do this minimization. So, I can pull this n outside also because this n is independent of theta. So, what I will get is, I will also do one thing, I will just take it minus inside, outside this and just do this. So, I am going to get theta of a minus log of theta. But now, if I am going to take this minus entirely outside, it becomes max over theta, right? I just took minus outside my min. Then I will have this, correct? And often this is denoted as L of a. Now, let us understand this. This m of theta which I have defined here, this guy is going to be in general convex in theta. Then just you can verify this like taking a double differentiation. You will see that as a function of theta, this is going to be greater than or equals to 0, that is going to be convex function. So, now what is negative of a convex function? So, you know what is concave function? So, now, and we have this theta, this is linear. This is a linear function in theta, right? And you will see that if you are doing already optimization course sum up to concave functions here, right? So, this is, so, okay. If you add a linear function to convex function, it will remain convex. And if you add a linear function to concave function, it will remain concave. So, I have a concave function. I have added a linear function. This is a concave function. And what I am trying to do is maximize a concave function. So, will it have a unique maxima, right? So, this optimization is well defined. So, I will, this is a convex function. I am trying to maximize a convex function, sorry, concave function, okay. And then if I have this, finally, what I have ended up is showing probability that Sn by n is greater than or equals to a is upper bounded by e to the power minus n of La, where La is this quantity here, defined as this. And this is some constant, right? Whatever it is, it is some constant, which depends on the distribution of my random variable through this moment generating function. But this is some constant. Now, this La is constant. And now, what we have done is, we have bounded this probability in this manner and how this is decaying and this is decaying exponentially, right? Because as you increase n, it is decaying as e to the power minus n. So, the deviation of my average from a, from the mean, if you take this a to b, which is strictly greater than mu, it is going to be exponentially decaying and that is going to be of this format. And often this L of a is known as rate function, okay? Now, fine. So, finally, what does this say? If you, if you just take log on both sides and do normalization like this, that probability that Sn by n greater than or equals to a, what is this quantity is? Or if I just take, I mean I can take minus on both sides and then this becomes greater than or equals to La. So, what we have is basically an upper bound on this. This is called Chernov bound. And it so happens that Chernov bound is actually tight. You can end up showing a similar result for the lower bound, okay? And that is called Kramer's bound. I will just show it. And the Kramer's bound says that, so let us say a is greater than mu that then for epsilon greater than 0, there exists n epsilon such that probability that Sn by n greater than or equals to 0 upper bounded by is lower bounded by exponential minus La plus epsilon. So, like this, we have a, but this holds for, so notice that this result here the upper bound, this is true for any n. I did not make any assumption about what is the value of n here. But for the lower bound which is given by Kramer, it says that the lower bound holds almost like this, but there is an extra factor of n epsilon in the exponent here. But this is true only for some large such that for all and greater than or equals to n epsilon. This is true only after some point. Now, if you are going to combine these two results, so I have an upper bound here and I have a lower bound here. But if you let and this is true for any epsilon, right? I can make this epsilon arbitrarily small and because of that what we can conclude finally is if you are going to look like n by n log probability that Sn by n going this and then look at limit n equals to infinity what this limit is going to be. So, this chi is going to be upper bounded by like this and as n goes to infinity this guy is lower bounded by again exponential minus n LA, but some arbitrary epsilon here. So, because of that you can conclude that this is nothing but minus of n LA. So, the rate, so what is this? It is exponential rate, normalized exponential growth rate of this deviation is a constant which is given by minus L of A, normalized exponential rate of growth. So, this is what, right? This is exponential, this is the rate exponential rate of growth, but 1 by n we are taking because we are normalizing. So, only then we will be left with minus L A. So, this is the normalized exponential growth rate and that is L of A here. So, these kinds of results will also be useful to, I mean these results are very, very basic, very fundamental and they are very, very important result and people have built much, much advanced results based on these ideas which are heavily used in areas like machine learning. If you want to understand deep into machine learning about how to analyze algorithms, how your algorithm performance on a given data set and all, you need results of this flavor or much, much advanced versions of this results. So, this is just to give a flavor like how based on our basic random variables and basic random processes, how can we derive such results? Okay, let us stop here and next class onwards we are going to start Markov chains.