 So, let us start our discussions. So, in the last class, we started discussing about converges of sum of random variables, right. So, we said Sn is sum of n random variables and then we were interested in the case where Sn by n converges in to what value? So, we notice that, so if I have a sequence of random variables which are IID, then this value of Sn by n converges to expected value of the random variables which is expectation of X1. So, notice that here what we are saying is the limiting random variable is a constant and what is that constant? That constant is the mean value of the random variables. So, this we called as weak law of large numbers and we could also show that this is indeed converges also in almost sure sense, but we proved it only under the special case that the fourth moment is finite and this version we called as strong law of large numbers. So, these limiting distributions are very important in applications. So, can somebody think of why it is so, so why these results are, I am saying they are very useful, can you think of an example why this should be useful to predict the future, what future, but what distribution you are talking about. I am saying it is an IID sequence, right. Every time it will have the same distribution, what to predict there. What we are saying is what is Sn by n, this is Sn is the sum of the random variables and n. So, like he gave an example of fertility rate, right. Let us say number of children per woman, we have it for many, many women and we want to calculate what is the average number of children per woman. What you can do? You can just count, take many, many samples, talk to many, many women and take a survey and just and then average, then if you have taken sufficiently many samples, maybe this is going to converge to that value. And another simple case could be, let us say the voting pattern, right. So, everybody has preference for one party of the other, right. Let us call party A and party B and now you want to see that what is the, with which probability each guy is going to prefer. Let us say all of them has a same kind of preference for like, if I have preference for party A with probability P and preference for party B with probability minus one minus P. Let us say that if it is same across all the people, how I am going to calculate that value P? I will just ask everybody, okay, tell me what is your preference? You prefer party A or party B and then just I will take large an example and then just do the average, then I am going to get this value. So, you might have seen, right, like all these after elections, many people will just say, okay, this guy is going to get this many seeds and this many, this and they are going to say this is going to happen with certain probability, how they will do? Maybe they will just mostly some variant of this results they are going to apply. And usually when they say this, they are going to say what is the sample size they use to make this lick. They say, will that, okay, we talk to 100 people and based on this, our survey says that this party is going to win this with this probability. So, and most of the times these surveys are believable if their sample size is large, that they have got this input from more people. If their survey is like, okay, if they have just gotten talk to 100 people and from that they are concluding we may not take it seriously. But if they have talked to maybe like 10,000, 1 lakh people maybe that is more available because our result says that this average converges to this and this convergence is more accurate if my n is large, okay. And similarly, like as simple as if you want to decide what is the success rate of a coin, what is the probability that when I throw up a coin, it comes with head. How you are going to do, how you are going to find that? So, let us say I will give you a coin and it has some bias for coming up head and you want to compute this head, what is this probability? How you are going to do that? You just keep throwing it and see that how many times head comes, map that head to let us say 1 and map tails to 0, just count how many times you get 1 and then divide it by n, then this is telling that that should go close to as you have take more and more samples that should be go to that value p, okay. So whenever we have such things to be done, whenever we have kind of parameter estimations you are going to do this. So, suppose let us say all my random variables are exponentially distributed random variables, X1, X2, this sequence is IID but each random variable is exponentially distributed with parameter lambda. If I take many, many samples of them, let us say I do experiments let us say 10,000 times, get the samples of them and then do the average, what will this value be? What is that average will be? Let us say I have an experiment which has a distribution which is exponential. Every time I am going to repeat it many times, each time I am going to repeat it independent of the previous one. Whatever the sample I got first time, let me call that as X1, whatever the sample I got it second time I will let us call that X2 like that, like this I have 10,000 samples. If I just all add them 10,000 samples and divide it by 10,000, approximately what is the value that you expect? 1 by lambda, okay. So, in this case suppose if you a priori know that my distributions are exponential but you do not know what is the rate then what you can just do is do this experiment many, many times and then take the average you already got close value of lambda, right. So, for example all this you might have studied in class 12, right, like life of atoms, they will be like kind of exponentially distributed but you do not know the rate at which they are going to decay then maybe you collect many, many samples and doing this you can estimate their rate, the lambda rate, okay. Today we are going to see one more important results of this convergence of sum of random variables called center limit theorem. First let me write this result and then we will discuss. So, what it says is let us say the sequence of random variable that are IID and they have now this mean and variance sigma square. If I look at their sum like this and then look at their normalized sum. So, I am going to call this normalized sum because what is the mean of Sn n mu. So, I am basically subtracting mean already from them in a way it is centered because each of them are now it is a zero mean random variable Sn minus n mu and then I am normalizing it by square root n. So, this is centered and normalized random variable here and we say that this goes to whatever your sequence is whatever distribution it has if it is IID then it goes converges in distribution to normal distribution with variance sigma square, okay. So, what does this result say? Now, say earlier when I did it in Sn by n just took the average it always converged to a fixed constant, but now here the limit is not a fixed constant. It is a distribution which is Gaussian with parameter zero and sigma square. So, let us try to understand why this result is true. So, to this true result the characteristic function comes to a help and we just use it here. Let us say so again I am going to prove this result assuming mu equals to zero. If mu is not equals to zero you can straightforwardly still conclude this by saying that. So, if mu, mu are zero this is Sn by n, right. So, if mu are not zero I could as well write the numerator as Xi minus mu i 1 to n, right. So, I have just replaced Sn minus n mu by Xi minus mu. So, this is by definition. Now, each of this random variable Xi minus mu is I can treat this as a new random variable, right Xi minus mu. What is the mean of this random variable? Zero. So, then what I am basically adding is n random variable where each one of them has zero mean and still their distributions are same. So, because of that I can as well assume that the mu's are all zero and try to proceed to prove this theorem. Now, so because of that I just need to worry about Sn by mu n and let us try to find out what is the characteristic function of Sn by n at a point u. So, this is a characteristic function of h u, right. So, this is Sn by n is sorry square root n and this is a characteristic function of Sn by square root n. Now, if you just expand this guy what you are going to get? So, Sn is sum of n random variable where each of them is independent, right. So, I am going to get it as product of J Xi square root n of u, okay and this i from 1 to n and then what I am basically doing now next is. So, now I am going to treat it as characteristic function of Xi computed at value u by square root n, right. That is my argument here. So, I can write it as I want to add expectation of J Xi and u by square root n. So, which is nothing but right and now because these are all identically distributed I could write this as well as phi of x 1 u by square root of n to the power n. So, notice that if they have identical distribution, their characteristic functions should be necessarily the same, right. So, that is why we have used this property. All of them should have the same characteristic distribution of that of x 1, okay. Now, let us try to understand how this guy behaves. So, now I am going to use this characterization. So, how many of you know Taylor's expansion of a function? So, let us say I have a function f of x and I want to expand this Taylor expansion of this function around a point f of x naught, okay. So, then I can write it as, so in general I can write it as and what is this? This is to the power and what is the n indexing happens here? 1 to infinity. But there is an alternate characterization of this Taylor function which says that I could do this as instead of taking the infinite sequence, I will look at the truncated version of this. What does this say? Then the nth one is going to be, so what should be here? It should be i here, right. What is this y n? For some y n belongs to x and x naught. So, this is the truncated version of my Taylor series, okay. So, as you see this, so you understand what I mean by this superscript i here, this is the i-th derivative of this function f, okay. Now, this Taylor function involves the first derivatives, second derivative and higher order derivatives, right. But we will be interested only first and second derivative because I want to have, I have only first order and second order statistics here. And when I have first order and second order statistics, my moment generating function, I need to only care about first derivative and the second derivative, right. So, to write this, let us see what is the value of this function at 0, phi of x 1 at 0 when the argument is 0. Whatever the argument is, for any random variable x 1, this guy is going to be what? 1, okay. And what will be the first derivative of this at 0? So, what is, I am looking at the first derivative of the characteristic function x 1 at the point 0. What is this? So, if you differentiate the characteristic function and put, take the argument 0, what you are going to get? You get 0. So, how do you are going to get compute the mean of a random variable from the characteristic function? I cannot hear you. Expectation of, yeah, I mean I am asking how you are going to derive the mean of a random variable from, if you know its characteristic function, you take the first derivative at 0, right. So, what exactly I am doing? I took the first derivative and put 0 and I am, that I am assumed to be 0, that should be mu, right. And that I have assumed to be 0 here. And what is, if I take the second derivative and put 0, what I am going to get? Expectation of, expectation of x 1 square. But in this case, our mean is 0. So, its variance. But with what sign? It is going to be negative with a negative sign, right. Just check it. When you, second derivative and because of this complex term involved there, we are going to get it as sigma minus sigma square, okay. So, now, for this characteristic function, I know what is the mean value. I mean, I have been able to express this, the mean value in terms of the first derivative and the secondary area of the characteristic function. And I know these things. So, let us try to do the Taylor expansion of this only for n equals to 2, okay. Now, and I am going to do the Taylor expansion of this function around origin, around 0, okay. So, then what will be the Taylor expansion of this? So, it is going to be phi of x 1 of 0 plus u of minus 1 times, I am just writing expanding for u, but I could have as well skipped this. And here, it is going to be not 0, but a value y 2 here, which is going to be dependent on where that value of y 2 will lie, 0 and u by n, is this correct? So, when I wrote this Taylor series expansion, I kind of expanded it assuming that my function f is real here. But here, my phi function could be always real or it could be complex as well. It could be complex as well, right. So, when I have a complex function, all I need to do is this guy is here is wherever I have this phi i x naught, right. I have to make it as real part of f of i x naught plus j times imaginary part of x naught, when my function f is a complex function, right. So, this guy here I have to replace it like this, okay, because my function phi is complex, okay. Now, let us look at this. This phi x bar, so for the phi of x 1, its derivative at 0 is always, it is going to be 0, either it is real or both its real and imaginary parts are going to be 0. So, because of that, this guy is going to get 0. Now, what remains is this part. So, what we have is phi of x 1 of u of root n is going to be, again phi of x 0 is 1 plus this is going to be u square by n and 2 there in the, rather 2 there and this one I have to write it as real part of phi of x double prime 0 plus j of imaginary part of phi of x double prime 0. So, you see this, when I write this phi of x double prime, its value is minus sigma square. So, it has only real part, right. There is no imaginary part in this, right. So, because of that, I will ignore this part. What is this going to be? 1 plus u square 2 n and what is this part? This is simply going to be minus sigma square, okay. I have just substituted the value for this. So, what I could get is only this part, but what I am interested in is this quantity raised to the power n. So, that is what my phi of s n by n, right. So, now let us go back to that phi of s n by n at u is this whole quantity, this guy, the real part of this is what? Oh, right. It should be for y to write. So, just let me correct this. What is this guy is going to be? This is going to be yn here, which is going to be 0. Yeah. So, I want this to be exactly this, right. Okay. This is fine, right. So, and where is this y 2 is going to be u upon divided by n, right, because that is what our, that is what we said here y 2 belonging to 0 to u by n and this is going to be still the real part of phi of x double prime y 2, okay. Fine. Now, and there is a power n here. So, now can you look into the sequence and try to see, you know, are you familiar with the sequence as n goes to infinity, what this sequence converges to? e raised to this guy, right. But this itself is changing. This is as n, if you let n go to infinity, right, this itself is changing and what this y 2 goes to 0 and what is the value of this phi of double prime at 0 is minus sigma square, right. And then what is the, can you, then what as this n goes to infinity where this limit will go? e raised to, e raised to, it is going to be e raised to u square. So, you people understand why I got minus sigma square here, okay. Now treat, so yesterday we have discussed that if I have a sequence like this, it goes to e to the power z where z n converges to z. If I have a sequence like that where z n converges to z, then this sequence converges to e to the power z. Now look at this, look at this part, think it as z n, u square by 2 remains like that as n goes to infinity, this guy goes to phi x prime, double prime at 0, right. And at that value, what is its value? That value is exactly minus sigma square and that is why we get u square by 2 is minus sigma square. Now this characteristic function has converged to this characteristic function, this function and what is this function? What is this? This is a characteristic function of a standard normal distribution, right. That means this has 0 mean and variance sigma square. So, okay, this is converging to a distribution which has mean 0 and variance sigma square. So, that is why we are saying that my s n, so if that is the case, we know that. So, one more terminology, so in general we are going to call, if it is variance is 0 and mean is 1, then we call it as standard normal. So, in this case, it is just going to mean 0 and variance sigma square. So, is it clear? Now, so we will come to see like, okay fine, what is this? Why this center limit theorem is important? So, in a way it says that if you have aggregated large samples, okay and if you center them and normalize by square root n, then we can, that aggregation you can treat it to be like Gaussian distributed if your n is sufficiently large, right. So, whatever be the x i distributions, I do not care, x i all that could be exponentially distributed or all of them could be uniform distributed, whatever they are, then if I am going to look at their aggregation, normalized aggregation, then for, if I take sufficiently many samples to aggregate, they will behave as if they are normal distributed, okay. So, in a way I kind of already know how this with sufficiently many samples, how that distribution going to behave like, okay fine. One last thing I want to cover on this topic is Jensen's inequality and the Chernoff bounds. Let us quickly discuss the Jensen's inequality. So, how many of you know Jensen's inequality? What is Jensen's inequality? Expectation of Fx, what is this F convex function? What is a convex function then? You know what is a convex function? Yeah, define it. So, let us take any lambda between 0, 1 and let us take two values. So, if I take, so I am defining a convex function now. So, if I am going to take lambda x plus 1 minus lambda y, what is this going to be? So, if a function satisfy this for all lambda and x y, where x y are in the range of the function or sorry domain of the function, then we are going to call this as a convex function. So, how does the convex function look like? Let us take two points x here and let us say y here, okay. So, what is this guy here and what is this guy here? So, where is lambda x and 1 minus lambda y is going to lie and where is lambda f of x plus 1 minus lambda f of y is going to lie. So, this is a linear combination of the point f of x and f of y, right. So, f of x is here, f of y is here, this is going to be the linear combination of them and what is the function value at this guy, the curve falling? Yeah, so let us say maybe just to be more careful, let us say this is my lambda x 1 minus lambda y. So, what is my function value at this point, this value and what is my lambda f of x into 1 minus lambda f y? It is going to be on that line and what is we are saying? Whatever the value of this, this is, let me make it, this value I am going to get that is going to be smaller than the linear combination of these two points. So, if I have a convex function, it is always the case that my function value at the linear combination of those two points is going to be smaller than the linear combination of the function itself as those two points. Now, the question is, if my function f is like this, why it is, why it is the case that this is true? So, now we understand how the convex function looks like, right. So, that is the consequence of this definition. So, another, so there are properties one can think of for any convex function for all x. So, why this is true actually? So, I have a convex function like this. So, if you start looking at the slope of this function, so the slope is somewhere here at this point and if I look at the slope it will be like this and at this point slope is going to be 0, at this point the slope is going to be like this, at this point the slope is going to be like this. So, what is this slope here? It is negative here, this is again negative, but this is having a larger negative slope. At this point what is the slope of this and what is the slope of this guy and what is the slope of this? More positive, right. So, what is happening? The slope itself is increasing. So, if the slope is increasing means what? There is double debilitative is increasing. So, that is one property. Another property of this convex function is at every point I can draw a tangent that touches this function at one point. So, I can draw a linear line at any, give me any point on this function. I can draw a linear line that acts as a tangent to this function f. So, from this property actually we can right away write this. So, let us see why that is true. So, take a point x, some point. Let me take this x, let me take this y. So, this is x and this is my function f of x. At given point y I can draw a tangent here that means this line here, here is a lower bound on my function, right. So, what does this mean? I think maybe this much space is enough. So, f of x is going to be greater than or equals to l of x and this is true for all x, right. Because this line is, take any point this line is going to be light below this function f. Now, let us take expectation of this function and in particular I would be interested in taking this y to be expectation of x. So, expectation of x is going to be somewhere in the middle of my domain, right. So, let us take that point to be f y here. And on this point we know that this function and this function both have the same value, right. So, what is this l of x? It will have some slope and it will be like this, right. Now, let us take an expectation of this and this is going to be a plus b and expectation of x, right. Is this true? I just applied expectation to both sides, right. And what is this? At this point this is the line a plus b x, right. This is a plus b x. Now, if I compute this function at the point y which is expectation of x, at that point this value is going to be same as my function f of x. So, this is my f of x, right. So, on this, this is simply same as expectation of x. So, I am applying this property because this line at the point. So, what does this mean? a plus b to expectation of x means this is nothing but this linear function computed at point x which is nothing but the expectation of x, right. So, this is a linear line which is computed at different point a plus b x. But now if I am saying this is a plus b expectation of x that means this function being computed at expectation of x. But expectation at this point this a plus b f is nothing but f of expectation of x. So, that is why expectation of x is going to be greater than or equal to f of expectation of x. So, this is exactly what we said as Jensen's equality. Yeah, L is the tangent. At the point I have chosen this tangent to be at a particular point which is the expectation of x here, okay. I know, give me any point on this curve. The convexity property tells that at this point I should be able to draw a tangent. What I did is I chose a point y which is the expectation of x, right. And on that point I have a linear function like this and I know that this linear function always lower bounds my function f of x. So, that is the first property I used here. And then just make sure that this f and this L function they have the same value at expectation of x. So, that is why Jensen inequality holds, okay. So, because of this you can derive some of the properties very quickly. For example, I want to find the relation between expectation of x square and expectation of x whole square which is bigger. Expectation of x square is better than this. Why? Can you use Jensen equality? So, what function you are going to use here? x square, right. So, I can think of this as f of x here where f of x is x square. So, if I have to do this, then I know that this is nothing but expectation of x and f is nothing but square function, right. So, that is there and as you said that we already know that is true because variance is a non-negative quantity and variance is the difference of expectation of x square minus expectation of x whole square. So, like this comes handy to prove, this Jensen equality comes handy to prove many inequalities of this part. Some of them are already there in your exercise or in your assignment try to solve them.