 Now, most of the times when we start looking into the system models, right, exact probabilities maybe will not be able to compute, but we are just happy if you get some bound on it. This often happens, right, if you if I give you a very complex function to integrate, maybe you may not able to integrate and get to what is the finite integral of that. But if I say that just give me a bound on that, maybe it is easier, right. So, like that in some situations you may just end up dealing with complex systems where you may just want to know the what is the worst case and in that case maybe just bounds are enough, ok. So, for that we will start looking into some bounds on the probabilities that we have and the first one is something called Markov's inequality. So, what Markov's inequality says is, suppose if you have a x which happens to be non-negative random variables and you take any value a. Now, the probability that my x random variable is going to take value larger than a that will be upper bounded by expectation of x divided by. Suppose if you know the expected value of your random variable x, then you can easily compute this value, right, ok. So, let us say your random variable is going to take some value in the interval a to b and its expected value is somewhere here. Now, it can take any value in the interval a, b. Now, you are asking what let us say some ok maybe instead of this I will use different notation maybe let us say let me call this m and n and its mean value is somewhere here. And now you want to ask the question my random variable always let us say this is a always going to take value in this range that is my x is larger than a what we are now going to say that that probability is upper bounded by expectation of x by a, ok. And this is true only for non-negative random variables. Now, naturally this is going to be useful when a is going to be greater than expectation. In case a is less than expectation of x, what is going to happen? This ratio is going to be something like greater than 1 that becomes a trivial upper bound for you know, right, like probability be anyway we know that cannot be greater than 1. So, this becomes useful only when you have a is greater than expectation of x, ok. This is like one of the looks very simple the proof is also simple, but this is like a becomes like a building block for many of the advanced probability you study where we will derive many, many different vary varieties of bounds like this. But ultimately they are all emerge from the ideas that is used in the Markov inequality and something called Chebyshev equality which we are going to see in the next slide, ok. Now, let us see how the proof line goes on. A random variable x, I can split it into two parts. One part when x is less than a and another part x is greater than equals to a, right. You understand the meaning of this splitting part. So, I am basically split into this range and this part. Now, if I look into the expectation, we know the expectation is a linear operator. So, I can write this expectation of the sum of these two expectations. Now, x being a non-negative value, can this expectation be any time lower than 0? No, right, it has to be greater than 0. So, that is why I will actually take a lower bound on it as 0 and then I will end up expectation of x is simply a times probability that x is greater than a, then I am done, right. If I just I take simply a in the denominator, then I get probability x greater than is upper bounded by expectation of x divided by a, ok. So, all I am now doing is the next step my did is expectation of x by a is I just took this a in the denominator, this is x greater than equals to a, that is exactly the claim. Now, suppose you know something more, instead of mean, you also know variance, then we can have something more and that is called something Chebyshev inequality. What Chebyshev inequality says is if you have a random variable x, notice that x here need not been non-negative, any random variable and you are given some value d greater than 0. Now, I am basically asking what is the probability that the absolute value of the difference between x and its mean value is larger than d. So, this we can call it as deviation from mean, right. That deviation from the mean is being larger than d. What is that probability? And Chebyshev says that this probability is upper bounded by variance of x divided by d square. What it is saying? Like suppose you take a random, let us say x is taking in this range value and its expectation is somewhere here, its taking is expected value is somewhere here. Now, you are asking see what is the alternate way of writing this? Alternate way of writing this is probability that x is upper bounded by expectation of x plus d and lower bounded by expectation of x minus d. Is this fine? So that is what x is greater than expectation of minus d and upper bounded by expectation of d is exactly means that x minus expectation of it, the modulus is going to be greater than d. Am I correct here? Ok, maybe let us see. I am not sure I am correct. Let me rewrite this. Ok, let us take the case where this quantity is positive. Ok, then it is simply x minus expectation of x is greater than d. And if it is negative, then it is expectation of x minus d is less than or equals to minus d, right. Now, let us combine these two. So, we will get expectation of x plus d and x is less than or equals to expectation of x minus d, right. So, basically I am by writing like this, I am basically asking x is greater than this and x is also less than minus d. So, if you take expectation of x, you look into this, like you add, this is like a d and another thing, this is like also d. So, this point is expectation of x plus d and this point is expectation of x minus d. What you are asking is, you do not want x to be in the d neighborhood of your mean. So, this is like a d neighborhood, right. This region is the d neighborhood, like this is within the d range of your mean. What you are asking is, ok, x takes value outside this, this is the entire thing. You are just asking is, ok, x does not lie within d range of your mean, it is just lying outside. What is that probability and that is given by this bound. Is this clear? Now, why this bound is true? This bound again simply follows from Markov inequality. First, I will start, I am interested in this quantity, right, expectation of minus x minus expectation of x, absolute value being larger than d. Now, what I did is, I took the square on both sides. Is this two probabilities are going to remain the same? They are going to remain the same. See, notice that because of the absolute value, left side is also positive and d is assumed to be positive value. So, if I squared them, the set of omegas which will satisfy, they remain the same. So, the event set remains the same in both the cases, ok. So, because of this, these two probabilities are the same. And if you are not sure about it, please look into this. Now, once I have this, I am going to now go back to my Markov inequality. This portion, I am going to treat it as a new random variable which is anyway positive, right, because of the square. This entire thing, I am going to call it as a new random variable, ok. Let us call this y, this entire thing. Now, I am basically asking the question, what is the probability that y is going to be greater than or equals to d square, while y being non-negative valued. By chubbishu, this is nothing but expectation of y divided by d square, right. And expect y is what? By definition x minus expectation of x whole square. And by definition, expectation of x minus its mean square whole square, what is this? There is variance. So, that is why we get variance divided by d square, ok. So, that is why or like as I said earlier, chubbishu is trying to bound the deviation of your random variable from its mean value. What did Markov inequality give? Markov inequality telling your random variable taking value larger than its mean value. And yeah, we just obtained this chubbishu from the Markov inequality itself. And you can obtain actually many more advanced inequalities, what often called as concentration inequality is building on them, ok. Now, let us quickly look into one example that I have put here, how this could be useful. So, I will let all of you to read this factory output example here. So, factory is producing certain items, certain number of items every day. And of course, the product that is being done depends on the raw material, depends on how much raw material is available that day, the number of output is going to vary. And let us say we know that on an average the factory produces 500 items every week. Now, I want to ask the question, what is that probability that the number of items produced by the factory is going to be larger than 1000, ok. So, now, basically I am asking mean value was about it was producing 500 every week. Now, I am asking it is going to be producing more than 1000. Can chubbishu, sorry can Markov inequality come to our help in this case, right. Now, we are basically asking what is the probability that x is going to be greater than or equals to 1000 that is nothing but expected value of x divided by 1000 that is 0.5. So, we know that the probability that more than 1000 units will be produced is going to be less than half, ok. The next question, suppose we know that the variance of the factory is like about 100, ok. So, what is variance? Variance is capturing about the deviation about the mean value, right. So, by definition variance of x is expected value of x minus expectation of x. So, it is way this is like again as we said this is kind of deviation from the mean. Like mean say how much the variation is happening this is capturing basically variations around the mean value. Suppose you know that variation variation is notice to be 100. Now, what is the probability that the production this weak is going to be between 400 and 600. Now, I am asking more refined question it is between 400 and 600. So, I am asking this question what is the probability that output is going to be between 400 and 600, right. Now, since I know that mean value is 500, I can pose this question like this that is like mean value something 500. I am asking between 400 and 600. That means, the deviation from the mean is like about 100 both left side and right side. So, the difference the deviation from the mean is being less than 100 it should be 100. Now, the Chebyshev inequality when I have this immediately that strikes to our mind is Chebyshev inequality, right. Chebyshev inequality gives bounds on such deviations. But Chebyshev inequality give us this quantity being larger than or equals to 100, ok let us compute that. What is this quantity is going to be variance divided by d square variance is 100 and d is also 100, 100 by 100 square that is 1 by 101. But now how to get this value from this? Just complement, right. Now, instead of asking what I got is outside this value and if I subset this value from 1 then I will get something inside between 400. So, that is why now probability that x minus expectation of less than 100 is 1 minus this which is now 99 by 100. What it is saying that under this given randomness I will with 99 percent of the chance that I will rely within 400 to 600 range. My output will be between 400 to 600, ok. So, fine like through this example you see that how Markov inequality and Chebyshev inequality can come to help and this is like a very Tauish example like when you analyze more complex system you need to first appropriately define the what is the random variable there and need to define what is the deviation we are talking about and then apply these results, ok. Next, ok limit theorems. So, these are the some of the fundamental results in probability. The first one of them is called law of large numbers. What law of large number says is if you have this sequence of IID random variables and let us say its mean value is mu and I denote the running sum as Sn. Sn is sum of my n random variables here and now this running sum I am going to now interested in the running average. If I look into the limit of this running average, this limit is always going to converge to mean value of those random variables. Notice what is the expectation of X that is exactly mean mu, right because all of them have the same mean. Right now ignore this term what is mean by probability 1 this will be made more precise in IE 621 course. So, this is called something called convergence in probability. Notice that you people know convergence limit you know all of you, right. How many of you do not know what is limit here? Limit of a sequence, ok. When we talk about limit is going to A whenever it exists all this A and S are some deterministic quantities, but whereas here Sn by n is it a deterministic quantity? Sn is a sum of n random variables. So, it is not a deterministic quantity this is a random quantity. What is law of large number saying is even when I take limit of this running average the limit is always going to be deterministic, ok. So, in a way what it is saying is if you are going to add lot of random variables and take their average it is going to be essentially going to be approaching to be a constant, ok. And you will see that this is one of the fundamental theorem which connects basically probability and statistics. The expected value is like a mu in this case, right. These are the parameters of our distributions. And now it is trying to connect this parameter with the data we are going to observe. That data is the sequence of random variable we are going to observe, right. So, what it is basically connects is your data with the underlying parameters and that is why this is a crucial link between your parameter estimation and data and parameter estimation we are going to see that is going to be the critical part in our statistics, ok. Now, just to understand this law of large number intuitively suppose let us say you consider you conduct your interest in event E and in your experiment if that event E happens you are going to assign value 1 to a random variable otherwise you will assign value 0. And you keep on repeating that experiment again and again and again and you are going to call for the ith trial the random variable you are going to call it as xi. And in the xi they are going to take value 1 and 0. Is xi's are they identically distributed? They are going to be identity because this is same trial repeated. Do they have the identical distribution? Yes and they are independent anyway because they have been repeated without having any influence on each other. Now, if you took into this sum of this random variable and xn by n like this what law of large number is saying is this is nothing but sn by n is going to do expectation of x1. But in this case what is the expectation of x1 here xi here? Probability that event E occurring right. So, naturally what you are basically counting is how many times the event has occurred when you repeated that experiments multiple times right. And that naturally is going to be has to be probability of that event E occurring in an IID trials. And fraction of the time that event E occurs is basically its probability. We talked about this when we talked about the frequentist notion of probability right. If you recall in the first lecture or second we talked about frequency notion that is what. So, this is the simplest version like xi's need not be always like this xi's can be arbitrary. And this theorem holds for any arbitrary xi's as long as they have some finite means ok. Let us look into some examples now. Suppose I had a sequence of random variables which are IID and all of them have exponential with parameter lambda. So, if you look into their sum and take the average law of large number says that that should go to exponential mean mean of that exponential distribution right. What is the mean of that exponential distribution? 1 by lambda. Similarly, if this xi's are all Poisson with parameter lambda what this value should go to lambda because lambda is the mean of the Poisson distribution. And similarly suppose if you are going to deal with some xi's IID random variable whose mean you do not know a priori. But you know that if you take this limit it should be converging to its mean. Now, can you tell me a mechanism if I tell you ok I am going to give samples drawn from exponential distribution, but I will not tell you what is the parameter lambda. I will not tell you what is I will just tell you they are exponentially samples are drawn exponentially distributed. Can you tell me what will be that parameter lambda of that exponential distribution? So, you take the sum take the running average and I will give as many samples you want. So, you just take that n go to infinity that value is what you are going to get as actually 1 by lambda and the reciprocal of that will going to give you the parameter of your exponential distribution. So, you see that like if you know that data is coming from certain distribution, but you do not know the parameter law of large number provides 1 means to generate those parameters find parameters. Ok next, suppose let us say I in even though I said that I will give you as many samples and you will let n go to infinity that is the hypothetical thing right. In reality it is never happens that we keep on generating and n goes to infinity. All I can do is if you keep on asking maybe I will keep on give you 100 if you ask maybe 200 after that I will say get lost I cannot do more ok. So, let us say if we have to deal with finite samples. Now, whatever let us say Sn by n let us compute for that particular n and let us denote this by by mu n hat. Now, do you expect this mu n to be same as mu if the mean value is mu? It is need not be right the limit value we know that as n tends to infinity this is mu, but for particular n this need not be mu. There is going to be some difference between mu hat n and mu ok, but as limit n goes to infinity what is going to happen to this difference? That difference goes to 0 as n goes to 0, but for some particular n this difference need not go to 0. And since in real life we have to deal with only finite n we may want to understand if I have to deal with certain n number of samples how much will be this difference. We know that that is going to go to 0, but how fast it decays to 0 ok. So, for that another celebrated theorem of probability comes into picture called central limit theorem ok. What central limit theorem says is like if you have a sequence of random variables and it has a certain mean and variance each one of them and as usual I define the sum as summation of Xi, but now if I look into this quantity Sn minus n mu, mu is the mean and then divided by square root of n sigma square this limit is going to converge to not a number, but what it is going to converge what it is saying is whatever value it is going to converge those points to which it converge they will have normal distribution. You understand this point ok. Let us quickly do one example. Let us say take a coin toss I did a first coin toss and I got a number 1 same coin I did second times 0 I did thrice fourth fifth sixth seventh like that and let us say that coin is Bernoulli 0.6 ok. And now if I add this ok these are like this was x1 was this x2 was this x3 was this x4 was like this like this and if I add 1 by n and I take limit this is going to be what this is going to be 0.6. Now what I am going to do instead of taking this I am going to take Xi minus and n what is the value of mu 0.6 divided by n what is the sigma square of a Bernoulli with point p into what is the formula ok that is going to be 0.024 and if I take as n tends to 0 for this realization this value may converge to some value call that as x1 I do not know what is that x1 yeah this is for this particular sequence and now let us say I generated another sequence where x1 was 0 x2 was 0 x3 was 1 x4 was 1 x5 was 1 x6 was 1 and like that. Now even on this sequence I can do the same thing summation Xi 1 by n and I let limit n goes to infinity what this value is going to be on this sequence this is going to be 0.6 again. So, notice that irrespective of what is the sequence we always got as 0.6 that is what Laplace number said as long as you take enough number of IID samples and average them the limit is going to be 0.6. But if you now do this value like this now if you do the same thing for on this new sequence it need not be x1 again you are following this. Now if I use this new values of x1 x2 instead of this and plug in here this value need not be same as x1 that could be something called x2 which can be different from x1. But what we are now saying is this points x1 and x2 like they appear like as if they are following Gaussian random variable with mean 0 and 1 ok that is the difference between the Laplace number and the central limit here ok fine ok let us stop here.