 Complete theory of conditional expectations is a fairly deep topic. What we will do is we will develop somewhat elementary version of conditional expectations for discrete random variables and for jointly continuous random variables. So, for both discrete case and the jointly continuous case, we have respectively the conditional PMF's and the conditional pdf. So, we will just define the conditional expectation as the expectation with taken with respect to the conditional PMF for the conditional pdf. So, that is a very elementary way of developing the topic. However, if you do not have. So, the conditional expectation is a concept that is defined actually in more full generality, but developing that will require some more tools from measure theory and some more properties of L 2 spaces and so on. It is a little bit beyond the scope of it is a little bit to advance for us for our purposes. So, I will develop it somewhat in an elementary fashion for discrete random variables and jointly continuous random variables and then just hint at what it takes to generalize it. But you do not have to worry about the most general treatment. So, conditional expectation. So, let us take. So, to start with let us take let x and y be discrete random variables with joint PMF. So, if x and y are discrete the joint PMF says everything I need to know about x and y. So, let us say p x y of x comma y is given. Recall the conditional PMF was defined as follows p x given y of little x given little y is equal to p x comma y of x comma y over p y of y. Assuming that this is non zero this should be strictly positive the denominator should be strictly positive. So, you are fixing capital Y equal to little y and you are looking at the distribution of x given y equal to y. If you recall that picture you have some right you have some discrete probability masses sitting on R 2 you are saying that I am going to fix y equal to y right. And you are saying you are looking at you are just scaling the joint law by p y of y this is say y equal to y little y. And then you are looking at the distribution of x in this y that is the interpretation. Now, what we are going to do is conditional expectation of x given y equal to y will simply be the expectation take it with respect to this conditional PMF. Define expectation of x given y equal to little y as sum over all the relevant x is x p x given y of x given little y. So, this will be a number which depends on the value of little y. So, if you fix this little y you get some answer and if you fix some other little y you get some other answer. So, you can think of this guy this summation here as some function of little y. So, let me just call it. So, I can just say. So, I can call this psi of little y right. So, this will be some function of little y you agree with that right I am just going to call that psi of little y. So, if you have this little y you will get some value if you have some other little y you get some other value. So, in the discrete case. So, now I am going to do the following the random variable psi of big y is called the conditional expectation of x given y and is denoted e x given capital Y. So, this is the key step. So, this is not called the conditional expectation this is the conditional expectation of x given y equal to little y. But of course, this object here changes when ever little y changes depending on the value of capital Y whatever value it takes you get different values for this guy here I am calling that psi of little y. Now, this little y itself is a value taken by the random variable capital Y. So, I am considering the random variable psi of capital Y understand what I mean. So, the psi of capital Y is a random variable this is just a number for little y equal to little y this is just a number that is not the conditional expectation that is the conditional expectation of x given y equals little y. But when I replace. So, this will be some function of y this may be you can explicitly calculated to be some function of y just replace small y with capital Y. Then you get some random variable that is called the conditional expectation of x given y it is a random variable and it is a random variable that depends only on y in this definition. So, this is only for discrete by the way. So, to say it a little more precisely. So, you know see you can think of this as some function some psi of capital Y. So, y is some random variable I am looking at some function psi of capital Y which has a very specific definition. But you know that any function of y will be a random variable which is measurable under the sigma algebra generated by y. So, this guy will be a random variable not just measurable under f the bigger sigma algebra, but it is actually measurable under sigma y that is I think that is something you proved in your homework or exam I guess. So, any function f if you have x f of x is measurable under sigma of x. So, this you can think of as a random variable and it is a random variable measurable under sigma y. Similarly, you can define expectation of y given x you will take the p m f the other way you define expectation of y equal to y given x is equal to little x it will be some function of little x replace x with capital X that is called the conditional expectation of y given x and this is the notation for it. So, this is what this is a random variable this is not. So, do not be fooled by the expectation here normally when you have expectation of something you get a number. But if you do that you get a random variable which is only a function of what is on the right side of this vertical stick. So, essentially you can think of it as averaging x out, but what remains is some function of the random variable that is some function of what is on this side of the line with me. So, example actually let me just do it for minus I just give the I will give an example after defining it for jointly continuous case as well. So, let next let x and y be jointly continuous with joint p d f probability density function f x comma y of x comma y recall the conditional p d f recall the conditional p d f f x given y of x given little y it was defined as what the joint over marginal of y and it is defined whenever the denominator is strictly positive. This was the definition of the conditional p d f. So, now define so I am just going to define my conditional expectation of x given y equal to y and expectation with respect to this joint with respect to this conditional p d f. So, I am just going to say define expectation of x given y equal to little y as the integral over x f x y. So, f x given y d lambda right d x if you like. So, I am taking this conditional density and integrating calculating expectation with respect to that conditional density. That is what I am going to call the conditional expectation of x given y equal to y little y, but this is not the conditional expectation again. So, I am with this is this will be from function of little y as usual right. So, I am integrating x out. So, what remains will be some function of little y. So, whatever function of little y you get you replace blindly little y with capital Y. So, you call that now psi little y I am using the same symbol. So, that I mean this psi and the same way actually will be not be the same, but I am using the same notation. So, this psi y is again the psi y is. So, the random variables psi y is called the conditional expectation of same denoted the same thing. So, the random variable. So, is that clear. So, so far. So, if you are either given discrete random variables or jointly continuous random variables you go ahead and find either the conditional P M F or the conditional P D F. Calculate the expectation with respect to that you will get what you will get is the function of the variable you are conditioning on right. So, replace that little y or little x with big y or big x and you call that random variable the conditional expectation. So, expectation is always a number conditional expectation is always a random variable. It is always a random variable and further more it is a random variable measurable with respect to the variable you are condition sigma of y in this case right. This sigma algebra generated by the variable you are conditioning on. So, this view point is this view point is good to keep in mind because the general definition of conditional expectation is in terms of these the existence of a random variable which is sigma x sigma y measurable. So, for now this is alright. So, just for the discrete case and the continuous case jointly continuous case we are. So, if you notice I mean very little of what I did was so far very specific to discrete random variables or continuous random variables right. We have developed till expectations we developed in full generality right. This I am not developing in full generality because it requires little more mathematical tools. So, far. So, what we will do is we will discuss some properties of the conditional expectation that we have so far defined for these two special cases. Some of these key properties will actually help us understand how the conditional expectation is defined in a very general case where x and y may not be they may be singular mixtures what have you right. So, let us look at some properties see the first property is a very important one. So, I have defined. So, let me just give you an example before going on to properties. So, let us just so that you know what exactly I am talking about. So, let us have let f. So, I will give you an example for the jointly continuous case f x y is equal to 1 over x for 0 less than y less than or equal to x less than or equal to 1 find expectation of y given x this is the problem. I am giving you some joint I am asking you to find expectation of y given x. And this will be a random variable it is a function of random variable that depends on capital X. So, let us see what this density looks like. So, that is my y that is my x. So, first of all where is this range right this is x is between 0 and 1 and x is greater than or equal to y is not it which means I have to be 0 less than y. So, which means I should be here in it that right this is correct. So, in this range I have a value 1 over x for this f x y right. So, you can verify. So, if you want to verify this is a density you will integrate y from 0 to x right I integrate x from 0 to 1 correct. And then you will see this is a valid p d f it is not uniform here it is 1 over x in this range good. So, what do I have to do now I have to find. So, first of all I need the marginal right I have to compute this guy conditional density which means I need the marginal density of y. So, let us find the marginal density of y. So, f y of well actually I need the marginal density of x is not it. So, f I need f x f x or f x of x which means I have to integrate y out right. So, that will be y will go from 0 to x right. So, integral y going from 0 to x d y over x right. So, that is equal to what 1 great. So, which means that x itself is. So, this is for x is in. So, 0 less than x less than or equal to 1 right. So, this is a uniform x is uniform is not it did I make any mistake no right. So, great. So, now I have to find the conditional of x given a conditional density of x given y. So, f x given y of little x given little y is equal to f x y over well other way round is not it I have to do y given x no over f x of x. So, which is equal to 1 over x for 0 less than y less than or equal to x which means y is uniform in 0 to x correct with me great. So, if I want to find. So, if I want to if I want to find expectation of y given x is equal to little x that will be equal to integral. So, I have to integrate this guy with respect to. So, y times 1 over x I have to integrate with respect to y and y goes from 0 to x correct. So, and this will be equal to y square x squared over 2 that is equal to x over 2 correct this you integrate this out. Therefore, what is the conditional expectation now. So, the conditional expectation this is not the conditional expectation right this is. So, I want to have conditional expectation expectation of big y given big x right that is equal to simply replace whatever I have here with big x right big x over 2 right as I told you this will be some function of what is on here what is here right the random variable on the right side of this thing. So, this is just x over 2 is here conditional expectation it is a random variable. So, and if you wanted conditional expectation of x given y you would do the same thing you will find the marginal of y right find the conditional of x given y and integrate that you will get some other function of y. So, I understood so far. So, in the discrete case you find some expression in terms of whatever value are conditioning on and blindly replace that small letter by a big letter. So, there it becomes a random variable that is as simple as that this is almost mechanical right great. So, you would. So, now you know how to for the case of jointly continuous and discrete random variables you know how to compute conditional expectation which is a random variable which is not a number yes. What is the notion of expectation here the expectation comes as a random variable right but, if you see it as a conditional expectation is a random variable expectation is a always a number. But, if you see expectation as an average polynomial it should come something fine I do not something expect which is better yeah. So, the expectation will have the interpretation of the average right. So, that we will do much later towards the end of this course we have not got. So, far as you are concerned expectation simply a number right I have not given it any interpretation just integral x d p right the interpretation of the expectation is given by the law of large numbers we have not I deliberately kept it just like a number right I have not given it any interpretation and conditional expectation is a random variable that is all I have said so far. So, if you compare it with the average polynomial even psi of negative i will be more appropriate. No I am looking at this yeah. So, this is if you are given x is equal to little x then you can view this as given x is equal to little x what is my expected y right. So, that is what this is but, this little x itself changes values right I mean what I mean is that this capital x can take different little x is. So, if you then this becomes then the conditional expectation this conditional expectation is a random variables what I am saying and it changes value depending on what values this x takes. So, properties so I have yeah I just move on to properties now that we have done this example. So, again let so you have this psi. So, let us say psi of x is equal to expectation of y given x right. So, this is a random variable which is only a function of x or y I mean it could be x given y right does not matter theorem law of iterated expectation. So, this is a very important result it says that expected. So, this psi x is a random variable right now this if I take the expectation of the random variable I get. So, this is well defined right. So, this is a sigma x measurable random variable. So, I if I just average over the distribution of x I will get some number. So, what happens is that you get the expectation of y it is a fairly remarkable important property of conditional expectations. So, more fancy way of writing it is. So, it is called law of iteration because the more fancy way of writing it is expectation of expectation of y given x equal to expectation of y right. So, this is why it is called law of iterated expectation you have 2 iterated expectations right. So, this is a little it is a slightly less transparent to write it this way right. So, here the interpretation is you are taking expectation of y given x in that sense right it will be some random variable which is a function of what is on right side of the stick which is x. And then this expectation is with respect to x right sometimes you people write little x. So, just big x here just to sort of make explicit the fact that you are taking expectation with respect to the distribution of x it cannot be anything else by the way right. So, it is this is a random variable to the function of x right y is already averaged out. So, to speak. So, if you do that you get back the expectation of y. So, this is a fairly important deep property of conditional expectation to proving it is. So, I mean so far we only defined it for discrete and continuous random variables. So, any proof we can supply will only work for discrete random variables and jointly continuous random variables in which case we will just write out the expressions and just prove it. Let us just do this for the discrete case I will do it for discrete you do it for jointly continuous. And then we do not know anything beyond that at this point right what we can do we do not know. So, it is actually a fairly mechanical proof. So, expectation of psi x which is expectation of expectation of y given x is nothing but sum over x p x of x times expectation of y given correct. So, psi of x is a random variable as a function of x x itself takes discrete set of values right. So, this is psi of little x right. So, this whole thing is psi of correct then it makes sense right. Now, what am I going to do I am going to write that out. So, that is sum over p x of x then I have sum over y y times p x y given x is not it. So, y given x of y given x correct. So, if you just do that summation. So, y times you have that. So, I can write this out right. So, I have sum over x p x of x sum over y times joint over p x of x and then this will go correct. So, then you will get sum over x comma y y times correct which is nothing but the expectation of y. So, I have been a little bit sloppy about not worrying about the order of summation and so on right. So, if your random variables are all non negative there is no problem and also when the random variables all have well defined. So, if expected absolute x absolute y are all well defined you can interchange there is no problem. I have been little bit sloppy, but this is the essence of the proof. So, actually intuitively this is not all the difficult to comprehend. I mean at a very elementary level you let us say you have a class which has let us say several sections right. You are just computing the class average by averaging the separate classes and then taking weighted average based on the number of students in each class right. That is roughly what you are doing right. You can think of what you are conditioning on as the section right. Inside the section you are calculating an average right and then you are calculating weighted average across the classes right. So, roughly what is happening right. So, and then it is not surprising that if you do that you should get the overall average right. Similarly, you can prove it for the jointly continuous case except all these summations will become integrals correct, but this is an important result law of iterated expectations. Expectation of expectation of y given x is equal to expectation of y. So, I will give you an application of this law of iterated expectation. Consider this you remember random sum of random variables you had. So, S n is equal to sum of i equals 1 through n x i right. So, where x i let us say are x i is i i d and n and x i are independent n and all the x i is are independent. So, in this case if you want to compute expectation of S n you can do the following. I can just write it as expectation of S n given n correct. So, what is expectation of S n given n. So, I have to see in order to compute. So, this will be some function of the random variable capital N right. So, first before I compute I should compute fix n is equal to small n and compute correct. So, let us compute consider expectation of S n given n is equal to n. So, I am fixing n equal to n. So, I will get what will I get I will get expectation of sum over i equals 1 through n. So, I will get this right x i given n is equal to little n correct. So, this is equal to expectation of sum over i equals. So, I am conditioning on n equal to little n right. So, I can put little n here x i given n is equal to little n. So, for now I have to use the fact that n is independent of all the x i's. So, conditioning on n equal to n does not change the distribution of the x i's and therefore, it does not change the expectation of the x i's correct. So, this will simply become you can remove the expectation is what I am conditioning right. So, this will simply become right. So, on this we know right this is just n times expectation of x because all these x i's have we have expectation of x i's there mean. So, great. So, this we know right so far with me. So, expectation of thus expectation of s n given n will be equal to blindly replace small n with capital N right n times expectation of x. So, what kind of an object is this it has to be a random variable right this is the conditional expectation this is a random variable that is the function of n this guy is just a number expectation of x is just a number. Therefore, expectation of s n is equal to expectation of expectation of s n given n which is nothing but expectation of n times expectation of x. So, if you are summing random variables i d random variables such that the number of terms in the summation is also a random variable that is independent of the x i's you get the expectation of the sum is essentially the product of the expectation of n and each x which seems to make perfect sense right is perfectly intuitive. So, as a as an aside as I mentioned earlier I think that there may be scenarios of interest where capital N is dependent on the x i's right is such as gambling that you may stop the number of times you gamble may depend on how many you want and how much you lost right in those cases you cannot apply this right. But, you can apply somewhat more sophisticated techniques of analysis and in particular when n is what is known as a stopping rule a similar in exactly same equation holds. So, I just wanted to mention that as an aside. So, even when n is dependent in a certain way on the x i's such that you stop gambling when you either lose 100 dollars or win 100 dollars let us say even in that case this this will hold. But, it is not following from our proof such a I mean such an n that depends on the previous x i's is called a stopping rule we will not cover that in this course. So, basically in the case when n is dependent on x i you can no longer make that step because if you condition on n equal to n the distribution of x i's will change then you cannot go from there to here right. So, you have to use some other techniques, but the similar equation holds. So, that equation is called waltz equality. So, I want to state a generalization of this law of iterated expectations more generally we can show that for any measurable g with expected g of x finite we can show that expected value of y minus psi y times wait a second. So, y minus psi x I guess y minus psi x. So, y minus psi x times g of x equal to 0. So, in other words or so what I mean is expected y g of x is equal to expected value of expectation of y given x times g of x. So, the psi of x is nothing but expectation of y given x. So, what we are saying is. So, the earlier version is simply with g of x is equal to 1 right earlier iterated expectation is when g of x is equal to 1 expected y equal to expected y given x right. Now, I am saying that no matter you can put any function of x you want and this will hold. So, this how will you show you can actually you can actually write it out again right. So, we only know after all we only know this definition for the case of discrete random variables and jointly continuous random variables just write both out and see that this works right whenever this expectations are defined right. It is a very like that you know just write it out. So, this you will be able to prove right. So, I do not want to waste time proving this now, because I want to give a very nice geometric interpretation to this ok. So, what this is saying. So, now it is useful for us to think of expectation of y given x as some kind of an estimate of y given x. So, you think of x and y as to random variables dependent random variables in general. So, generally if I give you x you know something about y right if they are dependent. So, if I only give you x let us say I only give you humidity today x is humidity let us say and y is some temperature or something. So, if I give you humidity you may be able to give some estimate of the temperature in some sense right. So, that is the kind of scenario we are looking at right. So, y you do not know x you know. So, you can think of expected y given x as some estimator for y that only depends on x, because x is known to me right only based on x I am estimating y right. So, what this is saying what our equation is saying is that. So, another way of looking at it is that y minus the estimate of y. So, this is nothing but expectation of y given x right this is psi of x y minus estimate of y right which is the error in estimate. So, to speak right. So, y minus expression of x is the error in your estimate is uncorrelated with any function of x. So, covariance between this random variable and that will be 0 right. So, if you want to interpret this geometrically you can say that the error is orthogonal to any function of x. So, if you are living on. So, let us say these are all L 2 random variables right finite variance random variables. So, you are living on. So, this is I am just drawing it like some Euclidean space, but this is really some Hilbert space. And you are looking for an estimate of y estimate of y or estimate of y right. So, you are. So, let us say this is my let us for the sake of argument let us say this is my y I want to estimate y and you are only you are constrained to the sub space. So, you are given x right you are. So, any estimate you want to make has to be a sigma x measurable random variable right. It must be measurable with respect to sigma x right the set of all events determined by realization of x. So, the sigma x measurable random variables form a sub space of this Hilbert space that you can show. And what you are saying is that. So, if this is your y and g of x is any random variable in this sub space alright. And you are what you saying is that your estimation error after all this psi of x is a function of x alright. So, psi of x is somewhere here and you are saying that the difference between y and estimate is orthogonal to anything here any vector on this right. Which means that you are in some geometric sense your conditional expectation is the foot of the perpendicular from y on the sub space of sigma x measurable random variables. Now, do not ask me what these access are right this I am just this is some Hilbert space I am just drawing it like some r 2 for your easy picturization fine. So, y minus expectation of y given x which is that vector which is the estimation error is orthogonal to the space of sigma x measurable random variables. That is what this result is saying which brings me to the point of defining this conditional expectation in full generality. So, the conditional expectation in full generality is defined as defined using this equation. So, for any function g of x measurable function g of x there exists a unique sigma x measurable random variable which satisfies this equation such as psi x is called the conditional expectation. So, that psi x can be shown to exist uniquely up to almost surely unique. And it has it is orthogonal I mean it has that interpretation it is basically defined as the foot of the perpendicular on the space of sigma x measurable random variables. So, far this is what is used as the defining property of conditional expectation when x and y are not necessarily discrete or jointly continuous or any such thing they can be whatever they want. And Kolmogorov proved that such a random variable psi x exists almost surely unique. So, that is taken as the conditional expectation, but it cannot be given as a explicit formula some random variable exists you can show. Now, that brings me to another geometric point I said that this is your estimate. So, I said this estimate is your base estimate of y given x and it is your estimate lying in the space of sigma x measurable random variables. Now, we can show that expectation of y given x is in some sense the best estimator in the squared error sense. So, if you namely if you have any function what so ever some h of x. So, h of x is some psi sigma x measurable random variable you can show that it is error from y will always be bigger than or equal to the error from your conditional expectation. And geometrically it makes perfect sense. So, suppose you are given a point y outside the subspace you want to find a point in the subspace which is closest to your y you would just find the foot of the perpendicular in Euclidean space and the same intuition goes through even in this Hilbert space this is also Hilbert space of random variables. So, this theorem the conditional expectation is conditional expectation E y given x is the minimum mean squared error or m m s c estimator of y given x and minimum mean squared estimator of y i.e. for any measurable h we have expected x minus well I should say I think I am going to mix it up a little bit now. So, y so I am y. So, this is what I want to say y minus expectation of y given x the whole squared is less than or equal to the expectation of y minus h of x whole squared. So, y minus h of x whole square is that norm and you know that this behaves like a norm expected square expected or the squared norm rather. So, this is less than or equal to that what we are saying. So, I think I have to stop here because I am out of time may be will continue with the proof of this next class and move on to the next topic. So, actually a fairly straight forward proof.