 Hello everyone, this is Alice Gao. In this video, I'm going to discuss the two questions in lecture 11 slides on slides 7 and 8. These questions are asking you to calculate the minimal number of probabilities to represent a joint distribution, sometimes when we know nothing about the random variables and other times when we know some conditional or unconditional relationships between the variables. First, let's think about the joint distribution when we know nothing about these variables. So that's part one of both questions. So there are two ways of deciding how many probabilities we need to represent the distribution. First way is a simple counting argument. So we have three random variables. They're all Boolean, true or false in order to specify all possible values in the joint distribution. We have to specify probabilities like these, right? So the number of possibilities are there are two possibilities for each variables and then there are three variables. So there are a total of eight probabilities in this distribution. Now because it's a distribution, the eight probabilities sum to one, so we only need to specify seven of them to know the last one, right? The last one is just one minus the first seven probabilities. So that's why we need at least seven probabilities to specify the distribution. Now the second way of thinking might sound a little strange to you, but it may help you to connect to the material later when we learn about Bayesian networks. If you remember the chain rule, if we take this joint probability, we can write it, we can expand it using the chain rule. So we can write it as probability of A multiplied by the probability of B given A and multiplied by the probability of C given A and B. Right? If we write it out like this, then another way of thinking about this is that we can try to represent this graphically by using nodes to represent the variables and the arrows to represent sort of the conditioning relationship between the variables. So we have A, B, and C, and then A is just itself, and we have B conditioning on knowing A, right? So let's use an arrow here to represent the fact that sort of you can imagine this as A cos B or the value of A is going to influence the probability of B. Okay, and then C is conditioning on both A and B. So we'll represent that by two directed edges, one from A to C and one from B to C. Right? So this is one way of representing the relationship between these three variables. Now given this picture, now let's think about how many numbers do we need to specify all the probabilities based on this picture? Well, A does not depend on anything. So in order to specify everything about A, we just need to know the probability of when A is true, the probability when A is false, and the two way, we only need to know one of them, because the other one is just one minus the other one. So for this, we need to specify, say, the probability of A being true. That's enough. What about B? Well, B is conditioning on the value of A. So technically, we need to specify four numbers. B is true, given A is true. B is false, given A is true. B is true, given A is false. And B is false, given A is false. But again, notice that, for example, B given A plus not B given A is equal to one, right? These two sum to one. So the two of these, we only need to specify one of them. And same thing for the other two probabilities, we only need to specify one of them. So let's suppose we specify the probability of B given not A. Then the other one, probability of not B given not A, is just one minus the probability of B given not A, right? So because B is conditioning on A, so we need to specify two probabilities out of the four. And then for C, you can see similarly, C is now conditioning on two variables. So we will end up needing to specify four probabilities to represent all of the combinations. So C given A and B, C given A and not B, C given not A and B, and C given not A and not B. Right? Because all of the versions where we have probability of not C given something, that's just one minus probability of C given something. OK, so this is again another counting argument where based on this picture, what's the total number of probabilities we need? Well, we need one for A, two probabilities for B and four probabilities for C. This is again a total of seven. Let's now look at part two. In part two, we assume that A, B, and C are all independent. In this case, if you remember the definition of unconditional independence, that means that we can take a joint probability between these three variables and write them as a product of all of the individual probability for each individual variable. Right? Which means P probability of A and B and C is equal to probability of A multiplied by the probability of B multiplied by the probability of C. You can see this sort of intuitively as well. Right? If you compare this expression from before, the A term just corresponds to the expression from before. But the B term, this is exactly the definition, right? If A and B are independent, then the probability of B given A should be equal to the probability of B. The third one, this is not exactly the definition, but you can also see this happening. If C is independent from both A and B, then knowing the value of A and B does not influence our belief about the value of C. So we can get rid of A and B and just say probability of C given A and B is the same as just the probability of C alone. Another way of thinking about this is referring to our picture here. So remember our picture before used to have B depending on A and then C depending on A and B. What about now? Well, A is just by itself, that's fine. Then B is independent from A, which means the probability of B does not depend on the probability of A anymore. So we can effectively delete this link and everything is still fine. Similarly, the probability of C does not depend on either A or B anymore. So we can delete both of these links. So you can see the magic of these independence assumptions allows us to simplify our graphical model. Now I will delete these links for sure. We can get a much simpler model. And knowing this model, well, what are the numbers we need to specify the probabilities? We need probability of A. We need probability of B. B does not depend on A anymore. And then we need the probability of C. C does not depend on A or B. A or B. So the total number of probabilities is one for A, one for B and one for C. That's a total of three probabilities. Let's look at the second question. The first part is same as before, so I'm going to skip it. Let's look at the second part where we know a different assumption. Now we know that A and B are conditionally independent given C. I drew another graphical model to represent the joint distribution. You might notice that the direction of arrows is a little bit different from the previous example. The reason is the way I've written this conditional independence assumption in order to explain it easily then this model is a little bit more convenient. Let's write out the joint distribution again mathematically given this graphical model. So we have the probability of A and B and C using the chain rule and sort of following the direction of the arrows in this picture. We can write it as C first and then multiply by the probability of A given C, multiply by the probability of B given A and C. Let's do C and A. All right. Now let's apply our conditional independence assumption. Our conditional independence assumption says that A and B are conditionally independent given C. One way of writing this out is let's look at this third term I have in the expression. We have C on the right-hand side. So given C, knowing the probability, the value of A should not influence our belief about B. Which means in this expression we can remove A and everything should still stay the same. So this expression is equal to the probability of B given C. Every other term I'll just copy. What does this mean graphically? Well the third term graphically used to mean that we have two arrows pointing to B, one coming from C and one coming from A. But this is equivalent to just the probability of B given C. Which means if B is already depending on C, then we can delete the arrow from A to B. Knowing the value of C, in addition knowing the value of A does not influence our belief about B. So let me delete this arrow now. And after deleting this arrow, now we can think about counting the number of probabilities we need. For C, we just need one. Then for A, I need A given C and then I also need A given not C. And then for B, I need B given C and B given not C. Right, so this is a total of one for C, two for A and two for B, which is five. The graphical models I've introduced here are essentially Bayesian networks, except I haven't defined them formally yet. But my reason for using both the mathematical derivation and the graphical representation is to show you different ways of thinking about the same problem. Either way, the important message is that if we know some independence assumptions or conditional independence assumptions about these variables, then that can allow us to use fewer probabilities to represent the same joint distribution. That's everything for this video. I will see you in the next one. Bye for now.