 Hello everyone, this is Alice Gao. Welcome to another video on probabilities. In the previous video, I talked about two rules, the sum rule and the product rule, and we can use both of these rules to perform inference using the joint distribution. It's really nice when we have the joint distribution, but most of the time in practice, we don't have it. It's actually difficult to come up with the joint distribution for a particular scenario. So instead, we often have some prior or unconditional and conditional probabilities that we can use to describe the random variables we're interested in. In this video, I'm going to talk about two more rules, the chain rule and the base rule that we can use to perform inference using the prior and conditional probabilities. Here's an example of some prior and conditional probabilities we can use to describe our home scenario. Remember, we've simplified it so that we're only thinking about three variables, alarm, Watson and Gibbon. And these numbers are carefully constructed with an underlying model in mind, but you might not be able to see it yet. We will talk more about it when I introduce Bayesian networks. The kind of tasks we are going to do with these probabilities are the same. So we are still wondering how do we calculate a probability over a subset of variables, and then how do we calculate a conditional probability? The first one, the chain rule is going to be useful, and for the second one, the base rule is going to be useful. Let's look at the chain rule first. I have several examples here, one for two variables, one for three variables, and then the general case for any number of variables. Let's look at the two variable case first. This formula right here looks exactly like the product rule that we've seen earlier. So for the simple case, they are equivalent. Another version of this is you can order AB differently, and then you can get equivalent. Version where this is equal to probability of B given A multiplied by probability of A. So both of these are the same. Now for the three variable version, for this version, we can begin to see a general pattern for how the chain rule works. So if you look at this three variable version, it seems like we are taking these three variables and sort of order them in some way. And then given this order, we came up with this expression. You can think about coming up with the expression backwards, where we have the last variable in the order, just the prior or unconditional probability of C. And then the second to last variable, we have the conditional probability of B given C. And finally, the first variable, we have the conditional probability of A given B and C. So if you try to extract the general pattern, the pattern seems like every variable in the sequence is going to depend on all of the variables that come before in the sequence. Given some sort of ordering of all of the variables. So this is exactly the general pattern. For the most general case, we have n variables, and you can think about them ordering. Again, the order is from right to left, right? So x1 up to xn. And then given this ordering, we have a giant product so we can decompose this joint probability into a product of prior and conditional probabilities. And this product in general, you can write it in this form where this big notation in the front is about the product. And then you can see every variable in this product. So xi is the current variable that we're looking at. Every variable is going to condition on all of the variables that come before in the ordering. Right? And if you want to write out this entire expression, it looks like what I have on the next slide. Right? You can make an example for yourself for four variables or even more and then try to write out this expression. Right? And notice that the expression, there are many versions of this expression and that depends on how you've ordered the variables. Let's practice using the chain rule. I have two quicker questions on this slide and the next slide for you to practice using the chain rule. Same as before, I will only tell you the correct answers in this video and you can watch a separate short video to see how I derive the correct answers. For this first question, we want to calculate the probability that the alarm is going, Dr. Watson is calling and Mrs. Gibbon is not calling. You can go back to the previous slide and look at the prior conditional probabilities that I gave you, but to make all the information available right here, I picked out some of the probabilities that could be important for this question and put them on the slide. So for this question, the correct answer is 0.063. So we want to calculate the probability of A and W and not G and this 0.063. Question number two, what is the probability that the alarm is not going, Dr. Watson is not calling and Mrs. Gibbon is not calling. So for this question, we want to calculate the probability of not A and not W and not G and the correct answer is 0.486. Even if you got the correct answers, it might be worthwhile to watch the separate short video anyway because I will try to explain different ways of solving the same problem using the chain rule. That's it for our first task. So in order to calculate a joint probability over maybe a subset of the variables, we can use the chain rule. Next, let's think about how do we calculate a conditional probability? In particular, the task we have in mind is we want to flip a conditional probability. So when do we actually want to flip a conditional probability? This is actually quite a common scenario. Think about it in real life. Often we have some knowledge about that one thing could cause another. So we have some knowledge about what are the possible causes for a particular phenomenon. For example, we would know a particular disease is going to cause some particular symptom on a person or we know that if there's a fire, then the alarm is going to go off, right? So this is the kind of causal knowledge that we often know. It's great to know this kind of knowledge, but they are often not helpful in practice when we're asking questions. Because when we're asking questions, we often do not know whether a person has a disease or we do not directly observe whether there's a fire or not. Instead, we often observe the evidence or the symptom. So in the case of a medical diagnosis, we often can observe that a person has some symptoms and then we want to ask how likely it does this person have a particular disease based on the symptoms observed. And similarly, in the fire alarm setting, we often observe whether the alarm is going off or not and we want to infer whether there's a fire going on. So you can see that the kind of knowledge we have does not really match up with the kind of reasoning we want to do, which means, well, based on our knowledge, we might want to be able to flip the conditional probability and calculate the conditional probability the other way around. This is the famous base rule, so it allows us to take the probability of y given x and flip it to get the probability of x given y, and in order to do this, we need the probability of y given x plus the probability of x. You will see that we don't directly need to know the probability of y as the denominator because the numerator already gives us enough information to derive it. Now you should not memorize this rule at any time you should be able to derive it, for example, when you're stuck on a deserted island, so here is how you can derive it. Remember the two different ways of writing out the product rule for two variables. So if we have probability of x and y, depending on the ordering, we can write this out as the probability of x given y multiplied by the probability of y. This is ordering y first and then x next, or if we choose the other ordering, this is equal to the probability of y given x multiplied by the probability of x. Given these two ways of writing out the same expression, you can see on the right, we can simply how do we get to our version of the base rule? We can simply take the probability of y here and move it to the right. This is going to give us our version of the base rule. The second point I want to make is regarding the p probability of y in the denominator. In fact, in order to calculate the probability of x given y, it's not necessary to directly know the probability of y. But to understand this, first of all, let's do a little bit of mathematical derivation. I'm going to take the expression of the base rule and expand the denominator. The numerator stays the same and then take the denominator and apply the sum rule in reverse. So we expand it by adding x into the expression. p of y is equal to p of y and x plus p of y and not x. We don't care about x, so when we introduce it, we need to introduce all possible values for this variable. And then let's expand the two terms in the denominator further by using again the product rule. The numerator again stays the same and the first term in the denominator actually ends up being the same as the numerator. We are calculating the probability of y given x multiplied by the probability of x. And the second term will also order things so that not x comes first and then y comes next. Like so. This is one part of the story. Now, because we are using this base rule and in fact, using this formula, we end up calculating a probability distribution. So this equation is only showing you one probability in the distribution. The other probability in the distribution is probability of not x given y. So I can do a similar derivation, but let me just write down the end result. So to write down end result is actually quite easy because the denominator is going to be the same as the denominator that I just derived above. Where the numerator ends up being the second term in the denominator. Let's look at what's happening with these two expressions that we just derived. Well, we are trying to calculate the distribution. One probability in the distribution is probability of x given y. The other one is probability of not x given y. So this distribution says, let's assume y is true. So among all the worlds in which y is true, what's the probability that x is true versus what's the probability that x is false? And we derived two expressions and notice if you compare these two expressions, the denominator we have are exactly the same. Whereas the numerator, in one expression, we have the first term in the denominator and in the other expression, we have the second term in the denominator. So an alternative way of thinking about this is that how can we calculate this distribution? We can calculate the two numerators and after calculating the two numerators, we just need to take these two numbers and normalize them. So by normalizing, I mean take these two numbers and divide each of these two numbers by the sum of them. This is exactly what's happening when we divide both of them with the denominator. So if you think about this way of calculating this distribution, you will realize that p of y is simply a normalization constant that we're using here. A normalization constant basically means we take a bunch of numbers and these numbers might not make up a probability distribution because they might not sum to one. Well, what do we do with it if we want to convert it into a probability distribution? We'll take all of these numbers and divide each number by the sum of them. This is one way to convert this into a valid probability distribution. So this derivation gives us a new way of thinking about the base rule and also gives us a new way of calculating this distribution. This new procedure has two steps. First, we calculate the two corresponding values, which are these two. Probability of y given x multiplied by p of x and probability of y given not x multiplied by p of not x. And then the second step will normalize these two values so that they sum to one. So from this description, you can see what is the minimum set of values we need to calculate this distribution. We do not need to know p of y to start with. All the quantities that we need to know are p of x and we do not need to know p of not x. We can just say 1 minus p of x is p of not x. And then we need to know p of y given x and we need to know p of y given not x. Given these three quantities, we have enough information to calculate p of x given y and p of not x given y. I hope this gave you a new way of thinking about the base rule. Let's practice using the base rule. Again, I have two questions and I will only tell you the correct answers in this video and watch the separate video for the process. For this first question, we want to calculate the probability that the alarm is not going given that Dr. Watson is calling. And the correct answer here is 0.8, so option C. For the second question, we want to calculate the probability that the alarm is going given that Mrs. Gibbon is not calling. And the correct answer here is 0.08, which is option E. That's everything for this video. After watching this video, you should be able to do the following. Calculate a joint probability over a subset of the variables using the chain rule. Calculate a conditional probability using the base rule. Interpret the base rule in a different way where the denominator in the equation is you can think about it as a normalization constant. Thank you very much for watching. I will see you in the next video. Bye for now.