 Let's finish our discussion of mathematics and data science and the basic principles by looking at something called Bayes theorem. And if you're familiar with regular probability inferential testing, you could think about Bayes theorem as the flip side of the coin. You can also think of it in terms of intersections. So for instance, standard inferential tests and calculations give you the probability of the data that's our D given the hypothesis. So if you assume a null hypothesis is true, this will give you the probability of the data arising by chance. The trick is most people actually want the opposite of that they want the probability of the hypothesis given the data. And unfortunately, those two things can be very different in many circumstances. On the other hand, there's a way of dealing with it. Bayes does it. And this is our guy right here, the Reverend Thomas Bayes, an 18th century English minister and statistician. And he developed a method of getting what are called posterior probabilities that uses prior probabilities and test information or things like base rates, how common something overall, to get the posterior or after the fact probability. Here's the general recipe to how this works. You start with the probability of the data given hypothesis, which is what you get from the likelihood of the data. You also get that from a standard inferential test to that you need to add the probability of the hypothesis or the cause being true. That's called the prior or the prior probability. To that you also add the probability of D the probability of the data that's called the marginal probability. And then you combine those in a special way to get the probability of the hypothesis given the data or the posterior probability. Now if you want to write it as an equation, you can write it in words like this posterior is equal to likelihood times prior divided by marginal. You can also write it in symbols like this, probability of H given D probability of the hypothesis given the data that's the posterior probability is equal to the probability of the data given the hypothesis that's the likelihood multiplied by the probability of hypothesis and divided by the probability of the data overall. But this is a lot easier if we look at a visual version of it. So let's go to this example here. Let's say we have a square here that represents 100% of all people and we're looking at a medical condition. And what we're going to say here is that we've got this group up here that represents people who have a disease. So that's a portion of all people. And then what we say is we've got a test and people with the disease 90% of them will test positive. So they're marked in red. Now it doesn't mean that over here on the far left, we have people with the disease who test negative that's 10%. Those are our false negatives. And so if the test catches 90% of the people who have the disease, that's good, right? Well, let's look at it this way. Let me ask a basic question. If a person tests positive for disease, then what is the probability that they really have the disease? And if you want to hint, I'm going to give you one, it's not 90%. Here's how it goes. So this is the information I gave you before, and we've got 90% of the people who have the disease, that's a conditional probability, they test positive. But what about the other people, the people in the big white area below of all people? Well, we're going to need to look at them and see if any of them ever test positive, do we ever get false positives? And with any test, you are going to get false positives. And so let's say our people without the disease, and that 90% of them test negative the way they should. But that of the people who don't have the disease, 10% of them test positive, those are false positives. And so if you really want to answer the question, if you test positive, do you have the disease? Here's what you need. What you need is the number of people with the disease who test positive, divided by all people who test positive. Let's look at it this way. So here's our information. We've got 29.7% of all people are in this darker red box, those are the people who have the disease and test positive. All right, that's good. And then we have 6.7% of the entire group, that's the people without the disease who test positive. So we want to do the probability of disease. So we have what percentage have the disease and test positive, and then divide that by all people who test positive. And that bottom part's made up of two things. That's made up of the people who have the disease and test positive, and the people who don't have the disease and test positive. Now we can take our numbers and start plugging them in. Those who have the disease and test positive, that's 29.7% of the total population of everybody. We can also put that number right here. That's fine. But we also need to look at the percentage that do not have the disease and test positive of the total population, that 6.7%. So we just need to rearrange, we add those two numbers on the bottom, we get 36.4. And we do a little bit of division. And the number we get is 81.6%. Here's what that means. A positive test result still means only a probability of 81.6% of having the disease. So the test is advertised as having 90% accuracy. Well, if you test positive, there's really only an 82% chance that you have the disease. Now, that's not a really big difference. But consider this, what if the numbers change? For instance, what if the probability of the disease changes? Here's what we originally had. Let's move it around a little bit. Let's make the disease much less common. And so now what we do is we're going to have 4.5% of all people are people who have the disease and test positive. And then because there's a larger number of people who don't have the disease, we're going to have a relatively larger proportion of false positives. Again, compared to the entire population, it's going to be 9.5% of everybody. So we're going to go back to our formula here in words and start plugging in numbers. We get 4.5% right there. And right there. And then we add in our other number, the false positives, that's 9.5%. Well, we rearrange and we start adding things up, that's 14%. And when we define that, we get 32.1%. Here's what that number means. That means a positive test results. So you get a positive on the test. Now means that you have a probability of only 32.1% of having the disease. That's two thirds less than the advertised accuracy of 90%. And in case you can't tell, that's a really big difference. And that is why Bayes theorem matters, because it answers the question that people want, and the answer can be dramatically different, depending on the base rate of the thing that you're talking about. And so in sum, we can say this, Bayes theorem allows you to answer the right question, people really want to know, what's the probability that I have the disease, as opposed to what's the probability of getting a positive if I have the disease, they want to know whether they have the disease. Now to do this, and this is the big trick, you need prior probabilities, you need to know how common the disease is, you need to know how many people get positive test results overall. But if you can get that information and run them through, it can change your answers. And really the emotional significance of what you're dealing with dramatically.