 Bayesian inference is a way to make guesses about what your data mean based on sometimes very little data. The way it works is tricky, but it's not magic. It's definitely something that you can wrap your head around and it's not impossible to do so. My goal is that by the time we're done talking you'll have a pretty crisp picture of how it works. Bayesian inference is just guessing in the style of Thomas Bayes who was a non-conformist Presbyterian minister. He wrote a couple of books, one about religion and one about probability. So a Bayesian inference is just guessing in the style of Bayes. So it illustrated, imagine that you're at the movies and someone drops a ticket. You pick it up and you can see them from behind. You know they have long hair, but you don't know whether they're a man or a woman, so you have to make a guess. Based on what you know about the attendees at your movie theater, you might say, excuse me ma'am, is this your ticket? Now imagine instead that this person is standing in line for the men's restroom. Knowing this extra piece of information, you might make a different guess. Bayesian inference is a way to capture this common sense knowledge about the situation and help you to make better guesses. So to put numbers to this dilemma at the movie theater, let's assume out of 100 women at the movies, 50 have short hair, 50 have long, and out of 100 men at the movies, 96 have short hair and 4 have long. In this case we can see that there are definitely more women with long hair than men with long hair. So it's a safe bet to assume this person's a woman. Now we just made a subtle assumption that there are about the same number of men and women at the movies. This assumption no longer holds when we move to the men's restroom line. Here, let's say there are two women out of every 100 people and 98 men, maybe women keeping their partners company. There's still one with short hair and one with long hair. There's still half and half long and short hair. But now there are four times as many men with long hair than women with long hair in this group. Now the safe money is to bet that this person is a man. So to draw this a little differently, out of 100 people at the movies overall, we'll make this assumption explicit that 50 of them are women, 50 of them are men, so this is how the different categories break down. In the line for the men's restroom then, they break down a little differently. So to translate this to math, the probability that a person is a woman is the total number of women divided by the total number of people, 50%, similarly for men. Moving to the men's restroom line, the probability that someone is a woman is 2%, 98% for men. Now Bayes' theorem is a little bit tricky, so to be very precise we're going to have to talk math. So if you bear with me for just three probability concepts, we'll lay the foundation for presenting Bayes' theorem. The first one is conditional probabilities. If I know that a person is a woman, that's the condition, what's the probability that that person has long hair? So it's written as probability of long hair given that a person is a woman. So to get this, we just divide the number of women with long hair by the total number of women, 50%. And this doesn't change whether there's 50 women in your group or 2 women in the group, still if we know that a person is a woman, the probability that they have long hair is 50%. We can do the same thing with men, probability that someone has long hair given that they're a man is 4%. So conditional probabilities, if I know that b is the case, what's the probability that a is also the case? This is not, you can't reverse b and a and have this be true. So for instance, if I know that the thing I'm holding is a puppy, what's the probability that it's cute? The probability is very high. If I know that the thing I'm holding is cute, what's the probability that it's a puppy? Well, it might be a puppy, it might be a kitten, it might be a hedgehog, it might be a small human. There's lots of things that it could be, so the probability there is less moderate. So these things are not interchangeable in conditional probabilities. Now concept 2, joint probabilities. So what's the probability that a person is both a woman and has short hair? So to calculate a joint probability, you first find their conditional probability. Well, if I know that they're a woman, what's the probability that they have short hair? And then you multiply that by the probability that they're a woman. So in this case, 0.5 times 0.5, we get a 0.25 which is exactly what we knew it was going to be. And the same is true for all of our different conditions. So we can do this for the men's restroom too. The probability that someone is a man and has long hair, 4%. Someone is a woman and has long hair, 1%. Joint probabilities are different than conditional probabilities. Here, the probability that A and B is the case is the same that the probability that B and A is the case. So the probability that I'm having a jelly donut with my milk is the same as the probability that I'm having a milk with my jelly donut. These two conditions, these two situations are identical. So we can swap the order. And finally, concept 3, marginal probabilities. If I wanted to say, figure out the probability that someone has long hair, I just add up all of the different ways that someone can have long hair. They can be a woman with long hair or a man with long hair. In the men's restroom line, that's a 1% probability plus a 4% probability or a 5% probability overall. And you can do the same thing for short hair, 95%. Now, this last concept finishes our foundation. We can get to what we really care about. We know that this person has long hair. What's the probability that they are a man or a woman? This is a conditional probability, but it's the reverse of the one that we know. We don't know how to answer this yet. So this is where Thomas Bayes comes in. He noticed something really cool. You can calculate the joint probability that someone is a man and has long hair using the formula we saw before. And you can also calculate the joint probability that someone has long hair and is a man. Now, these are different formulas, but we remember joint probabilities don't care about the order. So these two things are equal, which means the stuff that they're equal to on the other side are also equal to each other. And we can do a little algebraic sleight of hand. And now we have a formula for what we want to know. If someone has long hair, what's the probability that they are a man? We have this expression to the right. We can genericize that with A's and B's, and now we have Bayes theorem. One of the top ten most popular math tattoos of all time. So using this formula, we can go back to the movie theater and plug in what we know. We know that the probability that someone is a man. We know the probability that if they're a man, they have long hair. And we know the conditional probability, or sorry, the marginal probability that someone has long hair, which is just the probability that someone's a woman with long hair plus the probability that someone's a man with long hair. And we plug all that in and we can say if someone has long hair at the movie theaters, there is a 7% chance that they are a man. Similarly, 93% chance that they are a woman. Now, if they're in line for the men's restroom because some of those probabilities change, that conditional probability changes. Someone's in line for the men's restroom and has long hair, there's an 80% chance that they are a man. And this is consistent with what we saw before. We know that there are four men and one woman for every 100 people in line for the men's restroom that have long hair. So 4 out of 5 long hair people are men, 80%. It all adds up. So this example shows the mechanics of how to get Bayes' theorem and how it works. In practice, it's usually used a little differently. So to show this, we'll have to do a little bit of a detour and first talk about probability distributions. You can think of probability like a pot with just one cup of coffee in it. You can fill up, if you just have one cup to fill up, you can fill it all the way to the top. But if you have more than one cup, you have to share it around or distribute it. And you can share it with any proportion you want. So for instance, if we're representing the number of men and women at the movies, we could share it 50-50. But it will always add up to 100%. We could even share it further into different categories. So here we see the joint probabilities of all of our four different categories that we were working with. And you can see that this is just another representation of the distribution representation that we were looking at before. Now usually when we look at this, they're side by side probability instead of percentage and shown in a histogram like this. It can be really helpful to think of these as beliefs. So for instance, if I flip a coin and hide the result from you, you might half believe its heads and half believe its tails until I tell you what it is. If I roll a die and hide the result from you, you might believe about one-sixth that it's a one or a two or three or four or five or six until I show you the result. So these are what you believe probabilities can represent what you believe about something before you measure it. Similarly for powerball tickets and even for things that are more complicated, like let's say the height of adults in the United States in centimeters. You might believe that there's a very small chance that they'll be over 210 centimeters and a smallest chance that they're less than 150 centimeters. And then assign various amounts of this belief to all of the ranges in between. It still adds up to one, it's like a bunch of cups of coffee lined up in a row and you put a little bit in each one, the cups in the middle have more. And it shows how your belief about some individual is distributed before you actually measured them. Now you can take these bins and chop them more and more finely again and again. And if you keep doing this, you can get to something that's actually perfectly smooth. So it's as if you had an infinite number of very tiny cups and you put a tiny bit, infinitesimal amount of coffee in each one. At this point it's probably no longer helpful to think of it in that terms, but just thinking of it as a continuous distribution. Showing for all these heights, where am I placing my bets? What do I believe and how much? So once you have your beliefs you can use it to answer questions about typical heights, the average, the median value, the most common value or the mode. Now we'll use this in weighing my dog. I have a shitsuit named Rain of Terror. She's a little mischievous and when we go to the veterinarian rain squirms on the scale. So every time we weigh her we get a different weight. This last time we got 13.9 pounds, 17.5 pounds, and 14.1 pounds. What we want to know is how much rain weighs. And this will be the basis for a decision. This is important. If her weight has gone up, her food intake will have to go down. And for her this is a matter of life and death. So we don't want to make the wrong assumption and draw the wrong conclusion. So if you've ever taken a statistics class before, you know you can take these measurements, add them up, get the average, 15.2 pounds. You can calculate the standard deviation of these three measurements and also the standard error and come up with a 1.16 pound standard error, which when you show it graphically this red curve now shows the belief that results from those three measurements, the distribution. The peak of that hill is at 15.2 pounds and one standard deviation on that curve is our standard error of 1.2 pounds. So you can see looking at this that yes it's most likely that she's 15.2 pounds, but there's a lot of that curve that sits outside of the range of 14 to 16. So yeah, she's probably between 14 and 16 pounds, most likely between 13 and 17 pounds, but she might even be lower than 12 and higher than 18. That is a really wide range and it's not a great basis for making a decision. Now you can see the three measurements there, those three white vertical lines. And you can see why our belief is so dispersed because those three measurements are pretty dispersed. It's hard to capture all that in one distribution. So let's try it again with Bayes theorem. So the way we'll do this is instead of A and B we'll substitute in W for her actual weight and M for the measurements that we took. Now this term over here, the probability distribution of the actual weight is our prior. This is what we believe about her weight before we put her on the scale. The probability given a weight of getting certain measurements are the likelihood associated with those measurements. And then the posterior is what we believe about her weight given those measurements. So you can think of this as we start with a belief, we take some measurements and we update it, and then we have a new belief when we're done. This term on the bottom we're going to ignore for the most part, it'll be a constant, but it's called the marginal likelihood. So we're going to start by not assuming anything about her weight. It could be one pound, 10 pounds, 20 pounds, 100 pounds. We're going to let this be uniform and we're going to let the data speak. So now our formula looks like this, we can further simplify it. And so we want to calculate this. We want to calculate the probability of our measurements occurring given a weight. And we want to do this for all of the possible weights. And then we'll end up with a new distribution, which is our belief. What's the probability of each of those weights occurring given the measurements? These two things are identical. So let's start for instance by assuming what if she weighed 17 pounds in reality. Now because our measurement process is very noisy, as we saw, if she weighed 17 pounds, we would expect those measurements to be distributed as shown here. Some would be up way above 18 pounds, some would be down around 14 pounds where we actually measured, but not very many of them would be. So to calculate now the probability of our measurements occurring given this weight, we find what the probability of each individual weight is of occurring. And we multiply that times that times that. Now these two are pretty small when you multiply two small things together. They're very small. So the probability of her being at 17 pounds is pretty small. We shift our belief over and say well what if she was 16 and a half pounds? What if she was 16 pounds? And we recalculate it each time, multiplying all of those actual probabilities together. And then by the time we're done, this is what we've measured at each of those weights. This is the likelihood of each of those occurring. And you can see that the maximum likelihood occurs at 15.2 pounds. And this is commonly called the maximum likelihood estimate, where you start with uniform assumptions on your weight. And it just so happens that the standard error on this is exactly what we calculated before. A very cool thing connection here. When you take the average and calculate standard deviation and standard error, it gives you the likelihood that you would get by doing Bayes method and assuming a uniform prior. Not assuming anything about what the results going to be. So we've already established though that that's a really broad result and not helpful. So we need to start over now and let's start with what we know. So some background information. Rain was 14.2 pounds the last time we went into the vet. And she doesn't seem noticeably more heavy to me. My arm is not that well calibrated. But I'm going to assume that she's within about a pound of where she was before. So I take that prior and this is the form that it takes. A normal distribution centered on 14.2 pounds. And you can see that most of that bulk is within plus or minus a pound. And it extends a little bit further out. Allow for the possibility that she's actually gained a lot or lost a lot of weight. But probably she's pretty close. This is what I believe before I even put her on the scale. This is the probability, the prior, the probability of her being a given weight. So this time we're not neglecting the prior term. We're not setting it constant. We're going to use it. And the way this plays out now is we assume, okay, what if she were 17 pounds? Well, we need to multiply that now by the probability of our prior showing that she's 17 pounds. Which actually makes that quite small. Now we calculate and multiply the three probabilities of our measurements occurring. So now we have something small times something very small times something very small. So we get a very small result probability that she actually weighs 17 pounds. And now we repeat this process at 16 and a half pounds and 16 pounds and 15 and a half pounds and 15 pounds all the way through. And then by the time we're done, we tally up all of those and we get this new posterior distribution. It's normally distributed at about 14.1 pounds and it has a standard error of less than a pound. You'll notice it's even narrower than our original prior. So we've taken our original belief and we've been able to sharpen it up just a bit. And so incidentally, the peak of this curve is called the maximum a posteriori result. If we had to choose one value to represent our belief, that's not a bad one to choose. And now we compare this with our original estimate. It's labeled non-Bayesian here, but more accurately it could be Bayesian with a uniform prior. You can see that it is much broader and also the peak of that curve is in an entirely different place. So the answer that we got, it's more confident because it's more centered and it's probably based on what we know closer to being correct. So this is how Bayes' theorem is used most often in data science or in analysis. It's a prior that you then update based on your measurements to sharpen up and get a revised set of beliefs. So there's a lot of times that it makes sense to use Bayesian inference. Sometimes we just know things, like if we're measuring age, we know that everyone is more than zero years old. And so we can take that information and build it in and we can get sharper estimates with fewer measurements. Now, it should, if you're paying attention, make you a little bit nervous. We humans are actually not always aware of what we believe and putting it into a mathematical distribution can be very tricky. More importantly, the reason we're measuring something is because we want to learn about it. We want to be able to be surprised by our data. So if we believe something that's not true, it can make it hard or impossible to learn from our data. I like how Mark Twain phrased this. He says, it ain't what you don't know that gets you into trouble. It's what you know for sure that just ain't so. So the way to avoid this pitfall is to always believe things that we think are impossible, at least just a little bit. So by leaving this room for something to be impossible, we can do like Sherlock Holmes says, and once you've excluded the impossible then whatever remains, however improbable, must be the truth. We don't want to exclude the improbable out of hand because then we're left with nothing. Alice in a conversation with the Red Queen summed it up well too. There's no use in trying, she said. One can't believe impossible things. I daresay you haven't had much practice, said the Queen. When I was younger, I always did it for half an hour a day. Why sometimes I've believed as many as six impossible things before breakfast. So the secret to using Bayesian inference as well is to keep believing impossible things. Thanks for your attention. Here's how you can get in touch with me if you'd like to carry on the conversation. I look forward to talking with you again soon.