 Okay, welcome back everybody to the second lecture of statistical rethinking 2019 edition. We're going to pick up exactly where we left off, I've got two main objectives today. The first is to give you an intuition for how Bayesian updating works, how a Bayesian Golem learns from experience, and the second is to give you some intuition about how you build these models as well, and we'll flip back and forth between doing both of these things because they're intertwined. The assumptions that you build into your model also explain how it learns, because that's the information it uses, that structures experience, right, there's no learning without assumptions, and so when you make assumptions at the same time you're building up how learning works. So I think I ended on this slide, was that right last time? So this is what I think of as our schematic of how we do statistical data science. We design a model using the scientific background we have about how the data are born, that's our data story, then we update the model conditional on the data, that's always done the same way, Bayesian update, there's only one way to do it, there's only one kind of estimator in Bayesian inference, which is nice. Computationally that could be tricky and we'll deal with that later in the course. Then you have to be critical because of the small world, large world distinction. The things that the model believes are only credible inferences if the model is credible, and the model may not be credible. So there's this loop where you have to supervise, you never relinquish control through the golem in this business. So let's actually go through an example where we have some data and we have an initial model and then the model updates in light of the data. And so in chapter two of the book I go through this example, I'm going to go through it with you in lecture today where we throw an inflatable glow. And I really used to do this, I taught undergraduate statistics at the University of California, Davis, and we threw the globe. Do I have your permission to throw it at you? Yeah, some people are like, no. All right, well, Han wants it, right? So here's the goal. You catch it and then you look at your right index finger. Is it on water or land? Water. Okay, water. So we've got a water sample. Ah-ha, I predicted it. My slide says the first one will be water. All right, now we're going to go with this data stream but you can toss it to someone else, pick a friend or enemy, and nice. And just float it down to you as if it was mint for your hands. And water or land? Land. Land, ah-ha. We're on a roll here. Okay, let's do one more throw and then this tiresome exercise will end. Whoa, very nice. Right index. Yeah, ah-ha, very nice. Okay, so perfect. All right, you can hang on to that as long as I get it later. I love that quote. So here's the idea is we would continue this sampling. Eventually, the live samples from the audience would deviate from my pre-formatted vector, I expect. Otherwise, I'm going to start to get worried. But the idea is this is a sampling process. The data are getting generated. And I like this example because you can see exactly how it's getting generated. And also this carries forward the small world, large world distinction. We're throwing the small world around the sample data. So we can measure some property, estimate some property about the large world. What are we trying to do here? We're trying to estimate the proportion of the Earth's surface that is covered by water. That's the goal. And you can, I cert, get a pretty good estimate by repeatedly tossing this globe. The question is how many times? And so what I want to do is in your homework, I will ask you to consider how many times you need to toss this globe to get an estimate of a certain precision. But for today in lecture, I just want to use this to teach you how Bayesian updating works mechanically so that you understand for any sample size what the information justifies. So imagine we toss it. How many times is this? One, two, three, four, five, six, seven, eight, nine times. We get this sequence where there's three land samples and six water samples. And now the question is how should we use this information to construct an estimate of the proportion of the globe that is covered in water? Yeah. Leave aside until later the issue that, well, this globe and the Earth are a little bit different. This is better than throwing darts at a 2D map, which would be a terrible idea. You could try that as well. Okay. Think back to the model design cycle. We're in the design phase now. We're going to spend several slides. I'm going to outline this for you and go design, condition, evaluation, give you an outline first, and then we're going to turn to each in some depth, and we're going to work through it. Okay? So what's the data story here that lets us build a model? What do you know? You know a lot about how these data have risen. You saw it with your own eyes. A professor threw something into the audience and someone caught it. We are willing to believe that the process of where Hun's right index finger was is totally random with respect to my intent. There's no way that I could rig the throw, well, now you wonder because I got the first three exactly right, but no way we could rig the throw so that I could guarantee you would catch it and get a water sample. So each throw is random, and the probability that you get a water sample at any given catch should be approximately the proportion of the globe that's covered with water. Does that make sense? Because it's a chaotic system, right? It's like a coin flip. It's a deterministic physical system, but it's chaotic, so tiny differences in the initial conditions result in essentially random outcomes of the system, right? It's not that it's truly random, nature is deterministic, at least at this scale. Let's ignore quantum mechanics for the moment, actually for the whole course, let's ignore quantum mechanics. But at the scales and low energies that we live at, that's mere mortals, the universe is deterministic, but it's a chaotic system and so it appears random because we're deeply ignorant beings, right? And that's why coin flips work as randomization devices because you can't possibly measure the initial velocity and position of the spin well enough to predict how it's going to land. That's the only reason. But it's definitely, Newton wrote down most of what you need to get it right. So that's true of our globe too. So we say there's this random number that's generated, but each toss should have a probability, P of turning up water, where P is the proportion of the globe covered in water. And then finally, all the tosses are independent of one another. It doesn't matter if it was water, you've got the same probability P of water on the next time. Yeah, there's no strong auto correlation between the throws. This is an assumption. We would need to demonstrate this and be critical of it, right? And I think in the chapter, there's a section in which I showed you how to do a calculation to inspect that. But I'll leave it out of the lecture. Okay, does this make some sense? What we're doing? This is the simplest possible example I could think of, but we'll try to do these kinds of stories with a lot of models in the course so you understand where the assumptions come from. And what we're doing with this information is we're building up the garden of forking data. We're trying to say, what are the relative number of ways we could get the data we actually got, given the process that generates all the possible data sets that could have been born from the process? That's what we need, is the garden of forking data. And actually, we want a mathematical function for that, so that we don't have to draw the garden, like we did with the marbles. But actually, this example is like the marbles is just that there's an infinite number of marbles in the bag, sort of, right? So we're estimating P as a continuous thing. Okay. The conditioning step is the use of Bayesian theorem. Bayesian updating. I'm going to build this up for you in this lecture first graphically. We're going to animate through the whole Bayesian updating with the globe tossing, one toss at a time, and see how the model learns on each toss. And then we'll do the calculations. And I'll show you how to computationally to do it. Right? And you're going to use those computational steps in the homework, which is already up on the website. Right? There's three problems in which you'll use those basic calculations. You'll do it in R. You're basically cranking through Bayesian theorem by hand. Well, your computer will do it by hand. You'll be scripting. Right? And then it'll do it. So the idea is that you give your model some initial plausibility of each of the values of P. In this case, we're going to say quite incredibly that every possible value between zero and one is equally plausible before we toss the globe. This is incredible because, of course, you are an Earthling. You know a lot about the Earth. Before I've ever tossed the globe, if I asked you to give me a guess, I assume most of you know that more than half of the Earth is water. Covered in water, rather. And so we could do better than the zero to one prior. And I will ask you to do that in your homework. Yeah, but for lecture, we'll do the bad prior. And then in homework, you could do a little better. OK. And then there's this magical thing called conditioning. And so for the moment, just treat that as a black box. And then later today, I'll show you that it's not very mysterious at all. It's very easy what conditioning means here. Basically, it means slicing off possibilities, just like in the garden of working data. There were some things that became impossible because of the data you saw. And when you remove those, other things become more possible. And that's what happens in Bayesian updating. So let's think about this graphically. Again, I'm just going to walk through this like a cartoon, and we do the calculations later. So before the globe was tossed, we've got this, or rather, the model, not you, the model has some prior set of information states about each of the possible values of p. And it's signed some equal plausibility to all of them. And that's what this prior is here, this horizontal dashed line. It's the prior probability of every value of p. It's an infinite number of them between zero and one. Yeah? Those are all the possible things states the globe could really be in. Imagine all the possible plans. And then the first data point arrives, and the model's going to update this. And what happens in this case, and again, I'll show you the calculations to reproduce this graph later on. And the book has all the code, right? So you know you can look at the book and reproduce this graph. What happens is the prior is transformed into a new distribution called the posterior. But it's the same type of thing. This is what Bayesian updating does. Every time you observe something, if there's any information in that observation relevant to the thing you want to learn about, in this case, that thing you want to learn about is p, the value of p, observing a water sample contains information relevant to that. The model figures that out automatically. You don't have to. One of the things I also like about Bayesian inference is it obviates the need to be clever. So you don't have to know if something's relevant or not. If it's not relevant, the model will just tell you exactly what it believed before. And then you're like, oh, I guess that was not relevant. Then you have to figure out why your intuitions were wrong. But it's a nice feature of Bayesian inference. In this case, it's definitely relevant. If you keep observing water, that means there's a lot of water. And so now we get this slanted line. Higher values of p are now more plausible than lower values of p. And p equals 0 is now exactly impossible. Why? Because you've observed water. Yeah? This is not mercury. You're on some other planet that has some water. We do the second sample. Now I'm going to animate through them all up top. So I put in small form our n equals 1 transformation from dash prior to solid posterior in the upper left. And now the second plot here is getting our second sample, our land sample. Now the dashed diagonal line is the posterior from the previous figure. This is how Bayesian updating works. Every posterior is a prior for the next observation. And it just keeps getting updated. Now we observe land. And the posterior now is this hill, this gentle hill, where now 1 half is the most plausible. But there are a bunch of values near a half which have very similar plausibilities. Because you've got two data points. Come on. You're not going to get a grant with this. But there's still a lot of plausibility over a very wide range. But now p equals 1 is also impossible. It's been ruled out by the data. The garden of forking data has got zero ways to produce two samples with a water and a land if p equals 1. Yeah? What about the third one? We get our third sample. It's also water. So now we go from the dashed symmetrical hill to an asymmetric hill that is biased towards more water. Because now we've got two waters in one land. There's something exact about the shape of that bias, of course. It's not arbitrary. It's deduced using the rules of probability theory. But you can get an intuition about why it's shifted to the right without knowing those details. Yeah? Does this make sense? So more quickly now, three more samples in the middle row. N equals 4. We get another water. It shifts even more to the right. N equals 5. Another water shifts even more to the right. At this point, the model is overestimated. The coverage of water on the planet, right? Things have gone wrong. It's a small sample. It's overestimating, but also notice it's not very certain about anything. There's a lot of plausibility assigned to very large range of values. But at this point, it's pretty confident that more than half the planet is covered in water. Which, of course, you knew before you threw the globe. But the model knows nothing when you started it out because that's the way we programmed it. Then at N equals 6, we get another land sample finally. And it shifts back to the left as a consequence of that. You see how it jitters around every time there's another sample. And the posterior distribution from the previous plot is the prior distribution in the next plot. Does this make sense? And then the last three, N equals 7, 8, and 9. Same stories as before. We get a water. It shifts a little to the right. We get a land. It shifts a little to the left. We get another water. It shifts a little to the right. Notice it's shifting less and less. Because there's more weight of experience already embodied in the posterior distribution, the larger your sample gets. And so each additional data point contributes less marginal information. The more the model has already learned. And the model gets that right automatically just because probability theory makes sure it does. That's, again, I like Bayesian inference because it means I don't have to be clever. I don't have to realize that I need this phenomenon to happen. It just happens as a consequence of the way probability theory works. It's nice. Good? Again, what I want you to get from this is just some intuition about what updating is. And we're going to do the calculations later. And you can reproduce this graph. OK. So this is the conditioning step. Just to summarize what we've done, we've gone through each data point and the posterior distribution in each step becomes the prior for the next data point that's observed. We did this one data point at a time. But usually when we do statistics, we don't do that. We just throw the whole sample into the model and update all at once and get the same answer. It works the same way. You can just trickle the data out to the model one at a time. And it sits hungrily, like a cat waiting for individual treats. There must be other cat owners here. We'll use a lot of cat examples in this course. Or you could just give it all the treats at once. It'll get just as bad. And that's the way these models work. There's a special assumption in this model that the order is invariant because the tosses are independent, but that may not be true in all models. But it's still true. The sequence is observed. And you can't break up that sequence. This model, it's just a special case of this model that you can shuffle the sequence and it won't change the answer. But still, the data does have a sequence and you should pay attention to that. OK. Oh, note at the bottom. I just wanted to say this. I often get a question, where is the sample size? Since the sample size has this sort of looming thing in a lot of non-Basey analyses, you got to keep track of your degrees of freedom and your sample size and all that. The sample size is embodied in the shape of the posterior distribution. You don't have to track it. It's already there. Someone's done an analysis. And you've got their posterior distribution. You can use that as a product. And you don't need to know independently what their sample size was. It's already been taken into account properly. I know this is weird. But it's great. Again, you don't have to be clever. So what happens is the sample size makes the thing more and more peaked. And how peaked it is embodies the sample size in a particular way. But there's no additional information in knowing the exact sample size that is going to help you estimate peak. OK. Finally, the evaluation stage. You should be critical of all models here. You toss the small world. But we want to make an inference about the large world. None of us actually care about the coverage of water on my inflatable globe. It's probably not exactly the same as the real world. Anyway, so you have to be critical about this. Is there something about the way the data is collected, the measurement process that might create biases? Is there autocorrelation in the throws, so on? Is there a colorblind person in the audience who can't tell blue from green? Maybe there was a blue country. Someone put their finger on it and called it water. There are biases that may happen. These are silly examples, but in our actual science, less laughable things happen. And we should pay attention to those things. So what we're going to do a lot in this course is do what are called posterior prediction checks. And those are ways of checking the sensibility of what the model has learned. You have to supervise the thing. You can't just let it run and then trust it. Well, you could, but that would be a bad idea. OK. Now let's turn to what I call the construction perspective. Let's really build this thing up and make the model. So what I just showed you is just as if it's all behind the scenes. There are scripts that generate those graphs, and they're in the chapter. Now let's think about what's required in that first step of building the model from the data story to actually build it up. And the construction process is pretty stereotype to cross different situations, even though the scientific details are very different. So it'll feel different, but there are a set of steps you can go through which will help you in every case. And as this is the simplest model that I know about, we're going to do it with this. So a bit of an internet joke here, but probably old enough joke that no one gets it anymore. But question marks, profit, anybody recognize it? No. One person's marketing ad like. So in step one, you list all the variables. The variables are the things you can observe and the things you can't observe that you want to know or need to make inferences about to learn the things you need to know. I'll list these on the next slide in this case, and it'll become less mysterious. And then you have to define the relationships among these variables. Which ones do you need to know to say how plausible the other particular values of the other variables are? Again, I'll show you exactly how we do this in this case. And then, yeah, you have to do some other steps before you profit. That's the Bayesian updating part. When you build the model, what you're building is the joint prior distribution. Joint here means the prior probability of the data and the parameters. Because the data also have a prior probability. Before you've thrown the globe, you still know things about the possible data sets that can come out. How many outcomes are possible? Two. And you have some sense about the scatter over multiple samples, because they will follow a distribution, a sampling distribution. And so that sampling distribution, you may recognize this, those of you who've done a lot of non-Bayesian statistics. That sampling distribution is a prior for the data. When you get the actual data, it will be part of that sampling distribution. There will be one slice from that garden of forking data, which comprises the sampling distribution. But the sampling distribution is a prior probability distribution in Bayesian inference. OK. And then, yeah, what we deduce is the joint posterior by slicing off the stuff that has been ruled out by the exact observation we get. So to put some meat on this, we've got exactly three variables in this case. It's nice. There's no others that we're going to worry about. N, P, and W. So N is the number of tosses. Think of this as the sample sizes you want. Number of times we toss the globe, and that's 9 in this case. There's P, which is the proportion, the true proportion of water on this inflatable globe. And there's W, which is the number of water that we observed. You could also be redundant. You could have a variable L, which is the number of land you observed. But you can compute that from N and W. Or you could get rid of N and have W and L. And then W plus L is N. Am I making sense? And so in the chapter, I do these two configurations. If there's any help for you. The arrows are meant to show that generatively, you need N and P cause W. The number of relative numbers of ways that W could get a particular value, like 6 in this case, depends upon the values of N and P. But the values of N and P don't depend upon W causally. The value of P does depend upon W inferentially. So these arrows are about causation. They're not about your inference. What Bayesian inference does is it goes backwards against the causal arrows. Something has been caused. And then we trace the arrows backwards to make inferences about the cause. I just sound a little bit weird. Yeah, OK. If you're confused, that's just because you're paying attention. This is not how organisms are designed to think. But it's a really useful thing to do. So be patient with me. I guarantee this will pay off. So two of these variables have been observed. N and W, you've seen them. You've measured them. And if you believe your lying eyes, then you've got values for those. But one of them has not been observed. And that's P. And we have to infer it from the other two. That's the job of the model. Does this make sense? This observed, unobserved distinction is something that's a property of your measurement process. The same variable can be observed or unobserved in different studies. So N, for example, is observed here. But there's lots of ecology work. Where N is the population size. And you don't get to observe it. And that's your inferential target. Another ecologist in the audience, you can do these sorts of models. And so I love models like that. Maybe we'll do one in the second half of the course. Same basic model structure. But you must infer how many times the globe has been tossed. Good times. And it's doable. This is mark recapture. Those of you who've heard of mark recapture can do mark recapture this way. We're not going to do that today. OK. So what's the definition of W? Oh, I should say definition of N. And it's fixed by experiment. It's not. We determined that by how many times we threw it. What's the definition of W? This is the relative number of ways. What we want to calculate is the relative number of ways we could see W given N and P. We want a function for that. Mathematical function so that we don't have to draw the garden, the infinitely large garden. So usually these functions, which assign the relative number of ways to see any specific value of W, is called a probability distribution. In this case, this is a probability distribution. I think a lot of you know. But indulge me for a second. I just want to spend a minute building it up. And I know most of you will have seen this before. But if you haven't, then maybe you'll enjoy this a little bit. So this turns out to be a famous probability distribution. But it's very easy to build up from first principles. Let's consider a sequence of only three tosses, the first three in our sequence. Water, land, water. And then I ask, what's the relative number of ways we could see this sequence? Given some value of P, we don't know. But just given some value of P, any value of P, between 0 and 1, well, it must be you had a chance P of getting the first blue, and then you had a chance 1 minus P of getting the green. Because 1 minus P is the probability of land. Agreed? There's no other limits. We're not counting lava or what else could there be? Urban space. And then there's another P chance. They're multiplied because they're independent. And when things are independent, probability theory, you just multiply them. It's the chance they happen together in their independent events, you multiply, called the product rule. Why do we multiply? Thank you. To last lecture, multiplication is just counting. It's the way to count up all the stuff in the garden of working data. Remember we got to that step where I multiplied to compress the counting? It's exactly where the product rule comes from. It's nothing but counting up steps. And then we can compress this. It's P squared times 1 minus P to the 1, where the 2 and the 1 are just the number of each type of thing you observe. You with me so far? This is all it is. This is basic just probability application. The final step we need to get our definition, our probability distribution for W, is to note that these things could be shuffled. There are other ways to get a sequence with only one land and three water. There are different orderings of these things. And we have to count for how many orderings this could happen in as well. Because we don't care about orderings in this model. The sequence, because they're not auto-correlated. There's no cause spilling over from one to the other. So what are all the ways we could get? Two water in one land and three throws. You have to count for all the possible orderings that could happen. And you've seen, sure, you've only seen exactly one of them, but it could have been any of the others. So if all you're doing is predicting the two out of three, you have to count for all the orders. And in this case, there are three possible ways this could happen. Yeah, for really long sequences, there is a really terrifyingly number of different orderings that can produce any particular sequence. And that's what combinatorics are for. And that's why we have formulas. But in this particular case, the answer ends up being that the probability of two waters and three throws with any probability p on each throw is 3 p squared 1 minus p to the 1. And this formula, if you extrapolate it out to any length of sequence, gives you a very famous distribution called the binomial distribution. And again, I know all of you have seen this thing before, maybe once upon a time. But it's just built up from that same argument. And this monster is staying in front with the factorials is just the combinatorics term. It counts all the different orderings. They give you that sequence with a certain composition of water and land. That's its job. It gets big really fast. It's called a multiplicity. We'll talk about this in the second half of the course, because it's very special in Bayesian inference. It has a very strong role in the garden of forking data. It governs the size, the branching possibilities of the garden of forking data. But you can postpone thinking about that too hard. So here, all I want you to get is the right answer given our data story. Conditioning on how we think the data come to be is that this is the mathematical function, which will give us the relative number of ways to get any particular observation of w given in NNP. That's all it is. It's just a compressed mathematical formula for the garden of forking data that we did last time. We had to count it up by hand. Remember that with the marbles in the bag? This is the same. That was also a binomial system. This formula would describe those counts. You with me? Yeah? OK. We will not be doing much analytical math in this course. In fact, I think zero analytical math in this course. However, I'll be asking you to do a lot of coding. Why? Because you're all scientists. And that's what you come here to learn. This is you're going to be your applied statisticians. And so you're going to be producing inferences out of your computer. And many of the problems we will be inspecting will just be too onerous to do by hand anyway. So you need the computer. So we're just going to start with the computer. We're going to use R not because I have any strong affection for R, but because R, much like the English language, is spoken by a large number of people. And so we might as well use it. It has no inherent virtues relative to other languages. But since everybody knows it, why don't we use it? It's just kind of like that. If everybody was speaking Latin, we'd use Latin. There was a time when that was true. But it's not true anymore. So R is great in that regard. And then you can send it to colleagues that can download for free and run it. So it's good for open science processes in that regard. So we're going to use R. And R is great for doing statistics. It was built exactly for that purpose. So it has built into it a bunch of handy functions that compute these relative numbers of ways of getting different data sets. In this case, d-bino is the d for distribution or density, bino for binomial, and then with the different parameter mappings that correspond to our mathematical formula. And you'll try this out if you haven't already when you go home and do your homework. The last thing we need in our model is the prior probability for the parameter p. And as I said before, we're assuming this flat prior. This is not a very smart prior, because you have more information than this. But it's easy to think about, at least. And there's a probability distribution for this. It's called the uniform. You just assign the same plausibility to every possible value. There you go. And these two sets of assumptions together are the probability of w from the previous slide and the probability of p to find the prior predictive distribution. Before the model's even seen anything, you can force it to make predictions. You can force simulated data out of it. That's called the prior predictive distribution. In this case, it would just be predicting that the tosses would be uniformly distributed all over the whole space. It's not very interesting. In later chapters, this is going to be a very interesting object, because it's a way to design sensible priors or detect really crazy priors. And there will be examples of this as we go. Right now, it's not super interesting, but later, I promise you, it's going to be fun. I'm going to show you some crazy regression lines that are impossible. And there's a really big literature on prior choice, and we will be ignoring most of it. I'm going to focus on the construction perspective in this course, where we use prior predictive distributions to understand the implications of priors. Right now, it's easy, because there's exactly one parameter. But in later models, right in a typical regression model, you could have a dozen parameters, no problem. And they all interact to produce predictions. So what do the priors mean? You've got to simulate, I think, to figure things out. There's just no other way to understand the implications than to simulate predictions. Before your model has seen the data, what does it believe? That's the prior distribution. But it's in the outcome space that we live, as scientists, not in this internal probability space of the goal. OK. There's many, many words, lovely words, the best words in the chapter about this. What we end up with is a summary at this point. There's this stereotyped way of doing statistical model notation. And we're going to use this a lot in the course, because I want you to be able to read these things. People put these things in their papers, and once you learn to read these sorts of model summaries, there's a lot of power to it. It's a standardized way of communicating. So in this case, what we'd say is W is distributed binomially within trials in a probability p on each trial, and p is distributed in a uniform between 0 and 1. This is the simplest possible Bayesian model, I can imagine. It's actually Laplace's famous model, the law of succession. What you do with these statements, so that previous slide is just notation, the code is going to look a little bit different. We'll be writing models that look a lot like that. What you do mechanically is you must compute the posterior probability. Bayesian statistics doesn't have a bunch of estimators, like non-Bayesian statistics does. In non-Bayesian stats, you've got to make some decision about how you're going to get an estimate. You have an estimator, and then you evaluate it using frequentist criteria. And that's fine. Lots of good stats get done that way. It's not a criticism. In Bayes, you don't have a choice. There's one estimator, and that's the posterior distribution. And it doesn't produce a point. It produces a distribution of plausibilities over all the possible values. And that's your estimator. And you've got no choice. That's the only logical thing to do, given the way that probability theory works. So that's always your target, which is nice. You don't have a choice to make, and it feels as comforting. I don't like choices. Does anybody else like choices? Choices are terrible. That's why I like fancy restaurants, because there's no choice. Whatever the chassis is going to give you, you get it. So that's Bayes. It's like, no, no. Would you like a posterior distribution today? Yes, you would. Here's a posterior distribution, friends. You can have it with or without parsley. And so you use this thing, Bayes theorem. And again, there's a lot more detail in the chapter about this. What I want you to get right now is that Bayes theorem, it's hard to remember. There's a bunch of symbols, and they're all shuffled around. And like, what's the exact order? And it's fine. You'll get it in time. You'll understand it. What I want you to understand is the form it takes is just the garden of forking data we did last time. We're multiplying the prior times the relative number of ways the data could have arisen. And then we're standardizing, so that all the numbers are between 0. So the sum is between 0 and 1 again. That's all we're doing. So it's just the products, like in the garden of forking data. We multiply the probability of these error variables and the probability of the prior and the prior probability. And then we normalize it. And you normalize by summing up all those numerators, all the top parts, all those products. And so in mathematical notation, that's what it ends up looking like. And this is this famous thing called Bayes theorem. But its job is just to count up the relative numbers of ways given each cause, each conjecture, each potential value of p that you could see the data. That's all it's doing. And the normalization is mathematically ugly. But that's all it does, is just normalize. In fact, you don't have to do it, because the relative number of ways is what you care about. You can skip normalization. Then it's not a probability, though, and then casual mistakes could be made. So you should normalize, and your computer will do it for you. But technically, it's not even required. So I say more about the probability notation in the book. But it's not super important right now for us. Let me give you the picture version of this, so you understand what's going on. So the model we just updated looks like this. And I want to emphasize this product nature. You're multiplying a prior times the probability of data, which is often called a likelihood, the word I put at the top. And what this means is you take each value, a vertical slice, through these two functions and multiply the values at each point, and you get a new function. And it'll have a shape. In this case, since the prior is the same value in every place, if you multiply anything by a constant, you get the same thing back. So well, proportionally the same thing. You might get bigger. But I don't think it's bigger at the same rate. So you get the same shape back. So there's no change. So in the model we just did, the data are running the show. There's nothing about the prior inference that is doing any work here. And again, I've already tried to argue that that's not the best thing. We could do better. We could use our prior information, for example, that the Earth is mostly water. Everybody was taught this in elementary school. So we could construct a prior, which is zero at every value below a half, and then some value, any value, but always the same value above a half. And then you get a truncated posterior distribution, where you slice off all the impossible stuff. Because you know our priority. The model doesn't know that purely from the data. The data can't tell you that. But if you knew that already, you can embody it in the prior. And that lets you get to the answer faster as a consequence. And then finally, you can do some really weird stuff. You can make any assumption you want. You can make bad assumptions. Here's a weird assumption where we have this prior that's peaked on a half. Maybe we learned our lessons in elementary school very badly, and we learned that the Earth is exactly half water. So we want to put a peak there. And we choose this prior. This is called a double exponential prior. I actually love the way it looks. It looks very nice. And then you get this weird skewed thing that goes on. But that shape is a logical consequence of multiplying these two functions together. And that's what posterior distributions always are. It's about the relative number of ways each value of peak could be true given the data and the prior probability. So computationally, there are a number of different ways to get approximations of the posterior distribution. You have no choice about the posterior, but you do have a choice about how you approximate it. And there's an analytical approach, which I'm not showing you, because, well, it's not very useful. There are a very small number of models where you can analytically solve for the posterior distribution. This is one of them, though. You could do it in this case. And the book has a note where I give you how to do it. I made the graphs earlier using the analytical solution. For any even slightly more complicated model, it's impossible. Nobody can do it, because there are integrals. And those of you who know mechanical calculus, you know, integrals are like a wild land of possibilities. And there are a whole lot of integrals that just can't be closed. In fact, there are still unsolved integrals in math where you get famous if you solve them. So we have science to do, so we're not going to do that. Instead, we make our computer do the integrals. We do it numerically. So we're doing numerical integration. You'll be doing fancy calculus, but you won't even realize it. And there are different ways to do this fancy calculus. The only one I'm going to talk about today here is the first one, grid approximation. It's not very useful in general. We're only going to use it for two models in the course. But it's great for teaching, because it forces you mechanically to see that we're just counting up the branches in the garden of working data. We're just going to count them up, doing the multiplication. And then, for the first part of the course, we're going to switch to our third option here called quadratic approximation. I'll explain that on Monday when you come back, and we'll start using it. And then, I'll just say quadratic approximation is also called the Laplace approximation, because Laplace did it. And then the fourth one in the second half of the course we'll be doing lots of Markov chains. I know for some of you, this is why you've come, because you want to do Markov chains. And we're going to wait, so that we're only fighting with one thing at a time. But Markov chains solve a whole lot of really important problems in this. And really, the growth of Bayesian inference in the sciences since the 1980s is largely attributable to really nice Markov chain, Markov chain Monte Carlo algorithms that you can run on your laptop. OK, what is grid approximation? So you know the posterior probability is the standardized product of the probability of the data and the prior probability. Chant that to yourself. Write it down before you go to sleep. That's it. That's Bayesian inference right there. And standardize just means you add up all the products and you divide by the sum. And we're going to do that in code in a second. So hang on, this will be clear once you see the code. Grid approximation does is, instead of considering every infinitesimal value of p, if it could be true, and doing some integral over them all, that's what you'd have to do if you're going to do this analytically. You have to sum over an infinite number of possibilities. That means taking an integral. Instead of doing that, we're going to consider only a finite number of them, based on a grid. But if we make the grid very fine so that there are a large number of them, we get a very nice approximation of the interval. This is called grid approximation. For some of you, this will help you understand intervals, which is another bonus for the whole thing. OK, grid approximation is great here. You get more than a few parameters though. It becomes basically impossible. The sun will go nova and swallow the earth before your laptop finishes computing the grid. And so combinatorics is just not nice. It's just as no regard for human life. But in this case, it's great. It's very easy. So let's start out with a very simple grid. Here's our space we're worried about. Proportion of water on the bottom is the thing we want to make an inference about. There's an infinite number of possibilities between 0 and 1. Let's consider only three of them, right? 0, 1 half, and 1. And compute, say those are the only possibilities. This is like the marbles in the bag. Then we can count up the relative numbers of ways that we could see six waters given each of those possibilities. And now, we immediately rule out 0 and 1. Those are impossible. Why? Because we've seen water and land. So those go to 0. The only left and one left standing is a half. So it's the winner, the A. So if your grid is only three, you've already got some information here. Let's consider a few more, though. Let's consider five. With five values, now we've got some other than 0 and 1. We've got a quarter, and a half, and three quarters. And now we see that three quarters is more plausible than one quarter, because we have more water than land in the data. Does it make sense? So it's the same calculations we did before. But now we're doing it on a grid. If we increase now to 10 values, it's looking more and more like the thing we computed before. Here's 20 values. That looks pretty good, actually. I'd stop there. That looks pretty nice. But we're drunk on power. So let's do 1,000. Now, this is still a finite grid. 1,000 is a lot smaller than infinity. Agree? In fact, it's infinitely smaller than infinity. I'm glad people laugh at my jokes sometimes. But this has just been approximation. But it's a very good approximation in this case, and we can verify that with the analytical solution. Here's the code to do this. I'm going to step through this piece by piece for you. I won't normally step through the code piece by piece for all the examples as we go. But I think it's useful here because this grid approximation presentation is about understanding base theorem. That's what it's about. This approximation is useful for that. So step one of our grid approximation is you define the grid. That is, what are all the values of you're going to consider? So what I do here is I generate a sequence, Seq in r is just a way of making a sequence, from 0 to 1, and I want 1,000 equally spaced values. Then we're going to define the prior. That's the next thing. Oh yeah, then I plot our grid on the left. Those are our equally spaced values of p. Then we're going to define the prior probability of p. And we're just going to assign 1,000 times. Why 1,000 times? Because that's how many values of p we're considering. So each of them gets the same prior value of p. And this is a uniform distribution you end up assigning the value 1 to every one of them. Why? Because then the integral sums the 1. Those of you who are thinking about that, if you don't want to think about that, don't. Sorry I mentioned it. The area under the distribution has to sum the 1, right? So you have to assign value 1 to all of them to get the area underneath them to be 1. And now the probability of the data. This is our D binom friend again. D binom 6, 6 waters, 9 tosses. And then we put the whole thing p grid in there. The p grid has 1,000 values in it. And this is why you're going to love R, is you can give it a list of numbers. It does the same operation on every one of them and gives you a list back. Yeah, it's just built to do this. It was built by statisticians to do statistics. So it's built to do this. So when you give it p grid, prob data has 1,000 values in it. One for each potential value of p. The good. And then I plot this down here and you've seen this shape before, right? You'll see it in your grid. And then finally, we standardize it by dividing each value in post... Oh, then we compute the posterior by multiplying probability of data by probability of p. Nothing exciting happens because probability of p is the same in every case. It's multiplied by one. You get the same shape back in this example. And then we standardize by just dividing by the sum of posterior. And then again, the same shape. Yeah, I know you're keeping the same shape but that's because the final step doesn't change the relative values. It just changes the area under the curve. That's all standardization does. Okay? The important thing is that multiplication step. That's what gives the posterior distribution its shape. And there are many algorithms for estimating the posterior distribution where you ignore the standardization. Markov chain Monte Carlo is one of them, right? We're often in the calculation. You don't need the standardization at all. And in fact, it's the fact that you can't compute it is the reason we use Markov chain Monte Carlo. But you can wait to understand that until chapter nine, 10, I'll give you which chapter it is later. And then we'll do some fancy Markov chain animations and you'll understand what's going on. Okay. Last bit that I wanted to get across to you today is that when I work with Bayesian models, I'm nearly always working with stuff like this. A bunch of numbers, a bunch of random numbers. Random numbers drawn from the posterior distribution. And so this is actually nice in a sense because it means it takes what is a hard calculus problem, integral calculus, and transforms it into data summary. And most of you are probably a lot better at data summary. Apologies, then you are a integral calculus, right? And every human is, right? How many numbers are between one value and another? You just count and then divide by the total number, right? Then you've got the proportion in a certain interval. That is actually a calculus problem, which we just did by counting if you've got things to count. Random numbers have this extremely important inferential role in contemporary statistics, but it's worth recognizing that this is a very recent development, the idea that we would use our computers to generate a bunch of random numbers and then we make inferences from them about how to get rockets to Mars and things like that. It almost seems like madness, but it's true. And so starting in the 20th century, people started publishing books with pages like this. They were just full of random numbers because you need them to do science. But it seems like madness, right? You can imagine them having a time machine and taking this back to like ancient Athens and telling them, no, we run our society and ran the number generators. And they say, this is why you screwed up, right? This is why you come back to us for the answer. That's right, obviously. But it really is cognitively a prosthetic that helps us understand things because it transforms hard calculus problems into easier data summary problems. So here's the slide that just tries to summarize what I just said. We're going to be sampling from posterior distributions in this course all the time, even before we have to. So that, two reasons, first of all, it makes a hard problem into an easy problem, an easier problem. And second, once you start using Mark of T. Monte Carlo, you're only going to get samples. That's all you've got anyway. And so you might as well learn how to work with that now. Yeah? That's my gambit. And it's much easier to think with samples. Okay, so let's sample from our grid approximate posterior. Here's the recipe for sampling. First, you get an approximation of the posterior, in this case. Then you sample with replacement from the posterior, and then you compute stuff. We'll compute some things. Again, this is all of chapter three. It's all examples of how to compute stuff with the grid approximate posterior for the globe tossing model. All kinds of things. I want to give you a few examples of how this looks. There's one line in R that is sufficient to do all the sampling. We can sample 10,000 values from the grid approximate posterior with the line sample. So sample P, P is the list of all the possible, of our grid. Prob, posterior. So every posterior is a list of posterior probabilities. And so it's going to sample each value of P proportional to the probabilities in posterior. We're going to do it 10,000 times, so we're going to do it with replacement. And now you get a big bag of numbers, and those numbers are present in proportion to their posterior probabilities. And then you can start doing summaries. And now you're just summarizing how many values are within an interval, how many are above a half, and things like that. And that's much easier to think about than trying to do some integral over the distribution. When we get back here, if you think about plotting on the left here, sample on the horizontal versus the value that was sampled on the vertical, there's this big slew across, but you'll see that it's like you're looking top down on the posterior distribution like it was a hill. There's some topography to it. And if we look at this hill sideways, you get this density kernel estimate, which is what you're looking at on the right. And that's an approximation of the thing which we had before. When you start doing markup change, you're only going to get these samples. That's all you get. They just spit out. They're engines for spitting these things out. They're beautiful engines. They just spit out these samples. Which you will already know how to work with those samples when we get there. Okay, now compute stuff, as I say. What might you want to compute? Well, that depends upon your scientific question. You have some questions and how you want to summarize the evidence. So I don't want to be too specific here, but in the chapter, I give you a bunch of tools for how you do a bunch of examples for what you might do. Very commonly, people want to construct intervals, or say how much of the posterior probability is above or below zero. So I give you examples of how to do these things. You literally just count up the values that satisfy that criterion and then divide by the total number. And then you've got the probability it's below zero. It's that easy. Let me say a little bit more about point estimates and confidence intervals before we close up today. So you can think about there are two general kinds of intervals that people like to construct. And this is what I mentioned before. There's, so to say, intervals of a defined, a boundary, like so we might ask in the upper left, what's the probability that less than half of the earth is covered in water given these data? So that's the blue-shaded area in the upper left. And you compute that by counting up all the samples that satisfy that criterion and dividing by the number of samples. Is that easy? Yeah. This is a silly example, I know, but at least it's transparent. Yeah. You can also do it with 75%, right? In the lower left, what I'm showing you is the lower 80%. So you're finding the value on the horizontal axis below which there is 80% of the total probability mass. Yeah, and that's a slightly different calculation. But again, it's just a counting. And then in the right column, I'm showing you these intervals of defined mass. Well, the top is still defined boundaries between a half of point 75. And then the lower right showing you the middle 80%. There's an infinite number of 80% intervals and we've got to pick one, right? But there's an infinite number of them, right? You can just slide that blue mass around and it gets wider and narrower depending on where it is exactly, but there's an infinite number of 80% intervals. Startin' feelin' comfortable yet? Right, so papers are always report intervals and they report only one. What is that thing they're reporting given there's an infinite number of intervals which have to say mass? This is a question not often answered. Let me try to give you an idea. There are two basic kinds of specified mass intervals in statistics, and at least in basic statistics. The frequently statistics that intervals are a whole, we won't talk about it, they're weird. But you have intervals, well we'll call percentile intervals, which put equal area in each tail. This gives you the central area, so these are the ones you're probably used to. If you think about a 95%, which you might call a confidence interval, I'll say a little bit about that in a second, you're gonna put, you've got 5% left over and you put 2.5 in each tail, so now it's in the middle. And those are most of the intervals you see, are those kinds of intervals. Percentile intervals is what I call them. Percentile intervals are not necessarily the right thing to use. They're great summaries, all intervals are just summaries. They have no magical properties, right? Where the boundaries are, they're just summaries of where the action is. A percentile interval's a nice summary when you have a symmetric distribution. What about when you have an asymmetric distribution, like my example here? We imagine a different data set where there was more water present. No land has ever been observed in our sample, and so now p equals one is a possibility and we get this really skewed posterior distribution that just keeps rising, right? Now the, a 50% percentile interval leaves out the most plausible value. It omits one. That seems bad, right? If your summary of the distribution omits the highest value, the point where it's maximized, I assert, I can't prove, but I assert that's a bad statistic to report, right? So you wanna consider something else. And in this case, there's another kind of interval which doesn't have this flaw. It's called the highest posterior density interval. Say much more about this in the book. This will always include that highest point, yeah? But it has other problems. There's no grand solution here. What I want you to understand is that intervals are just ways of summarizing the shape. They compress the information, but the shape is what matters. There's nothing about the boundaries of the interval which has magical decision properties. They're just summaries for communication. And so if you look at the posterior distribution and you see it's highly skewed, then you can choose a good summary, but you have to look, yeah? Now in typical regression models, you're not gonna get a skewed posterior like this, but there are lots of places in science where you do. It's not that weird actually, to get something like this. Code to do this in my rethinking package. There are two functions, PI and HBDI, which will compute these from samples. You just give them the samples from posterior distribution and the mass you want, and it gives you the boundaries. So you don't have to toy around with the algorithms to do it. Okay. Point estimates. What I wanna say about point estimates is usually there's no point to them. We care about uncertainty. And the posterior distribution is a distribution. And you wanna summarize that. There's a bunch of chapter three that I'd like you to read, which is about justifying a point estimate. And what you need to do that is a decision analysis. You need to assign costs and benefits to different kinds of mistakes you'd make. And again, there's a big section in chapter three that I refer you to to talk about that. Typically in the sciences, we don't do this because we're building a research program and we're gonna accumulate evidence over many, many independent studies. And so we don't preemptively make a policy decision. But I have a lot of colleagues who do applied work and they have to make conservation decisions, for example. Do you list the species? Do you de-list the species? And then the costs and benefits of these decisions matter a lot. And then you really wanna say, like, what's our estimate, right? And you use all the information to construct it, but you get a different result. Also, hurricane forecasters do the same thing. Do we order an evacuation? Yes or no? Yes is the answer. Yes. Okay, a couple final things to talk about. So there's a lot of confusing language about intervals and I have nothing but sympathy for this. It's a big, terminological jumble out there. And I'm not very picky about what people call intervals. They just kind of like, because they're just summaries of the shape and I don't have superstitions about, you know, 5% or anything like that. So, whatever. But the term confidence interval, well, look, you can't have confidence in this interval. It's over the term. It's like, it's like a Norwellian Newspeak term, right? It's trying to say like, believe me, this is a great interval. The best interval, right? So this is political language and I don't like it. It doesn't even definitionally fit. It's a non-Basian term and it doesn't fit the construction of what's called a confidence interval. They should be called like, repeated occurrence intervals or something like that. Basian literature, you'll often see credible interval. Can I be grumpy again and just say I don't like this term either? And again, it's like political speak. It assumes, it seems to assume too much. It assumes that the model is credible. This interval is only credible if you believe the model and data. But hang on, that's sort of jumping the horse here. Jumping the horse, I need a better metaphor. I don't know. It's get cart before the horse, that's it. Cart before the horse. And again, it's a bit too Orwellian for me. I like, lately I'm trying this out, the term compatibility interval. I like this because it emphasizes the small world, large world distinction. This interval has the values which are compatible, most compatible with the model and the data. Now whether you choose to believe the model, that's up to you and that's another discussion to have. Whether this interval is also credible is another conversation we're going to have to have next. But what I can be sure mathematically is that this interval is compatible. Yeah, does that sound good? I don't know, I'm just trying this out. This is a new thing for me. You're the first victims of this. Okay. The last concept I wanted to get across, but it's 11 o'clock now is predictive distribution, posterior predictive distributions. So I think the thing to do is do this one more slide if I can keep you for another minute. Yeah, well the people who are complaining you can walk out. I apologize though. I think this is a very important step. We've got the model and now we want to inspect what it believes after seeing the data. And one way to do that, one of the best ways to do that is to make predictions. Have it simulate data through the model as a causal model now. It's by sampling things out of it. And so let's do that. There's a big section in chapter three where I walk you through the whole thing, but let's think about it as a cartoon again. So we've got a posterior distribution here for the globe tossing model. And let's consider just for a second, three values from that, A, B, and C, drawn at those vertical lines. We're gonna consider all the values, but I just want to show you each of these as an example first before we consider the infinite number of them. So let's think about, if we took the true value were A, and we simulated a bunch of globe tosses, what would the data sets look like? We'd get a bunch of ensemble counts of water, and what would the distribution of those counts look like? This is called a sampling distribution. Same thing as in frequent statistics, the sampling distribution. And so conditioning on P being 0.38, which is where that A line is located, you get a sampling distribution that looks like this. You expect mostly three and four water out of nine tosses, but those are gonna be a lot of spread. It's a lot of variability. That gives you an idea about the prediction in the future. So then if I ask you, okay, now we're gonna do this whole globe tossing thing for another class next door, everybody make a bet. And the person who gets closest to what actually happens gets 20 euros or something like that, right? And, or I'll plant a spruce tree in your name or whatever makes you happy, right? Something like that. So then you want to know that all of you will dutifully calculate the sampling distribution and choose three or four, right? And that would be the right thing to do. But if it were B instead, you'd get another sampling distribution where it's centered on six. And if it's C, a very high value, you get a very skewed sampling distribution now where eight is the most possible value, but eight or nine should be your bet. Yeah, you would bet eight if I forced you only one. What we wanna do is have a posterior predictive distribution which mixes all these together in proportion to the posterior probability of each value of P. I'll say that again. We want the actual predictions of the model are not any one of these sampling distributions, they're all of them, mixed together in the proper weights so that the implausible values of P contribute very little and the highly plausible values of P contribute a lot. This is called the posterior predictive distribution. And you just flow, I'm gonna say emerge. I give you the code to do this on the next slide. It's super easy. And you get a new sampling distribution where six is most plausible, but there's tons of spread. This is an over dispersed binomial distribution because the model's not very confident about what the true value of P is. And that's all embodied in these predictions. It's being automatically, properly, calibratively cautious about the predictions. Assuming the model is right. What does this look like in code? You need one line of code. You can use R binomial, the R means random. Binomial distribution, 10,000, nine tosses, and the probabilities come from the samples from the posterior distribution. And so now there's one simulation for every sample from the positive distribution, 10,000 of them. You get a bunch of simulated water counts and we plot them and that's where that graph comes from. Does this make sense? With more complicated models, this gets harder because there's more parameters that are involved, but the operation's the same and there'll be lots of examples in the course of how to do this. Okay, I was gonna say some stuff about hurricanes and more stuff about predictive checks, but in respect to your time, four minutes over, let me cut to the fun part, homework. So, homework's already up on the website, three problems. You're going to work with the globe tossing situation using the code that you've seen here basically and doing little modifications to it. You've got a week to do it, it'll always be that way, I'll assign homework on Fridays, please turn them into me the next Friday. Take your time, but start early. You're welcome to work in groups, just turn in individual assignments, but this is an open world assignment, right? Because that's how science is. You get to use the full potential of the internet to help you and your colleagues in this. And come back on Monday for the exciting transition to linear models, we'll start doing regression or what I, as I call them, geocentric models. Be sure to update your book, typos have been corrected, Brett gave me some good typos and I correct them and I pushed the commit up. With that, thank you for your indulgence, I'll see you on Monday.