 Welcome back. I want to keep going with the introductory material. We're into Chapter 2 now, and before we resume where we left off before, point of order about homework. Your first homework assignment is up on the class website. There are two problems with three parts each, which are meant to help you practice the skills I'm going to introduce to you today. There's a little bit of computation and a lot of concepts. Just to help you out, to make this work right, you're going to need to install the R package that goes with the course. It's called Rethinking. There's a link on the course website. You see here on the left of this slide, it doesn't have the red box around it on the website. That's my annotation. You click on that, though. It will take you to the thing on the right, which you can also just arrive at by Googling my last name, and you will find my professional website, and there's this software tab, and you'll find software that I write, and then there's these instructions and these three lines of R code. If you paste them into R, it will install the package, and nothing else, I promise. I'm not like, I don't work for the NSA, and I will not steal your information, anything like that. If you have trouble, it's probably RStudio's fault. That's my default response, but seriously, if you have trouble, let me know. We'll figure it out, but most people, most of the time, there's no trouble. You only need this package for some of the utility functions. All you need it for is to load the data that you're going to use this week, is to load the data that you're going to use in your homework problems, and later weeks, there will be utility functions that will make your life a little bit easier. To continue forward, this is where we ended last time. I used this toy example of pulling marbles from a bag to attempt to remove all superstition from probability theory. Probability theory is this incredibly simple-minded and powerful way of counting up possibilities, possible ways that events could happen. We have some observation, we have conjectures about processes that can produce those observations, there are a number of different ways each conjecture could produce data sets, processes that can produce the observations many more ways or more plausible explanations of those observations. And it really is that simple, it's just counting. And the things we call probabilities, and in Bayesian inference, we talk about them as measuring relative plausibility because the relative counts are higher. The things we call probabilities are just standardized counts of this time. We standardize them all so that they always hold a set of them, sums to one, and that makes the math easier. It really does. And then you don't have to keep doing the combinatorics over and over again, which would be monstrous, and your computer doesn't have enough memory to do some of the combinatoric problems we'll do in this course that way. And that's really all it is. Despite the fact that it's this simple-minded, it's the only pure logical expression, extension, it's the continuous extension of discrete logic. And that's as logic, it's powerful, and it's very vulnerable to bad assumptions. So you have to observe the garbage in, garbage out area. So that's what I mean by no superstition about this. We have no illusions that the small world calculations have some magical access to the real world, because they don't. They exist in the realm of your mind, which is a beautiful place, but it is only in your head. And nevertheless, thoughts when they have good premises can be very powerful in the real or large world. And so that's why statistical methods of all sorts, not just basing ones, have been so useful. But it also explains why they've also sometimes been so destructive, is because these assumptions matter. So what we're going to do for the first half of the day, at least, is now I want to take that same sort of story and we're going to work with the probabilities now instead of the counts. So I'll show you the standard representation of the way this works and the way we're going to work with it for the rest of the quarter. Instead of counting marvels, we'll be looking at the continuous possibilities and how they're traditionally represented in a statistical framework. So we're going to work with this typical model building cycle, which I advertise as a very useful way to avoid tricking yourself and continue to observe the small world, large world distinction. The first thing is we do is we tell a data story. Basian models are generative, which means they all correspond to some way that you could simulate the observations of interest. Not all statistical models are generative. That doesn't mean that they're second rate. But it does mean that it's more awkward to do things with them and you have to build them different ways. But it's a nice thing about the basian method is that you can tell the story forward first. You can think actively as a scientist about what process do I think in the real world generates these sorts of observations. That becomes a statistical model. This is the model design part. I'm going to give you an example of this in a very simple context in a minute. The next thing you do then is what we call conditioning on the data or basian update. This is the part where we count the past through the garden of data that remain given the observations that are there. I'll show you how we do that in the typical probabilist formalism. Finally, we're going to critique the model or evaluate it. All models are wrong, rather they're hammers. So they're only wrong that since a hammer can be wrong that there are better hammers for the task. And so there's always some way in which the model is doing a bad job and you want to find ways to inspect that. So I'm going to give you some examples of that. As we go through the course, there'll be a lot of that model critique or model checking that we go on. And then through many cycles of science, not necessarily in one paper or even two, but in cycles of science, we improve the predictive accuracy and information content of our models, the correspondence value of them through this loop through going over and over again through it. So the toy example I want to use today, and I forgot to bring this, but I have this prop, an inflatable globe. I forgot to bring it, I'll bring it next week. I always forget to bring it and I'll throw it at you. So we can do the simulation of this sampling. So imagine I actually had this inflatable globe. Those of you who've been to my office, you've seen it. It's always sitting uncomfortably next to you and in the chair on my desk, right? Like why is there an inflatable globe next to me? This is why. So imagine you had an inflatable globe and you were interested in the question, what proportion of the surface of this globe is covered by water? So you're interested in measuring, say, what proportion of the surface of the earth is covered by water. And let's assume for the sake of argument that the globe is a good representation of that. And so one way we could take point samples and get some data to get an estimate of the proportion of the surface covered in water is to throw this globe in the air. And each time you catch it, you see where your left index finger is and if it's over water, you write down a W and if it's over land, you write down an L. And let's ignore any of those cases like swamps and estuaries, things like that. Or ecologists in the audience, someone always makes up that, right? There's more than two types of things. Yes, but it's a simple case. So let's say we've done this, we've tossed it nine times and we get this sequence of data, water than a land and three waters and a land and a water and a land and a water from this sequence of observations. How do we translate this into a logical estimate of the proportion of water covering the globe? And so let me walk you through the simple Bayesian way to do this. So we'll go through these three steps in the design loop, the design conditioning and evaluation. So design part is we tell a data story that corresponds to how we think the data arose. How did the data come to be? There'll be, in this case, it's pretty simple. I think we'll all agree how the data came to be. But you'll see as we go forward in the course, there will be interesting disagreements, different candidate options. So the data story motivates how the data arise. So for this sequence of data, the Ws and Ls, here's the facts that I think we can agree upon, at least for the sake of the example. There's some true proportion of water covering this globe. And let's call that P for the moment. We'll use that symbol to represent that proportion. The thing we want to know is the question we're asking. So it's like we're programming our little machine, our little golem to answer this question, tell me P, it's gonna develop estimates of it. The globe is tossed up in the air. Given that the angular momentum of this thing is chaotic, like a coin flip, there's no correlation between successive tosses and the initial conditions have only the weakest effect on the data that actually arises. So we treat it quote unquote as random, meaning we have no information that would let us predict whether water or land comes up given how we hold for the bank. That's what random means. Remember, random is in us or in our machines and not in the world, right? If that weren't true, it would be a crazy and fun place. But it's the way we think about it. Randomness represents uncertainty or lack of information. So there's a probability P then under these assumptions that we observe a W because that's the proportion of the globe covered in water, right? And there's a probability one minus P of not. Each toss is independent of the other tosses. That's an assumption. It may not be true. We'll revisit that. You can definitely revisit it in the notes we may get to it today. Now, the hard part. Most people are cool with this. So like, yeah, I could have told that story. You're amazing. The hard part is in taking this verbal version of the data story and making it into equations. And you're gonna get really good at that during the term. At least in a conventional form. So let's do that. Well, let's go through this story and then we'll come back and we'll do it as formal functions as we go. Right now I just want to go through the outline of it. Second part is conditioning or Bayesian updating. What we say conditioning is a term and probability theory you're gonna get very comfortable with. When you condition on something, you learn something and so your beliefs are now conditional on that information is what we say. Conditional on having seen something, your beliefs may change or the estimates may change. Bayesian updating is the process of properly conditioning on information that you've received. The information here is data, right? So your machine starts with some initial estimate. In this case it's gonna be uniform. We'll talk about that of the value of P in the globe. You receive new information in the form of data. The estimates get conditioned on that data through a process called Bayesian updating. And Bayesian updating, again, it's defined uniquely by the rules of logic through that counting procedure we did on Tuesday. And it's the optimal way to learn in the small world, meaning no other algorithm is gonna be given that all the assumptions are true. That's why it's in the small world. In the large world, nothing's optimal, right? And I have this little box in the notes about squirrels and whether they're Bayesian. And squirrels are very clever if you watch them very much. They're pretty good, but I submit as a conjecture they're not Bayesian, right? So real animals do a pretty good job doing their lives without being optimal Bayesian decision makers. So don't get the idea that you have to be a Bayesian in order to make good decisions. But it is at least a warm comfort that you're doing in the small world the optimal thing. Additional in the assumptions. Okay, so our Golem has some initial information state. So for each possible value of P that exists, right? Between zero and one, there are an infinite number of them. So this is why, again, we don't wanna do explicit counting. This is an infinite number of boxes now of conjectures. And so, but your Golem can keep track of all of them. For each of them, we need some initial information state and we're gonna use this, what I'm later gonna argue is never the best initial information state. But these are an additional information states are called priors and here it's gonna be a flat or uniform prior between zero and one. It's equal weighting. So we say we've got a machine and it has no prior expectations. It's about how much water covers this globe. Now you do, because you live on the planet Earth and you're obviously on land. So you know that the pre-port that P is not one, right? Because you're not singing right now. But your machine doesn't know that. So we'll think about tutoring your machine later on. That's the prior, this gets conditioned on the data and the relative confidence in all the infinite values of P change. And you'll learn how to do this calculation and you'll do it in your homework. It's actually pretty easy. Let me walk you through the cartoon version, how this works. So here's the graphical representation of it. The way your machine sees the problem. Along the horizontal axis here, we have all the infinite conjectures about the processes that generated the data. P in truth could be anything between zero and one, right? Inclusive. And we don't know what it is. The prior is this horizontal dash line that your machine starts with. It's not an equal probability to all of them, equal plausibility to all of them. Again, you don't, but your machine has. We'll figure out how to give it better priors later. And the vertical axis is our plausibility scale. In this case, it's probability. Interfered it as relative plausibility to generate the data. Then we see the first data point. We can take them one at a time and we're gonna go through them and animate each one. We update one at a time. So we observe the W and we have a sample size of one. We get a new, you haven't learned how to do the conditioning or basically an updating yet. I'll show you that after the cartoon. But trust me, something happens. Some magic, math, magic happens. And the line changes. This line, the curve, or in this case it's still a line, represents the new relative plausibilities of all of the possible conjectures that could have produced the data. And we call this the posterior. It's posterior logically to the prior. We had some information prior to seeing the data. After seeing the data, we have posterior plausibilities. And that's all that it is. The prior posterior is just logical. It's just your perspective on when the data arrived. So then it could do with real time. And in this case, so we observe a W, more posterior probability gets amassed on the higher values of P because we saw some water. Now, one thing you can see about this, you may not be able to understand why it has to be a straight line. Don't worry about that. But you'll notice that there's zero plausibility on zero now. Why? Because we've seen some water. So P can't be zero because we saw some water. That makes sense. So there's the logic part. You can see kind of the discrete logic still in there. Let's add another data point. It'll get more interesting. So we're gonna start stacking them up on the screen. Now we see the land. Now we've seen one water in one land and we get a perfectly symmetrical posterior curve which assigns highest plausibility to 0.5 because half the evidence supports water and half supports land. Now there's an exact reason it has to be exactly that because logic determines what the posterior looks like. Relative, what I want you to see now is the kind of diagonal dash that was the posterior when n equals one is now the prior for n equals two. Because you're just updating. You're seeing each data point and you update them one at a time. For most of the exercises in this course we're gonna do what we usually do is we just take all the data en masse and we just do one updating event. But you can always break it apart into smaller pieces that you like and the process is logically exactly the same. That's the way it is. I'll reiterate this when we get to the end of the exercise. So we see a third data point. Third data point's another water. Shifts a little bit towards the more water values of p. Notice that no single value of p is ever like taking over, has all the plausibilities. There's always some range of them. And this is something you're gonna get used to and you'll see it as I start adding more of these boxes on the screen. In Bayesian statistics, points are not special. The estimate is the distribution. It's the whole curve. And initially you're like, WTF, how's the curve an estimate? You'll get used to it. It'll feel natural. There's nothing special about any particular one of the infinite number of points in there. We need them all, because it's relative plausibilities. We add the fourth, fifth, and sixth points. And what I want you to see now is the curve is getting steeper. It's maximum elevation is higher. So the amount of evidence you have seen in Bayesian numbers is represented by how peaked the posterior is. All the sample size is stored in that. The posterior distribution memorizes all the evidence that has been seen. And so it embodies it in that. So you don't have to keep track of degrees of freedom as some artificial thing. If you don't know what degrees of freedom are, it's okay, we don't need them. It's like some fossilized classical statistics. Thank you. But the peakness here is the thing you want to think of as representing, in a sense, how much sample size has come into it. And then the last three data points in the sample of nine. You see, it wiggles around a little bit each time. And what I want you to see, of course, is that there are nine data points and there are one, two, three, four, five, six of them are water, so six out of nine. And that's where the higher probabilities are assigned around six out of nine. So you might have had this informal idea, well, if you had nine, six of them were water, then my estimate of P would be six out of nine. That's not bad, but it's not easy. So the problem is there are lots of things near six out of nine that are nearly equally plausible. There's nothing special about that particular value. And depending upon other things you might have known, before these data arrived, six out of nine might not be special. So let me try to summarize the cartoon version before we go into the more algorithmic approach to this. One of the things you want to think about here is the data order in this example is irrelevant. We could have done the updating in any order. We can shuffle the data points and it won't matter. And the reason is because the model assumes the data points are independent of one another. When we observe a water that neither increases nor decreases the probability of seeing a water in the next toss in the globe. And I think that's probably true because I've actually tossed this globe a lot with an undergrad class, the sample data. And I think they actually probably are independent. I couldn't detect not a correlation. But it is an assumption, right? It's an assumption you can check with a lot of data. And in the notes I show you a way to go about checking it, in fact, but it's an assumption. On that assumption, the data order is irrelevant. There are cases where data order is not irrelevant. Most of the examples in this class it will be because we make this conventional assumption of independence. But you can also do the updating all at once. Just all as a mass and beat it in. This is basically how we're gonna do it because it's less annoying. If you don't want, so imagine you had 10,000 cases in your data set, which will not be unusual in your careers. Some of you have millions, right? Tony was analyzing a million chess games the other day for fun, as far as I can tell. And so when your data set has a million rows, you don't wanna do a million updating bits, right? Said you wanna do it all at once. And that's how we're usually gonna do it. But keep in mind that you can always break it down like this and think of it as a machine that is learning. And every time you feed it a bit of information, it conditions the prior distribution on that information that produces a new posterior. If you give it a big chunk of information at once, it does a lot of learning at once. If you give it a little drip, it'll do just a little drip of learning. Does that make some sense? But it is learning. Basing updating is a form of, it's a purely logical form of learning. And every posterior is a prior when new data arrives. So it just accumulates information in this way in this optimal way, given the assumptions. And therefore, every prior is in essence the posterior for some other analysis. It implies observations that have happened in the past. It's like ghost data, right? And this will help us understand priors as we go forward. So an equilating corresponds to some set of observations which would have produced equal weighting and making all P's equally plausible. I'll let you think about that. Maybe I'll give you a homework problem about that next week when you feel a little bit, your legs are stronger on the Bayesian sea, so to speak. Okay, and then we have to evaluate it. I'm gonna encourage you guys to think about Bayesian inference as, or at least a little Bayesian goal of stat models as providing purely logical answers to questions that you design into them. So the data story defines a question. And the things like P in them that are unknown are the questions you're asking of the data. And the data story tells the machine how to extract information from the data relevant to those little questions like P. It does, and it does it sort of automatically, which is a very satisfying thing. One of the things that it's always attractive me to Bayesian inference is, especially if you start out as a theorist, you've got a data generating model. You've got some models that make space natural systems. And then you get data. And the nice thing about Bayesian inference is that data story automatically implies how observations are relevant to making inferences about the adjustable bits of the system. That's a nice thing. You don't have to do any extra steps. It's like automatic. It's a really nice thing. The problem, of course, is that the answers come about in the form of this distribution. And it's purely logical, but the question may be terrible. And that's your job to deal with. The machine has no wisdom. So you have to supervise it. And it is easy to make mistakes. I don't wanna scare you, but now actually I do wanna scare you. Because there's a lot of bad statistics out in the world and I don't want you guys to do any of it. For basic problems, though, sort of what we do in this class, you'll be able to diagnose them. But all of us make mistakes. And sometimes they're just programming mistakes. Sometimes they're just silly, what we call brain farts. That's a professional term. And you can recognize them by checking your work, in a sense. You get a posterior distribution and you can see if it's nonsensical. The machine itself will never think it's nonsensical. So we always do supervision. We ask whether there was the software malfunctioned. That won't happen for a while, but about the middle of the course you'll get some malfunctions. It'll be fun. Does the answer make any sense at all? Often seeing the nonsensical answer, even though it's purely logical, will lead you to change the data story because you realize that data story was impossible. It didn't make any sense in the first place. And sometimes you wanna check the sensitivity. We know that all models are wrong, just as like no hammer is perfect for making a table. So, you might wanna ask, do the imperfections in the data story or with small variations in the assumptions of data story make big changes in inference or not? And you wanna report that to your colleagues, right? So little changes in the model make a big difference in what you believe or what the model believes, then that should be told. And if they don't, you should also say so. And this is what's called sensitivity analysis. And we'll do some of this when we get to chapter six. Okay, let's do it the more formal way now. Or maybe I should pause and say questions about that. That's what I call cartoon Bayesian inference, is what we just did. There was no math there, there was only like the specter of math, right? And, but that's what it is. That's what it is in spirit. If you can tell me a story for how the data came to be, Bayesian updating is an optimal way conditional on all those assumptions being true to extract information from the data that's relevant to the unknown parts of the process. I'll say that again. If you can give me a story for how the data come to be, Bayesian inference is an optimal way, conditional on the story being true, for extracting information from the data to learn about the unknown parts of the process. Those unknown parts we usually call parameters. We'll formalize that as you can see on this slide going forward. That's the conventional name for them. But that's the basic goal of it. Does that make some sense? And that's what we do. And this is why this is an old logical form of inference, right, from the 1700s. So when we, the conventional statistical way of accomplishing this is we need to make three categories of assumptions to define the data story mathematically and actually make this machine go. So now you get down and do the engineering, right? You sketch out your data story. The first thing we need is something called a likelihood. Second thing is we need to decide what are the parameters. Usually they're inside the likelihood, the parameters are things we want to estimate. And the third thing is to pick a prior state of information for the machine. What it sort of thinks originally, the initial plausibilities it assigns to the different parameter values. So we're gonna walk through this in the context of the globe tossing example again and define it for you. And in the notes, you should run this for yourself before you attempt to homework and go through the code. You do all this in R as well and run it through, especially when we get to chapter three. We'll do it all. From these, once you've defined these three sets of assumptions, the posterior distribution is unique implication of the assumptions. It's not a stochastic thing. It's not sampled. It's not estimated. It's computed. It is unique and deduced. It is deductive, right? Because it's logic. There's a unique answer to it, right? But it depends upon the assumptions being true, right? That makes some sense. Okay, the likelihood. The likelihood, I'm going to butt convention a little bit and ask you to remember it as the probability of data conditional on assumptions. I say book convention a little bit because there's a convention of saying that likelihoods are not probabilities. And if you're interested in that, there's a little box in the book about that. But let's just say it's semantic battle that has no content that matters for actually doing statistics. It's easier to think about it as the likelihood is a mathematical expression that counts up the relative number of ways you could see the data given a conjecture. So it's what we did with the marble bag. You made a conjecture. You assumed the contents were something like one blue marble and three white marbles. Then conditional on that, with that assumption, you can count up the number of ways through the garden of working data that are consistent with a set of observations. That's what likelihood functions do. They're mathematical compressions of that counting. They're mathematical expressions functions that give us the relative number of ways through the garden for a number of different conjectures. That's all they do. Does that make some sense? At least right now until you leave the room? Yeah? Yeah, I know this is. You're making neural connections. It'll make sense like tomorrow. So in this case, we've got an infinite number of different conjectures. And it turns out, given the data story, there is a well-known mathematical function which can express and count up all the relative numbers of ways through the garden of working data in the case of tossing the globe. The working data in terms of the globe is waters and lands. Waters and lands appearing. And there are more ways to observe water when P is a large number, when it's close to one. And there are more ways to observe land when P is a small number. You want to count all these up. This is the binomial probability distribution expresses this. And most of you probably have had a course before where you've worked with this before. So I'm not going to agonize over it. If you do have questions about it, let me know. And I'll help you out with it. But you can think of it as the coin tossing distribution. Where the first part of it here with the factorials in it is just the numbers of permutations that give you NW waters out of N tosses. And there'll be different ones because you can get them in different orders. And this is just combinatorics. You did this in high school. And then you blacked out and forgot it. Or you're at its drinking age and you ruin that part of your brain or something like that. And then this is the interesting part. This is just saying you get P to the NW is the probability of getting that many waters and one minus P to the N minus NW is the probability of getting that many lands. And then there's just the number of ways of getting them. So it really is just counting. It looks mathematically complicated if you're not, if you don't spend your life reading probability expressions. But it's just counting. It really is counting garden pests. But it does all the counting at once for any possible P. And that's what it's about. And that's what likelihood functions do. They just are mathematical expressions, algorithms for that counting process. Nothing superstitious or mystical about them. And they have assumptions, right? So in this case, the assumptions again, this function embodies the independence assumption. Every toss is independent of the others. Not anything to say, but not necessarily true. You guys with me? Make sense? Okay, we're gonna work with lots of different likelihoods. But we're mainly gonna force your computer to compute them. And I'll show you that in a minute. So let me walk you through how this thing works a little bit. So here, NW is gonna be our, is data. It's our count of water. That's something we've observed. It's a fact about at least our data. N is also data. It's the number of tosses. It's the number of trials, typically. And P is the thing we want to estimate. We don't know it. Likelihoods are conditional on things we don't know. In fact, that I had a great probability through reports in grad school where I was taught, if you wanna know something, Richard, condition on it. And that's sort of the whole premise with writing down likelihood functions. You don't know P, so assume you knew it. And then tell me what would happen if you didn't know it. And then you adopt a bunch of those possible conjectures. And that's how we do Bayesian inference. So I'll repeat that. So if you wanna know something, condition on it. Which sounds insane, but it's actually how Bayesian inference is propelled. That's by writing out these functions that say what would happen if you knew this thing you want to know. Also how you solve things like the Monty Hall problem and other probability brain uses and things like that. Okay, I have some of those in the homework later on. Just to torture you. But so if you wanna read, I think it's very good if the mathematics are slippery, but even if you're really good with the mathematics to figure out plain English ways to say these things. So in this case, I'm gonna try to put these little bubbles at the bottom down here. The count of W's of water observations is distributed binomially with probability P of a W on each toss and in tosses total. Make sense? And I know it doesn't slip off the tongue. It's not poetic, but you wanna be able to talk these things out. And later on we're gonna compress these model definitions down to hide a lot of the mathematics. So you still wanna be able to say this so you know what you're designing. You know what the data story is. We're gonna work with these in the R code. One of the things that's nice about using R to do your statistical programming is it has built in most of the likelihood functions we're gonna use. So you don't have to write the binomial formula yourself in R, although they wouldn't be hard. But there's no need to do so. There are these density functions they're called. They start with D and then they have some abbreviation in this case, binom for binomial. And I'm just showing you with the arrows, the correspondence here. The six is the number of waters we observed in the actual data, right, six out of nine. The N is nine and they call it size. That's a convention. N is the size, the number of trials. And PROB is our P thing. That's the probability of a water on each trial. You have to, and again, you have to condition, you don't know it, but you have to plug in a value, right? This is the awkward part right now. We're just using the likelihood. You don't know P, you wanna know it, condition on it, assume some value. So I encourage you to play around with this. If you're new to R, if you're not new to R, you're bored already, I'm sorry. But if you're new to R, you should go into this and put in the point five, you get this number out here. Put in some other number and move it around and get some idea that the numbers it's spitting out are relative numbers of ways to C6, conditional on that particular value of P. Right, there's a surviving number, relative number of paths through the Garden of Forty-Day. Does that make some sense? So you wanna think about them as being. Okay, let's talk a little bit about parameters because there is some little room about this. And I know some of you in here work with mark recapture models and those are also binomial models, right? How many people here have worked with mark recapture or at least run the program mark? Yeah, that counts. Running mark, it's solving mark on your computer paths is using mark recapture models. That's sort of how it goes. And mark recapture models are also binomial models but they're binomial models and what's known is different than in this example. So let me problematize this a little bit, just to understand that what's a parameter depends upon your focus and the question you're asking and what data you have available. So likelihood functions contain a number of symbols and these symbols can be data or parameters depending upon what information we have available to us in different contexts, change that. In this case, NW and ENTER data, we know their values. We know the globe was tossed nine times and we know that six of those times, it turned up water, right? But it would be possible to say you had a clumsy lab assistant and they lost the number of times the globe was tossed. So you didn't have N, you had six and you didn't have N. Now the first thing you do is you fire your assistant, but then you try to do data analysis. It turns out mercury capture is basically that problem. Mercury capture is a problem where you don't know in is the population size of animals. So you're trying to get a demographic census and six is like trappings, right? And you gotta estimate P from this. Now the thing that you're like, well, this is impossible and well, it would be except you get recaptures of the same individual. So they're unique things that are observed and that's how you can do the census and mercury capture but it is still a binomial likelihood. It's just what's known and what's unknown is different. So in this case, this thing P, this is the only unknown thing we have in our example. It defines our question, it's what gets updated. It represents the conjectures in this case, all the possible ways that the data gets generated. So yeah, and then I already talked about mercury capture at the bottom. We'll have some examples of this as we go along. All right. So priors, I'm gonna sort of trickle out advice about priors throughout the quarter as we go. I think in this case, it's easy to beat the flat prior all I'm gonna say about flat priors is they're never the best prior. And I'll improve upon that advice as we go along. Why never? Because you always have some better state of information than everything is equally plausible. So I already mentioned you're on dry land, right? So you know there's some land on the globe. So P equals one is not a possibility a priori. That's how you go. Often we can do better. When we get into more complicated models, there'll be more subtle advice about it. And when we get to chapter six, I'm gonna have a lot to say about using priors to calm your model down. Because I mentioned on Tuesday that these machines get very excited by your sample. You need to calm them down so they don't get overexcited. This is the only world they've ever seen. It's almost data and they love it, right? It's precious. It's like their first love and they imprint on it. And you want to reduce that imprinting tendency through use of priors. So we're gonna adopt conservative priors with preventive model from getting overexcited. And it turns out that that's gonna help out of sample prediction quite a lot. And that's why we do it. It's not a philosophical, it's a pragmatic attitude. But I'm gonna postpone that to six. So I say this now to apologize for not saying more about priors. Because there's tons to say about priors. Absolutely tons. Right now, just understand that this sort of decides what the golem believes. Not what you believe, but what the golem believes before it sees the data. It's what gets updated. Okay. So in this case, the probabilistic view of this prior is just a uniform prior. This is how you write a uniform probability distribution. The probability of P is gonna be the same everywhere. And it turns out, in that case, it means you assign a value one to every value of P. So if you haven't done a lot of probability work before, you should be asking yourself right now, how can the probability be one everywhere if we gotta sum them up to one? That's because you take an integral and the area under it will be one in this. And the interval between zero and one, if every value is one, then the area under it is one. You get a rectangle, right? And the area is one, that's the answer. And some of you are nodding like, yeah, the rest of you are like, I don't care. I don't see that. But if you do care, let me know and I'll show you what I mean. So the intermediate part, one over one minus zero is just, if you look up uniform probability distribution on Wikipedia, it'll explain to you that the interval bounds define it. And it's just so that probability distributions are defined by that they sum to one. Not the individual values can actually be greater than one. As long as they all sum to one, it makes sense. The area, it's not that the values sum to one, rather that the area under it sums to one, but the interval sums to one. So this will make more sense when we get to other things. I apologize for the awkwardness, but it'll come together. So in this case, you could say the prior probability of P is assumed to be uniform in the interval from zero to one. Okay, so this is what I hit it for. There's a really big literature on choosing priors and it's important. And we're going to, in this course, move towards conservative priors that reduce the over-excitement of models, what we'll call overfitting when we get to chapter six. All I wanna say about this right now is it doesn't have to be exactly right. So there's no correct prior, right? So I think there's this sort of legacy of the way classical statistics is taught, especially to undergrads, that there's one unique statistical analysis that is right for any particular experiment or study, and if you do anything else, it's completely wrong. And that's not true. So I said before, just like there are many ways to make airplanes, and many, many more bad ways to make airplanes. There are many ways to conduct a statistical analysis of a scientific study that are good, and there are many, many more bad ways to do it, sort of like that. So the issue with choosing priors is not that you get the right one, because there is no right one, right? The world doesn't have a prior. This is your machine's initial relative assignments to the answer. Your job is just to do a little better. And so the priors we're gonna end up adopting in this course will nudge you in that direction of doing better, right? But being a bit more conservative because the problem with models is they get overexcited and we get false positives from them. Okay, once you've got those three assumptions, the likelihood, you've decided what the parameters are and what the data are, and you've assigned priors to each of your parameters. You don't need priors on data, right? Although we'll mix that when we get to the last week. But then you can deduce the posterior. And the Bayesian, what we call an estimate in Bayesian inference is really just a logically computed posterior distribution over the parameters. And it's the, you can say it's the inverse of the likelihood. The likelihood is the probability of the data conditional on the parameters. You have to assume the thing you want to know. Remember, you condition on the thing you want to know in this business. Then once you've made all its assumptions, there's a mathematical formula for inverting that. Probability in getting the posterior, which is the probability of the parameters conditional on the data, right? They're updated, they've been conditioned. You start with a distribution like the probability of P that's the prior. After conditioning, you get the probability of P conditional on the data. And that means it's learned. If we get more data then we use this posterior as a prior and so on. That's how it goes. So here we'd write this probability of P bar or pipe. That's what it's called in typography. Or in Unix. If there are any Unix people here, I think you pipe things, right? And some people look at me like, what? It's okay, don't worry about it. It's the foregone era with spark stations where people did a lot of piping. And not so much anymore. Conditional on the observation. And the answer to how to do the updating is Bayes theorem. You saw this coming somewhere. This is what gives Bayesian differences, right? And Bayes theorem. I have a lot of commentary about this in the notes and I'll give you the crucial parts of it here. The left hand part here in the mathematical notation, the thing we wanna know conditional on the evidence is defined uniquely by the actions of probability by this stuff. And what I wanna say about this stuff, this is the likelihood. This is the prior. And this is this thing called the average likelihood. And I'll give you to you below that in English form. I'll say some more about the average likelihood in a minute. You already know the likelihood in the prior. Conceptually, I want you to think about this is, this is what we did when we were drawing marbles from the bag. This just does the counting. It really does. It does the standardization. So remember, we multiplied the pass together. So we're doing multiplication on the tops. It's the same thing. If you had a prior number of ways something could happen, you multiply that by the number of paths that are consistent with the observations. That's the likelihood. So that's the same multiplication we were doing when you were counting marbles. We get marbles. And the bottom part is the standardization. This is the part that makes it a probability and normalizes your standardizes. And that's what this thing on the bottom does. Does that make some sense? I can say it again six different ways. Those of you who know me know that, right? Got more than six in fact. But so the average likelihood is just this. You sum up for every value of P, you compute the product up here and then you sum them all up. And that's what the average likelihood is. It's literally the average likelihood, right? So this is a likelihood. This is, in a sense, the relative weight of that likelihood. So if you construct a weighted average of the likelihoods, you get something called the average likelihood. And it makes sure that the sum of the right-hand side is one, which makes sure that the posterior distribution is a distribution as an area under the point. Does that make some sense? Most of the time you're going to, this will be heavily automated. The important insight is that the relative values, the standardization to one is needed to make all the mathematics work correctly. And it's good. But the conceptual insight just comes from relative values. If you remember, we're just counting relative number of ways things can happen. And the relative weight of evidence for the different possible values P. So the key insight is to realize that the relative evidential support for any particular value P is the product of the likelihood in the prior, right? Or it's proportional to the product of the likelihood in the prior. Literally just the relative is the product of the likelihood in the prior. So you can think about it as multiplying across. So let me try to show you that visually on this slide. Say like we started out with our uniform prior there on the left. We see data. That data implies a likelihood function, which I've drawn out here. For every unique value of P, there's a relative number of ways you could observe the data, right? Notice it's peaked near six out of nine, right? Which is the most likely. The largest number of ways through the garden is that they're exactly six out of nine. Is the coverage of the glow six out of nine so that is water. So you get a peak. And so you can get a shape like this determined that's by the relative number of ways each value of P can get to the data you have seen. Does that make sense? Yeah. Some of you were nodding. Some of you were looking at me in a weird way. I'm not sure what the, I'll learn who it used to be and I'll get used to you. I'm trying to like coax you back here. So that's all the likelihood that you're just doing. Again, it's the relative number of ways through those paths of working data. That's what it does. When you multiply the two, you get the thing that's proportional to the posterior distribution, right? So if you rescale it so the area under this curve is one, it is a posterior distribution. But the shape stays the same. Even when it's not the same. And you notice here there's no change. When you have a flat prior, the likelihood is everything, right? The flat prior is equal weighting. It's like, I don't know. I don't know anything about this globe that you handed me and told me to throw. So I'm gonna assume every possible value is the same and then the data dominates, right? And that's what you get. When you have already some expectation about what's going on, you may not get that from the multiplication. So let me give you a truncation case where what you do know is you know you're on the planet Earth and somebody told you in grade school that the Earth is mostly water. Right? You remember that? Someone probably told you that. Carl Sagan probably told you that. And pale blue dot, right, that's the thing. And so it's a pale blue dot, not a pale smoggy dot or whatever you wanna think of it as. And so let's construct a prior where there's zero plausibility for everything below a half and then equal plausibility for everything above a half. So it's like this jigsaw prior. This is a perfectly fine prior. There's nothing wrong with it. It's just another uniform prior but it's uniform between one half and one. Right? And now when you do the multiplication, you get this truncation. The states is the same. You kind of see that. And it's like asking people to see multiplication is like asking you to be psychic. But you get better at this when you see some examples. Final example of this. Say we had some prior where you thought that values near a half were much more plausible and then there was some declining probability away from that. This is actually a double exponential function. If you're curious, it doesn't really matter. I just did it for the sake of the example. That's kind of pretty. And when you multiply them, now you get this skewing of the posterior distribution. The likelihood is in a sense kind of more towards the one half mark because those are so much more plausible given the machine's initial state of information. Now in this case, I don't think this double exponential prior is a good idea but I just want to show you the consequences. Yeah. Now of course, if we got more data, any of the posterior distributions on the right column here would be priors. And they would also multiply by the likelihood of the new data to give you the new logically consistent posterior distribution. It always works the same. Bayesian updating is uniquely defined for every model. It's just the likelihood is very a lot across models. And that's where the action is. And so this brings a point where I want to remind myself to say, non Bayesian statistics is also mainly focused on writing likelihood functions. And that's why the non Bayesian statistics so often give basically the same differences because they're both focused on the likelihood. And for most users, like you guys in your lives, you won't notice the difference. I hope to convince you there are some real advantages going forward into including measurement error and such things into your models, some real advantages to the Bayesian approach. I would try to convince you of this as we go. But for conventional simple regressions, there's almost no difference. And it's because of this relationship that the non Bayesian approaches are from a Bayesian perspective, assuming flat priors and the likelihood determines everything. And the Bayesian approach, we're free to use other kinds of weighting as well. Okay, so let me give you the cartoon version of what's going on here. To actually do this now, I've given you sort of the mathematical version of this, at least in an informal way. We need some algorithm for doing the calculations. For really simple models like this one, it's possible analytically to do it. You just do it with a pencil and paper. Well, if you're good at interval calculus, at least you can do it. So it turns out you can write down Bayes theorem and then you plug in your probability distribution to the uniform one and the binomial one. And this is a well-known homework problem that is done in a lot of Bayesian statistics courses. If you take the Bayesian stats course in the stats department here, which focuses on more of the math stats, you'll do it. It's like an early thing. You can do interval calculus is a problem at all. It does force you though to make a particular assumption about the prior, which is one of the reasons I don't teach it. More generally, we use a bunch of different computational engines, conditioning engines to do this, because for most interesting models, nobody can do the analytical math, nobody. I think this is why there was such a long period, the Dark Ages, where Bayesian inference was known in books and revered by mathematicians, but not being practiced a whole lot because the schmucks out of the sciences didn't have powerful enough computers and no one had yet invented algorithms that could make it work. Now we're in this luxury period. As I hinted at on Tuesday, I was a grad student in the mid-90s when the Markov chain Monte Carlo desktop revolution as was starting up and people were just like, super happy over the nerdiest thing ever that they could sample these models now. So, but all of these engines have the same kind of design in principle. If you wanna think about your Markov chain Monte Carlo sampler or the map engines we're gonna be using for the first half of the course is just like jet engines here, represented as a jet turbine. They're not nearly so dangerous, right? They won't suck seagulls into one end or anything like that. And the prior is what goes in them. It's one of the things that goes in one end and it gets transformed and comes out the other end. So in a jet engine, cool, loose air, low pressure air goes in one end, it gets heated and compressed and flies out the other end, right? And that's what propels an aircraft forward. In this metaphor, the prior goes in one end, it's mixed with some fuel, which is our likelihood, rather determines how hot it gets, so to speak, what happens to it. And out the other end comes a transformation of the posterior. The metaphor is imperfect because you're not gonna put another jet engine right on the other end of this and then like recondition. Well, if there are jet planes, I once said that in the lecture when someone who was an aerospace engineer in this class told me that actually they make jets that way where they just stacked jet engines on one another. And I said, do they fly? And he says, they mainly explode. But, yeah, I don't know. But we don't need to carry forward this metaphor very much. I just want you to think about it that way. Functionally, what you're doing, you define your model and there's a separation between the model definition and how it's fit. They're different choices. That's what I wanna emphasize here, right? You get your intentions for the prior and the likelihood come from your analytical intentions. And then you need to choose an engine that is convenient and practical in your case to get the posterior. But you want all engines to produce approximately the same posterior. So you can remember it this way. The posterior is the prior conditioned on evidence. It goes through some engine that does the conditioning. And in this class, we're gonna think about mainly three different conditioning engines. Number one, I've already eliminated from our consideration. We're not gonna work with the analytical approach because there aren't very many interesting problems that can be solved analytically. And that's sad, but it's just true. So it's just often impossible to do Bayesian inference analytically. Or at least if you want to do it, you have to pre-commit to being coerced into which priors and likelihoods you use. And you guys wanna framework that you're free for the science to determine the structure of the model, not the analytical constraints, right? So that's why we're not gonna do it that way. We're gonna do, for the first half of the course, we're gonna, well, today, and for your homework this weekend, you're gonna use a simple engine, which is very straightforward called grid approximation. I'm gonna show you in the next slides how this is done. It's very powerful, but it only works when you have a couple of parameters. And you'll understand why when I start explaining. So it's very straightforward. And the reason I'm gonna teach it to you, even though you'll probably never use it again in your life, until you teach your own children Bayesian inference at the crib. Is because when you do it, it really forces you to learn what the deduction of the posterior distribution is like. There's no dodging understanding what it is. You get it right away. You're building the full jet engine yourself. It's very computationally intensive for high dimensional settings though. And so we won't use it after, well, this week, basically. I'll show you, next week, I'll show you another example in a two dimensional setting. And then you'll see exactly why we're gonna stop. Cause it gets pretty awful. We're gonna do a lot. First half of the course is going to be using an approach called quadratic approximation, where we get an approximation of the posterior. I'll explain to you what that means. It's very useful for lots of simple kinds of models, which we'll mainly be working with the first half of the course. And even it's unreasonably effective, even when models get complicated. In the second half of the course, we'll alternate back and forth between it and number four, Markov chain Monte Carlo, so you get a sense about when the quadratic approximation breaks. And I wanted to say about quadratic approximation is it's really common in non-Basian inference too. Maximum likelihood estimates are nearly always some kind of what's called quadratic approximation. I'll explain what the quadratic thing means, right? Cause that's not a word that's rich in your vocabulary probably. And finally, Markov chain Monte Carlo, which is computationally intensive, but not like grid approximation. It's orders of magnitude more efficient than that. And very, very capable. And we're gonna be doing lots of that in the second half of the course. Okay, what is grid approximation? So grid approximation is a computational algorithmic way to get arbitrarily good approximations of the posterior distribution, depending upon how good you want them to be. Here's the strategy. Remember, the posterior distribution is just a standardized product of the likelihood in the prior. So if you got a likelihood function, you can plug in any value of the parameters and you can get a likelihood value for it. Then you can multiply that by the prior probability of those same parameter values. And then you've got a product. And that product is the relative number of ways in the posterior, right? It's the relative posterior value that you want. Then if you sum all those up and divide each by that, they're standardized. And then you have a posterior distribution. That's all you need. But in any way, all you need is the relative value. So even the standardization is often dispensable. If you want to do it quick and dirty. So the way we're going to do this is why it's called grid approximation is, since you're not going to evaluate every infinite number of P's, right? All the values of water coverage on the globe that are possible, they're an infinite number of them. We're going to do an infinite number because you have a finite lifespan. We're going to construct a grid of spacing, an evenly spaced grid of the values of P we're going to consider. And we're going to get a coarse picture of the posterior distribution, which will be plenty good for government work, right? As they say. So let me show you in animation what that looks like. And then I'll show you how the code works. And this is what you're going to do for your homework this weekend. So that's what you're really going to learn, right? When you start cussing at your computer. This won't be so bad. This is the simplest thing. So here's our cartoon on the horizontal again. It's the graph we've been looking at all day. On the horizontal, we've got the different conjectures from the portion of water on the globe with all green earth on the left. And of course, the earth wouldn't be green if there were no water. So now I second guessed the colors, but it'd be brown. But sorry. But on the far right, it's blue. It's water world. And the vertical is the posterior probability of these different values. Say we construct a grid with only three values of P, zero, one half and one. Those are the ones we consider. So we compute the likelihood at P equals zero at P equals a half and P equals one. You get those values, whatever they are. Your R will happily generate them for you. I'll show you how in a couple slides. Then you compute the prior probability at each of those. In this case, they're all one. We're using a uniform prior. Multiply those three sets of numbers together. In this case, you multiply each of them by one so they turn out the same. But in general, you see it. And then plot them, and that's what we get. Well, after you standardize them, sum them together and divide each by them. They become standardized. So you get this. So there's no probability that it's zero. There's no probability that it's one. So all the posterior probably ends up on a half because these are the three conjectures. So this is like, with our bag of marbles, it's like only considering zero, half and one as the proportions of water inside the bag. Makes some sense? You limited your analysis. These are the only values, parameter values you're gonna consider. This is a very sparse grid and it's not very good, right? Let's do five values. Okay, same data. There's six out of nine water still. It's the same data we've worked with before. Zero and one are still impossible. Now we expand the grid so that we're also inspecting point two five and point seven five. Point two five is not terribly likely because there was more water than land. But it's not impossible either because you could have just gotten a weird sample, right? So it's the relative numbers of ways. You could get six out of nine even at point two five for the true value. Point five is still the highest but it's no longer all the posterior probability and point seven five is slightly less likely but much more likely than point two five because there was more water than land. Starting to take on the shape we had seen before. For 10 values, looks better, right? For 20 values, it's pretty hard at least for the human eye to see the difference. And for most applications, it wouldn't make any difference. For a thousand values, you get a very smooth curve, so they all overdraw one another. So this is why I say arbitrarily precise. You can densely pack this grid and get it as good as you want. In this case, 20 is plenty good because the curve doesn't change shape very rapidly so you don't need a lot of points to do it. But this is what the grid is. The grid is the spacing of the values of the parameter that you choose. Does this make sense in cartoon way? You're gonna really get this when you do the code. That's how you always get this stuff and we will do the code today. Let me give you the introduction to quadratic approximation so you know what it is. It gets introduced in chapter two. You're told about it. In chapter four, you're gonna start doing it. So you don't have to worry about doing it right now. But this is a very important approach. The fellow on the right is P.S. Simone Laplace who is often credited with the invention of this. As always, the person credited with the invention of it is almost certainly not the person that invented it. But regardless, he did a lot of awesome stuff for Bayesian inference. So the quadratic approximation is also sometimes called the Laplace approximation as a consequence of that. And what is it? The quadratic or Laplace approximation is the assumption that the posterior distribution is normal or Gaussian. And we'll work with Gaussian distributions next week quite a lot. Most of you have heard normal distributions before. And when you do that, it's very convenient because normal distributions are completely described by exactly two numbers. They're mean and they're standard deviation or variance if you prefer. And so in the Bayesian context, we say that the posterior distribution, if it is normal, can be completely described by two numbers. One is its peak, which is called the maximum off posteriori or map, we're gonna call it. That's the top of the mountain that is the posterior. And then the standard deviation of that same distribution is the other number we need. This simplifies the calculations quite a lot. Reasons that I'll focus on more next week. Basically, it reduces the algorithm to finding the peak because hill climb, so it becomes an optimization. Think the computers are great at optimization. Hill climbs, it just keeps multiplying likelihoods by priors and climbs uphill. That's all it does. It's the near sighted mountain here. And it keeps climbing uphill until it gets to the top and then it figures out the curvature at the top and assumes that the infinite curvature out forever is Gaussian. I can't see very far, but it assumes it is. That's why it's approximation. And then it's done, that has been. Now it turns out for lots of models, you get exactly the right posterior this way. And for lots of other models, you get a perfectly good posterior this way. It's unreasonably useful in lots of cases. That said, in the second half of the course, I'm gonna give you lots of cases where it's unreasonably unsuccessful as well. So let me just have to hang on, but we're gonna use it a lot starting next week. Oh yeah, and I wanna say, this at most maximum likelihood estimates are effectively, they're not logically identical, but they're practically identical to this as long as you use flat priors. Okay, so let's transition into chapter three where we start to do calculations now, which is a reminder, you should install R in your machines, if you haven't already. And the rethinking package from before. So this is not a mathematical statistics course. So in order to teach you Bayesian inference without teaching you math stats at the same time, we need some computational substitute for doing integral calculus. Why? Because integral calculus is the way that you add up things that are in an infinite number of infinitesimally small boxes. It's just summation over some range of infinitesimally small boxes. That's what integrals do. And in probability theory, integrals are nearly always some way to do averaging. So you sum things up so you can calculate averages. So you might think, well, what does an average have to do with summing? You compute averages by adding up all the things and dividing by the number of things. So integrals sum up all the things and then divide by the number of things too. Just the things are infinite in number now, right? Which actually makes it easier. But so we're gonna compute our integrals computationally by tricking R into doing the calculus. And the benefit of this is that it will be much conceptually easier for people who are nervous about integral calculus. We'll get to that as we go. So what am I showing you on the screen here? This is a page from a book of random numbers from early in the 20th century. Mainly it's a book for experimental design for designing agricultural field trials. The idea was you pick one of these boxes by random number from another page and then you put your treatments in the boxes by the random numbers that are shown. What's interesting about this is prior to this era, prior to the late 1800s, especially the early 1900s, the idea of doing something like this was just insane. The idea that randomization was a way to learn about the world that made the ancients would have killed you because you were a public menace, if you suggested this. Right, so it's randomness chance is not the personification of wisdom. So as I talk about in the book, the Greeks and Romans had two goddesses for these things. There was one that was the personification of chance. This is where the Wheel of Fortune comes from, actually comes from Fortuna. She had held this Wheel of Fortune and she liked to trick people by letting them ride up the wheel and then watching them fall off and tackling badly. So the Wheel of Fortune was not a good metaphor, right? It's a bad thing. You don't want to ride the Wheel of Fortune. It's the idea. And in opposition to Fortuna was Minerva, born from the head of Zeus, right? You remember the story. And they were in opposition. They were really different. If you wanted to learn about the world, you wanted to hang out with Minerva. And if you prayed to Minerva, and if you were desperate and wanted to take your chances, you would ride the Wheel of Fortune. The Wheel of Fortune would always kill your kids later or something like that is when his stories always went. So the idea, when science started using random numbers to actually learn about the world, to do Minerva's work, it was a big shift in how people reasoned about their interaction with nature. Now, all of you accept this. It makes a ton of sense. Randomization is often talked about as this gold standard way of designing experiments. And it has many great features. But it is a historically quite novel thing. And to help you in this class to process all this uncertainty, we're gonna use random numbers to learn about the posterior distribution. And the reason for this is we're gonna sample from it randomly. And we'll do this because it will let you do interval calculus without realizing you're doing interval calculus. And that's why I teach it this way. And the bonus is when you start using Markov chain Monte Carlo to fit your models, this is the only option because you don't have any other description of the posterior distribution except random samples from it. You never actually get the distribution any other way. And yet it's efficient. So we're gonna start this course. Before we have to, we're gonna work with samples. Halfway through the course, we'll have to work with samples. But by then you'll already be really good at it. And then you'll have to learn one thing at a time. So we're gonna sample random values from the posterior. I'll show you how to do that. And then I wanna show you some examples of how to use that to visualize the uncertainty that's embodied in these relative parameter values, relative posterior probabilities of parameter values and compute confidence intervals and simulate observations. So, all right. Yeah, so I made this little picture right before I came over. It's supposed to be the evil integral symbol. Yeah, so people don't like integrals and they panic. You're gonna be doing these things and you won't realize it. So you'll have defeated the serpent. Those of you who know Trogdor, this was supposed to be Trogdor. I think anybody knows Trogdor. My postdocs didn't and I'm gonna demote them. So. Okay, so how do we sample from the posterior? Here's the recipe. Compute or estimate the posterior your chosen way. We then sample with replacement from the posterior. When you use MCMC that you're already at step two. That's all you get are samples. That's how it works. And then we compute stuff. This is like those internet lists, right? Or there's like a question mark and then profit at the end. Right, and then I'm sure what the intermediate step is. But I'll show you what I mean by compute stuff as we go. We've got 15 minutes, so we're in good shape here. First thing to do in computing the posterior, this is how you do the grid approximation in our code. I'm gonna walk you through this on the next slide, but it really is just five lines. And each of them has one particular function to go through. So let me step you through this code. The first line defines the grid. It defines how all of the list of values of P you wanna compute likelihoods for. That's all it does. So SEQ is the function sequence in R and it gives you a list of values from zero to one. In this case, a thousand of equally spaced. That's your grid of values of. The next line we define a list of priors, a thousand values of one. Because every value of P, same list, there are a thousand values of continuing. Everyone has the same prior probability of one. That's all that does. Rep means repeat. Then we make a list called likelihood, which are just the likelihoods. Wait, so I should be saying these things. There's the prior, and that's what it looks like. It's just they're all one. Then we make the likelihood list using Dbino. The nice thing about R is it works on lists. So the symbol P in this line of code has a thousand values in it. So when you pass it to Dbino, it returns a thousand answers. One for every value in P. This is why we use R to do statistics, right? Because it obviates writing a bunch of loops. Then the posterior is just the, is proportional to the product of likelihood and prior. So the first line there, we multiply the two. And then we standardize by dividing each value by the sum of all of them. And that's all there is, okay? Now you don't wanna run this through on your own. I don't have to understand every little detail of it, but if you look at the intermediate answers as you go through this code, it'll help you understand it, which is why I've plotted them here. Then we can randomly sample. We don't have to in this case when we've got the posterior, but I wanna train you to do the random sampling and the work with samples. Because it makes a lot of summary tasks easier. So we're gonna make a list called samples here. Using the sample function, we sample parameter values P in proportion to their posterior probability, right? So the probability of each P in the samples we're gonna draw is equal to their posterior probability, which we computed on the previous slide. Just what's up top? One E4 means 10,000, right? And it's easier than writing 10,000. And replace equals true means it's with replacement. The fact that we sample 0.3 doesn't mean 0.3 is out of bounds now. It can come back in proportion to its probability. Then what I plotted down here is the samples. So the graph on the left is just the stream of samples where the horizontal axis is just sample number and the vertical is the value. And I plotted them with some transparency because they overlay a lot. There are so many of them. You can see this is like looking down on top of the posterior distribution. See it's like a mountain. It has a peak right around six out of nine. And if we look at this mountain in silhouette which is what we get on the right, there's a, if you wanna plot densities in R, there's a density function which calculates them and my rethinking package has an abbreviated function called DENS dense, which will just make these plots using density. So that's what you can use to do it. It's looking at the silhouette. So if you looked at this mountain from the side, kind of compressed it together, you'd see the outline of it. It's altitude variation. And that's the posterior distribution. And you notice it's jiggly because there are samples, right? But we're not launching space shuttles. So the imprecision from the sampling is not gonna make any o-rings crack and there won't be any national tragedy. So I don't know if, the joke is getting old if people remember the Challenger. Yeah, okay, sorry. I'm a child of the 80s, this is like. So I was one of those people in a classroom when the Challenger exploded, right? Can you go back a slide? Sure. The priority node, that line, what's the 1,000? That's how many. You repeat one 1,000 times. Okay. Yeah. If you have questions about how these functions work in R, put a question mark in front of the function name and it'll give you the help page and it'll tell you what all the parameters do. So just like question mark rep on the R prompt, it'll give you the help page for the rep function and then it'll tell you what the second part is. Okay, so then we usually want to compute stuff given these samples. Having the samples themselves may make you feel like you've achieved something. But so you really haven't yet. Usually you still have a question and this is the frustrating thing about doing statistics is that the Bayesian model gives you the posterior distribution and that's its answer. And then you have to like, and like a counselor sit down with that answer and figure out what it means. Does this mean your parents rust your potty training or whatever it means? And so the machine is not exactly human. So it won't answer your questions directly and a lot of skill is required to get at these things. For these simple models, it's not so bad though. So let me run you through the standard things. Usually we have direct questions like, how much probability is below or above some particular value? And we're gonna do a summary test like that in a moment. Which parameter values contain 50%, 80%, 95% of the posterior probability, which you can think of as weighted evidence, crudely in this case. These are often called confidence intervals. I put confidence in quotes because lots of Bayesians don't wanna use confidence in this context because Bayesian intervals are not the same as non-Bayesian intervals. So last time I'm gonna say anything about that because lots of Bayesians, including Andrew Gelman, use confidence all the time to refer to Bayesian intervals. So there's just semantics. But just remember, Bayesian intervals are not the same as non-Bayesian intervals. They refer to different things. If you have questions about that, I can get to it when there's an example in class to worry about it. Which parameter value maximizes the posterior probabilities? Another question, I wanna convince you, I may not get to it today, but I wanna convince you that's not an interesting question. There's no unique point summarizes the whole thing. Even when you're doing map estimation, the top is just a convenient way to locate the center of the distribution. It's not special because there are lots of values near it, which are nearly equally probable. But in all, you decide the question. So there's no unique way to report the results. But usually we're interested in things like, for example, how much of the posterior probability is, say, below a half? So what I've drawn up at the top here, these will be things like intervals of boundaries. We wanna get the blue region in there. So if you've got samples from this distribution, you just count the number of samples below a half and divide by the number of samples. I'll say that again. You want the blue area on the left-hand side graph there. And you have samples from the posterior distribution. Count how many samples are less than a half. Your computer will happily do that. It's good at it. And then divide by the number of samples. And that's what I've done in the first code box on this slide. Some, the posterior, where the grid is less than a half. And that, well, if we had the, that's for the grid approximate posterior, sorry, when you computed it directly that way, we get the same answer counting up samples where we sum samples that are less than a half and divide by 10,000. We get the same answer within government work, within the error. Does that make some sense? When you execute the code, you'll force yourself to understand it and not until then. But this is, this is how you get it. You ask a simple question. It's like saying, you walk into a bar, how many, how many people are shorter than five feet tall? What proportion of the people are shorter than five feet tall? Well, you kind of follow the people that are shorter by people and divide by the number of people, right? This is the same thing. I'm going to ask you how, what proportion of the posterior probability is below a half? Count up all the samples below a half and divide by the number of samples. It's the same question in a sense. And so I'll think of it this way. It's discrete counting rather than trying to do integrals. The math stats way is you calculate the integral, right? The definite integral less than a half, between zero and a half. The one on the far right there is also definite integral. This is a case where you're, you'd have, and you want to know it's greater than a half and less than 0.75. And there's an area there. The code changes just a little bit and the example's in the notes. So I encourage you to take a look at that and see the code just changes a little bit. And you can ask a bunch of questions this way, depending upon your interest. Yeah. So just with the first, how would you interpret the 17% is it the probability that the parameter value is less than 0? What do you say? So the question was, how do you interpret the 17, the 0.17 here? The logical interpretation is, given the initial state of information you programmed into this fiscal model, that is that prior, then these are the relative plausibilities after seeing the data. The 70% of the relative plausibility of the weight of evidence is in that interval. That is less than a half of greater than 0. It is not, whether it's true or not. So you could say, this is a calibration issue, which is gonna have to wait until I get to calibration. But so I was starting to punt on it a little bit. But the logical answer is, it's the proportion of the posterior plausibility that is in that interval. So if you give the machine some initial set of assumptions and then it sees this data, that's what it thinks afterwards. Yeah? Exactly how much of that. Yeah, so where your question is going though is, as we get more and more data, what do we expect these probabilities to do? They do amass on the true value. Yes, they do. So this is all, regardless of the prior you have, I should say regardless, very reasonable prior, the priors we're gonna use in this class. As you get infinite amounts of data, the posterior becomes infinitely concentrated on the true value. It isn't necessarily what you want though in the short term because we never have infinite amounts of data. So I'm gonna punt on that a little bit. Yeah. So the question was when, this seems great if you don't wanna do integral calculus, when would you need integral calculus? You don't necessarily. Interim calculus will give you a more precise answer and you won't have to do as much computation. In practice, you can't do the integral, the formal integral calculus for any of the models we'll be interested in later on in the course. So it doesn't much matter, but it's just more precise. That's the answer. It'll save you computational work. This'll have Monte Carlo error, randomization error to it. So if you were launching space shuttles, you might worry about that. For us doing ecological applications, it's probably not to a big deal. That's a good question. All right, I got four minutes. So yeah, other kind of boundaries that are usually what people are more interested in are these what I call intervals of defined mass. Usually you don't, people sometimes ask the question, okay I'm interested in how much probability is less than zero, say? I want to know the probability that the parameter is above zero or less than zero. That's a good question. And people often ask that. And that's now it gets to the point five, to the point here. But much more common is, you're trying to say, where's most of the relative possibility around? So we use these things which are both called confidence or credible intervals to summarize that. They're regions. They're the high probability regions of the posterior distribution. And we do this just to summarize, right? If you wanted the whole information, you just report to your colleagues the whole distribution, right? You can give them a whole, give them an R dump of your samples if you want. If people do Monte Carlo models, that's what you do. But we have these intervals of defined mass. We're going to say, show me where the highest 80% is, right? And usually these are middle intervals. So I wanted to show you two differences. The lower 80% and the middle 80% are different. And usually we want the middle, right? Because that's where most of it is. The most plausible values are around the most probable point. And it's just a summary. There's nothing magical about that. There's nothing special about the 80%. Just a way to summarize for your readers where most of the weight of evidence is. You can use increasingly high values of this too, up to 99, and it shrinks down as well. So there are two kinds of intervals. And unfortunately you need both. One is called a percentile interval. And you'll read about the distinction in your notes. Percentile intervals make, put equal amounts of probability in the tails. So for middle 50% intervals, what I'm showing you here for this very skewed posterior distribution, right? The most probable value is not in the interval, which is bad, right? So it's like somebody giving you really terrible advice, they're saying, yeah there's a lot of probability of these things that are less likely to happen than a bunch of other stuff. Right, it's not the greatest advice. It happens routinely with these highly skewed distributions like this. If the posterior distribution is symmetrical, then the middle is a perfectly good summary. So that's a question. My advice is if you've got a posterior distribution that's skewed like this, you don't want to use intervals at all. Because intervals are just terrible summaries. You probably just want to plot the distribution so people know where the uncertainty is. There's another kind of interval called the highest posterior density interval, which always contains the peak, the highest point in the distribution. R will calculate both of these for you and I provide functions in the rethinking package to do it. They're called PI and HPDI. And you give them lists of samples and they figure out where the high probability regions are just by counting. That's all they do. That's all that goes into it. So before you guys go, let me say a little bit. I'll take you over like 30 seconds. Is that okay? You're lovin' this, right? Stay all day, I'm sure. So there's this attraction to point estimates which I think is baggage. It comes from past horrifying experiences in undergraduate physics courses. And in other traditions, point estimates are very important because it's the only kind of estimate you get. You get a collection of point estimates and the scatter among them is the only way you get a measure of uncertainty. That is not our business here. Our estimate is the posterior distribution and it's not a sample from some population of posterior distributions. It's the posterior distribution for your data model. I'll say that again. It is not a sample from some population of posterior distributions with some scatter to them. It is not an estimate. It is uniquely and logically defined by your model and data, right? So if any point in it is not of any particular interest unless you have a scientific interest in that point, like say you have a hypothesis that says the Earth must be covered by 0.71 water, then you're focused on the 0.71. Otherwise, there's nothing special about that point, okay? So I say much more about this in chapter three of the notes. If you are into particular points, you need to do formal decision analysis and we're not gonna do that in this course, but I'll give you some examples of how it's done in the notes so that when you come to a point where you might wanna do that, like apply a restoration policy or something, there is a way to take a whole posterior distribution, use all the uncertainty in it to make an informed decision based upon cost-benefit analysis. We're not gonna do that in this course, I apologize, but that's when point estimates become more interesting because you need some way to decide how you're going to behave, right? Okay, read the predictive check section of the notes because the last part of your homeworks, the homework asks you to go through that simulation exercise and I apologize for not getting to it yet, but my excuse is that I like to spend more time in the lectures on the conceptual bits. I'll start with this next time though, so I wanna make sure that I don't sell it short and thank you for your indulgence. If you have questions about the homework, please email me, okay? Thank you.