 here. Test, yes. If you're recording. Good, am I in the shot? If you were in the shot, there was some public space. Wonderful. I think we're missing, we're missing, aha. New or missing people. Okay, I'll get started. I'm sure someone else is going to come in and then you can all point at them when they walk in. Okay, welcome back. This is just what we're thinking, and I'm Richard, and we're in the last week. This has been a long struggle uphill. We've gone through a bunch of material. In the most recent weeks, we did multi-level models. Multi-level models were the goal for most of you, as for me. What I'd like to do now in the last week is show you how some of the conceptual tools and modeling tools that you use to build multi-level models open up a world for you of other exotic sorts of models that you might use that are, in essence, also multi-level, but their purpose is different than merely achieving partial pooling among different exchangeable clusters. They do, however, accomplish that as well, because pooling is just a thing that happens when you set up the model right. It's not something you have to specifically tell the model to do. It's just if you set up assumptions that make sense and establish relationships among variables, you will get pooling, and you will get it because it is rational. Now, if the model is bad, then the rationality will lead you into a false world. As I keep saying, there are small worlds and large worlds. But pooling is a generalized phenomenon. It's not exotic at all. It just arises from the way learning works. It just happens. It's a weird historical artifact, I think, of the way statistics is taught is that it's presented like it's strange, but it's regression to the mean. That's what it is. It's just like you get an ordinary regression. Okay, so with that, we're going to talk about missing data and other opportunities. These are things you can do with the concepts you learn for multi-level models and push through Markov chains and be more inferentially honest in your scientific work. So before we get to the boring computational part and the examples, let me take you through a couple of general points to set up this material. So, forget statistics for a second and think about pancakes. These are pancakes. Yeah? Possible? Yeah, so my art is not always persuasive. These are pancakes. And pancakes have two sides. I assert we'll ignore the edge. And two sides. And I want you to imagine that there are three pancakes that have been prepared. Like maybe I tried to make some pancakes for you and initially the skillet was too hot. So the first pancake I cook, it's burnt on both sides. That's what the black hatching is supposed to represent. So there's the burnt pancake. Second pancake, the skillet is now cooling down. The first side ends up burnt and the second side is just right. So now we've got pancake number one, which is burnt-burnt, and pancake two, which is burnt-correct. Burnt, not burnt. Unburnt. I'm not sure what the word is. And then pancake three is exactly right. It's not burnt on either side. These are the three pancakes. Now I serve you a pancake and you don't get to see which one it was because you worked in the kitchen. I just bring a play out to you and put it down. And what you see is that the upside of this pancake is burnt. Now I ask you the question because you've been in my statistics course for 10 weeks now. Tell me, what's the probability that the other side of this pancake is burnt? Now you don't want to eat this pancake. I understand that. But you do want to answer this question. Why? Because it represents the structure of lots of differential problems. It's a way to help yourself understand things. This is a version of a fairly famous logic and probability puzzle. And I give you the citations of the history in the textbook and the chapter. It's not easy, actually. That's why it's famous. It's a simple problem. Problems like this. It's got a simple setup. You can hold all the information in your head. And yet your intuitions, I shouldn't say you. Most people's intuitions are wrong. So what's your intuition? Now I just told you it's wrong. Most people's intuition, according to all the data, is just like mine the first time I saw this problem is that it's one half. Why? Because there are two pancakes that have burnt sides. One of them is burnt on both sides. And the other is only burnt on one side. So the probability that the other side is burnt is a half. That is wrong, as I've told you. It's wrong. But that's the intuition we get. What's the right answer? Well, we'll work to that. The question is how would you figure this out? Figure out it's wrong. And the lesson I want you to take away from this is not that we're bad at this. Because you knew that. We're all bad at this. We're humans, right? So this thing I keep saying, computers are good at things we're not good at. They're bad at things we're good at. Yeah? They're good at chess, bad at walking, right? Hard to teach your robots walk. This is like cutting-edge science. Make a robot walk. Make a play chess, that's no problem. Think chess and go are exclusively held among tournaments and on computers now. People are just out, right? It's just uninteresting. So how should you figure these things out? Well, stop trying to be clever is what I want to encourage everybody. Don't use your intuition. Just stop. Don't even try to be clever. Just use the logic and figure it out. That's what's great about Bayesian inference. It's just logic. It's just extended logic. That means it's garbage in, garbage out. That's what logic is. But it means you don't have to be clever. And the way these probability puzzles are often presented in books, it's taught like, it's solved through some clever realization, some kind of mathematical insight. And so I think this teaches the wrong lesson. It teaches you that you need to be fantastically clever like the author of your textbook. Figure these things out. The author of your textbook was not fantastically clever. They are just presenting themselves as they are. How do we actually figure these things out? We use cold brute force logic. That's what we do. We use probability theory. That's the idea. Just avoid being clever. If you avoid being clever, you will appear very clever to people who are trying to be clever. So intuition is a terrible guide to probability. It's not intuitive at all. But you don't need to be clever to solve these problems. You just need to be, as I say here, ruthless. Just apply conditional probability. Conditional probability has very few rules. They're the laws of probability. You already know them. I'll remind you of them in a moment. And when you're fitting statistical models in this class, this is all you're using as well. When you set up the model definition, you're defining all the assumptions. And then the laws of conditional probability let you find the implications of those assumptions. The implications are already there. You've stated them. To intuit them is very hard. But it's very easy for logic to draw them out and represent them. All the work we've done in this course is really just that. You don't have to be clever and realize, for example, that you need pooling. You get pooling for free. It arises, emerges from your assumptions. It isn't something that you say, I want a pooling here. No. If you specify a population of cafes, then you get pooling. And then you have to understand, this is the hard thing for us mortals, you have to understand why it happens. Why is that a necessary consequence of the assumptions? And that's incredibly useful. So the people in my department know, I didn't start off doing statistics in science. I started off doing evolutionary theory. And it's the same business there. We make assumptions about evolutionary contexts and environmental contexts. And then we'd like to deduce the implications of those assumptions. And we learned a tremendous amount through those exercises, even though the consequences are, of course, well, they're consequences of the assumptions. They're already there when you state the assumptions, but you can't realize them. And that's why we need the logic. It's to reveal the consequences of assumptions that are not at all intuitive. It's the same business in statistics. Okay, that's sermon number 73, I think, in the course, right? So let's figure this out. What does conditional probability mean? Conditional probability is the probability of something you want to know. Conditional, I want you to already know. This is our goal. This is a posterior probability, right? This is all these things we've done before. We'd like to know the plausibility, the probability, the plausibility of this thing, the state of the world that we want to know about, conditional on something we do know about. Now, some things that the things you know about give you no information, but you figure that out through the same procedure. So it's all the same business. So let's do this with pancakes, okay? You all like pancakes. Yes, not, thank you. I thought that was a very good dog. I appreciate that. And everyone else is kind of like halfway nodding. Pancakes are excellent people. You should be enthusiastic about this. So here's our pancake. It's a mystery pancake. You cannot peek. Look under the pancake. You just have to use logic to figure this out. We want to know the probability the downside is burnt. What we do know is that the upside is burnt. So what's the probability the downside is burnt conditional on the upside being burnt? That's the question. And we know from the rules of probability that that is defined in terms of two other things that we might be able to calculate. The joint probability that the upside is burnt and the downside is burnt. Yeah, probably a pancake is burnt on both sides. That's how you can read that. Living? Yeah. And divided by the probability the upside is burnt. That's just the definition of a conditional probability. That's all it is. Yeah, if you ever forget, sorry, now I'm not tall enough to reach my own slide here, but if you move the denominator in the right hand side over here by multiplying both sides by it, the joint probability is just equal to the conditional times the probability of the thing you're conditioning on. That's an easy way you can remember. And then you can just divide and move it under. Yeah? Does that help? Sure. Yeah, whatever you say. You're going to get good at this because Anna is going to drill some of you folks soon over this. Anna will make you dream in probability theory. So there's really like three rules. That's it. There's almost, that's all of it. So let's start filling stuff in, we know. So the probability, we need the denominator here. The probability the burnt, the upside is burnt is the probability we've got the BB pancake. BB means burnt on both sides. Burnt burnt. The probability of burnt burnt is when we have the burnt burnt pancake, the probability the upside is burnt is one. There are three states in the world the world can be in to say the probability that the burnt side is up. There are three pancakes. And so if it's the first pancake burnt burnt, then the probability the burnt side, the upside is burnt is one. If it's instead the burnt unburnt pancake, then it's a half-chains. Imagine I'm the cook in the kitchen and I'm taking the pancake and I'm just flipping it randomly onto the plate. Either side is equally likely to be up. That's the assumption here. If it's the burnt burnt pancake, it'll always be burnt up. If it's the burnt unburnt pancake, it's a half-chains. And then there's the last pancake, the good one, the one that you wish I had served you but did not. It is unburnt on both sides and so there's no chance we would observe a burnt side up. You with me? Okay. And you can calculate these things then. Let's assume that the prior on the pancakes is that I selected one at random. That's an assumption we need to introduce here, but we always need a prior. So priors let you average. They're required. And so that gives us a probability of one-third for each pancake. So this becomes one-third times one, plus one-third times a half, which is a half. It's also one-third times zero. I dropped that term. Okay. Let's move on. I know you're loving this, right? Some math in the morning. But I'm showing you how this is how I solve these problems. I'm not clever at all. I'm just ruthless. I just plug stuff in and go through the rules. That's the whole business. So now we can get the conditional probability because we know both things. What's the probability of a burnt burnt pancake? I just told you it's one-third. Yeah? Because that's the probability that the cook selected it. So the conditional probability that the burnt, that the downside is burnt conditional, the upside being burnt, is one-third divided by a half, which is two-thirds. It's not a half. That's the answer. Now we've got the answer. It didn't match your intuitions. And now the question is why? What was wrong with your intuition? And here's the moment where you can learn something. Or at least when I first encountered this, I felt like I learned something. And the thing you learn is the mistake is to focus on pancakes. What you want to focus on is sides. Sides of pancakes. So if you focus on the pancake you get burnt, you can take either two pancakes with at least one side burnt. And then you count the wrong stuff. But we're asking a question about a side, not about a pancake. So we've got to count the sides. What are the relevant numbers of sides that could be on the other side? Well, there are three sides that are relevant and two of them are burnt. And so it's this counting procedure. So the whole business in probability theory is a matter of figuring out what it is you're trying to count over and then you count up all the ways it could happen. What are all the sides that could have arisen from this? And then how many of those sides are burnt? And the answer is two out of three. So you should take a look through, this is the opening to Chapter 14, and you should take a look through it and work through this yourself on paper. And it's not a hard problem, but this is basically what your computer is doing automatically for you when you specify Bayesian models and it does the calculations. This is what the Markov chains are for, is to obviate the need to be a clever person. You don't want to do that. Your computer is clever. You tell it what to do. Okay, so this is the whole gambit in this business. We express the information as constraints and distributions. This is how we define statistical models and logic puzzles. And then you just let logic discover the implications. When you discover those implications, you may decide quite reasonably that the model is silly. That's perfectly fair. Logic is garbage in garbage out. That's what you must believe. It tells you the implications of assumptions. See the distinction. But the good news is you don't have to be a clever person. And in fact, trying to be clever results famously in lots of errors. Some of you will know this famous story about a newspaper columnist named Marilyn Rostavont and the Monkey Hall problem. No? Yeah, some of you do. If, depending on how fast I go today, maybe I'll end with that story, but I have a goal of how much material I want to get through. The short version is there was a newspaper columnist who got a probability puzzle problem correct and then a bunch of mathematicians wrote in to tell if she was wrong. It's a fun story about avoiding being clever. And then the revelation and the logic puzzle is called the Monkey Hall problem based on a game show. It was on it in the 60s, 50s, 60s. There were donkeys. Anyway, depending on how fast I go today, I can tell you the whole story. Okay, so let me give you some examples of this getting ruthless business. We've been doing this the whole course, but two cases where in my experience it's difficult to intuit the approach for how you deal with measurement error and a very grotesque form of measurement error is the data are missing entirely. That's error. It's hard to intuit what you should do in those cases, but you don't have to. You don't have to be clever. Just state the information, state what the error is, and state what the missingness is. And then the model automatically tells you an approach that is a consequence of the assumptions. And that's what we're going to do today. The other thing that will be useful as conceptual introduction to this material is to think that also is true with measurement error and missing data. We confront this thing that I like to jokingly, cheekily, called decolonizing Bayes, decolonizing Bayesian inference. So here's the basic story. Bayesians didn't call Bayesian inference Bayesian. This is a term due to Ronald Fisher, who was an opponent of Bayesian inference. And usually when we teach Bayes, we use non-Bayesian terminology to teach it. Why? Because in the early 20th century especially, Bayesians were in the minority. And people like Fisher, especially in the mid-20th century, worked really hard to exterminate Bayes. They were an embattled minority fighting against a powerful anglophone dictator pictured here. So this is a bit tongue in cheek, right? I don't mean this to be taken literally, but the consequence is that there are lots of awkward things about how Bayes is taught and how it's learned that they're a purely historical accident. And if we could rewrite time to the current situation where, if anything, in stats departments, Bayes is the powerful situation. That's not true in the sciences yet, but in research statistics department, Bayes has taken much more seriously than non-Bayesian approaches in general. But that's a flipside. That happened during the 20th century that there was an inversion of the power. Dennis Lindley was a very well-known Bayesian statistician from the 20th century and he became chair of statistics at Oxford or Cambridge. Sorry, I forget the exact detail. And when it happened, people were like, oh my God, this is like the Ottomans being taken over the Vatican or something. I mean, people were really just like, in shock, how could this possibly happen? Dennis Lindley is now chair of statistics. And it was a big turning point in the history of these things. And now, of course, it's not every Bayesian approaches are taken extremely seriously to say the least. And this whole decolonizing rhetoric doesn't seem to make much sense, but it's part of the history of what happened. That we still teach it using offered terms. So this is useful to point out, I think, because it helps you guys steer around conceptual obstacles, is that some of the thought patterns that you've developed in your earlier statistics courses, learning non-Bayesian statistics, just don't actually apply to the Bayesian framework. But I've been cheating you by using those terms anyway. So I'll give you examples. So data, in the Bayesian framework, data are just an observed variable. There's nothing fundamentally different when you set up the model about something you can observe and something you can't. You assign them probability distributions. They look the same, right, to get in your model. There's no distinction. Why you give it a distribution is why a parameter or data. You can't tell just from the model definition. How do you know? Well, if you feed those values in, then it's data. If you don't, then you get samples for it instead. And that is something that will be very important to us today. So parameter, the other side of this coin, is just an observed variable. And whether a symbol is data or parameter can change across analyses for exactly the same model. And that's what we're going to exploit today. How can it change? Because if you observe it or not, then if you don't observe it, you have to infer it. And the implications of the model, the assumptions you've given allow you to infer it. You can't infer it exactly in those cases, but you can narrow down your uncertainty about it. And likewise, if it's, sometimes you do get to observe things. And then you don't have to infer them because you observe them. But the distinction isn't perfect until you've made the observation. Once the observation exists, the distinction is incredibly important because it changes how you do calculations. But yeah, how you set up the Markov chain and all kinds of other stuff. And that's extremely important. Likewise, likelihood in prior, following on from this, a likelihood is a distribution of an observed variable. It's a prior for data. And a prior is just the distributional assignment for an unobserved variable. But they look the same in the model definition because logically they're the same until you know what's observed and isn't. Yeah? Now, this will be blasphemy in non-Basian statistics where the distinction between data and parameter is some fundamental fact of, like it's like a quantum particle. It's fundamentally different. But in Bayesian, I mean, that's fine. I have nothing against the non-Basian approach. You can do lots of great work that way. Just this is a Bayesian course. And so my job is to help you get the right intuitions about Bayes. And so all of this is, these awkwardness arises, I think, from a deeper difference that is often very hard to teach. I've always struggled with it. Is that in Bayes, probability is not ontological. It's epistemological. Meaning there's nothing random in the real world in the Bayesian view of inference. It's randomness is always just a proxy for our lack of knowledge about what is determining events. We use probability distributions because we can't measure or we don't know the deterministic processes. And so we summarize our ignorance with a big distribution. But there's nothing ontological. You're not claiming when you assign a probability distribution to a variable that the world is fundamentally random. You're not going to upset Einstein, right? You guys know this Einstein quote, God doesn't play dice with these fights. Anyway, we won't review history. I'm always tempted to. Those of you who know me, it's like you can get me ranting on history of science really fast. But I get the impression that in introductory stats courses, people are often taught that randomness is like a thing in the world. And there's like a property of some process. But I don't really believe that as a scientist that's wrong. It's absolutely wrong. And so you get misled into thinking that probability distributions are claims about how the process works. It isn't. It's just a summary of our ignorance. And that's the way we've used it in this course. And so that turns out to be constructive because it gives you permission to summarize your ignorance and then see the implications of that. And that's what we're going to do here. Yeah, anyway, so this is my little, my tongue-in-cheek joke about decolonization. So let's exploit these facts. Move on to some examples. So I think I can do all the measurement error in the remaining time today. And then we'll do missing data on Friday. And I'll do a course wrap-up as well with some, you know, hopefully wise things to say about statistics. Okay? Yeah, I know I'm skeptical too. But measurement error. Measurement always entails error to some extent. And how that error gets into the model affects how the inferences are drawn from it. So what I want to do here is take what you already know about measurement error because you've been modeling measurement error for the whole course action. There's always measurement error in these models. It's just that in the traditional, simple sorts of models we do, we don't talk about it that way. You deal with the problems of measurement automatically. But sometimes you get error, measurement error in less convenient ways. And then you need to know how to deal with it as well. So for example, in a typical linear regression, the sigma parameter we're estimating, that residual variance is like a kind of error. It's a precision issue. We assume it's constant across all the observations. But what if the error isn't constant? What if some observations have a lot more error than others? Because they were derived from fewer measurements, for example. That is a really common problem. Or those of you who are psychologists, you have multiple raters and some of them suck. So you've got radar reliability. You want to use all the ratings, instead of throwing them away, but you need to downweigh ratings from certain individuals. How could you do that? I think this is the way you could do that. If you're willing to assign reliability to individuals, which I think psychologists in my experience are, they're very eager with such things, then you can do this. And then, so we're going to do that first. We're going to look at an example where there's variable error across observations, the outcomes in a model. And I want to show you how to set up that model. And then second, we'll finish, I think I can do this today, on a case where there's error on the predictors as well. And it can also vary across cases. That is, your right-hand side variables are measured with error. So we don't know the data value. Instead, you have a range of data values that could be there. And this happens, but frighteningly often, in the statistical problems that concern me the most in my field, some people in the room who've fought with this, when you do anthropological field work, you rarely know a person's age exactly. It's just a fact. Why? Because in most parts of the world, a human species does not care about birthdays. Birthdays are a very strange thing, especially here in Germany. Sorry, say it. As an immigrant. There's this parasitic birthday party rotation thing where students bankrupt themselves and throw out birthday parties, and you guys figure out a better equilibrium soon. There's a better way to do this. No, sorry. It's going to be ethnocentric for a moment. In general, birthdays are not tracked, and most people don't know their age in most parts of the world for most periods of human history. There are proxies. So you can ask people their age, and they'll give you a number. There are ways to do this, and anthropologists have spent a lot of time trying to develop ways to hone in and narrow down what a person's age might be, but usually there's still residual air. You can bracket by siblings and things like that. So there are biological constraints on how human populations are generated. It actually helps you a ton with this. You can do a lot, but you can rarely get the error to zero. So we deal with this problem routinely. And when... There are some analyses that put in fixed ages as if they were known. You might be a little concerned that the results are anti-conservative just a little bit. Same, the primatologists in the room. This is a problem that I know that you worry about. Prometologists actually worry about this much more than anthropologists in general do, to their credit. So, error on outcome. Think back to the Waffle House data. You remember these data from chapter 5? Yeah. You guys... It's my book. I should know, but I think it's chapter 5. One of those chapters. When we introduce multiple aggression. So the basic idea is we're trying to predict the divorce rate in various countries and we have various things to do this with. We're not going to use the Waffle House thing in this example, but it turns out Waffle Houses are correlated with divorce. But instead we'll look at the median age of marriage, which is one of the better predictors in this. Sociologically speaking, to remind you the earlier people get married, the more likely they are to divorce. Feel that in without ever causal hypothesis you like, but that's kind of true. And... So, in those original data, I ignored it back in chapter 5, but it was true that there's a column for the standard error on the divorce rate. The divorce rate is estimated from a finite sample in each state. This is one of the actual divorce rates. This is how general statistics in the states are done. It would be very expensive to do total surveys and those of you who spent time in the United States know the United States is barely integrated better than the European Union. Right? So now I can insult it to political assemblies at the same time, but what I mean is the individual states in the United States keep their records in completely different ways. So you cross a border to the whole new world of record keeping or the absence thereof of record keeping. Marriages could be legally a completely different entity. It's like a bunch of little countries where there are open borders. Which is why I do this comparison to certain parts of the European Union. And so if you try to get official statistics like this, sometimes it requires some creativity and finite samples are often taken within the states. So responsible sociologists report standard errors on these measurements. The standard error of the estimator. So we've included these data in this data set and the thing that's interesting about this is that there's lots of heterogeneity in the error. For some states, the estimates are much more precise than others. Why? It's typically the larger states you can estimate the divorce rate better. But then each year there's only so many divorces. We're trying to talk about the average rate of divorce rate in the long run. Over some window, say a decade, obviously over in the real long term it changes, it's not a constant, but over some window. If you've only got a deep year of data, then in a big state you've got a lot more data with which to estimate the divorce rate than in a small state. In a state like Utah, which is not only small, but the divorce rate is low, it'll be very hard to know what the rate is in the long run. I've calculated from those facts the survey results, the standard error, and I give them to you. In a future course, maybe we'll start with the raw survey results and we can build all the way out. Then you have a multi-level model. You just start with the individual courses and the population size and build it all the way out. Maybe another time. As a consequence here, I'll show you on the right the way the data look. On the top, divorce rate against line segments to indicate the standard error. Plus or minus one standard error on both sides. You'll see that some of those bars are much bigger than others. That's the point. In the bottom here, what I'm trying to show you is if you plot those divorce rates with their errors indicated on them against the log population of the state, you see that there's a very strong relationship. It's mainly big states. Each year you get a lot more information about the divorce rate. And every other bias. The mortality rate, birth rate. The sample size is bigger. Yeah, you with me? Is this exciting? Okay. I'm excited by this. I love this. This is cool stuff. This is the music of the spheres. This is what drives the universe. Can't you hear it? The celestial music right now? I've been reading about Gauss. I'm thinking about the celestial music all the time. So let's put the error in the outcome first because I think this is the easier one to think about. So focus on the divorce rate variable. The key insight is to realize that we don't know the actual divorce rate in any state. So that's a parameter now. We have to estimate it. What we do have as data is an estimate of it. And that estimate has some error. So the thing to realize is that our observation of the divorce rate in a particular country is going to have some distribution. It will probably be normal because it's the central limit theorem. At least if you're not willing to impose other assumptions. The normal, remember, is the maximum entropy distribution assignment. It is the most conservative. It spreads probabilities evenly as possible with any assumption you could make. So that's the reason to use it. It's conservative. Then there's some true value we don't know. That's the center of this Gaussian distribution. So you imagine you've taken a sample from a particular year from this. The central tendency in that is the true value, because they're symmetric on both sides. So we want to know the mean, the location of this thing, but we haven't observed it. Instead we've got one sample from this distribution. Yeah? And it turns out that's enough to learn stuff. It I'll show you. And then we have this other thing that's observed, which is the standard error. So that's observed, and that's observed, and this is not. This is what we want on the left-hand side of our regression. It's now a thing that's a parameter inside the distribution assignment. You with me? Yeah? We'll stick it all together. Here's the full model. So what I want you to notice is that the estimated divorce rates appear in two places. So at the very top of this model, it looks like a linear regression, except the thing on the left is a parameter, something we haven't observed. But that's what we want on the real divorce rate, which we can't see, not on the observed divorce rate, because we know the observed divorce rate is maybe with error. We have information about that error. Yeah, this is crazy, right? It's not. It works. So the top part looks like a linear regression. Our estimated divorce rate, the true divorce rate is normally distributed with mean, mu, state of deviation, sigma. Then we have our, you know, E oldy linear model right there. It needs no introduction. Yeah. And then we assign a distribution to our observations. This is another, it looks like a regression model, right? We're not predicting it with anything except the identity of each state. So each state's observed divorce rate comes from a normal distribution with some true average divorce rate for that state. And then the standard error of our observation, which is a product of the way that the observation is made. That's why we can put a value in there. Yeah. Does it make sense? Does it make enough sense? And then some priors. All right. So, well, I'm sorry. I have all this annotation and then I just tell you guys stuff. I animate. Sorry. So this is what I just said. So this is our trick to talk about decolonizing Bayesian, right? Think about the vocabulary distinctions and the awkwardness of this. Whether something is a likelihood or a prior depends upon whether something is observed or not. So we used to call the top line in this model a likelihood. And I'm happy to still call it that except the symbol on the left is a parameter now. So it's actually a prior, right? But it functions as a likelihood in our inferential goals. You see the awkwardness of this. So this is the decolonizing Bayesian, right? So the terms likelihood and prior actually come from a distinction which is fundamental to non-Bayesian approaches. They say it's fundamental to Bayesian too once you know what's observed. You have to make the distinction to do the calculations correctly. But in models like this you start to see it gets awkward real fast and it'd be nice to have some other vocabulary which I have none to provide. I'm sorry. It's hard to make any vocabulary. So, and then we have another likelihood which is the likelihood for each observation down there below. And this looks like more traditional likelihood because the thing on the left is data. But now we also have data inside the thing. Good times. Does this make sense? You with me? John's angry stare always makes me nervous. I know you're not angry, but you've got a real, your focus of concentration on nursery. There, alright. So, alright. How does this look in code? It looks exactly like I said before. So, you prep your data list. No problem. Set up the model. Now, divorce estimated is going to be a parameter and we're going to you'll see down here we're going to give it some initial values so that so that Stan knows how long it should be and it's normally distributed new and sigma defined new as normal, right? This is our regular linear model intercept. This is agent marriage and that's the marriage rate. Each state. And then the observed divorce rate in each state is normally distributed with me, the unknown true thing we like to know we have to infer that me and then the measured standard deviation or standard error which is just a standard deviation and then some priors weekly regularizing priors. You see the symmetry? The only trick here is that in this way of doing the code you need to tell Stan how many elements there are in div s. So, I initialize this vector of parameters with the observed values, right? That's just the initialization and the chain but they wander from there. They move. You with me? Is this good? I'm very sympathetic to the idea that this is a bit strange because it is usually taught does not prepare you for things like this. We seem illegal and if I can pause for a moment for sermon number 74 this will be a brief one it's for most routine regression problems in the sciences linear regression, logistic regression, whatever it makes a little difference whether you do base or not provided you have some reasonable amount of data and you don't have much background information hardly makes a difference and often people will tell me this well, it doesn't seem to matter. I did it with the Markov chain and I did it with that and I got the same answer and I'm like that's fine because that's the structure of the problem you're using you're using tools that are commonplace because they were developed by non-Basings and then we replicate them in the Bayesian framework and hold them very similarly because it's the same model just with priors now and the priors are washed out but there are lots of neighboring problems which are very very difficult to access in a non-Basian framework and trivial to access in the Bayesian framework so things can be interchangeable in one context it makes little difference which you use but if you modify them a little bit it can be much much easier to modify one approach than another I think for me as a scientist that's what I like about the Bayesian approach it's not that it's fundamentally more correct than the other or that the other is broken it's that it's once you've learned it it's easy to make modifications and solve other kinds of problems like measurement error and dealing with measurement error but the Bayesian approach is possible but it's very awkward and there's a bunch of what I call ad hocary involves ad hoc from the latin ad hocary and ad hocary is well people just invent some procedure and then they see if it works and you can do that that's fine I mean that's mostly how science works we just try so and see if we can make predictions so I'm not against that approach but I think the Bayesian approach is more productive because you can modify it in different model spaces without awkwardness and so this is an example of that you've got information to get the measurement error model we just describe the measurement error inside the model that's all we did what do those standard errors mean they mean that the thing we observed there's some average true value and then an error around it that's what it means you with me? yeah that was served in 74 absolutely I'm not gonna step through every detail in the description of the book there's a lot more you could say about the details of running this and inspecting the values let me just summarize for you what happens well what happens is that there's a difference between the original the observed divorce rates which I'm plotting on the left on the left I'm just showing you the scatter plot of the data that's the observed divorce rate against the median age of marriage with the standard errors plot on the right I'm showing you the posterior distributions of what the model thinks are the true divorce rates in each state that's the vertical axis it's not the data it's the parameters yeah and now the bars are the widths of the posterior distribution I think it's one standard deviation each direction on the posterior distribution and you'll notice it's all shrunk right? why? that's the action of the regression line some of these points and here's the cool part that won't surprise you that's regression to the median some states are outliers and then you enforce some relationship among all states and then some of them move closer to the line that's not exciting right? that's just regression to the median that's normal that's nothing new the exciting part is that there's a pattern and the states with less certain measurements of the divorce rate shrink more so this is a slide I should have been on so the divorce rate estimates move from the observed values the model ends up thinking that the best estimate of the divorce rate in each state is not always centered on the observed value in each state and that's a consequence of the assumptions of the model the question is why so this is like the pancake thing you find out your intuition is wrong and now you can learn something from that experience that's what we wanted to so why is that? the major reason is well there's the regression relationship if a state has a really extreme divorce rate the model is skeptical that it's really that extreme and it shrinks it back towards the line but the states move at different rates as you will towards the line towards one another and part of that is where they are where their median age of marriage is of course but the exciting bit that the opportunity to learn here comes from the fact that in small states the divorce rate is highly uncertain and so you get much more flow of information from the total sample the information in the whole data set flows into and informs the estimates of the divorce rate in the small states very little of that happens for the big states like New York because there are a lot of people and there's a lot of divorce that used to be the story actually if I can start with an anecdote used to be that New York it was illegal to get divorced in New York because they were buying to have a very low divorce rate but very moral and so they exported their divorce to other states and Nevada in particular took up the call and it used to be there were whole resorts in Nevada where New Yorkers would go and get divorced there was a residency requirement you had to live in Nevada for a couple weeks or something so there were these residence hotels in New Yorkers they took vacations, divorce vacations and then we go to Reno I think it was Reno was the first the coolest story so historically New York has a low divorce rate on paper but the actual divorce rate was exported to another state I told you it's a crazy world the music's of the spheres so why did these divorce rates move this is pooling again, our friend pooling and it happens automatically so look at this, this is the idea so what I plotted on the vertical is the difference between the estimated divorce rate that's the posterior distribution the posterior mean of each state's divorce rate minus the observed value so this is a measure of how much it moves so I've drawn a horizontal dashed line at zero if a state is on zero that means that excuse me its posterior estimate of its divorce rate is exactly the same on average as the observed it doesn't happen for very many states but you do get it for some of the bigger ones and then on the horizontal I plotted the observed standard error on the divorce rate so big states are over here small states are over here and then you see that there's this blossoming of shrinkage so some small states move down and others move up but they're moving more, why? because that's the only logical thing to do given the assumptions of the model the necessary consequence okay, does this make sense? this is worth running this model and poking through these and taking a look absolutely is okay making good time here so let's go one step further often we also have error on right-hand side variables no problem let's just add that information to the model and then see what the implications are you don't have to be clever in fact in general I think it's a mistake to try to be clever just put the information that you have inside the model and then let logic figure out what the implications are the implications might be ridiculous and then your information is ridiculous you still learn something so I was talking about ad hocary this is what I should have said in 74 for this slide ad hocary there are all kinds of ways creative and useful ways the statisticians have invented to deal with measurement error on predictors variables, reduced major axis regressions things called total least squares and these are ad hoc procedures that don't arise from any basic generative model of how it works they can work but they're very fragile because if you modify the data context a little bit then they cease to give you good references to the danger of the ad hocary of it all it would be much nicer if we could go from the information about the data generating process to an inferential model and that's the Bayesian strategy so our approach will be merely logical we state the information we have about errors on variables and then we deduce the implications and of course if it's garbage in you know so here's our new model this is like the previous model we've got another likelihood if you will deep inside the model there are three likelihood functions in one model now yeah tempted to make an exhibit joke but nobody remembers MTV's pit-by-ride yo dawg I heard you like likelihoods so I put some likelihoods I'm getting old when I was young MTV had music a long last century so let's look at this model so you don't need to under you understand the top part now look what I've done is we're going to focus on marriage rate which is also measured with error for the same reasons that divorce rate is you've got a finite sample every year so the marriage rate has a standard error associated with it so in the linear model we replace the observed marriage rate with a parameter for each state for the observed marriage rate which we can't see and then we create a likelihood for the observed marriage rate which again is normally distributed with the true marriage rate as it's mean for each state i and then the measured standard error which again is a complicated thing but it arises mainly from the size of the state the size of the survey yeah good so this is just what we did before well it's okay we just set up the assumptions and let it go and and then weekly regularizing priors so what happens well the same stuff happens as before you get pooling as a consequence and you get more pooling in those states where in small states so the filled circles here are observed the open circles are estimated and the lines connect points for the same states it's like this shrinkage crash that I tried to do in chapters 12 and 13 you remember those and we're plotting marriage rate posterior against divorce rate posterior so now both axes here are posterior means there's no data except well the blue stuff is data but the open stuff is posterior means and you see that what's happened is like this one over here which has a very low marriage rate I forget what state that is it's divorce rate is also extremely low and so the model shrinks it dramatically towards the regression line yeah because there's a big standard error on that so the model says well it's a very extreme it's an outlier shrinkage kicks in and it moves it states that are instead typical really close to where the regression line ends up don't move very much from the raw data because they make sense given the overall relationship between these variables that's present in the data the cool thing about this there are a bunch of cool things about it one of the cool things is the posterior distribution is dealt with all of this simultaneously the fact that there's information moving in every direction every state informs every other states estimate and the intercept and slope and the intercepts and slopes are influencing the estimates for the states but those intercepts and slopes are inferred from parameters remember down there's a parameter that is multiplied by a slope inside the regression model that's pantomime and mind-blowing so here this linear regression up here has got BR times RS as a parameter times a parameter that's okay Bayes will play if this works and it's not magic there's residual uncertainty in these things because the Java Bayes is not to what's an elke way to say this the Java Bayes is to tell us what conclusions we can justify and sometimes the answer is no but that's important you want a statistical framework which can tell you you have no business making conclusions all statistical frameworks can do that absolutely it's just how do you see it in Bayes it's the width of the posterior distribution you've got a wide flat posterior distribution then lots of values are possible and you get it for free that way in many other statistical frameworks you can get that but it's a secondary procedure after you do the fitting you've got to do some secondary procedure in machine learning this is sort of legendary that there are a bunch of scene learning analyses which are just point estimates and then point estimates are plugged into other analytical procedures and then I start screening in those methods you can do you can get it right it's a separate procedure you don't do it while you're fitting the model in Bayes there's no separate procedure the estimator in Bayes is the posterior distribution and it's a function it's smooth like this sorry this is sermon 75 but you've heard this one before so all of that's going on simultaneously information is flowing in every direction every dimension in the posterior is informing all the others they're jointly constraining one another I continue to think that this is one of the coolest things about logic it can all happen at once the only thing that moves faster than the speed of light is information and that's because it doesn't move it's already there we need mathematics to tell us those implications to tell us where it is that's what we do remember we're just counting up all the way stuff can happen according to our assumptions but that counting is the procedure the information is already there because as soon as you write the assumptions down the implications hold sorry there's maybe too much poetry poetry meant ironically does this make sense yeah so the details are not going to matter and in general I can't tell you if you've got errors on your variables whether they're predictors or outcomes exactly what pattern of fluid will happen it will depend upon the details but you can set up the model in the same strategic way and see what the implications are sometimes measurement error doesn't do anything to your analysis and you'd be fine ignoring it but you won't know that until you put the measurement error in yeah and as I keep saying I think this was sermon number 7 the public is not paying us to do second best right so I feel embarrassed for our discipline all the time when people say yeah I could do that but it's a big bother and you know it probably won't make a difference it's like let's bring in Mr. Taxpayer and ask him if he thinks of that attitude I don't, I disagree we don't need the taxpayer I definitely disagree with that this is a professional discipline and we're here to do the best job we can sorry sermon number 7 well there's also authority issues to them no because you're all here because you agree about this but you know what I'm talking about right I'm sure you've heard this before so that's why it's so important to make tools to help people who are not statistical specialists do the best thing that's very important to do that I do believe in division of labor okay let me try to summarize this we have an error on a predictor you get shrinkage for both divorce rate and marriage rate divorce rate shrinks way more than marriage rate and the question is why when you run this example from the text in the comfort of your own home with your cup of tea or whatever you'll take a look at this you get shrinkage in both of these but the divorce rates move way more from the observed values than the marriage rates do and so it's why is the question this is like the pancake thing it's a consequence of our assumptions and the opportunity for learning arises from trying to figure that out and the answer is because the beta coefficient that measures the relationship the strength of the relationship the association between divorce rate and marriage rate is not very big remember in these data it's median age of marriage that is the good predictor if there was one thing you could know about a statement to help you predict this marriage rate you would want to know the average age it's the first thing you should ask other things matter too there's a ton of residual variation but that's the first thing you'd ask and marriage rate has almost no effect once you've accounted for the median age of marriage if you don't know the median age of marriage the marriage rate is predictive this is this partial association thing it's approximately zero with some mass on both sides of zero it's approximately zero as a consequence the model doesn't know how to pool those things the marriage rates it's got no way to do better for them because it's like you just told them whether it's a marriage or symmetric but then there's no relationship between the marriage rates and anything else you know so it doesn't get tugged in any particular direction it's fundamentally different than what happens to divorce rate because the divorce rates are strongly associated with the marriage and so if a state has a particular age of marriage and it's observed divorce rates inconsistent with the regression relationship then it will move the observed divorce rate to some posterior rate is different does this make some sense there's one of these cases I think that sentence I just gave was like a paragraph long and there wasn't enough grammar and it was like the structure was modifying what I thought I felt that as I was speaking it was a little bit outdated right about now but to make sense too many rates and rates the rate of the rate so this is a general sort of phenomenon is that in the model intuitive because it figured out that beta coefficient which is like in your previous multilevel models it determines how much pooling you get how much information is moved and the divorce between the different variables so in the pooling models remember when we did random slopes two weeks ago there were correlation parameters that determined how much information moved across intercepts and slopes the beta coefficient sorry which slide is it on the beta coefficient on the marriage rate there B sub R is acting somewhat analogously because it moves information between the estimates for divorce rate and marriage rate right we don't know either it's like intercepts and slopes you didn't know the random intercepts and slopes of each group these are now for each state there's an unknown divorce rate an unknown marriage rate these are like random intercepts and slopes and if the correlation between them is approximately zero as it is here then no information moves between them yeah and that's why you don't get much shrinkage on the marriage rates anyway I hope that was useful to understand okay I'm right on time here so here's my summary slide on measurement error this is a really common melody lots of data even experimental data has measurement error so you can deal with it this way just some quick examples before I close lots of prediction with averages this is an example we've worked with here you don't have to use the average you can instead use the posterior distribution of the average which is what we've done in this case DNA sequence data you should respect the error rate in the sequencer this is the this is a very common problem and this is why you do a bunch of reads these giant read numbers on genomes but still if there are still problems where like say you're looking for mutational hotspots if mutation rates are low they can be approximately the same as the error rate on sequencing and then given that you've seen a mutation it's about a coin flip whether it's actually a mutation so you need reads this is why you do multiple reads so this is about respecting error rates and parentage analysis similar things there's a probability distribution over possible parents usually you can't know parents for sure in wild populations this is also true for humans and phylogenetic trees same thing there's not a tree there's a posterior distribution of trees and it's terrifying once you get over your tear you can settle in and do good stuff with it there are a whole lot of trees in there it's like oh my god it's little trees and archaeology paleontology forensics obviously if there's a hole in the ground and you dug something out and you're trying to figure out how old it was there might be some uncertainty now I focus on things that are important in this building where we do evolutionary anthropology and we have problems like this all the time but I think for those of you who do experimental psychology I'm sure you can think of ways that this is relevant as well okay so with that I'm going to stop here when you come back on Friday we're going to pick up on this slide and talk about missing data thank you for your indulgence I'll see you on Friday