 All right. Welcome back everybody. I have a lot of slides. This is going to be a great lecture. Yeah, I know. I say that in Jeff's responses. Oh, no. I know you're talking about something else. It's okay. All right. This is the last lecture, but certainly not the least. We are going to cover some topics that I think are extremely important for doing applied statistics. Before we get to that, let's talk about pancakes. Okay, these are my very bad drawings of pancakes. It's just three pancakes. And the hatching on a side of a pancake indicates that that side is burnt. So imagine like you come over to my apartment for pancakes and I turn on the griddle and the temperature isn't right. It's too hot and the first couple pancakes get burnt. This is actually how it happens. Right. And so by the third pancake it's perfect. So the first pancake is burnt on both sides. The second pancake is burnt on only one side and the third pancake is just right. Yeah. Now, I serve you at random a pancake with a random side up and you look down at your pancake and it's burnt. And I want to ask you, since you've taken my stats course, what's the posterior probability that the other side is also burnt? I'll let you think about it for a second. All the information you need to solve this problem is on the slide. You don't know which of the three pancakes it is, right? You know which one it's not, right? It's not number three. So what's the probability the other side is burnt? What's the answer? Anybody else that think a half? We got two votes for a half. Two out of three? Or yeah, it's the answer is two out of three. It's not a half. Half is the standard intuition and it's a perfectly reasonable don't worry. That just means you're human. That's good. The intuition is half because there's two pancakes it could be, right? You think, ah, it could be either one. It's random, right? No, it's two out of three. And let me show you how to figure that out. But in doing this, it wasn't to make you feel bad. It was actually not it at all. This is my intuition too is a half. First time I saw this problem. And this is a famous logic problem actually. It wasn't originally done with pancakes, but I like it better with pancakes. I think there's originally boxes with balls in it or something stupid like that. So pancakes, everybody likes pancakes, right? Yeah, okay. Good answer. So the point of these logic puzzles is not to make you feel bad. It's to teach you some intuitions about or correct your intuitions and teach you methods for solving these things. And as I've said many times in this course, I think one of the values of learning probability theory is that it means you don't have to be clever. You can just ruthlessly apply the rules of conditioning and you don't have to feel like you need to intuit the right answer to this thing. So don't trust your intuitions would be my advice. Instead be ruthless, ruthlessly conditioned. What do I mean? The way we figure things out in probability theory or Bayesian inference, which is just probability theory, is we want to know the probability of something and so we condition on what we know and see if that updates the probability, right? If there's any information in what we already know for inferring the thing we'd like to know, then if we compute the probability of the thing we want to know, conditional on the stuff we already know, it'll be there, right? And the rules of probability tell us the only way to do that. So we do that. Let's do that for the pancakes, all right? Here's your Burke pancake. We want to know the probability that the side we can't see is burnt. Conditional on what we know, which is that the side up is burnt. Now, of course, you could peek, right? But this is a probability puzzle, yeah? The rules of probability tell us that a conditional probability is defined as the probability that both things are true divided by the probability that the thing we condition on is true. So that's what the expression on the right-hand side is. The probability burnt up and burnt down, right? That both sides are burnt divided by the probability burnt up. This is Bayes theorem, by the way, just disguised, right? So this is just the definition of conditional probability. If you ever forget this, just Google definition of conditional probability. There'll be a nice Wikipedia page for you, right? We have all the information needed to compute this. First, we need the probability that the burnt side is up, and there are two ways this can happen. Well, let's think of it this way methodologically. There are three pancakes, right? So you've got a BB pancake, a BU pancake, and a UU pancake. What does that mean? BB is burnt-burnt, right? BU is burnt-unburnt, and then unburnt-unburnt. Three pancakes. Probability that this is the burnt-burnt, that you would see a burnt side up if the pancake was burnt-burnt is one, because either side will show you burnt. Yeah, so that's why it's probability BB times one. Probability BU times 0.5, because only one of the sides of that pancake is burnt. So if I'm randomly flopping it down on your plate, then there's a random chance the upside is burnt. And then for the unburnt pancake, there's no way. So you already eliminated that, right? There your intuition was working. Each of these pancakes has a chance third of being served to you. So there's a half chance that you get the burnt side up on your pancake. And then the last thing is just the top. What's the probability of the burnt down pancake? It's one-third, because there are three pancakes, right? And I served you one at random. So it's one-third divided by half, and the answer is two-thirds. Weird, right? Let me give you some intuition about why. Now that probability theory has given you the right answer. In the text, there's a simulation, a pancake-based, individual-based pancake simulation of this, right? To tell you you'll appreciate this, right? It's like agent-based, but they're pancakes. To prove to this that it's two-thirds, if you didn't believe this calculation. Let me try to give you some intuition though, now that I just told you, you don't need to be clever. In fact, I'm not clever at all. I just ruthlessly apply conditional probability and figure things out. But if you want some intuition at the end, which is a perfectly legitimate goal, let me try to give you a little bit. The mistake is focusing on pancakes. You want to focus on sides. So look in this picture. There's a side at the bottom, and we want to know what the other side is. So if you look at the top, there are three burnt sides. The side at the bottom could be any of those, right? So there are three possibilities. The other side of those, how many of those three possibilities are also burnt? The answer's two, so the answer's two out of three. You've got to focus on sides. We're talking about sides, not pancakes, right? That's the mistake. Pancakes are so delicious you were focusing on pancakes and not the side, right? Mental bias. I forgive you. All right. So the point of all this is everything we've done in this course, everything you do in any statistics course really is underlain by the idea of ruthlessly applying conditional probability to solve problems. There's some information we have. There's some other thing we'd like to infer. If there's any evidence in the information we have about the thing we like to confer, you reveal it through conditioning, through the rules of probability. Or another way to think about this, express our information as constraints and distributions, and then we just let logic do the rest. And that's what Bayesian inference is. It's just logic. So rest of the time today I want to take this approach and show you how it produces automatic solutions to two very common problems which are historically in statistics very hard to solve because people tried to be clever. If you avoid being clever and just ruthlessly apply conditioning, you can get useful solutions to two very big problems. These are measurement error. Triassert is always present in data and usually ignored. And missing data, which is a special kind of measurement error. It's not there at all. It's just the extreme version of measurement error. These are really common. So let's think about measurement error first. There's always some error in measurement for sure. And when you do an ordinary linear regression, it's captured in this residual variance. There's this sigma thing that's supposed to capture any, even if you had the true relationship, you don't expect a perfect fit. That's what that sigma error variance on the end is supposed to be. And that's fine if the only error is on the outcome variable. What if there's also error on the predictor variables? And imagine, for example, that error is not uniform across the cases or the variables. Then you're in trouble. You could be in trouble. And there are a whole bunch of ad hoc procedures to try and deal with this, like reduced major axis regression and such. And none of them are terribly reliable. So let's think about an approach that avoids trying to be clever and just conditions on what we know to get there. So think back to Waffles, right? We have pancakes, now Waffles, the waffle divorce data set from early on in the course. And in that data set, we had a bunch of states in the United States of America. For each one, we have measured divorce rate, median age of marriage, and the marriage rate. And if you go back and look at the data set, you'll see there are columns which were standard errors on a couple of the variables. That is, these are taken from partial census data, from records. They're measured with some error. And that error has been quantified in terms of the standard error on each of those values. We don't know the divorce rate in a particular state. What we have is an estimate of it. And we also have a number that tells us our confidence in it, the standard error, right? There's a lot of heterogeneity in this error in this data set. So I'm showing you on this graph is the divorce rate on the vertical, median age of marriage on the horizontal. And the line segments show you the standard error for each state of this divorce rate. So some of these are big. The reason some of them are big is because some of the states are small, and so they produce relatively little evidence in any given census period. So here's another way to look at these data. The left graph is the graph that was on the previous slide. It's just divorce rate against median age of marriage with the line segments for the standard error of each divorce rate. And then on the right, the divorce rating in vertical axis is the same. Now, the horizontal is the log population of that state. So really big states are on the right. California is the one all the way on the right. And in California, in any particular year, you get so many of the events that you get a fantastic measure of the true, if you want to say the true divorce rate in the state. Because it's just so many Californians. They're all over the place. I'm one. We're just all over the world. And on the left, you've got smaller states. I won't single any of them out. Well, later I will. But, you know, pick your favorite small state. There are states with very small populations, particularly west of the Mississippi, where there are more livestock than people by far in many parts of Western United States. And in those states, there's so few of these census events in any given period that you're going to have an error. Your confidence in having estimated the long run rate is very small. And so the standard errors on those states are big. Does this make sense? Let's think about this in terms of a causal model. Measurement error can be put into a DAG. Almost anything can be put into a DAG. Not anything, but almost anything. And so we want to think about the observed divorce rate here, which I named D sub-observed as being a function of the true divorce rate, which we don't get to observe here. That's D. And the population size of the state, which I call N. And in the data set, we've already taken the N part of this and summarized this as there's a standard error reliability, right? But in general, this is what's going on, is that your observed variable for divorce rate is a function of these two things. And that's what generates the measurement error. Yep. Does that make sense? So how can we approach this statistically? Well, let's not be clever. Let's just be ruthless and apply conditional probability. We state what we know. And we see if it'll figure out what we don't know. So here's the idea about how the observed variables are generated, right? So there's some true divorce rate, which we haven't observed. And we'd like to use that as our outcome variable, not this crummy measured thing, right? With highly unreliable things for some state. So generatively thinking, our observed divorce rate is sampled from some normal distribution. Central limit theorem makes this almost true in this case. And the mean of this normal distribution will be the true rate. And then there's a standard deviation, which has been summarized in this proportional, inversely proportional to the population size using the standard error. Yeah, saving you some work. You could do this yourself. Somebody did this, right? There's a formula for this. Does this make sense? Conceptually, what's going on? It's like you, you sample this process every month. Divorces are happening in a given state, but in California, you get thousands and thousands, right, in any given month. In Idaho, mostly zero every month, right? Yeah. And so, but in the long run, there's some D true that you'd observe in the long run, averaging over a really long data set. But in any finite period, there's error. And that error will be inversely proportional to the population size. It's a function of the population size. And that's what the standard error is. Yeah, so the thing you observe comes out of this generative process. It makes sense. And let's think about the statistical model now. So this was our our DAG before. Remind you, A is median age of marriage, M is the marriage rate, D is the divorce rate. We were evaluating the direct and indirect effects of age of marriage. And this is the statistical model we had before. Where the thing on the top is D true. We acted as if we knew the divorce rate. But now we don't. So we need to add something else to this model. What are we going to add? We're going to put a line on top of it, which is the observation process. And this observation process, you just add this to the model now. And now the D true is a vector of parameters, because we don't know it. We replace all those things, which are previously the outcomes with a bunch of unknown parameters. And then the observation line at the top estimates them for us. Because we have information with which to estimate them. What is that information? Well, we have the standard error. Okay, that's great. But if that was all we had, that would be that would not be enough. We also have the whole regression. The whole regression relationship is going to pin down values for the different states. And you may be getting a tingling sensation in your skull, which tells you that shrinkage is going to happen. Right? Because it's my class, and there's always shrinkage. So let me, I should pause here and say, does this make sense? What's happening? One way to think about this is, if you were going to simulate measurement error, this is the model you could use. It's just right down the same model. And then it runs backwards, right? Bayesian models are generative, and you can run them in both directions. So if you run them forwards, you simulate fake data. If you run them in the reverse, they spit out a posterior distribution, right? You feed in a distribution, they spit out data, you feed in data, they spit out a distribution. You can run them in both directions. And so this, if you were going to simulate measurement error, you do it with the top line, you'd have some true value, but it will be sampled from a distribution. So it would have some error on it. Yeah, right? Mechanically, where's the error coming from if this is census data? Like, you're not, you're not mismeasuring the data. Because in a finite time period, imagine the census period was a day. In a big state like California, you get many more events. And so that day provides much more information about the true rate, right? The variance from day to day will be higher in a small state. It's like sample size, right? So imagine you pick a village instead of Los Angeles, and you try to estimate the divorce rate. It's easier if you pick Los Angeles. Yeah, does it make sense? It's just sample size. It's just sample size. California is bigger than Idaho. And if the period, if the census period is the same for both, it's a sample size effect. Yeah, you never observe the rate. The rate isn't unobservable always. And so the sample size constrains your precision of it. You'll sleep on it. It's just sample size. It's nothing but sample size. Alright, how do we do this in a model? Exactly as it was on the previous slide. The only trick here, you've got to define the true as a vector of parameters as long as the whole data set. So for every state, there's a detrue, and that's what the vector in thing is in this model. So at the top, we've got detrue, which is a vector of parameters. And the second line we define it as a vector of length n. And it gets the likelihood. But there's a parameter on the left now, not an observed variable. But it works the same. Remember, the distinction between likelihoods and priors in a Bayesian model, that's cognitive. That's something you know. Probability theory don't care. It works the same on unobserved and observed variables. It doesn't care at all. So when something in your data set becomes unobserved, the model doesn't change. Yeah? This is like a brain bending thing for people, because you're taught like the data and parameters are fundamentally different things, but they're not. They're just variables. And sometimes you observe them and sometimes you don't. And this is the part where I say you don't have to be clever, you just have to be ruthless. The model exists before you know the sample, right? It's like a representation of the generative process. The fact that you haven't observed some of the data doesn't mean the model changes. Yeah? It just means that now you have parameters there, because you haven't observed it yet, because a parameter is just an unobserved variable. Yeah? The fact that you could observe it makes it feel really different than something you can't observe, like a rate. Yeah? Okay. And then the rest is just the oldy priors, right? Good? Yeah? Some of you are nodding and some of you are like, I don't like this. It can be unhappy. And what happens in this regression? So there's shrinkage. As I said, the tingling sensation in your skull is shrinkage. And now plotting the relationship between median age of marriage, which you know, there's a strong relationship in these data between the median age of marriage and the divorce rate. Plotting median age of marriage against the divorce rate. Both of these are standardized variables. The blue points in this are the values that were observed for divorce rate. In the data set, the ones we previously ran the regression on, back in chapter, I don't know what it was, five or something. And the open circles are the values from the posterior distribution, the detrus, the posterior means for each of the detrus that has been estimated. And the line segments connect them for each state. You with me? So what has happened here? Well, there's shrinkage. Some have moved more than others. And since you're pros at shrinkage now, you can explain this pattern. What has happened? Why have some of these moved way more than others? And why have they moved where they've moved? So they've moved to the regression line because that's the expectation. That's the expectation. If some state's observed divorce rate is really far from the regression line, then it will shrink more. But how much it shrinks is also a function of its standard error. So a really big state like California could stick far from the regression line. It doesn't here. It's actually pretty typical. But because it has such a precisely measured divorce rate. But a small state like Idaho, look on the left of this slide, you see Idaho. That's ID. On the far left, Idaho is more sheep than people. All right. Mostly potatoes. Just potatoes and mountains for the most part. It's a beautiful state. And it has a very imprecisely measured divorce rate. It gets shrunk. It's still off the regression line by a good amount. But this model says like, given this relationship and these variables, that measured rate is way too extreme to be believable. Probably due partly to sampling error. And it shrinks it a lot. You get similar effects for North Dakota down on the bottom. Wyoming's an interesting one there. That's the next one over W.Y. It's not so far, but it's so uncertain that it gets shrunk directly to the line. Right. Because Wyoming is another one of these states. It's mostly sheep. Right. Another beautiful state. And so on. Maine has an extremely high divorce rate. It gets shrunk a lot too. And it's small. And Rhode Island's a small state, etc. You can look at this. Again, compare it to this plot with the standard deviations where we calculate the shrinkage. So on the left side of this plot what I've done is I've taken the difference between the estimated D observed. This is the mean of the posterior distribution of the detrus. And subtracted the observed rate in the data set. And that's the vertical axis. And on the horizontal axis I put this standard deviation that's in your data set. The standard error of the measure. And you'll see that if you're any state that has a difference of zero means there is no shrinkage. So on the far left you've got California, for example. And then as you move to the right, the standard deviation increases, you get more movement. Right. That's what the shrinkage is. There's a bigger difference between the observed rate and what the model thinks is plausible. Does it make sense? Okay. That's error on the outcome. When it's not constant it varies across cases. You can also, of course, have error on predictor variables. And those can go into the model the same way. Again, be ruthless. Don't be clever. You've got a generative model. Imagine sampling an observed version of a predictor variable, but now with error. And then you insert that observation process into your model. So that's what we're going to do here. So on this plot, on this slide, is now we're looking at the marriage rate in each state. And that's also measured for the same reason with error. And again, plotted on the horizontal against log population. California's on the far right. Idaho, Wyoming on the far left. The line segments get bigger because of the sample size issues. Here's the model. I'll do some notation on this on the next slide. So don't panic. Top part is the same you saw before. The very first line is the observation process on divorce rate. Then we've got the regression of the true divorce rate on agent marriage and marriage rate, which is m, but now inside the regression, the linear model, we have m true, not the observed m. What is m true? It's a parameter. Right? There's a parameter for each state here, and it goes in. So this last term is a parameter times a parameter. Don't you like the course now? Yeah. So every state is going to have one of these m truths. We don't know it, but it goes into this thing. Exactly the same. It's the same model because it's the same generative process. You haven't observed it, but that doesn't change the model. And then we've got what you might call the likelihood for the observed rate. The m observed, the marriage rate for each state that we saw, comes from this sampling process. Again, normal m true with the standard error, which is also in the data set. Good? Yeah? Again, you've got to go home and draw the out, of course, but this is really all it is. You just think generatively about the thing that right down the stat model. You can run it both directions. One last thing to talk about before I show you how this model behaves. You've got to put in a prior for these m truths. You can do lots of interesting things here. In this example in the text, I set it to zero one because it's a standardized variable. It's not terrible. What happens as a consequence of that, though, is that you're ignoring information in the data because if I tell you the agent marriage in a state, you get information about the marriage rate. Right? Look at the DAG on this slide. If you believe this DAG and you do, yeah, you don't have to, but there's lots of evidence this is a reasonable DAG for the system. Agent marriage influences marriage rate. And so we can have a better prior for the marriage rate if we use all of the regression information, put the whole DAG into this model. And I'll do an example of that, not in this, but later on today. I'll have an example when we do that. We put the whole DAG into the stats model. All the variables have relationships. And if we do it all at once, there's even more information that can help us pin down and get better estimates of the true values. But this won't be awful. Okay. Okay. I know there's a lot going on in this graph. It's not the most beautiful graph in the course. What's happening here? On the, we've got two variables now, which are observed with error. And I'm plotting them against one another. We've got divorce rate on the vertical and standardized and marriage rate on the horizontal. Again, standardized blue points are the observed values in the data set. All right. So the combination of the observed marriage rate, the observed divorce rate. And then the open points are connected by line segments or the corresponding pairs of posterior means for the true, estimated true effects. True rates in both cases. Yeah. So you've got shrinkage in two directions now. Right. Both of them. Towards some regression relationship that you can, you can see. I haven't drawn it, but you can kind of see it there. Right. It's like a constellation. This is like the Milky Way. And some of these shrink a lot more than others. So I want you to notice, the first thing you're going to notice is if you're really far from the regression line, you shrink more. You expected that and you see that here right away. There's a more subtle thing going on, which is that there's more shrinkage for the divorce rate than for the marriage rate. So look at a case like in upper left. There's some state up there. I should have labeled this. I don't know. Let's guess that's Wyoming or something like that. And well, no, it's high. It's probably Maine. Something. And it's extreme in both. Right. It has an extremely low marriage rate. Yeah, this is almost certainly Maine. And an extremely high divorce rate. And so it comes down really far, but it comes down a lot more on divorce rate than it does on marriage rate. Why? And then you can see in the other cases, this is also true for the most part. Yeah, all those things at the top similarly. Why does that happen? And the answer is because the regression says that marriage rate is not really strongly related to divorce rate. So the shrinkage doesn't happen as strong a marriage rate because there's not as much information in the regression to move it. The model doesn't know where to move it exactly. So it doesn't, it isn't attracted to the regression relationship as strongly because the real causal effectiveness model is agent marriage. Remember? There's this association between these two variables that arises through the back that that other path. But that means you don't get as much shrinkage on this predictor. Does that make sense? Yeah, I thought this was a cool effect when I made up this example. Anyway, go home and draw the owl. You will have fun with this. I am absolutely sure. Let me try to put some context around this before we move on to the next topic. So measurement error comes in many disguises and the version I just gave you is the simplest where there's a variable in your data set that's called the error. Right? And sometimes you're lucky and you get that. And some expert has told you here's the measurement error in this variable. Then you can proceed like this. But there are many, many more subtle forms as well. One of the things that I see a lot and I think is a waste of information is people will pre-average some variable and then enter that as a predictor in a regression. That removes the fact that you have a finite sample with which to estimate that mean from. Yeah? So people will do this all the time. Like you've got some sample from a state and then you just create state averages and then you put them in the data set. That takes variation out of the data set. And if you're doing that consciously as a way to get your p-values to be smaller, I would call it cheating. But I think usually people are cheating, they're just doing what they were taught to do. Yeah? But we don't have to do that. What else could you do? You could just use a multi-level model. Right? That's what you could do. And then the means are bearing effects. Right? They're parameters. And you don't have to do any averaging. Do the averaging in the model, not outside the model. And then all of the uncertainty that has to do with different sample sizes is taken care of. Parentage analysis is a fun case. This is done in people, but it's also done in other interesting animals. Right? You're trying to figure out. You've got some population of, say, wild rodents. And you're trying to figure out who sired who. Right? Who is whose parents. And you can't interview them, unlike people. And so you get their genotypes and you try to figure out if they're related. And there's uncertainty. You can do exclusion. But then there'll be some number of individuals who could be the parents and there'll be different probabilities. And this is a standard sort of thing that happens in this. Phylogenetics. So when I did the example last week, I used a single tree. There's a strong fiction there. We don't know the history of primates. Phylogenies are rarely very certain. And so on the right hand side of this slide, there's this trend now, the phylogenies like this. I'm a really big fan of this. I see this in papers all the time now. I like this a lot. So you're showing the whole posterior distribution of trees. It's a big fuzzy graph. In some cases, there's massive uncertainty in some parts of trees. And you want to keep that in mind. And you can do the analysis over the whole distribution of trees actually. Right? You imagine you just feed it into bays like this and now you've got this whole distribution you need to average over. It works like lots of other ways that we have distributions in bays. In archaeology, paleontology, forensics, measurement area is the norm. It's absolutely standard. Your data are decimated in some way. So think about like radiocarbon dating. You don't know the radiocarbon date. You've got, you know, a few hundred year, a thousand year period, right? The archaeologists, the audience, know the pain of this. And people take this very seriously now. At least most people do. Take this very, very seriously. I had sexing in here. I had a colleague back in California who was trying to sex fossils. And this is no joke, right? To try and do this correctly. You can never be sure. It's very, very difficult to sex a primate fossil and, but you can assign probabilities. Absolutely you can. Okay. Yeah. Determining ages is another issue. In my department, we work in places where people don't keep track of birthdays. So you can ask people their age and they'll give you a number. But you don't want to naively use it, right? So you think biological facts like average birth space and things like that. You can reduce the error on those estimates. Okay. Let's shift to a very related example. Grown-up measurement error is missing data. There's lots of things which are sort of mechanically similar about this, but it feels really different because often it feels like the data is missing. There's nothing you can do. You want to do something about missing data, typically. And I want to teach you today why. So this is really common, right? You're used to this. Most of the standard regression tools in software packages will automatically remove cases with missing values without saying anything to you, right? So if you run LM or GLM and R, if it finds any of the variables you include in the regression, if it finds a missing value, it deletes that whole case. So all of the variables for that case are removed from the dataset. This squanders information, right? So that's the first thing to worry about. But it can also create confounds. So missingness can create confounds. Complete case deletion is not harmless. And that's one I want to convince you of. And there are ways to deal with this. Sometimes there's no guarantee. So how do people deal with this? There are lots of different approaches. The worst approach, and you should absolutely never do this, as I say on this slide, is to replace the missing values with the mean of that column. This is tragic. This is a really, really bad idea. Why? Because the model will interpret is if you knew that value. You don't know that value. You want something with error there. That's what you want. And the mean is, if you just put it in there, you cognitively know that you don't know that value. But the model doesn't. The model thinks that it's the mean. And then bad things happen, really bad things. So don't do this. I haven't seen this in a very long time, which is nice. The word has gotten out. You should absolutely never do this. What else could you do? There's this procedure called multiple imputation, which works. It's one of these things which shouldn't work, but works really, really well. And it's a frequentist technique for imitating what we're going to do today. In fact, it was invented by a Bayesian back when desktop power, Don Rubin, back with desktop power couldn't do these things. And so buddy created a frequentist technique. It works unreasonably well. It's really effective. And basically, you do the model multiple times on different samples from some stochastic model of that variable. That's what multiple imputation is. You don't need very many multiples to get a really good estimate of the uncertainty, it turns out. We're just going to go full flavor Bayesian imputation here and just put in the probability statement about how this works. Okay, impute. What does impute mean? Impute means to assert some feature of a thing. In law, it's usually a crime that you're imputing to somebody, right? But here there's no there's no valence like that implied. So before we get to the mechanics of how to run a model that does imputation that tries to guess the value of a missing variable, let's talk about DAGs again and try to get missing this into the DAG. This is a literature that is deeply confusing because the terminology is really awful. It's just about the worst terminology I've ever come across in any region of statistics and I will convince you of that on the next slide, I am confident. So let's think about the primate milk data again as an example. This is a small data set. We've got a number of primate species. We're interested in understanding why the energy content of milk varies so much across different species and we're focusing on body mass and proportion of the brain that is neocortex. So here in this DAG, I have an M as body mass, B for brain is the proportion of the brain that's neocortex. Why are we focused on that? Because humans have a lot of neocortex and we focus on things that we have a lot of. It's this kind of narcissism of our species, the whole field of anthropologies and exercising narcissism. K is the milk energy, killer calories of milk. U is some unobserved thing that is generating a positive correlation between body mass and the proportion of the brain is neocortex and there is a strong positive correlation across species in these two things but we don't know the mechanism. So I'm just putting a U there to say I don't know, something is generating it. It's not directly causal for men to be because it's not just a lometry. So brain size, if we were just talking brain size that would be maybe it was just an arrow for men to be but that's not what's going on here. You can have brains the same size but some of them will have more gray stuff than others do. So there's something going on here we just don't know what it is. Now let's talk about the three types of missingness. All types of missingness can be classified usefully in this taxonomy because it tells us what we need to do and that's why this taxonomy exists and this is where the totally confusing vocabulary comes in. So you've got to know this vocabulary or at least when you recognize it you can come back and check the notes and figure out what's going on. This is horrible right? So on the left we've got something called missing completely at random. I'm going to walk through that starting on the next slide and explain what that means. It's abbreviated M-car and I'll explain this dag to you. In the middle we've got the second type it's called missing at random. Does that sound different to you than missing completely at random? It sounds about the same to me but it's totally different in terms of its consequences and then I'll explain that to you as well after I explain M-car and then there's M-nar sometimes written in mar instead and because you know in English you can put a knot anywhere and it has some random effect on the meaning of the sentence right and so this is missing not at random. So I don't know about you but as a native speaker of English when I hear the phrase is missing completely at random, missing at random, missing not at random, I don't think of these processes at all that I will explain to you. This is a tragedy. This is another example of the law that statisticians should not be allowed to make terminology but I think you can't understand what's going on here and these three types are incredibly important because they tell you what you need to do to make an unconfounded inference about things. So let's start with the first. Here's our milk energy DAG the triangle at the bottom is the basic DAG and now we're not going to get to CB the true proportion of neocortex anymore because it has missing values in and in this data set it does about half of the values are missing almost half of them are missing I think and I think there are 12 missing values in this data set. There are lots of primates that no one's ever measured this for. Maybe you measured brain mass but they didn't measure the percent neocortex and so we've instead we've got this variable be observed which has deletions in it gaps missing values in it and to get this we know it's partly caused by be that's why there's an arrow coming up from be but it's also caused by this missing this mechanism whatever it is that places missing values in particular on particular species and we want to name this thing and in this literature these missing this processes are given the letter r for some reason I guess for random and then use a subscript to say which variable it's affecting so r sub b is the missing this mechanism that creates missing values in be observed think about this is what DAGs always say is they say for any variable the arrows entering it are the things the arguments to some function which generates that variable right so this says be observed is a function of be and the missing this mechanism that's all it says it makes sense now since you're all experts at graph analysis I ask you are there any back doors from be observed to k why is that the question because we're going to condition on be observed we can't condition on be because we don't have it but the graph stays the same and now k is our outcome be observed as our predictor are there back doors from be observed to k give you a moment any backdoors remember what a backdoor was it does go through in but that's not a back door it's a front door so that's the distinction there's there's two things to think about here right there's no back door that's the answer is no there's no back door it's only back door when the arrow enters the back of the variable right there's no back door but there are two paths from be observed to k there's a direct effect and an indirect effect but the total causal effect of be observed on k can be estimated by simple regression would just be observed on the right right because there's an indirect effect through through him yeah exactly indirect path so there's some you know other path there but there's no back door make sense so you can in fact you can condition on in here and you can get the direct effective be but there's no path that takes you through are through our sub b no back door that takes you there through it you see that it's just like stuck on the end of the graph and this means that the missing this mechanism is ignorable it's not a confound you analyze it just like any other variable right there's something that's influencing the thing you're going to condition on but it doesn't create any back door confound and so you can ignore it right remember the rules you don't have to condition on it and that's true here you don't have to condition on it you don't need to know that's nice because usually we can't discover the missing this mechanism yeah so this is the this is the benign case what's called missing completely at random yeah remember that it's completely m car k it's m car when k is unconditionally independent of the missing this mechanism you don't have to do anything condition on anything in this graph to keep your inference about k independent of the missing this mechanism yeah you're safe and that means it's ignorable so this is the benign case is what everybody hopes for when you find missing values right it doesn't people pretend oh it's in car but let's pause for a second and think about this sorry yeah yes you can so complete case analysis will not create a confound when this is true this is what licenses you to drop all those values is this assumption and this is the only time it's going to turn out that that is okay however if you do the imputation which we will you can do even better because you get more power right you'll get a more precise estimate of the causal effect if you do the imputation so you still don't want to drop the values you could drop the values it's not a sin right but you can do even better if you didn't drop the values and imputed that's the m car case and in car you don't have to impute but you should yeah right exactly and that's why you're losing power exactly you with me before we leave m car i want us to think about does this ever happen so what would this graph mean the only way you can get m car is if like your research assistant used a random number generator to delete values in the spreadsheet right what what could do this now maybe there are cases where it could truly be unrelated to every other variable in the graph that's what this means the missingness mechanism is not influenced by anything else we know right or anything else that we might need to know even an unobserved variable and that would be yeah if this your your research assistant your heebie comes in and just like randomly deletes some values right it could happen could happen probably has happened but uh probably not this is the monkeys on typewriters sort of missing this mechanism i assert that this is a highly implausible in most real research situations that this is the case i'm sorry to say but i think it's you can come up with a convincing example for your data set congratulations but it's really hard to think of one where this happens in this case what else is probably going on well here's one proposal m is influencing the missingness mechanism perhaps uh this will give us the situation called missing at random i know we had missing completely at random now it's mere merely at random something like that it's missing at random and i i i should say the person who came up with these terminologies is don rubin who is an absolute genius it's just not with terminology right see he's the first person to analyze these cases and talk about the conditional probability requirements for each it's a super achievement but these terminal this terminology is not not helping to spread the gospel um but it's i don't mean to make fun of the topic uh so uh m is now entering rb influencing it what does this mean the missingness mechanism depends upon the body mass values species that have particularly large or small body masses are more likely to have missing neocortex values how about that ha excuse me how might that happen well anthropologists have different research interests and they find different kinds of species attractive to study and measure and in particular for example maybe small ones are really hard to measure neocortex for we're just not as interested in them right there's like a bunch of calitric kids uh they're out there in a little furry and they're cute but there's not nearly as much effort on measuring them as say chimpanzees right where there are whole armies of people in this building studying chimpanzees right so uh and that that makes sense but it generates a pattern uh where some of the things features of a species predict missing values that are associated causally with the missingness in other values does that make sense I think this is extremely common and in this case we get the missing at random case and now I ask you the same question as before are is there a backdoor pass or be observed to k and I won't you know go through the the socratic thing of waiting for you to say something yes there is now because there's an arrow entering rb from the back you've got a complete path all the way from be observed around through m to k how can you close that back door you condition on m as it says at the bottom of the slide yeah sorry I shouldn't have put that there you condition on m and then it shuts that path but still you should look at it right remember this I know it's it's rusty but remember all the path closing procedures yeah if we condition on m we block that fork right m forms a fork and you you close the fork by conditioning on the middle of the fork so we close the fork by conditioning on m and then again the missingness mechanism is ignorable we don't have to know it but we do have to condition on m and we do have to impute otherwise there will be a bias in the estimate but you but you don't have to know r which is nice because r is usually not knowable right in any any detailed sense but you do here you don't have to know the missingness mechanism I've got a summary on the next slide but you do need to do imputation otherwise you're going to have a confounded a bias estimate of the causal effect so what is mar missing at random missing simply at random maybe is any case which k is conditionally independent of the missingness mechanism that is there's some variable in the graph or set of variables in the graph we can condition on and separate the two de separate if you remember that term from way back when chapter six or so and this is missing at random missing at random is a nice situation to be in and it's the situation that's probably most common there is something else in our system which is associated with a causing the missingness if we can condition on that and do imputation we have hope of getting a good causal inference out of it I'll show you how to do this why do you need to impute because otherwise you're polluting all the other variables with this associated missingness pattern and this can create really strong biases in the inference yeah and final case which is the worst case to be in is called missing not at random I know the last one wasn't at random either right but this one's really not that random and in this case there are a couple ways to get this let me show you the most obvious the variable itself causes the missingness particular values of neocortex percent are more likely to go missing than others right how could this happen well this could happen if for example well in this case I can't think of an example what is to be true but I've got another mechanism that could do it in this dataset but maybe species with low neocortex you guessed that from the background information and so you don't measure those and so you don't have any precise estimate for those species in that case it would be the actual value that's doing it this is nasty because you get a backdoor that you can't close the mission this mechanism is not ignorable in this case there's nothing you can condition on that will shut it because you don't know the missingness mechanism if you did know the missingness mechanism you could shut that backdoor path and that's what's required and your only hope in this case is the model to missingness mechanism and thereby condition on it so if you've got enough scientific information about how the missingness works you can do that if you're lucky in the right case but there's no guarantee the other way you can get this effect it isn't just an arrow from B to RB you could have a latent variable that does it good times right so here I've drawn on the right another version of missing not at random there's another unobserved variable U2 and this is a fork which influences both neocortex percent and missingness what could this be like it could be like phylogeny imagine humans since we're narcissistic we like to study things that are closely related to us things that are closely related to us have brains with a lot of neocortex so in this case U2 if it's phylogenetic proximity to humans will influence the neocortex percent and it will also influence missingness good times right this happens like all the time in primatology right good okay so there's my summary of missing not at random this is a case where k is unconditionally dependent on rd there's nothing you can condition on except for the missingness mechanism itself which will shut that back door and this happens too yeah I can't say how commonly but when you when you find yourself in this situation your hope is to model the missingness mechanism which can be done okay I've got 10 minutes to go let me do this slide because this is about concept and then there's a bunch of mechanical slides to come which I can go quickly through because all that's in the text and it's just how to run the model yeah so here's here's my other I'm trying to develop a way to teach this stuff so here's my other attempt at redefining these let's think about dogs eating homework and many parts of the English speaking world world this is a way we talk about people lying why they don't have their homework done right my dog ate it it's a standard joke my dog ate my homework sometimes your dog does eat your homework though right so it happens happened to me when I was a kid my dog ate my homework it's true you can imagine me I was a grade A student going to the teacher my dog ate my homework she's like really you Richard I thought better of you but it really happened so so my cats are better they don't eat your homework so imagine a a dag not a dog a dag with four variables in it labeled h h star a and d h is your homework and that is the score it's worth the quality of your homework as a quantitative variable h star is the version with missing values so a bunch of students are coming turning their homework so some of them are missing yeah a is some attribute of the student which causally influences the quality of their homework right attention span working memory Adderall whatever and d is your dog the missingness mechanism right it was R on the previous graphs it's now a dog and on the left we've got missing completely at random here I've relabeled it the dog eats any homework and this is the dag for it so the attribute influences homework on the top of that the true state of the homework influences h star the version of homework with missing values and the dog influences h star but nothing influences the dog the dog will eat any homework it's not selective right that's missing completely at random you with me okay I'm working on this I just need some development I know so in the middle we've got missing at random the dog eats particular student's homework they care about the student the attribute of the student so now it's like the dogs of students who have particular values of this attribute are more likely to get their homework eat right I don't have a mechanism that's why I said I need to work on this I have a mechanism for this the attribute could be something like attention span or something like that and so if you don't if you don't pay close attention to your homework and you turn away the dog eats it something like that right so it's an attribute of the student not of the homework yeah of course it's correlated getting eaten when your homework is correlated with the score on the homework but not because of the score on the homework it's because of the attribute of the student who was working on it that's missing at random and that's the thing I said is really common in science incredibly common in science okay finally the worst case dog only eats bad homework the dog sniffs the homework assesses its score eats it right so you I know or more likely the homework's bad and the student feeds it to the dog so that's another way this could happen but it depends upon the score on the homework right and so that's missing not at random or dog eats bad homework and now in the DAG we've got an arrow directly from the true H to the missing this mechanism D does this help I'm working on this I think I'll add this to the book it might be a little too weird but it's never stopped me before okay so let me show you a little bit about the mechanics of this and this is all in the book the code to do this so I'll necessarily move a little bit faster this just to give you you know the first step of drawing the owl the conceptual bit of it and the key insight again is just we think about the generative process and we write down the same model when we go every missing value just gets a parameter now because we haven't observed it so it becomes a parameter model stays the same we run the model that's it basically that's it well there's all this drawing the owl part in between that has to do with the algorithm but let me give you the intuition so there's 12 missing values for a new cortex in this data set on the right I'm showing you the whole data set the last column on the far right is the neocortex percent each of those n a's is a missing value and we're going to assume missing at random that m is influencing the missingness the body mass is influencing missingness conceptually what we're going to do is we're going to replace each of the n a's in this column with a parameter and then we're going to get posterior distributions for each of the missing values and the information in those will also flow into the regression so you'll get different slopes out of this too so let me show you how this works the idea is you think about each of these gaps now getting assigned a parameter because it's unobserved and unobserved variables are by definition parameters that's what they are in base and these things will be imputed by the model this is what the model looks like b is now a vector in which some positions are observed values and some are parameters but it's all mixed together and we're going to stick that mix of things into the linear model in an ordinary regression model and then the only additional thing we have to do shown in blue here is have some prior for the b values you can think about this as the model of b and sometimes your dag will tell you this right what is b caused by your dag will give you some information about this in the case when b is observed it informs the parameters inside this so nu is the that's nu the Greek letter so that little v it's not a v it's nu is the mean of the b values of the of the neocortex values this is a standardized variable so it'll be very near zero and sigma sum b is the standard deviation those will be estimated from the observed values when the value's not observed this has been a prior for that value it can keeps it from being any old thing mind blowing I can see there's at least one mind blowing excellent but it's the same model as before it's all right here's my my annotation to say the same thing that's a b is a mix and then when b is observed this thing's a likelihood when it's not observed it's a prior in code form it looks exactly the same except we add this prior around here what ulam is going to do is it's going to detect the n a's and it's going to construct that mixed vector for you tries to help you it automates this and once you get out in the posterior distribution there's a parameter for every missing value so you see 12 b imputes here and each of these is an imputed neocortex value and the question of all your minds unsure is what does this do to the slopes in this model now we've added 12 cases to the data that's nice and almost doubles our sample size and so let's compare the same model you don't have to change the model code at all you just delete the cases that have missing values and rerun it right the model stays exactly the same and now we can compare the slopes remember this was one of these masking effect cases we've got two predictors that are positively associated with one another but one is negatively associated with the outcome of what's possibly associated model 15.3 is our new is our new imputed model that uses the full sample and 15.4 is the old one from way back in the previous chapter notice what has happened is the estimates have gotten more precise they've also shrunk a little bit towards zero so the previous model was probably overestimating the influence of each of these but we've gotten precision by adding 12 parameters for things we don't know we've got more precise estimates of the slope yeah this is what you expect in a missing at random or dog eats particular students homework situation is that you get you de-confound and get extra precision by doing the imputation let me show you what happens as a pattern in the data so we can plot these imputed values up mixed in with all the observed values but they're going to have standard Arizona and that's what I'm showing you here we don't know exactly what these values are the posterior distributions are pretty big right despite that they help us understand the slopes more so this on this graph we've got neocortex percent on the horizontal each of those open circles is an imputed value the blue circles are observed values they follow the regression line the posterior means follow the regression trend right because the regression informs them if you've got some species with a big body mass that tells you something about its neocortex percent because those two variables are strongly associated in all of the observed cases yeah and so the the model automatically counts for this you don't have to be clever cool right I don't like being clever it's very hard disappointing thing about this model is that the relationship between the imputed values and the other predictor is zero which is wrong so if you look at this graph you see a regression trend for the blue points and the observed points there's a strong positive correlation between log body mass and neocortex percent but the imputed values don't follow this at all and that's because we didn't tell the model that these two things are associated so in the text I show you how to fix this just very quickly because I know I'm out of time we do this by saying m and b come from a multivariate normal and we model their correlation we did this just like an instrumental variable right it's the same trick the same kind of code but you can do it now even though b has missing values inside of it you can still do it it's a mixed vector of parameters and such and observed values there's code show you how to do this you just have to manually construct this mixed vector of things and there's some code to do that that's the drawing the alpha and then at the end happy days they're associated and you get even more precision from the estimate of these things okay this is a really big topic missing data and there are lots of things which are kind of like missing data but don't feel exactly the same one of the the areas that I think is most important is in this family of models in ecology called occupancy models mark recapture methods these are really missing data problems in a sense you think of them they're kind of like measurement error there's a true occupancy is the species there but you can't observe that zeros are not trustworthy right and so there's this special relationship you do imputations of the true states right there's this latent thing whether the species is really there or not and you need to impute it that's how these models work they're like missing data models but they have special structure which comes from the detection process that you model right okay I realize the amount of time I'm gonna put up a final homework later this afternoon after I do it myself I think it's good but we'll see after I try it in which you will do some imputation practice with some primates and it's doing a week even though the course ends today please turn it in a week for a full sense of satisfaction and if you so want some certificate of completion I'll be happy to give you one as well so with that thank you for your indulgence for the last 10 weeks we've gone a long way from the Golem of Prague and as you go home and deploy your Golems I just want you to remain humble in their presence and I hope you've learned something valuable thank you you