 Welcome to Lecture 17. I want to start by talking about rituals. March 1st, this year, 2022, is Shrove Tuesday. And on Shrove Tuesday, many Christians burn things. For example, palm leaves, or in places where palm leaves aren't available, other kinds of leaves. What does this ritual mean? Well, of course, it has a symbolic meaning that's tied to Christian history. But what it really means is its function to the community that practices it. That's the importance of the ritual. And it can have that function even if the people participating know nothing about the biblical story in which it originates. The other thing that is traditionally burned on Shrove Tuesday are pancakes. They're not intentionally burned, not usually, but nevertheless, lots of them end up burned. And this leads us to another aspect of meaning and understanding and function. That is, the function of probability theory does not typically require that we understand the results it produces. Let me try to explain that with some pancakes. So imagine that I cook three pancakes, and I burn two of them. One of them has been burned on both sides, there at the bottom of the screen. Another, I've only burned it on one side. I was starting to get better. That one's at the top of the screen. And then, by the third one, I had actually figured it out, and I have a perfect pancake that's not burned on either side. So now, you come to my house and I serve you one of these pancakes at random, and the side facing up is burned. And now the question is, what's the probability the other side of this pancake is also burned? This is a version of a classic probability paradox known as Bertrand's Box Paradox, but we can call it Bertrand's Pancake Paradox. It's a paradox because intuition does not work here. It's extremely difficult for people to intuit the right answer. And even when the right answer is explained, it's often very hard for people to hold on to it. But probability theory gives us the right answer. So let me walk through this just to start. So to remind you, here's the setup. There are three pancakes. The first one on the left here is burned on both sides. The second one's burned on only one side. And the third is not burned on either side. You are served a pancake and you can only see the side of it that is facing up on your plate. And that side is burned. What's the probability the other side is also burnt? Most people, the vast majority, even those educated in probability theory, say that the probability is one half. And that's wrong. The intuition being that there are two pancakes that have burnt sides, so it has to be one of the two. It's just random like a coin flip. But this is not the right answer. Let's derive the right answer and we're going to do it by resisting the temptation to be clever. And we're just going to use the axioms of probability. So let's just start with the definition of conditional probability. The whole deal with probabilistic inferences, we condition on what we've learned to figure out what we don't know. I'll say that again. The whole principle with probabilistic inference is to condition on what we have learned to see what the implications are for what we haven't learned. And so that means we use conditional probability. And the definition of conditional probability gives us the formula we need in this case. And that is that the probability that the other side of the pancake, the downside, is also burnt conditional on the upside being burnt. This is the question is equal to the probability that both sides are burnt. That's the probability of burnt up comma burnt down the joint probability that both sides are burnt divided by the probability that you get a burn side facing up. And we can calculate these two terms. The top one here, the probability of burnt up and burnt down is just one-third because it's one of the three pancakes, right? Only one of the three pancakes has both sides burnt. So the top here is one-third. What's the bottom? We can calculate the bottom again using the axioms of probability by averaging over the possibilities. So there's a one-third chance of each pancake. And if it's the first one, there's 100% chance that the burnt side is up, right? If you're served at random, the pancake where both sides are burnt, then whichever side ends up being flipped up, it will be burnt. Next, in the middle, if you get served the pancake where only one side is burnt, then only half the time will the upside be burnt. And then finally, on the right, if you get served luckily the good pancake, then no matter which way it's flipped, the upside will never be burnt. You do the arithmetic here, and the probability if you're randomly served a pancake that the upside is burnt is one-half. We can put these two bits of the answer together. One-third divided by one-half is two-thirds, and that is the correct answer to Bertrand's pancake paradox. It's paradoxical, again, just because intuition does not grasp it. Intuition leads us very strongly to another answer, and even once we've got this answer, it somehow doesn't seem right. If you look in the book, I give you a short simulation to prove another way that two-thirds is the right answer, and yet still many people, myself included, after having seen this answer and derived it myself multiple ways, it still does not seem intuitive. If you want a little bit of intuition, though, a trick that only works in this case is to realize that it isn't the number of pancakes that are burnt that matters. It's the number of sides that matters. We're interested in the sides, and there are three burnt sides in the six sides of pancakes here. The reason I introduced this pancake story is to prime you to think about how probability theory works and how its function does not really rely upon us grasping its meaning in a particular way. In fact, it's not clear that that's even a meaningful question to say what the meaning is of a set of logical axioms. Being clever in scientific reasoning is incredibly unreliable. It's also opaque, and that opacity means that no one else should trust your clever solution. It might work in some particular case, but not in another. It's much better in nearly all cases to resist the urge to be clever, to show off, to signal, and instead just be ruthless, ruthlessly follow the axioms of probability. Probability theory provides solutions to problems like the pancake problem, but also real problems if we'll only follow the rules. And in this lecture, I'm going to show you some very important and extremely common data analysis problems where probability theory gives us essentially automatic solutions to them if we'll only follow the rules of probability theory. And often, when we do these, when we follow these solutions that we get just by following the rules, the results are sometimes paradoxical. It's not entirely clear why it happens the way it did because our intuitions would have guessed something else would happen. And again, that doesn't mean anything is wrong. Often there's nothing else to understand in this business except that the meaning of the results is the axioms themselves. They're the things that engender the results, just the logic that comes from it. In the lectures last week, I already showed you an example that involves the rich elements of real data analysis, these Gaussian process regressions where we have measurement error and underneath it there's some complex process going on we're trying to learn about and there are many, many possibilities under there where it's coming from. And in the lecture today, we're going to pick up from here and think about measurement error as a big area on its own. It's present to nearly all scientific projects and all research projects and we should take it seriously and think about some principled ways to deal with it. This will also take us all the way back to earlier weeks in the course. I think it was the third week where I introduced the four elemental confounds and the fourth one here, the descendant, has not been mentioned as often as the others but at the time I did say that it has something to do with proxy variables and measurement error and it's going to play a leading role today. Truth is that many of the variables we analyze in statistical projects are actually just proxies of the causes of interest. They're imperfect measurements, ghosts of the things that we'd really like to see and it's extremely common, as in many of the examples up to this point to simply ignore measurement error but we know that that is not always safe and so there are also many ad hoc procedures for dealing with measurement error but these can be unreliable because they rely upon cleverness and intuition and while they may work in isolated contexts, they will not generalize very well. It's better to think causally because of course measurement is a causal process and the mechanisms that produce measurements are things we can model generatively just like everything else in the systems we study and if we're willing to do that, make generative causal models of measurement then we can lean on probability theory to tell us how we should analyze those measurements. I'd like to prime our thinking about measurement error with a very simple example and then we'll build up into more complicated measurement error projects that are more realistic. In this example I want to attack a particular myth that is common in the sciences which is that measurement error can only reduce the size of an estimated effect. It'll never increase it and so you'll read, for example, that if there had been measurement error it has simply reduced the size of the effect and so the true effect must be bigger than what is reported in the paper. This is not true, not in general. Here's a simple example to show you why but it's not the only mechanism that will produce this. Suppose we're interested in estimating the causal influence of parent income on child income. However, parent income is not really observed. Instead we have a proxy of it and that proxy comes from the child's recall of what their income of their parents was like when their parents were the same age. This parent income variable, which I've written here as P star, is a function of both the true parent income shown there with the dash circle around it to indicate that it's unobserved and some error, e sub p, on this graph and we can add it to our DAGs like anything else. And to add a little bit more complexity here, but realistic complexity, the child's income itself biases recall and influences the error such that they tend to recall their parents' income as more similar to their own. And in this case where we have some variable that's measured after its causal effect these kinds of errors are not rare, or at least they're thought not to be rare. What do they do? Well, they can do lots of things, but they always induce some kind of bias in our estimate of the causal effect. So let's look at this very simple code example here. This is like the other code examples in previous lectures. I'm going to take you through the little bits of the code and show you how each bit corresponds to the DAG on the right. The first thing we do is we simulate parents' income. It has no parents of its own, the variable. So we simulate it first. Then we can simulate the causal effect of parents' income on child income. And just for the sake of this example, I've made the causal effect zero. You'll see in the code on the left I'm multiplying the parents' income by zero so that there's no causal association at all between the child's income and the parent's income. It's not realistic. It's just a provocative example. Then we simulate measurement, and that's the rest of the graph. And this is what I'm doing is creating a mixture of 80% of the parent's income and 20% of the child's income, and that's the mean of p star. If you wanted to calculate e sub p, remember it's just the difference between the parent's income p and p star. And then we can run the model. This is an ordinary linear regression programmed in ULAM, and we get an estimate. And what you see is that the slope coefficient b here is all the probability masses above zero even though the true causal effect is zero in the simulation. This is not a complicated example. This sort of stuff is typical, and it's meant to highlight that even in very simple data analysis problems these sorts of measurement processes can lead us astray. In this particular example, the bias depends upon the size of the original causal effect. So if the causal effect is zero, it biases us up. But if it's large, let's change the simulation to make the causal effect so that the actual coefficient is 0.75. Now the estimate is below that. As you can see, I've updated the posterior distribution plot in the lower right again. And the vertical dashed line is the true effect, at 0.75. And you'll see that we've gotten nowhere near it, and that's the consequence of the measurement error. So what's a general way to do this? You could probably intuit from the previous example that what we're going to do is try to add the measurement process to our generative models, whether those generative models are the sort of heuristic DAG types or much more specific simulations in order to understand measurement and deal with it in a transparent and justifiable way we need to model it. And that's what we're going to start to do. And in the rest of this lecture, I'm going to give you just two examples of ordinary sorts of measurement error problems and then at the very end, I'll mention the wider horizons of ways that the statistical community has tried to deal with measurement error in principled ways. We're going to return to this marriage divorce, age of marriage example from the first half of the course. So I don't have to reintroduce this data set. There are a couple of issues that arise from the fact that all of the variables of interest here as the divorce rate, the marriage rate and the median age of marriage in each state are measured with error. They're measured with error because the data were collected in a finite time window and only so many marriages and divorces occur within this time window. And this means that we don't know exactly what the long run expected rate is in each place. Of course, those rates may also be changing, but we can leave that aside and recognize that any finite amount of data we get an imperfect estimate that we have updated our prior from. So uncertainty remains about each of these things, these averages. Remember, averages are never observed. Rates are never observed. They must be estimated and so typically they always have uncertainty. So measurement error sometimes can be ignored. For example, if every entity in your model has the same amount of measurement error, then it all gets absorbed in the sigma. But there are lots of cases where that's not true. We're going to think about two here. The first is when there's imbalance in the measurement error. As I say, imbalance in evidence in this case. And in this data set, there's really strong imbalance. The states have much better estimates than others. And then the second is that measurement error can be related to other variables in the graph. And then as in the income example we started with, you have a potential for confounding. So let's look again at the divorce and marriage data. But now let's add the errors in the data frame. I've provided standard errors for the divorce rate and the marriage rate. And I've plotted these together with the variables on the graph on the left of this slide. And you can see that there's a lot of variation in the size of these errors of the uncertainty in each point. There's a general correlation between the error in marriage rate and the error in divorce rate. But they're not the same thing either. What's driving this variation and uncertainty across states? Well, it's mostly, not exclusively, but mostly just the population size of the state. As you can see this on the right, I'm just showing you the divorce rate on the vertical with the standard error interval plotted there with it. And then the log population size of the state on the horizontal. Why does this arise? Well, because states with fewer people have fewer divorces and marriages every year. And so they provide less data. And so you'd need a longer time series of data to get the same precision of estimate as a large state like New York or California or Texas. So let's think generatively about this measurement error process and add it to the heuristic causal model. Well, let's start with just one variable at a time to keep it easy. And we'll start by taking account of the fact that the outcome variable divorce is not really observed. The divorce rate is not observed. Instead, we simply have some pale ghost of it. The D star here, which is the observed divorce variable. And it is a function both of the true divorce rate D, which we can't observe. And some measurement process which generates errors, different error for every state. Of course, the other variables also are not directly observed. And so this structure is mirrored for the other variables, making a quite complicated looking heuristic causal model. We're going to don't panic because we're just going to back up a bit in a moment and deal with one variable at a time. But for now, let's think conceptually about other issues. With what we've got on the screen right now, if the errors vary from state to state, that means if we fail to take into account the measurement error, then we will be giving undue weight to lousy data in some cases that may lead us astray. So that's already a problem. But as I showed you, it's extremely plausible that the errors are caused by influence by the population size in each state. And so in this case, we start to see that there's potentials for noncausal paths that are opened up when we start conditioning on the star variables, which are in each case a collider on the path thereon. Remember, we can't regress on the true variables in this graph. We have to regress on the observed ones. And so potentially we can get into trouble. Suppose, for example, that population size also influences divorce rate in some other direct way. That is, in small states, the marriage market seems really different for people, and they may be more or less likely to get divorced. The same applies for marriage. In that case, it's possible that you can get all sorts of backdoor and other noncausal paths that cause us difficulties. And in that case, we want to think very carefully about these things. You may intuit here we could condition on population. We might be able to also incorporate the measurement error itself into the graph and make it work. How are we going to develop a solution to something that looks this complicated? And what I want you to do is think back to the pancakes. Always trust the pancakes. And what I mean by that in this case is try not to think like a regression. Resist the urge to be clever, and instead think like a graph. What I'm going to call for this lecture, thinking like a graph. The problem with thinking like a regression is that when you think like a regression, the idea is that you've committed to a particular very narrow range of statistical models in which your job is to figure out which predictor variables to use and how to combine them on the right hand side of a linear model. And that is a very narrow part of statistics because it's a very narrow set of approximations for modeling natural systems. What we want to do instead is think like a graph by which I mean, how do we model the network of causes? And this could involve multiple simultaneous equations, not just one. And it takes direct inspiration from your causal model, whether it's heuristic like a DAG, or something much more algorithmic, like a structural causal model or a set of dynamic equations. So let's think like a graph. And that means we're going to break this thing up and think about it as being a collection of submodels, which are all taking place simultaneously. And we can build a statistical analysis from that that deals with measurement issues as best as the available information allows, but does it in a transparent and justifiable way? So I said I'd simplify again, and I will. So let's pretend we've observed A and M, the agent marriage and the marriage rate without error. Just pretend for the moment to keep it simple, but that the outcome variable divorce has not been observed. Instead, we've got this proxy delta star. And I want you to realize is that we've got a couple of submodels inside this causal graph. The first is the model for divorce, how the real unobserved divorce rates are produced. They're influenced according to this model by the marriage rate and the agent marriage. The other submodel is the model for the observations, the measurement model for D star. D star is a function of the true divorce rate D and the measurement process, which gives the errors ED in this example. Of course, there's a third submodel here, which we're not going to deal with just at this moment. We'll fold it in a little bit later in the lecture. And that's the model for M. M is produced in part by the agent marriage according to this graph. So this simple little tag here has three submodels, which implies three statistical models, but they are joint statistical models. They have implications for one another because they share parameters and data. So let's write those submodels. Again, just for now, we'll focus on just the red and blue subgraphs here and build up a statistical approach, and then we'll fold in additional complexity as we go. I'm doing it this way, not just so it's easier to understand, but because this is how you should work when you do your own projects. We know where we want to go and where we want to go is pretty complicated. So if we're going to successfully get there to have a real credible path to make it, we need to go one piece at a time. So we start with the red part, and this is a linear regression of the true divorce rate on A and M. Of course, we haven't observed the true divorce rate, so it's not clear how we're going to do this yet, but hang on, we'll get there. But this is the model that's implied by the graph, assuming you're happy with a linear regression. You don't have to be. This could be nonlinear. And then the measurement model in blue, the observed D star is just an additive combination of the true D in each state, D sub i, and the error for that particular state. And the errors, we have a bit of information from the statistical bureau that produced these estimates that the error on each state's divorce rate estimate is normally distributed with a mean of zero and a standard deviation that they have provided to us as data. And this is what we'll call S sub i. This is what is sometimes called the standard error of the measurement. We can rearrange that equation just by folding the error distribution in so that the distribution of the D star observations is normal with a mean of D sub i, that is the true value, and a standard deviation of S sub i, the reported standard error. Let's take a break. There are going to be a couple of breaks in this lecture because I think it's, even though it's not a computationally complicated lecture, it's conceptually complicated. So let's take a break already and please review the previous slides, make sure you're ready to keep going, but then really do take a break, take a walk, have a drink. And when you come back, I'll still be here. Before the break, we were looking at the sub models of this small causal network that contains measurement error and I asserted that we have these two sub models, the red one, which is the generative model of the divorce rates and the blue one, which is the measurement model for the observed divorce rates. Now let's do something statistical. We can put both of these models together in the same Markov chain and run them together and that's important to do because they share a symbol and that is they share the D sub i measurements that we haven't observed and we're going to make those measurements that aren't observed into parameters. Why can we do that? Because in Bayesian probability, unobserved variables are parameters. That's all they mean. And observed variables are data. That's all it means. And importantly, sometimes for a variable like a divorce rate, if it's measured with error, then the true value is unobserved and it's a parameter. If it's measured without error, then it's data. And in the same vector, sometimes some of the states will be observed with error and others without. You have a mixture of data and parameters and that's all okay. It's totally orthodox. So it's going to provide an automatic solution for us here if we resist the intuition that data and parameters are somehow fundamentally different things. Now to be clear, before I explain the code, of course when we do the calculations, data and parameters really are fundamentally different things. But the structure of the model, the equations on the right, happened before the sample does. That is the scientific structure of the model should be independent of the sample we get. And so whether some variable has been observed or not in the sample we have, that is what determines whether the variable is a parameter or data. In the model it's just a variable and when we make generative models, there's really no distinction between parameters and data. It's only after the sampling and when we build the statistical model that the distinction becomes important for the calculations. Okay. This will become clearer as we work through the examples today. But remember, it's all about think about the pancakes. You don't have to understand. You just have to follow the rules. And the meaning of the results are the rules of logic themselves, are the rules, the axioms of probability. Okay. The first model here is the observation model and this is just a regular old regression with some parameters inside of it and some data. So in this case, D stars and observed variables, the observed divorce rates. And we assign that a normal distribution where the mean is the unknown true divorce rate in each state. So that's a parameter. And then DSD are the reported standard errors in each case. And that's also data in this case. And then we have the submodel where we treat these unknown true divorce rates in another regression as an outcome in a regression. And it has the same structure as every other regression you've run, except now that these are unobserved values. But the model looks the same. And the rest of it's just a linear regression just as before. And we make this a vector of length n because there are n states of the true divorce rates that we haven't observed. And it turns out there's a lot of information in the data. These divorce rates are only observed with error. And there's a lot of information in the data that can be used to get more precise estimates of these through the regression relationship of divorce with the other variables. So what we end up inferring are 50 different true divorce rates. And each of them is inferred as a posterior distribution. That is, we don't know its precise example. I mean, precise value, but we get a range. And the true value is not, in general, centered on the observed value. And I'll show you why. Here are the observed divorce rates plotted against the observed median age of marriage values for each state. Just as in the earlier examples in the course. And there was a regression relationship between these variables, right? There's a strong, potentially causal relationship between the median age of marriage in each state and its divorce rate negative relationship. And then what happens when we introduce to the model the fact that the divorce rates are measured with error and we run the measurement model at the same time as we run this regression. What happens is the divorce rate measurements shrink towards the regression line. And they shrink towards the regression line because of partial pooling. Because what we have done is we have created a varying effects model. We didn't know we were doing it. It was forced on us by the logic of the causal graph. But this is what has happened. And we have gotten partial pooling, which has shrunk the estimates towards the regression line for exactly the same reasons that you get partial pooling in the previous examples. So what I've done now is I've highlighted in red and pink the new estimates, including the new regression estimate there with the pink tilted bow tie. And then the raw data are still the black points. And you can see that a lot of points have moved quite a lot, especially the ones that are furthest from the regression line because that's how partial pooling works, right? The most extreme points that are the least consistent with the statistical inference are shrunk the most towards the mean. Okay. Oh, I wanted to say, sorry. I wanted to say the regression relationship itself has also changed a little bit, but only a little bit. There are lots of states that haven't moved very much and there's been a lot of movement on both sides. But you'll see that the pink bow tie part, which is the new regression relationship, is a little bit steeper than the previous, which is shown by those thin dashed contours. That's the previous regression relationship. Maybe if I run the animation again, it'll help give you an idea for how this is working. Yeah, and then you can see the regression relationship changes a little bit, not much. And again, it changed exactly as much as it should given the assumptions of the model and the data, as the probability theory has done the only thing it is allowed to do. It has followed the rules. Okay, let's fold in another piece of complexity. You know that there are other variables here that are measured with uncertainty, like the marriage rate and the marriage rate is symmetrical to the divorce rate. It's a predictor now. It's a cause, not an outcome, but it also has a measurement model that's structurally the same as the measurement model for divorce. Now we have more submodels, and we're going to take them all and put them into the statistical model just as before. To remind you, here are the two we've already dealt with. We have the measurement model for divorce in blue, and we have the generative model for divorce in red. And then we have two new models that we're going to add to this. We have the measurement model for the marriage rate in black on the left, and we have the generative model for marriage rate in cyan there on the bottom. And we're going to put all four of these submodels together in the same Markov chain now. So I'm going to walk through this. Don't panic one step at a time. At the top we have the measurement model for divorce. This is just repeating the structure from before, but I'll talk through the structure again, because I appreciate that it takes more than one example for these things to sink in. It's a paradoxical set of pancakes. So the observed divorce rates in each state have a normal distribution where the mean is the true value, and the standard deviation is the reported standard error. And then the unobserved true divorce rates in each state are a vector of parameters, one for each state, and the prior for this vector of parameters is literally a regression equation. It's a normal distribution with some mean and sigma where the mean is a linear model, and this is fine. So this looks exactly like a likelihood probability of data, because probabilistically it's the same, it's just that the outcome is unknown, but probability theory will still process it using all the legal rules of probability theory. And then this structure is mirrored now for the marriage rate variable. We have the observation model for marriage rate, which is again the observed marriage rates are normally distributed around their true values with some known standard error that comes from the population size and the survey process and so on. And then the unobserved true values of the marriage rate in each state, again it's like a linear regression, but the outcome is a vector of parameters instead, and this linear regression is the prior for the true value in each case. Okay, you run that Markov chain, it runs uneventfully, it samples quite efficiently, and one of the consequences now of having both a predictor variable and an outcome variable that are measured with uncertainty, flip back here and show you something I skipped over a bit until now. If you look at the model for D, there's a new bit, is that in the linear model for the true divorce values, we no longer have the observed marriage values, but the true ones, you see that in the line from you, it's alpha plus beta for agent marriage times the observed agent marriage plus a slope for marriage rate times the true marriage rates, which are not known, so that's a parameter. So what this model does, what the Markov chain does is it runs, is it averages over the uncertainty in the unknown marriage rates while estimating the relationship to the unknown divorce rates, so that's like a complicated thing to do and it would be for a human, but the computer does not find it any more challenging than any other probability theory problem, and this is a necessary, correct implication of probability theory and our causal assumptions, this is the way that we get these, as I said, if we're willing to relax enough and suppress our desire to be clever, then we can get essentially automatic solutions to seemingly incomprehensible problems like how to incorporate measurement error into our models. So back to this weird looking graph. The consequence of doing this, of having a regression where both the outcome and one of the predictors are actually parameters or are unknowns, is that you get partial pooling in multiple directions at the same time, and so now the marriage rates on the horizontal and the divorce rates on the vertical are both shrunk towards the regression relationship. The black points here are the original observed values and then the red points are the posterior means, in each case, for each state. And you'll see again the more extreme ones are shrunk the most, but there's a bit of a helter-skelter pattern of where the shrinkage goes, and that's because there's another variable here, the age of marriage, which has a strong relationship with both of these variables, and it's pulling sort of in a third dimension that I have not plotted. We want to look at the coefficients though, what happened to the regression relationship between divorce rate and the two explanatory variables, that is age of marriage and marriage rate. So I show you this on the left of this slide, in the upper right, we have the posterior distributions of the coefficient on age of marriage, and in gray, this is the posterior distribution we got long ago in the lecture I first introduced this data set in. If we ignore the measurement error, then you get a strong negative relationship between age of marriage and divorce. When we incorporate the measurement error, it makes it a little bit weaker actually, in this particular regression. It's still almost all the posterior mass is below zero, so we still be pretty confident that if this is a causal effect, it's a negative one, and still pretty substantial. There's been a bigger change for the coefficient for marriage rate, and you look at the lower right. So now the coefficient on marriage rate, and this is the one, remember, where we just incorporated some uncertainty for it. When we ignored the error, the posterior distribution straddled zero, and it was reasonably narrow around it. Now when we add the error, it gets wider, which maybe doesn't surprise you if you have less information about a variable than a wider range of causal effects are consistent, but it has shifted upwards as well. So it isn't just that it's gotten only wider, and the mean has stayed in the same place, but it's actually become such that most of the probability masses above zero now. Why did this happen? Well, as always in applied statistics, if you understand your context well enough, you can understand why probability theory produced the results it did in that particular case. And in this case, when we include error on the marriage rate, that ends up down-weighting some of the states that had particularly unreliable and misleading estimates, according to the model in these data, and that is what produced the change so that there's now more evidence of a positive relationship between marriage rate, even though we included uncertainty on the variable. But this is a good example, say, in other contexts, the opposite could happen or nothing could happen when you include a measurement error. When you add measurement error to a model, it can either hurt or help your analysis from the perspective of the results you're after. Either way, it's the only honest option is to do so. Okay, I'd actually at this point like to take another break. Again, the computational details here aren't really new, but conceptually this is wild new territory. So please take a scan back to the previous section, maybe all the way back to the beginning of the lecture, make sure you've got the key concepts in order, then take a walk, and when you come back, I will be here. Okay, welcome back. Let's start the final example of this lecture, another measurement error example, another data set. This data set comes from Northern Namibia, a place called the Kaukefeldt, and this is a very arid landscape with more livestock than people. The people we're going to be interested in are the Himba. The Himba are a successful group of Bantu agro-pastoralists who have a really interesting kinship system and family system from an anthropological perspective. It's a very small number of people for a very big area, only 50,000 people in what is almost 50,000 square kilometers of area. And their family system has something called bilineal descent. If you'll tolerate this for a moment, I am an anthropologist and these things I think are really important to understanding human societies, so please just bear with me while I explain what this means and why it's important to understanding the data we're going to look at. Bilineal descent is very rare in human societies, but it's also not unheard of. It's found many places. It means that people trace their descent both through the male and female line at the same time. They keep track of who their descendants were through their mom and who their descendants were through their father. And they get different rights and privileges through both sides. So in the case of the Himba, wealth is passed through the mom and this is called matrilineal inheritance. And so their access to livestock and material resources comes through who their mom was. So for example, young men receive inheritance from their mother's brothers, not from their fathers. However, they receive status through their father, particularly whereas it's related to religious rituals and rights. And they're also patrolocal. So women move to their husband's family location when they get married. Polygyny is common in the Himba as it is and for many pastoralists in Africa, meaning that men typically by the time they're a senior have multiple wives at the same time that live in large cooperative households together. The Himba also have a system which is rare, but again, not unheard of of what we'd call open marriages in Europe, which is that it is normative and indeed expected that both men and women will have romantic and sexual relationships outside of their marriages. And so on the right of this slide, I'm showing you just some data for the contemporary Himba. This is data collected by Professor Brooke Shelza. You see LA that in young people, middle-aged people and people over the age of 45, middle-aged people, both men and women report majorities of both men and women report having extramarital relationships. And this is totally normative in the Himba. So the question is, what does this do to the structure of families? Right? How did they work? The social norms among the Himba are that the social father, that is the man that is who is married to the woman, provides for all of her kids no matter who their fathers are. And this is the norm and it's accepted. Westerners often have trouble accepting that this could be true. And so one of the things that this has led to is an attempt to actually do paternity testing among the Himba. The Himba don't want this, they don't care, but they'll tolerate it. It's the Western scientists who want to know what is the actual rate of what's called extra-pair paternity among children in the Himba. So this is our estimate. What is the proportion of children who are fathered by extra-pair men in these marriages? And remember, this is not cuckoldry because it's culturally sanctioned in the Himba, it's okay. It's expected and lots of social fathers provide for kids that are their biological kids. It's just their job. But we're going to try to estimate the rate of it using genetic paternity tests. The thing about these tests though is that they're not 100% accurate and in this particular rate, even though the geneticists used the latest fanciest ways, genomic techniques to do the paternity testing, there's about a 5% false positive rate for extra-pair paternity. That is to say about 5% of the time a child will be assessed by the protocol to not be the social father's child, even though he really is. And this is an example of what we call in statistics misclassification. It's a kind of measurement error where the variable is discreet. You have a categorical variable and you end up misclassifying it. So how do we include misclassification in a statistical model? This is a very common kind of measurement error. So I wanted to show you an example of that as well. And the code machinery, the way you make the golem to do this, is very different than how you do it for continuous measurement error. So hang on. I'll walk you through it. So of course there's a DAG. We won't do too much work with this DAG, but I wanted to give you an idea again of the structure we're talking about here. The idea is that there's some outcome variable X for the extra-pair paternity. This is a 0-1 outcome variable. And this applies to a child. And that child has a mother that you observe there on the right for M. And a social father, which is the mother's socially-recognized husband there in the lower left, the F. And these two individuals also form a dyad. This is like a social network. And this dyad is TMF. And there is divorce and remarriage in the Himba, like in most human societies. And so women and men can be in multiple marriage dyads over their lifetime. And all of these things can have an influence on extra-pair paternity through the quality of the relationship or the time they've been together or any number of other things. And what we're positing here is that the X is not observed directly. It's an unobserved variable. Instead, we get its pale reflection, X star, which is influenced both by the true value of X and some measurement error. And we're going to make a model as before. Now, the general structure with the submodels and stuff is the same here as it was, or at least sufficiently the same here as it was before the break in the marriage example. The action here is going to be in how we have to code this up. A discrete error is different. So bear with me. The first submodel is the generative model at the top of the slide. And I don't think there are any surprises here. I'm going to include in this one just the mom as a varying effect so that each mom may have a different rate of extra-pair paternity among her kids. And that may be because she's less close or closer to her boyfriend than she is to her husband. And then we're going to have dyad level effects as well represented by the delta here. And these are standard varying effects as in the previous lectures. Nothing really surprising going on. We do need to use non-centered coding for these, but you're a pro at that by now. The measurement model at the bottom is the new bit. So, and this is going to take some explaining. So let's focus on that. To make the measurement model work, what we need to be able to say is how each of the possible values of X star can be produced through the measurement process. And I've written two equations to express that. And maybe you already intuit why they have the structure they do, but if you're like me, you need to remind yourself every time you do one of these models why it needs to have this structure. And that's important. Remember, trust the axioms. Avoid being clever. So it's nice with these sorts of problems to make a probability tree where there are branching possibilities. And then we can count up the possibilities and get the equations we need to express the probabilities of each event. So the first thing that happens when a measurement is generated is the actual causal process. And so I'm showing you in this first little probability fork on the screen here in the middle of this slide. So there are two possible true values for the X variable. Either it equals one or it equals zero. When it equals one, the child is an extra pair of eternity. That means one of the boyfriends is the father, is the biological father. The social father will still care for this child among the himba. Or X sub i equals zero, in which case the social father is also the biological father. And these events happen inside the model with probabilities p sub i and one minus p sub i. This p sub i will be our probability from the generative model of an extra pair of paternity. And then if the true value is actually zero, there's a chance of a false positive. And the false positive probability is going to be represented by the symbol f here. And it's about 5%, about 0.05 for this particular example. And then we produce X star finally at the bottom down here. So in the case that X sub i equals one, there's no possibility of measurement error in this example. The idea is that there are not false negatives, but there are false positives, and that generates this tree. So now we can use this tree to justify the equations at the top. Let's start with the first one, the probability that X star i equals one. Well, there are two ways that could happen. Either the true value X i equals one, in which case we will always observe a one. And that happens p sub i at the time or, and remember whenever you say or in probability theory, you add, you put a plus in, so p sub i or one minus p times f. And that's the other possibility, the other way to go down this probability tree to get to X star equals one. And then for X star equals zero, the only way to reach that is for there to actually be a zero and to not have a false positive. So it is one minus p sub i times one minus f going down the right branch twice. So we take these probability expressions, which again are just expressions of the rules of probability theory, nothing really exotic here. Just apply to this particular case, and we can program them into a new law model. And this is where the golem gets grumpy. We have to do this in a particular way. And I want to take a little bit of time in this example to focus on the code, which is not something I've done a lot in this class. But I want to do that because I want to emphasize that for these sorts of problems, it often pays to focus on the code a bit because there are different ways to do these calculations and some ways are better than others. So let's start with something that I hope is fairly transparent. So in this particular case, we can express these two probability expressions rather directly in the ULAM code. And the trick for doing this in ULAM is you can put this little pipe next to an outcome variable and say that there's a probability expression that applies only to particular values of it. So here we have x pipe x equals 1, which means the distribution of x, the probability of x conditional on x equaling 1 is, and then I have this custom thing which just lets you directly inject stand code into the model. And so this is some raw probability function that we're going to write. And then we just write the logarithm of the expression on the right. That is p plus 1 minus pf. That's the logarithm because stand does all of its math on the log scale. And that's the way statistics is done nearly always. And the reason is because that's more precise. I'll talk a little bit more about this in a moment. And then for x equals 0, the same sort of line, but then the corresponding function, the log of 1 minus p times 1 minus f. Okay, this model works and it works fine and we'll look at the results in a little bit. I want to spend the next several slides talking about other ways to code this that are better because if you get into this business and you start making your own misclassification models or other models that have these sorts of expressions in them, and there are lots, this way of coding it can be very bad because it can lead to numerical problems. So there's a better way to do this and I want to slide you into it slowly. The basic problem is that the way computers do math is crazy. So to speak casually. That is, math in the computer is a different system of constructing the real numbers than the way mathematics actually constructs it and the way you learned in secondary school. That is, computers use something called floating-point arithmetic. Computers can't really represent an infinite number of values or an infinity of infinities and other crazy things that happen in actual mathematics. But they use this system, this very clever system called floating-point arithmetic to approximate all that. And for most calculations, most of the time you don't have to pay attention to the differences between floating-point and real math, but sometimes you do. And in particular, when you're doing probability calculations, those differences can cause some real grief. And that's because probability calculations, because they're tightly bounded between zero and one, have a tendency to underflow to zero. That is, if a probability gets really small, close to zero, the computers will often tend to round it to exactly zero and that's catastrophic with a probability calculation. So imagine, in a very large data set, any particular outcome is extremely unlikely. Even if you know the true process that produces them, think about the garden of forking paths from early in the course. And so probabilities for any large data set are often very small, but they're not zero. And it's important to preserve that fact that they're not zero. Okay, the other problem is on the other end, is that if probabilities get close to one, which doesn't happen as often, but it can happen, they may round to one. And these underflow overflow situations destroy everything in your calculations. Often then the model won't even run, or worse yet, it'll run and you'll get the wrong answer. The solution to this is to do every step of the arithmetic on the log scale because these underflow and overthrow problems are much less likely on that scale. And there are, so that's why we take the logarithms already in the previous code, but we can also work with all of the probabilities on the log scale to begin with and not even work on the raw probability scale at all. And there are a bunch of specialized functions in probabilistic programming languages like Stan and Julia for doing exactly this. And they look a bit odd. They're ancient incomprehensible weapons. These things like log-sung-exp and log-1-m and log-1-m-exp. But they're indispensable if you do this business for very long and you get to know them and then become your friends. So let me show you an example of mixing in some of these ancient friends into our code and we're going to make a better version of the extra pair paternity model. So, same probability expressions at the top here. New code. And I'm going to focus on each of these lines and explain what we've done. So I've re-expressed this p plus 1 minus p f line is a much longer bit of code now that has two weird functions in it. Log-sung-exp and log-1-m. And what are these doing? Well, notice first of all that inside this log-sung-exp thing there are two arguments. There are two terms and these are the two terms in the expression at the top and the first one is p, but it's the logarithm of p, obviously, as it is here. And the second one is this log-1-mp plus log-f. And this is the logarithm of the second term. And log-1-m is a function that returns the logarithm of 1 minus what is inside of it. I'll say that again. Log-1-m is a function that returns the logarithm of 1 minus whatever is inside of it. And this function is nice because it does that operation, that calculation in a way that is much less likely to produce catastrophic underflow or overflow, in this case, is the danger, right? If p is a very small number and you do 1 minus it, it might be 1 and that would be very bad. So this is a nice little tool. It's one of these friends that we use all the time in this business. And then we add the log of f. Why do we add? Because on the log-scale multiplication is addition. And then what does log-sum-xp do? This is a workhorse tool in this business. If you've got a long probability expression where these different products are added together because they're or's, remember, in probability theory when you say or, you add. What you want to do for numerical stability is compute the log of each of the terms that are separated by pluses. And then combine them all using log-sum-xp. What is log-sum-xp? Well, it returns the log of the sum of the exponents of each thing inside of it, which means if we have log probabilities inside of this thing, which we do, it gives the sum of those probabilities because if we exponentiate each of those terms, it takes it off the log-scale onto the probability scale again, then they're added and that does the calculation that's at the top of the slide. And then we take the log of the whole thing, which is what we want to add to the target, which gives us the surface that the Hamiltonian simulation is going to cruise around on. I know this is weird, but there's really no avoiding this stuff. And after you work with a project that uses this, it seems a lot less weird, actually. But numerically, it's much, much more stable. And there are many projects which you just can't successfully complete unless you're willing to program the model this way. The other term for x equals zero, same logic, but it's a simpler expression here. We just have one term on the log scale. It is log 1mp, that is the log of 1 minus p, plus the log 1m of f, right? The log of 1 minus f. And this model will run too, and it gives you the same answer as the previous one because in this example you can get away with being sloppy. But we want to show you the right way to do it in case you need it. We can go one step further. Apologies to show you that we can gradually take everything off the raw probability scale and just do all the calculations from the very beginning on the log scale. That often helps. So another thing I've done here is we've taken the loget part of this model itself and just do it on the log scale originally. With all the calculations for the x's, we don't need p, we need log of p, so why are we computing p? Let's compute log of p. So this is what I've done now. I've changed the linear model so that it is the log of p. And then the inverse link now is the log inverse loget, which is a function that Stan provides. It's like inverse loget, but it doesn't go all the way to the probability scale. So it's more numerically stable than going to the probability scale and then taking the log. And then it's just a linear model inside of it as before, nothing too surprising there. And then in the expressions, the custom expressions for x conditional and x equals 1 and x conditional and x equals 0, I'm now including log p instead of p. Yeah, and then so log 1 minus has now been changed to log 1 minus x, which means we're going to get the logarithm of 1 minus the exponent of what's inside of it. And maybe this sounds like this is a lot of extra operations, but actually it's not. So this function does this calculation in a way that uses diffused operations possible and it is most numerically stable. So it's better to do this than the other options that I've showed you in this lecture. Okay, all three of these different varieties of model will produce almost identical posterior distributions because, again, in this example, you don't get underflow or overflow, but sometimes you do, so it's good to know how to deal with that in case it happens to you. So here are the posterior distributions for the misclassification model shown in red and a version of it that ignores error, which I haven't showed you, but I'm confident you could code that version on your own. The code for it is in the script for this week. And the thing that we're estimating here is the probability of extra pair paternity. This is per child. And this is clustered by women and dyads. And you'll see that it's around 50% a little bit below on average, but that there's a wide range. So in this particular sample of HIMBA women can range anywhere between about 0.3 up to about 0.7 in the high density region of the posterior. And you'll see that the effect of incorporating misclassification has been to reduce the estimate a little bit. So it's important that we want to get the right answer and we want to do it in a transparent and justified way. This is very high compared to comparable estimates in Europe, by the way, where they're below 5% always. But this is socially sanctioned in the HIMBA. We were expecting this. At least the anthropologists were maybe surprising to others. Okay, this has been a lot. This is a conceptually deep lecture. I appreciate that. It would be a good time again to review and look at the bits that might remain confusing and run the examples and play with them a bit to help fold in some of the understanding. Their measurement problems are extremely common in real research. And there are lots of related problems that are about measurement, which may not look exactly like the examples in this lecture. So in closing here, I just want to mention some of them. Lots of statistical tasks involve ratings and assessments. So you have judges, which are scoring wines or students or dancers or teachers, and you have tests of students or job applicants or any number of other things. And judges and tests are noisy instruments. They don't exactly measure the thing of interest. The thing of interest typically cannot really be observed, whether it's some talent or knowledge or how much someone has, how much potential somebody has, or how good the wine will be for some generalized audience that might consume it. There's a family of statistical models that are really just deployed industrially now. They're successful workhorse sorts of devices, a fleet of deployed golems. They're known as item response theory, which is actually a special case of some broad area known as factor analysis. And these models all are measurement models explicitly. They posit that the thing we really want to know is unobserved, and they try to estimate how different measurements related to it give us information about it. So they're simultaneously trying to construct a model of the measurements themselves so that they can get better measurements of the thing of interest. There are also models called hurdle models, where in a hurdle model there's something you're trying to measure, but if it isn't sufficiently present in high enough concentration, then you can't detect it at all. But you know that doesn't mean it's not there, so that there's a kind of noise for small values that is zeros may not really be zeros in hurdle models. And typically in these sorts of models, you need to model the zeros and the non-zeros simultaneously using separate distributions, so they're measurement models. Occupancy models are really workhorse models in ecology, in parts of ecology, but also in other industries. Often it's true that just because you didn't detect something, that's not a good reason to assume it isn't there. And so we can approach this in a generative statistical way, and there's a big family of models known as occupancy models, which do that. Okay, this has been Lecture 17. We're in Week 9 of Statistical Rethinking 2022. In the next lecture, we're going to extend the concepts here into missing data. Missing data is a particular kind of measurement error, if you will, and so we'll carry on and build on what we've learned in this lecture to see how to deal with that as well. And I'll see you there.