 I don't know. All right, we got to do some good things today. All right, let's bring this train to the station. We can all get off that train. That. Welcome back, everyone. We're going to resume where we left off. We had just reached in the measurement error unit, putting an error on a predictor variable. We finished putting it on the outcome. And that was only a minor nuance, I think we've done before. In a sense, we've always been implicitly assuming there's error on outcomes, or at least it was sort of smuggled into the model in a very epistemological way. And the nuance here was that the error was heterogenous across the cases. So we need to do something special to deal with that. And what I hope I convince you of is that if you state the knowledge of that about the error on each case, you can just let the model figure out the implications. It's a wonderful thing. Well, provided you've got a Markov chain, you can figure out the implications from some other algorithm to do it. Now we're going to continue with this and put error on a predictor variable simultaneously to the error on the outcome. And this is a longstanding problem that scientists have worked about for a long time. So there are a bunch of procedures that have been used. For example, in economics, there's this thing called errors and variables, which I admit I don't completely understand, because in a lot of those literatures, you go and you can't get the model definition. There's some procedure described, and I can't ever figure out what the model is. And that is my limitation, perhaps. I need to see a model to understand what the assumptions are. Very old, there's this thing called reduced major axis regression, which I would advise you not to use, because it's super vague about how it works. And also, I think, all totally squares. And these are all attempts to deal with different kinds of error structures in the data. But all of these procedures are procedures. They are inventive ways of dealing vaguely with the possibility of error on predictors as well. But they don't let you input information you have about the predictor and outcome variable simultaneously. Our approach will instead be, as I keep saying, logical. That doesn't mean it's guaranteed to give you the right answer. It just needs garbage and garbage out. So we're going to put in something that is hopefully not garbage. We may still get garbage out the other side, but I think in this case, we can. So I keep saying about logic. Logic's nice, because it reveals the implications of your assumptions, but it doesn't guarantee that your assumptions are good. So we're going to state the information that we have about the error on predictors simultaneously with the error on the outcome. And then we'll let the laws of probability deduce the implications. So here's the new model. I'll explain this, as I usually do on the second bit by bit. You can see what's going on. The first thing to note, I'll just take the static priors, the fixed priors off the slide so I can do some labeling. This is the measurement error model before for the divorce data. The only thing I've added now is this bottom line. Now we have a distributional assumption for the predictor of capital R, which in this case is the marriage rate in each state i. These are all measured with error for exactly the same reason that the divorce rates are. They're taken from samples. And large states have less error in the measurement. Small states have more. And that error is expressed in the reported standard error that came from the Census Bureau that is in the data set that I collected. And so the way you can think about this is up there in the second line of this model is the ordinary linear model. The old regression, right, back from chapter four, something like that. And it's in the regression model, but what's in there is not the data. These are the parameters, the estimates. And so those are distributions that are in the linear model. That means we're not sure of the value and we want to integrate over the uncertainty and the model's going to do that for us, right? Rather, you know, the Markov chain will do it using the axioms of probability. The third line is what we did on Tuesday, which states our knowledge about the error on each observed value i. We imagine it's sampled from a Gaussian distribution with some unknown mean, which is the true divorce rate in each state. That's what we want to know. And then are known at standard error for each one. So we get a posterior distribution for each of the ds and that's the top outcome variable up there. And now we've added a symmetrical one for the predictor variable. Same idea. We imagine that what we've observed as a marriage rate in each state is a sample from a Gaussian distribution with an unknown mean, the true value with some standard error that has been reported. Does this make sense? There's really no concept that's new here. We just keep going. We're gonna keep adding more turtles, right? Yeah, okay. The code is unsurprising. It's, you just add the extra line. So that's all you do. And so I'm not gonna show you the code. It's in the book though. You wanna take a look at it and I encourage you to run these things yourself. And here's what you get out of it. Here I wanna show you, we get two dimensional shrinkage again. Our friend shrinkage now in two dimensions. Why? Because we've got posterior distributions for two vectors of parameters and which are both the divorce rate, posterior distribution of divorce rate plotted on the vertical, posterior distribution of marriage rate on the horizontal for each state. So there are pairs of points. The filled blue points are the raw observed estimates, right? The means of, let's say, well, the observed estimates, what was reported as observed, what we used as data back in chapter five. And the open circles are the posterior means that we get from this. So there's also an error. You could plot that too, but it would make the graph pretty complicated, right? So, and then the lines connect the two points comedy and state. What I think you can see here is remember, marriage rate and divorce rate are weakly associated. They're only very weakly associated in this model, especially because median age of marriage is driving most of it. So there is joint information about them. And mainly what happens, like mainly through median age of marriage because there's a correlation between marriage rate, median age of marriage and the states. So what happens is there's this kind of invisible regression line in this graph that is creating gravity that is shrinking the estimates towards one another. But most of the shrinkage, now this is for divorce rate being shrunk down towards this regression line. There's a weak correlation. So the lines are angled. That's moving marriage rate as well, but not as much. Why? Because marriage, accounting for median age of marriage, marriage rate and divorce rate aren't terribly correlated. So most of the shrinkage is in the vertical dimension. I'll say that again, because there's too many like marriage and divorce words going on here. So remember, this is a regression that has three variables in it. The outcome variable is divorce rate and the predictive variables are marriage rate and median age of marriage. What we learned before is that accounting, once you know the median age of marriage in the state, learning its marriage rate tells you very little, right? Their marriage rate and divorce rate are correlated, but mainly because there's a correlation between marriage rate and median age of marriage. That's what's driving it. So what we got before from, before we put error on marriage rate was there was shrinkage of divorce rate towards regression line. And we're seeing that again still. We're also getting a little tiny bit of shrinkage, but very little in marriage rate. Notice that these are, the shrinkage is mainly entirely vertical in the divorce rate area. And I'm trying to explain to you is the reason for that is because there's not much relationship between these two variables, once you've accounted for median age of marriage. If you take, this is an exercise for the student now, you take median age of marriage out of this model. The shrinkage plot will look different because now there will be a correlation because you don't know the median age of marriage information and you'll get shrinkage in the other dimension as well, much more. Does this make some sense? So the possibility of experiencing shrinkage doesn't mean you always will. It depends upon the correlations among the different dimensions. Remember the row that we spent all that time estimating last week and so on. And we don't get much shrinkage of marriage rate here because there's actually after accounting for median age of marriage, very little relationship between these two axes. But there's a lot of shrinkage because there's a strong relationship between median age of marriage and divorce rate. So the regression improves those estimates, make induces shrinkage in the possibility of distributions of divorce rate. Much less a marriage rate because the relationship between these variables tells us very little about the true marriage rates so they don't move as much. Yeah. The shrinkage holds the estimate outside of the observed. I'm not sure I understand the question. It looked on when we just had measurement error. But some of the points would be holds outside of where the measured standard error was and how do you interpret that? It can happen. I don't have an intuition that it shouldn't. The standard error is just the estimate of where the thing was but the regression equation also contains information that flows back into those. And so for estimates with big standard errors they can move a lot because there could be a lot more massive information in all the other 49 states that pushes them quite far given that stage relationship. Given that state's value of predictor values, the regression equation could tug it really far out of where that estimate was. So the measured standard error doesn't actually constrain the whole range of where the thing could move. It's information that goes in and then the implication, and I can tell if this is great because it's counterintuitive to you, which is nice. Different things are counterintuitive to different people as they would move out of that. It actually can and it's just a consequence of the logic that information is also flowing out of the regression line. The information provided by the predictors can move these estimates out of where we thought they were bounded in the naive estimate before we accounted for the sort of overall across all 50 states the relationship between the two predictor variables and the outcomes. And then we realized there's no way that say Vermont has that before straight. Which I guess a bad example of this case because Vermont has a low standard error or something doesn't move much I think. But there's no like Utah or something. Yeah there was one point that had a low median age somewhere low before straight. Exactly. And the model saying that the big shrinkage is the model saying it's very skeptical of that. Now what you're not seeing on here is that these are posterior distributions and usually the ones that move the farthest have the widest posterior distributions. So that's you should run this for yourself and do that to section. And I think you'll see that that's what I always find when I run those linear models. The things that shrink the most often have the widest posterior distributions because that's why they shrink the most. It's because there's the most uncertainty about them. And then the gravity of the regression line is the strongest on them. Does that make some sense? So don't get too excited by these posterior means. They're just like, I actually plotted the two dimensional, I put little ellipses around all these because it's a two dimensional Gaussian uncertainty distribution on each event. It was unreadable. It was, I almost posted it that there's a Twitter feed for like our art, like who your art? Perhaps you're wrong. You know what I'm talking about? And it's great actually. And I almost posted it after that. I was like, yeah, you know, I'd like to put some colors on before that. So I might revise it and put it up there. But it was beautiful, but useless, right? But yeah, the uncertainty, this doesn't count for the uncertainty in it. And the ones that shrink the most almost certainly have widest posterior distributions. Which overlap the original standard error, I'm sure. Okay, other questions? Is this cool? Yes? This is easy to do, no reason not to do it. When you've got estimates of measurement error, you can incorporate it quite easily into these things. So this new plot in the bottom right on the slide is just showing you the shrinkage on marriage rate on the horizontal marriage rate standard error. Obviously with the ones that had big standard errors have shrunk more. There's been more information in the regression line. The overall scale on the vertical, the amount of shrinkage is a lot smaller than it was for divorce rate, which I'll ask you to do that comparison yourself. And that's because there's less relationship. There's less information about marriage rate because it's less related to the outcome after accounting for median age of marriage. You guys have questions about the measure in the middle of the part? Yeah, Cody. Does identifiability come into this? Like, so I had a measurement error model for months. I was struggling to figure out why it wouldn't converge because they had all of the data had measurement error. So they were all in a set of parameters being multiplied by another parameter. So you could actually get multiple modes from that and it was really hard to figure that out. So how can you tell when doing this is gonna improve inference? You know what it might just lead you down like endless rabbit holes, like is there? Well, if you fit your model for months you're probably in a rabbit hole. My general advice would be, but as I told you before, in principle, invading inference, everything's identifiable because you can use prior information to get rid of those ridges. Regularization saves your bacon. I don't know if I'm wrong. Can I just copyright that? That would be so good. And I like that, so regularization saves your bacon. So, but if you don't have good prior information to do that and constrain the coefficients, for example, then yeah, you could be sunk. And I'm gonna make the point later, hopefully I get to it, that this business is just logic. So if the information you have and the model you're interested in, there's no guarantee that the logic group tells you something useful. And actually I like that about Bayesian inference that if you get the prior back or you get multiple modes, that's useful. It's telling you that with this information and this model, it's insufficient to learn what we'd like to know. And that's an advance usually, I think. And there are lots of procedures which will always give you an answer. And so I like the ability of a procedure to tell you, give you the big shrug. I kind of like that, the flat plus or your distribution is a nice thing. I've had problems like that too. And yeah, it's usually that tells me that the design won't work. You can't possibly learn. These days I'm better about doing power analysis before I collect data. So I can see things a little bit ahead of time but my dissertation had a problem like this and it happens, right? Your design can't tell you what you want to know. And it happens to a lot of people, I think. Okay, anything else? All right, let me try to summarize the measurement error. This is a very common issue and let me try to generalize this a little bit beyond the strict definition of measurement error we've had here. It's routinely common that there's some uncertainty about the data, the number we're putting in there. But people will condense it down to a single number because that's what the procedure they're using requires. If your software requires a specific value, you will create one or one will create it for you, right? And we can do better than that now. There's no reason to do that. And one of the reasons that we should worry about this is that ignoring uncertainty leads to overconfidence. And it can lead to overconfidence in every direction. I think usually the direction it leads into is false results, false positives. And I have something to say about that later today but it can also lead you into false negatives. It can lead you to be overconfident about something not happening as well. And so it's just due diligence to try and put all the information you have into the model. And Bayesian inference makes this much easier than other procedures do. So let me give you some examples, some commonplace examples. One that I mentioned in the introductory chapter and one of the virtues of multi-level models is, and these measurement error models are examples of multi-level models. You probably caught that, right? They look a lot like the others. People use averages to do predictions. Instead, so I say you measure the body mass of three female recent macaques. You average that and use that as a data in your model. Why not use a submodel that computes the posterior distribution of the average body mass and then put that posterior distribution in the regression model? The fit-post model simultaneously. That would be a measurement error model much like this. You're fitting two regressions at once so that you can use the whole posterior distribution all the information in the higher level inference. This is what we call propagating uncertainty forward. It's a great thing to do. And then you don't lose, you carry forward all the imprecision and knowledge and all the sample size information for the fact that that average computed from three individuals and not 300. It is in there in the model already and will be automatically taken account of by the logic of probability theory. DNA sequence data, this is a famous thing in that business, I think they've got a handle on this but it's a big headache. You have to respect the error rate in sequencing. This is a horrible issue and for many of the most interesting problems that say study mutation rates, I know Jonathan Eisen used to talk stuff that's sort of fascinating, the sequence, the error in sequencing of the sequencer machine is on the same order as the expected mutation rate. And so conditional on getting a hit on a mutation rate the problem is only about a half that is actually a mutation. It's very terrifying, isn't it? So to realize that yourself requires cleverness but the model can just figure it out for you if you put the sequencer error rate into the model. It will do that calculation for you. So that makes some sense. Now the sequencers are getting better all the time but this is why with genome studies you get these like 30X things. You've got to see this now and it's like I just sequenced something at 30X and it's like what does that mean? That sounds impressive, right? And then it's 30 times, that's all it means. Why? Because the same error is very unlikely to happen at the same site multiple scans and that's why you do it. So for a whole genome study though you can get a lot of errors and so you need a lot of repeat sequencing to do it. And that's what makes it expensive. I say now in 10 years I keep joking we'll have like things on the sides of toasters on our kitchen tables that can sequence any human DNA in the world, right? Which is sequence my toasters, why not? And it's gotten so, I mean the prices come down so much. Parentage analysis has the same problem. There's a probability distribution over possible parents rather than returning assignment, right? Rarely are things perfectly diagnostic in these studies. Phylogenetics, often people create some consensus tree and then plug that in through later analyses to look at the evolution of traits. This is cheating and then lots of people in the phylogenetics community are aware of this but it's hard to get software that will let you use all the trees. You have a posterior distribution of trees. That's what you have. We don't know the true history of the species. We have an inferred history and we'd like to propagate that and sort of keep forward in some places that's really important. This is starting to become, people have long known about that problem but it's a pain in the butt, it really is. But you can imagine repeating the analysis for every tree in the posterior distribution and then weighing the results by those posterior probabilities and essentially what happens. Even better would be to estimate that phylogeny simultaneously with what you want to use the phylogeny to make inferences about and then information flows about directions and the relationships, the evolutionary relationships among the traits on the tree can inform your inference about the evolutionary history of the species at the same time. There are people who've done analyses like that as well. I'm not saying it's easy but we know how to do it. We just need to have better software tools that make it easier for people to do it. And I've already talked about many times about archeology, paleontology and forensics. There are all kinds of measurement error. In fact, it's the norm. Everything is measured with error. I think the forensics people are the furthest behind on this actually because there's this illusion of lab precision I think in all these TV shows like CSI which make it look like it's amazing. Zoom in on the license plate, get a piece of hair, this was a male Caucasian age 32. He smoked. That's the way it is on CSI. It's a lot worse than that. And signal detection is a terrible, terrible business. But we know how it works logically and we can do a lot better if we respect the fact that there's uncertainty and identification. The whole goal is to propagate uncertainty instead of throw it away. That's the idea. Make sense? All right, that's enough of my sermon. I know this class has been like a long series of sermons but they're secular sermons. So hopefully they're non-offensive. Okay. All right, let's switch to a related issue. Let's go one meta level further. Measurement error is easy enough to understand. We've got an observation and we know that the true value is somewhere near that. We get information from it and we've got some guess about the error in measurement. So it's not so weird to think that we can use that error information to calibrate our inference to sort of down weight the things that are measured with high error. But what about the case where there are variables for which there are just values that are missing completely. We don't even have data at all. What can be done now? The answer is a lot. A lot can if you're willing to model the predictors or more generally the variables for which there are missing values. And usually you can't. In fact, what I wanna show you is the same assumptions that are required to do what people nearly always do which is what's called complete case analysis. So you have any cases with your rows and your data for which there are missing values and the predictors you wanna use. Or the outcome. The outcome is a special case because then it's just like a prediction problem. So we'll ignore the issue of missing values in the outcome for now. That's just posterior prediction. You know how to do that. What if there are missing values in predictors? Usually what people do is just drop all those cases. If there's a single missing value in N, in any of the predicted variables, all that data goes away. And that is sad. An angel's crop. Right. With every data, every data is precious. Jesus tells me so. Right. And I have a dear Catholic friend who sings that song. Actually. So, complete case analysis. The same assumptions that would allow you to assume that if you unbiased estimates of the relationships between the predictors and the outcome also allows you to do what I'm gonna show you today which is Bayesian imputation where we infer the missing values. And that means we get to use the observed values on all the other predictors where they're present and we get more power. So we're not throwing away data. So every data is precious. There are a lot of alternative methods here and some of them are perfectly fine and others aren't. Sometimes what people do is they replace the missing value with a mean of that column of data. Never ever do this. Oh my God, no. This is bad because there's a specific value now in those cells and there's no uncertainty around it. The model thinks that yes, it's exactly the mean. So this is, that creates a lot of conservatism in the estimate, a lot of false conservatism. It makes the regression, makes the relationship between that predictor and the outcome a lot less than it would then otherwise, right? Because there's variation that's been removed. This is not, it could be better to drop those cases than to do this, absolutely. Multiple imputation is a Bayesian-inspired but non-Bayesian algorithm, mainly associated with Don Rubin who's a contemporary and quite famous Bayesian statistician at Harvard. And I say it's a procedure's non-Bayesian but the math that inspired it was because when Don invented this procedure it was pretty hard to get Markov genes on the desktops to do this multiple imputation. You use this distributional model of the variable to create multiple data sets and you run the analysis on all of them and then you can find the results. It works really well but it has a drawback compared to what we're gonna do that I'll talk about later. We're gonna do Bayesian imputation which means we make assumptions about the distribution of the predictor and we let the model figure out the implications. And then there are lots of other things as well. This is a common problem people are worried about. I would say this word impute is weird. It's probably not a word that rolls off your tongue. It's kind of a legal term. I consulted my dictionary and it told me that impute means to represent as being done, cause or possessed by someone, attributes for example the crimes imputed to Richard. That's what my dictionary says. And I learned it in finance. I learned it in the Econ class when I was an undergrad. You assign a value to something by inference from the value of other products or processes to which you contribute. So that's what we're doing here. There's information in the other cases which helps us guess the values. And all the other variables help us guess the values in here through what we're assuming about the regression model. Also the theology one is great. Christ's righteousness has been imputed to us by the fact, through the trinity. That's the idea. His righteousness is imputed through his relationship to God. So that'll help you understand the weirdness of that word impute. Okay, let's return to the milk energy example. So I don't have to teach you another data set. And there were missing values in the neocortex column. I'm showing you the three variables we're gonna work with on the right hand side of this slide. Bunch of NAs in the neocortex column. Those are species for which we don't have measurements. Proportion of the brain that is neocortex. And so he dropped all those before. And as you see, that's like half the data. There are 12 NAs in this data set and there are only 29 rows. So that's sad. Really sad. And the sad thing about it is we lose a bunch of perfectly good values in the other two columns. So now we'd like to do the imputation approach so we can use all the data. So what we do is we're gonna do what's called the missing completely a random or M-car analysis. M-car assumes that these missing values are sprinkled just randomly. It's like your lab assistant just screwed up, got drunk and hit the link a few times or something in the column, right? Then you fire them and then you set down and do an imputation to save your project. And you can do better than this if you're willing to make more assumptions. But this is, if you're willing to do complete case analysis you should be willing to do this as well. The distribution of these R values does provide information about the plausible range of the missing values though. So think about this, you already know this. I bet you none of those neocortex percents are like five, right? Because these are all primates, right? And all of these percents are in the 50s and 60s and they're a few 70s, right? Like up in the apes, down in the apes rather. We get into the 70s at the bottom. The top we've got prosimians, strap surrounds up there, they're in the 50s. So there's even claim information. But what we have to do is model the predictor. So, oh yeah, so I already picked on the undergrad assistant, right? So your undergrad assistant lost those neocortex values consider just a neocortex value. What is your best guess of each missing value? And you might say, if we had a prior for each of the missing values, it would be the posterior distribution of the mean of these values, right? So there's a Gaussian distribution of this variable and that's our prior for each of the missing values. We're gonna state that prior. We're gonna estimate it actually from the data. We're gonna create a parameter for the mean and standard deviation of this column, of this variable. We're gonna put that into a Gaussian distribution and state that as the prior for each of the missing values. And then we're gonna let logic figure things out. What you're gonna see is they'll move from that prior because the regression equation contains information of each case, right? Because this variable is related to the outcome. And so the posterior distribution should get updated by the regression. So that's our goal. Conceptually, does that make sense? Yeah, for the moment. I'll show you the model in a minute. So mechanically, what we need to do is we replace every missing value with a parameter. Very basic inference. The difference between a datum and a parameter is that a datum is a parameter where all the masses pile over a single value. So it has no adjustability now and you can't learn anything extra about it. It's like a maximally strong prior, right? Now we have something more wishy-washy here. We need posterior distributions for each of the missing values so we make a parameter for each. These parameters are pooled by having a common prior. So there's gonna be a lot of shrinkage. So don't be too excited by thinking this adds 12 degrees of freedom to your model. The degrees of freedom are a classical early 20th century concept. They work for lots of gospel statistical procedures but they don't generalize beyond that. So now we're in shrinkage land and parameter count doesn't tell you how flexible the model is anymore. Not once you're in multi-level models and once you have priors, it doesn't work that way. So, and it's not just a feature of Bayesian inference. That's true for lots of non-Bayesian machine learning mechanisms. Neural networks, all those things. There's a ton of regularization going on and discounting how many adjustable bits that are on the system doesn't tell you how flexible the system is. They mutually constrain one another in complex ways. So we're gonna place a unique parameter in each missing value and we're going, these are the things that will be imputed from the properties of the other elements in the model. So you can think about it on this slide. There's this vector capital N which is the neocortex proportion values. Now I've scaled them so they're proportionate. It's easier to work with that way. And this is a vector that's a mix of data and parameters now. So where it's been measured, there's an actual number there. So like the first species in the data table, there's 55% neocortex. And then there are three missing values and there are three parameters we need in two, three, and four. And then we've got some more measured values all the way on to the end of the table. There are 12 parameters in that vector and 29 minus 12 data values. I don't know if somebody do the arithmetic. And this is the model. So now all we have to do, I can help you see this if I take all the fixed priors out for a second. At the top is just our old linear regression. We're gonna predict keel calories of milk in each species. That's case of I for each species I through this linear model of the mean. And it's the same linear model as before, but now neocortex of I is a mix of observed values and parameters. So what does that mean? Well, when it's observed value, it's the same thing as before. When you make a prediction for that case, you just plug in the observed values, multiply by the coefficients and you get a prediction. What about when it's a distribution, when it's a parameter? That means you don't know the value of the data. And so the prediction must average over the uncertainty in the input value. That would mean taking an integral if you were doing this by hand. It means doing an integral here too and that's what your Markov chain loves to do. It loves doing integrals and makes it happy. When the fans turn on my computer and it starts worrying, I know that's classic at cat purring. It says thank you, thank you for this difficult integral. Exactly what you're talking about. Right? Does this make some sense to you guys? What happens? And so it's a case by case thing as the model considers each case I, there's a different integration that's required because it sees whether it's a distribution or just a data value. Technically the same operation but there's more and less uncertainty. For an observed value, there's just no uncertainty to average over. There's just all the probability masses on one value. As a posterior distribution widens, you have to be more averaging over and all. Right? You can imagine taking samples from the posterior distribution of an included value, computing the linear model for each of them and then you get a distribution of the linear model value and that's what the Markov chain does and that's what you get. And that automatically propagates the uncertainty into the regression relationship. Yeah? At least you're willing to hang on to the roller coaster for a little bit longer here. So then we have the common likelihood and prior for this vector of mix of data and posterior and parameters. When int sub i is an observed value, then you can interpret this line as a likelihood. Right? Because it's the probability of an observations, probability of data, conditional and some parameters. And nu there, that weird looking v, that's nu and sigma in, those are parameters we're gonna estimate which are the distribution of neocortex proportion across the whole data set. Right? So that's the prior. We're gonna learn it from the data. We're gonna train it on present values. Right? That's the only information we've got. When, so you can think about it this were a simple regression you're just estimating the mean and standard deviation from the observed values. Right? But you're getting a possibility of distribution for them to respect the sample size, right? The amount of data you have. When int sub i is a parameter instead, this is a prior. Right? But it's a flexible prior, it's an adaptive prior because it's a prior that has parameters inside it. So there will be shrinkage and other things that happen as a result of this. You don't have to anticipate any of it happening. You just have to state this relationship. Right? The information you have. You with me? I understand this is pretty meta stuff. So if you're not with me, you can like scream about it and that's perfectly fine. It's, I don't expect this to be perfectly crisp. Your computer understands it but unfortunately your computer gives you answers as always in the least convenient format of posterior distribution. Right? It's not the most useful form of answer we want. To fit this model, as you might expect, you can do this directly in Map to Stand. And the reason for this is I spent some time making Map to Stand to detect missing values in predictor variables. And if it finds them, it then looks for a prior, some distributional assumption placed on that predictor. And if it finds one, then it goes ahead and it constructs that vector of mix, fill, replaces all the missing values with parameters and then goes on trucking. If it doesn't find one, it gives you an error and says, hey, what are you trying to do? You can't fit a model with missing values in it. So either take that thing out or add a distributional assumption to it. Stand will not automatically do this for you. So if you wanna see algorithmically what's actually happening here, you can run this code and then as always you can type stand code and then the Map to Stand fit model will show you the raw stand code. And it's really just, it builds that vector by replacing missing values with parameter names. That's really all it does. It makes a vector of parameters, the length of missing values and it just sticks them in the right places with a loop that does it. And then the model looks exactly like that because it's, stand doesn't care. Because remember, data is just a special case of a distribution. You with me? It's actually looking for N, A, right? Exactly, it's looking for the N, A type, N, R is what it's looking for in the input. And then it's gotta intercept them before stand sees them because if stand sees an N, A or if you used to work it'll just like, it'll be like, no, no, no. I can't do anything with this. Which is what you want it to do because it doesn't understand them. So there's this abstraction layer in between the two where I try to do housekeeping like this and this is what it does. So this repeats what bugs and jags do this automatically and they do exactly the same automatically. If you give them a variable that has missing values it's exactly the same Bayesian procedure that's going to knock them out. But it's automated inside those tools. With jags and bugs it's a little easier because it's all interpreted. Those are interpreted engines. That's why they run so much slower than stand. They could do it in stand. I think they'll get around to it. But they don't like to automate stuff so much. Okay. You fit this, this model mixes amazingly well. You think it'd be a monster but everything's Gaussian. So this is like no sweat for stand. It's like, are you kidding me? Give me a challenge. The fans don't even turn on my computer because it's not happy. And notice that we get, well the first thing to say is in this particular case the consequence of this, of using all the data is that you get reduced estimates of relationships between both predictors and the outcome. The cases that you drop are less strongly related to the outcome than the ones that are complete. So I think there's value in using all the cases as a result of that. There's still positive relationships but it's been moderated a lot by this. And I think the reason for this is that data is more complete for apes than it is for all the other clades. Why? Because egocentrism, right? Some kind of specious thing. We're apes and so we measure apes a lot. Every ape has been measured to death. You can publish a whole paper on one chicken's eat. Right, the product I'll just back me up on this. If it's a great ape, you can do anything, right? Get it to paint. That's an ancient publication. If you're studying a prosimian, it's kind of harder. Anyway, yeah, we spend a lot of time in the ethnic department ragging on ourselves. That's kind of like a hobby. But yeah, buddy. So I assume if you're using a space, you can't use it. It'll drop them. It automatically does complete case analysis. So I think it's bad news. I think the black box tools should refuse to work when they're missing values in any of the predictors you put in and make you drop the cases yourself. And I talked about this when we talked about information criteria. This is horrible thing that happens where I was talking to you about this, I think, where you've been a model, you had another predictor in your model comparison set, but it's got missing values in it. You didn't quite notice it because you're busy. And some of the information criteria and support for that one is by far the best. And it's the slam dunk model. But the reason is because half the data were dropped. And so there's less to predict. And so the deviance is always smaller when you predict fewer things. It's like moving the target really close to you. Yep, I hit it. But there's no, that doesn't tell you anything, but the structure of the model seems better. So model comparison using information criteria uses, you must use always the same number of cases. It's harder to predict more cases. And deviances are scaled that way. There's sums of the error across all the observations. So you have to be careful about this. So I think it's actually really bad news that the standard tools in R automatically and invisibly drop cases with missing values is really terrible. And I bet that there are a whole lot of wrong papers published as a consequence of this. I caught something reviewed where this has happened in AIC tables. So I've only caught some, so you guys start policing your neighborhoods too and see. It's an innocent mistake. So I don't think we should blame anyone that happens to you. The software should be more annoying in a sense. Was that a hand? Yeah. Given that you speculated that the AICs are measured. Yeah. Then this is not missing at random? That's what I think. So then, well, missing completely at random. There's another category, sadly, called missing at random. There's missing completely at random. Then there's missing at random, where there could be a relationship between missingness and the values that are missing. And then, but in general, it's just better not to think of these categories, but think about how do you think the missingness arises? And you can model that. And there are a lot of missing data models where you do that. You model how things get in the sample and how the measurements arise. And now there's no guarantee that once you introduce those assumptions, that means you can figure out what you want to know, but it'll tell you that. So this is a special case of a much wider area thinking about how the sample comes to be. That's a big issue in some fields. Epidemiologists think about this quite hard, right? This issue of whether people actually get treated or not, and how people get into the sample for treatment. It's a constant problem. It's the people in treatment groups don't do what you want them to do. And I'm tempted to go off and I'll kind of entertain stories about that. I think Cody has a bunch of these, too. But I won't, because I'm not sure I have time, but there's a general issue. Like, the bombers, I think I mentioned this on Tuesday, too, the World War II bombers, or maybe I did the World War II bombers from chapter seven on interaction. That's a case where even being observable tells you, right, there's a relationship between where the damage was and whether you get to observe the plane over the manatees, right? Conditional on observing a living manatee that tells you something about what happened to it. That's what fools us. But if you simultaneously model sample selection, how something gets into your data set or how the measurements are arrived at, and those things, you're less likely to be tricked. Now, the model may tell you you just can't know what you want to know, but that's an advance over false knowledge, right? Yeah, did I answer the question? Does that make sense? Okay, all right. You end up with a parameter with a posterior distribution for each. What I want you to see from this is the posterior distributions are quite wide because, hey, this isn't magic, right? We're not sure where they are because we've got a lot of information to nail it down. But they're all between about 0.55 and 0.8 because that's the range of observed values. But they've shifted. They're not all the same. They're not all identical to the prior because the regression line has moved them because of the values of each case and some of the other variables. Information moves back out of the regression line into these missing values, moves them around a bit. So let me give you an intuition for that real quick. So the imputed values end up very weakly tracking the regression, weakly in this case because there's a lot of imprecision about them. There's a wide range about them. And the regression relationship isn't all that strong. So there's not a whole lot of shrinkage induced or movement induced here. So shrinkage towards the regression line is what it is. So you can think of it this way. The observed neocortex values are associated with milk energy. So when we don't know, there is a consistently and almost certainly positive relationship in these data accounting for body mass between milk energy and brain size. And bigger brain animals have more energetic milk controlling for body size. So then when there's a value to impute, you should use that information to adjust it. You should move it from the prior, which is just the mean and standard deviation of all the values. You should adjust it some, given the conditional on that knowledge. Because you know the outcome for that animal. So that helps you predict its neocortex amount. And of course, we'd rather just measure the thing. We're in this business had someone didn't measure all the stress around brain sizes. That's harder to publish than eight points, I guess. I don't know. Or they're just small and hard to measure. I don't know. Something like that. So you'll see in this, what I'm trying to show you here, and again, this is my lunch right there. The model figured all that out. You didn't have to be clever to figure it out. They're automatically figured it out. Because it's an implication of the probability, which is the model. And so what I'm showing you here is neocortex proportion against kilocrats per gram and the counterfactual prediction regression line that's been inferred. Posterior regression line that I've inferred is positive. The blue points are observed. Pairs of values for neocortex and kilocalories of milk. The open ones are cases where neocortex was not observed. And I'm curious how that will show up on my microphone later. Sound like an explosion or something. He dropped his coffee cup. That would be alarming. So microphones are weird sometimes with loud noises, right? It's kind of like an explosion. We went to 11. So you'll see that there's a weak tilt to the centers of gravity of the infuted values because they're tracking the regression line. And I'm tracking it very strongly because it's highly uncertain. So there are lots of values that would be consistent with regression given the uncertainty in the infuted values, but the tracking is a consequence of information flowing back out of the regression model into the estimates. So let me try to summarize this real quick. The observed neocortex is probably associated with the observed body mass. Unfortunately, here's a weird thing about this model. Let's do due diligence and model check. Imputed neocortex is not associated with the observed body mass. That's what I'm showing you on the right hand side of this slide. Now I'm plotting the log body mass, which was the other predictor in this. And remember, this is a case of masking where these two predictors go in opposite directions. They're associated with one another and they're both associated with the outcome, but in different ways. So neocortex proportion and log mass are associated in the, or rather they're both associated, but they're negatively associated with one another. I think it is something like that. Another positively associated with one another. Negative is differently associated with the outcome. So the blue points are the observed pairs. And then we have the posterior distributions of the imputed values here. And what you notice is that they're not tracking the relationship between the two variables. So I guess our model said that there's no relationship. It didn't connect them directly. In the imputation. There's information by the fact that log mass and neocortex are correlated themselves. The predictors have a correlation structure, but we haven't exploited that yet. So for the last thing we do, let's exploit that information. Let's add it to the model, because we know that as well. And this is as easy as basically putting it under regression as a model. So if you know something about how these two things are related, their relationship, or you can model it, you can estimate it, then you can get that information into the model just as easy as well, adding this to the model and simultaneously estimating it. So our naive imputation model that we just used was the first distributional assumption you see on this slide. Each neocortex value i is normally distributed with some unknown mean nu and standard deviation sigma n. We estimate those from the data. This becomes the prior for the imputed values. There's nothing in here about the correlation between these n i's and log mass. Slightly less naive would be, we say, there's a regression. If we were gonna predict neocortex value using body mass, we'd just write this thing on the bottom. That's a linear regression of neocortex on body mass. Now some of our neocortex values aren't known, but that means we get to use the overall association with these two variables to improve the imputation of neocortex values. Yeah, yes. Some of you are online and some of you are squinting a lot. Trying to figure out what that is. Yeah, we could use them. Yes, the question was if we had more variables that were related, you could model it. Absolutely. In my, if they're strongly correlated, yeah, there's a structure. Often what people do, and I'm not showing you that example here because I thought this would be more transparent, often what people do is just have a joint multivariate Gaussian prior for all the predictors simultaneously and you can just estimate the co-variant structure among them all and then you get cooling in all directions among them all and that can propagate up. I didn't show that because I thought that might be a little bit too mind-blowing. And it doesn't generalize to non-linear relationship. So say you think that log body mass and neocortex aren't linearly related like this, this'll generalize. You can put any kind of function here you like, like a negative exponential relationship or something, which is almost certainly gotta be true because, but this is bounded between zero and one. So as this gets bigger, this can't possibly be linear. So even this, a beta regression would be better here. I'm thinking of putting a homework problem at the end of this chapter, which you guys won't get because you're gonna get your final today where you do this as a beta regression. But I haven't talked to you guys beta regression so I'm not gonna put it out there. But yeah, this is a general case, you can extrapolate out, but very often people just model as a giant multivariate Gaussian relationship among the predictors. That helps, it exploits information. As long as they're linear relationships is all you care about among the predictors, you're okay. Yeah, Cody? Is this like in it, this is only defining the NIs that are missing and not the other NIs. Right, the other NIs don't move. They always go into top-level model data because they're like infinitely strong productors, all the probability masks. Sure, so this is added, so there's actually two probability statements defining the distributions of the missing NIs. There's the one in the main regression model and that's defining both real and missing data. Well that's this, these two ones are gonna replace that one in the model. That's what happens, yeah. So basically we're just adding the linear model that puts more general information in. Provided, I mean, this gamma M parameter, if that comes out to be straddling zero, then you're not gonna get anything extra here, but it doesn't, you know that, right? There was another hand, it's a great question. It's a great question. So the question was, let me repeat it back to you, it's slightly different than you tell me if I got it right. So the question was, why didn't we get this for free before? Because we had a regression model already, they had all three variables in it. And the answer is because that other model, all it gives you is the association between each predictor, assuming you already know the other and the outcome. There's not a joint distribution among all three yet. Now we have a joint distribution among all three. Does that make some sense? Yeah, that's all right. That's a great question. I should have put that in here. I'll try to add something to the chapter, which brings that up. That's good, there's another hand. So if the goal of doing this analysis is to see whether your predictor variable is associated or can predict your outcome. And so with the naive model, you're just generating that missing data, assuming no predictions. So you're assuming it's rather random, but when you're bringing that predictor in, is it possible that your bias team did analysis somehow by saying, oh, the predictor should be predicting this outcome instead of going in a rather naive way and saying that there's a positive Europe. Here in the top one, you're assuming there's no relationship between the two predictors. And when you go out, if you were gonna go out and collect those data in the field, you would initially assume that there's no relationship. Well, you know, when you go out and collect them, you don't have to make an assumption about the relationship. Because you would wanna be testing that there wouldn't be. Like, you would wanna go in there naively and not think that there's relationships so you're not bias team. But if there is no relationship, then gamma's just gonna be zero, right? Yeah. So then, this covers the previous case as well. Yeah, so the top case assumes there's no relationship. The bottom case estimates the relationship. Okay, I guess what my question was, just by doing that at all, are you somehow biasing it? Biasing, well, provided that there's useful information in the relationship, the linear relationship between the two predictors, this will improve estimates, right? Now, I'd like to lead you astray, but there's no free information. Did I understand your question right? I think so. Yeah, the word bias, I would encourage you guys never to use the word bias in statistical, and I know it's like, and the reason is it's not clear what it means. But bias, even in statistics, it means like six things. And so I'm sure, if you think about this a little bit, you can tell me what your specific concern is, and it's probably a good one, but the word bias is this unprojected term, and that's why I'm freezing, like I'm not sure what you mean. So, because if we're talking about bias in terms of like the overfitting, underfitting tradeoff, often called bias variance tradeoff, then you need bias to give more efficient estimators. You want bias. And so there's this whole tradition in classical statistics of gravitating to unbiased estimators, and it's a tragic tradition, because unbiased estimators, the only thing good about it is that if you had infinite data, they would tell you the true value. But we don't have infinite data. Anyway, I won't go off on that because it would take too much time. It's very confusing, though. There's nothing good about unbiased estimators in practical usage. That's the truth of it. Unfortunately, the word bias sounds bad, right? It sounds like a bad thing. So you have to be careful about it, yeah. We're just trying to use, this is garbage in garbage out. If you like the assumptions, then you must like the results. Yeah, yeah, bias or, oh, yeah, yeah. Yeah, yeah, so, well, this model assumes that there's no relationship between the two predictors, and then it doesn't estimate it, and therefore can't use the information about that relationship. The bottom model estimates the relationship, but it's to use that information to prove the guesses of the missing values. Is that what you were asking for? Now if we're gotten this. I was just wondering what you had said. She was asking earlier how the model got the information. Oh, yeah, your question. Yeah, yeah, yeah, okay, sorry. Different fingers, everybody. So we're doing the same line over here. All right, that question was, yeah, yeah, so, the first model, so remember, an ordinary linear regression is just telling you, given that we already know all the other predictors, what's the linear association between this predictor and the outcome? So the only information that's really got to flow back out of the regression model is the association between neocortex overall and the outcome. It doesn't get any, it doesn't, it isn't estimating the relationship between the two predictors anywhere. How do you know that? There's no parameter for it. So it's not estimating it, right? There's no joint probability distribution among the predictors in the first model. Now there is. It's right there on the screen. That's a joint probability distribution between the two models. Okay, yeah, yeah. I will work on something for the chapter that talks about this. Okay, all right. Model two, the code is unremarkable, it's in the book. Take a look at it, just add that one extra, add the linear model, I think you're done. And then map to stand, if it finds you're in A's, and it makes a vector, and it passes it off to stand, and it starts humming and purring, because it likes the blue integrals. And it starts sampling, and this sample came great, everything's Gaussian, no problem at all. One of the things that happens here is the slopes end up being steeper now. We recover some of the strength of association between the two after you do the error on both, because there's information about the joint distribution of the two. Because remember they're confounding one another in the original example that we have. So appreciating the joint distribution helps you with the imputation. So it helps you not get confounded. The confidence intervals in the imputed values are tighter. As a consequence of, we have more information now, it helps us down down the values, because not only do we have the relationship between each species, the overall relationship between neocortex and helicotters of milk to help us impute values, but we also have the relationship between the two predictors to help us impute values. So the joint probability distribution involves three maps inside the model. And this information updates the imputed values, both the association of milk energy and the association with log body mass. There isn't much to say about the estimates on the screen, except show you that you're gonna get, expect a parameter for every missing value. I had a paper I published last year where there was something like 7,000 missing values in the dataset, there were 7,000 parameters. But Sam was like, give me something I really need to sweat on. It was no problem at all. Because these are massively pooled, because they have a common prior, right? There's a lot of shrinkage in model. So each, the effective degrees of freedom for all these things is way less than the number of parameters. I forget what it is here, but I'll leave this exercise to the student to look at WAIC for this. Okay. Questions? Yeah? How do you go from this to filling in your missing value? Well, you don't, you don't. Because there's no single point, these are distributions. So what the model's telling you is each of these neocortex and puke values is a posterior distribution. And with a mean and standard deviation like these, they could be non-Gaussian actually. In this case, they're pretty Gaussian because everything's Gaussian in the model. But they don't have to be. They're distributions. Don't pick a single point out of them. Because no single point summarizes it. Does that make sense? So there's no way in principle to go from, say, just take the means out and plug them in. You might want to report to people that the included values where had this mean and standard deviation list their means and standard deviations or you can just upload all your samples to GitHub and put people at it. That's a perfectly reasonable thing to do. That's okay. How soft starts to play about gigabytes of samples on their server, which hasn't happened yet, but they complain to you about that. Yeah, they don't like data files now, right? They don't like binaries. So anyway, well, it costs them a lot in bandwidth, I think. I mean, I pay, but I think most people don't. I have a premium account because I like to have a million private repositories. But yeah, anyway, I don't know what their business model is, to be honest, but they're a great service. Anyway, yeah, just report the full uncertainty, I think, or this is like the horoscope problem again. In the context of your problem, you can decide what you need to report and how you want to act upon it. If you have to make a prediction, I would try to propagate the uncertainty forward somehow about it. It's tough. It really is tough. Okay, other questions? All right, show you the posterior predictions now for the last model here. The range of Q value is still pretty wide. They are narrower. And so this is just to remind you that Bayesian inference is not magic, right? No one thinks it is, but there's this idea sometimes that if the statistical model doesn't give us what we want, that it's wrong somehow, right? Like, damn you, Bayes, you're supposed to be magic, right? And no, I mean, the fact that it tells us it doesn't know is information. It's one of the things I really like about it. Other statistical procedures also aren't magic, but we shouldn't expect statistical procedures, logical statistical procedures to always give us what we want. And they don't, in this case, we can impute values, but this just makes it honest. It doesn't narrow down exactly what's going on. And just to echo the lesson here, imputation is a logical consequence of defining a full model, joint model for the outcome and the predictors. When you do that, you get it for free. It's just a logic. It was always there. And if you were willing to actually find a distribution for the predictors as well, it's just in all the models up to this point, we've been acting like the predictors had no distribution to them. Only outcomes did, right? Our outcomes always came from a family. They were ranked as variables that were modeling very aggressively in very complicated ways. And those little predictors were ignored and crying in the car, right? They got to appear one little point in the model and then we said nothing about them. But usually we know other things. Predictors are born out of some process as well, right? And in fact, some predictors produce other predictors in a sequence. And so this is like structural equation models, often have that structure or causal diagrams more broadly where you have some idea about the causal structure among all your variables. You can write that down in these models and there will be logical implications of that that arise. And you don't have to necessarily appreciate them. Last thing I wanna say about imputation before we move on is there, the big advantage of basing the imputation over the others. The drawback is it's computationally expensive and seemingly mystical, so your audience may not understand what you're doing. But use the word Bayesian and they may just let you get away with murder, right? Which is not good, I should say, but that's what usually happens with my reviewers. And, oh, Michael Reeves, Bayesian, okay. Which is not good, it isn't, it's horrible. But the good thing about it is that you get information flow in all directions from the joint plus distribution. The improved regression relationship updates the imputed values. Other methods like multiple imputation where you're simulating from the prior of the variable and then running regression multiple times, there's no feedback from the regression model back into the imputed values. There can't be, there's no step in the procedure that lets that information flow back. And this is bad, actually, because you're throwing information away. It means that the analyses are illogical in the strict definition of the term. That doesn't mean they're useless. They're unreasonably useful, actually, given they're illogic. The bad thing, I think, is when people have a Bayesian engine and they supposedly try to stop the feedback. This happens a lot. Stan doesn't let you do this, although people keep requesting it and the developers keep saying, you know what you're asking. You're asking it to be illogical. There's a literature on this in bugs, which does have a cut function, which lets you stop feedback. I would encourage you guys to never use this. And if you see a model that has it in it, take it out if you're gonna use that model. The reason is because this stops the feedback of information from the regression into the imputed value. So the information only flows one way. And the consequence of this, well, you say why people justify this, I think mainly because they want results that look like multiple imputation results, which means the information only flows in one direction. Some people will say they trust the regression model, but they don't trust the error. They don't trust the regression model, but they do trust the error model. That seems like a weird justification to me. And you like some of your assumptions, but not others, but they all affect inference. It just seems weird to cut information off flowing in one direction. But overall, regardless of the justification, it's very bad news. So Martin Plummer, who is the author of JAGS, did a bunch of simulations with bugs in the cut command to show that when you use cut, the possibility of distribution is no longer valid. It doesn't even give you the right estimates for the regression parameters. If you have good guesses where it starts and you don't run the chain very long, you probably get the right answer vaguely, but it's not actually basing inference anymore. So this is really bad news. It's a scandal in the software. And yeah, and I think people think they're doing basing inference at this point, but they're not, they're doing something else and it doesn't mean it's useless, but it's not what they advertise it to be. And so one way you can think about this metaphorically, remember the cafes from chapter 12? And if we were an NdZ app robot going from cafe to cafe, one of the things we do is when we got to Berlin, we learned about Berlin and we might, we update our prior from Harris, from the data with Berlin. Logic also demands that we simultaneously update our previous estimate for the Periscope by using the data from Berlin because the order we visit them in is irrelevant. But if the robot refuses to let the information flow back in time, it's illogical. And that's what CUT is doing. It's refusing to let the information flow back in time, so it throws information away. And it results in pathologies and inference that people don't quite appreciate. Anyway, hopefully that was sufficiently scary, right? I should put on like a hockey mask when I give this lecture. No, this is a big deal. I think it is. All right, let me try to wrap up this course in 15 minutes. I don't know how many hours we have spent together now, but a lot. So we started with the Golem of Prague, which is just to say this metaphor that Cisco models are a special kind of machine. And we don't often understand the details of their operation, but it is incumbent on us to be responsible for them. And to recognize that they have no automatic access to the truth or nobility. They don't access reality in any way. They're devices for processing information, for doing jobs, and their behavior may be counterintuitive, but it's always an implication of the programming we give them. And so I like this metaphor both because it's a bit monstrous, right? It makes you a little bit wary about things. You could use a robot too, but I think people think of robots as nice safe things, unless you work in a factory and then you're terrified of them. But, so on this note, I wanted to spend a little bit of time here, maybe the next 10 minutes talking about something which I think is obvious to all of you, which is that statistics is not a substitute for science. And I don't think anybody ever claims that, but behaviorally lots of people act like it is. Let me try to deliver on that statement here. What I mean is people trust things in journals just because they're significant, right? And they act like, well, so-and-so has shown in the nature publication in 2012, blah, blah, blah. Now next year, that publication might go down the drain, but it's amazing how trusting people are to public literature, right? Not everybody is. There are fields like public health, epidemiology, and clinical medicine, where people are very suspicious of the public literature because things don't replicate. So whether they have learned when they try to follow up on things is that a majority of published results are false. This should terrify you. And the same is turning out to be true in social psychology just this week. Another release came out of the many labs project where they're replicating textbook, famous textbook social psychology phenomenon, and a majority of them do not replicate. Nine out of 10 do not replicate. Most of them had never been attempted to replicate before as far as we can tell, and they were in textbooks. We should be terrified by this. How can this happen? How can our statistics betray us so much, right? And this is why I say people, there's just something going on where people behave as if, even though they would never say the sentence that's on the top of this slide, right? So no one would ever say stats is a substitute for science, but there's this behavioral sense in which people, lots of people, not everyone, but lots of people act as if it is. So let me give you a thought experiment to help you understand why this is true, why even if you do all the stats the best you can, you use the true data generating model for your process. You say you knew it, and you do everything right. But why you still need science, by which I mean repeat learning over multiple studies in a community of people arguing about results. No single study can ever tell you for sure what has gone on, no matter how significant the relationship between the predictors. So let's assume, for example, the probability of a false positive finding is 5%, so that would correspond to our P less than 5% threshold, the thing that I keep telling you guys to ignore, right? Because it's like what, there was some bony-rayed fish that crawled up on land several hundred thousand years ago or something, and a hundred million years ago or something, and went to the lung fish, come up on land, somebody know. And because it had five bony fins, we like five now, so we have five digits, and so five is now enshrined, and stuff, I think that's basically the only justification you can get for this. And yet it dominates scientific inference, it's crazy, it's absolutely crazy. Those crazy people in Congress who are after NSF should be picking on that. And not other things, I think. But don't, well, don't let them hear this lecture, maybe, but the probability of true positive finding, let's say it's 80%, this is, when people do prospective power analysis, this is what they usually assume, which means if the hypothesis is actually true, 80% of the time, you will get a positive signal, a significant result, or in the Bayesian thing, your posterior distribution won't overlap zero. However, you wanna think about the significance concept. We usually call this power. First thing to say is that power is nearly always lower than that. Actual audit studies of power show that in many fields, power is below 50% most of the time. So even if the hypothesis is true, our chance of detecting it is less than half. That is routinely the case. So this is, this is, this is, but this is what people usually assume. Conditional on finding a positive finding, so now say you get an asterisk, ours gets out of an asterisk and you get happy, you're like yay, publication, let's go drink. What is the probability of finding is actually true though? Can you answer this question? Most people can't. Now what would most people do? I think you guys should, because you've been Bayesian blood all quarter now. You do this with Bayes' theorem. It's a conditional probability, right? So conditional on the signal that it's true, what's probably is actually true. That's a conditional probability. So what we need is, I should say, what most people say is 80%, right? Most people just report the power back. That is so wrong, that is so wrong. What you actually need is Bayes' theorem. The probability of the thing is actually true. Conditional on a positive result is equal to the probability of the likelihood. The probability of a positive result conditional on being true, right? So the positive result's data. True is the actual state of the world that we wanna know. And then the prior, the probability is actually true. And then we divide that in Bayes' theorem by the probability of the data, right? The marginal likelihood, the probability overall that the thing is true. This expands out. The only thing that you can expand the denominator, we have to average over all the way so we can get a true signal. Get a significant result. So there are two terms. One way is if the thing is actually true and we detected it. The other way is if it's actually false and nevertheless we get a false positive, right? So this is the false positive probability, 5%. This is the power. And these things, what's the probability of true? Well, the probability of false is one minus the probability of true. But what's the probability of true? This is the base rate. And this is the thing we don't know. We don't know of all the hypotheses we test in any scientific field. What proportion of them are actually true? Nobody knows this. And everybody in their own field seems to think, Paul and I have been talking to lots of people about this lately because we haven't made a script about this, but everybody in their own field seems to think it's really high. Oh yeah, that's a problem in social psych, those schmucks over there. But our base rate's like 80%. And then you ask those psychologists, they're like, yeah, that's a problem in medicine, but our base rate's really high, it's like 80%. I'm not sure what it is, but let me characterize what these things are looking like. So here are calculations. We just take base theorem and do the posterior calculation across unknown base rates for different assumptions of power and false positive rates. So on the left, I fix the false positive rate at the conventional 5% and I vary the power a bit. The thing to notice is if base rate is small, the posterior probability of things actually true conditional on a true result is easily less than a half, right? So say that most hypotheses and only 10% of your hypotheses are true, then observing a positive result is only about a 60% chance that it's actually right, which is a little less exciting. So it's like saying, new result in nature, only 60% chance it's actually true, right? Imagine that in the abstract, right? Kind of deflationary, and that's my goal here, this would be a bit deflationary. Power doesn't have a big effect, this is the other thing I want you to see here. So adjusting across 50% power, 80% power, 100% power, doesn't have that big of an effect actually, except in basically a middle range. False positive rate has a much bigger impact. And this is a big consequence because lots of the behavioral ways people do statistics elevate false positive rates. You drop outliers, you develop your hypothesis after collecting the data. All of that is a formula for elevating false positive rates, right? You look at the data and then you may not have a hypothesis about it. Yeah, that's a way of getting false hypothesis, right, because there's a feedback loop. Dropping outliers, removes variation from the data, that'll reduce variation in treatment groups, easier to get significant results. These are routine behaviors and statistics in the wilds of the sciences. In every stats department, people will scream if you want to do these things. But in the wilds of the sciences, we grow around and we do not eat things like this, right? And yeah, I'm tempted to riff on that a little bit, but I will not. When you guys laugh at my jokes, it just encourages bad behavior. So we don't know the base rate, what is it? I think it's often pretty low. And that's because where have been audit studies where people try to replicate often famous published results? Things with textbooks. They end up pretty low. So just this week, this is the mini lab study trying to replicate these famous social psychology results. Each row is a different famous social psychology result. This is the, this is the Naxis Coen's D, which is a standardized effect size. And this is the zero point. The green dots are the original famous published studies. Each of these points is a separate lab replicating the result. This is pretty depressing. Not just, don't think this is just social psychology though. Go into an ecology journal sometimes and look at how much just like, oh, we found all of these things we measured about the plots and then we regressed them all and look, we got these significant results. How much of that do you think replicates? I suspect very little, right? But you can get pretty far in your career doing that stuff. A lot of it you just can't replicate, right? Yeah. That was that year. Exactly, exactly, that was that year. There's always a story. Yeah, exactly. Well, one of the things, one of the interesting things about that here in this context is there was this vocal or social psychology called the end of semester effect, which is used to explain on replications. It's at the end of semester, if you run a treatment, the students are lazy and they don't behave right, so this is, and they, so that was also in the study and there's no end of semester effect in this data. It doesn't replicate to any points, these things though. Now something's new, the strupe effect is real. If you don't know what the strupe effect is, it's awesome, Google it, it's really cool and it's very powerful when it's been replicated many, many times before. So that's like just a check on this working. If they hadn't replicated the strupe effect, I wouldn't believe it. I really wouldn't. The strupe effect is incredibly powerful. So you look it up if you don't know it. There are other audits like this in the clinical medicine literature, as I've said, for lots of say cancer treatments that have been developed when they go into clinical trials, 80% of them don't turn out to work. This has turned out to be a panic for the pharmaceutical industry because they waste a lot of money following up on initial positive results and the vast majority of them don't work at all and they spend millions and billions of dollars on it. So the NIH is in full panic mode over this issue and I think people are arguing about it as being a statistical issue and maybe it is because of false positive elevation rates but I think there's this burden put on statistics as if it's supposed to solve this whole problem. The problem is that people aren't replicating. That's the major problem. We trust the statistics to tell us something that it really can't do. Samples are noisy, they're impartial, we don't know all the issues yet. So to kind of bring this home and say that this isn't just a contemporary problem, this is a great book. I started reading it a couple months ago. The Lost Elements. This is a book about the story about hundreds of elements that chemical elements for the periodic table that don't actually exist but were discovered and people went to their graves arguing they existed. It's a fascinating book. Chemistry was a mess, a real huge mess. Now we think of it as a completed field. There's a table of elements that's on the walls, a liver campus and those elements do exist, we're pretty sure. At least we're a fraction of a second down in the bottom of the table, all right? But this is fascinating, all of these things. It was a mess and all the ego-driven nonsense and lab accidents and everything else that goes on in social psych and medicine now was going on in chemistry in the 17th and 1800s. It's really fascinating. So it could happen to all of us. It's not just something about the social sciences and biological sciences, it could happen anywhere. And I think it's just individual studies aren't that diagnostic and we shouldn't put that heavy of a burden on it. The awful flip side of that though is most of us propel our careers off of single studies. It's like we're getting illegitimate credit for these actual tiny little bricks in the law of science. So there's a readjustment of norms coming that people think I'm not a crackpot. Maybe I am a crackpot, but anyway. So what I wanna say, the good news is it's not just us, this has always been this way. Replication is always necessary and communication is always suspect. You can't trust the literature because it's only a partial representation. People don't get positive results if they tend not to publish it. Try publishing a negative result sometimes. Actually you guys don't, wait till you have 10 minutes. Then try publishing a negative result. You see I just gave you corrupting advice, right? The incentive system pushes us to keep doing this. Anyway, I don't want this to be depressing. All right, I got like a few minutes. It's okay if I go over a few minutes because I wanna do a little bit of like giving you guys some useful like horoscopic advice here at the end. This might take me a little bit, a couple minutes over. So I always resist giving you recipes, but I understand that when you're starting out they're incredibly useful. And so I wanna give you a few kind of broad advice that I trust in vast majority of cases. But again, as always, if your domain knowledge overrules this, trust yourself and not me. And I think there's this general issue here is that why we like recipes is when we start out we're not experts. We know our science, but we don't know mathematical statistics very well. We wanna be told by an authority what to do. But in often the recipes we get are useful, broadly useful, but we could do better if we do more. And it takes time to get to that. I would just encourage you to relax about it. There's this emotional feedback loop where our anxiety over these things leads to compulsive hand-washing behavior like the significance ritual, the null hypothesis ritual. Why 5%? It's obviously crazy to use 5% as a threshold. There's no justification for it. And you have vast majority of scientists do it. And why? Well, that's what everybody does. And if you try using 6% sometimes and watch how people react to this, right? I think when I advise the book I'm gonna change all the confidence intervals to prime numbers. Just disagree with people, I'm gonna troll hard. It's gonna be fun. And then when people ask me why, I'll look at them very seriously and go, oh, because they're prime. I told you I was gonna troll hard. But no, I mean, it's not anyone's fault. We're all subject to these incentives. It's the system that is coercing us. And individuals wouldn't have chosen this. This is science is evolved, not designed. And so we're all prisoners of the system. It's just we have to kind of step in and steer it a little bit. And I think the sociology of this is the field of statistics became autonomous in the middle of the 20th century. And then it's incentive shifted from helping scientists do their work to developing statistical procedures. And those things, that helps us, it does. But it's not, they don't get promoted by discovering scientific principles. They get promoted for developing more fancy statistical procedures. And I'm a big time consumer at that. So I don't criticize them for that. And their curriculum is designed to train more statisticians because they're the only ones that we're going to, right? If they don't do that, then nobody will. So they don't train us. And this is a horrible incentive problem. And it's nobody's fault. But I think it's a side effect of statistics leaving the fields it was born in and becoming an autonomous unit. It creates, it changes the incentives for them. And so in the sciences, we have since evolved our own in the wilds. We have naturally selected for bad statistics. That's the way I think about it. Why? Because if your lab has some procedure that elevates false positives, you get great publications and people replicate those things and go off. I know this is darkness. But why do you have mirrors sometimes? It gets darker. No, I saw any cultural evolution. So this is, I've got stories. Anyway, so, and these words objective and subjective are really awful too. And they're thrown around a ton of statistics in science in general. I don't think people, this is like that thing for the Princess Bride, you keep using that word and I'm sure you know what it means. I'm sure it means that objective and subjective or like this. I think the usual in science, when people say objective, all this actually means is everyone does it the same way. Just means it's conformist. And that's safe, right? You don't get penalized for that. And science is incredibly conformist. And I don't think that's always bad. I think some conformist is a good thing. But subjective is a case where expertise matters. And you have an opportunity, if you're an expert in something, your subjective opinion is of value. It's of more value than somebody else's. So subject to the sounds bad in science, but it's not necessarily. And inference from data is always alternating between objective procedures, like basing inference, conditional on the assumptions, basing inference is done the same way all the time, provided the machine works right. It's objective. But our interpretation of the output is always subjective. And we need that, because that's a chance for us to use our expertise and say, hey, that model is crazy, right? Something like that. Or for your colleagues to say, but you left out this important variable. And that's subjectivity and it's indispensable. So on that, recipes and mantras, very quickly, Bayesian analysis is a recipe. You define a model, fit a model, check the fit, critique the models and repeat, right? We need to iterate. Hopefully you get the model exogenously for something other than the data. That would be ideal, right? And always depend upon context. Now, by the way, that's a plane I built with my son. It's awesome. I really recommend it. Has working wings and everything. So, for choosing likelihood functions, figure out the constraints on the variable and invoke maximum entropy. That's a conservative approach. It's maximally conservative and introduces no additional information that you have not explicitly stated. And it makes, because maximally entropy gives you the flattest distribution consistent with your assumptions. That's what makes it conservative. So that reduces false positives. Ask yourself what aspects of data you care about. What can you actually calculate and understand? You can use multiple models. You can use multiple likelihood functions and compare them. That's called sensitivity analysis. When you're not sure what to do, you don't have to choose. Do multiple things, report the variation and results that calibrates. And it gives you an idea what the next step is. For choosing priors, flat is almost never the best. You want to guard against overfitting. That should be your default position in regularization. The data, do you let the data drive? You risk it being drunk, right? That's the problem, the sample's live. Don't trust the sample. We know that overfitting in yearly always happens so expect it. If you have a meaningful parameter, try to get information into it. One way to do that, you can exploit maxim entropy again to get a maximally conservative prior consistent with the stated constraints you know about it. If that doesn't make total sense, when you get to a problem like that, come to me. I'll help you with it, we can figure it out. All right, last thing, and I think I'll let you guys go because there's another class coming in. Little bit jokingly, if you want some relaxing meditation, I have this book of Zinn Cohen's on my desk next to me in the morning I read little Cohen's notes to relax myself because you know I have like an admin meeting and like I want to kill someone. So I pick up my little book of Zinn Cohen's and it's relaxing, it's like a little line puzzle. So let's engage in like statistical Cohen's, one line ones at least. What we should be doing is assuming there's an effect in estimating. Our representation of reality is not actually on the terms of reality and so on our representations, there's a continuous range of possible effects. Let's assume there's some effect and let's measure how big it is. This isn't a dichotomous issue of whether it is there or not. Because our senses don't access reality directly. We should embrace and propagate uncertainty and certainty is your friend. It guards you against the stakes, trust it. Fitting is easy and prediction is hard. I hope this course is advanced, right? Oh wait, we can fit models. You guys are experts at that. Prediction is a monster. There is no right, only less wrong. You don't need the right model, you just need a less wrong model. You don't need the right prior, you just need a less wrong prior. That's always our issue here. And then math is not real. Only then can it be real. But what I mean by that is, people tend to act like that math access is the truth somehow, but it doesn't. Math access is a logical world of symbols and it's of incredible value because it's a mental prosthetic. You can do things for us, process information. It's very hard for us to do individually but we can't mistake it for the real large world. So remember in the small world, large world of distinction. If you want to harness math instead of letting it harness you, you must remember that it's not real. It's an invention of people that we use to process information, okay? And that should be a hopeful message as we go. Anyway, on that I will stop there and your final exam is on the website. I think it's a very well documented. There's a lot of instruction about it. The data is already built into the rethinking package. If you have questions, email it to me. Thank you guys for a great quarter. You worked really hard. You're a very impressive group and I'm sure I'll see you guys in my office soon.