 Welcome to lecture seven of statistical rethinking 2023. In this lecture, we're going to focus more on the details of estimation and all the problems that arise after you have an estimator in place. This is a little cartoon that I showed you in an earlier lecture of the Earth orbiting the sun and Mars also orbiting the sun. And this is a cartoonish explanation of this phenomenon called retrograde motion that is from the perspective of an observer on Earth. Mars and the other planets appear to zigzag in the night sky. Figuring out why this happens was a major drive in previous centuries. And the Ptolemaic model that was geocentric used epicycles to do it. But this fellow, Mikolaj Koparnik, Nicholas Copernicus, he's usually called in English, decided that another way you could explain retrograde motion is by putting the sun at the center of the solar system, so-called heliocentric model. This turns out to be a clever move because the actual structure of the solar system does have the sun near its center. The thing that's not often appreciated about the Koparnikian model or heliocentric model that Copernicus used is that it still uses epicycles. The major problem is that the orbits of the planets are not actually circles, and the sun is not at the center of them. It's offset. Copernicus realized that was true, but he thought that everything in the celestial sky was made of circles. So the circles were still his building blocks. Epicycles were not in ill repute yet, and so he used them. And this is simplified compared to his actual model, which had even more epicycles, but you can get a pretty good elliptical orbit that is closer to what the Earth actually does by putting one circle on another like this and get many of the features of the orbits that you need. Copernicus' actual model had a bunch of epicycles. And what's distinctive about it, though, is that it had fewer epicycles than the predictively equivalent geocentric model. So it reminds you, on the left, the geocentric model, like the Ptolemaic model. Again, this is just a simplified cartoon. You put circles on circles, and you can explain retrograde motion this way. Copernicus explained it through a heliocentric model, but he still needed epicycles to get a prediction of the positions of the planets in the sky because he didn't realize the orbits were ellipses. Both of these models were exactly equivalent. At the time that Copernicus made his geocentric model, it was no better or worse at predicting the positions of objects in the night sky than the Ptolemaic model. So why prefer it? Well, Copernicus' argument, skipping a bunch of details, was basically that it's simpler. The heliocentric model requires fewer circles. This principle, some people call it parsimony, or Occam's razor, comes up a lot in sciences, the idea that we should prefer simpler explanations. But the problem is that, in almost any realistic modeling context, we're not choosing between a simpler and more complex explanation of a phenomenon, both of which make equally good predictions. Typically, we're trading off simplicity. Against accuracy. And there's, unfortunately, no good justification for preferring simple things anyway. What we care about is why simplicity should be related to accuracy at all. And we're gonna think about that today in this lecture. But I wanna come into it from a bit of a sideways direction. The problem that stands for us at this point in the course is that there, for any given sample, of course, there are many different causal models that could be compatible with it. And when we design our estimators from any particular causal model, telling us how we should use the sample to get an estimate, we still face a lot of engineering problems in estimation because the sample is finite. And the due calculus doesn't tell us how to cope with that at all. How to get an efficient estimator. And there are different estimators for any given task, and some will be better than others. There are better and worse ways to use data. So you can think of this as two related struggles, both of which are happening at the same time in all of our scientific data analysis projects. The first is the struggle against causation. How do we use causal assumptions to design estimators and contrast those alternative causal models with one another? That is, which kinds of observations that we could use in estimation will distinguish the models at all. That's something I talked about in the first lecture of this course. The second struggle is the struggle against data itself and its finite nature. How do we make the estimators work? Just because the DAG and the due calculus tells you an estimator is possible doesn't mean that it's practical and useful. Existence is not enough. And so we need to think hard about the engineering aspects of how we deal with estimation and that's what we're gonna talk about today. Let me give you a really simple data analysis example to build up the story for you. I wanna talk about different things that we might mean by prediction. So sometimes what we're interested in doing is figuring out what function describes some points. So I'm gonna work with the example on this slide, which is an example of just a few data points, one, two, three, four, five, six, seven data points, each of which is some hominin that is human relative body mass against its brain volume. So the point near the top is us humans. We're not the heaviest, but we have the largest brains and there seems to be some extremely vague positive relationship among these points. If we were gonna find a function to describe these points, what should it be? And this kind of question is not necessarily a question about causal inference, but it's just as interesting and useful. You can think of this as a, it's about curve fitting or compression. That is, instead of just describing every point, can we get a relatively accurate compression of them through some mathematical expression? And that's often a very valuable thing to do in the sciences. Another kind of thing we might be interested in doing with data is asking what function explains the points and that's causal inference and that's what we were focused on last week. Another thing we might ask is what would happen if we change the points mass? This is also related to causal inference. It's about intervention now and causal inference, but it's distinct from the inferential part now and requires different kinds of tools and different detailed problems arise in coding and implementation. Finally, what we're gonna spend the time in this lecture up to the break on is what we call prediction in the most typical sense. What is the next observation from the same process? Some process that is generating these points, we're not trying to infer that process, we're not doing causal inference, but we'd like to be able to accurately predict the next observation we might sample from this process. There are lots of legitimate tasks in science and outside of science that are of this kind. You're not necessarily interested in inferring the generative model, but you'd like to be able to have a good expectation of what will happen next. This is the absence of intervention, just like in the first lecture, I talked about causal inference as being defined by interventions in one heuristic sense of defining cause, but if you're not going to intervene in a system, you can do a lot with statistics to predict what will happen next. It's intervention that requires causal inference. Okay, so let's think about this. And there's lots of different functions we could use. We could fit to these data, these seven little data points to try and predict the next point we might sample. And we don't have more than seven. Well, we do in anthropology have more than seven, but let's say we only had these seven. One thing we can do to get an idea of how good any particular function is at predicting the next ones is to drop one point at a time, fit the function to the other six points, and then predict the point we dropped as if we blinded ourselves to it. So let me walk you through this procedure. And this is a procedure known as leave one out, cross validation. It's incredibly common in machine learning. Common in some parts of the sciences as well, but it's a task you use to assess the predictive accuracy, the expected predictive accuracy of a statistical procedure. Okay, so first we drop one point. The red point on the right is the point that I'm gonna drop the first one, and then we fit the line to them. That's the red line there. Then we're gonna predict the drop point. And what we're gonna assess is its distance from the prediction line, right? Which is the little vertical dash line that I've added to the plot. This is in a sense the prediction error, right? Because if the line predicted the drop point perfectly, the drop point would be on the line. I'll say that again. If the fit curve, that is the line here, had predicted the drop point perfectly, the point would be on the line. Since it's not on the line, we measure the vertical distance. That is the badness of the prediction is the length of that dash line segment. We then repeat this procedure with the next point. So point two, we can do the same thing. We can fit a new line. In this particular case, just by coincidence, the left out point is below the prediction line this time. Yeah, it's a little bit better, and we can do it with all the points in turn. And we will end up with seven different regression lines, seven different errors, and we sum up those errors, and that's the expected accuracy of using a line on all seven points. And that's this blue line. And the way we talk about this is there's two scores. There's the score in and score out. The score in is the fit, that is how well the function fits the sample we have. And this is what we do when we run quap in this course, or if you just run an ordinary linear regression, you use least squares, you minimize the least squares, you're minimizing the errors in the sample, and that's the end score. And the fit is always better than the predictive accuracy, because we haven't been able to train the function on the points we haven't seen yet. And so when we compare functions in their predictive accuracy, we don't wanna compare them on the end score on their fit to the sample. We must never do that. What we want to do instead is compare them on their scores out. And one way to do that is through cross-validation. So let's consider other functions so you can see what that means. Oh, before I move in into the other functions, I wanna mention, of course, we're Bayesians in this course, we're good Bayesians, and like good Bayesians, we don't use points. We always use distributions. Estimates or distributions, points or decisions. Estimates or distributions, points or decisions. And we're not making decisions in this course. If you wanna do decision theory, that's fine. But usually in research, what we wanna do is communicate the estimate and let our colleagues make up their own minds about what decisions are implied. So we're gonna use the whole posterior. And so when we assess the fit, both in and out of sample, we use this thing called LPPD, the log posterior point-wise density. And for the cross-validation sort of task I just showed you on the previous slide, it's this sum of sums. Let me take just a moment to explain this horrible looking expression to you to demystify it. And the point of this, explaining it to you, is not because you're gonna have to write the code for this or such, it's already built in to the software. It's so that you can start to understand these expressions and read them without any sort of angst. There's a structural language to these things that you get used to very quickly the more time you spend at this business. Okay, the LPPD is the log point-wise predictive density. The predictive density part is that we're trying to get the predictive distribution for some particular model. That's when density is related to probability distributions. It's point-wise because we're considering each point independently in the cross-validation. That is, we're assessing the accuracy of each point. And it's log because in statistics everything is done on the log scale typically because it's more numerically stable. Okay, all the junk on the right. The N is just the number of data points in your sample. And S is the number of samples we draw from the posterior. This could be arbitrarily large to get our approximation of the LPPD. And then this thing, so the two sums, all they do is they, for each data point we take the average over all the samples. That's what the one over S is in there. And then for each sample we compute from the posterior distribution, we compute a prediction. And that's what this log probability is. It's the log probability for each observation I computed with a posterior that omits point I. So that's the cross-validation. Yeah, so we're asking a posterior distribution to predict a point it hasn't seen and so it hasn't been updated in light of. And that's the cross-validation idea. But we have to drop each point separately. And so if we have N data points we end up with N posterior distributions. It's a lot of computation. Oh yeah, and so the whole thing here is just an average. Okay, it's a lot of computation but you can automate it and I've done that and I'm gonna show you some animations just looking at posterior means here for simplicity. On the left I'm repeating the linear estimation for these seven points and the score in is 318. Bigger numbers are bad because they're distances, right? They're total errors and the score out is 619. So it's about twice as bad out of sample as in for the straight line. On the right of the slide I'm fitting a parabola. Now in a previous lecture I told you not to use parabolas and I'm still gonna stick with that advice. But lots of people do use parabolas to fit data. So I'm gonna use this as a simple example. And it's also an example we can extrapolate by just adding more and more polynomial terms. So the second order polynomial on the right, the parabola can be fit to the same seven points and you'll see now it bends to accommodate, right? And we end up with a final parabola. In sample it has a better score, it fits the data better because it's more flexible than a straight line at 289. But it's score out of sample that is the error assessed by summing over the points that are dropped one by one is worse than the straight line, 865. And this general pattern that more flexible functions do better in sample and worse out of sample is very general, at least for simple statistical models of the sorts that we've seen so far in this course. When you get to statistical models in which the parameters are hierarchically embedded in one another like we'll see in multi-level models in the second half of the course this relationship no longer holds in this way. Nevertheless, the intuition you're gonna get from this lesson does hold because you're gonna learn something about how flexibility works in models in and out of sample. So let me show you the extrapolation now. Let's consider even fancier stuff. So on the left I'm just repeating the straight line and the parabola from the previous slide and now the larger curve here is a cubic function, the third order polynomial. And it's even better in sample. We're down to 201 now, we're getting a pretty good fit because you can see it bends twice unlike the parabola but it's error out of sample is astronomically worse now. It's astronomically worse now because that flexibility in sample is, lets it bend really far away from points that are dropped. And you'll see the gray curves which are the independent curves that are estimated for each drop point vary a lot more. They're much more different from one another than the different parabolas are or the different straight lines are. The different, the gray lines in the upper left of this slide are for the most part, most of them aside from just one are very similar. And the parabolas are quite similar aside from that just one which bends the opposite direction. But the cubic curves on the right are all over the place and that's the variance. It's susceptible to varying all over the space when you drop any particular point. This continues, we can go, we can add a fourth order polynomial and the same pattern, it gets better. Every time we add another polynomial term the curve fits the sample better but it has even worse out of sample performance and until we get to the fifth order polynomial here on the far right of this slide which fits the data set almost perfectly. It has an error in sample of only seven but out of sample it has a really extraordinarily bad performance which means again it's very bad at predicting any particular point that's dropped. So let me try to summarize some of that. For simple models, that is the sort of models we've seen so far in simple models parameters aren't conditional on one another. They all have a direct role, functional role in computing the probability of the observation. There are no parameters in these models which give you probabilities of other parameters. That'll come in the second half of the course. So for simple models of that sort this relationship is very reliably true that as you add more parameters you improve fit to the sample but you may also do worse out of sample that flexibility in sample is often a curse out of sample but it's a trade-off. There's no Occam's razor here because we still must assess accuracy. We want a model that is most accurate out of sample and it just turns out that models that are most accurate out of sample purely for the purpose of prediction remember we're not doing causal inference. We're not after a mechanism here. Trade-off flexibility for accuracy and it's important to understand why that's the case. This phenomenon is called overfitting. Try to illustrate it with the graph on the right for this toy example with the seven data points. The blue trend there is the relative error in sample. This is what people might call the fit and with every polynomial term we added in the example the error in sample declined, the fit got better but the opposite happened out of sample. Every polynomial term we added increased error out of sample which makes it seem like the line is the best and in this particular example with only seven data points that's true but that's just a feature of this example. Let me show you an example in the same context but now with more data points. Again, we're looking at mass on the horizontal against brain volume on the vertical for a bunch of fossil hominin species and we're going to fit the same set of functions to them with the same cross-validation but now there's more points but the same idea and we're gonna do the same assessment of in and out of sample. We're gonna go all the way up to a six degree polynomial on the right. You can see this is a very flexible function and it's all over the place and then on the left is the least flexible one, the first degree polynomial. Well the least flexible one would just be an intercept. I mean no slope at all but I haven't showed you that one. So, in this particular example, the function that gives you the best predictions by the cross-validation score that is dropping each point one at a time, measuring the prediction error for that one left out point and then summing those errors together, the best one is the polynomial as I show you here because again the blue trend always declines every time you add a polynomial term, it goes down but the red goes down, meaning it gets better before going back up again at cubic and you'll often see this in data analysis in prediction problems that there is some optimal complexity or another way to say this, some optimal flexibility of the model which allows the model to learn the important or we say regular features of the process from the sample and then it can use those regular features to make good predictions. So, let me highlight this issue. I'm going to drop the six-order polynomial so we can see the graph on the right a little better and I've drawn this black loop around the polynomial and you can see its error in sample is not the best, right, the fifth-order polynomial does a lot better in sample but it is the best out of sample of all the functions that are considered in this example. It trades off flexibility for accuracy. So, as I said, the idea is to have some function which is regular, that is, it can learn the regular features of the process that's generating the points so that it makes good predictions in the future and being regular means not being too excitable about the sample because not every feature of the sample is regular. I'll say that again. What's important is not to be too excited about every point in the sample because not every feature of a sample is regular, that is, it doesn't represent the long-run expected value of the generative process. So, in Bayesian modeling, we have a really nice tool for dealing with this, making models less flexible and that is priors. We can use skeptical priors that regularize inference, that down weight extremely unlikely observations and so when a sample arises that has those observations, the model is less excited by them and it can make better predictions in the future. Cross-validation measures predictive accuracy but it doesn't do anything about it. Regularization is the procedure of designing a model that produces good predictions because it will or has a good cross-validation score. They go hand in hand. I'll say that again. Cross-validation by itself does nothing to produce good models. You can use it to compare models and choose one that will make good out of sample predictions but for any given model you can make it even better by using regularization, using skeptical priors that down rate extremely unlikely observations. Skeptical models tend to do better. Let me show you what this looks like. So, we're gonna take the same idea in and out of sample, the same polynomials, the same data set and we can think about repeating that whole set of animations I just did. Don't worry, I'm not gonna show you the animations again. You've seen enough of those, the dancing curves. I'll do this in the background and I just wanna show you the results in the terms of the blue trend in sample and the red trend out of sample. Now, I'm gonna, excuse me, I'm gonna use different priors in the examples I've used so far. I use these extremely broad normal zero 10 priors, that's a normal distribution with a variance of 100 which is essentially flat. There's no prior information for the slopes within the reasonable range but we can constrain them and see what happens. So, the first thing to do is try a normal zero one, a much better choice I think and what I want you to see is that on the blue curves, it actually makes things a little bit worse. I know this is hard to see but there's two blue curves there and one slightly above the other. The normal zero one blue curves has worse prediction error for every polynomial model than the normal zero 10 and that's because priors hurt your fit in sample and so narrower priors give you less ability to fit the sample perfectly. Yeah, if your goal was simply to encode the sample to basically compress it, then you don't want tight priors, right? You want loose priors but that's not our goal. Our goal is not to simply compress or encode the sample using a model and our goal is to make predictions, at least here it is. So, we get the opposite phenomenon for the red curves. So, the blue curve got worse when we made the prior titer the red curve gets better. That is the red trend for normal zero one is below the trend for normal zero 10 which is better because remember smaller numbers mean less error and we can keep going. We can use something even narrower. Beta normal zero with a standard deviation of a half does even better. Notice that it's noticeably worse in sample because it's less flexible. The curves can't, no matter the number of polynomial terms models with this narrower prior cannot bend as much and so they fit the sample less well and you see that on the blue curve now is higher. But again, we've gotten a uniform systematic improvement out of sample for every polynomial model by constraining the flexibility through these priors. This is regularization. It's the idea that we want to choose skeptical priors that allow the model to learn regular features but not learn irregular features that is the random variation that doesn't have to do with the long-term expected trend of the process. You can make priors too tight. So if we make the, if we use normal 0.1 priors here then the end sample gets quite worse and so does out of sample prediction it gets quite worse. In this case, the sample size is too small and these priors are too constraining and so the model is extremely skeptical. If you fed it more data eventually it would overcome the skepticism and these priors wouldn't be bad but for this sample size this is too skeptical. So how do you choose to whistle your prior then? Well, if we were engaged in a causal inference task as in many of the examples in this course you use science, you think about and this is what we did in previous weeks when we thought about like human height. You know the impossible and possible ranges of these sorts of things and for most scientific data analysis you always have enough domain expertise to say something about that, to put good regularizing bounds on your priors of what sorts of slopes are possible. For pure prediction you can actually tune it using cross validation. You can find a prior that in simulation gives you a good out of sample performance and this is often done in machine learning. In practice and research, most tasks are a mix of inference and prediction. For any given causal inference task there are going to be choices of functions in those models in the generative model that are not purely determined by our background causal theories and we will need to be skeptical of overfitting in those cases. And so in practical use we're always doing causal inference and dealing with overfitting and regularization at the same time. People are often quite anxious about the choice of priors because it feels well like it's hard to justify in a particular case. So if it relaxes you, one way to put this is the worst possible prior you could choose is a really flat one. If you made it a little bit narrower it would be better in almost every case. And we're not trying to be perfect. The success of your analysis does not depend upon having the perfect model structure. Not the perfect prior, not the perfect combination of terms or predictors. We just want to do better than the worst case default. And that's easy to do. And again, one of the reasons I'm spending so much time in this course teaching you to write generative simulations of your models is so you can address these things in simulation. You've not left guessing and making up stories and hand waving at your colleagues about your choices. Okay, that's been a lot. Let's take a break. I encourage you to go back and review the slides in the first half of this lecture. Collect some questions and try to answer them for yourself or bring them to me in discussion at the end of the week. And then you should take a walk, relax, have a cup of coffee. And when you come back, I will still be here. Welcome back. In the second half of this lecture I want to take the foundation from the first half about overfitting and regularization and their relationship to priors and talk about some very useful metrics that actually assess the expected out of sample accuracy of a particular model. So remember the blue and red curves from before the break. The blue curve is the in sample error, the fit it's sometimes called and smaller numbers are better. So the fit is always improving with model complexity in this example. And then the red curve is out of sample. This is what we really care about. We only want to compare models on their out of sample performance for prediction. And the distance between the blue and the red for any particular model is the penalty, if you will, trying to make a prediction. And what I want you to see having drawn these black bars is that the penalty gets bigger as we add complexity to the polynomial. So if we just take the vertical links of those black bars on the left and draw them on this new graph on the right, you'll see that as the polynomial models get more complex the out of sample penalty gets bigger and bigger. So the problem with cross validation, of course, validation is fantastic. No complaints about it in general. And it's a wonderful method. And for small samples, you can just do it like I've done in these cases. If you've only got 12 data points, it's no big deal to fit the model 12 times. It goes really fast. But for really big samples, this can be, and by the end of this course, we're going to be fitting models that can take a half hour to fit. It's not such a big deal. But at that point, you don't want to fit the model 12 times. Or it would be more like a hundred times or hundreds of times because there's hundreds and hundreds of data points in these big models. And it just cross validation becomes prohibitive. So wouldn't it be nice if we could estimate this out of sample penalty for any particular model from a single posterior distribution just by fitting the model once. And for once the universe is benign and not hostile to us because there's good news, you can. And there are multiple methods that are quite accurate for the simple task of prediction in the absence of intervention, which is what we're talking about in this lecture. I want to talk about two of them. And the first is important sampling, approximation of cross validation or Pareto smoothed important sampling PSIS is what I call it in the book and in these lectures. And an information criterion, the W-A-I-C, many listeners will have heard of A-I-C, the Akaike information criterion. A-I-C is only of historical interest now. You should never be using it because it is eclipsed by W-A-I-C. W-A-I-C is much more general to many different kinds of models and more accurate and gives you a standard error. So you can assess its accuracy in any particular case. Both of these tools are trying to do the same thing. They try to estimate the out-of-sample accuracy of a model and they essentially measure this penalty and then add the penalty to the in-sample fit. They're both very accurate and in most cases, they give you the same answer. So if you're interested in the details of how PSIS and W-A-I-C work, I refer you to the book. But in the lecture here, I want to show you how they're used and talk about things you shouldn't use them for. So it reminds you, on the left, we've got this graph that's going to be burned into your nightmares now, the blue curve going down and the red curve going up. And on the right, I show you how PSIS, the purple curve and W-A-I-C, the new red curve, do in estimating this Liebland out-cross validation score. You'll see they don't get it exactly right, but they get the trend right. They understand how the trend works. They get the right inflection point. So if you selected a model based upon PSIS, for example, you would get the quadratic model, again, which does best. And it's estimate that it's out of sample accuracy is quite good. So tools like W-A-I-C and PSIS and cross-validation, if you've got the computing time to do it, measure overfitting. Remember, they don't do anything about it. They estimate the out-of-sample performance. And that's like a way of measuring overfitting through the penalty term. Regularization manages it and neither of these tools, although they're important components of any scientific project, neither of them directly address causal inference. So what you should never do under any circumstance is choose a structural model and a causal estimate by comparing W-A-I-C or PSIS or cross-validation scores. That should never be done because it's simply, these, they're predictive metrics. They're about prediction and the absence of intervention. And remember, causal inference is about knowing what would happen, being able to predict the consequences of an intervention. And that simply requires different tools. One of the things, so W-A-I-C and PSIS and related measures are extremely useful for understanding how models work, even if you're not gonna use them. So because you understand more about the estimation part, that is, you've got some DAG and you design an estimator, but that whole intellectual exercise is sort of done in the imaginary realm of infinite data. We don't live in that realm. We live in the realm of finite samples. And so we need tools like this and we need regularization to actually get estimates. And underfitting and overfitting are both bad. We wanna get something in the middle. I wanna spend some time hammering home this point that I just made about W-A-I-C and PSIS and cross-validation, not being legitimate ways to choose causal models, to choose a causal estimate. And I wanna hammer this because you see it done. You see publications where people will have a bunch of alternative models with different predictors in them and they'll fit them all and then compare them with W-A-I-C or PSIS and then choose the one with the best predictor performance as the explanatory model. And this is just logically incoherent. W-A-I-C and PSIS are just ways to approximate the cross-validation score. That's what they're for. They're pure predictive tools. This is not prediction in the presence of intervention, which is what causal inference is about. These are predictions in the absence of intervention. And so what happens if you do do the wrong thing? Well, very often you will end up doing the worst thing possible, which is choosing a confounded model, confounded from the perspective of trying to get the right causal inference. And the reason is because predictive criteria prefer confounds, confounds and colliders, conditioning on post-treatment variables. All those things will actually improve your predictive accuracy out of sample. I'll say that again. Conditioning on colliders, including post-treatment variables, which create misleading bias. All those things actually do help you in prediction, in the absence of intervention. Causal inference is about prediction in the presence of intervention. And these are just different tasks so you can't switch the tools. Let's go back to an example from last week. You may remember it, the plant growth example, and carry through with this. We're gonna do the wrong thing. So you can see how W-A-I-C and PSIS and cross-validation will prefer the confounded model. So we're gonna just remind you how this experiment works. There's a treatment T and the treatment is some antifungal compound and it's applied to a bunch of replicant plants. These plants had some height H sub zero at the start of the experiment when the treatment is applied. The treatment may have a direct influence on the growth of plants at the end of the experiment, which is H sub one, but it also has an indirect effect on growth through suppressing fungus, which is F in the graph there. F mediates the effect of treatment on plant growth. And what we want to do in this experiment is assess the total causal effect of the treatment, both its direct effect and its indirect effect, because that tells us its value. Yeah, and what you saw last week is that if you condition on the fungus F, you block the effect of the treatment, in fact, and you make the wrong inference about the experiment. And so this is called conditioning on a post-treatment variable. It's very often a bad idea. And so the model on the left of this slide is the one where we stratify by fungus. We condition on the post-treatment variable. It's the wrong adjustment set. So if you took the DAG in the middle and used the backdoor criterion with the estimate of the total causal effect of treatment, you would not end up stratifying by F because there are no backdoor paths in this DAG. It's an experiment. It's not confounded. You don't need to stratify by fungus. And then on the right, we have the proper model for assessing causal inference. But I didn't show you last week, because we didn't have the tools yet, is that the model on the left, the wrong model from the perspective of getting the right causal estimate, that is that if you use the model on the left, you get the wrong estimate of the treatment. You end up including the treatment doesn't work, but the treatment actually does. What I didn't show you is that the model on the left predicts better. If your goal was to predict plant growth, then the model on the left is the one you wanna use. So let me show you that. In just in terms of posterior distributions, we fit. You can go back to the code in the book or from the lecture last week and fit these models and reproduce these posterior distributions for yourself. The model on the left, which is I'm gonna call it the confounded model because it includes the post-treatment variable. That is the fungus is a consequence of the treatment. This is wrong. It's biased. The true causal effect is about 0.1, the total causal effect of the treatment. But when you include fungus, it straddles zero. Stratifying by fungus essentially cancels any observable effect of the treatment. And on the right, when we don't stratify by fungus, which is the right thing to do, we get the correct estimate. Now we can compare these models using Pareto smooth important sampling, which is usually my preference for reasons we'll talk about in the next example. But if you did this for AIC or WAIC rather, you'd get the same values. They're functionally equivalent for this example. And in most examples, actually. So this graph that I've showed here is the comparison and it's a bit confusing. So let me label each point one at a time so you understand what's going on. Each row is a model and the horizontal axis there is the deviance. You can think of this as badness. It's the sum of errors. So big numbers are bad, small numbers are good. We want models on the left. The filled points are the score in sample. That's the in sample or the fit, the in sample error. So you can see in sample, the model that includes fungus is better. And there's a penalty. It's the open points are the score out of sample, the expected score out of sample after adding the expected penalty. And that is always higher. And then there's this bar, which is the standard error of that score. And one of the nice things about Pareto smooth important sampling and WAIC is that you can get an approximate standard error to understand when models are distinguishable at all on statistical grounds. These standard errors are just approximate because they require reasonably sized samples. But it's better to have some guidance than none. Then this little triangle thing with an interval is the contrast, which is how we actually want to compare the models contrast than its standard error. And so this contrast does not get anywhere near the vertical gray bar in this graph. And so the model with fungus is substantially better at prediction by this metric than the one that omits it, which is far on the right. This makes sense. Let me try to help it make sense to you. When I say it makes sense, it makes sense that the model that is confounded that includes fungus is better at making predictions than the one that leaves it out. The one that leaves it out was not designed to make predictions. It was designed to make the proper causal inference. When you include fungus though, you make better predictions because the fungus is a better predictor. Let me show you that in terms of the data. So here I've just plotted for a particular simulation of this experiment. The fungus status of the plant on the horizontal axis, we've got two categories and I've jittered the points just so you can see how many plants there are in each category. And then I've colored the points red and blue for treatment and control respectively. And on the vertical, we've got the growth of each plant. That is H1 minus H0, how much it grew during the experiment. And you can see that it's easy to distinguish the plants with fungus from those without because it's the ones with fungus are a lot lower. They had a lot less growth because fungus inhibits the growth of the plant. The fungus is a parasite. And in contrast, treatment is way less diagnostic of the growth of the plant. So we can sort of rotate this graph, take the treatment out of the color and put it on the horizontal axis and put fungus status in the color. So the plot on the right now, red indicates fungus on the plant, blue indicates no fungus. Notice that the red points are all lower. I mean, or for the most part are lower because fungus is bad. But knowing a plant was in the control or treatment group isn't nearly as distinguishing of growth as the fungal status, right? The treatment does affect the growth of fungus because notice that there are fewer red points in the treatment group. Now I'll say that again, notice there are fewer red points. Red points indicate fungus in the treatment group on the far right. So the treatment works as we learn from the model that omits fungus, but for predicting growth, what you'd really wanna know is the fungus, not the treatment, yeah? Because it's a more proximate cause of growth. Okay, what's the point of all this? Do not ever in your life use predictive criteria to choose a causal estimate, okay? That's all, it's a very simple rule. Always follow it, you'll do fine. For causal tasks, use causal tools, like DAGs or generative models, backdoor criterion, do not use predictive tools. For predictive tools, do not use causal tools. It works the other way around too. These tools are all fantastic, they're great achievements, they're signs that the universe is sometimes benign to human life, but you have to use them in the proper circumstances. And again, why am I getting exercised about this? Because you can see in the best journals, in all the fields, people using predictive criteria to choose among different causal models, and that's bad. Okay, in reality, most analyses are some mix of these things. There are aspects of our models, which involve estimating functions, and the shape of those functions is not completely determined by the background science. And so we end up having a mix of causal tools and predictive tools at the same time. So you need to learn how to use them both and respect them in their proper ways. Okay, another example of how we might use these predictive criteria that's extremely useful in some sense outside of what you expect them to be is that they provide diagnostics of things like outliers. So there are some points in any particular sample which are more influential than the others. And what do I mean by influential? They have a bigger impact on the shape of the posterior distribution than the other points. And typically, for most model types, the sorts of points that are more influential are the ones that have low probability or way out in the tails of the distribution. They're extreme points. Sometimes people use the term outliers for such points. These are observations in the tails of the predictive distribution, okay? They're sort of surprising from a statistical sense and surprising things in statistics and lead to large updates to the posterior distribution. So outliers have a big influence, sometimes called leverage on the inference. One of the thing about outliers is that if they have a big influence, it may indicate that your model is mis-specified. It's missing key points. You probably don't wanna drop the outliers. This is what people often do. The outliers are information, yeah? But there are things we can do to our models to make them more robust to these things so that they don't get too excited. So it's like a kind of regularization that it's done structurally through changing the expected distribution of the observations. Let me show you what I mean. The basic problem is the model is too excitable. It's not sufficiently skeptical. Just like narrower priors can often help, wider predictive distributions can often help in making good predictions. So dropping outliers, I just said, is a bad idea. They didn't commit any crimes. It's the model that's wrong, not the data, yeah? Dropping outliers just ignores the problem. Your predictions of your model are still going to be bad. I'll say that again. Dropping the outliers doesn't fix anything. It just ignores the problem. Your predictions will still be bad. The model's wrong, not the data. So the first thing we wanna do is quantify the influence of each point. And it turns out this can be done using cross-validation, or using the intuitions that come from cross-validation. That is the influence of each point on the posterior distribution. How much does the posterior distribution shift when we drop or include each observation? That quantifies it. And then we can identify outliers in a principled way, not by calculating standard deviations or eyeballing in a complex model with lots of variables that won't work anyway, because it's not some simple bivariate thing. So we get a principled measure, and I'll show you that on the next slides. And then the second thing we can do is use a better model, some kind of mixture model or robust regression. And I'll show you in the context of linear regression what that would be like. And then in the second half of the course, when we use nonlinear models, there will be nonlinear versions of robust regressions as well. So let's go back to a previous example. So I don't have to teach you a whole new example. Remember the divorce rate example. And agent marriage is strongly negatively related to divorce rate in the different regions of the United States. And for the most part, the States cling quite tightly to this trend. There's really seems to be some strong and causal influence of agent marriage, at least demographers think. Demographers and divorce counselors, or I would say marriage counselors, there's no such thing as a divorce counselor, is there, on divorce rate. And, but there are two states which really defy this trend and in different directions. The first is Maine, which has a quite high divorce rate. In this particular year that these data came from, it was the second highest in the United States for everything else about it. It's very unusual other states like it in all the other ways have much lower divorce rates. And then Idaho in the other direction. Idaho has a very early average agent marriage, but one of the lowest divorce rates in the country. What is the influence of points like this on the inference about the relationship between agent marriage and divorce rate? Well, we can begin by actually quantifying the statistical influence of these outliers on the posterior distribution. Remember, we don't want to just eyeball them and sentence them to death. That's not what we do with outliers. First of all, we need some principled way to say that they are outliers. What I would recommend is that you use the Pareto smooth important sampling K statistic. This is part of the output of using it. And I show you in the book how to do this. It's something that gets calculated automatically in the calculation of the cross validation score. WAIC provides a penalty term, which is very strongly correlated with this case statistic as well. So if you really want to use WAIC, you could do it by looking at the penalty of each point. And the points that have big penalties are the ones that are inducing the most overfitting. And those are the outliers. Yeah, just to show you the scatter plot again. So Idaho and Maine have big K values from PSIS and big WIC penalty scores. And they're the ones that are defying the trend. They're influential points. And the Puster distribution is influenced by these two points much more than by any other point in the sample. So one of the things that's almost certainly going on in examples like this is that there are lots of unmodeled sources of variation. That is the error distribution around the trend is a homogenous Gaussian distribution, but different regions of the country and different states have lots of other influences. And so there's no reason to really think that this error distribution should be constant. And lots of things can happen when we say that error distributions are not constant across the whole sample. But one of the things that we can do the cope with this is to assert that this population, the different regions of this country are actually a mixture of different Gaussian distributions. All kind of piled together. And so what will happen when you mix Gaussian distributions with the same mean, but with different variances is that you get a thicker tail. You get something from the so-called student T family of distributions that you may have heard of from the T test that's common in all sorts of biostatistics and also psychology. So here, comparing a Gaussian distribution to a student T distribution with, I think, two degrees of freedom, I think that is, you'll see that the tails are thicker. That is, the student T is higher than the Gaussian out in the tails. The Gaussian distribution has very thin tails. It's extremely skeptical of anything outside of two standard deviations. And the horizontal axis on this graph is in standard deviations. The student T distribution in contrast is not so skeptical. And as a consequence, not so surprised by extreme observations. And then as a consequence, not so influenced by them. So the student T distribution will not be perturbed by outliers nearly as much as the standard Gaussian distribution is. And this is what makes it robust to these kinds of inhomogenous samples that are under theorized with all kinds of unmeasured sources of variation in them. How do you fit the student T model? Well, you just swap out normal for student. And that's, you can see the code on the left here. This is the linear regression that we did in chapter five for the divorce rate example. You see D tilde D norm, that is the divorce rates are modeled as Gaussian as normal distributions. And then in the bottom model, the T version, we just replaced that with D student, which is the student T distribution. And I've chosen two degrees of freedom to make the tail stick in this case. And let's compare the posterior distributions for the coefficient of interest. The coefficient of interest here is BA. Remember the influence of the effective age of marriage on divorce. And you'll see that in the more skeptical student T model, the posterior distribution shifts down. Doesn't think the relationship is as strong as in the other. Oh, no, sorry, it gets even stronger. It gets further from zero than the other. The outliers were actually making the posterior distribution closer to zero in that case. Could go the other way. There's nothing about outliers that has any consistent effect one way or the other. It depends upon where they are and how far they are from the trend. Okay, let me try to summarize that. So the basic problem is there's lots of unobserved heterogeneity among the cases in these sorts of studies. Sometimes that heterogeneity is associated with things we have measured, but we have no hope of figuring out what. And in these sorts of cases, what you end up with is a kind of mixture of Gaussian error in your observations. And thicker tails for the observation model will often protect you against overly confident estimates. Essentially thick tails means the model is less surprised. It signs more probabilities to extreme observations. It's less confident that everything should be close to some regression relationship. And so it's tugged around less by Idaho and Maine in this particular case. Okay, in the previous slide, I just chose two degrees of freedom for the t-distribution and maybe that bothers you. Good, if it does bother you. It just shows you're paying attention. We can't really estimate the thickness of the tails in most applied regression examples where we want to use robust regressions. We just have to choose the thickness of the tails and that's what the degrees of freedom term and the t-distribution does. It adjusts how much probability is in the tails. We just have to choose it based upon, well, some sort of risk assessment of how skeptical we want to be about things or we can try a range of values and see how it changes the estimate and then report them all to our colleagues. That would be the principal thing to do. The reason we can't usually estimate the degrees of freedom is because there aren't enough extreme values. I'll say that again. The reason we cannot usually estimate the degrees of freedom in the student t-distribution is because there aren't enough extreme values. That's why they're called extreme. Outliers should be rare, right? Most of your sample can't be outliers and as a consequence, empirically, it's just not plausible for most samples that we're going to be able to estimate the size of the tails. Now, of course, in a really long run, if we had tons and tons of data measured with fine enough precision, we could do so because it's just an empirical problem. But the thing about processes that have thick tails is that the extreme values are still rare and they could be very, very large. And so for any particular finite sample, we will not have observed the really extreme samples that will have big effects on the long-term mean of the process. So this is just how it is. This is the universe we live in. Lots of people think, and I tend in this direction, that student T regression is a very good default, actually, for under-theorized domains because a single homogeneous Gaussian distribution really does have really thin tails and it really is extremely skeptical of extreme values. And in many cases, that's going to do damage to our estimates. Here's an example of a sort of real-world situation of great importance where thinking about extreme values is quite useful. So this is a graph that more than five years ago was shown on lots of media sites because it was the focus of an academic debate. And that debate is not necessarily the interesting thing. It's the statistical principle at the heart of that debate that is valuable to learn. So the horizontal axis in this graph is decades, but you can see it's by years in each decade. And then the bars on the vertical axis, this is world-wide battle desk per 100,000 people. So this graph is meant to give you an empirical overview of the lethality of armed conflict in human societies from the middle 20th century into the start of the 21st. And the debate here is whether there's been an outbreak of peace. As you can see that the 40s were bad. There was a certain war then, yeah, World War II. Then, of course, the 20th century saw lots of other conflicts and with lots of fatalities as well, but not as extreme as World War II. And then the 90s and 2000s have seen fewer. And now the debate is going forward is that can we expect this trend, this declining trend to continue? And if you believe in the idea that there are thick-tailed distributions and that this may be one, then this sample tells us nothing, absolutely nothing at all. And the reason is because in the long run, battle desk of people is mostly determined by rare, really big conflicts like a World War and not by the relatively more peaceful inter-World War periods. And so you can't say anything from a short-range sample like this one about long-term trends because whether you fit a regression line to this graph and whether it goes up or down is meaningless because it just has to do with the presence or absence of some particular rare event like a World War, which happens to fall at the beginning of the centriple in this case. Wars are almost certainly a thick-tailed distribution or we should say the lethality of armed conflict is a thick-tailed distribution because they're auto-catalytic, right, as they run away with themselves. World Wars happen because more and more combatants join them and they get bigger and bigger for a while until they're resolved. That sort of process produces thick-tailed distributions and so this is yet another chance for me to advertise the importance of making generative models. If you wanted to model this process, you would need to think about that sort of auto-catalytic effect and the fact that wars last more than one year and so the years here are not independent of one another, right, they're carrying on because there's a bunch of geopolitical forces that are flowing through this graph. There are many other examples of thick-tailed distributions out there. The most famous one, of course, being investments, stock prices are also very thick-tailed. They don't engage in big fluctuations from time to time and so investors are keenly aware of that problem. Okay, this lecture has been about prediction. Prediction is just as important as inference and in most applied research, both inference and prediction problems are present and it's important to realize that they require their own tools even if you use them together in the same project. There's some good news here. If our goal is to predict the next observation for some process that we have a sample for, there are some handy tools already on your computer to help you do that, to help you calculate the expected out-of-sample accuracy of a range of different models, ways to assess how narrow your prior should be to optimize that predictive accuracy and so on and these tools like predosmooths, important sampling are real achievements in applied statistics that are in active use every day both in the sciences and in industry and so you should make good use of them. Okay, that has been your lecture on overfitting. Next lecture this week, we're gonna talk about Mark O'Chain, Monique Harlow and that's gonna be important for helping us fit more kinds of models more effectively going forward in the rest of the course because that's what we're gonna turn to next. I'll see you there.