 Welcome to lecture seven of statistical rethinking 2022. We're going to rewind a little bit to get started today. Back in the third lecture in the second week I showed you this animation as part of a story of how statistical models can produce arbitrarily accurate approximations without actually capturing the causal structure of the phenomenon at all. This is a cartoon representation of the heliocentric or Copernican model of the solar system and how it explains retrograde motion of planets like Mars in the night sky. They appear to zigzag against the background of the stars because of the ways orbits are embedded in the solar system and planets like the Earth are faster than planets like Mars and therefore pass it by creating the solution that Mars has changed direction in the sky. This story though folds back on itself in an interesting way that we're going to talk about today. The genius who made the heliocentric model of the solar system, Nicholas Copernicus as his name is usually said in English, found a very powerful structure will change to the Ptolemaic model that is putting the Sun at the center and that is of course true. The solar system we live in has the Sun at the center. All solar systems have a star at the center at least one but there are other aspects of the model which remain geocentric as it were by which I mean they're merely approximations. So here's another cartoon animation of the Copernican model. Now we're looking at the Sun there in the middle and the Earth orbit as it goes around it. Copernicus like everybody else at the time was committed to circles. The sky was full of perfect circles and so in order to fit the data of the motions of the planets in the sky he still needed epicycles. So here what you see is the Earth is on an orbit that goes around a point which is not even the Sun. It's offset from the Sun. That's the little cross in the middle of this diagram and then the Earth itself is on another orbit on that orbit, an epicycle. But this is actually quite genius on Copernicus as part because it is a method of approximation which explains a very interesting fact about the orbits of the planets is that they're lopsided with respect to the Sun a little bit. They're not perfect circles. They're slightly ovals or ellipses and the planets accelerate. They're moving fastest when they're closest to the Sun and Copernicus was able to repeat this observation just by using epicycles. But as a consequence this model is still very complicated and very wrong in a structural sense. There's this great victory in the usual story of this model that Copernicus figured the whole thing out but he didn't. The model is still largely just a Fourier approximation of the solar system. I want to use this story not just to praise Copernicus as genius, of course he was, but to highlight that in scientific modeling we're always dealing with these two big problems that at the same time. And what are those problems? Are the problems of causal inference? What's the structure of the system that we're studying that causes the phenomena we observe? And for the other pieces of the system that we don't have good structural theory for yet, how do we meaningfully approximate the structure of that so that we can move forward? It's important to work on both of these things all the time because they're always present in real scientific problems. And if we can't deal with both of them successfully we can't move forward. The good news is we can move forward. The bad news is the tools that help us do causal inference are not always the same as the tools that help us construct good approximations of natural systems. We're going to talk about today is that second part, how to construct good statistical approximations. To begin let's think about all the meanings of prediction in statistical science so we can narrow down the topic. One thing we might do with statistical models is try to find some function that describes the points. And this is often what we mean by fitting. So the example I'm going to use to start this lecture today is pictured on the right of this slide. It's just seven data points there and each data point is a fossil hominin, a bipedal ape that once lived. And the horizontal axis is body mass, usually approximated, and the vertical axis is brain volume in cubic centimeters. There's some relationship between these points for larger species tended to have larger brains. And if we were going to describe these points we'd want some algorithm that could compress them. And that's what fitting is. It's purely an in sample kind of job. It's a different thing to ask what explains the points. And that's been the topic of the previous lectures last week. And we usually think of this as a causal inference issue. If we had a successful explanation of these points we could explain why they turned out the way they did in the history of these species. We might also ask a related causal inference question about interventions. What would happen if we changed a points mass? This is a counterfactual thing. It relies upon having a good explanation. But the calculations are distinct. And finally what we're going to talk about today is something a little bit different. It's forecasting. What is the next observation from the same process? And this is one meaning of the word prediction. And what I'm going to show you, and I'm sure you're prepared to hear this, is that this kind of task doesn't require successful causal inference at all. But to do it successfully, to construct successful predictions, you do have to worry about other problems that have to do with how statistical models function. So let's take these seven points and think about what happens when we leave each of them out of the sample and then fit a function to the remaining six points. This is a way of thinking about prediction. That is, we're going to take one of these species, set it aside. That's what I say. Step one on this slide is to drop one point. So the red point on the right there is the one we've dropped. The remaining black points remain in the sample. And now we're going to fit a line, just an ordinary regression, to the remaining six points. This procedure that I'm going to develop for you on this slide is called leave one out cross validation. And it illustrates some very powerful things about how statistical model fitting works. So bear with me while I develop this narrative. After we've dropped the point and fit the line, we're going to predict the drop point using the line. So this little dash red line that I've drawn and the field red point on the line, that's the line's prediction of where a species with a body mass, a little below 35 kilograms, where its brain volume should be. It should be near the line. Not necessarily exactly on the line. Remember, it's a regression, so they're scattered. But the expectation is that it's on the line. And we're going to repeat this with the next point. So now we dropped the second point. We fit the line again. There's a new prediction. Yeah, now that we've dropped the second point and included the first one, the line tilts down a little bit because the point on the far left pulls it up. And we do this with all the points. And we score them, we get predictions, and we draw the dashed red error line, as it were. And the blue line that's showing on the slide now, that's the line you get by using the whole sample, all seven. And you could score the regression, that is, say how good it is by using all seven points in the blue line. But that's a bad idea. And that's what I want to show you. Instead, what we're going to do is we're going to score this linear regression on the dashed red lines, that is, on the error from the dropped points. So we take each of those dashed red lines. And for linear regression, you square the distance. These are squared residuals, gives you a valid measure of deviance. Only for linear regressions. We'll talk about general, a general scoring property in a little bit. And then you compare. So what I've shown you here is the in sample error is 318. That's a sum of squares, those of you who are regression or a nova aficionados. And the out of sample score is 619. It's much worse. And now it's always going to be worse, you shouldn't be surprised by that, but it's typically a lot worse. Now there's a pattern to this as we change the model and change its flexibility. And that's what I want to show you next. Before we get to the other functions though, this this cartoon example is meant to be instructive. And so I'm hiding some of the complexity and the calculations and all the rest of the complexities explained in the book. But it's worth focusing a little bit on what it means to do this cross validation task in a Bayesian way. So of course, Bayesian calculations use the entire posterior, we don't use point predictions. And so the point predictions in that animation are really distributions of predictions as a posterior predictive distribution, just like we did in the first week in the second lecture. And then in that case, the cross validation score requires a little bit more averaging. So the reason we worry about this is because this density on the right hand side of this slide, this shows you the log posterior probability of an observation. This is the plausibility of some particular data point from the perspective of a Bayesian model. And you see that it's uncertain. It's not concentrated on a single point. This is a predictive distribution. And you'll notice that it's skewed, right that the lower tail is longer, much longer than the upper tail. And this is quite common in predictions is that they're not symmetric, the uncertainty is not symmetric. And so you need to average over this whole shape to understand the predictive risk in any particular prediction task. Thankfully, this is pretty easy. There's this thing called the log point wise predictive density that we use for Bayesian predictive tasks to assess their uncertainty. And it's log because we work on the logarithmic scale for the sake of accuracy that's not required, but it makes the calculations much easier. And in the computer, often it's essential to use logarithms. For reasons I mentioned in the book, that is that it makes the calculations more numerically stable. It's point wise because we consider each point one at a time, like in the leave one out cross validation task. And it's a predictive density, meaning that it's a probability density, but it's constructed on observable values. It tries to predict where they would be according to the model's perspective. So a Bayesian cross validation score using the log point wise predictive density has to average over things. It has to average over the end data points and it has to average over capital S samples from the posterior. And for each of those samples, you compute the log probability of each of each data point. And but you want to do it with the posterior distribution that has omitted that point. And that's the whole exercise where we drop the point and then we predict it using all the other points. That is the posterior distribution from the other points. And then this whole thing just averages that log probability. What's the average over? It's over the posterior distribution and it gets that average because we've taken a bunch of samples from the posterior. And we can we make do the calculation for each of those samples and then we divide by the number of samples. And while this maybe looks unfamiliar as a formula, it's exactly what we've been doing since the first week of this course where we take samples from the posterior and then we do our calculation for each of those samples. And then we end up with a density that is properly averaged over the posterior distribution. It's a way of doing integral calculus without actually appearing to do integral calculus. Okay, there's a lot more detail about that in the book if you want to take a look at it, but you don't need to understand the calculation details to understand the point I'm about to make here. So now we've got our lineback, our cross validation lines on the left of this slide with the in sample and out of sample score. And I've added now a slightly more complicated function, a quadratic function that is a parabola. Second order polynomial model. And do the same thing with it, drop each point and turn, fit another parabola, score the left out point on the parabola. And then we can get an in sample score, the blue curve that shows at the end of the animation and an out of sample score that is the sum of the error across the left out points. And you'll see here, the in sample score for the second order polynomial on the right is better than the score for the line. It's smaller and small numbers are good because these are deviations, they're prediction error. This is typical for simple models at least as you add complexity and flexibility to a model, it fits the sample better. And so the score gets smaller. But out of sample it's gotten worse. And this is also quite typical. Not always true, but it's quite typical that more flexible models frequently make worse predictions out of sample even though they'll fit the sample much better. And this is a fundamental challenge in some sense the fundamental challenge about prediction. Let's build it up, take this to its logical absurdity. Here's a third order polynomial and a cubic model as it's often called. And this function is a lot more flexible than the other two. It's even better in sample has a score of 201, which is smaller than the other two models shown in the left column. But it's out of sample error is huge now is much, much worse out of sample. And of course this has to do with this last point you'll see when we predict the last point it's really so bad that it's off the bottom of the y axis. But that's how this is how polynomial models work as I explained in an earlier lecture. They're often behave very strangely at the limits of the observed data range. And we can add two more, a fourth order polynomial and then a fifth order polynomial. And the pattern continues as we add more flexibility to the curve. It gets better in sample. The blue score gets smaller. But the out of sample error gets bigger and bigger. So by the time we've arrived at the fifth order of the polynomial, it almost fits the data exactly. It's in sample errors only seven. But out of sample it's catastrophically terrible. Okay, let's think about all these curves together and plot their errors in and out of sample so you can see what's happening. The general pattern to understand here is that for simple models like these, and what I mean by simple is that their parameters are have only simple relationships to one another. They appear in the linear model, but they're not embedded in hierarchical ways like we'll see in the second half of the course. So when we get to that we're going to have to revisit this topic. But for now just think about ordinary kinds of regression models without hierarchical structures of parameters. In these models, as you add parameters to the model, the fit to sample nearly always improves. That is the error gets smaller. But out of sample it may not improve. And the flexibility to fit any particular sample may actually hurt you in prediction. So what you're seeing here on the right is plotting the blue and red scores from the previous animations. The horizontal axis are the models that I showed you. This is the number of polynomial terms. So the one represents a linear model, the two, the quadratic model, and so on up to the fifth order polynomial. And you'll see that and then the vertical axis is the relative error. So I've just taken those scores and divided each by the sum of all of them so that we can think of them in terms of relative magnitudes. The blue scores decline as you see and that means that the more complicated models describe the sample better and better. They're essentially compressing it, describing it with the parameters. But the red trend shows you that each additional parameter you add as you increase the polynomial terms actually makes out of sample prediction worse, catastrophically so at the fifth order term. It's really quite bad. And this is the basic trade off that we see that model accuracy has to trade off flexibility that a model, the function needs to be able to bend and fit the data. But that comes with a risk of something we call overfitting that is you can overfit the sample that is learn features of the sample which are not regular and recurrent in samples produced by the same process. And when that happens, when we overfit, we tend to make bad predictions. Seven data points is a very special case. So let me show you that how this trade off works using a slightly bigger sample. So here's, I think these are 20 points using the same variables mass on the horizontal axis brain volume on the vertical. And we're going to do leave one out cross validation with these points again. And same thing. And I'm just showing you two functions here, the linear on the left and leave it out cross validation across it. And then the sixth order of six degree polynomial on the right. You'll see one of the features about flexible models like the six degree polynomial is each of the curves fit to each sub sample after we leave each point out can be quite different. So the shadow curves, the gray curves that are left behind this animation are quite diverse on the right. Whereas the linear model on the left, the lines are nearly always the same. They just pile up on one another because the model's not very flexible. And this is this is what you see as a consequence, the six or six degree polynomial can fit the sample really, really well, much, much better than the straight line. But out of sample as as you anticipate, it tends to be quite bad. So we fit all the models between the first and the sixth, the one, two, three, four, five and six. And we can look at the general pattern. And often what we see in these kinds of cross validation tasks is there's an intermediate model complexity, which is best and makes the best out of sample predictions. For very small samples, that'll typically be the simplest model, like it was when we only had seven data points. But now with a more modest, a slightly bigger sample, we find that the second degree polynomial is best for describing these data. This doesn't mean it explains them, it merely means it will be best at pure prediction in the absence of intervention. If you want to predict interventions, as I keep saying in this course, you have to understand something more about causes. But for non intervention prediction, that is, you're just standing outside the system and trying to place some bets on the next point from it. You don't have to get the causal structure right. So on the right of this slide, I'm showing you the relevant blue that is error in sample and red error out of sample curves for this larger sample of body mass and brain volume measurements. And let's zoom in on this a bit because it's hard to see the error. The sixth order polynomial is so bad out of sample that it makes the other parts of these trends look flat. So let's drop the sixth order model from the graph so that we can zoom in and see what's going on. You'll see that the second order polynomial is best out of sample. That's where the score is smallest. And, of course, it's better in sample too than the first order model. But it's the score out of sample that we're using to judge these models against one another. Fitting is easy and prediction is hard. That's the slogan here to remember. As we move on to the third order model, the error out of sample gets worse again, even though, of course, the fit in sample gets better and better with every parameter we add. And this is the overfitting problem. So cross validation, like I just showed you, measures overfitting. It gives us a way to compare models in terms of their out of sample predictive accuracy, rather how we expect them to behave if new data arrives from the same process. And it's achieved by leaving data out and actually trying to predict it. And we leave out each point one at a time because we don't want to privilege any particular point. We want to leave them out in turn and use the whole sample that way. But cross validation doesn't really do anything about overfitting. It finds a model like the second order model in the previous example that does best better than the other models. But it doesn't improve the model in any particular way. So how would we actually design models consciously that are better at prediction in their structure? And this is a topic called regularization. So in the Bayesian perspective, one of the things overfitting depends on it isn't the only thing, but one of the things is the priors. So overfitting results from flexibility and models get their flexibility in a lot of different ways. One of them is their parametric structure and how those parameters are related to one another, how it can build up curves. And the other thing is the priors and how informative the priors are. For priors that don't contain much information, the model will be much more flexible, but you could also use something called skeptical priors. And skeptical priors have tighter variants, and they reduce the flexibility of a model. And this can actually improve predictions. So regularization in this in this literature gets its name because we're thinking about the regular features of the process. So the process is producing a series of samples. And those samples share certain features, aspects of their distributions and the associations in them, which are consequences of the process that we're trying to predict points from. And then there are irregular features in the samples. And a good model, a good approximation will only learn the regular features. And this is the basic challenge of fitting first overfitting, but we can also underfit. And I want to show you what's going on there with some examples where we change the priors. But in general, we should be skeptical like Olaf on the on the right of this slide, we should be skeptical until enough evidence arises to justify increased flexibility. And good priors for prediction are often a lot tighter than people think they are. Let me show you the example. So these polynomial models, the structure can be represented with a simple sum. So it's what I'm showing you on the right of this slide. And I'm just the plot on the left is just the in and out of sample error that I showed you before. So these models that are plotted here, the equation for mu, the expected value of the outcome is some intercept plus a sum over a bunch of terms where there's a coefficient beta sub j for each of the m terms in the polynomial. So m will be one, two, three, four or five on this slide. And then the predictor is exponentiated by that same value J. So these beta sub j coefficients, we can give them a bunch of different priors. And in the example I showed you already that I just gave them a normal zero 10 priors. So this is a very wide priors. That's a standard deviation of 10. That's a variance of 100. It's essentially flat from the perspective of standardized data. Almost no information at all. What if we tighten this prior? What do you think is going to happen? So let's start out by making it normal zero one. So this is a lot tighter. There's a standard deviation of one and a variance of one. So we've gone from a variance of 100 to a variance of one. And you might think that this mothers the model. And indeed, in sample, it makes it slightly worse, but only slightly. The blue curves overlap one another, right? You can see at the bottom of this. I've plotted them both, but they're essentially the same place. So you can't tell them apart. But now there are two red curves here. And the one that's slightly thinner is the one for the normal zero one prior as thinner metaphorically, because the prior is thinner or narrower than the wide one. And you'll notice that this is better. So we've improved prediction out of sample by using tighter priors. And that's because we've reduced the flexibility of the models. So the the polynomial model is still the best. But we've improved all of the models, including the polynomial model a little bit. So these are better priors. And then we can make it even tighter and get another improvement. So how about a normal zero comma 0.5 prior, just a standard deviation of 0.5 and a variance of 0.25. Very narrow normal distribution. And notice that the fit in sample, at the bottom, the thin blue trend there, that is labeled by normal zero comma 0.5 is worse. It's higher than the others, which is bad. Remember, this is the y axis is error. So low values are good. So we've made fit to sample bad with this prior. So you might think this is bad if you only are assessing models by their fit sample. But if you look at the out of sample error, you see they've all come down again, substantially so for the highly complicated models. And now the fourth order polynomial with the narrow prior is better than all the polynomials with the wide prior. So this is what using regularizing priors or more skeptical priors can do for you. It improves prediction of models, whatever their complexity. There will still may be a best model. But you can make it even better if you use skeptical priors. Of course, you do have to be careful. So this slide shows you the problems with overfitting and how we can address that with skeptical priors. But if the priors too skeptical, if it's completely inflexible, then you get the opposite problem and it can be equally catastrophic. So suppose we use a normal 0 comma 0.1 prior, very, very tight prior. And in this case, it and the thin trends here show you the same models done cross validation executed on the same models using this very tight prior. And they're all very bad. So the end sample fits unsurprisingly are much, much worse. But we're not going to score the models by that. It's the thin red one of starts at the very top here that we're concerned about all the models are quite bad now. However, notice that the fifth order polynomial with this very tight prior actually performs better than it did with the loose prior. So it all depends upon the model complexity. But for any particular model complexity, you can make the priors too tight. However, the good news is, so you don't stress about it, it's always possible to do better than a flat prior. So how do you choose the width of a prior? Well, for causal inference, of course, it's a different topic. You should use science. You should think about what scientific information you have to constrain the prior and then probably use a little bit less than that. So that you're not accidentally baking in results for pure prediction. And I should say that's what we've been doing so far. We think about the outcome range of the data, what possible biologically possible or physically possible observations are allowed by the model and we use priors that cover that whole range. So they're a little bit skeptical from a scientific perspective, but they don't they don't bake in any any particular result or relationship among the variables. For pure prediction, however, you can actually use cross validation to tune your priors. Once you understand the general setup, you can do that quite quickly. This has done quite a lot in machine learning to tune regularization on a task. Of course, we care about both these things at the same time because many scientific tasks, as I said at the beginning when I talked about Copernicus, many scientific tests are a mix of inference and prediction. So to remind you the point of the Copernicus example, the heliocentric system he constructed is trying to do structural causal inference by putting the sun in the middle and claiming that that makes a huge difference in getting the predictions right. But he still had other aspects of the system that he hadn't figured out and he needed to approximate those. And so the avoiding overfitting for many parts of a statistical model is essential at the same time that you're working on causal structural aspects of the model. And you really can't dodge either of these jobs. However, to relax you a bit, you don't have to be perfect. There's no perfect prior. There's no perfect model. You just have to be better than the sort of uninformed, totally flexible approach would lead you to lead you to approximate a function. Okay, that's a lot. Let's take a break. And I encourage you before you proceed through the rest of the lecture, take a look back through the examples, figure out which parts of it were confusing, and make some notes about that before continuing on, because we're going to build from there. Okay, welcome back. Let's pick up heading out the door from the example we had before. What I've showed you so far in this lecture is that fitting in sample is easy as it were, you increase the flexibility of a model, the fit to sample gets better and better and better. But the fit out of sample, typically there'll be some intermediate model complexity or intermediate flexibility of a model, which does best in prediction out of sample. And this distance shown on the left here between the red and blue curves for each model, you can think of as the prediction penalty. It's how much worse a model does out of sample than in sample. And I want you to see by looking at these black line segments is they get bigger and bigger and bigger for all the models as you increase model complexity. And this is quite typical for simple models. Again, later in the course, we'll see models that don't behave exactly this way. But for simple regression models and generalized linear models, this is true on average that as you increase the model's flexibility, it does relatively worse relative to itself in sample, the prediction penalty out of sample gets bigger and bigger. Wouldn't it be nice to be able to estimate this some efficient way? So cross validation is very expensive. For in data points, a sample of size in, you have to run in models. Why? Because we're leaving out each point one at a time. So for the example I started with in this lecture of seven points, that's not a big deal. It's just seven regressions. But say you had a really big sample of 100 or 1000, that's a lot of models to run. And the models we've done so far in this course run pretty fast. So maybe that's not so frightening, but it's not unusual to have models that take 10, 20, 30 minutes or even a day or two to run. And so it'd be nice to have some fast way to do this. So what if we could compute the prediction penalty of a particular model from a single model fit? What that would give us is we have the in sample fit. And then we just need to add the prediction penalty to it to get the expected performance of the model out of sample. That's why we care about the penalty. And that's why what we're really trying to estimate is the prediction penalty of a model or a model family. And there's really good news here. There is good news in statistics sometimes. You can do this. And there are a range of metrics that let us do this. And they all perform quite well in expectation. I'm going to talk about two of them. The first is a family of strategies called important sampling and a particular member of this family called Pareto smooth important sampling or PSIS and then information criteria. We're going to talk about there are lots of information criteria, but we're going to focus on the best one, which is the widely applicable information criterion or WAIC. So let's talk about important sampling first. What is important sampling? Important sampling is a really slick method for taking a single posterior distribution and using it to calculate the posterior distributions for the for the so called holdout models. That is the models where we have dropped one point. So remember in cross validation for a sample of size n, we would normally need n posterior distributions fit to n models. But it turns out with important sampling, you can use the implications of probability theory to fit the model to the whole sample and then purely in the mathematics to virtually weigh out the one observation and get the implied posterior distribution if that observation had been left out. This saves a lot of calculations. So the key idea to give you an intuition, there are more details in the book and this is a huge topic in statistics, not just in Bayesian statistics, but it's easy to grasp the key idea, I think. The key idea is any data point with low probability, according to the model, has a strong influence on the posterior distribution. And so that strong influence can be used to approximate what the posterior distribution would look like if it had been omitted. Let me give you a visual way to look at that. So let's think about this made up example here on the slide. We've got some posterior predictive distribution shown in black there. And we've got four observations that this was constructed from. And just shown as vertical bars to make them easy to see, but they're just points on the horizontal axis. Each of these points has contributed different amounts of information to the posterior distribution because of its position. So let's think about them one at a time. So imagine doing leave one out cross validation with this tiny little sample of four points. So we leave out the point on the left first, I've colored it in gray to signal that I've left it out, and I've recalculated the posterior distribution. And I show you the previous, the original posterior distribution in the dashed density there that show you where it was. When we leave out the point on the far left, the posterior distribution shifts a little bit to the right. For the next one, the second point from the left, there's almost no change at all. It's essentially the same. This point has done very little, contributed very little to the posterior distribution. And if it gets left out, we have essentially the same inference as before. The same goes for the third point from the left to one right in the middle. What is it about these points, these first three points that leads them to have such a small influence on the posterior distribution when we leave them out one at a time? Of course if we left them all out, there would be a big change. But if we leave them out one at a time, no single one of them induces a big change in the posterior distribution. And the reason is because they're all highly probable according to the posterior distribution. They're right in the middle. They're not at all surprising to the model. And so when you remove them, you haven't removed anything that has taught the model very much. Whatever information they contributed is contributed by other points. There's nothing surprising about them. But it's quite different for the last point, the one on the right. We leave that one out. There's a big change in the posterior distribution now. And it shifts a lot and becomes much, much narrower and piles up on the three points, which are very similar to one another. It's this last point, the one on the far right, that is a highly influential point. It's very important to the posterior distribution. And so when it's omitted, the posterior distribution changes a lot. And we can calculate that, it turns out, from its posterior probability or rather the posterior probability of that observation calculated from the entire sample and then infer the posterior distribution that would arise without it. The calculations are a little bit involved. We talk about them in the book and there are lots of guides to doing important sampling. But those details aren't really important to getting the intuition behind it. The intuition behind it is this is a great thing. However, important sampling tends to be unreliable because we're estimating the new posterior distribution. We don't actually have it, so we just have an estimate. And as an estimator, important sampling can have really high variants. It can be quite unstable, especially in small samples with highly influential points. So there are a number of ways that have been invented to deal with this. And the best one, in my opinion, is this tactic called Pareto smoothed important sampling or PSIS. It's much more stable. It has lower variants than sort of vanilla as it were, vanilla important sampling without modifications. And even better is it provides diagnostics. So when it is unreliable and it has poor variants, it will tell you. And it's nice to have an estimator that complains when it's bad and is honest with us. Pareto smoothed important sampling is the child of Professor Aki Vitari in Helsinki pictured there. He's a real smoothed estimator. And you can access PSIS through my rethinking package. And I show you examples in the book. Aki has written a lot of software that makes use of this as well and always provides very useful diagnostics. One of the things that's most useful about it is it gives you point-wise measures of the importance of each point to the posterior distribution. Some of you will have heard this called high leverage points or outliers. And we're going to talk more about that at the end of this lecture. Okay, that's important sampling. That's one way to get a cross-validation score that is an estimate of the model's performance out of sample without actually doing the cross-validation. Saves a lot of time. The other strategy to do this is to use information criteria. And all of these, this strategy of using information criteria, they stem from the work of the man shown on the right of this slide. Professor Aki Ike, who passed away in 2009, and real stroke of genius coming up with this tactic to use an information theoretic argument to get at what is actually the correct from an information theoretic perspective, the measure of out of predictive accuracy and that is the Kublick-Libler distance or divergence of a model. And this is a very famous result called AIC, the Aki Ike Information Criterion, that for simple models like linear regressions with flat priors in sufficiently large samples, you can get the expected out of sample performance of a model by taking the logpoint y-predictive density, multiplying it by minus 2, that's just a scaling thing. It's just there for historical compatibility. And then adding two times the number of parameters and that's the penalty term as it were. AIC is famous and has been used a lot, but it's very restrictive because it's derived under quite strong constraints as flat priors, large samples, it assumes the posterior distribution is normal. For simple linear regressions that's fine, but we're going to want to use more informative priors than flat priors and so AIC doesn't get us anywhere and I honestly think it's of historical interest now, but of course without Professor Ike K's work we wouldn't have the new information criteria that are better than it. So what do we use instead? Well there's another one, the widely applicable information criterion, which is much, much more general, that's why it's called the widely applicable information criterion. And this one's due to Sumio Watanabe and it has a much more complicated formula as you can see on this slide, but the same basic structure we still have minus two times the log point-wise predictive density and then there's a penalty term. In this case the penalty term is a function of the variance of the log probability of each point and just turns out that is the theoretically appropriate way to compute the penalty term when the priors could be informative and the posterior distribution could have any shape. And in practice WAIC and PradoSmooth's important sampling are extremely similar. Sometimes they're practically exactly the same value, which is a quite incredible result given that the quite different ways that they're computed. So AIC is fantastic and you should feel comfortable using WAIC or PradoSmooth's important sampling. One big advantage of PradoSmooth's important sampling though is it gives you a bunch of automatic diagnostics whereas WAIC does not. However in practice they perform quite similarly and it's the diagnostic value of PSIS which makes it my default at least when I have prediction tasks. Okay let's look at an example and show you how these things behave. So on the left of this slide I'm showing you the the red and blue curves you've become friends with during this lecture. This is the basic cross-validation exercise we did before the break. Using actually doing cross-validation. Those curves are actually cross-validation curves. And then on the right I show you what PSIS and WAIC think of the same models. So the blue curve is in sample fit. We get that just by fitting the curve to the data. Yeah so there's nothing special about that. It's the same curve is on the left. And then the red and the purple curves are the out-of-sample accuracy curves as predicted by, not measured by, but as as as estimated by WAIC shown in red and parades with important sampling as shown in purple. And you'll see they they reach the same conclusion in terms of which model is best. That is number two which is what we found in the cross-validation. But both of them tend to underestimate how bad the fifth model is right. PSIS is more accurate in this particular case, a little bit more accurate in this particular case because it's more skeptical. And but both both of these metrics have the same basic feel about what's going on in this prediction task. Okay let's summarize a bit and then and then talk about actually using these things and I want to use a couple of examples. So WAIC and Prado smooth important sampling and cross-validation are all trying to do the same thing. They're trying to measure overfitting as traded off against underfitting and we have to worry about both. And but the point is that these criteria measure overfitting and underfitting. They don't really do anything about it. Regularization whether we do it by constraining priors or through changing the structure of the model. We'll talk about that later on in the course. Actually does something about overfitting. It manages it so we can take any particular model and we can reduce the amount of overfitting it does and therefore improve predictions out of sample. Neither of these things really directly addresses causal inference. That's a whole different problem and we've talked about that in previous weeks and we'll talk more about it in weeks to come. But all of these things all of the predictive criteria and regularization we want to use them because they do useful things for us but we also study them because they teach us general lessons about how statistical inference works and that's worth just as much I think in many cases. So even if you're someone actually like myself who doesn't usually do pure prediction tasks. I'm a scientist and I'm often interested in explanations. Of course I do have to deal with overfitting and underfitting at the same time just like Copernicus did but I don't do pure prediction tasks usually. Yeah I don't work in business. I'm not trying to predict what someone wants to watch on Netflix. Nevertheless studying these things teaches us really important general lessons about how models behave and that's worth it even if you're not gonna use WAIC or Pareto smooth important sampling in your own work. Let's talk about a couple of examples about how these things behaved and what they what they can and can't do for you. So the first thing to really warn you off is the idea that you can use predictive criteria like WAIC to choose a causal estimate and I will try to warn you off this because people do do this and I don't want there to be any ambiguity that this is a bad idea. So let me give you an example of how it can mess things up. The thing about predictive criteria is that they actually like confounded models models that contain confounds and colliders or have as the example here are stratified on post treatment variables. These those sorts of models lead you to biased causal estimates but purely predictive criteria will typically prefer these models over the models that give you a correct causal inference. And I want to show you an example and try to give you some intuition about why. We're gonna return to the plant growth experiment example from last week. I use this example to warn you off stratifying on post treatment variables. So to remind you we have the dag on the right of this slide. There's an experiment with 100 plants and we measure their initial initial height h sub zero. And we then apply some treatment an antifungal treatment represented by the letter t in this graph. And that influences their final height h sub one. The treatment works by reducing fungus and that's represented by F here. So there's a potentially direct effect of the treatment. It might be toxic and an indirect effect that the treatment has by reducing fungus. And what you saw was that if we stratify by fungus it makes it look like the treatment doesn't work because we've blocked the mediating path. That's called stratifying on a post treatment variable and it's usually a bad idea. What do criteria like WAIC and PSIS think about this task? I want to show you now. So if you've looked at this example in the book you're familiar with these model notations now. On the left I'm showing you the wrong model as it were for getting a correct causal inference. It has the wrong adjustment set for the graph. So remember the way you decide what's an appropriate adjustment set for this model. I mean sorry for this causal model in the middle this tag is you name the treatment. In this case it's the treatment t and the outcome of interest in this case is h sub one. And our s demand is the causal effect the total causal effect of t on h as shown through all those red arrows. And then you apply the backdoor criterion to find an adjustment set. So in this case there is no adjustment set. You just want to stratify by the treatment and measure the association between the treatment and the outcome. And that's what you need to do. It's dead easy. So the model on the left of this slide has the wrong adjustment set because it also stratifies by fungus. And this ends up blocking the mediating path. And that was the point of this example in the previous lecture. And on the right we have the correct model with no adjustment set. The correct adjustment set is no adjustment set. And this gives us an identified measure of the treatment's effect on the outcome. Now let's calculate the expected out of sample predictive accuracy of these two models. What you see, oh sorry, before we get to that, yeah this is just to show you that the incorrect adjustment set gives you the wrong estimate. So I'm showing you at the bottom here are the posterior distributions of the effect of the treatment. And for the bias model with the wrong adjustment set you'll see it straddles zero. It looks like the treatment doesn't work. But for the correct one the treatment definitely works in this example. Now let's compute the Paredos moves important sampling scores for both of these models. I'm going to explain this weird forest plot in a moment. So hang on. What's going on here is that each row in this is one of the different models. And I've named them in this graph with the R formula notation that you're accustomed to I'm sure where h1 is a function of h0 and the treatment and the fungus. That's the model in the upper left of this slide. Or on the second row the height at time one is a function of the height at time zero and the treatment only that's the correct model to use. And then the points on here represent different things. So let me walk you through this a little bit. First the filled points is the score in sample. So the horizontal axis here is prediction error. This is often called deviance and statistics. And the filled points are the score of that particular model in sample. Now that's here to show you the penalty and how big the penalty is but that's not how we judge the models. And then the open points that's the focus of our interest that's the score out of sample. That's how we compare the models by those points. There's uncertainty about the score out of sample. We're not sure how well it's going to perform out of sample. It turns out you can get a standard error for the out of sample performance for the Pareto smooths importance sampling score. And I show you that here with the line segment as well. But of course what we want is the not to ask whether these line segments overlap. You remember I was quite strict about this in an earlier lecture. We need a contrast. We want to compare the distributions of the predictive error out of sample. And so we can compute the Pareto smooths importance sampling contrast between these two models. And that's shown with the little delta there and its own line segment. And that's the distribution of the difference between these models. And you'll see that the second model is much, much worse on average. There's really no contest here. The wrong model that gives you the biased inference about the causal effect of the treatment vastly out performs the good model that is the one that gives you the right inference about the treatment when we make predictions out of sample. How can that be that the model that gives us the wrong causal inference outperforms in prediction? And the answer is a bit obvious once we think it through, but it's not at all obvious before you're introduced to this topic. So let me walk it out for you. Why does the wrong model win at prediction? Let's break down the data and color it by treatment in this first graph. So on the horizontal axis, on this graph in the middle of the slide, I've got the categories of no fungus. No fungus appeared during the experiment and then no fungus when when fungus or yes, fungus when fungus did appear on the plant. And the vertical axis is the growth. So what you can and I've colored the points by whether it was a treatment plant. The plant was treated with antifungal or whether it was a control plant. What you can see, of course, is that the plants that had fungus on the right of this grew a lot less. They're shorter and most of them are blue. Most of them were control plants. But there are some red plants, treatment plants that are in that group and they also are smaller on average than than the no fungus plants because they got fungus. That is, the treatment didn't work for those particular plants. Yeah, they got mildew anyway. And then on the left, there are lots of blue points mixed in there, too. Lots of control plants that manage not to get fungus because fungus doesn't strike like destiny every plant. Plants have their own resistance. And in that case, there's really no difference. Once we look within the no fungus group, there's really no difference between treatment and control. Right. So this is why stratifying by fungus makes it look like treatment doesn't matter because you look within each group at the association between treatment and the fungus. And within the no fungus group, as you can see, there's no effect. There's no difference between the red and blue points. They overlap a lot. That's the statistical effect that's going on. We can look at this data another way and think about which predictor, that is, the treatment or the fungal status is the better predictor of the outcome of the plant's height. And I hope you'll see with this complementary figure I put on the right that it's always the fungus and not the treatment. So now I've sort of pivoted this graph and made the horizontal axis control or treatment. And I've made the colors refer to fungus and no fungus. And you'll see that this segregates the data really powerfully so that within both the control and the treatment group, the red points are at the bottom. They're shorter because they got fungus and fungus is actually what's affecting growth here. And the blue points are at the top. So if you had to choose one of these predictor variables, either the presence or absence of fungus or the treatment control distinction just for the sake of describing the height of plants, you should always choose the fungus. It's more informative. It's much more strongly associated with the height of the plants. They're growth than the treatment is because the treatment's imperfect and it's the fungus that's actually affecting growth. And the treatment is only works by reducing the probability that a plant gets fungus. Yeah, but it doesn't always prevent a plant from getting fungus. It only provides resistance. This is why the wrong model works better for prediction. It's not really say it's wrong is not fair. It's wrong for causal inference because it leads to a confounded inference, a mistaken inference that the treatment doesn't work. But for prediction, you want to use a predictor that is causally, as it were, more proximate to the outcome. So the fungus is the more proximate cause. It's the mediator and it has a stronger association as a result with the outcome of growth. And so this is why the wrong model for causal inference can actually be better at prediction out of sample and the prato smooth important sampling score or the cross validation score or any information criterion is going to tell you the same thing that the treatment is not necessary for predicting the growth of plants. Rather, you'd want to measure the fungus. But that doesn't tell you the treatment doesn't work. It's completely different inference. So the upshot of this, please do not ever use predictive criteria to choose a causal estimate. Of course, as I keep saying, many analyses are mixes of inferential and predictive chores. For any structural causal model, there will be functions in it that you have to treat as approximations because you don't have a strong scientific reason to choose any particular function. And so we do often mix causal inference and predictive criteria in the same analysis for that reason. But you have to do it using your domain expertise as a researcher. Okay. Last thing I want to talk about in this lecture is the topic of outliers. We met these when we talked about important sampling. And I want to talk a little bit more about these because people often discriminate against outliers in quite unfair ways. And I want to show you what I consider to be a much more principled and fair way to deal with outliers. So when I introduced important sampling, I pointed out that some data points are much more influential than others. These are some things called outliers in statistics because they lie in the tails of the predictive distribution of the model. If a model has outliers, this indicates that the model's predictions are possibly overconfident and unreliable because the process that's generating observations is producing these extreme points. It doesn't produce them very often. But when it does produce them, the model will always predict them quite badly. And that can make the predictions unreliable. In any kind of risk management situation, the outliers might be incredibly important. And so being overly surprised by them can be catastrophically bad. That is, imagine, for example, you're interested in predicting the strength of hurricanes and the distribution of the damage done by hurricanes is typically quite small, but some storms are really, really bad. They will be outliers in the historical time series of tropical storms. Dropping those outliers or using a model that considered them extremely unlikely is a bad idea. You want to take them seriously because those are the ones you really need to plan for. There are many processes like this where we're trying to estimate the shape of some process and the outliers are part of that process. So the general way to think about this is that it's not the data points fault that they're outliers. It's the model's fault. The model doesn't expect enough variation as much as there is in the process that generates the data. So how do we do this? Well, you don't want to drop the outliers. I hope I've convinced you of that. Don't discriminate against the data points. It's not their fault. From a statistical perspective, it's bad to drop outliers because you're just ignoring the problem. The predictions in the model aren't going to get any better just when you ignore the points that the model can't predict. I hope that makes sense. I'll say it again. The predictions in the model aren't going to get any better just because you've removed the points that the model can't predict. It's the model that's wrong, not the data. So how do we deal with this? Well, first, it's good to quantify the influence of each point and Pareto's Move's important sampling will do this for you. There are other metrics, but Pareto's Move's important sampling provides a good theoretically sound way to measure influence in a Bayesian context. And then second, it's often the first and largely statistical way to deal with this is to use some sort of mixture model or robust regression. I'm just going to show you an example. A simple example that's accessible to you as an alternative to ordinary linear regression. Ordinary linear regression can be very sensitive to outliers, as I'm sure you've heard. Let's revisit a previous example that you're already familiar with. This is the divorce rate example from last week. In that example, there were a couple of regions, a couple of states that did not fit the trend or outliers as it were. And I've colored them in the graph on the right here. They are Idaho and Maine, and they're outliers in different ways. Maine has a very high divorce rate for the observed trend among other points. It has an average age at marriage, but one of the highest divorce rates in the United States. Idaho, in contrast, has one of the lowest, actually the lowest median age at marriage, and one of the lowest divorce rates. Extremely low divorce rates. Both of these points are outliers. Both of them are extremely influential on the posterior distribution as a result. And we can quantify this a couple of ways. I recommend using the Paredes Moves Important Sampling case statistic. And in the book, I show you the code for extracting this from using my rethinking package. But Paredes Moves Important Sampling is available in all scripting languages. You can look at the documentation for your particular case and figure this out. But you can also use the penalty term as calculated by WAIC. If you remember the definition of WAIC, at least in a vague sense, the penalty term was computed from the variance of the posterior probabilities of the points. And sometimes this penalty term for information criteria is called an effective number of parameters. It's a measure of essentially the flexibility that is induced by this data point on the model. It's a measure of the importance of that particular point to the shape of the posterior distribution. So on the right, I've just plotted these two measurements against one another to show you that they have some general association for the Paredes Moves Important Sampling case statistic on the horizontal. Maine and Idaho have really much, much bigger K values than the other points that they're influential. In fact, when you run this in R, it'll probably give you a warning about it, about these points. And the WAIC penalty terms for each point show the same general pattern. These two metrics don't have exactly the same information. They're not 100% correlated, as you can see. But both of them are, they're behaving similarly because they're getting at the same basic thing, is that these are the points that are dragging the posterior distribution around. They're outliers. I can show you again why they take their positions because they don't fit the trend. Okay, what's a mixture model? So in these sorts of situations, one of the things that's often going on is that the regression model simply is far too surprised by outliers because it has this very strong assumption that the variation around the expectation is the same everywhere. But for some particular regions of the United States, the variation might be bigger than in other places. And we don't have data to help us predict those things, but we can still teach our model to expect more variation. And one way to think about this is that the predictive distribution is actually a mixture of Gaussian distributions with different variances. And that's what I'm showing you on this slide. And when you take a bunch of Gaussian distributions with different variances and you pile them together as a single distribution, you get something called a student T distribution. And the student T distribution has thick tails. As I'm showing you here, the black is an ordinary Gaussian with a standard deviation of one. And the student T distribution also has a standard deviation of one, but it has this other parameter called new. Talk about this in the book, which makes the tails thicker. And this means that the student T distribution expects more extreme points. It's less surprised and it's less vulnerable, less vulnerable to making bad predictions. So in code form, we can take the linear regression for the divorce example, putting in marriage rate and age at marriage. This is model 5.3 from the book. And we can make a student T version of it. As I show you on this slide, and all you have to do is replace the normal distribution in the first line of the model formula with a student T distribution. And in my rethinking package, that's D student is the student density. And then for the thickness of the tails parameter new, I've just used two here. As I explained in the book, in most cases where we're modeling outliers, we haven't observed enough of them to estimate the shape of the tails. So you just have to make some judicial assumptions, some cautious assumption about how thick the tails are to reduce your prediction risk. And so that you aren't unnecessarily surprised by outliers in the future. But because outliers are rare, it's very hard to estimate the shape of the tails of the distributions in robust regressions. And I'm sorry, but that's just how it works. And there's not much to do about it. If you had a lot of data though, like a financial time series as an example, you probably could estimate the shape of the tails and you'd find that they're much, much thicker than a Gaussian distribution expects. Lots of processes are because the distribution is not homogenous. There are a bunch of different distributions with different variances, and we observe them as one distribution and that results in thick tails. When we compare the inferences of these two models to one another, what happens is that the posterior distribution for the effect of age at marriage, remember that's what's driving the trend in this example, it's age of marriage that is the strong effect on divorce rates. In regions in the United States where people get married younger, they tend to divorce more. There's lots of speculation about why, but it's a very robust result. And what happens when we use the student T distribution that is less sensitive to outliers is that the strength of this association actually gets stronger, as the posterior distribution gets more negative, moves further away from zero, and actually gets narrower, its uncertainty is reduced. And this is a consequence of the fact that Idaho was polluting, making the model overly skeptical because the model did not expect a point like that. The student T model is less surprised that there might be some particular instances which are influenced strongly by other processes, and so the error distribution is wider and the student T is less surprised by something that doesn't fit the particular trend that we're looking at in the model, and this can result in better inferences as a result. And better predictions. Okay, let me try to summarize that. So one of the problems in lots of regression problems, especially with observational systems, but also in experiments, is there's a lot of unobserved heterogeneity. There are lots of causes of the variation in the data that we have not measured. I'll say that again. There are many causes of the variation in the data that we have not measured. One way to think about this statistically is that the outcome distribution, the distribution of the observable measure, is not a single Gaussian distribution, but some mixture of Gaussian distributions with different variances. And what that implies is that the outcome distribution, the predictive distribution, will have thicker tails, and so it'll be less surprised by extreme events. And there are many applied cases where this is a really good way to proceed, and that student T regression is a very useful default in many areas of research. Less surprise means possibly better predictions because you're more likely to... Your predictive distribution will expect extreme points, which you have sometimes observed, as in the case of Idaho and Maine. Okay. Let me wrap this lecture up. We've come a long way. The general problems of prediction are many, and they come in many different dimensions, and they're related to one another, but each of them needs its own tools to manage. The whole focus of this lecture has been a kind of pure prediction perspective. What is... We want some statistical tools so that we can say what the next observation from the same process might look like. That's the pure prediction task, and there are many, many important research tasks which are like this, both in the sciences and in industry. It's possible to make very good predictions without knowing the causes of things. That is, black box statistical models can work. There are lots of really powerful machine learning techniques to make good predictions without understanding even how the model works, much less the model providing any explanation of the data at all. However, you have to manage those black box devices like polynomial models or neural networks so that you can optimize prediction. Optimizing prediction doesn't reliably reveal causes, but it has its own set of trade-offs between underfitting and overfitting, and we can manage those things both by measuring it using cross-validation or by trying to reduce it through regularization. Okay, we're in week four, and that was a lecture on overfitting. Next lecture, I'm going to introduce Markov Chain Monte Carlo, which will be an opportunity both to discuss some interesting things about statistical modeling and lay a foundation, an infrastructure for the rest of the course where we fit more complicated models going forward, and I'll see you there.