 Hello everyone! Welcome back! We have lots of exciting things to get through today. I want to pick up with this problem we have of figuring out which kinds of models will make good predictions out of sample because of course in sample well we can always make perfect predictions if we want. Just add some parameters. So where we left off last time I had said that what we're going to need to do is appeal to information theory because the way machine prediction works is it follows the laws of information theory. So if we learn a little bit of information theory it'll help us have a reasonable and coherent framework to do this in. And the goal in this first bit is going to be to drive the gold standard way to score a model's accuracy. So any type of model doesn't matter, it doesn't have to be Gaussian, anything. All the models will do later. It doesn't matter. Frequent is Bayesian. All of them there's a single gold standard way to score it and it comes from information theory. And I want to motivate that for you in a heuristic fashion this morning so that you don't feel that what's to come is arbitrary. Okay what is information theory? Here's the basic problem information theory it sets out to address. When we don't know some future event or some event that's happened but we haven't learned what happened yet there's uncertainty. Everybody feels that and every language has a word to express that. There's something we could learn and when we learn this thing that uncertainty will be removed. The question is then there's some scale of uncertainty. You can be more or less uncertain and we need some metric for this uncertainty. And information theory is a framework in which we derive a rigorous and principled metric for uncertainty. We're saying when something is more uncertain than something else and it turns out there's a unique way to do this. So in the case of say predicting the weather something we all do right so if it's sunny today and we don't know what it is tomorrow there's uncertainty about the weather tomorrow. We might use various cues from the weather today to make a prediction and then when we find out what happens tomorrow some uncertainty will be removed. Let me motivate for you how uncertainty can be different in different circumstances depending upon the statistical distribution of weather events and this is where we get our metric from. So imagine you're in Los Angeles. Presumably most of you have not lived in Los Angeles. I went to graduate school in Los Angeles but you know that Los Angeles has no weather. It's just always sunny and slightly smoggy. Actually the air quality is excellent now because of aggressive environmentalist movements in California. Now the air quality is excellent. But it's always sunny in about you know 15 to 20 degrees and it's wonderful right. That's why the real estate is not affordable. So it's sunny today and I ask you what's the weather going to be tomorrow. It's probably going to be sunny so you have very little uncertainty right. Even though we don't know for sure what's going to happen the uncertainty is small. Why? Because it's nearly always sunny. If it does rain you're going to be shocked right and it will be on the news in Los Angeles when it rains. It's like it's a weather alert. There's some rain. Los Angeles doesn't know how to drive in the rain. I can attest it. So contrast this was Glasgow. Again maybe you haven't been to Glasgow but you have some stereotypes of it and they're probably accurate. It rains a lot in Glasgow. And in fact there's more rain than not. And as a consequence then if it's raining today and I ask you what's it's going to be like tomorrow well it's probably rain. You just carry your umbrella all the time right. Or you just get wet. You just get used to being wet all the time. It's you know deep dampness in your soul. And then New York has highly variable weather. It rains. It's also sunny. Anything could happen. A random parade. Just whatever might happen. And so there's great uncertainty about what weather we like in New York. In contrast to both Los Angeles and Glasgow where there's low uncertainty. And that difference arises from the frequency distributions of the different weather events in that are being drawn from in a sense that nature is drawing from in these microclimates. So a formal way to deal with this not just for weather but for anything in particular for communication was developed by Claude Shannon in 1948. He was working specifically on communication telegraph telephone. And he derived this metric that we call information entropy which I put up here and I want you to pay attention to the English translation of this equation the uncertainty H on the left in a probability distribution P. The P is a vector of the probabilities of different events that could happen right. So for every possible kind of weather that could occur there's a probability it could occur on any given day. And P is a vector then of all those weather probabilities. The uncertainty in that distribution is just the average log probability of an event. There's this minus in front which just makes it positive instead of a negative number. That's all. It doesn't do any work. It's just the average log probability of an event. And this is a unique criterion. It arises. It's not arbitrary at all. In the book I give you more information about this but basically if you want a reasonable measure of surprise of uncertainty and surprise you have to adopt something that's this or proportional to this. And it's very unique and there's nothing about this expression which is arbitrary at all. Again I give you a little bit more background on this in the book. This is a huge achievement and your mobile phones in your pockets only work because of this. There's all of this fancy air correction that goes on, Bayesian air correction that goes on in your mobile phones and you know 3G and above all the new good networks that depend upon information theory and developments from it. Encryption of course as well depends upon all this. So entropy information entropy is a measure of the uncertainty in a distribution. You could think of it as it's the potential surprise. So what's your potential for surprise in a place where the weather is highly variable? It's very low because you don't have any strong expectations. It surprises the flip side of entropy. So in Los Angeles your potential for surprise is actually quite high. Right? The entropy is low. Yeah. And vice versa. The other way around. We are interested in this because in applied stats we've got some model of the weather distribution say and we're going to make predictions and we want some way to score our predictive model on its accuracy in the future. This is our problem remember. So we can calculate the entropy of our model. Right? And there's a probability distribution of the events that the model expects. That has an entropy. Right? And then there's the entropy of the true distribution. The actual events that will arise. The entropy of nature. Right? And the difference between these two things is what we're trying to minimize. We want to make that difference zero. And in information theory there's a particular way to calculate differences and it's this thing. It's called the tail divergence. There's some special properties of this. I want to spend probably the next five minutes talking about this thing just to give you some intuition. Again there's more in the book about this. We've got two probability distributions P and Q. I want you to imagine that P is nature. It's true. So this is vector probabilities in P which are say the frequencies of weather events in New York or Los Angeles or whatever. Q is our model or weather forecast. And if we want to score Q on its accuracy in a principled way what we should do is look at this thing that's called the divergence. And the divergence or the KL divergence, K is for our Kuhlback for Solomon Kuhlback and L is for Leichler. Kuhlback and Leichler were information theorists, contemporaries of Shannon actually who applied a lot of this work to statistical inference. And the intuition here is that this distance from P to Q is this thing right here. It's the sum so that the sum P in the front is just an averaging. We're averaging over the true frequency of events and then we're just taking the distances between the log probabilities of each event. So the distance from Q to P is the average difference in log probability. This has some very important properties. It's a distance but it's not symmetric. And in your homework the first problem is going to force you to confront this and have fun with it. So you don't have to understand all this right now. You will after the homework I promise you. It'll be like rainbows and everything. It'll be fantastic. It's easy to calculate so I can demystify this and then I'll show you some code. Yeah, and you can use this code in your homework. If that's the math version of it, if you have a vector P you can take this function, my function here dkl, just sum. If P is a vector and Q is a vector then we just sum P times the difference between log P and log Q and that sum is the divergence or the distance between from P to Q. And then I imagine over here a bunch of alternative models Q and I show you how the divergence behaves. It's only zero where Q equals P where the model equals the truth. And then it's positive everywhere else. So small values are better. Divergence is bad, right? Doesn't it sound bad? It's divergent, right? Isn't that a science fiction series? Yeah, I keep seeing it suggested to me on Netflix or something. I haven't given it a temptation. So there are interesting properties about how this behaves given both P and Q that you'll explore in your homework. Let me give you the cartoon version of it right now. So I want you to imagine you're an astronaut and you're leaving Earth heading to Mars or a Mars-like planet. But you don't really know much about the planet you're heading to. And you're going to use the frequencies of water and land on the Earth as a prediction of where you're going to land, what you'll land on on the planet when you get there. It's like say you can't control, I know this story is a little bit weird, but you can't control your rocket exactly. You're going to land at some random point on Mars and you want to predict whether it's going to be water or land because it matters to you. And you don't know much about the distribution of land and water on the planet you're headed to. So you just use what's your only model is Earth and you use that. Earth is a high entropy planet in terms of water and land. Why? Because it has a lot of both. It's 70% water, but that's still a lot of land. So you're not going to be particularly surprised by either type of thing when you get to the other planet. You're going to expect both and you plan for both because if you use the Earth as a model, there's a lot of land and there's a lot of water, be ready, right? Have floating escape slides, whatever you need to handle it. Now imagine you go the other way. You have a bizarre universe. There's you leaving Mars and heading to Earth and you've never been to Earth before and you use Mars as a model. Now your potential for surprise is very high. The reason is because there's not much water on Mars. Yeah, most of it's frozen and underground. And so when you get to the Earth and you discover all this blue liquid stuff everywhere, you'll be very surprised. And that potential for surprise arises from the fact that Mars is a low entropy, right? It's like Los Angeles. It's the Los Angeles of planets. It has a huge asymmetry between the frequencies and the different types of terrain. And as a consequence, in terms of divergence, the information distance from Earth to Mars is smaller than the information distance from Mars to Earth. Yeah, I know. I'll say it again. The information distance from Earth to Mars is smaller than the information distance from Mars to Earth. Why? Because if you're trained, if your model is the Earth, it expects all kinds of events. And so it's less surprised by any particular event that happens. This means its prediction error is lower on average across a huge number of potential planets in the universe. Then if you come from Mars where you're going to be surprised by water all the time, does this make some sense? Again, there's a homework problem where you'll go through this again and I'll ask you to do some calculations. This is an essential thing about why simpler models work better. It's because they have higher entropy. They're the distance from a simple model to reality on average. Other things equal is shorter because it expects all this stuff more equally. And we're going to use this. This is going to have, well, this has massive ramifications for how all machine learning works. It really does. Today we're going to use it in a very precise way going forward. And we're going to loop back to this when we do generalized linear models because generalized linear models, the probability distributions we use to build them, come from the same principle, from the idea that we want to choose distributions that have high entropy because then the distance from those distributions to the truth will be shorter. This is a weird fact. I know. It's just like give yourself time to soak it in. But all machine learning works this way regardless of whether it's Bayesian or not. I think it's a really cool thing. Okay. How do we estimate this in practice? A lot more in the chapter about this. You can skim over it with a cup of coffee or a glass of wine, whatever helps. Let me jump to the conclusion here. We want our gold standard way to score the accuracy of a model. The problem is we don't know the truth. You never know P. This is the whole problem. We don't know the truth. We can't handle the truth. It's not there. But we have the model. It turns out you don't need the truth part because it's just an additive term. And so you can get the relative differences between models without knowing the truth. This is an amazing, benign fact about the universe, I think. And so you can use this thing. The log score is a gold standard way to score models. And log probability scoring is the standard whether you're Bayesian or not for how to do it. There are lots of other ways to score models and they're all wrong. They're usually correlated with this, but they're wrong. And R squared turns out to be a special case of this. The problem with R squared is you shouldn't use an N sample. If you could use that sample, it'd be okay. But it turns out to be a special case of the log score. It's a transformation of the log score. Okay. And in the book, there's lots of code to show you how to calculate this thing with a model output. In practice, since we're Bayesian in this course, they're not a single log score, but there's a distribution of log scores because you're used to this fact now, right? Everything in Richard's class has a distribution. You're welcome. And so does the log score. So we want the average log score averaging over the posterior distribution. And that's this thing which in the literature is unfortunately called the LPPD, the log point-wise predictive density. We do the averaging on each point by itself. So what this equation means, there's a whole box in the chapter to walk you through a calculation of this thing. And there's a function in rethinking called LPPD, which will do it for your model. But it's worth understanding what's going on. For each point i, we're taking the average probability of that observation conditional on the samples. And we average over samples. So average of samples s. We average over the, for each observation i, what's the average probability the model expects for that observation? Then we take the log and we sum across all of the observations on the model. This is the right way to do the Bayesian log score. Okay. Let me show you what all this, why all this matters in a practical sense. We're all practical people. And so the first thing, I want to show you that everybody overfits. This is sort of an inevitable consequence. But we have weapons to fight against overfitting. We can understand it. That's the first thing that we need to do. And we can also measure it. Now let me show you what it looks like, first of all, this difference between in sample and out of sample. So the first thing to do is say I'm going to, the graphs to come, the vertical axis is all going to be on this scale that's called deviance. This is the log score times minus two. This is extremely conventional in statistics and I'm sorry. It's just I want to prepare you for, to read other things. And so this is, this is an ordinary thing. Y minus two, well there's this thing called the chi-square distribution that this comes from in this log likelihood ratio tests and stuff. So there's history here, but it's totally arbitrary. It doesn't do anything. It doesn't change any, it doesn't change the rankings of models. But it means that smaller values are better. So smaller is better. And it can go negative, it's still better. You can pass zero as not a special point on the deviance scale. It's just zero. It doesn't matter. You can go negative. The more negative it is, the better it is. So smaller is always better. It means less divergence. Yeah, does that make sense? And now I'm going to walk you through a simulation exercise. What I call the metamodel of forecasting. Imagine we have two samples from the same generative process. One is our training set. We're going to fit our model on that. And the other is the testing set. That's what we're graded on. They're both going to be of the same size in. So we're going to fit our model to the training sample and we get a divergence d sub train. Then we use the posterior distribution from the training sample to compute d test. We force it to predict the out of sample thing. We don't refit it. We make the training sample predict the test sample. And then the difference between the test and train deviance is our measure of overfitting. Let me show you what this looks like. So I'm going to come up with some toy linear regressions here. I'm going to assume some truth. The true data generating process is the simple linear regression where the mean is a function of two predictor variables, x1 and x2. And I fix the coefficients at 0.15 and minus 0.4 in the other one. Standard deviation of the outcome is one. That's the truth. The thing you can never know. But you know what here is. I'm making it all up. Yeah. And then I'm going to consider five different candidate models. And I'll use flat priors for now and that's going to change later. We'll do one step at a time. We'll start with flat priors. You know I don't like flat priors, but we'll fix that. The first one is just the intercept only model. Really simple. One parameter. No predictor variables. Then we add an x1, then we add an x2, then we add an x3. We've got more predictors. There's always more predictors. Just go to the internet. You can find predictors right if they're out there. And then x4. Okay let's see what happens. Let me show you what happens in the sample. That's what we're looking at here. I think this is 10,000 simulations of this in-out test. And across the horizontal axis are the different models as I showed them before. We just ranked them by their number of parameters. The first model had one parameter. The last one had five parameters. And then the deviant scale was on the vertical. Remember, lower is better. That means the smaller divergence. The shorter distance to the other planet. So I want to show you on the left. We look all the way over here. The point is the average across all the simulations and then I think that's one standard deviation on both sides is the bars. The third one is the data generating model. I want you to notice that it doesn't have the lowest deviance on average. The more complicated ones do. This is what I showed you before. If you add parameters, you do better. So the more complicated model is always going to fit in sample better, at least for these models with flat priors. So you were expecting this. It makes sense. There's a lot of variation from simulation to simulation. That's what the bars are. But you see the trend on average. Now there is a big jump in three. You've got a clue there. You've got a lot more improvement right there at three and then very little improvement after three. And that is a hint. We can do better than a hint. We can think about what happens out of sample. So here's the pattern out of sample now. The black points and black line segments. These are all paired. For each simulation, we've got exactly this difference. And we can look at the error. What's called the generalization error. And now unsurprisingly, everything does worse out of sample. That's overfitting. Right? Everybody overfits. It's okay. We all do it. It's the safe place. You can talk about it. And there's a pattern to the amount of overfitting that I think you can see. You'll see that actually model three is best out of sample. Yeah, it's not a lot better than model one. But it is better on average than model one. You see it? And then models four and five, which were better in sample, get progressively worse out of sample because they're fitting noise. They're fitting irregular features of the sample that have nothing to do with the generating process. They help you explain the sample because they just encrypt the sample for you using parameters. But then when you go out of sample, they've led you astray. And now your predictions are actually worse for having ever noticed those variables exist. Does this make sense? I think this is cool. This is cool. Okay. That's for an N of 20. That's a relatively small sample, or I guess that's a standard treatment in some branches of psychology, right? And a factorial experiment, right? Is this like conventional 20, something like that? And in anthropology, we're happy to get 20 because the historical record is devastated. We take what we've got. It's even worse. I'm not picking on anybody in particular. And then at n equals 100, the same pattern is there, but it's greatly muted because even if you've got a wrong model with a large sample, you can very precisely estimate that a predictor doesn't matter. And then it won't hurt you as much out of sample. So this is what's happening now with models four and five. In sample, they're only barely slightly better. And that's because the true coefficients on those predictors is zero, and you have a big sample, so you can estimate it to be zero, really, really close to zero. And then out of sample, you're predicting that those things don't matter. And so they're only slightly worse. And the bigger your sample gets, the less overfitting risk there is. Right? Because you can try and get a really good posterior distribution on the true effects from that, even with an overly complicated model. Does it make some sense? But the pattern is the same, right? And overfitting is still there. Yeah, right? What do you, in the case of overfitting, the magnitude of overfitting? The difference between the in and out. So the model of the five parameters is more of the dive volume? Yes. And in fact, there's a very special pattern here. I wasn't going to mention this till later, but this is a great prompt. There's a very special pattern to the distances between these points. And I haven't highlighted it to let you see it. But if you think about it, on the left graph here, the distance between the blue dot and the black dot, in each case, you'll see it's growing. And it is approximately twice the number of parameters in each case. This is a super awesome fact, which I'll get to in a later slide. Okay, but just hold that in your mind. Okay, the first thing we want to do is regularize. So we don't want to use flat priors. And I've been waiting for months and months to use the skeptical hamster. So I finally got to put it in a slide. Olaf, because I ain't nominist at Olaf. Yeah, I stumbled across this on German Twitter. I've been waiting for this lecture for a long time to use this. Sorry. This is what Keith came used. So we have to be skeptical, like Olaf. Be like Olaf. And be skeptical of your models. And one way to do that is to build skepticism into how they're trained on the sample. And we can do that through regularizing priors. And this is what I've been nudging you to do by, through prior predictive simulations, so far in the course, is to choose priors which can only produce possible outcomes. There are outcomes that we know are impossible before we've even seen the sample. And if we train our priors so that they constrain the predictions like that, it helps to regularize. It helps to reduce overfitting. We can even be more aggressive than that quite often. So let me run you through the example we just did, but now with regularizing priors instead. So we're going to consider three different priors on the regression coefficients. So this is the model on the left. This is just a linear regression model. I'm going to have something essentially flat on the intercept because we're not interested in that. I want to focus on the slopes. And we're going to use different standard deviations on this Gaussian prior on the slopes to induce different amounts of regularization. That is skepticism for large effects. And so three different ones are considered the N01, which is that dashed curve there, Gaussian slope. Then a standard deviation of a half, which is the middle peak one. And then a standard deviation of 0.2, which is the very peak one. Specter intuitions about what is going to happen, which of these will be best. So here's what happens at N equals 20 again. Just in sample, I'm showing you for all five models with all three different kinds of priors. Again, I think this is 10,000 simulations. On average, how do they do in samples? So the dots are the previous thing. That's flat priors. The dashed is N01. The thin solid is N0.5. And then the thick solid one there on the top one is the narrowest prior, the most regularizing prior, the most skeptical. What happens out of sample? Exactly the opposite pattern. So in sample, the more skeptical prior does worse. Why? Because it learns less from the sample. It's skeptical. Yep. So it learns less from the sample. Out of sample, the more skeptical prior predicts best because it ignored irregular distractions from the sample. Cool, huh? Now, in any particular problem, the pattern might be different. Or the regularizing prior could be too strong. I could give you a really, really spiked prior and then you'll learn nothing from the sample. And that's bad too. You can overshoot. But this is the regularization effect. Some skepticism helps you make good predictions. And this is why I always say flat priors are always bad. You can always do better than a flat prior. Just make it slightly not flat. You'll do better, guarantee it. Yeah. So let me show you what it looks like with a large sample. With a large sample, it hardly matters. With a large sample, these priors are overcome. They all make essentially the same prediction. You'll see that they're basically stacked. The order is the same, but the differences are tiny on average. You see that? That's because if you have enough data, you don't need, the regularization isn't doing any powerful work for you, right? Just as I said before. But if you have a small sample, then regularization does a lot for you. In complicated structural models, like multi-level models, we're going to have to revisit this and think about it because even in really, really big sample sizes in a multi-level model, there are some parameters which you don't have much data to estimate, like variance components. And so we'll have to think about this. Loop back to that when we get there in a later week. Does this make sense? You don't have to understand all of this right now. You're going to read the chapter 2 and think about it and you'll absorb it osmosically and it'll feel great, right? You'll get this idea. So let me stimulate you a bit by musing. So most of the times in science, we don't do much regularization. In industry, there's a lot of it. There's all of this cross-validation because they're scored on prediction. They've got to actually drive customers to sale points and stuff like that, right? And now maybe they're screwing that up too. My colleagues in data science and industry tell me that all kinds of things go wrong, but they do care about prediction. That's their benchmark, is predictive accuracy. In science, we don't usually think about that very hard. And I think there's an interesting sociological set of questions about why. And I have no answer to this, but I mean the first reason we don't do it is because we're not taught to. It's not a big part of stats curriculums and scientific programs. Functionally, it makes getting significant results harder. So if you use regularizing priors, you get fewer significant results, and then you'll publish less and then you go into industry. No, I mean I've done okay. So you can't eat it out without significant results, right? I haven't had a p-value in the paper in a long time, so I do find it. But there is, there are, when you're junior, you can get a lot of coercion from senior people to do the wrong thing, right? To not, to not regularize. Because it makes it harder to get a flashy publication to fish for asterisks. And I think that maybe the biggest thing is we're just not judged. Our career success doesn't depend upon the accuracy of future prediction. That isn't, we're focused on inferences and we don't have a strong philosophy about how those two things are connected. And that's, I don't have a crisp answer for what to do about that, but I just wanted to put this here to give you nightmares. No, I think there are good problems to solve here and lots of tools that are at disposal to solve them. So back on the tour, so what I just showed you is the phenomenon of overfitting and that regularization can fight against it. If we regularize and we regularize correctly, we'll do better out of sample. Is there a quick question? Yeah, hon? Yeah, yeah, yeah, that's reasonable too. Sometimes you sometimes it's iterative, you realize that you have to regularize because you forgot, and then the thing can't converge because there's not enough information. Sometimes regularization is the only way to get an estimate that's reliable. Yeah. Well, there is that risk. You could, you could prior hack. You could with this kind of model. Absolutely. That's a good question. So my plan for the last slide of this lecture is to talk about that. So this is a great prompt. We'll loop back. Let me try to stay on schedule and get through it, but I care deeply about that question. Absolutely. Okay. Yeah, so what I just showed you is the phenomenon of overfitting, regularization helps against it, but we can actually predict the amount of overfitting as well. You can, even in an applied case where you don't have the out of sample to predict, so you can score yourself, it turns out on the basis of theoretical considerations, you can predict how well a model will do out of sample. Again, this is, now this is all small world stuff, what I'm about to show you. So you have to be like Olaf, be skeptical, right? But in theory it works really well and provides great advice on, it gives us a principled way to talk about the complexity of a model in relation to its overfitting risk. And that's what I think it's useful for, scoring models on overfitting risk. Okay. And there are two major families. The first is cross-validation. And the second is information criteria. I want to give you a quick definition of each, and then I want to work through in the last part of this lecture two examples where I take some data sets and show you what happens when you apply these criteria. Okay. Cross-validation. Cross-validation means leaving out some part of your observed sample, fitting the model on the part you didn't leave out, and then predicting the left out part. If you do this across a bunch of different left out bits of your sample, and take the average out of sample performance on the left out bits, that's turns out to be a really good approximation of the log score of a model. Right? Because you're doing the right thing. You're scoring the model on things it wasn't trained on. That's the whole principle. Yeah. And that's what I say. This is the thing in industry that's done quite often, these prediction contests. Right? And sometimes it's done so that the out of sample is actually, you can't, you don't have access to it, and you have to submit your model to a team, and then they give you back your log score. And you have these competitions like the Netflix competition is like that. Right? Anybody here paid attention to that. And so you can do this with your own samples as well. I motivated this a bit on Monday when I talked about the difference between an underfit model and an overfit model. You can leave each point out, fit the model on the points that remain, and then predict the left out point. And you see how much variation there is in this. And this is cross-validation. There's a function in the rethinking package to do this for craft models. If the sample's large, it'll take a while. Right? Because you're leaving out each point and fitting in all the others and then predicting over and over again, and then you take the log scores to sum across all of those isolated predictions. The most common way to do this is to leave one out at a time, but you can leave out big chunks, too. And they have different properties. There's a huge literature on what the optimal size is for what you leave out. There's not a single answer across domains for what, for that, by the way. It depends upon the nature of the data and the phenomenon. But the general idea that you can do this and that this is a useful metric is a super cool thing, because this is accessible. This is something that you can do in practice for a large range of different kinds of problems. These days, in a big dataset, you're never going to fit the model, you know, as many times. So if you leave one out, which is the most common way to do this, you leave one observation out at a time, you'll have to fit the model as many times as you have data points. You need that many posterior distributions. That's a lot of computer time, unless you've got a bunch of cores, right? Which we do. I think in my department, how many cores do we have now? 180 cores or something like that downstairs? You guys use them all day somehow. But maybe you don't have that at your disposal. And you want to do something else. It turns out they're really good analytical approximations of the cross-validations score. This cool technique called important sampling. And these days, the best thing to do is use this Pareto smooth important sampling developed by Akibatari, who is a smooth estimator. And he developed this with his colleagues in Helsinki. It's a regularized version of the important sampling, leave one out cross-validation score. And it's incredibly accurate. Incredibly accurate. So there's a function in the rethinking package, capital loo, which will calculate this given a model fit. You just need a posterior distribution, a single posterior distribution to do it. And Akibatari has done our package, lowercase loo, which will do it and provide a bunch of additional diagnostic information. One of the best things about the Pareto smooth important sampling score is you get lots of diagnostic information. And in later lectures, not today, we'll talk about that. And I'll focus on how useful that is. It can identify high leverage points for you, which is a really useful thing. So we'll have examples, not today, but in later lectures, where I highlight that usefulness of it. The other major technique is information criteria, which obviously has some relationship to information scoring, right? So information, this whole approach really stems from this fellow, Hirotugu Akaike, and who came up with this most famous one, AIC, the Akaike Information Criterion. And it's an estimate of the KL distance. In theory, he does the Taylor expansion of the KL distance out of sample. That's how you derive this thing. And to get an analytical approximation, of course, lots of assumptions were made. And I list all those assumptions in the book. They are the most important cases is you need a Gaussian posterior distribution. That's one of the assumptions in the derivation. You assume all the dimensions in the posterior are Gaussian. And if that's true, then you can get a really nice approximation of the performance of the log score out of sample. That is the test deviance. It's just the training deviance plus twice the number of parameters. And this is that pattern that appeared on that previous graph with the flat priors, which is another assumption of this, by the way. Your priors are flat, and your posterior is Gaussian, is that the expected distance between the in-sample and out-sample performance will be twice the number of parameters. The nine universe that it turned out that way. It could have been 1.7 times or pi times or something like that, but it wasn't. It's just twice. The two actually comes from the scaling of deviance. It's just on the information theoretic scale, it's just the number of parameters. And the two is arbitrary. So this is super cool. And it's an incredible achievement, actually. These days, I think AIC is of historical interest because it's eclipsed by the metrics that were developed afterwards. Because of AIC, this started a bunch of people working on problems and information criteria and predicting out-of-sample performance. And another theoretical statistician has developed Sumio Watanabe, has developed this new, more capable version of AIC, WAIC, which he says stands for Widely Applicable Information Criterion. But people have started to call this the Watanabe Akaike Information Criterion. But he didn't call it that. Akaike didn't name the thing after himself either. In the original paper it's called AND Information Criterion, AIC. But this didn't last very long. People named the thing after the person. So WAIC, call it what you want. Watanabe wants it to be called Widely Applicable, so I'm going to use that until I lose. I expect fully to lose this battle. And this thing looks complicated. There's more in the text, but actually it's pretty simple. LPPD is this Bayesian deviance, in-sample deviance. That's all it is. And this penalty term on the right is the point-wise variance of the log probability of each observation. And it turns out that is the generalized parameter count that you want. Once priors aren't flat, so this works for non-Gaussian posterior, non-flat priors, anything. And it's very capable. And it turns out in general the parameter count isn't what's relevant. It's the variance in the posterior distribution. And you get these generalized penalty terms. In models with flat priors and Gaussian posterior distributions, this reduces the AIC. It'll give you the same value as AIC, which I think is also really cool. But in general we're not going to use flat priors, and so we use this. And you'll see that the penalty term often has some interesting information in it. Okay, and again there's a function in rethinking to calculate this for you. There's also a box in the chapter where I show you step by step how to do this, so you can really understand it, walking through the code and understanding it. You just need one posterior distribution. Okay, let's compare these things. Let me show you in simulation again how these things do in predicting the out-of-sample accuracy. That's their job, right? So we go back to that same simulation tournament I had before, and now we're going to compute these criteria, and we're going to score them on their error. All right, we know from the simulation the actual prediction error was called the generalization error, and these things are supposed to be WAIC and leave one out cross-validation, and true cross-validation are all supposed to predict that generalization error. How close do they get to that? So this is what I'm going to show you here. Same models on the horizontal, same scale on the vertical. A bunch of lines, I apologize, I'll walk you through. Just focus on the bottom, focus on the top lines. Those are all models with flat priors. The open circles are the actual generalization error, the out-of-sample deviance, in each case, on average, and then each trend line is a different metric for predicting it. The black one is WAIC, and then the blue dashed is what I'm calling LOOIC, the LOO information criterion, that's the Pareto-Smooth, leave one out cross-validation estimate, and then the solid one is actual cross-validation, where I actually forced my computer, yes, to churn through every possible permutation of the data I compute the log score. As you see, WAIC is getting closer to it, but the difference is really small. LOOIC is an amazing approximation of the true cross-validation score. Remember, that's what it's trying to do. The Pareto-Smooth cross-validation score is trying to tell you the actual cross-validation score you'd get if you really cross-validated. It does amazingly well. WAIC is trying to tell you the generalization error, and it's doing incredibly well. They're both doing their respective jobs almost perfectly in a sample of 20. At the bottom, we've got regularizing priors, and you'll see everything get shrugged down. The error goes down because we're predicting better out of sample now that we're regularizing. But the differences are about the same. We're getting really close. The unit difference on the vertical is tiny amounts of error here. Now, this is only on average. In any particular case, you could be really off. But on average, you're right on the money. It's amazing. It's nice to look at this example, too, on error, not on the absolute-scale deviance, but on the scale of the absolute error from the target we're trying to hit. The target we're trying to hit is the out-of-sample generalization error for the out-of-sample deviance. Same line, same meetings. This is where I can show you WAIC is slightly more accurate than cross-validation. In theory, it's trying to estimate exactly this thing, and it's a really good theoretical approximation of it. These differences are tiny, though, and they're much, much smaller than the error of each criterion across instances. So these differences are tiny. All of these things work. All of these things work extremely well, amazingly well. So I can show you at n equals 100, now the samples are large, and now all of these criteria work identically on average. They do the same. So there's no big fight to have about what you use. They all perform basically identically in a wide range of circumstances. When they disagree, that indicates that there's some high leverage, highly influential observation, and then the different metrics are processing it differently. That's a hit for you to then look carefully at what's going on in the fit. We're going to have examples of that in later weeks. So hang on to that. Disagreement between these metrics is usually an indication you should trust none of them. Does that make sense? But then there's potential to learn because you can figure things out about where the high leverage points are. Okay. Yeah. Is the concept of underfitting at all applicable in this setup? Is there just something you can point to and say that's underfitting? Well, yeah. The models on the left are underfitting. They don't have enough parameters. How do you know that? Because if we add parameters, you do better at a sample. Model three is the best, and models one and two are underfitted. They didn't learn enough from the sample. There was extra info left on the table that they did not harvest, and as a consequence, out of sample, they do worse than model three. They're under fit. And then models four and five are over fit. Model three is just right because it's the true data generating process. You can never beat it. Yeah. Okay. I want to get some examples motivating here. So how do we use these things? I want to counsel you to avoid model selection. This is not some tournament, you know, cage match where models enter and only one model can leave or something. We want to score the expected overfitting of models to understand their properties. And in particular, in the sciences, we usually have an inferential objective, not a prediction objective. Now, if you do have a pure applied prediction objective and you don't want to understand your sample at all, then by all means, just select a model that has the best score, and you can probably do, you know, only mild damage with that. But if you intend to intervene in the world as a consequence of selecting a model, then you don't want to use these criteria to select models. You want to use these criteria to understand the models and compare them to one another. So let me give you some examples to reinforce that point. Okay. Here's a, here's a, let's revisit the fungal experiment example from last week. Right. So you remember this. This is a cooked up example to show you the problem when you condition on a mediating variable. You can knock out the treatment and lead yourself to believe something that works, doesn't work. W-A-I-C and cross validation will get this all wrong. So it will counsel you to do the wrong thing. Why? Because the inferentially incorrect model in this circumstance makes better predictions. Let me try to motivate that for you. So in this case, if you look at the example in chapter six, there are three models that I fit to the simulated data. 6.6 only has an intercept. It ignores all the predictors. 6.7 contains the treatment, which does work. Why do I know that? Because I simulated it. Right. And then the fungus, which is a consequence, is correlated with the treatment, but not perfectly. The fungus is what actually reduces plant growth, which is our outcome measure. So we fit these three models. The middle one has treatment and fungus. The last one, treatment only. Treatment only is the right model you'd want to fit if you were trying to examine the experiment to figure out if the treatment works. That's what you should do. You should omit fungus from the model. Remember that lesson? Right? You don't condition. Don't block the path. There's a pipe here and you don't want to block it. Yeah. So, but if we compare these three models, this compare function is in rethinking. If you give it crap models, it computes by default WAIC for them. And remember, smaller numbers are better. So the top model is 6.7, which contains the fungus. It's doing a lot better than the others. In the text, I walk you very slowly through this table. So you should spend some time with that and get all the info from it. But you can probably see the difference here. This DWIC is the difference of each model from the best model. There's a really big difference between the other two and the top one. Why? The top one will make better predictions because the fungus is what's causal. It's the proximate cause. And if you knew the fungus, you could make better predictions. But you'll make the wrong inference about the treatment if you do this model comparison exercise. Right? Because it tells you like, well, you're not going to make better predictions by knowing the treatment. That's true because once you know the fungus, you don't need to know the treatment. But that doesn't mean the treatment doesn't work. Right? So inference about cause and finding a predictively accurate model are separable tasks. They're not the same thing. Yeah? Do you feel sufficiently cautioned? Yeah? I can tell you're enjoying this. The tariff says, no, this is, I think this is cool. You're terrified. I can tell. But no, this is great. So you could do both. You need to do both. But you've got to keep in mind that they're different. If you do model selection, just I took the smallest WIC model, that is no argument that you've got to infer the cause. Right? Because you could have blocked a pipe or any number of conditions on collider, any number of other things to do. And in all of the examples I gave you last week in which you reach an invalid causal inference by doing something like that, all of the invalid confounded models will make better predictions. Yeah, I know. Happy weekend. We'll have lots of examples. We're going to use this going forward in the course. And I'm going to keep iterating on this. And you'll develop a minimal level of comfort with it. So if you reduce, would it actually say that even spurious correlations are useful? Yes, even spurious correlations are useful because there's real information in them. They're picking up the haunted parts of the DAG. And there's information there that is correlated with outcomes. That's the whole point, is that this correlation arises. It's just that you'd be a fool not to use it. It's just you'd be a fool to use it for causal inference. And the distinction matters because if you intervene in the system, you're going to adjust a variable now, you won't predict what happens if you selected the lowest WAIC model. Because the confounding really matters there. You go in and suddenly give someone the drug or give the plant the treatment. The highest predictive accuracy model will not necessarily predict what happens, in that case, when you intervene. If it's just nature generating what takes the drug and stuff, then the confounded model can do fine. Yeah, is this fine? This is sinking in? Okay. All right, I got about 10 minutes here. Let me see if I can do a decent job of talking about my favorite monkey. So there's something about Sebas. I know they're primatologists in the audience who did this joke. All right. I thought it was funny. So there is something about Sebas. Sebas lives a long time. Small body, South American primate can live 40, 50 years in captivity. One of the biggest brains for its body size and the whole primate order. Incredibly clever and diabolical. So we're interested in anthropology and evolutionary biology in general in understanding a topic called life history evolution. How different characters like brain size and lifespan and maternal care co-evolve. And to understand human evolution, we need to understand why it is we live so long and have big brains and have such a long period of dependency and whether these things go together. And all the primates are weird in this regard. All the primates, the monkeys and the apes and the prosimians stand out against the other mammals in having long life spans and big brains. So there's something to understand by looking at the whole field. So people like me spent a lot of time staring at big tables of primate life history characteristics. It's just what we do as a hobby. And so here's a data set that I want you to consider as an example. A common, there's a whole literature looking at the evolution of lifespan as an outcome. Why does lifespan vary so much across mammals? And it does. Primates have long life spans, rodents have short life spans. A tremendous amount of variation. And a typical kind of conceptual model you'll find going back to the 70s and forward is this idea that obviously, body mass is a positive influence on lifespan. If you're bigger, fewer things kill you. Your organism invests in living longer. That makes a lot of sense, actually. Right? And that brain size might also help you live longer because it makes you smart. And if you're smart, you can avoid danger. And this may be one of the things that leads to selection for brain size, actually, and then it encourages lifespan. And then just because I'm kind, I put some of my observe confounds on this day. Just to remind you that you should always season your dad with an unobserved confound or two. Imagine a few of them. Feel the flavor of it. We're not going to be able to do anything about those unobserved confounds, but whenever you read a paper about these topics, you should imagine that there are unobserved confounds that are creating correlations between these characteristics and try to imagine what's going on. So let me show you what happens in this case. And again, there's all the code to do this in the chapter to run through to this example. The original dataset has 301 primate species in it. It's called primate 301. After you remove all the missing values for the three variables of interest here, we're down to only 112, unfortunately. There's still a lot of measurement to do in primatology. And we're going to fit three models. The first is log lifespan as a linear function of log body mass and log brain size. So m for mass and b for brain. And that's, as we say, the industry standard model that everybody expects to be the right prediction model in this case. We expect lifespan to be a function of body mass. So if we want to figure out the influence of brain size, we've got to block the back door path through body mass. So you've got to include body mass if you believe this data. Make sense? Remember all that? Blocking a pass? There's a back door out of brain to lifespan through mass. So we've got to include mass to block that back door. Okay. And then the two simpler models, because we're interested in the predictive differences, right? So now these log scores are like our squared, but they're on the right scale. They're out of sample. And so they give us differences which are informative of the expected predictive improvements of the relative models. How much of it actually helps to add a predictor? You can use them this way. Curiously. Again, it's all small world stuff. We actually don't know what would happen. We're not going to generate new primates. So it's a weird thought experiment. So what happens? Here's the WAIC scores presented graphically. So the black dots are the in-sample fits. Unsurprisingly, the more complicated model is the best. It's on the left, which means smallest. That's the best. The open points are the WAIC scores. And you can ignore the bars for a second. Those are standard errors. I walk you through all this in the chapter. What I want you to show is that 7.8, 7.9 are basically equivalent out of sample. They're almost identical in their out-of-sample predictions. And both do a lot better than the model that only has body mass. So there's something about brains going on here. When you see something like this, there are two models that have different predictors, but they're almost the same out of sample. You should see this as an invitation to poke inside them. And you can use information criteria to do that poking. And so let me show you the summary of what's happening to the estimated coefficients to give you some idea. So BM is the slope for log mass. And BB is the slope for log brain size. So what you see is in the model 7.10 only has body mass. And it says there's a very positive relationship between body mass and lifespan. And there is, in general, across all of the, all life, right? Big things that life has. Yeah? And 7.9 is the model that only has brain mass. And you'll notice it says there's a positive correlation between brain mass and lifespan. And there is, across a wide range of organisms. Yeah, it's not always perfect. There are big things with small brains. But generally, blue whales have massive brains. Huge. Right? Heavier than you. Blue whales brain is many times heavier than one of you. Right? And the model with both, there's this catastrophic flipping and spinning that goes on. What is happening here? So now body mass is negative with a wide standard error. So wait, so now suddenly if we control for brain mass smaller things live longer, seems weird. Now this is something you'll see in lots of published papers that do exactly this regression. Same thing happens. And then 7.8 is the brain mass thing, still positive, but a bigger standard error. So the model with both and the model that only has brain, so 7.8, 7.9, out of sample, WA's AIC expects them to be about equal. What's going on here? Since I have one more minute, I'm wondering if I have time to explain this. Probably not. So let me do like the quick version of this and then I think when you come back next week I'll do justice to it. But I want to motivate it and then you'll read about it in the chapter. I think the thing to do here is you can look, you can do WAIC point-wise. For each species in the sample you can calculate a WAIC for each model. WAIC is separable. It's point-wise. So you can say for any species like say a capuchin monkey which has those life history characteristics. Which model expects to do best out of sample on organisms with those same kinds of covariates? Or you can think about it, these are entropy scores. They're divergent scores. That is, how surprised is this model by a capuchin monkey? And so what I've plotted up for you is that the relative surprise of these two models looking at all of the species in the data set. So on the left is the model with both predictors, does better on those species on the left and the one on the right does better on the model that only has brain size. And so I actually have this up here like this. So on the left model with brain plus mass is doing better. And those capuchins are up there. Why? Because they have small brains but their brains are really big for their body size. So if you don't control for body size you can't explain their longevity. Right? So the model without body size is really surprised by sevis. You find it's totally confused by sevis. The model with body size is not so surprised by sevis. It can explain the extraordinary longevity of sevis because it has a big brain relative to its body. But it doesn't have a big brain in any absolute sense because they're tiny, right? Condensed evil, right Brendan? Sorry, they're not evil but they're aggressive little monkeys. So then on the, I won't talk about Lepidlemur but that's the other extreme. They have small brains and extremely short life spans. So it's the other flip case. You'll be surprised by them if you ignore body size. On the other side we've got a bunch of species where ignoring body size actually helps you make better predictions. And this model is less surprised by them than the model with body size. And this includes things like gorillas. Gorillas have really big brains but they also have really big bodies. And so either one of them is a proxy for the other and you make a fine prediction just using raw brain size. And then there are these things, I talked about these in the text, these two in the middle. And I encourage you to read that story in the text. So what's the point here? It's to say you get, you can understand how the models are performed. The fact that they're expected to be equally accurate out of sample doesn't mean they make the same predictions. So if you look point wise you can see exactly how they see the sample having penalized the accuracy for the overfitting risk of each. So this is a very principled way to inspect and understand your golem. What is going on inside the golem? You look at the point wise predictions of the different models and see what's going on. This is also a way to find your high leverage points which we'll do in follow-up weeks. Okay, I really, really wanted to talk about SEVA's collider but I should let you go. So let me hold this when you come back. I think what the literature is doing in this case has been conditioning on a collider for about 25 years. So I'll talk about that when you come back on Monday. Let me say we're going to go onwards. You should be patient. You don't have to all understand this all at once. It comes with time and practice. It's like learning a foreign language. Yeah, you can speak badly and still do good things, right? And in time you get better but you have to be patient with yourself. Homework four is up. I like this homework. You're going to calculate some entropies and understand divergence better and then you're going to revisit previous data sets and look at the contrast between causal inference and out-of-sample accuracy. Coming up next week we're going to do interactions, deeper dependencies. I'll introduce Markov chains and this will set us up for lots of exciting things to come. All right, thank you for your indulgence and I'll see you on Monday.