 Hello everyone. Welcome back. This week is my favorite week. Let's begin by talking about Copernicus. So here's the Copernican solar system reduced down to one planet. We're tracking Mars. The Earth is the blue dot going along its circular orbit. And Mars is the red dot here. And one thing I want you to notice is there are a bunch of circles in here and the one at the center is the Sun. So this is supposed to be the major key element of the Copernican Revolution is this heretical idea that we put the Sun in the center. It's the heliocentric solar system. What I want you to notice, however, is something that's really said about it is that this is completely wrong. Really completely wrong. The planets don't orbit in circles, first of all. Second of all, every circle here has a different center. They don't orbit, not actually. Things are orbiting in different, weird offsets. You need the Earth is orbiting one thing, the Sun around that. Then there's this weird green line. And what is that thing? Well, that's the thing that Mars is orbiting around its own little circle. Mars is going around this little purple circle that Mars is orbiting. Mars is actually orbiting some invisible point in space which is on the green circle. There are epicycles in the Copernican model of the solar system. They have to be there to make it be particularly accurate because the planets don't orbit in circles. Copernicus, like everybody at the time, thought that, you know, this was the mind of God set up the solar system and God likes perfect geometrical shapes. So it's got to be circles. And Kepler, too, suffered for a long time trying to make perfect circles work and eventually gave up and used ellipses which was like, he thought he was going to go to hell. He used ellipses in his diaries. It's like, oh my God, well, God forgive me, ellipses. This can't be right. It turned out to be right. But at least within some order of approximation. Now, don't be wrong. This is an amazing mathematical achievement, the system. And it predicts the position of Mars in the night sky quite well. It predicts the retrograde motion, which you can see with that. At the outer margin along the zodiac is tracing the apparent position of Mars in the night sky. Position of the zodiac, you know, there's a point at which it's going to zig back. We had missed it before. It won't catch up. But it's when the earth passes Mars, he boom. Then we get that retrograde motion. It's the zig back. And that was the big mystery about the wanderers of planets. And now, but here's the thing. This model makes exactly the same predictions as the geocentric model. Exactly. It's no more accurate than the geocentric model of the solar system, which is wrong in its own particular way. They both have epicycles. They're both Fourier series is what they are. They're ways of decomposing periodic functions into a series of circles. And no matter what the actual structure of the solar system is, you can construct a geocentric or heliocentric model with a bunch of epicycles that will personally predict the position of Mars in retrograde motion. So why historically are we so excited about these models? Well, one thing about the Copernican system that is different than the geocentric model is, it gets the same predictions with fewer circles. So it's simpler than the geocentric system. It still has little epicycles, which from our modern perspective have been sophisticated in launching probes. And so it seems ridiculous. But it's no more accurate. It's merely less complicated. And complexity of theoretical explanations is a common criterion by which scientists choose among competing explanations. So very commonly, people cite an informal heuristic called Occam's razor, named after this fellow here, William of Occam, immortalized in class, and who lived in the 1300s. And Occam wrote about a bunch of things, as many clergy at the time. He was among the most educated people in his region and kept libraries and worked on mathematics and tons of things. As far as anybody can tell in the surviving writings, this is the only thing he ever said about this. Although in his writings, as I've read in summary, he used this principle quite often, but did not state it clearly, that plurality should never be posited without necessity. Though people who took Latin in high school, maybe there's a couple of you who'll be able to decode that in the Latin there. I want to suggest that this is a pretty goofy principle. I mean, it's not that it's necessarily going to lead you astray. But it doesn't really give me a warm feeling of confidence about how to choose among theories. And the major problem, I suggest, is that it doesn't bring up the trade-off. So our problem is actually to trade off empirical accuracy versus the complicatedness of the theory. And in the Comparican and Geocentric system, you've got a perfect case where they're equally accurate. And so Occam's razor can be applied. You can say, okay, we prefer the Comparican system because it's simpler, and both of these two models are equally accurate. But often we find ourselves, in our own research, in a much worse situation. We've got models that differ in both dimensions. They differ both in their complexity and in their accuracy. And then we have to trade them off against one another. In that case, the razor is not of any help at all because it only speaks in one direction. So we need something to replace it. Let me give you an alternative metaphor also from the classics. I keep reminding you guys I minored in classics in college for random reasons that I now regret. But I mean, I could have been learning other things at the time. But so let's use the voyage of Ulysses to get another metaphor about this. So way back in time, you folks were forced to read this, maybe? Or maybe you were spared? Actually, it's pretty good literature if you get a good translation. Anyway, Ulysses has this voyage and when she goes around being heroic, eventually he gets to what we now think of as the boot of Italy and sails in between the little straight there that I've between Sicily and the boot of Italy that I've circled in red. And in that little straight, it is treacherous waters. There are tides, cause heavy currents, and there are rocks that ships have run against over many centuries. And in the story of Ulysses, this is the point at which he loses sailors to the mini-headed monster, Silla, which is living on the rocks on the side there, and he has to sacrifice them because otherwise he'll lose this whole boat down the whirlpool of Charybdis. He has to navigate between these two threats and he trades the lives of his sailors to save the whole ship in this story. And then he sails back through and I forget he does something different. I forget it was long time ago. But I want to use this as a metaphor. If you think about Ulysses compass, there's a trade-off involved here in navigating between two hazards. And both are bad. Both will lead to bad, the choice of bad theories in science. And these hazards metaphorically are too simple and too complex. There are problems of models that don't learn enough from the data, is the way I'm going to explain it to you in today's lecture. And those models are too simple. They don't find the features in the data that are representative of the process that has generated them. And there are also hazards for models that are too complicated. And it's more subtle to explain that. So that'll be our major work today is to focus on the complex side just because it's conceptually more difficult to understand what the hazards of complexity are. Whereas I think most of us get the idea of what a too simple model does. It ignores things, right? And that's the problem. So I want to encourage you to think of Ulysses compass instead of Occam's razor. Let's replace our razors with compasses, our beards, and what will be good sailors. And I have to put in here mainly in this course, I don't say anything about P values because it's just not time. But one thing I do want to say, independent of any general attitude about P values, whether you want to use them in general, you should not use them to choose among competing models. This is not what they're for. Let me just present this in a standard kind of way that I think everybody has to agree with. There's nothing about the 5% that optimizes anything about predictive accuracy of models. It's just a convention. Fisher's original citation of it was, well, it's about two standard deviations. We'll take 5%. It's not a huge endorsement, right? And I'll give you that quote in chapter one of the book, I think, or chapter two. So, but it is extremely common. I think the most common way that scientists decide which variables to include in a model is to pair it down to a model so that all the predictors are giving you asterisks by this. And statisticians have this tongue-in-cheek way, snarky way of talking about this. They call it star gazing. I can't find the primordial reference to this phrase, but every stats department, people know what this means. Star gazing means getting asterisks, right? This is bad news. Don't do it. Why? Because nobody knows what it's actually doing. You're choosing a model, but what's using it on what criteria that it has stars? Basically, you're choosing on a criteria that it can get you published, I think. That's the criteria you're choosing it on. And maybe you have to do that, but it isn't helping the predictive accuracy of the model at all. So today, we'll be trying to replace star gazing with something that chooses its objective and then finds a procedure of all comparison that meets that objective. That's what we're going to be building up towards. So we're going to start by exploring more deeply this Ulysses compass problem, the trade-off between overfitting and underfitting in models, where overfitting is something that two complicated models do, and underfitting is something that two simple models do. Then we're going to learn about information criteria. These IC things, AIC is the most famous of them, become really common in biology journals. Once in a while, I see it in social science journals too, but in biology journals, it's everywhere now, right? I think every issue of the top biology journals, the ecology and literature biology journals have AIC articles in them now. So this is a big difference between five years ago. More satisfyingly, Bayesian or DIC and WAIC, these are all information criteria which aim at the same objective to help us in model comparison. They measure overfitting in a sense, so we're going to we're going to construct them up from first principles as a way to compare them to models. I'm also going to talk about a complementary strategy called regularizing priors, which I've been mentioning in previous weeks, and now we'll actually look at them comparatively. These strategies using information criteria and regularizing priors are complementary. Information criteria measures overfitting, regularizing priors reduce it. So you can use it together and you probably should, and I'll show you examples of this. Then of course for the rest of the term, we're going to practice these things in our homeworks, and you'll love it. It'll be tons of fun. And the final task, this will happen on Thursday, is we'll learn how to use information criteria to logically average the predictions across multiple models to construct ensembles, prediction ensembles that guard against overconfidence in any particular model. And you might be thinking, why would we do that? The reason is because, just like there's uncertainty about parameters, there's uncertainty about models. So we want to calibrate. We don't want to use just one parameter value from the posterior distribution. We want to use the whole posterior, just like that. We'd like to use all the models in a set rather than just one that guards against overconfidence, instead of throwing away some of the information we have. So that's a lot to do. Before I get into the work of it, let me explain it with a metaphor and we'll update this when we move through. We'll come back to this metaphor. Instead of a question of thinking about, well, the goal is to pick a single model, think of it when you fit multiple models to the same data that it's kind of like a horse race. Now, some models finish faster than others and, of course, we'd like to know which horse is fastest. But we also want to know the speeds of the other horses, right? We're not going to kill all the other horses and only keep the fastest one. Well, maybe they do that in order. Horse racing is a horrible sport. So, but we have to get that out there. So, maybe this is a bad example. But that's not going to happen, right? They're not going to shoot all the others. Although the faster the horse is, the more money it's worth, right? And if, say, you couldn't buy the fastest horse, you'd want to buy the next fastest and so on. But here's the thing. Any particular study is a single race on a single track under a single set of weather conditions. So, on average, some particular horse, like, say horse A here, a horse A nearly nosed out horse B. Do you say nosed out horse racing? I've already said horse racing is immoral, so no one's going to say that they do horse racing. But I like horses, right? They're pretty awesome animals. But, and this seems cruel. But A is slightly faster than B, maybe on average. But only on this race it was. And so, but here's the thing. The distance between them and their finishing times is informative about what will happen on average. Over more and more studies, the distribution of those finishing times will give us even better information. So in any particular study, what information criteria are going to let us do is use these relative differences in finishing times, so to speak, to construct a principal probabilistic estimate of the on average expected performance on the next race. And that will be useful, but it will be heuristic and there'll be a lot of uncertainty about it. And that's just the nature of it, just like if you used finishing times of horses on any particular day. So maybe it was raining on this, on this particular day and horse A does great on wet track, when it's dry horse B, you'll beat them, something like that. And the same is unfortunately true of models. So, let's talk about overfitting and that'll probably be half of today's lecture and then we'll get started on information theory, which will get us to the road to information criteria. The problem with parameters is that if you add them to models, they always improve the fit of the model to the sample. I'll say that again. The problem with parameters is that as you add more and more parameters to a model, make it more complex, the fit of the model to the sample always improves. That is true even if you just, the predictor that's associated with the parameter, say these parameters are coefficients, beta coefficients, even if the predictor is random noise, you can generate a list of random numbers, add them as a regression predictor and the fit to the sample will go up. Absolutely guaranteed. So let me, let me spend a little bit of time demonstrating this phenomenon, which is known as overfitting. And it's, it's the opposite of what's known as underfitting. So underfitting is quite intuitive and we'll, we'll say less about it, but my goal in this next series of slides is to explain both of them relative to one another. So underfitting, think about as, is a case where a model learns too little from the sample and ignores what we're, what we're going to end up calling the regular features of the sample. As a consequence of learning too little from the sample, it predicts the next sample quite badly because it doesn't learn enough about the structure of it. In contrast, overfitting is a result in a case when a model learns too much from the data. It's learning things that are just irregular features of this particular sample, things that are not going to be features of the data, which will not be apparent in future data, as we call them irregular features. Complex models always fit the sample at hand better because every parameter gives you a chance to grab some more of the noise in the data. And, but they often predict worse because they're overfitting the sample. They're learning irrelevant things about it. And what our goal is to navigate between these, like in Ulysses Compass. So let me give you a graphical demonstration of this overfitting phenomenon and then this will lead us into solutions to the problem. So I wanted a really simple data set. Here's one. There's only one, two, three, four, five, six, seven data points, right? Did I count right? Four, five, six, yes. Seven data points. These two variables, this is hominin brain volume in cc's. This is the volume of your head, so to speak. You're all hominins, I'm sure. And against body mass in kilograms. And there is a relationship here across hominins. Hominins are extinct pre-human or parallel human ancestors, right? There's a relationship here but it's complicated. And what I've put up here are just species averages. So this is the kind of thing I always tell grad students not to do. But for the sake of the example, we'll do it. You really want to do individual specimens so you get the noise in there. But this will be sufficient. It's a nice simple data set. So I can show you we're going to build up with the, go from the simplest model to the most complex model to predict this. And I want to show you as the complexity increases, the fit gets better. But the predictions get increasingly ridiculous and you won't even need a comparison to realize you wouldn't do this, right? But it illustrates what happens in cases where you don't notice it as well. So the simplest model we might consider in this case is a simple linear regression where we just consider a linear relationship between body mass and brain volume. We're going to fit these with LM just to show you how to do them. We talked about this at the end of last week. Remember this implies flat priors. So we'll talk about the impact of priors probably later today if we get to it. But it'll suffice for the sake of the example. We fit this and there's a relationship. There's some vague positive relationship. You knew that by looking at the scatter plot. Okay. So, but there's nothing about a straight line that makes really biological sense here. In fact, some of you are already thinking, but wait, body mass, we should take the log of that, right? Didn't we have a homework problem where we did that? Yeah, you probably should take the log of that. But that's not where I'm going. That won't help. The problem I'm going to show you would arise no matter what transformations you did to the axes on this. So let's consider a parabola. Let's just march forward a theoretically and start considering polynomials, right? And so here's the parabola you'd fit to that with the gray region showing you the 95% bound of the mean of the parabola, of the regression lines, right? So it looks like a weird shape because there's uncertainty in some regions that exceeds that in others. Fit this the same way. The only thing to say about this code, and I mentioned this in the book, that little I in parentheses around mass square that constructs it, that's an as is function in R and it sort of inline constructs a new variable. So you need to do that or it won't fit the model right. This is only a feature of the built-in model fitting stuff in R, not of my package. But why stop there? The parabola fits better than the straight line. I was going to say. You can see that from the R-squared. We'll do a comparison on a couple slides in. The R-squared has gone up. And R-squared is this common measure, which I think most of you know. If you don't know it in the book, I explain it in all the detail you need. It's a, people talk about it as the variance explained by the model. We can consider higher order polynomials though. We can go all the way up to this one, a 6th order polynomial. And this will be the most complicated we consider with these data because this gives us as many parameters in the linear model as there are data points. Beyond that, you can't get any better. This model, as I want to show you, will give you perfect predictions of the sample. Right? But the next fossil I think you'll be able to see is very unlikely to be near the curve that results from this. That's where we're headed. So here's the linear one again. It has an R-squared of 0.49. We did the parabola. It has an R-squared of 0.54. So it's gone up a little bit. We can do the cubic. Cubic gets to turn twice. So now we've got a kind of funny S there, and it's doing a little better. Right? But it thinks it, really heavy body sizes it dives down again, that's what I'm saying. And we can go to fourth order and fifth order. It gets increasingly weird. Notice that the uncertainty around the line also is kind of funny. By the time you get to the fifth order, it's pretty confident, right? Conditional on the fifth order polynomial, it thinks it's right there, right? It's got to be that one. So this is also the thing I'll keep reminding you about. Models can have really narrow confidence bounds, but that's because they take the model for granted, right? Models never question themselves, that's your job. So, yeah, those confidence bounds are tight. Conditional on this being the right model, right? If you're committed to a fifth order polynomial, that's the right one. And we can go up to the sixth order polynomial, as I said. You can't get any farther beyond this, and that's because the uncertainty now shrinks down, collapses, there's no error variance to explain, because at this point there's a parameter for every hominid in the dataset. We just encrypted the data in a different format, now where we have a parameter for every case in the data. This is the best you can get, and our squared is now one up here. But I've changed the scale here and put a dash line at zero so that you can see now the curve actually predicts negative brain volumes, right? So, there's this little sweet spot in body mass where, you know, the hominid gets lobotomized. Crazy model. Can't be right. You don't have to do any kind of model comparison to realize this is a bad model. This is happening all the time, I think, because people don't plot their data, but most people would never have a parameter for a data point, I hope. Although RLM fits this and returns the results. I mean, it's pretty tropocetic with it. But what has happened here? Well, I said one way to think about this is you simply re-encrypted the data in an alternative formulation. You've created a function and a set of new data inputs which pump back out the raw data. You can think about it. You ask this to think this model, this final model, a model F down here in this figure to make predictions, and you give it the same body masses as before, it'll spit the data back out. Exactly. That's what it does. If you give it new body weights, it does nonsense, right? Absolute nonsense. But that's the prediction issue, not the fitting issue. So any measure of retradiction of how accurately the model recapitulates the sample is going to overfit. If you maximize it, it will overfit. I'll say that again. Any measure, not just R squared, but any measure of how well the model retrogics the sample you fit to it. If you maximize that criterion, you will always overfit. You will end up with a model that is maximally nonsensical. You will crash against the muster or sill up. It will eat all your sailors, so to speak. You'll make bad predictions in more material terms. So one way to think about what's going on here, back to the metaphor about learning. Underfitting is a problem of insensitivity to the exact data, insufficient sensitivity to the sample. So one way we can think about this is an exercise if we take this data set with it seven cases, and we drop them one at a time and refit the linear regression to it. We end up with seven linear regression lines, each one with one of the hominins drop from. So it's seven linear regressions with six data points each, but a difference with a different hominin left out in each case. And those are these seven lines, again showing the raw data. They don't vary too much. There's one that varies the most, and that's because when we drop homosapiens, right, it's the greatest drop. That's that one that's at the bottom. But mainly this model, this simple linear model is pretty insensitive to the exact data. It's not learning very much from the sample, and that's why it's pretty insensitive. It's also not capturing probably what we need here. The more complex model, so here's the fifth order polynomial, fit in the same routine. Now we drop each hominin one at a time. We've hit a fifth order polynomial. Why fifth order now? Because now there's only six data points. So we max out our parameter number against cases at that point. And now we get these wildly, this is actually art, right? We get a bunch of different curves swinging around all over the place. This is the symptom of overfitting. The model is extremely sensitive in crazy ways to the exact composition of the data, because really all it's doing is re-encrypting your data set in another set of numbers. And we want somewhere in between that captures the regular biological features of these data. And that might, that will probably mean giving up on polynomials, right? But does this make sense, the contrasting of overfitting and underfitting? I think this is, overfitting is one of these things that all statisticians learn about and it's emphasized. But it's in the sciences, I think it's often under emphasized by accident, I guess. But it's one of those, the most famous thing in statistics that scientists haven't heard about. It's incredibly important in the field of statistics and not so much in the practice. And a former student of mine actually was at a conference once where he tried to tell an economist that he probably had overfitted his models the economist had and asserted this result because he learned it from me, right? And the economist denied that this was true, that this actually happened. So to be clear, the statement was as you add parameters, the sample always goes up and the economist denied that that was true. This led by a former student to like get out some paper, right? And start doing things. Because you know Brett, this was Brett. But it's a better story when Brett tells it. But so this is just to say that economist is a smart person, but it wasn't trained in this sort of basic problem. This is a pragmatic issue in using models to make predictions. We've got a sample, we don't have the future data that we really care about probably, right? Or rather, even if you're not going to do forecasting, you're trying to understand what caused the sample at hand. There are still irregular features of the sample that have nothing to do with the information you want to learn from it. So this is the way we think about it in statistics. There's this, what we say is you want the regular features of the sample. They're the features that reoccur because of the underlying process that generates the data. It's usually what we're after. And there are a bunch of strategies, all of them useful in various contexts for coping with this and discovering regular features of the sample. One of the oldest ones and still very useful, so they call cross-validation. We're not going to do cross-validation here, but cross-validation means holding out some of your data, spitting on some of it, and then evaluating your models by their accuracy and predicting the data you held out. So the model hasn't been trained at all on the data you set aside, but that's how you're going to make them race. You're going to make the horses race on the left-out track, so to speak. Train them on one track, race them on another. Cross-validation isn't used a whole lot because it means you have to hold some of your data out. You can also do cross-validation within your whole sample by, as I showed on a few slides back, leaving one data point out at a time. You've still got this problem where the samples don't vary a whole lot though. So there's a big interesting literature on how to optimally cross-validate in different situations. And it's a worthwhile literature if you take time to get into it. But instead, I'm going to focus on regularizing priors and information criteria this week. And we're going to use those throughout the rest of the course. And these are sort of, well, regularizing priors are a way of making a model relax and not learn as much from the sample. We've been using them in an ad hoc way so far. We'll be more principled about it starting this week and do some comparisons that you can see. And in parentheses there, I have this phrase, penalize likelihood. In likelihood-based statistics, as non-Basian, most non-Basian model-based statistics is in this likelihood-ist framework, they do the same thing. They use regularizing priors. It's analytically identical, but they're not called regularizing priors. They're called penalized likelihoods. But analytically, it's exactly the same thing. So you're going to learn the Bayesian view of it. You will be learning simultaneously how people who don't do Bayesian statistics do the same things. So big in machine learning, penalized likelihood, for the same reason. You've got a lot of data. Big data has big overfitting problems, right? And so penalized likelihood is a big deal. Information criteria are a way of measuring overfitting. We're going to spend most of our time constructing that. And then I want to say finally, we won't be doing much of it in this class, but science is a great thing to do. You should do as much of it as possible. And the important thing about science is you do it in public so that your mistakes not only help you, you don't learn from your own mistakes and successes, but everybody else does as well. And I want to assert that most of what's wrong with the practice of statistics in science isn't what's done with the statistics. It's that people want the statistics to do more than it possibly can. Any particular study or statistical procedure can't tell you what's true. That's hard. And it's a real big burden to put on individual methods or the people who employ them. We always need iterative group learning, right? We always need that. Any particular study can be misleading. There's no substitute. There is no mathematical procedure which will tell you what's true. Yet in the long run we've learned some stuff in science, right? A little bit. Yeah. I mean, every field. The history of chemistry I was telling someone recently is littered with discovered elements that turned out not to be real. So, I mean hundreds of them, actually. And so, you know, chemistry looks complete and beautiful right now. But there was a time when it looked like anthropology. I only pick on anthropology because I'm an anthropologist. And it's easy. But no, I've tried to be constructive and say that science is a population process. And if we try to construct it as if it's up to us individually to do everything right so that we learn from our studies, that's just not, that just can't work. That isn't the way it happens. Nevertheless, we want to do the best job we can of course with the data at hand individually. Okay. So, let's start building up information, information criteria. And to do this, I think the typical way information criteria are taught is diagramed as it is on this slide. There's this sunny beach resort of information criteria that's going to solve problems for us. And we'd like to go there. You know, I would like to go to there to quote Tina Fey. And you know that phrase. But, and usually if you, and I know this because I've learned about information criteria by reading about them. And usually in most books, you'll get this. Okay. So, we start with this problem. You'd like to compare models. Hey, information criteria. Here you go. Use them. And then in between, there's just this like arrow. And the question was, but what, how do you, where do the, huh? Where do these things come from? And if you're like me, you always worry about these things, perhaps too much. So, we're not going to go along the straight highway to the beach here. We are going to get to the beach. I promise you, on Thursday we will arrive at the beach and we will have my ties or something. But we will instead go through a dark road. I'm going to put you all in the back of my Vita pickup truck. And we're going to go along this dark, dry, twisted road with no other car design. At some point, I'm going to suddenly swerve off the road onto a dirt track. And then we're going to pull up to this dark cabin, which has lots of fresh dirt behind it. And we're going to spin the night. And I promise you, in the morning, you will all wake up alive, really, you will not be zombies or anything like that. And then we'll continue on to the beach. But along this path, we will talk about information theory and information criteria and you will learn why we get at the beach. You can use these things and you'll be wiser for it. Okay, so this is to say, there's a long and treacherous road to go here. And I'm going to present it in the most palatable way I can. But there are many steps to the argument. So let me outline them for you. We're on the road to information criteria. We're going to talk this week about AIC, DIC, and WAIC. And here are the steps we need to go through on this dark road through the woods. First, we're going to have to decide what a good target is for a model. So when we say that one model is better than another, what is the criterion we should use for that? What defines a good prediction? And that is actually a really subtle problem. There's a lot more in the book about this than I have time to do in lecture this week. And I encourage you to read it because there's a big literature on this. And it's easy to go wrong. I think frequently people quote the wrong thing. Like average predictive accuracy in the book I talk about is the wrong criterion almost always for a model. I won't have time in lecture to go over that, but it's worth focusing on. Then we're going to deal with how do we measure the distance of a model from that target? And this is also subtle because prediction can be very complicated and have a bunch of dimensions to it. Then we need a way to actually estimate that distance. That'll be the next step on this road. I think that's when we get to the cabin. That's when we reach that point. And then we're going to need to deal with the problem that overfitting is still there no matter what criterion we pick in a way to measure the distance of a model from hitting the target we like. We still need a way to deal with the fact that models get excited by the sample. And it turns out in the realm of pure thought you can estimate the overfitting of a model. And that's where we end up at the sunny beach of information criteria. Very quickly, as I said, there's about three pages on this topic in the book that I simply don't have time to do this week in lecture. So instead I'm just going to assert the answer. The target we want, if you want to define truth in italics, because that's the only way I can, as an anthropologist, you have to use truth ironically. So the italics mean irony. The truth is the real joint probability of events. Think of it this way, if we knew what was going to happen in the world, given any particular information state we're in, we would know the probability distribution of events that arise, the real. That would be the truth. It's the best we could possibly do. That's the truth we want. And it's joint probabilities, not average probability. And I say more about that in the book. So truth defines a probability distribution. If we had the perfect model, we would know that probability distribution, but of course we never will. Our model defines a different probability distribution and our job then is to figure out how far our model's predictions are, the probability distribution it defines, how far that is from the true probability distribution of events in this context that we're interested in. This is subtle because it's a distribution. So distance is a weird thing. How do you measure the distance of a distribution from another distribution? And there could be, this is complicated just to begin with and think about, at the bottom of the slide here, I represent this weather forecasting metaphor that I use in the book. In this case, we only have to predict rain or shine, but there are many kinds of weather. And as there are more kinds of weather you'd like to predict, the probability distributions to predict them get more and more highly dimensional and complicated. So the way we measure distance between probability distributions must take into account the complexity of the prediction task in subtle ways. Another metaphor I use in the book, remember metaphors are dangerous but we can't think without them, is to say like if you were an archer and you were trying to hit a target, there's a two-dimensional problem initially that is you want to hit the bullseye. So there are two dimensions in which you can be off from the target, but now imagine we also have a stopwatch and we want you to hit the bullseye at exactly the right moment. Now there's a third dimension. So measuring the distance then means you have to take account of an extra time dimension. How can you do this sort of thing? And there are a number of answers and we're going to pursue the one that leads us to information criteria. And that path is we're going to construct a way to talk about the distance between two probability distributions as constructed by a field of work called information theory. So let's begin with the definition. What is information? Sadly the technical literature uses this word which is used in common language, right? Information means a bunch of different things and it's just horrible that the statistics is full of common sense words given very precise definitions like significant, right? Bias, it's just, it's horrible. It's like the plague of the sciences that these terms are used this way. Information is one of these. So you do have to be careful. There are a bunch of different definitions of information in the sciences and we're going to focus on this particular one which leads to a particular tradition called information theory. Information is defined in this tradition as the reduction in uncertainty caused by learning an outcome. And let's think about this in the context of weather prediction because I think it's probably the only thing everybody's familiar with. Say today it's sunny. This is a poor choice, right? Because it's not today. Say today it's sunny and your job is to say what's going to happen tomorrow? Information. So there's some degree to which you're uncertain about what's going to happen tomorrow. And information is the reduction in that uncertainty when you wake up tomorrow and you find out what the weather really is, right? That's what it is. So now you're still like, well, okay, how do I measure that? Yeah, that's where we're going. We're going to measure it. It turns out there is a principled way to do this. There are actually a number of principled ways to do it. So, yeah, it turns out to be sunny. Yeah, I meant to do that. Okay. So one way to make progress on this is to realize that the uncertainty is a property of the probability distribution of events. It is. It's an inherent property of it. So think about it this way. If we lived in Los Angeles, and it's sunny today, how uncertain are you about the weather tomorrow? Anybody here from Los Angeles? It's always sunny. That's right. It's always sunny in Philadelphia, right? So I should put Philadelphia there. That's a good job. Was raining yesterday though. Yes, it was. Yeah. And there's a mild bit of snow in Boston, I understand, instead of a blizzard. Everybody on Facebook is like, WTF, where's my blizzard? All right. So it's going to be sunny tomorrow. You don't need any information. You're not uncertain about the weather, because it's nearly always sunny in Los Angeles, right? It's a probability distribution where it's like 99% sunny all the time, or smoggy. Those are your two things, right? Now, there's Santa Ana's too. There is some weather occasionally. So that lays kind of a lovely place, actually, once you get over the smell. It's a nice place. So Seattle, that didn't sound like a genuine endorsement, but I meant to do so, actually. Seattle, in contrast, is the flipside, right? So it's raining in Seattle today, I bet. I don't know for sure, but I bet. What about tomorrow? Probably raining, right? Because it nearly always rains in Seattle, but little kind of drizzly rain that they get. It's not got a real rain. It's not a credible rain, and it's just kind of like moistened your skin kind of rain. But it's always there. In contrast, you get a pretty interesting city like Atlanta. Now, in Atlanta, there's no telling what the weather could be like in any particular hour of the day. You never know what's going to happen. Now, you're highly uncertain what's going to happen tomorrow. And now is the case. So now let's go back up the column and think about information. Information is the reduction in your uncertainty. It's also a measure of how much you can stand to learn from data in a particular case. So there's very little uncertainty about the weather in Los Angeles tomorrow. So there's very little to be learned because there's not a lot of uncertainty in the distribution of weather events. So there's very little information that I can give you that allow you to reduce your uncertainty because there's almost no uncertainty. The maximum reduction in uncertainty is small. So the value of information is small in Los Angeles. In Seattle, the same is true on the opposite end of the probability distribution, right? Nearly always rains. In Atlanta, in contrast, you're highly uncertain about the weather tomorrow. It could be raining, could be a thunderstorm, could be a tornado, right? Could be a snow flurry for some parts of the year. And so when you learn the weather tomorrow, your uncertainty has been reduced a lot. You've gained a lot of information, right? And that's a measure of it. Does that at least make sense intuitively for the moment? This is one of those slippery things that may take you years to kind of wrap your head around. But you'll do it through exercise. So come back to our definition. Information is a reduction in uncertainty caused by learning an outcome. How can we quantify this? Now we get to, I'm going to take you verbally through the historical derivation of this. And the way this literature starts is a fellow named Claude Shannon, who I'll show you on the next slide, said, okay, well here's our problem. We'd like to measure this uncertainty thing and how much it's been reduced. That will give us a definition of information. How can we write down a mathematical function which lets us take a probability distribution as an input and get out of it the amount of uncertainty in that distribution? And then the flip of that, the reciprocal of the uncertainty is the amount of information that we can gain. And he comes up with three intuitive desiderata or criteria. He says, first, it should be continuous, this measure of uncertainty. Otherwise, an arbitrarily small change in the probability distribution could make an infinitely large change in the uncertainty. And that just wouldn't be useful. So we want something that's smooth and continuous. Second, the uncertainty should increase with the number of possible events. If I ask you to predict rain and shine, you'd be uncertain about that. But then I say also snow, your uncertainty should go up, right? Because there's more stuff to do now. And finally, it should be additive. I say more about this in the book. This is the hardest one to get intuitively. The easiest way I know I'd explain this is to say, look, the way we've been up the categories of events can be done in different ways. You can think about decomposing weather events into, say, a combination of temperature and precipitation and make different events. So it could be hot and cold and rain and shine. That's, you can think of that as two categories of events with two numbers each or four events, right, from the two, from the combinations of both of them. You would want that arbitrary choice of categorization not to affect the answer. And it turns out that only an additive function will guarantee that. I say more about it in the book. This is really subtle. So if you're like, okay, wizard, I'm with you completely. But it's true. It's required to do it. So from these, Shannon derived a unique function, which I'll show you on the next line, which gives us a definition of the uncertainty in a probability distribution. And I want to say before I show it to you that these criteria are intuitive. Or at least they were to Shannon and there aren't too many people who do this kind of analysis. But that's not the reason we use the resulting function. We use the resulting function because it works, right? So because lots of intuition is not a guarantee that things work. It's a thing we use to derive things. We have intuitions. We carry forward with them to see if we can get to something useful. And then once we've got that thing, we have to actually test it and make sure it works. Information theory has proven to be indispensable in all sorts of, I mean, the internet exists because of it in a lot of ways. And, you know, Shannon worked on telegraph, but in other things in code breaking. But if information theory hadn't worked, if these criteria hadn't led to a function that was useful, we wouldn't be using it anymore. And I think the same is true of Bayesian inference. Bayesian inference starts many people, including myself. It's highly intuitive. It seems like, well, yeah, of course you're going to do this. You just count up all the ways things can happen and contrast them, right? And that's Bayesian inference. But the reason we use it is because it's effective. It's proven useful in many contexts, not because it's intuitive. The intuitiveness helps you learn it. And historically, it was a way that was to derive. But I don't think that's a justification. Okay. Here's our patron, Claude Shannon. Now, this is kind of a creepy picture, but if you stare at it long enough, it gives you a warm feeling. And now, Shannon was brilliant. I mean, oh, it's like almost all of information theory sprung forth from his head like Athena from Zeus, right? And he did so much, so fast. A lot of it is until recently classified. He did all kinds of stuff like optimal encryption theory and it really important stuff in code breaking. He worked with Alan Turing during the Second World War. Turing took a transatlantic cruise to come and visit Shannon in Princeton, I think it was where they were at, the Institute for Advanced Study. Anyway, it's worth reading the Wikipedia biography on Shannon, at least, to see. And in 1948, he published his paper, which he defines this thing called Information Entropy, which is the function we seek. The only thing that's gone into this are those three criteria that were on the previous slide. It turns out there's only one function that describes this. Well, there's an infinite number of functions which are all proportional to this. There's a constant that goes in front, which is irrelevant. And this is usually written H of p, where p is a probability distribution. So it's a vector of probabilities that defines the probability of each event that is possible in the distribution. And the function is the negative expected log probability of each event. So you can think about the ethane means expectation or average. You want to say this in English. At the bottom of the slide, I've tried to do some translations for you. The uncertainty in a probability distribution is the average log probability of an event. The minus is there for convenience to make this positive, because otherwise it would be negative. That's all. But it doesn't actually matter. You're not meant to peer into this expression and learn anything at all. It's just how it is. It turns out that the right scale to measure information on is log probability, not probability. And you're like, where does the log come from? That makes it additive. Because log probabilities are additive, where probabilities are multiplicative. That's where the log comes from. It's from the additivity constraint. So everything about this function arises from those criteria. It turns out to be really useful in a lot of ways. Solomon Colbeck and Leibler, here's Colbeck's book, quickly applied the statistics back in the 50s. And one of the applications they came up with is this thing we now call the Colbeck-Leibler divergence, which is the thing we're looking for. It's a way to measure the distance from one probability distribution to another. And it arises, you can derive it from the definition of information entropy. There's a box in the book where I explain where this comes from in a crude sense. And so here's the idea. We have two probability distributions, P and Q. If you like, you can think about P as being the true probability distribution of events. And Q is our model. Our model outputs probabilities of events. And we'd like to now say, how far is it from Q to P, from our model to the truth? How accurate is Q for describing P? And it turns out the answer to this is something called the divergence. We write this as DKL, KL for Colbeck-Leibler, the two really smart dudes who figured this out. And it turns out to be the average difference in log probability between P and Q, which is what that expression says. We sum, and then we weight each event by its probability, and then we have the difference in log probability between the two distributions. So the distance from Q to P is the average difference in log probability, the way to think about it. And again, you're not meant to peer at this equation and be like, oh, I get it now. Yeah, of course. No, we're going to use it. And let me give you a little bit of intuition about how it works, though. A property of this thing is that the only time it equals zero is when P and Q are equal. And now that's great, because otherwise it wouldn't be a great edge of distance, right? We want it to be zero when they're the same. And that is what happens in this. When P and Q are the same distributions, and only then the divergence is equal to zero. So I've tried to show you this. I'll give you a little bit of our code that you can step through in your free time if you like. This code makes this graph. At least it makes a really ugly version of this graph. This is nicer. But what I've plotted on the bottom here, we're considering two probability distributions, P and Q, where P is defined on the slide here as they're two events. The probability in truth of the first is point three and the probability of the second is point seven. And then we have a family of approximating distributions, Q. And what I'm showing you on this graph is the divergence from each of those alternative distributions to the truth. So we can, since there's only two events, I can put all the Q distributions on one axis, right? Because the probability of the first event tells us the probability of the second event, that they must have the one, right? The law of total probability. Gives us that. So all the way from probably the first event in our model being from zero to one, first the divergence is high, then it decreases until we get right to point three. At that point P and Q are equal and then we hit zero. And then it increases again. And it actually goes to infinity if you're at exactly zero on one, which I'm not showing you. I can't really show infinity on slides. I haven't figured that out yet. But this is just to show you how divergence operates. And so if you minimize divergence, you find the truth, right? The problem is, of course, we don't know the truth. And we're going to get there. Not only can we not handle it, but we don't know it, right? Before we move on to that, I thought somebody laughed at that. Thank you. I like that joke. All right, so that's all that matters. So before we move on though, let me give you a little bit more intuition about this. There's a funny thing about divergence is it's directional. It's not a true distance, because true distances are symmetric, right? The distance between tes and me is the same regardless of which of us starts walking, right? It's the same in both directions. Divergence is not like that. It matters which thing you're using to do the approximation. So let me give you this an allegorical form in the form of an imaginary astronaut who leaves, launches off the earth and is going to land on Mars and wants to predict using the Earth as a model whether they land in water or land, whether they land in water or in dry land on Mars. And then that astronaut being educated now about the geography of Mars will launch again and come back to the earth, and now they will use Mars as a model for the earth, and they want to predict using that model whether they land in water or land. And the divergence in the two directions is different. I think this helps. These are the kinds of exercises I did for myself on napkins when I was learning this theory, and it helped me at least. There's no guarantee that I'm kind of crazy, so there's no guarantee this will be useful to you, but other people have told me this works, mainly my spouse. But you may be lying to make me feel good. But okay, so we're going to leave Earth first and go to Mars. Here's what I want you to think about. So the earth is 70 percent water and 30 percent land. One way to think about that is we think about it as a probability distribution of our water and land, right? Say like I picked a random pair of latitude and longitude coordinates on the planet earth. Tell me if it's water or land. It's highly uncertain because there's a lot of water and a lot of land. Now there's more water, absolutely, but it's a highly uncertain thing. You're like well I don't know, it depends. Can you give me a hemisphere? Like no, any point. It's highly uncertain. There's a lot of information entropy about the water or land of any particular location on your surface. In contrast, Mars has very low information entropy because it's almost entirely dry land. And for the sake of the example, let's count the polar ice gaps of Mars and water, okay, because otherwise it's probably no water at all. So we'll count ice. So let's say Mars is like 1%. I think it's the example I use in the calculations. There's a box in the book in which I do the calculations for this. So now when we leave earth and we go to Mars, if we use the approximating distribution Q for the earth has 0.3 land and 0.7 water, and now we're going to predict where we land on Mars. On this graph I'm showing you what you can think about this is the information entropy of of the earth is at a certain position and we go to Mars. The information entropy of Mars is really low. So that the distance is not so big because we're highly uncertain. So the distance isn't so far. In contrast, which one way to say is what's the divergence is almost a measure of the surprise when things don't turn out the way we expect it. So imagine your your expression of surprise on the radio to home base when you land on Mars and it's inevitably dry land. You're like, hey guys, you told me that I was going to land in some water here and I just wrecked my capsule or whatever. What happened? That is your, you won't be as surprised because actually it's as you would have been going the other way. Now you leave Mars and it's all land so you're really, really confident you're going to land and land on the earth but you probably land in water. But you could land on land. Your surprise will be reduced. The distance is different than if you go the other direction. I know I'm not explaining this very well but the box does a better job of it. So let me try one more time to do it in the abbreviated form. If you develop your model-based expectations from the earth, you have highly uncertainty, high uncertainty about what you're going to land in. So then you land on on Mars and it's landed, you're like, okay yeah that could happen. My model said that was possible. No big deal. The distance isn't that great as a result because the potential surprise is reduced because your expectations were entropic. They didn't have a lot of certainty in them. But now you train yourself on Mars and you go the other direction. Mars has very low information entropy because if I asked you to predict whether any particular combination of latitude and longitude on Mars is land or water you'll say it's land. You'll be wrong one percent of the time but so what? So your information entropy is low. That means your surprise potential is very high. So the distance now between Mars and some other distribution can be much greater because the surprise value is much greater. So now you go to earth and the approximation is way worse. Say the distance, the divergence, and information entropy between the two is much greater because the surprise potential is much much greater because you're probably going to land in water. Did that help? Yeah okay good. At least you're nodding. That's thank you. But there is a book. Yeah fun. So it's a good way to say it. It has to do with the imbalance and it has to do with the direction of the imbalance. So which model you're using to do the approximation? If it's highly uncertain, if your model is highly uncertain you won't be terribly surprised. And so the distance between your highly uncertain model and whatever the truth is will be lower. As a way to say that's what divergence is measuring. It's a measuring, in a sense, the surprise potential. The surprise is directional because there's a thing that you're developing expectations from. And that there's a huge imbalance here which you use because the earth has high entropy in this example and Mars has low entropy. So Mars is a, you can think of it this way, if you learn from Mars you're really sure that every planet has got a lot of land on it. And then you can be really wrong. And so the distance between the high certainty Mars distribution and it and the truth can be much bigger. I haven't quite understood. Okay yeah talk to me about it afterwards so I'll run through it. It's we need, if you're like me you need to like write down the numbers and work it through. But the best heuristic I can leave you with and what I want this lesson to get across to you guys is that if your model is really uncertain about what's going to happen then it won't have as big a divergence from other things. And the more certain your model gets the potential for it to be more and more wrong. That it's going to be greater and greater divergence from the truth. But of course if we get it close to the truth the fact that it gets more and more certain is great. But this matters because sometimes the truth is highly entropic and that's an easier target to hit. Lots of models will do a good job predicting highly uncertain phenomena like species extinctions. Which arts are probably going to go extinct this week. I don't know your bets as good as mine. Some will though. On other hand there are some things sometimes where the truth is has very low entropy. Some things will happen and some other things will never happen and then not all models will do reasonably good jobs with that. It's one way to think about it. Yeah, Katrina. So in this analogy we use the analogous to Q. Yes. And the one that you're going to is analogous to the true value. To P. Yeah which is P. That's the thing you need to make yeah the predictions about. So exactly the question was your computer. The planet you're leaving from is Q and the planet you're landing on is P. Exactly. And we have to make predictions about the true distribution of events which is the P and we're using a model Q to do it. Anyway I hope this was useful. It helped me when I was learning this stuff to construct my imaginary astronaut who was unbelievably stupid but that's what we need. Models are done but they're smarter than us still right. So okay. So how do we estimate this thing? Here's where we get into the rub. I'm just repeating the definition of Colbeck-Leibler divergence on the slide. It's the average difference in log probability of events. The trouble here in computing this thing of course is that we never know P. If we knew P we wouldn't need a model. Right. If we knew the truth we wouldn't need models. And so we don't know it. To calculate this thing obviously one of its one of its arguments is P and we don't have that. It turns out though that we don't actually need it because when we're comparing models all we need is their relative difference in divergence. What we want is the difference between two models in their divergence which is I'm showing at the bottom of this slide. Say we have two approximating distributions two models Q and R. Why R? Because it comes after Q. And we can compute the the divergence for each of them. D P Q and D P R. The difference between the two we can write that down. Now here comes the magic. The truth subtracts out. It's the great thing about math for the most part. So there's a log P i in both of them and it's in the same place. So if you start factoring this difference they cancel one another because there's a minus in the middle. And you end up with just a simplified version on the bottom where we what we want is the average say the difference between the average logs of each log probabilities of each model is all we need. This just gives us the difference in divergence between the two. It doesn't tell us the divergence of either model from the truth. It's a relative difference and that's what matters. Does this make some sense? You know you can follow the algebra in your own free time and have fun with it. It really is just like factor things out of the plus. Yeah. What is R again? R is another model. Q and R are alternative models that we want to use. Yeah. So which is closer to the target can be measured by this expression. So we can actually approximate this. And the way to think of the conventional way to do this is, well, you could just calculate the average log probability because you've got the Qs. Now where does the average come from? That's the events that are happening. There's a P i in there but nature is making those for you. The events are happening and then your model says oh I predicted that event would happen this amount of time. So you've got the Qs and nature hides the P's so to speak. Does the averaging for you? So we can calculate the average log probability for a probability distribution and conventionally in statistics we use something that's proportional to that rather than that itself and this thing is called the deviance. It's usually just written D. The deviance of a probability distribution is minus 2 times the sum of the log probabilities of the events. It's proportional to the average because averages are just sums divided by a constant. Right? So it measures the same thing but confusingly it's not the same thing but it functionally works the same way. It's a measure of the relative information divergence of a model. Relative to what? To other models. It's the same events. Makes sense? Okay. There's some interesting expressions on cases out there which I take for thinking. Thinking faces. Just to re-emphasize what you've been saying Q and R are equivalent. Yes? There are models but they're different models. They're different models but they're equivalent conceptually. Yeah. They're both approximations and we'd like to know which is better. Exactly. Question was Q and R are equivalent in the sense that they're both models and we want to know which is best. If we calculate the deviance for each we can use those relative measures of their information divergence. This is the magic of it. Okay. So yeah this is strange because it's the average rather than the sum but it's still proportional. It's still the thing we need. And deviance came into statistics before this information theoretic paradigm really rose. So well similar time but it's a different tradition and so it was already computed by a bunch of software. You might ask as I say at the bottom deviance is the sum log probability of data with a minus two because reasons. The minus two is there purely for historical reasons has to with chi-square distributions. So the minus two deviance has a chi-square distribution under barely mild assumptions and so had a had a life in chi-square testing, model fit testing before this use and so software used to many tab used to spit this thing out I think and that's why people adopted deviance as a measure. Really it's just because reasons right you don't doesn't matter it's just there and you gotta remember to do it and it's really annoying but reasons that's how it goes. Okay we're making good time this feels good right we're not at the cabin yet we're almost there so maybe we're coming up to the cabin. So we got deviance how do you compute this thing mechanically? There's a box in the book where I walked you through in a detailed way how to do it. Usually R will do this for you because it's made to do statistics and for most of the model fitting algorithms in it you just ask for the deviance which is BEV and give it your fit model and it'll give it to you. If it doesn't do that you can use log like, log lick, L-O-G-L-I-K which stands for log likelihood. The log likelihood is the sum of the log probabilities of the data of a fit model and so it's the it's the that thing up top in the deviance without the minus 2 in front so if you get the log likelihood multiplied by minus 2 you've got the deviance. So in the recipe format you compute the log probability of each observation that's model based predictions that takes your model and transforms it into a probability distribution the QIs then you sum all those log probabilities and you multiply the sum by minus 2 and you've got the deviance. Fairly easy to do. For map models and later on map to stand models it'll give you this it computes it you can already can type deviance and get it. That's good to know sometimes how to do it yourself you might have to at some point it really is that easy. So I want to say deviance is usually calculated at the map at parameter values right because you can think about if the predictions of a model are uncertain there's actually distribution of them so it isn't our real problem in a Bayesian framework is not that there's a single QI there's a distribution of Qs and we want to know the distribution of divergences for that distribution of Qs. So we'll do that Thursday actually it'll be great trust me you'll feel awesome when you do well your computer will do all the hard part you will feel awesome because you'll understand what it's doing and you'll tell it it'll be like heal boy calculate that distribution distributions for me. So we will and we will do this like I said on Thursday when we use WAIC. Okay so this is this is down the dark road we've got what's a good prediction the answer is joint probability events how far is the model from the target the answer is use the divergence to figure that out how can we estimate that distance the answer is deviance deviance is in it'll work but it's unsatisfying because it's just relative among multiple candidate models. This is the point where calling it a distance loses all intuition because now you know it's like it's like saying with our Archer metaphor before Archer didn't hit the bullseye but this is the case where we don't know where the bullseye is we only know that some archers got closer to it than others right so now the whole bullseye metaphor is like no it's like a transcendental bullseye or something but that's what deviance is like it's like you can say that one of the archers is better than the other but you can't say how good in any absolute sense any of them are and that's what models are like you can you can say in some particular context which one is better but you can't say in any absolute terms how good they are at predicting things why because we don't know the target we don't know the truth right that's the problem okay last step is overfitting so even deviance over fits as I said anything if you make it retrovict the sample is going to eventually encode the sample if you let models get more and more complicated so if you if you get the smallest deviance and deviance you want it to be small you can tell by the name right usually we want to minimize deviance unless you're an anthropologist then you maximize it all the time right but you want to minimize deviance and because it's a relative measure of the divergence from the truth so you want small numbers but if you do that across a set of models alternative models you will always overfit you'll always pick the most complicated because the most complicated model always has the smallest deviance in sample we're interested in some out of sample prediction some generalization so how do we do this we're going to do this by considering what I call a metamodel of forecasting we're going to make a model of models and this is the road to information criteria that we're headed down so first step in our metamodel of forecasting suppose we have two samples we've got a training sample and in the future there will be a testing sample they're both going to be just to make the the argument easy to understand of the same size in initially this will be 20 and I'll show you on the next slide we'll actually conduct this the training sample is the data you usually call data it's what you get you fit your model to and the test sample is what you want to do with your model is the events you really want to be good at so you fit the model to the training sample and we measure the deviance of the model on that training sample we call this d subtrain and then we use the fit to the training use the probability distribution of events we get from the training model from the training sample to compute the deviance on the test sample we don't fit the model to the test sample we merely take the parameter estimates from the training sample and the model after it's been educated and we use those to get probabilities of events on the new data and we compute the deviance using those does that make sense yeah at least for the moment yeah uh the difference between detest and detrain is a measure of the overfitting right it's the error it's the extent to which model got learned too much uh from it and models always overfit uh that is on average they do worse on the test sample than they do on the train sample only on average sometimes they do better right because you're the sample that you train on could be a weird sample right it could be like a sample of anthropologists and you're trying to learn something about human psychology right and then your training sample leads you astray and then you know the lower rank models might actually do better in that case uh but on average uh because of the phenomenon of overfitting um we do worse out of sample than in sample so let's think about this in a in a simulation experiment and i i give you the code to conduct these experiments in the book if you want to do it yourself i'm just going to explain through them in in the last 10 minutes we have here uh so in this example we're going to know the truth that's the great thing about simulation you can know what the true thing is uh here's the true data generating process um in the form of a model as you know it simple linear regression with a fixed variance uh standard deviation and variance of one and there are two predictive variables creatively called x sub 1 and x sub 2 uh that influence the outcome the mean of the outcome in each case um and they have different importances but they're these are the only two things that matter that is the whole data generating process right there whatever the values are in x1 and x2 they get multiplied by those coefficients that gives you an expected value and then there's a random Gaussian variant that's observed that's how data is generated so then we're going to go through the the the recipe on the previous slide we're going to get two samples from this process we're going to train models on one of them and force them to predict the out of the other one uh and then we can compete them this is the horse race uh issue and we're going to compare these five models uh in in order of complexity and uh the first one just has an intercept um we're going to use flat priors here just to make it conceptually easy we'll add in informative priors on thursday uh the second one is a linear simple linear regression the second one is a linear regression but with two different predictors right so you might say the second one is is sort of the data generating model if you let alpha be zero right uh then the fourth one has three predictors the third one and you can regard as random numbers from the perspective of the true model it doesn't actually matter but we're going to estimate a coefficient for it and it won't be zero right uh it'll be something uh and then the final one has four predictors so which of these is in some sense the true model it's the middle one the third one the one with the two predictors x1 and x2 right that's what we'd like to identify as the best model for prediction out of sample let's see what happens though there's a lot of subtle issues and about how this works so uh here does this as a result of i think 10 000 simulations um and where we fit all five models uh to a training sample and then we we measure their deviance in sample on the training data it's detrain and then we measure their deviance out of sample on the test sample as detest right now i'm just showing you the deviance in sample so the vertical axis on this graph is deviance and units of deviance and units of deviance is like what is that well it's just units of deviance uh i can't tell you what you know sometimes uh uh people call these things decibons and they're in units of information uh which has a base depending upon whichever base you like so that's why they're weird but um the scale is not very informative it's the only thing to say and the horizontal is our models uh here listed in their number of parameters where on the far left we have the intercept only model three is the data generating model and then four and five have the pretend or predictors right uh and the things that we like to drop out and i'm showing you here with the blue lines is the point is the average deviance in sample for each of these models across 10 000 simulations and the bars are plus and minus one standard deviation of the distribution of deviance in sample it varies i mean it can vary quite a lot depending upon the details so this is all in sample so far and this is for a sample of 20 observations now let's um compare that to the out of sample performance oh what i wanted to say i go back notice that it always goes down as the models get more more complicated the deviance always goes down in sample that's what i asserted before and what the economist told brett didn't happen right and uh out of sample it looks quite different now notice uh so the open circles are the average deviance out of sample uh from the fit models over 10 000 replicates again the bars are plus or minus one standard deviation notice that you know we do get overlap sometimes you get really lucky and you do better out of sample than you did in sample but on average you do worse notice now that the models don't always perform better the more complicated they are in fact the lowest on average is the data generating model it does the best out of sample because it's the truth right they're close to it in a sense not exactly because you're still estimating the parameters so there's still error but on average it does better even with only 20 data points and you can see the climb and complexity in models four and five out there as they get worse on average now but notice this is like a horse race in any particular horse race the wrong model could win because there's a lot of variants going on especially for small samples does this make some sense yeah um let's look at the same phenomenon with a hundred data points now with a hundred data points overfitting risk is a lot smaller actually because each sample is a better representation of the data generating process you have more information about it so ironically then overfitting risk goes down or not ironically intuitively I guess overfitting risk goes down the more data you have so again let's look at the blue points it's still always going down but it goes down very very little in models four and five out here what's going on there and and they go up a little bit they are worse a little bit out of sample but not much what's going on here is that the true coefficients so to speak for these extra predictors are zero with a lot of data you won't ever actually estimate them to be zero but you'll get really close to that and so since you have so much information you end up making good predictions out of sample because you can estimate precisely these pretender things mainly don't matter right that's the thing we like regression to do we have a lot of data we'll do that for us out of sample you see model three is still the best but the difference between it in models four and five is much smaller now than the climb as fast because the overfitting risk has gone down it is possible although you don't see quite see it in this so you can almost see it on the graph on the left it is possible for a model other than the data generating model to be the best for making predictions out of sample and an even smaller sample here if you get below if you went down to 10 data points I encourage you to try this at home because you have nothing better to do right except run a bunch of simulations I give you the code it's in a box in the book I just have to change n equals 20 to n equals 10 and write it model one will be the best out of sample why because there's very little data in the sample so it's hard to learn anything regular about the process from it and the model that doesn't try to learn much does best in prediction and so this is where philosophically you need to drink a couple beers or whatever your drink of choice is you know stiff coffee if you don't drink alcohol and think about that this is what model based statistics is like sometimes we don't have enough data to identify the truth and we have to be pragmatic instead and there are lots of applied sciences where this has been known about for a long time conservation is the one that I know best fisheries management where we may know the true biological model of the fishery but we don't care about it because we can't possibly estimate all the parameters in it because fish vary in their growth rates and all kinds of things across age in very fancy and interesting ways and we could know about that biology in some abstract sense but then you've got to estimate those parameters from data and you know fish are frustrating because they're under the waves you know you can't really interview them you get landings is what they're called right there's some dead fish on your boat let's count them and that's a biased sample and so pragmatically you can often do better in prediction for population dynamics by using the wrong model because there are fewer parameters to estimate and that could happen here too if you in a simple linear regression scenario just if you have a small sample size does that make some sense questions about this questions will come up while you read the book this takes us into regularization and I think I have enough time to get this started and do something useful regularization is the first tool that we can use to deal with this problem so what I've just showed you is in the realm of pure thought where we knew the data generating process what if we dumped what's a way that we can redesign our models so that they do a better job one of the ways is taking a particular model with some set of predictors and making it conservative so it learns less from the sample and we've done this before with what we call regularizing priors these will reduce overfitting because they make the model learn less from the sample you can think of them as expressing a certain amount of skepticism about extreme values of things now if the data are diagnostic they'll overcome these priors absolutely and as your sample size increases you can overpower these priors really easily I encourage you to explore that by by trying it out the the risk here of course is if you make the prior to conservative you won't learn anything from the sample and that's no good either then you underfit so it is it's Ulysses compass in a sense there's this Ulysses compass problem but remember you don't have to get the right prior because there is no right prior priors aren't part of nature they're part of the information problem that we use to learn about nature so you just have to do better so in that sense a prior that's modestly conservative is always going to do better than a flat prior because it will reduce overfitting now how conservative it should be you want to worry about that and information criteria as you'll see can measure that we'll get to that on Thursday so what's a regularizing prior in this case the simplest is a Gaussian prior with some small standard deviation what's the right one well it depends upon the scale of the data right depends upon the units of the outcome we're talking about what conservative means so you'll have to use your domain expertise about that and the graph in the lower right I show you three Gaussian priors to give you some intuition about it cross parameter values so say this is a regression coefficient and it measures you know the the number of centimeters taller someone is for every additional pound of weight or kilogram of weight these different priors they differ only in their standard deviations they're all centered on zero the dashed one there is a standard deviation of one and then a standard deviation of half is more concentrated and then a standard deviation of point two is even more concentrated yet you're piling up prior plausibility before you see the data on zero this is expressive skepticism and then you need more evidence to overcome this and it turns out this helps out of sample prediction and that's what I'll leave you with we can repeat our simulation experiment and same thing 10,000 times we generate a sample we fit models to that sample we also generate out of sample from the same known process which is known in simulation and we measure the performance of the models out of sample using again using deviants so what I'm showing you here is just in sample performance for the moment I'll show you out of sample in a second for three different for the same five models as before but now for each of them we use three different priors the previous slide I used flat priors which actually means a center deviation of 100 which is effectively flat over the range of these data the thick one that's doing so let's say the points here represent what you saw before the points are what was on the graph before previous one we looked at for a flat prior the dash is the most modest prior the one with a center deviation of one it's slightly worse right notice this may slightly worse at most values but it's basically the same there's almost nothing then there's this the solid line is standard deviation of a half it's worse because the deviants is higher on average for all of the models right in sample in sample it fits the data worse why does it fit the data worse because it learns less about the data so it makes worse predictions about the sample and then the thick solid line is the most conservative one with the smallest standard deviation and it does a lot worse on average in the sample but in sample who cares about in sample all in sample does is get you published right we're interested in out of sample I don't mean that slightly snarky I think that's actually the sad sociological fact about a lot of journals but let me show you the out of sample effect it's completely reversed completely flipped now it's just like mirrored on the other side well also we've morphed it but model three is the best in all cases but notice that the conservative prior does better on average for all of the models they all can make better predictions out of sample when they're made to be conservative to be skeptical and the reason this pretty intense skepticism matters here is because there's not a lot of data there's not a lot of data and so we shouldn't expect to be able to learn reliable and lasting inferences about this process from only 20 data points right and with that note I'll ask you to think about that the next time you see a social psychology study quoted some points all right not to pick on social psychology well yeah actually to pick on social psychology all right thank you for indulgence and I'll see you on Thursday