 Hello! We were discussing illuminated manuscript in the front row here, so I will attempt to put something in a future lecture just for you. Alright, let's pick up. Right where we left off. Just to remind you what the ambition is, this week we're trying to understand overfitting and underfitting, and two of the most common ways to address these phenomena. We need some way to trade off overfitting, which is learning too much from the sample versus underfitting, which is learning too little. We have walked the road to getting to information criteria as ways to guard against overfitting. We're about to start that. We had just gotten to regularizing prior, so we're going to pick up there on the next slide. And later today we'll talk about how to use all the models in a set and the relative estimated out of sample accuracy to average predictions from them, and create something called an ensemble, which guards against overconfidence in predictions across models in the same way that using the whole posterior within a model, guards against overconfidence in any single parameter value. So back to regularization. Regularizing prior is conventionally a conservative prior. It's informative. It's meant to hang down the model, make it calm down, and not be too excited by the sample. You can think of it as expressing skepticism in large values of regression coefficients. That's the conventional use that we're seeing here. So these will reduce the amount of learning from the sample, but if you have a lot of data, remember, you can note these priors will have no effect. When the likelihood is very, very peak, and that's what happens when you have a lot of evidence, even a very strong prior will get swamped out. And one way you can keep reminding yourself of that is that inside a Bayesian model, priors are just the result of previous learning. That's how they work in the calculus. And once the evidence is out, remember, way back in the first week, if you had a posterior, you could always put that in as a prior in a later analysis, and then put the new data on top into the left date. And that'd be equivalent of using all the data at once with the primordial priors that we started with whenever. So logically, every prior is a posterior to some analysis. And that's one way, in fact, you can decode priors and figure out what they mean. Depends upon the prior. So it's a relative issue. But it's a good question. The question was, what's a lot of evidence so that you overcome the prior? And it depends. If the prior is incredibly peak, with a really, really narrow variance, then you're going to need a lot of data to do it. The regularizing priors we use in this course, you'll see in your homework you can play around with this, are pretty easy to overcome, even with relatively modest samples. Nevertheless, they can still have an impact depending upon which part of the posterior distribution you're in. Because they're mainly going to have an effect on extreme values. They put a lot of probability mass around zero, and they're skeptical of stuff like, you know, 10,000. We're going to get a lot more practice with these. And I always encourage you guys to do experiments when you're doing your homework, or when you're going through the chapter, and change the priors and play around. See how strong you have to make some of these priors in the example analyses before the data doesn't do anything. And that's the way to figure this stuff out, I think. That said, of course, there are analytical ways to address that question. But I think those are typically limited to particular kinds of models, because the likelihoods can vary a lot. So in any particular analysis you're doing, you want to learn the skill of sensitivity analysis, seeing how sensitive it is to the priors. Another thing to say is, and we'll have examples of this, not this week, but in future weeks, is often it's a good idea, once you get your posterior distribution, to compare it to the priors and see what the data has done. The visual comparison is often informative. We'll do this later on when we do Gaussian processes, especially, because we'll get very complicated. There, remember I asserted this before, you're going to get a covariance function, a spatial covariance function out of the posterior distribution. So the distribution in your distribution. And a bunch of parameters go into defining it. So we'll work on it, it will make sense. But those kinds of comparisons help a lot as well. You can see how much it has learned. Okay, so here's an example of regularizing prior, these beta coefficients. I've been foisting these on you guys in previous weeks, magically. Now we're going to work with them. I gave you three examples of ones of different levels of skepticism. The normal zero one prior is not tremendously skeptical. But of course it depends upon the units on the parameter here, on the scale, right? So it's all relative. This is why you need to play around or have some domain knowledge. And then one that's more peaked and then the 0.2 standard deviation one is more peaked yet. And this is the slide we had just gotten to and ended on. So let me reintroduce it on the horizontal axis and the plot on the left. We have five models that vary in the number of parameters and therefore the number of regression predictors they're using. And in sample, the more complicated models always fit better if your machine works. It should do that. So that your machine will not work and that's one of the ways you can tell actually is in your model set the more complicated models fit worse and something's probably wrong with how you did the fit, right? And the points on here are the perfectly flat prior, what you get when the prior is effectively completely flat. And the different lines correspond to the different regularizing priors on all the coefficients, all the regression coefficients, not the intercept. The intercept is still flat in these but the slugs so to speak. And you'll see the stronger the prior gets the worse the fit to sample because that results from the model being extracted less from the sample, right? Because it's more skeptical, needs more evidence to move as far away. Yet when we look out of sample, the whole relationship is flipped over because the more skeptical priors overfit less. And there's a lot of overfitting risk here because this is a sample of only 20. And the model with five predictors has four regression variables in it, right? There's a multiple regression with four predictors and there's 20 data points, right? You might think, I would never do that. Well, let me show you an issue of the Journal of the Royal Society sometime. I'm sure we can find, I bet you can find, someone actually said one of my former grad students sent me a great paper that was in nature about growth in dinosaurs where they had five fossils and there were six parameters in the model. So this does happen. I'll put that on the website actually, you guys will like it, it's great. I thought about making a homework problem from it. It's great. Besides dinosaurs are cool, right? Who doesn't like dinosaurs? So I'll try to remember to put that up. So you get overfitting risk because it's in some sense the, it's the model complexity relative to the amount of data you have. But a lot of data, there's less overfitting risk because you can estimate small coefficients very precisely. Or you can think of each sample as a better representation of the general process when there's a lot of data. Does this make some sense so far? You guys with me? What's going on? Notice model three is still the best overall, but with the really strongly regularizing prior, that's the thick line, the thick black line on the top out of sample. There's almost no additional risk infused by the more complex models because the prior is so skeptical of each regression coefficient and the standard deviation. So the marginal posterity in those parameters are very wide. And so the prior effectively squashes those estimates down to nearly zero, which is what they are in reality. I know because I made the reality of simulation. Let's look at what happens when you have more evidence. So here's the same analysis, same 10,000 simulations for each model, 100 data points in the samples. Now though, same basic relationship in sample. The more skeptical regularizing priors always do worse. It isn't as big a difference now. And that is it doesn't descend as rapidly. And the overfitting, the difference out of sample among the models isn't as great now, but it's still the same ordering. It's flipped over in the other direction. Why? Well, there's a lot more evidence now. So the overfitting risk is lower. Priors in general have less effect and you don't need them as much in this context. But this is a case where they do no harm to use them in a sense. And they do do us a little bit of good because it does help a little bit. Does this make sense? So now of course the issue in this is if you make the prior too tight, then you won't learn anything from the sample. So there is a tuning problem here. We're going to talk about the tuning problem a little bit today and in some work you will do some tuning. Or rather I will guide you through some tuning and you will do some tuning even if not actively in your cognition, at least physically you will do some tuning. Okay? And you will get it. It's like playing a sport. First you do it and then you understand it, right? That's how these things go, riding a bicycle, stuff like that. Okay. So often we need a way to think about measuring the expected overfitting from something. And this is where information criteria come in and all that work we've gotten up to so far with the dark road and the cabin, right? And the beach now we're at the beach. Or rather we're almost at the beach. You can see the beach now. We're out of the woods. You know I'm not going to kill you all and put you in a mass grave sort of thing. So we'd like a way to estimate our sample deviance because all models even with very intelligent regularizing priors are likely to overfit some. Why? Because every sample has something about it that is not going to generalize usefully to more samples in the future. And so we expect some overfitting to go on always. And we'd like some measure given a model structure of how risky it is to use a model of that type. And one of these measures of risk is how flexible it is to fitting. And it turns out that information theory, all that stuff we did on Tuesday provides us a way to do this. Rather we can continue with this line of using deviance as our target. We'd like to measure the deviance out of sample. So what we've been using even today on the previous slides. And there's a way to calculate what the expected deviance out of sample is given a particular model. It is a likelihood and a set of priors. And then the posterior that results from them. And that's what we're going to do. And I'm going to assert a lot of the mathematics and instead give you the magical version of it. But if you have questions about it, I'd be happy to explain. I think the simulations work better and the same simulation strategy we've already used will explain how these things work. Okay, so in theory information criteria, let us estimate out of sample deviance. That's the goal of every legitimate information criterion. There are things that people call information criterion like the Bayesian information criterion. It's not an information criterion. And it's probably not Bayesian either, depends upon who you ask. If you ask me, it's not. But why? Because it uses only math estimates. It doesn't use the whole posterior distribution. But that aside, it's not an information criterion because it doesn't have its target as the other sample deviance. It uses something else. That doesn't mean it's not useful. It just means it's four in a different cast, one we'll talk about actually. I'll mention it today. And there's a little box on the Bayesian information criterion in the book if you're interested in it. But I'm not going to advertise it beyond that. So we call these things information criterion because they're based on information theory. That's what nominates deviance as the target. It keeps us in probability distribution land. And they're criteria because they're used to compare models. Smaller deviances are better. And there's a bunch of these now. The most common ones in Bayesian work are AIC, DIC, and WAIC. But there are lots of others for special model types. You can come up with ones that deal with the particular features of those models. WAIC is a generalized one that's, well, it's called the Widely Applicable Information Criterion because it's very good in adapting to any kind of crazy model form you like. With some restrictions we'll talk about in later weeks. So let's start with the Ancestral one, the Akaike information criterion. This was the first one of its kind. And according to its creator, Hirotugu Akaike, the idea came to him while he was writing the subway for work. He realized in a flash that he could get an estimate of the out-of-sample deviants because of the symmetry of the problem and so on. It's one of those things where you read it and you're like, wow, okay, you're kind of smart. Or maybe I should write the subway more or something like that. But it's one of these flashes where the insight comes to a unique individuality, a unique circumstance, and it has a big effect on, it creates a new area of work in a field. Although people had come up with related ideas before, but his was very principled. So it's the same metamodel of forecasting we talked about before to talk about overfit and the exercise I did with regularizing priors. Just earlier, we imagined two samples coming from a common process, the training sample and the testing sample. We can observe the training sample and train our model on it. That is fit the model to it. For the sake of convenience, we say these samples are the same size. They don't have to be in principle. Obviously, you can make predictions for smaller or larger samples in the future if you like. It's just convenient to do it this way. We fit the model to the training sample. We get the deviance on the training sample. That is overfit. It's smaller than the average deviance out of the sample is going to be. And so it overestimates how good the model is. It over, it underestimates its information divergence. Because remember, deviance is an estimate of the co-library divergence. Then we take the model fit, the posterior distribution from the model fit to the training sample. And we, in a sense, predict the data in the test sample. And we can get its deviance. It's estimated co-library divergence on the future data. The problem, so we did this when we did the simulations. What AIC wants to do is get the test part analytically. Mathematically. And that's what IPK is going to figure out is a way to do it analytically. I'm giving assumptions, right? This is not magic. This is still a golem. It's just a metagolem. It's a golem of golems, something like that. So under some strict restrictions, you get this famous expression that the AIC information criterion is the deviance in sample and the training sample plus twice the number of parameters. And that's the expected, the average. Remember, there's a distribution here. And in some cases you can actually define the distribution too. But the average is simply the deviance in sample plus twice the number of parameters. Which is pretty nice, right? It could have been something much more hostile. And it's a nice thing that it works out that way. I should say for the sake of history here that Akaike did not call this the Akaike information criterion. He called it an information criterion. And he says actually later on that his secretary suggested that to him. Just call it an information criterion. AIC, that's an acronym. He's got to have an acronym, right? So he used it. He says he credits it to the secretary. This is back when professors had secretaries, I think. Now we're our own secretaries. My computer is my secretary. It's a terrible one. But anyway, so you say he wasn't a raving egotist. That's what I want to point out. By all accounts, a very generous person with his time and his mind. Okay. So there are some conditions and it's worth knowing about them. First thing, you have to like the AIC forecasting model. What does that mean? That means the scenario that I outlined on the previous slide we've used before. On the next slide, I'll give you a hint about why that's not completely general. Even though it reflects a lot of what us as scientists do, where we get some data, we try to do some learning from it to think about a generalizable phenomenon. There are other kinds of things we like to do with statistical models. ASC assumes flat priors. Now you can use it with non-flat priors as long as they're pretty weak, right? Because then its approximation will be excellent. And you can try this out on your own in the homework because I'm going to give you information criteria that are generalizable beyond flat priors in a moment. But when people use AIC, there's this implicit assumption that the priors are flat. As a consequence, this actually excludes multi-level models. Or rather, there's a subtle issue here. It ignores all the multi-level aspect of them. You'll see lots of people use AIC with multi-level models in the literature. And I go back and forth on whether that's naughty or not. Ask me next week and maybe make up my mind. But I want to punt on that until we get to multi-level models when it will be useful. But the reason just for now to say why AIC strictly excludes varying or mixed effects or random effects models is because in those models, the deeper levels in the model are effectively priors for the parameters higher up. And they're informative if the multi-level structure does anything good for you. They're often highly informative. They're regularizing of the higher level parameters. And that's why we use them in multi-level models, adaptively regularized. They learn the proper amount of regularization from the data. I'll say that again. They learn the proper amount of regularization from the data. And that's why they're awesome. So we're going to punt on saying more about that until we get to that in week 2000 or whatever that is coming up. Okay. AIC assumes the quadratic approximation to the posterior that is multivariate Gaussian. This is the least in the label of the assumptions because it's very regularly satisfied. Thank the universe for the central limit theorem. Fossilier distributions are routinely Gaussian. Nevertheless, after the two weeks from now, and going into generalize linear models, I'll show you cases where the Gaussian assumption is violated. It's not too unusual actually once you get things like logistic regression. There are lots of cases where it won't be multivariate Gaussian. And it assumes that the number of parameters in the model is much less than the number of cases you're fitting the model with. There is this generalization which is quite popular and I think rightly so which is that as K approaches in for a Gaussian likelihood model like the multiple linear regressions we've been using, there's this small sample correction to AIC called AIC sub C where there's this other expression for the penalty term, the number of parameters isn't 2K, it's 2K over that thing. And as K approaches in, that thing goes to infinity. That is the infinitely over fits. Why? Because that's like that model we fit to the hominin species on Tuesday where again when you had the same number of parameters as data points there was no error variance left in the model. So anything off the curve was impossible according to the model. So yeah, that's infinite over fitting and that's bad. Now you would never use a model like that, but except for that dinosaur model that was published in Nature which I'll upload for you guys later. This is pretty useful. It's also conservative. I've done a bunch of simulations with these things just to learn them for myself, cases where I know the answer. And I've always found AICC is conservative relative to the actual error over estimates the amount of over fitting risk that'll come as you approach K equal to n. But I'd rather do that than not use it at all. But you just probably keep that in mind. But when people are using it, that's what it's about. Okay. So I wanted to say why might we not like the AICC prediction scenario when those of us who do professional science or semi-professional science or whatever it is we do often we find ourselves with there's some data, we've got some hypotheses and we've got fitting. Our lives are discrete in that way. There's a degree you're trying to get and there's some data in your life that you're analyzing it. And any forecasting or predictions you're going to do are some distant time in the future that someone else might have to do like your students or something like that. So our lives are organized in this sequential fashion the same way the training test rounds are in that scenario. And that's fine. But there are lots of real world applications of statistical modeling and statistical learning which don't have that convenient structure, that discretized structure. Here's one from my life. I have strong opinions about how to cook bratwurst. Just FYI. And those of you who don't you should get opinions about this. It's very important. It's way more important than which political party you vote for. So anyway, I like the intestine on the outside to be crispy on the outside. But anyway that aside, I'm over sharing I know. In this kind of task you don't want to think of it as cook a bunch of bratwurst one way and a bunch of other way and then compare them and get a treatment effect or something like that. You could do that. But that would waste a lot of bratwurst. And I want to eat these bratwurst that I'm experimenting on. So I would like to do an efficient form of statistical learning as I eat my dinner, as I grill these things or preferably boil them and then grill them. That's the correct answer, by the way. So there are ways of optimizing this too. Different kinds of criteria. But they're not the standard information criteria we are using because the information criteria, the central ones that I'm talking about, use this discretized kind of trained test scenario. Instead, the form of learning with the bratwurst that I'm interested in is called pre-quential error. And it's the accumulated error over the cycle of learning as you go. So it's like, I start with some prior. I cook the first bratwurst one way. I eat it. I decide, hmm, that wasn't very crispy. I learned something. I update my posterior. Next, so your sample size is growing as you go and you're learning as you go and you'd like to minimize the error on that learning trajectory. So this is like a search problem with the learning and search problem at the same time. You're trying to discover something and minimize the error along the way. It's an efficiency and sampling as you build together. And this isn't purely something that has to do with me and grilling bratwurst. In World War II, there was a very serious application of this issue. It had to do with finding German U-boats in the Bay of Biscayne. And U-boats are hard to see because they're underwater, especially since the Allies really didn't have the technology at the time. And Bayesian statistics was applied to do search passes in the Bay of Biscayne. And you're learning something when you look in an area and you don't find a boat. You learn something about where you should look next. AIC is not the right scenario for these things, but there is a big and very successful literature on this. This is also how we find plane crashes. They apply the same algorithms to finding, you know, various Asian airliners that go down. Too soon. This year has been a bad year for certain airlines. Anyway, this is a fascinating literature about this stuff. You need a different way of thinking about how to choose models and do model comparison if this is your goal. And there's a big literature on it. It's quite popular in machine learning. You're constantly getting behavior out of the system and you're optimizing as you go. You don't get discretized data collection around while you're doing your degree. Okay, that said. That's how I'm not going to teach you that, which is a tragedy, but yeah, question for you. So just to make sure I understand, with the AIC, you want to compare the same model but fit to two different data sets. No. The question was, with AIC, you want to look at the same model fit to two different data sets. Now, the goal is to predict how any particular model will perform out of sample. But you've only got its performance in sample. So AIC is our crystal ball, which is a metaphor I will diminish later on today. And the idea is you're going to have multiple models and there'll be an example of this later on. And if we can use AIC on all of them, it provides a criterion by which we can rank them and compare them on what we care about, which is their out-of-sample performance. Because in sample, they're bound to overfit. So we don't actually have this test sample. No. Now, it's like your data, right? I mean, I know your data. But actually, I do know something about your data. But you've got some data you're going to work with and you're going to fit some models to it. And you'd like to choose among them based upon not how they work on your sample, but on some generalization of the process that sample came from. And that's what AIC aims to help you do. Did I answer your question? Yeah. So you're saying with the prequential and you can't answer the question? There are other things. They're not information criteria, but there are other criteria that use the prequential framework. They accumulate the error as you go over learning cycles. They're big in machine learning. They go by various names. But if you find yourself with a problem of that kind, let me know. Send me an email or stop by my office and I can give you some stuff about it. There's a big successful literature on that stuff, too. OK. All right. So let's look at AIC and how good a job it does. So same story with these graphs. You use these graphs now, right? I run 10,000 simulations. My computer chugs along for a minute or so because that's all it takes these days and to fit a million regressions and pull out the deviances. And so along the bottom, the same five models you've been using. Model number three there is, in some sense, the true model is the data-generating model and deviance on the vertical. That's the estimate of the KL divergence. We'd like that to be small. The blue points are the in-sample deviances on average across the simulations. Notice they always get smaller as the model gets more complex. The open points are the average performance out of sample of each of the model types. Model three is the best out of sample. That's reassuring. But again, it won't always be. If you have very little data, you're better off using a deliberately unrealistic simple model because you can't estimate the parameters of the true model. I should show you an example of that, but I haven't here. And then the dashed trend there is what AIC predicts for each of the models. And how does it get that? It just takes the blue point and adds twice the number of parameters. That's where the dashed line is. It's not bad. It's good enough for government work, as they say. It is also a little bit. It's conservative for the true model, a little bit, although it's a tiny amount. It's less than one unit of deviance off, which is not something we should get excited about. On average. These are just averages. And I've showed you with the line segments with the numbers by them, I've showed you the actual distance. And AIC expects that the length of those line segments to be twice the number of parameters. So for some of them, it's right on. For the first model, it is. It's two. And the second one is pretty close. 4.1. These are converged, by the way. I doubled the sample size of the simulations, and they didn't move. So I'm just showing you this. And then it's a little off for some of the others. But it's pretty close. It's about twice the number of parameters. One of the reasons AIC doesn't get this exactly right, and we'll have reason to talk about this later, is the actual overfitting is data dependent. It depends upon your sample and some other things. And AIC doesn't think about any particular or the actual sample you have. Others, like WAIC, do. That's one of the reasons it does better. We'll get there as we go forward. For the moment, do you get what AIC is doing? It's not an obvious thing. This is subtle and weird, even though you see this in every journal now, right? It's just like the new stargazing. I don't want you guys to go forth and do that. But it's trying to estimate the out-of-sample deviants. And at least when its assumptions are met, which is what we're looking at here, it does a great job. Sadly, in the real world, assumptions may not be met. But if you look at it when you have a lot of evidence, same basic story, the same identity for the line that's AIC, it does a pretty good job. It's a little bit over-conservative, but it's a very small amount. That's just like one or two units of deviants, which is not much to get excited about. Again, these are just on average. It's entirely possible for any particular sample you have that the actual data-generating model is going to rank worse than some other model because there's variance. These are distributions. I shouldn't get too overconfident about it. That said, it performs quite well in simulation. It meets up to what it advertises on the box. Okay. Let's talk about two other information criteria or the ones that I'm going to encourage you to use. And AIC is great. And if you're... I'll get your question in a second, David. If you're using multiple regression models with almost completely flat priors or with a lot of data, go ahead and use AIC Rockout. There's nothing wrong with that. But often there are good reasons not to use flat priors or especially when we get the multi-level models and the whole point is not to have flat priors. So you're going to want something else in that. So I'm going to show you two that can deal with that. David, you have a question. Are we seeing the sequential thing yet? Is it part of the story you're telling here or is it something else? We are not seeing the sequential thing. You will not see anything about the sequential thing. Yes, I just wanted to put it there like this is the only way to decide if a model is good. It depends on your purpose still. I'll have a little bit more to say about that at the end today. But I think there are lots of prediction scenarios that are different. Like if I get a chance, I want to talk about the blizzard in New York or the blizzard recently. Because of the assumption of flat priors, is that why people use AIC for frequent system methods? Well, that's a good question and I don't want to go off on that because there's a lot of philosophy here. The question was, because of the assumption of flat priors, is that's why people will use AIC for frequent system methods? Not exactly. You can derive AIC from a Bayesian starting point or a non-Bayesian one and end up with the same criterion. So is it really either or no? I mean, there are different starting points. It's like linear regression can be justified a bunch of different ways too. There are versions of justifying linear regression that don't even have likelihood functions. But then they're perfectly fine. So is linear regression Bayesian or not? It's not a good question. Does that make sense? Yeah. There was another hand. Did I suppress it? Was it happy? Was it pacified? Yeah. All right, let's get back to the awesome David Siegelhalter. Coolest living statistician. So the deviance information criterion is a generalization of AIC that became rapidly popular because it was built into this software package called Bugs which stands for Bayesian inference using Gibbs sampling. It's a market chain by Carlo that we'll start using in the second half of this course. You calculate it with samples from the posterior distribution like the kinds you've been working with already. You can also calculate it analytically, but it's great that it can be calculated with samples. DIC is nice because it doesn't require flat priors. That was the whole idea. It's more intelligent about that. It does require a reasonably Gaussian posterior. I mean reasonably, right? If it's really skewed, then DIC will freak out. And you will usually notice because you will get some, it'll tell you it has a negative number of parameters, something like that, which is nice. It also requires that the effective number of parameters be much less than the number of cases you're fitting. I'll say more about what the effective here means in a moment. And in the rethinking package, there's a function called DIC which you can give a map or map to stand model to, and it'll give you the DIC value. It'll sample from the posterior distribution of the model and it will use those samples to compute this thing. So how do you compute it? I want to show you its formula because it's not actually that awful. And it's very much like AIC. In fact, if you impose the assumption of flat priors on this, it gets to AIC right away and reduces to it. AIC is a special case of DIC. So DIC, let's look at the top one here on the slide first, is DHAT, which is the deviance at the average parameter values in the posterior distribution. The hat there is a peak, right? So it's like the map estimates. If you pull the map estimates out of a multivariate Gaussian posterior distribution and plug them in to the likelihood function, you get the deviance at the map estimates. And then you add to that twice the difference between D-bar. D-bar is the average deviance. What that means is you take a bunch of samples from the posterior distribution, compute the deviance for each of them, each set of samples, correlated samples, and then you average across those deviances. So think of it this way. Parameters have a posterior distribution. Therefore, so does the deviance. The posterior distribution of the deviance, that D-bar is the average of the posterior distribution of the deviance. It makes sense. So it turns out that is approximately the expected D-test, just like AIC was. But this works when your priors are not flat. The second version of it is mathematically the same. I leave that as an exercise to the student. Why? Because algebra is good for you. You should do a little bit every day with breakfast. The people laughing over here took my course last quarter and you're scarred. That course is 10 weeks of algebra. And 20 hours of algebra for you at home every day. Something like that. Not every day. Sorry. But every week. And here we do almost no algebra. And I feel constantly guilty about that because algebra is good for your soul. But anyway, they're the same. You can confirm it. And what's nice about this is you can see the difference between D-bar and D-hat is often called P sub D. It's the effective number of parameters. It's analogous to the K in AIC, the number of parameters. I don't know. Say more about this on the next slide. Yeah. That's what he's come from. So the question was, are D-hat, D-bar often very different from another? They can be. That difference with flat priors, that difference on expectation is the number of parameters. So the more complex the model gets, the bigger that difference gets. It's very strict. There's almost no variation in the estimate when you have flat priors. It grows proportional. I'll show you on the next slide, actually, something where this happens in a sense. But I'll show you the flexibility and what priors do to it. PD's effective number of parameters because when you have priors, parameters don't have full flexibility. They can't extract a whole unit of flexibility from the data. Flat priors would age the AIC formula telling us for every parameter you add, it's like adding a dimension to the posterior, you get a unit of overfitting. And that unit of overfitting and the dvth calculation, you double it and that's why it's 2k in AIC. And when you have priors, adding a dimension to the posterior doesn't necessarily give you a whole degree of freedom, a whole new unit of overfitting because the priors will damp down how much information to be extracted from the sample. So the more informative the prior is, the less the model learns, and the fewer effective parameters you typically get. That's the goal of regularizing priors is to make the effective number of parameters less than the actual number of parameters. And in your homework, you will have a lot of fun with this. I anticipate. It's like the Jedi mind trick, right? Sort of thing. These are the parameters you're looking for, sort of thing. Sorry, that was awful, but it's about average for my go-to stuff. So, all right. I'll show you what this means. Here are some simulations I did last night and while I was watching Pacific Rim with my son, so I can't attest to the thing fully cogent, but no, I did it correctly, I'm pretty sure. So what I did is this is the code that sort of gives you different points on this. We fit a very simple regression model to some simulated Gaussian data. It doesn't matter what it is. And we know Sigma. Let's say I know Sigma, so we're only estimating one parameter in the model just to keep it intelligible as an example. And what I'm varying on the horizontal axis on this graph is the standard deviation of the prior on the Gaussian mean in this model. That's all. Here, I've just called it S, and I wanted to show you in-map you can actually pass in parameter values as data because it just gets plugged in there, right? That's how R is. Replace numbers with symbols. It works fine. And then at the end, we extract DIC. And what I'm plotting here is the PD of the effective number of parameters. And when you call DC, it'll give you both DIC and PD as separate outlets. So you can... And what I want you to show is when the prior standard deviation is large, one is enough, then what DIC tells you is the effective number of parameters is the same as the number of parameters, right? Because it's fully flexible. It's as flexible as it can get. Does that make some sense? Because it's an uninformative prior. It's acting effectively flat. So the Gaussian prior with the standard deviation of one is uninformative, but it's sufficiently uninformative here that it doesn't reduce the effective number of parameters. But then as we make this prior stricter and stricter moving to the left, the effective number of parameters drops. There's jiggle there because this is like... I think I did a hundred simulations in each value, something like that. I was watching Pacific Rim at the time. Forgive me. But, well, I was doing this and not watching Pacific Rim. But it goes down. Eventually, the prior gets so strict that your parameter... Your model has zero parameters because it can't learn anything. That's when the posterior distribution collapses to a singularity. Basically, it's what happens effectively. Does this make some sense? What's going on? It just has to make a little bit of sense right now. When you do your homework, you'll get a lot more practice with this. It'll make some more sense. I should pause for questions. Yeah. So down on the left side of the graph, then, that means that's like the big peak that we had when you were looking at different priors that we can give it. Oh, yeah. So the question was, on the far left, that's a prior that's really peaked. Yes. When the standard deviation of this Gaussian prior is small, it's a very peak mountain, very high. Yeah. So that would have been too conservative. Yeah. We can't say here because we don't have a right answer. So we don't know. But that'll depend. But yeah, that's when you get into that risk. Absolutely. All depends. That's a good question. Okay. Other questions before I move on? No? Okay. Something encouraged you. Let me know. Before we look at the performance of DIC, let's get the other one out, WAC, and then we can compare them, actually, and put them both up at the same time. So DIC is good, but it still has some restrictions. Most annoyingly, it still requires a reasonably multivariate Gaussian posterior distribution. Lots of model types don't produce that. Although later in the course, I'll show you some tricks, maybe, for taking things that are known Gaussian and tricking them into being Gaussian. You can do that by rescaling the model. Better would be this more recent information criterion, the so-called Widely Applicable Information Criterion, or WAIC, or WIC, or however you want to say that. It's all enough to you. It's all in your head, all whatever you like. And this is published around 2010 by Sumio Watanabe, who's carrying on the work of Akiike and others. And this is his frightening book on the right. It's a great book, actually, but it's analytical geometry. So if you're not into any local geometry, don't get it. But in this, he does a lot of foundational work to get up to this information criterion. And the reason I'll introduce you to this is it performs really well. It really beats the fans off the IC. And I think it's not yet huge, but in the Bayesian statistics community, everybody knows about this measurement now, but it's going to be big once it gets into packages. So I wrote it into rethinking, and you just need to type WAIC. Give it your fit model and it will compute it. It's formula. Oh, I wanted to say, yeah, people have started, just like with the Akiike information criterion, people have started naming this after its user, and Watanabe did not call it after itself, but you'll often see people call this the Watanabe Akiike information criterion just because no acronyms lead people to act in drunken passions, I don't know. But this is what we've got. So the formula is deceptively simple. Deceptively, I'll reveal on the next slides. It's minus two times something called the LPPD. That's a Bayesian log likelihood. It's the log likelihood of the model averaged over the whole posterior distribution. WAIC performs better because it uses the whole posterior in a way that the previous information criterion don't. What's the minus two for? Remember the minus two on deviants? Because reasons, right? It was there. Yeah, that's why, because reasons. And then we get twice the effective number of parameters again, but the way that's computed is different, you know, the P sub WAIC. And you don't need to make any assumptions here about the shape of the posterior distribution. In fact, because WAIC was invented to handle models with singularities, so it can do some really slick stuff. Yeah. Can you say how much output do you use? On the next slide, I'm going to say more than you ever want to know about it, but it's a few questions. Yeah, LPPD. I'll say what it stands for in a second. But it's a Bayesian log likelihood of the data, the computational part of the deviants. Bayesian, because it averages over the posterior. Let me show you what it does. Oh, and by the way, you compute this from samples from the posterior distribution, like with DIC. I'm not going to say a little bit on the next couple slides about how to do that. Although in practice, you probably don't have to do this yourself. You can lean on packages. But if you ever do have to do this yourself, it's not that hard, actually. And if you want to, like in your future, you fit a model and you've got a bunch of posterior samples for it, and you want to compute this by hand, I can send you a script example that does this. You can do this a lot. And it's not that tough, actually. It's pretty easy to do. Let me walk you through the verbal version of how it works, each of the pieces. Give you some idea about what it's doing. The WIC performs better because it uses all the uncertainty in the posterior distribution to compute its pieces. This thing, LPPD, is the so-called log point-wise predictive density. And it's the action part of what we call a Bayesian deviants. The deviants average over the whole posterior distribution. And better than that is it's point-wise. It takes each data point by itself, because the model assumes they're independent. So you need to average the uncertainty on them independently. And one of the cool things that's going to happen about this is some of your data points are more problematic than others. They're the ones that are hard to fit. And WIC figures that out from this. It figures out that some of those data points have a really horrible deviance, and some of them have a really good one. And that uses that information. So here's the function for it. Let me just talk to you through this. You're not going to have to work with this expression, but it's worth understanding. In some sense, what it's doing. This big sigma here in the expression is just summing over all the data. The cases in the data, i for 1 to n, where your sample subsides in. So what this means is for each case in your data, take the log of the average. Why? Because in probability theory, integrals almost always mean average. Average over what? Well, go to the end to figure out for no good reason. Because like this. I think that's actually why, isn't it? And so because like this, we go to the end. And we're averaging over the posterior distribution of the parameter theta. This is the posterior probability of it. And this is the likelihood. This is the probability of that case conditional on the parameter value. So taking each, we compute the likelihood for each sample from the posterior distribution of each case. Average that. Average those likelihoods. Then take the log of that. And then sum all those logs. And we get LPPD. And you're like, why would we want to do this? Because it's the right way to compute a basic impedance. That's why. The averaging is done for each prediction. So you use the posterior distribution to average your predictions. The predictions are on a probability distribution scale. Right? So we do the averaging in there. And then we take the log. There was a hand. Yeah. So this is different from regular? Right. So the question was, this is different from ordinary log likelihood because we're using the whole posterior distribution. Yes. And in the Bayesian literature, you'll often hear the sort of ordinary log likelihood called the plug-in estimator. Because you just plug in the map is the idea. And we've been using that too. And I mean, often it's extremely useful. But the advantages of WIC arise from using all this extra stuff. So it's point-wise. Another thing, those of you who've done some cross-validation, you might recognize this as a cross-validation issue. You can consider the error or the training and test error each case at a time. Because some of the cases are problematic and some of them aren't. Some of them are informing your parameters and some of them aren't. And WIC figures this out by considering a certainty case by case is what it's doing. It makes this computationally annoying, though. So here, let me talk you through the recipe. We do all the calculations point-wise. For each separable piece of data y, we compute the likelihood, the probability of that piece of data conditional on a parameter values for each sample of the parameter. We average those likelihoods together. Then we take the logs. And then we sum them all together. We get LPVD. This generates a lot of computations, you can imagine. So say we've got 1,000 data points and we have 5,000 samples from the posterior distribution. This is 5 million likelihood calculations. But hey, your computer is pretty awesome, right? And it can handle it. But this is what it does. And when you use the WAIC function in rethinking, you may sometimes have to wait a little bit. But it'll give you a progress update. It'll be like, hang on, I'm calculating. And you'll get this little progress bar. That's there for your benefit. Really big models might take some time. For most of the models in this course, it'll just be like a minute. It won't be a big deal. OK. The next piece is the piece that estimates the effective number of parameters. And it's also point-wise. Because in the true view of it, every separable piece of data has its own effective number of parameters. And they can differ. I'm going to let that be a little bit mysterious right now. And when we get later on in the course, I'll have some cool examples, I think, of when this comes into play. But the easy way to think about how that could arise is some observations are right in the middle of the high density region of the likelihood. They're extremely likely. And others are kind of out in the tail. And when you're out in the tail, a small adjustment in a parameter value can radically change the likelihood of that observation. And so those observations are way more sensitive to the parameters than others are. So it's data dependent in a way that AIC and DIC were not. And that's one of the reasons it's better. It's a better job to anticipate. Because your exact sample can mislead you. And conditional on your model and your exact data, WAIC can figure out some of these risky cases. Now, again, later on in the course, we'll have some more misbehaving models. Right now, I can't give you good examples, because everything's Gaussian. And Gaussian stuff behaves well. Anyway, maybe next week I can put something in. OK. So how do we calculate this? In this case, you compute the log of the probability of theta for each sample of theta. You compute the variance across those log likelihoods. And you do this for each observation and then sum them. The variance is a measure of the width of the posterior distribution for this particular observation. And that width is the flexibility. It's the uncertainty of the model about it. It's the spread that you get. So this total measure is for all the data that you've used to fit in your sample, the flexibility of the model, just like all the others. When the priors are informative, this will sum up to something less than your number of parameters. When your priors are flat, it'll be almost exactly the same. It won't always be exactly the same because it takes account of the actual data now. And depending upon the data, the effective number of parameters may be different, really different than the actual number. OK. So that's why AIC is not always perfectly accurate, for example. OK. You don't have to understand the exact details right now. Yeah. Here's what I want to assert when I've been inserting all along. WAC is better than DIC, typically, even though it's a pain to calculate. Well, it won't be a pain for you. You type WAIC. It's just one extra letter for you. It won't be that big a deal. And often they agree. So for the, like, in your homework, I encourage you to check them both, actually, and you'll see that they give the same recommendations. Well, because we're in Gaussian land right now, and everything's very well-behaved. In fact, everything's a well-behaved right now that the frequentist methods would give you the same rankings of things. It will make almost no difference. Once we get into multi-level models, that won't be the case. But everything's well-behaved in Gaussian right now. The issue here, if you want to think about it, is that when the mean is not a good summary of the posterior distribution, DIC does poorly, because it assumes the mean is central. When the posterior distribution is Gaussian, then it's symmetrical and declines in an expected way as you move away from the middle. So the mean is a great summary. The mean and the variance are all you need, in fact, to describe the whole distribution. And that's what DIC assumes. When that's not true, it goes, as I say, squirrely, which is, by the way, the word that is hardest for non-native English speakers to say in English. Search on YouTube for Germans trying to pronounce squirrel. I grew up in Germany, so I like to pick on my German friends this way. It's like, squir-wheel. Squir-wheel? Is it squir-wheel? Come on, Richard, how do you say it? No, it's a squirrel. It's like, squir-wheel? No. I love you guys. I do. I know you're listening, but it's great. Now, I mean, there are German words for it, too. German word for squirrel, actually, is pretty hard for English speakers. We'll do that next time. So when this is true, you can actually, DIC can tell you a negative number of parameters, and mixture models also cause a lot of problems for DIC because they have these points in the likelihood which are singularities. And WIC was partly designed to deal with that. For now, DIC works great. WIC and DIC are going to give you basically the same answers. So don't assume that, you know, just because there's a bunch of published papers that use DIC, that does it wrong somehow, they're probably fine, especially if they checked for the Gaussian posterior. But you might as well just use WIC. But definitely don't mix them. If you choose a criterion and use it for all the models you want to compare, fit to the same data, we'll have an example in a moment. The major drawback to WIC though, and sometimes you have to fall back on DIC, is that it requires separability of your data. And there are model types where that's a difficult thing to think through. So like a time series. In a time series, you can't say like, take out a sample in the middle of the time series because it causes all the ones after it. They're not independent of one another in the way the examples we've been working with have. So in those cases, WIC is awkward to use and probably the best thing to do at the level of discourse is just to fall back on DIC or cross-validation of some kind, designed for time series. We'll have an example of a network model later too where basically they're all co-varying, they're all co-determining and we have to think hard about what's going on. So there is no oracle in this business and what encourages me is that when I started graduate school, neither DIC nor WAIC existed and things have gotten a lot better and there are also a whole lot of new Mark O'Chain Monte Carlo algorithms that didn't exist. A lot of people are doing Bayesian inference. We're learning a lot about Bayesian inference pretty rapidly. It's a good time to be Bayesian. So to speak. Katrina. So what are the assumptions of the WAIC? It assumes that the effective number of parameters is a lot less than the sample size. Yeah. That's it. Yeah, yeah. But well, it also assumes the data is separable which is a pretty big limitation, I think. That's the one I wanted to emphasize for that reason. I'm working on models right now where I can't use it. It's annoying, but there you go. Okay, let me show you the comparison. So I'm not showing on these plots, these are familiar kind of plots you've already seen today and on Tuesday, I'm not showing in-sample performance because that's boring. So this is just out of samples. It's like the top parts of those previous graphs. And the open points again are, the open points are actual performance out of sample. And what I'm showing you is the top is DIC as the trend lines for two different regularizing priors. The blue trend line is the Gaussian prior with a standard deviation of a half. And the black line is an effectively flat regularizing prior. So as the regularizing prior helps and DIC anticipates it is going to help, it ranks it right. It is a little off, but look at the units on the vertical axis. It's a tiny amount of deviants. They're different. So it's like less than one unit in most cases off. It gets the ranking right is what matters, right? It gets the relative ordering right. It's a great job of that. WAIC, same story, same simulation conditions. It's more accurate, right? It gets closer to it, to the actual performance. It's the same models. Make sense? Do you hear it for a second? Let me know. There's a lot of sagely head nodding going on. There are parts of the room. Yeah, that's a question. Sorry, what was the... I didn't catch what the blue and black were. Blue and black are different regularizing priors in the model sense. So in the blue we're using a regularizing prior and black we're not. So we expect less overfitting and that's what we get. And both of these information criteria anticipate that correctly. And they get the rankings right. They get the approximate adjustment improvement, right? Which is encouraging. That's what they're designed to do. And in these simulations their assumptions are met. That's a pretty good job. So I wanted to say, don't get cocky because this is still a model of forecasting and your actual forecasting situation makes a bunch of assumptions like the uniformitarian assumption that the future is like the past, right? This is like the curse of science that we have to assume the future will be like the past. And it's a pretty good assumption in geology. But in the social sciences it's probably not, right? The things we measure in the social sciences are let's face it, goofy. They're just convenient ephemeral measurements like political party ID. Next week you can meet something completely different to people. So in biology, if you think hard about it, it's not too hard to see that we have things in biology that we measure. Weird metrics about community ecology are often things like that. So you should be skeptical. Remember this is still a model. It's not a crystal ball. That's what the Legos are there for, right? So this is the only image that was licensed for noncommercial reuse that had crystal ball attached to it, by the way. So I think that's Lego Gandalf. Is that what that is? I'm not really sad to know. Anyway, the point is this is not a crystal ball. It's still a model. It's a model of model performance out of sample. But if your forecasting situation is different than this, you can still study it. In principle, you just need to do the simulations under that kind of design, right? If you have some specialized sort of thing. So like you're designing a reserve system, I would encourage you to think in detail about the performance criteria of the reserve that you want. Accuracy is not the issue there, but something else about the functioning of the system. Being right is not the question when we intervene in the world of being effective is. And I'll say something more about that a little bit. I think I've got time today. Okay. We finally reached the beach. And let me try to summarize a little bit. Before we move on, I give you an example of working through with a common data set and a set of models. We'll fit them all to the same data, and we'll calculate WAAC for them all. And I'll show you how to work through it in an example. And then your homework will be more practice, along with fun. And in future weeks, we will keep using this over and over again. We're just going to accumulate all the way through it. So let me try to summarize. I think underfitting is possible. Sometimes we just don't know any of the variables that matter, and we end up with overly simple model. But at least in the fields I work in, there's a ton of things about the system we can measure. And the question is instead, which of these things are relevant? So population biology and social sciences are both like that. You can download a bunch of data or measure a bunch of things. And it's not hard to get variables. The question is, which ones actually matter? And ideally theory would tell you which ones matter. But theory's imperfect. So underfitting is a risk. I say it's possible, but overfitting is inevitable. Because even with a model that omits key predictor variables, the estimates for the ones that are there will probably be overfitted to your sample. Because overfitting will happen for any coefficient in a model unless you get the regularization right. So overfitting is basically inevitable. You just sort of have to expect it. It's like death. Underfitting is taxes and overfitting is death. And regularizing priors can reduce overfitting. So I say as a consequence, people worry about overfitting a lot more than enough. Because if you just maximize the fit to your sample, you'll always overfit. And in practice, people do that. And so we have these strategies of using regularizing priors that reduces overfitting. It can reduce it to the point where it becomes pathological. So you do have to tune things. And information criteria don't in and of themselves do anything about overfitting. They merely measure it, right? There is weird multi-dimensional ruler that uses the squishiness of your posterior distribution, what it knows about the model of the data, to guess how much risk there is in generalizing from the estimates. These things work great together. Don't think you only have to use one. You should be using both of them. You want to reduce overfitting and you want to measure what's going on with that reduction. Like if you're on a diet, you eat less and you buy a scale. That's how these things work. Questions before we do the example? If not, during the example, questions will come to you. And we can work through them. I've tried to leave enough time that we can comfortably get through this. So I think we're basically exactly where I want it to be. Okay, let's use them. First thing I want to say is you want to avoid model selection. Oh yeah, there's a demo there except I forgot. I put those there. So avoid model selection. Model selection is usually what this literature is called. The idea is you compute AIC or DIC or WAIC for all of your models. And then you take the model with the lowest value of that criterion and you throw all the others away. I want to encourage you never to do this. Why? Because this is like throwing away the posterior distribution. It throws away the full information about the uncertainty. Now of course there are times where there's a model that vastly outranks all the others. And effectively there's no harm in throwing all the others away. That's analogous to the case where your posterior distribution is incredibly peaked over a particular value. And then in that case just using the map value is probably okay. But if you just use the general procedure and use it all then you don't have to make a decision about that. And there's no harm in using it all. So instead I want to encourage you to always do model comparison. Retain the whole set. Think about all the models you want to do. And use comparisons among them to learn why some models outperform others. We're going to talk about that as we go through examples both today and in future weeks. And again I keep saying it this way. The differences in deviants among your models or I should say the difference in expected out-of-sample deviants among your models are informative about the set of models in the same way the differences in posterior probability are informative about parameter values. So it's helpful. And then when you retain them all you can do model averaging. Which is model averaging means for each of the models in the set you simulate predictions from it. And then you use the relative expected out-of-sample deviants of those models to build an ensemble prediction that uses the predictions of all the models together. This is really common in actual real-world prediction scenarios like weather forecasting actually. Except in New York. We'll talk about later. Okay. Does it make some sense for now? Alright. So it's not too late to wike it. Wike it good. Sorry. This slide's a came to me. It's just... Alright. Let's move forward. Sorry. I can do this all day. So primate milk. Alright. Let's go back to familiar data sets. So I don't have to introduce a new one. Primate data again. Nothing new about it. This was from last week. I'm just going to load the data set. The only thing different about it is I'm going to take the neocortex percent. I'm going to divide it by 100. So it's a proportion of the brain mass that is neocortex. And I'm doing this for a reason that will come up later. I can teach you something new about parameter interpretation. Let's fit a series of simple linear regression models to them. And in these I'm actually going to use perfectly flat priors. Partly because I just want to show you how to do it. I had to show you an example of that. If you leave the priors out, you can do that with map. Then it basically progresses to a frequentist procedure. Nothing wrong with that. It doesn't make you a bad person. In these cases, there's enough data in this set. You can use priors, regularizing priors, and it won't make any difference, actually, to the estimates. But when you do that, you have to provide start values with this optional start list down below. When you leave out the start list, map uses the priors to sample starting values from the priors. And then it can start crawling uphill. When you don't have a prior, then there's an infinite number of values you can sample from. And it's like, no, I won't do that. So you've got to give it a start value. But it works exactly the same way. So first model is the intercept only model. There's just the alpha embedded there right in the likelihood. Nothing fancy going on. The next model, we actually get a linear model. We add in neocortex, but only neocortex to predict nuclear energy. A third model, we put in only the log mass of mothers in the species. And now remember, the point of this example last week was that there's a mass influence. These things are correlated with the outcome, but in different directions, and they're correlated with one another, right? Because they each have big brains and they're heavy. Remember that story? So these are the models that only have one of those covariates. So they're vulnerable to the masking problem. And then here's the model that has them both, the model we looked at last week. And I want to show you what information criteria do with this model set. So there is, you could take each of those and you could pass them to WAIC function and you could process it all by hand. Here's a convenience function called compare. You give it a bunch of fit models, map or later on map to stand models, and it'll extract the WAICs for them. It computes WAIC for each of them just by calling WAIC. And then it makes this nice little summary table for you where each of the models is listed. If you want DIC instead, you can also do that. Check the help file for it. Let me tell you what these columns mean though. That'll be a way to help you understand how to use the diagnostics here. So the first column is WAIC. Now you do that. This is an estimate of the expected out-of-sample deviants. The only thing to say about this is, and I used this example for this reason, it's negative. That's okay. Why? The log of a number less than one is negative. That's why it can happen. There's nothing naughty about this. And probabilities can be less than one. For Gaussian models at the proper scale with a small standard deviation, you will get the total probability of the data. The likelihood can be less than one easily. And it can happen a lot, especially for continuous densities like Gaussians. So there's nothing wrong with it being negative. Smaller is still better, which means more negative is better. So the best model here is model 6.14, which has both predictors. See that? Okay. I wanted to use an example like this because it happens a lot when people freak out. I mean, they freak out rightfully. There's a reason to freak out, because all my other examples, it was a positive number. But there's nothing wrong with this. It's a symptom of the fact that the scale of the deviants is uninformative. Its absolute value is uninformative. Only the relative value is uninformative. Okay. The second column is this effective number parameters column. And they're close to the actual number parameters, but not exactly the same. But almost exactly the same here. And so we had five parameters in the top model and two in model 6.11 is the intercept-only model. It's got two parameters because it has the alpha intercept and sigma. Remember, sigma was also fifth. It's a dimension of the posterior. 6.13 has, both of the others have three. And somehow, WIS magically knows the number of parameters in your model. And it's from the squishiness of the posterior distribution that it gets at. Yeah. You're going to go over this later. But can you models that are the same except for the different uses of the priors? The question was, can you compare models that are the same as if they have different priors? Yeah, in principle, you can. In my experience, it doesn't do a great job there. I think you should decide your priors based upon some domain knowledge about these things or some sense about the overfitting risk instead. And use this to compare different likelihood structures. That's what they're good at. That's what they perform well on in my experience with prior stuff. I mean, it does. I showed you simulations where it anticipates the reduction because of using more priors. But you can probably do better than that. And in your homework problem, your last homework problem, you might get a whiff of this because you're going to tune a prior in the last homework problem of your thing. But you're going to do it out of sample. So there's a case where you could actually do the exercise and see what you think. The differences, DWIC is just the difference of each model from the minimum AIC. So the best model gets a zero here. And then it's the units of deviants of each one. How much worse each subsequent model is from the best. You can see the differences. And the weight column I'm going to explain on the next slide. And then the last two columns I'll explain on the slide after that. All right. So what are these weights? You see these in papers all the time. They're called pocket DK weights. And I'll call them that too. Sure. There are model weights that are computed from the information criterion value. So the information criterion value is on the deviant scale. It's an estimate of the relative divergence. It's on log probability units, right? Minus two log probability. You can undo that transformation and put it back on the probability scale. When you do that for an ensemble of models, you end up with these things called weights, which are relative amounts of evidence and some total amount of evidence for each model in a set. I can't tell you anything about models you didn't fit. But what it says is conditional on this set of models and these data, here are the sort of central weights of evidence of each. And we call these weights. We call these, in simulation exercises often you can interpret these as probabilities. They're approximately this, although they're not exactly probabilities. So I'm nervous about it. But heuristically it works in simulation that if you do a bunch of out of sample tests in something like this, in 93% of them, model 6.14 will make the best predictions on average. People haven't figured out why that works yet, in my opinion. There's still active debate about it. So that's why I'm nervous about it. I want you to think of these weights as heuristic. They're easier to compare and think about than the deviant's values. So the thing is, though, we need some estimate of the error on this. And WAIC does vary, right? There's a sampling distribution of it. You imagine a bunch of test cases of it with a bunch of different samples that you're trying to predict. Sometimes your estimates perform well. Sometimes they won't. So there's a sampling distribution of the out of sample deviants. And these last two columns give us information about that. They're the standard error of WAIC. And DESE is the standard error of the difference between the WAIC of two models. So I'm going to send the next few slides talking about these. I think they're really important regarding it gets overconfidence in the model rankings. So I showed you on the slide, I've highlighted the standard error columns. If you're interested in how to compute those, I talk about it in a box in the book. If you plot the output of this compare table, you get this little dot chart down here. I encourage you to use these. It's easier than reading numbers. And let me walk you through what these things mean. The first thing to note is that the filled points in each row of this dot chart are the in-sample deviants of each model. Again, the most complex models are always furthest to the left. And being to the left is good because that means the smaller deviants. The open points are the out-of-sample deviants as estimated by WAIC or the WAIC values. So they're greater than as we expect overfitting, right? And the distance is the effective number of parameters. That's the distance between them. Then what are these lines? The lines are the values in these columns at the end. They're the standard errors in one direction or the other of the sampling distributions of WAIC. So this first column, SE, standard error is 7.5. On this scale, that's a lot. So you expect your out-of-sample prediction to vary a lot given these data. So this is a small data set. And that's why the standard error is big relative to the distance between these models, because there's 12 species after you drop the missing values from the data set. I think there's only 12 places. And the others have their own standard deviations all centered on the open points. Does that make some sense so far? So now, but again, you don't want to use the overlap between these things to do the comparisons. You want the contrast, right? It's always the thing. If you want to know what's the sampling distribution of a difference between two models, you need to compute the distribution of that difference, not use the overlap between intervals. It's always naughty to do that. Naughty-naughty. People go, this. And additions go to purgatory for this, right? So we got that on the same thing. I've interspersed them in a perhaps ugly way, but the package is free. This is what you get. I'm looking for suggestions to improve it, honestly. I put these little delta symbols for difference in between. And then these are one standard deviation of the difference of the top model from each model that it's just above. And that's why there's an Na up here, because the top model has no difference from itself, right? But it is different from the others. So you can see, I wouldn't be too confident about 6.14 outperforming the null model. On expectation it does by a good amount, but it's within the one standard deviation area. So you could think about that as your relative risk. That's not nothing. That's still a lot. All the evidence supports it as the best model, but it's not a slam dunk. It's not that there's no overlap at all. And then the other models are basically all equally bad. Good. Good. You learn something from these data. That's good. Does this make some sense? This is a tool to help you visualize it. Later on in future weeks, there'll be examples where it's a slam dunk. It's really clear. So you'll be able to see the contrast. Yeah. How many actual parameters are there in 6? Five, I think. Five? Okay. I thought there were four. Well, hang on. There's an intercept. There's sigma. There's two regression parameters. Yeah, so there's only four. There's more effective parameters than actual. Yeah. And that can happen because it's data dependent. So if there's a species that's an outlier, changes in parameter values will have, it'll have a very wide posterior distribution. So we could look at it point-wise. Yeah. I'll try to do that next week. Yeah. Yeah, these are... Can you just kind of like round? Yeah. I mean, the question was, what's the significance of having non-insure parameters? You don't round them. Leave them at they are. These aren't parameter counts. They're, in fact, the degrees of freedom concepts that you see in classical statistics, and that appears in AIC, is not generalizable beyond classical models. It's just a coincidence that this thing is on expectation of the parameter count in AIC. It doesn't have to be an integer. Think of it instead. It's unfortunate that the history leads us to attach the word parameter to it. Obviously, it's related to the number of parameters because that determines the dimensionality of the posterior distribution. But better to think about it as a measure of the flexibility of the model, I think. And then there's nothing about integers that's special. Does that make some sense? Yeah. And so you can get in excess, depending on how the data is distributed, and a lot less, depending on how the data is distributed. So, like, later on, we'll have, and on the linear models, we'll have cases where there's an observation that's right on the boundary. So, like, if the data can't go below zero and you have a lot of data right near zero, well, that effectively removes half the dimensionality of the posterior because those parameters can only push the predictions in one direction. And that happens a lot in nonlinear regressions. But in this case, we can slide in both directions forever. Okay. I've got, like, four minutes. Okay. Back to the horse race. Remember the horse race. These things are heuristic advice on this track with these horses, some horse won. The amount by which it won is informative of how far, how well it might run on the next track. But there might be new models on the next track. In fact, we should strive to have new models on the next track, right, and do even better each time. But don't, you know, kill all the other horses and select only the first one. Right? Okay. That's my sermon about this. I love horses. They're wonderful animals. You should not ride them. You should just pet them and feed them. Okay. You can ride them too. They seem to like it, actually. So just don't race them until they drop dead. That's my only wish, please. So, comparing estimates. I don't have enough time to get through all of these slides. So let me get into it, and I'll finish it up early next week. But you can start into your homework without having finished this up. But I'm not going to not do this. I'll quit on you guys. We'll get through it. And I'll leave one thing out of next week, maybe, so we can do it all. We want to compare all these models instead of selecting, because we get more out of the comparison. Let me show you one of the things you can do. Often, you want to compare the parameter estimates for all the models. And you can do this in table form. There's this function in rethinking called co-FCAB, which makes a table of coefficients, where each column here is a model, and each row is a parameter name. And so, by scanning across, you can see how the values of the map estimates change and you change the structure of the model. And often, this is very instructive in an idea of how you can see masking effects in this, where in the model, in the last column here, you see that both of these estimates have gotten further from zero as a consequence of the masking effect in each one, the masking effect that we saw before. So it's a way to see the sensitivity of an estimate of a coefficient to the structure of the model. This sort of presentation is really popular in economics. Economists don't like the plot stuff, but they like to have these massive tables to get used to reading them. There's a lot you can get out of them. If you plot this co-FCAB, you get this dot chart on the right there, which is easier to read. And the bars there are the 95% intervals, posterior intervals, for each marginal posterior distribution of each parameter. Let me say a little bit about these and what you can do to compare across them. The top line, of course, is the intercept. And there is nothing to be said about intercepts here, as usual. You just get them by default. There they are. You don't usually have any expectations of where they're going to be, because usually it's intercepting zero, which is not even a valuable data point. And that's true here as well. We don't normally care about the standard deviation. Here I estimated the standard deviation on the log scale, because I wanted to show you an example of a trick. I say more about this in the book. The only thing I want to say about it is notice that the model that fits in sample the best has the smallest value of the error in the sigma. That's because it's fitting the data better. So there's less error variance left over at the end. You can see that diagnostically in it. And then we get to these regression parameters. And there is a way that you can see that they're getting further away from zero for 6.14, but this is why I rescaled neocortex to a proportion. The scale of a parameter is up to you what to choose. And it has an effect on your ability to graphically inspect what's going on in the comparison. So the intervals here are so small that the scale of this thing is small that the circle is bigger than the interval. So I want to re-standardize and show you what happens. So if we instead standardize those predictors, which means take both predictors, log mass and neocortex and subtract the mean from them and divide them by their standard deviation. Now they're just in z-scores. And then we put it back in the model. Fit all the models again and redo the co-optab. Now we can see what happens. In the top model, 6.14, they get further from 0, right? Whereas in the other models that have them by themselves, they're both closer to 0 and they overlap it, right? In the 95%. It's much easier to see. So if you want to interpret marginal posterior distribution, z-scores are your thing. Really. Otherwise you're finding the measurement scale all the time. Or I just speak for myself. Always finding measurement scales. The problem, of course, is when you plot your data and you want it back on the normal measurement scale. So you've got to develop some skills with shuffling back and forth. But that's... You guys are very capable. So I know you're able to do it. Okay. That's all the time we have today. When you come back on Tuesday, I will teach you how to make an ensemble. And we will talk about meteorology a little bit. Have a good weekend. Your homework is already up on the website.