 Alright, I'm going to get my brain back in the stats, and I know you guys have told your brain to get back in the stats. So, we're going to pick up exactly where we left off. Here's the outline. We had gotten to massed association under the good. Then we're going to do the bad today as well. Next week is the ugly, if you know the movie formula. Which is really the last bullet here, overfitting. Next week is all about overfitting and how to deal with it. So, there might be questions that come up, which are perfectly reasonable, and I'll kind of brush them off, because you'll get a lot of it next week. So, we were on massed association. This is the second really useful thing that multivariate models can do for us, despite the fact that, remember, we're still in geocentric land. These linear regressions are descriptive. They're very powerful descriptive engines. They rarely have anything mechanistically to do with anything, right? Nevertheless, they're useful, but it's worth. I'm trying to cultivate this feeling of consummate embarrassment every time I use one, in the publication, just to try and get myself weeded off of them. They're so useful and convenient. It's easy to do them, but it's also often easier to do better. So, often we have an association between some outcome and the predictor, but it's massed by some other variable. So, let me give you the abstract version of this, and we'll have a couple of examples to work through. But the sort of technical way to think about this, the most common case of when it arises, you have one predictor variable, and it's associated with the outcome in the opposite direction of some other predictor. And it may both be meaningfully associated with the outcome, since they're associated with opposite directions. And here's the important part. They're also associated with one another. They tend to cancel one another in the cases. So, only if you put them both in the model can you partial out these effects. Because, remember, these multivariate models address the question, once I know the other predictors, what's the value in knowing this one? So, if you only put one in by itself, everything is looking like a wash. You know, I'll show you examples of this. But when you put them both in, you can see what's going on. Try and provide a kind of simple real-world explanation of this. Consider, for example, something like a preschool interaction in the inner city or something, right? We're expecting that to come out of our mouth. But have to type on the social scientists, right? So, I read these follow-ups of these states. So, you go in and you do a head start program or something, and you look at some behavioral measure, some outcome measure for the kids in those programs. And you might compare them then to kids that worked in the program. And it might turn out that it looks like the program does nothing. Say that they're worse off than kids who worked in the program. But the reason that might occur is because you didn't account for the fact that the program is targeted on needy kids who need the most help, right? And the furthest behind the beginment. And so, their rate of improvement, they may have improved more educationally than kids who didn't have the program. But since you're ignoring the variable where they started, you don't get to see it. It'd be like a plant growth experiment where you've got plants that start out small and you give them fertilizer, and they're still smaller than the plants that started out taller because you ignored where they started growing, right? In fact, they were the needy plants, and that's why you gave them fertilizer. So it looks like fertilizer doesn't work, but if you also put in the variable how tall they were at the start of the experiment, then you can see that the fertilizer did work perhaps. So this stuff can happen all the time. In those examples, hopefully you would plan the research so that you collected where they start. But often we work in systems where we don't understand the causation well enough and we can miss these things. So there's no substitute for thinking, and you have to do the thinking just to nominate the variables in the first place. So the statistical model can be a real one once you've got the notion. But we may not necessarily find it for you. So let's look at this in the case of a data set that's in the rethinking package that's about the composition of primate milk and brain size, particularly neocortex size, which is the part of the brain that primates are proud of. That's the only way I guess we can talk about it, right? It's part of the brain that is biggest in humans, so that's part of the brain we think matters and we measure it a lot, right? I think marsupials have a different opinion about which part of the brain is important, and birds yet another and so on. But from our egocentric primate view, the neocortex is the wrinkly part that we like, so we measure it. And so across different primate species, you can measure milk composition. We're going to look just for now at energy and milk, and this varies a lot across mammals. So here's a pretty typical lemur of about 0.5 kilocalories per gram of milk, and 55% of its brain mass is neocortex. For a mammal, that's really high, but you know primates. We're going on here. Humans, really interesting primate. We have more enriched milk than most prosimians, and we are very neocortex heavy, most of our brain, three-quarters of our brain is neocortex by mass. Cibisappella is the brainiest monkey, people say. This is Marcel from Friends, right? Marcel was a different subspecies, but that's the same basic thing. People remember that show, Friends? Yes. There was a monkey named Marcel on it, right? Or was that a character? I'm getting it mixed up. But anyway, really energetic milk, very encephalized. They're the most encephalized monkey last time we had data on this. So is there a systematic relationship between these? If you're interested in the life history evolution among mammals, primates are a good place to look about the relationship between slow life histories and perennial investment. Milk is expensive. It gets more expensive the more you put into it. Does this have to do with long juvenile periods and brain growth? This is a standard sort of question. Remains unanswered, by the way. This is a fun teaching data set because no one actually knows the answer, or as I can tell, at the end. We're going to look at just three of the variables in here. And I've plotted up something here called a pairs plot. Pairs plots are ways that looking at all the bivariate relationships among the variables in a data set, there's a command in our called pairs. So a good example of how to use it. These are pretty useful for seeing the relationships. So we're going to look at the relationship between kilocalories per gram of milk. That'll be our outcome variable. The log of body mass of females in the species and the percent of brain mass that is neocortex at the bottom. What I want you to see here is if you look at, on the slide, the intersection of say the log mass row here and neocortex, this is log mass on the vertical, neocortex percent on the horizontal, you can see they're positively correlated. Pretty strongly. It's not clear that there's any clear correlation bivariately between log mass and kilocalories per gram or neocortex percent and kilocalories per gram. You can kind of a whiff of it here if you squint a lot and you really want it. Nevertheless, what I'm going to show you is there's a pretty strong relationship between both of these variables in the outcome. And this is a classic masking effect where they're correlated in different directions. So in the bivariate relationship, you can't see it. This is like the Head Start program sort of problem in a sense. So let's take a look. Oh, I had this animation to show you that. Yeah, there's kilocalories per gram. You can't see anything with those two. Log mass, neocortex percent, nevertheless, are clearly positively associated with one another. And that's the thing that creates the masking effect here. Is there associated within species? Okay. So just some bookkeeping here. I have deliberately put this data set in so that it has missing values for the neocortex percent variable. It's not because I removed them. It's because for those species, we don't have it. I measured this particular way, but I didn't take those cases out because often you will have missing values in some variable in your data set, and you're going to need to do something about that, make some decision about how to deal with the missing values. So now we're going to do the customary thing, which is almost always the worst thing, but we're going to do the customary thing, and just drop those cases. This is called a complete case analysis. The simplest way to do this is this one line of our code here, the 5.16 box. You can just select complete cases in the DataFrame D. So what that says is, give me the DataFrame D, all the rows that are complete, and all of the columns. That's the way you can read that line. In the last week, I'm going to come back to this, and we're going to do imputation instead. We're going to compute the missing values. That's nearly always a better idea. But for now, this will do it. If you don't do this in the book, I'll show you what happens. You get this mysterious error, something about beam and finite difference error or something like that. Convenience functions in R will automatically drop cases with missing values and do the complete case analysis. But it's best to know that this is happening for reasons we'll see next week. I want you to get in the habit of doing this manually when you start an analysis. Recognize if you're trying to use variables with missing values and make your decision about the complete case or not. The reason is because we're going to be comparing models next week, and you always want to compare models to the same data, not two data sets of different size. Because if there's more data to predict, you will do a worse job. Every case you add is another source of error. So that will make some of the models mystically seem better than others, but it will just be an accident. There's a lot of people I know from consultations, so I'm trying to head this off. Notice if you've got missing values first before you start the analysis. Okay, so we fit the bivariate models. I'm not going to belabor the code here because I think you guys are pros at this by now. I know because you're sending me your homeworks, and I see that you can do this, or at least you can copy stuff from the book and do it. Understanding always comes later. And that's fine. So notice there's a weak positive relationship between, on the left, between kilocalories per gram of milk and neocortex percent, but it's highly uncertain, right? Lots of the highly plausible lines in the posterior distribution are negative, actually, you see in that bowtie curve. But it's not a big, it's not a strong relationship. It's pretty weak. There's a negative relationship, but not a particularly strong one, between kilocalories per gram of milk and the log of body mass of mothers in a species. But it's still quite uncertain. What I want to show you now, if we make a model with both of these, both of these relationships get steeper as a consequence that there's a masking effect going on because both of these predictors are positively associated with one another, and they go in different directions, right? There's a positive relationship on the left and a negative relationship on the right, and their positive association means they tend to cancel in species, and that's what's going on. So here's the multivariate model A sub i here stands for kilocalories per gram of milk. In sub i, in the mu line, the second line of this model is the neocortex percent, and then the log of m sub i, m sub i is mothers mass. Okay? I think it was in kilocrats. Then, neody relatively flat priors and weakly regularizing priors on the beta coefficients. Again, I'll just keep saying this, but I'll just keep saying it. Next week, we'll talk a lot more about those priors, so they have very little effect, even in a data set as small as this one, actually. I encourage you to experiment with that. Fit it exactly as you'd expect, right? Yeah? Or if not, let me know. If you don't expect this, please tweak about it, because if one of you doesn't expect it, then probably ten of you don't. But same sort of thing. We create parameters of any name we like. Those would be the coefficients. We multiply those by the particular variables. Make the list of formulas. The map has no trouble doing this, because it's a linear problem. And now when we plot predictions, you can see how things change. So at the top, I'm just repeating the bivariate regressions with the posterior predictions imposed on them. On the bottom, I'm showing you the counterfactual predictions from the multivariate model. The data's not in there, right? Because we'd have to do the residuals plots. We'd have to partial them out and do that in. And so I just did the counterfactuals. And you can see now that both relationships have gotten a lot steeper. The model is saying definitely positive relationship between milk energy and neocortex percent of brain, and definitely a strong, almost equally strong, but negative relationship between mother's mass. So brainier species produce richer milk. Heavier species produce less rich milk. But unfortunately across primate species, brainier is correlated with being big. So the apes are the big problem that data said, right? So apes like us, we produce more milk, but we're also heavier. And these things across species are antagonistic. So then you're asking, but how does the model figure this out then? Because they're not perfectly correlated. Again, it's that partialing out. So you could do the residuals exercise where you regress mother's mass on neocortex percent, see the variation that's left over. That's the variation. That's what this model is picking up on. That makes some sense. If that is confusing, you do want to get in there and do it yourself. Follow the recipe from the start of the chapter where we did the divorce examples, but do it for these variables. You'll enjoy it. Yes. Was that convincing? You'll enjoy it? No, it wasn't. Okay. You'll enjoy it. But that is all that's going on. It's exactly the same mathematical procedure. It uses the residual variation. After you've learned neocortex percent, there's still variation in milk energy left over, and that remaining variation is correlated with mother's mass. So it's expected that they're not perfectly associated that helps. But across species, they are strongly associated, and that's where the masking effect comes from. Apes are brainy and apes are heavy. That's the big problem in the dataset. Does this make sense? A classic masking effect that goes on. And again, this is often, this can arise. This one's one where you may not anticipate it unless you know it. It's one where you may not anticipate it unless you know something about the biology of these organisms. But there are lots of cases where it has to do with baseline measurements and missing the baseline measurements that treatment effects get applied to cases where they're the neediest and you won't see the improvement because they started out at some lower point. In an experiment, we don't worry about this as much because we randomly assign the treatments. But in the real world, a lot of you are population biologists and you go to war with the data you have. That paraphrase Don Rumsfeld. So we try to do the best we can in observational studies when we have to do them. So I kept saying on Tuesday that I was going to show you what to do in these counterfactual plots in code. And all the examples of counterfactual plots that I've shown in lecture, the code is in the book. But here's one I want to show you the trick just so you're clear on what it is because I know it's easier when I talk you through the code. Yeah. Then you have to go through it yourself. So here, the way I've done this counterfactual plot, showing the model's prediction of milk energy against the cortex percent, controlling for mother's mass is I've set mother's mass to the mean log mass in the data set. And I calculated on the first line of code here mean.log.mass is the mean of the log mass in the data set. And I'm going to use that in calculating all the predictions that are shown on the right. The same value. And the way to do this, if you want to do it with link, is then I make an NP sequence is the neocortex percent sequence from 0 to 100 in steps of one, that's what the colon means. And then I make a new data list that we use to do the counterfactual prediction where neocortex percent is just that sequence of numbers and log mass is just this one value. And link is smart enough to replicate that one value at a time to make it work. Whether R does it automatically for me, I didn't have to do anything. So now you've got the two predictive variables you need for model M5.7 which has both those predictors in it. You pass the prediction data in the link. They compute some. We summarize them. We get the mean and the percentile interval. And then the plotting is just as you'd expect. I do it all here with lines rather than those nice fashionable shapes that you just do just in case you want to do it with respect. Does this make sense? That's that easy. So and as I said before and a number of you asked me about it afterwards it really doesn't matter for interpretation of these purely linear Gaussian models what value, constant value you assign to the other predictors you're not plotting. It really doesn't. The only thing that's going to change are the numbers that appear on this vertical axis because it'll shift the whole line up and down. It's essentially since it's constant across all values on the x-axis it's just moving the whole thing up and down. It's like changing the intercept. It moves it up and down. So it isn't a big decision you need to have any ox about. When we get to generalize linear models you'll want to try different values I think because we're going to encounter a phenomenon called ceiling and floor effects. The outcomes are constrained. Gaussian outcomes are unconstrained. Theoretically they could take any value between negative and positive and positive. Practically they don't, right? But theoretically they could. So you don't get ceiling and floor effects but something like a count you definitely do because counts can't go below zero. So that's called a floor effect. Well, I'll show you what it means when we get there but it means you will want to think about what value you assign. And the mean is one thing to do but you might want to try others as well to get an idea what's going on. Again, I'm giving you a horoscopic advice. I'm trying to give you some of the most generally useful advice that will keep you buying the newspaper, right? So to speak. No, I'm trying to be useful in a sense but that's hard with statistics because a lot of your questions a lot of the ambiguities and analysis get removed from your domain expertise. You can do better than a statistician in the stats department because you know your science. They can give you some valuable advice on technical stuff quite often but you're the tiebreaker because you know things about your system, right? You know your flies or your weeds or whatever it is you're studying or your shirts, right? And that breaks the tie breaks the ambiguities in deciding how to do the analysis and it gets easier. So if you feel confused in any of the examples that's perfectly normal as I keep saying it just means you're paying attention and in your own science it gets easier and you can break convention because the conventions well they're like horoscopes, right? They're broadly useful but you can nearly always be better when you get more information about the case at hand. Okay, here's the way I think of regression in my own work. It's a wicked oracle. It's like the oracle at Delphi, right? You can go up to it and it does give you useful advice. It tells you the truth always. These models always tell you the truth provided you fit them correctly, which you are. The problem is the advice they give is not always exactly what you want. So in our case we would like these models to tell us what's causing milk energy. But these aren't actually causal models. They're descriptive models and when we interpret them to get causal information out of them there's nothing inherently causal about them actually. They're just measuring the partial associations amongst these variables. That's really useful. They can process information way better than I can. So we lean on them quite a lot. And they're sort of magical in the sense that they automatically find the most informative cases. You can load your data set with cases that are uninformative and the model automatically ignores them. It finds the ones that are most useful. What do I mean by that? It does the partialing out. It finds the ones where there's residual error that is correlated with the outcome. And for all those cases where everything is perfectly associated with all the predictors it just ends up throwing them out actually and they don't do anything to the model because they're uninformative. This is part of the magic of Bayesian inference is it just automatically finds all the information in the data that are relevant to the model. I'll say that again. Bayesian inference always finds all the information in the data that are relevant to the model. However, you're responsible for designing a model that is good. The oracle will not oversee that at all. So if you ask it a dumb question you will get the right answer to a dumb question. It will use all the information optimally to give you an answer to a dumb question which is probably a waste of everybody's time. But it will do that. And that's great. That automatic discovery of relevance in the data set is something we lean on quite a lot in information processing here. So the things you often want to look out for though these are not, this makes this a wicked oracle is it won't tell you about omitted variables. It won't say, okay, yeah, you asked me about the correlation between this thing and that thing. You asked me about the correlation, for example, between rate of marriage and rate of divorce. The model will tell you, yeah, they're positively related. But what it won't do is say, oh, by the way, there's this big literature on the age of marriage and you might want to look at that. It might be being driven by this other variable. It will never tell you that. And that's what we call omitted variable bias. The omitted variables that may be correlated with the predictors of interest or the outcome, that's up to you to figure out. And this is what makes science fun. Why they pay us the modest dollars. Right? But no, it's part of the iterative process of it. And post-publication peer review, often these things come up. So post-publication peer review is just, that's the engine, right? You put things out there and then someone says, eh, you know, you forgot age of marriage and you're like, oh, thank you. And then you rerun it. You get another paper. That's how it goes. But no, seriously, it's often you can't expect to think of everything and that's why it takes time to figure things out. It really does. There's this other issue we'll start talking about today, is non-identifiability. It's possible to write down models which are really strange sorts of questions. Rather, it's the combination of the model and the data in particular. I'm going to show you two examples today. And sometimes in the non-Basian literature, this usually called non-identifiability, in the Basian literature, technically everything's identifiable. And I'll say what that means when we get to our slide about it. But that doesn't mean it makes the answer any easier to interpret. We'll get there. So this brings us to the issue, why don't we just add all the predictors in? If we're worried about admitted variable bias, can we just dump everything into the model? You sometimes see this, both in social sciences and in population biology. In the social sciences, you know, you go online, you get the general social survey and it's a big longitudinal dataset asking Americans a bunch of stuff. Just all kinds of things. And you just dump it all in there. You've got thousands of predictors, actually. All kinds of stuff. And in a time series. So you can just get drunk with power. Just dump it all into a linear model. And linear models are easy to fit, so you could actually probably get estimates. Nevertheless, this is bad news, nearly always bad news. And there were a number of issues that arise from this. The first, which I'll spend a good amount of time on today, is this problem of multicollinearity, which is related to the non-identifiability issue I mentioned on the previous slide. We're going to spend some time on this. There'll be two examples to work through. Easier to understand, less technical, or just loss of interpretability. When you have many, many predictors, all of them are associated to some amount with the outcome. And there's all of these complicated masking issues that goes on. Even worse, and I think the major threat to inference in this business, is false possibilities. So in your particular sample, if you try enough predictors in your particular sample, one of them will be strongly associated with the outcome, but it will just be a feature of your sample. Why? It's a base rate neglect problem, right? So we'll talk about this in the last week if I can make time for it. But most stuff is not causing the thing of interest. I'll say that again. Most stuff in the world is not causing the thing you're interested in. That's how it is. We don't, for the most part in our sciences, live in the world of butterfly effects, right? Where a butterfly flaps its wings in Brazil and then a raccoon goes extinct. Right? If that is true, then I think the history of science would be pretty different. So the base rate of variables of predictors being associated with your outcome is low, right? But if you try enough of them then, however low your false positive rate is, you're going to get a bunch of false positives. So this is, you're saying like, is this a big deal? Yeah, it's a huge deal. In some fields, it's the main worry that goes on. Whole genome studies. So doctors are really big into scanning genomes for congenital diseases, sites for congenital diseases. The problem is most loci. So say they scan like 10 to the 6 loci. No exaggeration, they do this. And they're plausibly maybe like 10 loci associated with the phenotype of interest. Even if you have a 1% false positive rate, nearly all of your positive hits are false. Because you're scanning 10 to the 6 loci. Does this make some sense? This can happen in more benign and less medical fields as well all the time. It happens in the social sciences when you download the general social survey. It happens in ecology when you take some big matrix of stuff you've measured about your quadracks and you just use them all as predictors, right? So you'll find stuff associated with the presence of whichever invasive weed you're studying. But chances are it's a false positive. Because at the base rate, most things don't really meaningfully predict the presence of that weed, right? The presence of the weed could be better, right? It could happen. But anyway, does this make some sense? I think this is a major hazard. And it's just an intuition issue that you have to deal with. How do you fix that? Well, you have to theorize. I mean, this is the main job of theory is to eliminate variables before we put them in the model. And that helps with interpretability. And it helps us do more meaningful hypothesis testing. You lose precision when you put more variables in because you're trying to use the same number of cases to estimate more parameters. So you get noisy estimates for every parameter you put in. We'll talk about that in more detail next week when we talk about overfitting. It's also an issue. So pragmatically, even if a variable is associated with an outcome, sometimes we want to leave it out. And again, overfitting is all next week. We'll do it a lot. So let's look at the first example, first of two examples of the multi-colinearity slash non-identifiability problem. So here's some data that I simulate. I give you the code to simulate this in the notes in arbitrary units. I don't know. These are like decimeters or something like that. We have a bunch of individuals. I think the default in the simulation, we do 100. I'm just showing you the first 20, their height, and then the length of their left leg and the length of their right leg. Now, the length of the left and right legs are highly correlated, as most of you in this room. However, they're not exactly the same, just like everybody in this room. So most people who are... I think last time I read something about this, if your right hand and your right leg tends to be longer because your right leg is more muscular, this issue about people are actually... They're not only handed, but they're also footed. So if you're right-handed, you're also right-footed and you lead with your right foot when you go upstairs, there's a big literature on this. So strangely enough, there's a lot of people doing science. There's a big literature on everything. So that's why they're not exactly the same numbers. These are measured in arbitrary precision, and you can look at the formulas. As this course goes on, I'm going to put in more of the fake data or dummy data is what I call it in the book, dummy data cases, because this is how you do power analysis. This is how you plan and preregister your study, is you want to simulate the data as you expect to be able to collect it, design your analysis, then go do the collected data, run what you planned, and you will have preregistered it, and then when your hypothesis turns out to be true, you'll be a rock star. And it turns out to be false, you can say, this is my fault, it was well planned, I preregistered, it was all in the public. People clap you on the back and say, try again. But this is a good thing to do. It helps you do the design better. This is prospective power analysis, and so we'll have more of these dummy data cases as we go through, and I have a number of them in boxes as well. Maybe on a second read you look at them, you don't have to agonize about it right now, but it's a good skill to pick up, because through your scientific career, you'll want to do more and more of this. I assume lots of journals will require pre-registration, at least if I have my way, which is almost certainly means it won't happen, but no, I think increasingly this is the zeitgeist in contemporary science, is that pre-registration is needed, because otherwise you can just fish, until you get a star, and then as soon as you've got your star, you've got a publication, and journals are starting to get really scared about this, so that's something to worry about. Does the structure of this data make sense? Yeah, okay, so I assert, now I simulated the data so I know it, but I assert that you can predict a person's height knowing the length of their legs, either way. However, what I want to show you is, a statistical model, a linear regression, cannot predict their height with both legs. It's a funny thing, or at least it'll look that way. So let me talk you through this, and again I'm using the simulation case, because we know the right answer, so you'll know that the model is, well it's not really getting in wrong, you'll see. So let's fit a linear regression here, we take that simulated data, we take height as a Gaussian function with a linear model of the left leg and the right leg. We have separate coefficients for them, right? So both of those should turn out to be positive, right? Because both of them predict height. But wait, hang on here. So I'm showing you the results at the bottom, and they don't turn out to be that way. The coefficient for this particular simulated data, yours will look a little different, but you'll get the same ridiculous outcome. For the left leg, it's very nearly zero, and you'll see that the interval is about equally on both sides of zero. Big standard error, big standard deviation rather. For the right leg, it's about two, but it has a really big standard deviation so it still overlaps both sides. You can see it there on the dot chart, if maybe you're here to see. The model is saying like, there doesn't seem to be a lot going on with legs here. I mean, I'm not going to tell you these people's heights, but these legs, I don't know what to do with them. What has happened in this? Did the model do something wrong? No. The model answered the question you asked it. Remember what multiple regression, it gives you an answer to. It's these models are asking the question, conditional on it knowing, say, the left leg, is there any additional value in learning the right leg? And simultaneously for the other one, once I know the right leg, is there any additional value in learning the left leg? And the answer to that question is no, right? You only need one of these legs, and the other is redundant. Now, of course, you would never run this model, right? But I'll show you the second example is a real-date example, and one that people did run in the literature, learned about it from a colleague of mine. So we'll get there. This can happen, and it can happen innocently to perfectly smart people, right? This can happen all the time. This is the right answer, though, to the question. And in fact, it's actually, the model is predictably fine. Here's what happens. We've got to look at the posterior a little bit more closely. So here I've restated the question. The model is asking, what is the value of learning the lower right leg once we already know the right or left leg? And the answer to both those questions is nothing, pretty much. But we have these really big standard deviations on the posterior. What you're seeing there in the lower left, I'm showing you, I think this is like 10,000 samples from the posterior distribution of the two coefficients for the left leg on the vertical and the right leg on the horizontal. You see they're really strongly negatively correlated with one another. There's a long ridge of combinations of values that are equally plausible. Remember, the posterior distribution is ranking the relative plausibilities of all the combinations of parameter values. So this is what it's done. What's going on here? What it's saying is, what matters is the sum of these two things. And there is a particular sum of these two things, which is highly plausible. How do we find that out? We just add them together in the posterior distribution. Since they're strongly negatively correlated, we add them together. We get a marginal posterior distribution for the sum of the two. It's pretty sure where it should be. It should be right around two, which is the simulated value I use. Actually, it's doing a great job. So what has happened here? Oh, and I should say, as a consequence, although I'll show you this on the next slide, as a consequence, the model actually makes good predictions. If you asked it to predict an individual's height and you had your left leg and right leg's inputs, it would do a pretty good job of it magically. And yet if you read the table of coefficients, you think that neither matters. I have seen this in published papers. And the error arises from ignoring the correlation between the parameters. I'll say that again. If you just look at the table of coefficients, you can conclude here that neither left or right leg can predict height, but that's a mistake. And the mistake arises from ignoring the correlation between the parameters. The uncertainty is correlated. Yeah. So would you recommend... Yeah, so the question was, would I recommend before putting things in, examining bivariate correlations? Yeah, that's a great idea. Of course, presumably you will have already screened down to some range of theoretically plausible and interesting ones. And then ones that are highly correlated, you probably only want to put one of them in at a time. More broadly, what I'm going to recommend, and we'll talk about this next week, is doing the models that leave them out and comparing them. And then you can really detect what's going on in this. Then it's easy. So imagine fitting models here with only the left leg. You'd find that it does a really good job predicting. And then only the right leg. It does a really good job predicting. Then you put them both in and the standard deviations on those two parameters massively inflate. That's this. That arises from this problem, that they're redundant information largely. Yeah. And again, this happens in ecology, tons. I remember looking at something like California Wildflower Dataset and like everything's correlated with moisture. Right? So it's like all this stuff, it's just moisture. And yeah. So any one of those variables that's correlated with moisture will do. And then you can predict basically the communities of wildflowers that you get. Right? Or serpentine is the thing you all said. It was like moisture and serpentine. Right? So you know serpentine. You know what I'm talking about. But anyway, does that make, did I answer your question satisfactorily? Okay. Let me look at, let me show you algebraically what's going on here. And this is one of these cases where I think the MAS stats really helps to see what's going on. This is your linear regression model, the leg model, re-represented in abstract form, where YIs are outcomes. So that's like height in this example. And now the two legs are really the same thing. So I've just put the same X variable in there, but we're going to estimate two coefficients for it. This is really what you've done in the leg model. Right? Because they're not exactly the same length of leg, but pretty much. And in any case of winter or not the same, it's not really correlated with height anyway. So this is the model you've actually expressed. So from the Golem's perspective, from the little robot's perspective, this is the question you asked it. You asked it to estimate simultaneously two coefficients for the same predictor variable. All right. This is the perfect case of this. If the two predictor variables aren't perfectly correlated, the problem is less severe, but it can still arise. Right? So, again, it's not necessarily a problem. It's a problem in interpretation. But the model fit correctly, and it makes okay predictions. So if we do a little bit of algebra, you just factor that X out, right? You can see that this is the model you're actually fitting. And you're estimating the sum of the two coefficients. And that was the thing that the model reliably estimated. And it got the right answer for that sum, which is the coefficient between, well, it's the ratio of legs to height. I think that's what it did, is for height to legs. Yeah. And height to legs in this. So this is the way you can think of it. So it gave you the right answer to that question, which is what's the posterior distribution of that sum, but the results were expressed differently. And that's why it looked like neither of them mattered. Right? Because, again, the correlation between the two. And there are an infinite number of combinations of B sub 1 here and B sub 2, which will make the same sum. And that's why you get that long ridge in the posterior distribution. Right? There's an infinite number of ways to make any sum. Right? Let me accept that. Hopefully. That's how real numbers work, sadly. So when we look at the posterior distribution again, now you can see it's estimated the thing on the right. That's the marginal posterior distribution of the sum. But the components of it, the two components of the sum, can be that long ridge of values. As long as, but the values along that ridge sum up to values around 2. Does it make some sense? If it doesn't make sense quite yet, when you read over it, it'll take a little while. It's fine. We're deep in the realm of the small world logic. Right? But this is, I think this is, increasingly I think this is a serious problem because often people just inspect the table of coefficients and you can find multiple parameters that overlap zero a lot. But if you don't also inspect the correlations between those parameters, you can conclude that neither of them are actually associated with the outcome. They're very strongly associated with the outcome. You miss it completely. We'll have another example of this late in the course where it arises in a real data context again. Okay. So just to show you that the model still makes good predictions, I won't belabor this. You can spend some time at home with it if you want. In the code in the upper left, I've taken the model estimates, the posterior, and I've constructed the leg total as a predictor. And I pushed that through the model instead and just put it in. And constructed leg total made that plot for all the cases for each individual. I plot their height against their total leg length. This is a weird thing. I know. But this is the question you ask the model. Right? So, or I ask the model. I don't mean to saddle you with my idiocy here. But then I just use link and I push left and right legs through again. And then I process out the predictions and I plot them against the totals. Actually it's just left leg type, left leg times two or something. See, it actually, the model knows how tall people will be given leg length. It does. It just, it can't tell you correctly because you're asking for a table of coefficients. Or rather, I asked it for table coefficients. And then picked you in the thinking you would do the same thing. But you won't. You will inspect posterior predictions and then you won't be fooled. Yeah, David. So is it the priors that's keeping the posteriors on the coefficients from going to positive and negative? Yes. So the question was, is it the priors that are keeping the posterior on the two coefficients from going to positive and negative infinity? Yes. They're very weakly informative priors. You still get a very long ridge, but it doesn't run away forever. And this is this issue in, if you tried to fit this model without any priors, perfectly flat priors, if you tried to do it in the likelihood framework, the model would be technically non-identifiable because there would be infinite variance in your parameter estimates. And that's a classic problem. And that's why I said in Bayesian inference, everything's identifiable. It just may not be the answer to the question you wanted. Right? So usually you don't want to do this. You want to ask it in a way that's easier to interpret. But yeah, that's exactly what's going on. Absolutely. And we know that there's not an infinitely strong association between leg length and height. So there's a particular ratio that we're trying to estimate. So those priors aren't distorting inference, they're just making it possible. We'll have some examples of this. When we get to the Markup chain Monte Carlo chapter, I'll show you how this is needed to do estimates too. So one of the reasons that biology has gotten so Bayesian lately, actually, is for that reason, just the estimation power of it, identification of the model. Yeah. When you mentioned it in the Bayesian framework, everything's technically identifiable. What does that mean exactly? So if you're multiplying parameters together, you have this reflection invariance where you don't know whether both parameters are positive or negative. I mean, is it still identified? I mean, if you can't... Yeah, I guess that makes for a lot of identified means. You get anything with a posterior distribution. So in Bayesian language usually, although I bet you can find someone who defines it differently, the most people who define identifiability as is your posterior distribution proper? If it is, the model's identified. So even in your reflection invariance cases, right, you can get like the Gaussian mixture model or something like that, you can get a posterior, a proper posterior, just you're going to get two modes. Yeah, Merry Christmas, right? That's how good it is. It doesn't mean that the case is benign or easy, which is why I think identifiability is really a non-Bayesian concept, but so many papers and textbooks talk about it that I felt like I should say something about it. And if you use improper priors... Yeah. Is that still a Bayesian model? Yeah, it can be. As long as your posterior's proper, it's okay. Yeah, absolutely. And then, anyway, we won't go off on to that, but some people are like, what, proper or improper? That sounds very normative. Like, can we wear a petticoat? Or are we going line dancing? What is going on here? Yeah, I hope we're going line dancing, but that would be a great final exam, but we don't need to worry about that necessarily because I'm not going to show you any improper priors in this course, but I don't want anybody to use them, actually. And that's... If you do know about that and you want me to defend that sometime, buy me a cup of coffee and I'll defend it, but you don't need to worry about it necessarily. Okay. Maybe I could just say improper means that the distribution doesn't integrate to one. That's not a proper probability distribution. Nevertheless, you can use them as priors sometimes. Sometimes they're useful, but I don't want to lead anyone down that street. Okay. Here's a real data example of the same issue where we get the multi-colinear that is highly correlated predictors that are mostly redundant information. And so the marginal posterior distribution of the coefficients on each look like neither is associated with the outcome, but they're both strongly associated. Here's a case that's more subtle. It comes from a real data set. I get this example from Dr. Katie Hines, who went to graduate school with me actually. So right after I left, she started, she used to be around here. You guys may know Katie. And she alerted me to this and it's a great teaching example. And it wasn't her mistake, by the way. It was saying people didn't literature before her. So what Katie and her colleagues do is they're interested in milk composition across a bunch of mammals and it turns out that we don't know a lot about the evolution of milk composition. So it's an interesting area of work. And if you take also in the same data set, you have these columns, the percent fat content of the energy in milk and the percent lactose content. These are the main energetic constituents in milk. It's mostly water with some fat and sugar in it. It's a particular amount of sugar that makes people have indigestion when they're adults. Unless you come from northern Europe like me, in which case you just like bain and milk all the time. Melt like milk naturally. It's a wonderful thing. Well, I get that joke ahead of me, but I did my postdoc. I had a Japanese office mate, Masanori. You may be listening. Hello, Masanori. And Masanori once made this joke. He said, like, you know, I like Americans a lot, but you guys smell like milk all the time. You know, I think you might be right about that. But anyway, so contemplate that sometimes. Where was I? All right. So yes, percent fat and percent lactose. So here's the story just to run into the punchline. The total energy content is measured by kilocalories, per gram, and milk is just a trade-off between these two. If you put more fat into it, the species puts less sugar into it. If they put more sugar into it, they put less fat into it. The species are mainly just trading off those two things. And that's the main thing that goes on. Fat gets replaced with sugar or sugar gets replaced with fat depending upon the species. And so if you look at the bivariate correlations, this is fat against kilocalories, fat of the horizontal kilocalories convertible. So the coefficient for fat is positive and reliably above zero. The units here, by the way, are really small. I did the example this way because these are percents. They're numbers from zero to 100. So the coefficients are being small, but that's okay. There's nothing about the scale of that. It's just the units you use. And then the lactose content is negatively associated with kilocalories and milk, or, you know, you see it like this across species. And that's also reliably below zero. Now, what do we do? Let's say, why is this true? Because fat has more energy in it, right? This is why it tastes so good. It's deep butter. It's very good for you. So if you fit both of them in a model, this is just like the leg example. We put them both into a multiple regression. The coefficient table at the bottom tells us that, well, for fat content, it's very close to zero and it overlaps. It's on both sides of zero. So maybe it's probably not much going on. The standard deviation is on the same order as the estimate. It's always bet. It always means there's probably not a lot going on. The one for lactose is below zero, but it also has a big standard deviation. And if you put them in, looked at the bi-area class, we've got a much stronger relationship than that. It's actually much more stronger. And this arises from exactly the same problem. They're correlated with one another. Because species mainly trade them off against one another. Or rather, you can think of species that are putting a lot of energy in their milk are adding fat. And these things trade off, in a sense. They're negatively correlated, but they're both associated with the outcome. Same sort of issue arises in a real data set. You really only want one of these. There's a single dimension here. How much investment in milk is there? And you're using two variables to describe it is what's going on. Does this make some sense? Yeah. You may need to play around with it a little more, but it's the same issue. The leg case was better. All right. We're doing good time. All right. Categorical variables are subtle. So I've got a few examples of these to work through in our last half hour here. Very often, we have predictive variables which are just discrete unrooted categories. By discrete, I mean it's not a continuous measurement like height or leg length or the amount of the percent of fat in milk. It's just whether or not this case is something. So for example, sex or gender is one of the most common. So in the HAL data set shown here, you have the sex of the individual in the column mail, whether or not they're male. This is called an indicator or dummy variable. And the values in it don't necessarily matter. If we process through this battle, that might be the case. Cognitively, you want to use zeros and ones because it's a way to think of turning on and off a parameter. The purpose of a variable like this is just to turn on and off a coefficient. When it's a zero, it turns it off. It has no effect on prediction. When it's a one, it turns it off. As a consequence, you can actually make the values different and it'll still work. It'll make the same predictors, but it'll be harder to understand. So I advise you to use the zero one so-called dummy variable or indicator approach. And it doesn't matter which you use. I think for mammals, there's an argument if you're using male as the derived state because males have to derive sex biologically, so mentally, right? If somebody genes turn on, that's always more broken because you have to turn on more genes to make a male. It's the opposite for birds, right? It goes the other direction, but in mammals, this is the way it is. So if we want, often we want to predict, make predictions contingent on these indicator variables categories. So we're going to start with this simple case, just the sex of the individual in the dataset and show you how to put it in the model using a dummy variable. Before we look at the code though, really all you need to know is it's exactly the same as any other predictor. It really is. The model doesn't see any difference. It's just you've coded it in a way that makes the interpretation easier for you. But it looks the same. As far as the model is concerned, it's the same kind of thing. In your brain, it's not. And so you will have wrinkled eyebrows about this for many years. But algebraically, it's the same thing. Maddeningly, it is so. And then after you've wrapped your head around the one case where we've only got two categories, I'll do more categories. You can have as many categories as we like. And when we get to multi-level models, what those are, for the most part, are ways of cleverly interacting continuous predictors with any number of categories you like and doing a good job of estimating those interactions. And so we're going to have cases where every individual is their own category in the data, and we will estimate parameters for every individual like that. And that'll be an extension of this strategy when we get there. But you don't have to understand that now. I just wanted to say we're headed towards the Golden Road here. And we're going to start with simple categorical examples. So dummy variables effectively allow each category in the data to have its own intercept. All they do is push the regression line up and down because they add or don't add a constant. That is the coefficient to the prediction. That's all they do. So in the structure of this model, it looks just like a linear regression, but now I've only put m sub i is whether or not the individual on row i is male. That's your dummy variable. And beta sub m is the coefficient for it. So I think I have animation. Yeah, that's the 0-1 variable male. Alpha is the intercept when you have a female individual in the data set. I'll say that again. Alpha is the intercept when you have a female individual in the data set. What's the intercept when it's a male individual? It's alpha plus beta sub m. So you can think of beta sub m as the change in the intercept when the individual is male. It's what's called a contrast. It's a difference. It's the difference in average height between a male and a female individual. I'll say it again. It's the difference in average height between a male and a female individual. Those sorts of parameters are called contrasts. They're differences. Does this make sense? So effectively, you've got two intercepts in this model, two parameters. And actually, in the book, I show you how to write this model so that there really are two intercept parameters. It's the same exact model, same exact posterior distribution, but fact factored a different way. There's a little box in the book to show you how to do that. In my opinion, it's easier to fit the model that way with multiple intercept parameters. And then when we get to multi-level models, that's how we're going to do it every time. But this is the conventional way to do linear regression. So I'm a prisoner of convention. Just like you guys are. I want you guys to be able to read the literature, so I need you to be able to read this unfortunately maladaptive, historically embedded way of writing linear models. Where we do, there's one category that gets assigned to the intercept. In that case, this is female. Or in this case, that is female. Too many demonstratives. In this case, that is female. And then all the others are expressed as contrasts between any particular category and the default category, which in this case is female. That makes it difficult to answer questions in many cases. But I'm going to show you how to process estimates out of a model like this into any number of contrasts you like. It's pretty easy. We're going to do the same when we've always done it, by sampling from the posterior and then writing a little bit of R code. And it'll work out. And you won't have to refit the model even. You can do all the re-expressions. So that's where we're headed. Here's how you fit this model. Nothing too surprising. Male is the W variable for male. It's already in this data set. Just write BM for the coefficient. Fits exactly the same. And we get a table of estimates. And you read this the same way as before. There's a posterior mean for the coefficient for male cases in the data. It's around seven. So males are on average seven centimeters taller in this sample. This is not controlling for age. So this is a pretty terrible model. There's your age variation here. You want to account for that. And there's a lot of filtering of males by age in this data because guys don't live as long. So you really need to deal with that. That might be a homework problem at some point. Sounds like a good homework problem, doesn't it? And a lot of uncertainty, almost certainly arising from the fact that, yeah, there's all this age error that's mixed in here that's going on. But I'm confident you folks know how to interpret a model like this. It's just a bivariate regression. Here's how you can think about the meaning of these parameters, the mean for a male at the map, at the maximum of posteriori, is around 135 plus seven. I'm just rounding those values, which is about 142. And the mean for females? About 135. You get that straight from the alpha estimate. Does it make sense? Yeah. Yeah. Some unenthusiastic nodding out there. It's like, yeah. Sort of saying. So what about the interval? To calculate the interval of the mean for males, this is why it's usually easier to fit these models where there's an intercept for each category. And then we already have the answer. We have a parameter for the average female height and a parameter for the average male height and we have intervals on and we be done. We don't have that in this traditional form of fitting where you have a contrast. So you have to compute it. But you can't compute it by just like, say, adding the standard deviations of the marginals. Because remember, these parameters are correlated to one another. And it's like, oh my god, they're correlated? What do I do? Sample from the posterior and add them up. And then you add across the samples and the correlation structure is automatically preserved and you get the right answer. So again, you're doing it under a calculus but you don't know it. And that's my mission. It's to trick people into doing it under a calculus. That's what I live for. So here's how you do it. Extract samples from this model. Write the linear model. Or rather, write the definition of the mean for a male. The mean for a male is the sum of alpha and beta sub m. And we just write that. So we're going to iterate over the rows in the posterior. So say we have 1,000 samples. We get 1,000 values for the mean for a male. And we just look at the interval of that distribution. Everything that parameters have distribution. So everything you calculate from them has a distribution. And this is the same thing that's going on. Why did we add them together like this? A plus Bm? Because that's the model you stated. That when it's a male, when male equals 1, then the linear model is A plus Bm. When male is 0, it's just A. Does that make sense? Yeah. Can you compute it up here from Zedd? In this case, in general, no, you can't. When they're uncorrelated, you get lucky and you can. But in general, they'll be correlated. Or they could be correlated. And so always do this. Because if they're highly correlated, it won't be as benign. Yeah, in this case, this is a bad example. Could you just look up there and you're like, you know, I could figure that out just by adding the interval. And yes, you can. But in general, well, not quite actually. It doesn't actually work, does it? So no, in general, you can't because they're correlated. The posterior, I should have put this up here. If you plot the samples for A and Bm, you'll see that they're correlated. And so there's not a big cloud. Or you can just add them up and get the uncertainty of the cloud. You have to do the sums. The distribution of the sum is what you want a confidence interval for. Yeah, this has made some sense. You've got to go in and calculate it. Let me run through the logic again. Our goal is to calculate the mean, what the model predicts for an average male. Now of course you could do this by pushing it through link, right? But I'm showing it to you in the logical way. So make sure you understand what's going on. You can't get that from the table of coefficients because there's no parameter for the average male. There's two parameters that describe the average male. And the height of a male is alpha plus Bm. Alpha and Bm have distributions and those distributions are correlated. So to get the uncertainty of that sum, you calculate the distribution of that sum, which is what we've done in the R code at the bottom. And then you calculate the interval on that. And then you've done an interval for it. Yeah, I've got like eight other ways to say this. I'll try some more as we go. You do have to be patient with yourself about this because there's nothing... I don't even know how to say this. There's nothing cognitively appropriate about this style of work. It's something you get used to over time. It's like learning a foreign language and you get better at it over time. Say you take your first class in French. I'm sure some people here have studied French and probably forgot it or something like that. And after your first class in French, you don't really speak French. You kind of speak at French or something like that, or French speaks at you and you get little bits of it. But at least you know where the words are. When you started, it was just the long stream of sound. You couldn't pick out anything. And then you find the words. And this is like that, but actually much easier than that. But you do have to be patient with yourself because it does take time and you're making all kinds of new synaptic connections and developing new mental habits. And you do have to be patient. It takes time. I've been doing this nonstop for, I guess, almost 20 years now. And so it does come a little too easy for me. I feel like actually there's this optimal range for teaching where you can still remember being confused, but you're actually an expert. At some point you forget when you were confused and then your teaching goes downhill. I'm worried that this will happen soon as I age. It's my pessimistic theory of teaching skill. But no, the empathy part of it is. And I'm saying I can still cling to that empathy and I remember this being an awkward thing to do. So I'm sympathetic and you should never be embarrassed to admit you don't understand something in here. Because I remember not understanding. And let's face it, statistics as a culture is full of all kinds of arbitrary conventions. And one of them is this thing about contrast and intercepts and I would change it if I could. And when we get the varying effects models, we will change that. We'll do it the conventional way there, which is saner. But for now we're kind of stuck with this and I've got to teach it to you. So on that note, let's do a more complicated example of an intercept with categorical contrasts using the milk data set again. In the milk data set we've got, for each species in the data set, I just call it clade here. These are these broad kind of old fashioned primate taxonomic categories, right? Although they're mainly phylogenetically okay. At least in this data set. Although who knows, next year they might publish a new phylogeny. It's all over. As far as I can tell, New World Monkeys, we have no idea how they're related, right? But regardless, we're not going to look within that category. We just look between them. So we're just going to look at four categories to keep it easy. As a teaching example, we're not putting a phylogeny on this. That's something we'll talk about in the last week of class. How to use a phylogeny in here. But the streps are on to the top. New World Monkeys, Old World Monkeys, and apes like us. We're at the very bottom, right? Because we're miserable. And then we list the species. How do we, we've got four categories. How do we get that in there? Well, the answer is you need three dummy variables. If you have K categories, I'll generalize this in a couple slides. You have K categories, you need K minus one dummy variables to code them. And the reason is because you're turning on and off parameters. That's the same thing. For each category, you need dummy variables. It turns on and off a parameter unique to it. So it can have its own intercept. So in this case, what that'll look like, I define, I'm going to define apes as the default category. They're the intercept. Why? Because we're apes. We like apes, right? We're the default. And there's the egotistical modeling principle. And so then we need a dummy variable for the New World Monkeys, the Old World Monkeys, and the Strepsurines, shown at the top here. So for all the Strepsurines, what this implies is that the dummy variable for New World Monkeys and Old World Monkeys, there's zero in all those cases. And it's a one for clade.s in every case, right? And that codes that these cases are Strepsurines. Does that make sense? And that will end up turning off when I show you the model code. It's going to turn off the parameters for the contrast for New World Monkeys and Old World Monkeys and turn it on for the Strepsurines. Yeah. Okay. Likewise for the other cases, for the New World Monkeys, we have a one in the column for the New World Monkey dummy variable, zeros and the others. For the Old World Monkeys, we have a one in the Old World Monkey column, zeros and the others. So for apes, there's zeros for all three. Because apes are the default category. They get the intercept. We want to turn all of these other parameters off to make a prediction about apes. Yeah. Okay. I know this isn't the most exciting thing in the world. There are lots of ways to make these dummy variables. I'm going to show you the crudest and most logically transparent way just to use if else. And all you're saying is when you code these up, the dummy variable for New World Monkeys, if the value in clade is New World Monkey, that's what it says, because that's how the data is put in here, then store one here, else store a zero. Right. That's how you do it. And this isn't, there are automated ways to do this, but you know me, I like to show you the not automated ways to do things because at some point the automation breaks down and you want to actually understand what's going on. Does this make sense so far? With me? At least enough to keep writing the roller coaster forward. Yeah. So let me show you the model now. So it still looks like a linear regression model. Now we have three contrasts. There's the intercept alpha, which is going to be the average milk energy for apes. And then we've created three contrast parameters that adjust that intercept, depending on which category the case is in. They're going for New World Monkeys, one for Old World Monkeys and one for Streptorines. Each multiplied by the corresponding dummy variable. So you can see this as NWM sub i and OWM sub i and S sub i are the dummy variables we made on the previous slide. And then we have the difference of each of these categories from the eighth category. These are going to be contrasts. So the implication you can think about this is these models effectively just change the linear model depending upon which category you're in. And this table is meant to help you see that. Now the implication is when the case is in A, the linear model is just alpha. The prediction is generated purely from alpha. Musaba equals alpha. When it's a New World Monkey, it's alpha plus D N W M, because it's that contrast. It's that extra amount or less amount, in this case, of milk energy for New World Monkeys. On average, for Old World Monkeys, we turn on only one of them still, the Old World Monkey contrast. And for Strepturine, it's just the Strepturine contrast. That's what the ones and zeros do. Okay, so fitting in the code, no surprises. And I put regularizing priors and all these things, so that's my habit. And at the bottom now we get a table of coefficients. And again, these differences, these beta coefficients here are differences from apes, which makes it hard to get at the thing you want. You want to know on average what does the model say is the milk energy for each of these categories. So then we post-process. Same way as before, we're going to sample from the posterior. The fact that you know the model means you can add these things together and you can get posterior distributions for the mean for each category. So let's do that, just so there's no mysticism about it. Because this is what you'll want to do to interpret these models. You also do this when you plot, right? So this will be the thing you do when you plot predictions for each category. But you may just want to report them in your paper, right? The model says that the distribution of average milk energy for Strepturines is the following. So we extract samples from the fit model. And then these lines of our code here are exactly the same equations from that table I showed. They're the implied linear models for each kind of, for each clay. So we only use alpha for apes. That's kind of redundant, but I wanted to show it to you so it was transparent, right? And for New World Monkeys it's the sum of alpha and the beta coefficient for New World Monkeys. And for Old World Monkeys, likewise except now for Old World Monkeys and then Strepturine, same arrangement. Yeah. I know I'm, like, hitting this like six ways, but some of you were liking it, I can tell, right? It's like Stockholm Syndrome is that same. You're like, please, sir, another example. I have more, so don't come. All right, then we can summarize. Pracey is this function I wrote for myself years ago, actually, because I wanted really simple summaries of tables and variables. And so it does map models, but it will also do data frames. So if you just put these posterior distributions in, these are lists of samples, right? These mu dot symbols are lists of samples from the posterior distribution of the means for those categories, mu for each of those categories. You can make them into a data frame this way using the data frame function and plus some pracey and it gives you this table just like before. And then they look like parameters. And these are the estimates you would have gotten if you had fit the model that way with a separate intercept for each category, which is how we'll do it when we get to varying effects much later in the course, when we'll see how to do it that way. So you're like, why do I care? Well, you probably want to report that, right? But most software only gives you contrasts. Does this make sense? Okay, yeah, question. Well, we did do that up here, right? So we... I mean, in the example before with males and females, we just... Well, we didn't do that. I'm trying to persuade you not to do that because it doesn't quite work. To get the uncertainty, you can get the map that way. So let me... I think I understand your question. Your question was, can we get the difference or can we get the average milk energy for a stretcher on it just by adding... You can get the map average milk energy for a stretcher on it that way. But not the distribution so you can't get the interval correctly. That's right. To get the interval correctly, you need to do this kind of thing and then get the summary. So you get these mean values just by doing the addition. Yeah, yeah, you understand. Yeah, okay. So I didn't understand your question. Yeah, so you can get this call. The mean call, that's easy. Just add the maps. But you won't get that part right. And you won't get the standard deviations correctly. And the uncertainty matters. I hope, you know, I can drill that in. Yeah, yeah. Okay. So, all right. So let me show you that ends up looking like this so you can make a plot like this where for each clade, we plot them up with uncertainty bars. Now, the last issue with this, the last key issue is about differences between categories. So when you fit that contrast model in a conventional form, you do have contrast parameters. Each of those little beta coefficients is the posterior distribution of the difference between, say, new world monkeys and apes. But what if you want the difference between both kinds of monkeys, between new world monkeys and old world monkeys? Now what do you do? Well, you know the model so you can calculate it. So let me show you how to do that. So this is computing your own custom contrasts. And this is something to worry about because it hinges upon this common fallacy and statistical interpretation that the distribution of a difference is different than just saying whether or not two distributions are different from some arbitrary point like zero. So this comes up in non-basic statistics very often because people say, well, you know, I have this one predictor and it was statistically significantly different from zero and another wasn't. Therefore, those two things are different. Right. One was significant and the other wasn't. So they're different. That is an invalid inference. What you actually want to know if the two things are different from one another or not, you must compute their contrast, the distribution of their contrast. So let me put it to you in plain English. Well, that's, I shouldn't say plain English. Something like English, right? Which is, if your question is about the difference between two categories, then compute the distribution of that difference. You want a parameter that asks that question. And maybe you didn't fit this model that way with that particular contrast in it, but we can compute it after the fact because we have the posterior distribution. So I want to show you how to do that. So in this graph, the way to think about it is if one and two are beta coefficients for two different categories, right? And both of them overlap zero so much that you think like, eh, you know, they're not so different from zero. Nevertheless, their difference between the two is very reliably different from zero, shown in blue. It's a different distribution. They're not the same thing. Some of you who have had a mass-stats experience before know that for Gaussian distributions of a simple formula for computing the difference between the two, you add the standard deviations, right? Or you add the variances, right? So the standard deviation is some of the variance in the square root of some of the variances. And you don't have to worry about that, but that's the sort of mental consequence to think about why you can't just see that they're different from something else. One way to think about this is statistical significance is only about a difference from a fixed point, usually zero. It isn't about a difference between some other parameter. If you want to make an inference about the difference between two parameters, calculate the difference between two parameters of the distribution. Does that make sense? So in the context of this model, the differences between clades are called contrast still. They have their own distributions. Here's an example where we calculated for the two monkey categories. It's exactly what you think. We just computed the distribution of the mean value for new world monkeys and the distribution of the mean value for old world monkeys. We just subtract one from the other. Now we have the distribution of the difference, right? And it's a distribution. Exactly as before, all the samples, all the uncertainty of the samples is being propagated forward in all these calculations. And then I just used the quantile function because I realized I hadn't showed you how to do this before. I don't think. Or maybe we did it in like chapter two or something. You can still do it. You just give the list of values and it gives you, this is the 95% interval in the medium. It's a 50% quantile. And that's the distribution here. So that's the distribution of the difference, right? That's the idea. That new world monkeys are slightly smaller, right? But it overlaps zero, but smaller. Does this make some sense? So you can do other categories the same way. This is how you compute contrasts after the fact. Okay. If I'm nodding. Yeah. Categorical variables are really important and we're going to work with them. Like I said, when we get to varying effects later on in the course, that's the main engine, is creating a bunch of unique intercepts and slopes for different categories. And it's a very powerful sort of thing to do. And ironically, it's easier when you do it later on because we won't have this aliasing thing where we pick a default category. We'll just have an intercept for each of them, and then you'll see that form. But I need to teach you the conventional form. I really do. Otherwise, you won't be able to make sense of anyone else's table. Right? Okay. Last thing I hope to do this week is tell you a little bit about the way you usually fit these linear models, which is using a different algorithm, but it still gives you posterior distributions if you use the Bayesian-Swenting technique, which is not a formal method. But let me talk you through that a little bit and understand what's going on. So most of you will have heard that the way most linear regressions, bivariates, or multivariates are fit to data is using an algorithm called ordinary least squares. And this is a very effective way to find map estimates. There's a Bayesian-Swenting technique. It finds the map estimates. Assuming you're going to use flat priors or really flat priors. So it'll work fine for most of the stuff you guys have been doing here so far. If you want an informative priors, it's not going to work, but OLS will work otherwise. And there's even a way to trick OLS into using priors, but we're not going to go into that. But there is a way to do it. And what's interesting about the intellectual history here, of course, is that Gauss justified, was one of the people who independently discovered OLS, least squares estimation. Sometimes it's called. And he used a Bayesian justification for it. So now we often think of OLS as a non-Bayesian method, but originally it was done. This was before anybody called this stuff Bayesian. But it was the only way you did stats back then, around 1800. And that's where it comes from. So it's worth looking at a little bit about how it goes to demystify the machine a little bit. And it's fine if you want to use flat priors or nearly flat priors to fit these models using LM, which is the linear model fitter in R. And you'll get a summary table, which you can interpret as a posterior distribution using those very weakly informative priors. And extract samples will still work on those models. I wrote it that way. So we'll figure out. It pulls out the variance covariance matrix, and you can still get samples and process all this stuff the same way. So let me talk you through a little bit about that. Just a little bit. So ordinary least squares find some parameters that minimize the sum of square distances from the mean. And this turns out to be the same thing as finding the map point in the maximum office theory 40. For these perfectly linear models, it can be done analytically, and this is what Gauss proved. So it turns out, what are the sum of squares? Well, imagine drawing a horizontal regression line through the height and weight data shown on this slide here. If we calculate the distance between the square distance between each actual observed data point in the line, and then add them all up, we get this value labeled here SS, which is the sum of squares. Yeah, and that's what the sum of squares is. And now here's the trick. If we start tilting this line, we want to find a line that minimizes that value of the sum of squares. And it turns out that line will be the line that has the map parameter values in the maximum office theory. And that's what Gauss proved. So let me show you in a cartoon fashion. We could start there. That's in the upper left is the figure I just showed you. In the bottom, I'm showing you the whole surface across all values of the slope that we're interested in here, what the sum of squares is. So we're starting at a terrible place. Why? Because it's horizontal, and obviously there's a relationship between the two. And also it's an example. So you always start at the bad place, and you improve. You don't start at the good place and go bad in examples. It's discouraging. So now we start tilting the line up. Why? Because I know which direction to go. We could go the other way, and then we'll get very despondent. But the lines start to get shorter. Not all of them. Some of them get longer, but it's the sum that matters. And you maybe ask yourself, why do we square them? That's a consequence of the square in the Gaussian density function, that it's a quadratic function of the difference between the mean and the observed value. So we keep tilting. We've overshot when we've gone over there. And there's a value right there at the bottom at the nadir. This minimizes the sum of squares. It minimizes the error in prediction, the expected error in prediction. It's the parameter values most consistent with these data in this model. And that's why it's the maximum of posteriority values. And then quadratic approximation can give you the curvature, assuming the posterior is normal. Same thing. And that's what LM will do, actually. It'll do that for you. So for Gaussian additive models, you can find the map values this way. There's nothing wrong with this. This would be to use a lot of quote-unquote non-Basian software to compute Bayesian posterior distributions. You really can. It turns out at this level, for Gaussian models, while there's a logical difference between Bayesian estimates and non-Basian estimates, they're often quantitatively almost the same. This will not be true for almost any other kind of regression. But it is true in this case. And a couple of you come to my office hours and I ended up explaining this to you accidentally, so I'll just very quickly say it again. This will sound like a little bit mystical, but I like to leave you with mystical things. This happens in the Gaussian because there's this little term where you get mu minus y squared. It's inside the exponentiation. Those of you at home, I'm gesturing in the air. It's not helping anybody in the room. There's this squared difference term in the Gaussian density function. In that term, you have a parameter, mu, which is the mean, and you have an observed value. Those are exchangeable in the sense that if you swap one for the other, you get the same value on the other side because it's a square difference. It doesn't matter which you subtract from the other because you're going to square it. It gets rid of the minus sign. As a consequence, the Gaussian probability of an observed value, say y, conditional on mu is the same as the probability of mu conditional on y using a flat fryer. That's why the Bayesian posterior using LM is ... That's why you can get a Bayesian posterior using LM even though it wasn't built to do that because that's what Gaussian ... That's the way he justified it. I think that's kind of neat, actually. That symmetry for the most common kind of regression method that they're basically the same. We'll still interpret them differently. You still need to worry about everything the same way. I still encourage you to process the sample so you don't get lost, but there's nothing wrong with using quote-unquote non-Bayesian software. It's a different conditioning engine. It's OLS. It's a different little machine that powers up your robot. Okay, so how do you use it? Just two minutes then. Some of you already know this because you've fought with R already. You arrived in graduate school and the first thing someone told you is you must learn R and you're like, oh my God. What is that? I know my alphabet. What's the letter R? Then you probably encountered these weird, what I call design formulas, although that has another name for it. Maybe I should find another name. But what these are, is these are abbreviated ways to specify these linear models. So in the simplest case, if we have a linear regression between Y and X, you'd write it as a design formula that R is Y tilde 1 plus X. The one means intercept. Those of you who've learned design matrices in a previous MASS-S course, you're per and along here, you know exactly where this comes from. If you haven't, you don't need to know it. Don't worry about it. Well, if you do need to know it, let me know and I'll explain to you, but I don't think you do. For a slightly more complicated one, if you have more terms, you just add them on. So for a multiple regression with X, Z and W, it's Y tilde 1 plus X, Z, W. And if you, to use it with LM, you just put those design formulas inside LM and then tell it the data. So it looks a lot like map fitting, but you don't have that long inconvenient list that I make you write in your homework, right? But the trade-off here is you can't use priors. Effectively, that means you're assuming a flat prior from infinity to infinity for every parameter. And if you've got a reasonable amount of data, that's cool. It's not going to be a big deal. But sometimes it'll explode and we will have examples later on. For Gaussian models, you can get away with it. For non-Gaussian models, it can be a severe issue. So the intercept is optional, but it's always assumed. So these two lines of code are the spit the same model. I encourage you to put the intercept explicitly in there. If you want to get rid of the intercept, how do you do it? Well, you have to, ironically, either put a zero in the linear model, so in model 5.14 here we go, y tilde zero plus x, that fixes alpha at zero, so it doesn't estimate it. Or you have to subtract one. This is the maddening form, right? It's like I'm subtracting the one from it. But it just gets rid of the intercept. And you're saying, why would I want to get rid of the intercept? If you're in a case where you want to get rid of the intercept, you'll know it. And then if you've forgotten how to do this, let me know. And I'll tell you, send me an email. And I'll remind you how to do it. Last thing to say, in the rethinking package, there is a function in progress. This seems to work, but I have not done a lot of testing with it, where you give it these design formulas, and it spits out the map formula list. So you can always remind yourself what the model is actually assuming. And it sticks in default priors, so you can edit those. But this just gives you this list, and then you can edit it as you like and make whatever priors you want. So if you want to add priors to this, or if you're just lazy and you want to get the template for a map model, you can use Glimmer and type in the design formula. And this will work later for GLMs and for varying effect models. We'll be able to do multi-level models this way, too. You can put in multi-level formulas. So Glimmer mainly works. I'm saying mainly. I don't use it, but I'm working on it. So if you folks want to start using it, it's telling me about errors in it. I'd appreciate it. But so far, my test case is all pass, which only means that I don't have good testers. So we'll keep going in there. But that's the idea, it's provided as a scaffold. So if you forget all the details or want to save yourself some time, you can do it. Okay. Paul has some homework requests to put in here. Include your code, please. Some of you forgot your code. You upload it later. That's cool. We're very forgiving. Just give us your code. And also explain the code. Don't just give us code. Because then we're like, oh, nice code. What is this doing? And include the plots, please. So we don't have to make them ourselves or whether Paul doesn't have to make it himself. And he has a young baby, so any time you can save him, it's more time he can be a daddy. So give it that way. So if you don't plot, his child will cry. And especially do not give Paul MS Word documents because they make his computer crap. Or rather, we can't reliably see your plots when you give us an MS Word document. We don't actually discriminate against Word. It's just software. And I try not to have emotional attitudes towards machines. It's unhealthy. The problem is that there's no guarantee we can see your Microsoft Word document the way you do. That's how Microsoft Word is. So just export a PDF, please. If you forget, it's no problem. Paul will complain. And he'll upload a PDF. It'll be awesome. All right. Thank you all. Next week, overfitting.