 Okay, so let's start with multi-level models and this will be mainly conceptual work today and then starting on Tuesday next week we can hit the ground running and actually do some work. So there are lots of ways to justify multi-level models, but the most general kind of epistemological justification that I like that I prefer is this one. Let me introduce you to it through the biography of a really impressive individual musicologist, Clive Wehring. He's still alive and British musicologist and conductor was at the height of his career and then got herpes encephalitis infection. So herpes is a strange virus in the sense that depending on which tissue it affects it manifests a completely different disease and he got one that went to the brain and when herpes goes in the brain it causes really severe brain damage to his camera. It's often deadly, but he survived. They saved his life, but he lost large parts of his prefrontal cortex and his hippocampus and as a consequence he has enterograde amnesia. It's funny, his motor memory is largely intact, his procedural memory. So he knows how to make coffee, but the first time he tastes it, he's like, wow, this is the first time I've ever had coffee. He doesn't remember ever having had coffee, but he knows how to make it. Likewise, he was a very accomplished musicologist and conductor and music historian and he still can play piano and conduct symphonies, but if you ask him if he can do these things, he'll insist that he's never done it before. And as soon as he stops playing the piano, one minute later, he will have forgotten that he just played it. He's like the proverbial goldfish in the bowl. Every time he turns around, the castle is there again. He's like, ooh, what a nice castle. He spins around and doesn't see it. Ooh, nice castle. It's really fascinating and there are a bunch of clinical studies of this fellow, Clive, wearing ant individuals like him. They're rare, but people are interested, of course. If you learn a lot about the brain when it's not working right, it's often the best way to study the system. It's when it's broken. And his wife wrote a fantastic and very touchy book about their life together because when he got this illness, physicians really didn't know how to treat people like this because it's incredibly rare. And his wife did a lot of work figuring out how to help him. And they've made great progress, as I understand, in helping him lead a much more normal life. It'll never be a normal life, of course, but a much more normal life than before. So I give you citations to that in the notes, by the way. It's pretty interesting. And I think there was a documentary made about it. So I wanted you to think about that. This is my mnemonic device to help you think about a statistical modeling. Clive wearing lives in this world where he can't remember what happened the minute it goes. So he learns things in the short term. He makes some coffee and he's making the coffee and he learns whether he likes coffee for a moment. Then he puts the coffee down and he forgets he ever had coffee. Next cup of coffee is the new cup of coffee. Of course, we're not like this because we don't have interrogated amnesia. We can learn going forward and our past experiences inform our expectations for current events. So let's think about beyond Clive wearing's cup of coffee. When you get cups of coffee at cafes in different parts of the world. So I'm trying to hear a cafe in Paris and a cafe in Berlin on the right. And let's say we take some simple measurement like how long it takes you to get your cup of coffee. But it could be anything that you want to record about your cup of coffee. You go to the Paris cafe, say it takes you seven minutes for your extremely rude waiter to bring you a cup of coffee. If you've ever spent time in Paris, you know this is accurate. Yes, no, true, exactly. This is a tourist photo, right? And I think the Berlin one is more accurate because it kind of looks like it's about to rain, which is exactly what is probably about to happen. So it takes you like seven minutes to get your cup of coffee. Then the next week you're in Berlin and you're at a cafe, you order a cup of coffee. Before you get your cup of coffee, what's your expectation for how long it's going to take? Now, seven informs it, but now here's the thing. You're not sure it's going to be exactly the same because it's a different cafe. And not only that, but even at the same cafe, it won't take you exactly the same amount of time on every day. So you know the variation. But nevertheless, your experience at the Paris cafe is informative. It's useful to remember that, even though you're in any context. It's not a completely different context. It's still a cafe. It's still just a cup of coffee, right? So enterogramnesia would have you completely forget the Paris cafe and act like in Berlin it's the first cup of coffee you've ever had, right? Now here's the trick though. It isn't just that Paris informs Berlin when you get the cup of coffee in Berlin. But once you receive your cup of coffee in Berlin, you should also update your estimate of how long on average it takes to get a cup of coffee in that cafe in Paris. Right? Because that was a finite sample too. Now you have equal amounts of data in both. The fact that you visited one first is irrelevant to the learning, right? Information moves forward and backwards in time inside a statistical model. That's the speed of life. It really doesn't matter because it's a joint probability distribution. And so you're compelled logically, if you are Bayesian or a Bayesian model, to update in both directions, regardless of which order you visit them. To do otherwise would be to commit yourself to enterogramnesia and live the life of the goldfish. Does this make some sense? So all the models we have looked at so far in the course are goldfish. They have enterogramnesia. As they move from one cluster, say it's an academic department or an individual or any of the other kinds of clusterings, nations that we've had data sets to another, they don't use any of the information from the other clusters to improve estimates of the one they're in. And all those clusters have finite samples, so there's a pooling of information that's available that can improve estimates for the whole collective of parameters. And this is what multi-level models do. They break the enterogramnesia of classical statistical models, and they use all the information optimally, conditional on the assumptions of the model, of course. This is not magic, right, to learn across. But you don't have to get all the assumptions exactly right. You just have to do better than being an amnesia, right? Use some information. How much pooling you want to do? That is how much you let your experience in Berlin influence your estimate for Paris, and vice versa, will of course depend upon also your estimate of the population of cafes. What is that like? If cafes are all pretty much the same, and in Europe they are, in the U.S. they're a completely different experience because you're expected to take your cup of coffee in your hand and walk out, right? So it's like an assembly line. But in Europe you're expected to sit down with your coffee like a civilized human being and drink it. And so you guys know what I mean. So they're all pretty much the same, and so you don't expect it to be wildly different in the amount of time you're going to wait for your cup of coffee. There are other, so that would mean you pool a lot of information, because you're approaching. You don't meet. You still expect them to be somewhat variable, but you're approaching the case where they're all the same cafe, epistemologically speaking, right? In which case you just pool all the data. Yeah. Is that a hand between us? No. Okay. I'm just like seeing hands everywhere. There are cases where you don't want to pool very much at all. So here's another example to think about. There are things that are much more variable than cafes. These are goat chili peppers. They're called goat peppers. I ate a bunch of these when I was doing field work in Tanzania. And I always had them in my pockets because these protect you against intestinal infections. That's the main thing. You know, spicy food kills gut parasites of all sorts. And so I spiked all my food with these things and I stayed very healthy after a while. And the thing about goat peppers is they're only barely domesticated, I would say. And even peppers from the same plant are wildly different in how spicy they are. So you have to like gingerly taste the first bit of an individual goat pepper to figure out how much pain you might be in if you put the whole thing in your rice. And some of them are really mild, even from the same plant. It's very frustrating. So I had a pocket full of them and I would slice off a little lamb to taste it and be like, okay, this is the spicy one. I need this much of it in my rice. And oh, this one's really mild. The whole thing goes in. Right? Or maybe two of them. But in this case, the pepper you just ate gives you very little information. Right? So you get a plant, right? An individual pepper is off of it. The first pepper you try gives you very little information about the next pepper. Very little information about the average on the plant. The plants do vary in their average concentration because they have been artificially selected to be spicier than they would be naturally. But there's still so much variation that the benefits of pooling information across tastes of peppers is much lower. And the amnesiac models do pretty well in these cases. Right? Because generalization isn't very valuable. It can be like the case of all cafes or radically different. You can do a cup of coffee. Does this make sense? So multi-level models also, as a consequence, attempt to estimate these features of the population of objects. They have new parameters in them that are the mean and the standard deviation at the minimum of these populations of things, whether they be cafes or individuals or goat peppers. And learning about the population as you estimate the features of each cluster in the population helps a ton in accuracy of estimation. And that's what I want to... This is the way I want to justify to you the basic justification for multi-level models. And that will do good things for you, but the primary thing is they use the data better. They get more information out of the data than if you were an amnesiac. Right? Okay, so... Here's an example from earlier when we looked at the academic departments from the UC Berkeley data. Remember, we have these different departments. When you estimate an intercept for each, we do much better in posterior prediction and we don't get tricked by the fact that female applications are mainly for departments that are superior to ENF with really low admissions rates. Nevertheless, we left information on the table because the numbers of applications submitted to different departments do vary quite a lot. And, look, they're all academic departments and they're sampling variation in each. So it's like cafes, right? There's variation among them. If we can learn about that variation, then we can decide how much information we should transfer across to improve estimates. It's a little bit magical right now, I'm sure, because we haven't looked at the machinery of it, so we'll do that in a little bit. We're trying to give you the idea of recognizing a situation in which it might be profitable. So there's a standard language here. The models we've been using up to this point in the course are often called fixed-effects models. Any of the parameters aren't from... Well, it's the opposite of varying effects. So I'm going to punt on what that is to show you an example. Let's just say everything we've done so far is simple. And these models have enterogram amnesia, but like live-wearing or the goldfish in the boat, seeing that castle being excited every time. So every new cluster or an individual, a pond, we're going to do tadpoles today, so that's where the pond's going to come from. A road, a classroom is some new world to the model. It uses only the data from that particular cluster to estimate the parameters that are unique to that cluster. And that sounds reasonable, and so you realize that that forces you to be an amnesiac because if you just came from a different pond or a different road or a different classroom, you learned something that's generalizable, but the question is how much. And to answer how much you can generalize, you need to know something about the population of these clusters. And that's what multilevel models do. They remember information across clusters and they pool that information in savvy ways determined by their estimates of the shape of the population. So there are properties of clusters and they come from some statistical population but we can't see it. But the cafes or the goat peppers or the classrooms or individuals are all samples from some statistical population. If we knew the average and variation at the minimum of that population, we could figure out optimally in a Bayesian way how to pool information across data in the sample. So I appreciate that this could be a little mystical. So if you go to the, typically in these kind of abstract audit experiments, you can get some insight by thinking about extreme cases. So imagine a case for example where you study the school like you see Davis and you've got information on average test scores in a bunch of classrooms and then there's a new classroom for which you don't have any data yet. And I ask you, what's your prediction for that classroom? Well, you'll use the posterior distribution of the population of classrooms you've already studied. That gives you your prior for this new classroom. Okay, now when you examine that new classroom, you get to update that prior. Prior is the population for each classroom, right? Because each classroom is sampled from that population. So if you've got a posterior distribution for the shape of that population, you can statistically sample a classroom from it. That gives you a prior for the features of each classroom. Once you observe the test scores for that classroom, you update that to the posterior distribution for the classroom. But now you have to feed that information back into your estimate about the population as well. And so now you get a posterior of the prior for the classroom, which is, yes, turtles all the way down. I told you it was going to get there. That's how the pooling works jointly. Now I'm going to show you the model, and maybe this will make a little bit more sense. This is a fundamentally psychologically difficult thing to think about. The cool thing is your brain does this. I don't think it does it analytically exactly the same way these models, but you're not an amnesiac. And you do pool information across similar things as you define them cognitively. Absolutely. Tables, cafes, Starbucks, right? But these things are categories. And people do generalize automatically across them. We just like models that are at least as savvy as undergraduates. There are no undergrads here. So I guess that's it. OK. So let's shift gears a little bit, although I'll tell you why this is relevant. So what you're looking at here is a chart of organ donation consent percentages in the population across different nations. And there are four here on the left where the legal policy is that people have to opt in. They have to, like, you know, sign the back of their driver's license or send in a form to say that, yes, when I die, please use my organs to save other people's lives or something like that. And so the United States is also an opt-in, right? I think maybe it varies by state to state. Probably doesn't look like everything else in this crazy content. But I think nationwide, we're almost entirely nationwide, we're opt-in. And those nations' proportions of organ donation are really low because you have to do something to join to do it. If you poll in these nations and ask people whether they think organ donation is a good idea, almost 100% say it is. Not quite 100%, but almost 100%, dude. The countries in blue are cases where the law is opt-out. You sign the back of your driver's license to say, no, do not harvest my kidneys. And I guess you could write that on the back of your driver's license. And in these countries, nearly everybody, except in Sweden where there's an amazing opt-out rate, it's interesting, are almost at 100%. And public opinion across these nations, at least on this slide, barely differs in terms of how much people think it's a good idea to be an organ donor. In almost all of these countries, people agree, almost 90% agree that it's a great idea to be an organ donor. The point of this is that defaults are very powerful things. Not just in legal systems and healthcare systems, which is what really this is an aspect of. But also in statistical and scientific methods, the default procedure that you must opt out of has an incredible force on the nature of our work. And so in this case, I want to argue something that may seem unreasonable, but I think over time it will not seem so unreasonable, is that multi-level regression deserves to be the default form of regression. There was a time when ordinary linear regression seemed fancy to people. And there's still a few people on campus who feel that way. And not very many. And then multiple regression seemed like this crazy fancy thing. People say, no, I'm not doing fancy statistics, you and your multiple predictor variables. And now we're in a case where multiple regression is pretty default. You can't say like that's fancy, right? That's a standard thing. So generation by generation, what's the default? Climes up, it rags itself. And I think it's time that we've ratcheted up multi-level because the software is available on your phones to fit these models. And it's not that much harder and you get better estimates in nearly every data context from a multi-level model. It's the power of defaults. That said, there are definitely situations in which you don't need multi-level models. So opting out is okay. There's nothing normally wrong with it. But again, it's just the power of defaults. Let's force people to justify opting out. No, don't take my kidding. No, these cafes are so different from one another. Or I have so much data that it won't matter. And I'll give you illustrations. So we work through the data examples of cases where opting out is fine. But the default is a powerful thing. And I want to encourage it that way. Okay. So practical goals for the slides to follow in all of next week. I'm going to introduce multi-level models to you in a mechanistic way. The main concepts to get are called shrinkage and pooling. There are aspects of the same process of not forgetting as you move from cluster to cluster, but also generalizing only the right amount. The right amount according to what you've learned about the population as you go. My goal is going to be to give you intuition about why these models produce better estimates as a consequence of remembering things moving from cluster to cluster. I'm going to show you how to fit these models with math to stand. Math is dead to us. It will not fit these models. Well, I'd say for the pure Gaussian-Gaussian model, actually it won't be such a bad job. But that's the special case. Outside of that, you're done. Show you methods for plotting and comparing as always. And then I think at the end of next week, I may get started on it, but certainly at the beginning of the week after, I want to generalize this to the case where we can have continuous categories. And this is a family of models often called Gaussian process regression where we have, we want to do shrinkage and pooling across continuous categorizations where individuals or units are taking different amounts of dimension that governs how similar they are. So like age groups, right? Or location is the one that ecologists are really going to be interested in, right? Spatial autocorrelation problems. Phylogenetic distance is another example, right? There are categories of a sort, but there are an infinite number of them. But some of them are closer to the others. And that's different than the categories we've worked with so far because all those categories are equally dissimilar from one another. There are unordered categories. And you want to learn across them. I'll show you how to do that with Gaussian processes. It's a really useful thing. That's my megalomaniacal objective. Okay. So functionally speaking, as end users of these models, you'll want to use them when you have clustering in the data. And this happens with vast majority of data sets, which is why I say this deserves to be the default form of regression. Classrooms within schools, students within classrooms, grades within students, questions within exams. I once did a multi-level analysis of the after-money exams like Dave. DA's hated it. But you learn a lot about your test from these things. The basic idea is here, we have repeat measures of units, right? Biologists learn this as pseudo-replication, which makes it sound like a disease. This is a good thing. You want data sets with pseudo-replication. You're using a model which ignores this fact when you have repeat measures of the same unit. So if you're really wrong with having a repeat measure of the same unit, you want to aim for that, because more data is good. When you use a model that doesn't match the structure of the data generation, then you get into trouble, and that's usually what pseudo-replication refers to in biology. Often we end up with imbalance in sampling as well, such that some of the clusters have been sampled a lot more than others. It's like some students don't come to class very often, so you have fewer test scores for them. That means you have a less estimate of how good a student they are. So if you just naively ignore the clustering structure, some students dominate the inference because you have more data from them. This happens in lots of data sets, especially things that aren't experimental, but even experiments, because you can design the best experiment in the world that it ain't how it's going to play out. Let me tell you, when you actually collect that data, some units are not going to work on some day, or along will come a bear and empty your trap, or whatever you think it's happening. It doesn't always work out quite right, so expect imbalance. We had some examples earlier where we could have used multi-level modeling. We had individuals and families in the Kalahari Saan data, species and clades in the primate milk data, nations and continents, pools and chimpanzees and blocks. We're going to reanalyze the chimp data in the multi-level context, and applicants and departments will also come back to that in a multi-level way. Let me give you the first... I think I got 20 minutes here, so I got time to give you a first worked example. There's a data set in the Rethinking Library called Read Frogs. This is Ben Bulkers Read Frog experimental data. These are field experiments where they captured Read Frog eggs and had them drop into experimental... They were buckets, but they're called tanks in the data. They're buckets, basically. So the eggs are laid on leaves, and when they actually fall down into these buckets, it's suspended under them, and then the tadpoles can move around these buckets and manipulate the densities of these buckets. So it's a great kind of field experiment on predator defense and aggregation and so on. I cite the paper in the book. Take a look if you're interested in things like this. We're going to be interested in it because there's clear clustering, and there's lots of heterogeneity across the tanks and the outcomes. The outcome of interest is survival. And survival, why? Well, tadpoles get eaten by things. There are vicious arthropod predators, like the one displayed here, that love delicious tadpoles. And also then tadpoles eat tadpoles, too, but the species I don't think cannibalism is very common. So there's a bunch of experimental treatments where they vary the density and the size, initial size of the tadpoles, and then with them without predators. So we're going to ignore those predictors for now. We might come back to it in a homework problem next week. And instead, I'm just going to say, those treatment effects generate a lot of variation in survival of tadpoles in the buckets. And so you can view each tadpole's outcome, one or zero, whether it lives, as one cup of coffee from a cafe. Each bucket is a cafe. So you learn something about each bucket, and that information helps you understand survival in the other buckets as we go. So we're going to build a multi-level model of tadpole survival in this context. Here's the conceptual structure about it. We've got tadpoles and tanks. They're different treatments. The outcome's the number surviving. We're going to look at two basic models. First, we're going to fit yeoldee dummy variable model, where we have a dummy variable for each tank. That's the amnesiac model that does no cooling. I'm going to remind you how that works. And you've seen models like that before. We did it with the UC Berkeley admissions data. I'm just going to replicate that structure. Then we're going to do the multi-level model, which is barely different, but does a much, much better job of coping with these data and you learn a lot more from it. It'll help me illustrate shrinkage and pooling and why that helps with accuracy. Okay. And no, I don't know what spunky tadpole is, but I Googled tadpole, and it was there. So you're welcome. When there's only words on a slide, it makes me uncomfortable. That's just how it is, so I need something on my slides. Here's the tadpole model that just has a dummy variable for each tank. So we're just estimating a unique intercept for each tank. And I'm going to do this in the form where we don't explicitly code dummy variables, but we use this trick of indexing. We have a vector of intercepts, and we have a tank index variable. I will show you the code to do this in a moment, in case you've forgotten. But we did this with the Berkeley data, didn't we? Or the CHIP data? It was the CHIP ACR data. We did it with the CHIP ACRs, right? Okay. So S of i is the number of individuals surviving in tank i. This is a binomial distribution and we know how many tadpoles were there at day one if they were placed in there, right? They were placed on a leaf and they hatched in. And N sub i is the initial density, the number of tadpoles that could have survived. And P sub i is what we're going to model. We're going to put a logent link on it, and it's equal to alpha for tank i, which is the log odds of survival in that tank. That's what we're going to estimate. And then we assign them all a common regularizing prior, a pretty weak one centered on zero, which remember log odds of zero is a half chance of survival. And five is a big standard deviation on log odds. It covers reasonably flat over the whole range of reasonable probabilities. So weekly regularizing, not too much. All get the same prior. We can fit this model. You've seen code like this before. Load the refrogs data. Next thing I do is construct the tank index, right? And then in this data set, every row is a tank, is an experimental tank. And this is an aggregated binomial model, and the outcome is a count. It can be greater than one. And so we can just use the row number. All I've done there is one. Two, the number of rows is our index variable, because every row is a tank. And then you fit the model. You just bracket tank. Map to stand knows what you want to do. It makes a vector of parameters for you, and you get the results from it. Just like the chimpanzee actor example. Makes sense? We're not going to inspect the output yet. Can we get the other model? Yeah? Okay. This is the multi-level model. And it's got two extra lines, and they're both priors. So let me walk you through step-by-step what's going on here. The thing that's new, in a sense, is that now instead of a standard regularizing prior, which has a zero and a five in it, or any numbers you like, but they're constants, there are parameters inside this prior. It gets an adaptively regularizing prior. This is a prior that gets learned from the data as you fit the model. This is what makes these multi-level models. So now you're like, but if it's a prior, then how are we learning what it is? It's like, well, welcome to Bayesian land. It's turtles all the way down. And from one's, whether this is a prior or a posterior, it depends upon which parameter you are. So imagine yourself as a parameter. And if you're an intercept for a tank, then this thing is a prior for you, because it's the initial thing that the data from that tank updates, just like before. In the previous model, that mean of zero is standard regularization of five. That was the prior for every tank's intercept. So if you're an intercept for a tank, this is your prior. However, if you're the alpha and sigma that make it up, it's a posterior distribution. The whole thing is actually being learned. It's like a likelihood. A prior is a likelihood for the intercepts, in a sense. And then they inform the linear model to give you a likelihood for the data. So these are often called two-level models, because you've got two things that can kind of be squinted at and look like likelihoods. At the top level, there's the data model. And then in the middle, we've got this secondary, second-level parameter model for the intercepts. But it might be easier to think of it as, this is the prior that gets learned. There are adjustable bits of it. And those are alpha and sigma, the parameters inside of it. I'll come back to those in a second and explain them a little bit better. This is called a varying intercepts model. And now the alpha tanks are varying intercepts. And I have to say something about terminology, because the terminology in this literature is really magic. And this is why you have to insist upon some mathematical definitions of the model, I think, to be sure about what people are doing. These things are all, I'm going to call them varying intercepts, because I think that's sort of the least bad term, but it's still not very good. And I'd rather call them, you know, adaptively regularizing priors or something, but that doesn't make sense to anybody, right? So I won't do that. But these are also often called random intercepts. You've probably heard that as well. These are random effects models. I like that way less, because what does random mean? People get so superstitious and weird about the word random that I avoid it. Random, what is random? The only thing random, at least at the scale of the world that we decide is information. Right? We believe in a deterministic universe at the scale of stuff science that we do. In fact, I believe in deterministic universe at the quantum level too, but there's a fight in physics about that. So, but in biology and social sciences, right, you don't believe that it takes actually random. Instead, random just refers to our lack of information about things. That's how, that's the Bayesian premise about using probability to encode information. So, random here is not clear exactly what it means. And it just gets confusing. So, but when you see someone talk about a random intercept model, they're talking about the same thing we're using, nearly always. So, it's weird because this random, there's a tradition in experimental design of talking about random effects models, which has to do with how you design the experiment, rather than the information content of what you want to learn. And that's, well, let's just say that's not what's going on here. And that's also never the right way to decide whether to use a random effect or a varying effect. You use a varying effect because you don't want to be an amnesia. It doesn't matter how the data that generates, in a sense, if you want better estimates and you dive into clusters, you can use a mixed effects model or a random effects model, or a varying intercepts model. They're all the same thing. Vary is not perfect either because even ordinary dummy variables vary across clusters, right? We saw that when we analyzed the chimpanzee actors. The whole point was, look, there's always variation across the chimpanzees. They have an intercept for each of them. Look how different they are. They've obviously varied. So the variation doesn't magically appear suddenly because we do this this weird prior. So you just have to live with this. Statistics is full of terrible terms, the worst being significant. But, right, that word is dead to me now for that reason. You just can't use it for anything. It's a great word. Fantastic Latin root. It's almost magically good sounding. Be great in poetry, but it's dead to me. It's just dead to me. You can't do anything about it. You can't fight vocabulary. It's like spinning into the wind. Just don't do it. So I'm just going to tell you the different vocabulary. Okay. So, yeah, here's my bid. If we run time, I go back in time and kill Hitler, and then I would rename varying intercept models to the nestic model because it remembers things. But not everybody studied Greek in college, so I'm not sure anybody cares about that naming, but nestic means memory. It means of memory. So you can remember something. I like that because it labels the other models as amnestic or they're amnesia. They forget things, right? It's one of these. It's this labeling that characterizes the opposition, right? It's like saying, some people don't love our country. Right? And then you're going to have to imagine who those people are. Some models don't use all the data optimally. Yeah, those are the fixed effects models. And they should not be presented. Okay. All right. Yes. Okay. Let's finish explaining this model. So we've got two parameters inside this weird prior, the varying intercepts prior now. Remember, all we've changed, we've taken the zero out, we put alpha in its place, and we took the five out, and we put sigma there. And these are means and standard deviations of the normal distribution of the prior for each intercept for the tank. And we're going to estimate those from the data. They're ordinary parameters. They'll get estimated. And this is what we need the Markov chains to do. They do a better job of this, because you can't write down a unified likelihood function for a model like this. But since they're parameters, they need priors of their own, right? They're turtles all the way down. So we define weekly regularizing priors. These ones are fixed for alpha and sigma, the ordinary sort. Now there's going to be a posterior distribution for alpha, sigma, and all the alpha tanks. They're all going to get updated from the data. They could end up really different than these priors, and they will. But it's important to understand what the objective here is. The idea is there's a population of tanks. That population has a mean alpha and a standard deviation of sigma. We can estimate that from the data. We can learn how to pool information across the tanks. We can get better estimates for each tank. Because some of the tanks, as you'll see, there's not a lot of data for them, because there weren't a lot of tadpoles in them. So stochasticity gives us, you know, error, measurement error in those tanks. There are some tanks with lots of tadpoles in them. Those tanks have a lot of information, which we can pool to the smaller tanks to improve the estimates in the smaller ones. This is what I'm going to reveal to you happens. And that's possible because we estimate the population distribution of log-on mortality. That's what that prior is. But it's learned from the data. Yeah, question? So is that saying that each tank will have a distribution of log-on mortality? No. It is saying each tank has its own intercept. But the intercepts across all the tanks come from a common distribution, a population of tanks in which mortality varies. And we'd like to learn about that population at the same time we estimate the mortality in each tank, because it will improve the estimates in each tank if we do that, because it lets us pool. And there are more slides coming where I explain what's going to happen. It's called shrinkage. And I'll explain how it goes. So hang on, and we'll get there. I've only got six minutes, so I may not finish this, but that's okay. This is like a three lecture series for me. Okay, so here's the first kind of take-home thing. You could say survival across tanks has a distribution. This distribution is the prior for each tank, and that distribution also needs its own prior. And this is the model that defines all those things. And this is the minimal multi-level model. And you can use any likelihood at the top you like. It's still the structure beneath it will be nearly identical for the basic case. Okay, how do we fit this? Well, you just write this in map to state. It looks exactly the same. You just swap out zero for an A, or, you know, or pickler-tardis, whatever you want to call it, remember, and then a sigma. Put the fixed priors on those two new parameters, and let it run. And this samples it to death, you don't necessarily need to do that. But I'm trying to teach you guys early on, sample things to death so you can get some understanding of what's going on. And this fits no problem, because all you guys have succeeded finally in getting Stan installed, yay. Right? And it's little victories. And was that like a maybe? Oh, yes, yeah, exactly. Yeah, I am with you. Exactly. Well, let's say the first time I used state, it wasn't yet released. It was like .9. I had to compile it myself and download what's called the tar ball and get it going. And it was definitely worth it, but I was an early beta tester of a sense. And so it's gotten a lot better, is all I'm saying. But yeah, there's still little victories involved. It's also compiler now. It used to be more than that. So the first thing I want to say about this is if you compute WAIC for these models, the effective number of parameters is going to be typically a lot less than the literal number of parameters now, because it's these varying effect priors are regularizing. The estimated variation is, well, whatever it is, it's not infinite. And one way you can think about the fixed model is that it assumes infinite variation across the tanks or the departments or the schools or whatever it is. Because that means that you would pool nothing across them. There's no information in the population about how any of them are related to any others. That's like there's infinite variation in the population of tanks. So Sigma here, I haven't showed you the estimate, but it's a lot less than infinity. In fact, it's pretty small. So this corresponds to, I think it comes out to be like 1.3, it corresponds to a pretty aggressive regularizing prior in that case. That's stronger than the one we put in in the original. So a regularizing prior would have standard deviation of like 1.0. Now the caveat is that there's a posterior distribution of Sigma, and all of the estimates for all the parameters are averaging over that. Because it's this full Bayesian inference. It'll still help you think about it if you look at the math value for Sigma. Figure it out. So as a consequence of that regularization, WIC pegs the effective number of parameters at 38 instead of the 50 literal parameters that are in the model. If it turned out that tanks were really similar in their mortality, Sigma could be really close to zero actually. And the number of parameters could be like 12. We'll have examples like that where multi-level models aggressively pool because what they've learned from the data is these things don't vary much. I'm going to treat them basically as the same thing. And you get really strong regularization around the mean of Alpha, which has also been learned from the data. This makes some sense. Just to make a little bit of sense right now, you're going to be inundated with examples and you're going to fight with your software. Let me show you what this population looks like. So these are a bunch of Gaussian distributions that have been sampled from the posterior distribution of this model. We sample Alpha and Sigma values, which are correlated. You better believe it. And then for each of them, we plot a Gaussian contour here. I think this is 100 of them. And you can see they do vary. There's a posterior distribution of tadpole populations or tank populations, rather, that are being sampled. These are distributions of the log-ups of survival. It has a mean a little over one and a standard deviation. I think it was like 1.3, but you can check. You'll fit this model yourself for fun, right? And see. The implication on the log odds scale, when we mean on the probability scale, you take these log odds and you run them through the logistics. So if you take these distributions and you push them through the logistics, what you see is that these tadpoles survive quite well. In most tanks, nearly all the tadpoles survive. And that is true in the data as well. But there's this long sloping tail of death in Missouri, down towards zero, especially as you add predators and have small tadpoles. But that's the connection. This gets translated into a population. So you can imagine sampling tanks from these Gaussian distributions and then plotting the distribution of the probability of survival of those tanks. And this would be the population. This is the distribution of probability of survival that's implied in the population of tanks. So this is a posterior predictive distribution. It makes some sense as we've learned about the population. I've got like 20 seconds here. So I'm going to put this slide up and promise that this is where we'll resume on Tuesday. And I will explain to you why we want to do this and what happens by trying to explain this magical effect that arises from knowing about the population is we get something called shrinkage, which is the monastic transfer of information across the tanks. That's my promise for next week. Your homework is up on the website. You'll have a lot of fun with it. Parks and Estimation.