 Many of you, like me, this week we begin to materialize the whole point of your participation. This is multi-level models. When I was a graduate student, which was sometime last century, multi-level models were this cutting-edge thing. And only if you were some kind of quantitative hot shot in the science would you bother to learn. It was considered showing off to learn and use multi-level models. That is no longer true. And I'm happy to say it is now a requisite skill for everybody. It's a PhD in any science to understand multi-level models. And I'm not joking. I'm not saying that to be a hot shot or show off. It's just true. And this has been true in the sense of the scaffolding of statistical skills for many generations. One of my PhD committee members, Nicholas Gordon-Jones, who's a behavioral ecologist, said that when I, my first year of grad school, he counseled me to learn multi-level models because he said, look, when I was in grad school, Richard, multiple regression was fancy. And if you were using multiple predictor variables, that's really showing off. You're a hot shot. Are correlations not good enough for you? Partial correlations. And then, of course, the generation after his, everybody had to be multiple regression. You're an idiot if you didn't use multiple regression. Now, of course, the fact that you can be an idiot no matter what statistical method you use, right? It has nothing to do with what statistical method you use. But there are these standards, right? And every generation gets more sophisticated than the last in what the default tools are. And that's because the new defaults are incredibly useful in default and for individual situations. And once you get trained to use them, you learn them in a way that's simpler than the method of discovery. And that's the way technological evolution tends to work. So multi-level models are our new default. And they're not substantially more complicated than ordinary multiple regression. But it depends upon having the right set of metaphors and frameworks to think about in a minute. So let me work on that a bit here to start. We're going about models just for a moment. This is a story that I start chapter 12 with in the book. And in the chapter, I give you all the citations to this. Clyde Waring is a real person. He's a musicologist, quite accomplished. He's a very successful conductor and music theorist. And pianist as well. And at some point in his life, he contracted a herpes virus. And as those of you who know something about the herpes virus, depending on which tissue it infects in your body, it manifests as a very, very different disease. And so this particular herpes virus was a cold sore. He got it in his head area. And it happened to travel along into his brain and gave him the worst kind of herpes you could get, which is the brain infection version. And he suffered a serious and permanent form of brain damage from this. And he lost the ability to form new long-term members. Because herpes is a part of the Pacific Islands. Yes. Happy New Year. But I'm telling you this story for a reason. The interesting thing about this case, now, Clyde Waring has been intensely studied by neurologists since because there are a very small number of cases like this where you get a very specific site of damage. Those of you who are psychologists, right? You know you've read lots of books about patients like this. And Clyde Waring is a really interesting case because he remains incredibly cognitively competent for all the memories that were formed before his disease. He can still play piano, but if you ask him if he can play piano, he has no memory of ever learning it. And he can't learn new songs, but he can play old songs. He can still conduct orchestras. That's just something he used to do for a living. But he doesn't remember doing it. And as soon as he finishes conducting the orchestra, he will have no memory of having done that. It's a very strange sort of thing. And a fascinating case. And he keeps these fascinating diaries. I'll show you an excerpt of one on this slide where every morning when he wakes up it's like the first time he's ever woken up. And he has these bizarre notebooks about feeling alive. Oh, this is amazing. It's like, you know, imagine the first time we ever saw a Star Wars movie and it's the first time a Death Star ever blew up, right? And then every other movie when another Death Star blows up, right? It seems less fresh. Sorry, I'm taking a jab at Star Wars here because it sucks. But inviting hate mail, I know. But no, all the movies are the same, right? There's always a Death Star that always blows up. But for him, it's fresh every time. It's exciting, right? Because the plot is good the first time. And it just gets older after a while. It's a very fascinating case. So this form of amnesia is called anterograde amnesia. It's the inability to form new long-term memories. Every cup of coffee, five-wearing experiences, is the first exciting cup of coffee that's ever had it, right? Now, of course, your first cup of coffee is usually bad because it's initially a person. But you could use to it after a while. So what I want to argue is that typical multiple regression, what you might call fixed effect regression, all the models we've done so far in this course, also have anterograde amnesia. They have the inability to form new long-term memories when they're considering new clusters within the data. They forget everything they've learned when they move from one cluster to the next, and pretend that none of the other data are relevant. And this is deeply irrational. You do not want to be an organism. But things like this, because you'd be like five-wearing, which is not a good state to be in. And you don't want to program your little golems to act like this either. So let me give you an idea what I mean by cluster. So clusters in the data are things like individuals, ponds. I'll talk about ponds later today. Roads, classrooms. And obviously, students within classrooms are different from one another because there are classroom-specific effects, say you're measuring test scores or something. Or all the individuals are different from one another so that there are correlations in all the data from one individual because individuals have different tendencies. Those are what we call clusters. Nevertheless, when you meet a new individual after you've sampled a bunch of students in the classroom and you're told, here's another student from that same classroom, you have a prior that's been informed by the other individuals you've met. Individuals are not maximally different from one another. You get to use information. So this is like when you visit a cafe, the fact that you've been to other cafes in your life gives you expectations, you're largely valid about what will happen in that cafe, how long you might wait for it to come out. That is remembering things and using information. Fixed effects models do not do this because they have, you can put in unique intercepts, so-called fixed effects for every cluster, but only the data from that cluster informs that parameter. And that's bad. It's very bad because you're ignoring information that would be useful. I'll build on this metaphor in a second. Multi-level models are better than this because they remember their past experiences, so to speak, and they use that to make better educated guesses about new clusters. So they learn faster the true values of every cluster because they pool information among clusters. They have memory when they form new long-term memories and they use it to learn, in fact, learn optimally in the small world. If you think back to chapter two, where it was, chapter two of the metaphor of large worlds and small worlds, remember Bayesian models are optimal conditional on the model in the small world. Of course, the model's wrong. So you have to be cautious about that. But this is the way the robot learns optimally, given the assumptions. So the strategy multi-level models use to get this pooling of information is properties of clusters come from some quote-unquote population. This population is a statistical population. It just means that it makes sense to use some of your estimate from one cluster when you're making a guess about the next. I'll build on this on the next slide. The inference about the population defines the pooling phenomenon that I'll show you pictures of in today's lecture. And so all the previous clusters actually improve your guess about any new cluster. So often people are asking themselves if they want to use a multi-level model because there's lots of strange conventions for figuring this out. But I think the last one on this slide is the question you want to ask yourself. This is the decisive thing. If when you encounter a new cluster, you've visited a bunch of classrooms and you've got estimates of the average test score and all those, now you've got some new classroom. Before you see the test scores in that classroom, I ask you to make a guess about the average test score. If you think that all the other classrooms help you make a guess, then you should use a multi-level model. To help you make a guess, then you want to use the fixed effects model. I think in the real world, most of the time, you're in the former situation, not the latter. Because classrooms, they're not all the same. But they're not completely different either. Same with individuals, palms, roads, other things of that sort. This is an important device because there's all kinds of weird conventions for saying people appeal to aspects of the design of the experiment and other things. All that's irrelevant. It's just about information. It's about whether the other clusters will improve your guess about a new cluster. Okay. So here's the metaphor. I want to help you think about this and why the statistical population as the pooling is sensible and useful for us. I want you to imagine that as I do at the beginning of chapter 12, that we've got some robot that we're programming. We want it to learn how long you're expected to wait for a cup of coffee at a cafe. It's going to visit some cafes. It's like a Roomba. You get a Roomba and you pop it open and you reprogram it and you're going to make it visit some cafes and it's going to order coffee. This is the new world. It's going to record the weight income. How should the Roomba optimally use the coffee Roomba, optimally use the information and the data it collects. You can imagine you get two of these for the sake of the example. You have two of these Roombas programmed this way. They visit the same two cafes but in opposite orders. Here I've got pictures of a cafe and I think that's Paris. I made these slides several years ago. Does that look like Paris on the left? Yeah, it could be Paris. It's Europe. On the right, that's definitely Berlin because it's a dirty alley. There's a cafe in a dirty alley. Sorry. The locals understand what I'm talking about. I love Berlin. Your cafe might be a dirty alley with broken bubbles. Our Roombas are going to visit both of these cafes but they're going to do an opposite order. Roomba 1 starts in Paris and it orders some coffee here and it gets that data I mean in Paris you could make a long time actually to get the attention of the health and the other robot 2 starts in Berlin in order to coffee and then they travel crossways and visit one another. So here's the question now. If you were Robot 1 you've got your say 7 minutes from the Paris cafe and you arrive in Berlin before you've ordered your coffee what's your expectation? The fact that you waited 7 minutes gives you information. Now you don't expect it to be exactly 7 minutes because it's not a lot of data to go on but it gives you information. And then you order your coffee in the Berlin alley and it takes it's probably faster if I can use my stereotypes. And then you can do better than the 7 minute guest. You update that 7 minute guest but you start with a prior that comes from the previous cafe and that makes sense. It's a very vague prior it's not that you expect it to be exactly 7 minutes but a lot of variance around it. And then you observe say 3 minutes and you update it you update that prior. The other robot has the opposite experience it started out with a 3 minute coffee and then it goes to Paris and it expects 3 minutes and then it takes 7. Again it starts with a prior a vague prior centered on 3 gets to update that and it goes the other way. But hang on information shouldn't depend your inference shouldn't depend upon the order you experience the data in. So robots need to update the old cafe when they get the new data. So robot 1 travels from Paris to Berlin when it gets to Berlin coffee it needs to update its estimate of the Paris cafe too because the other robot in that direction updated the Paris thing. And that's where the statistical population comes from. It's the idea you're estimating the average wait time in the population of cafes and the variance among them and that's through that inference you get to do the time reversal of the inference here so that the exact sequence that you visit the cafes in doesn't matter that you can do the reverse update. You'll see how this works when I draw it up but this is the sort of rationality constraint the time sequence can't matter your robots would be irrational if the exact sequence they visited the cafes in affected their inferences you don't want that and the way to avoid that is to have them estimate the properties of the population of cafes and then use that as the prior for each cafe. Each cafe can be different but you're estimating at the same time properties of the whole population. The variation in the population is going to have a very strong effect on how much how strong the prior is that is how much pooling is what we're going to call it you do that is you can think about it how similar will the inferences be to the fixed effects model so here's an example to help you think about it cafes are a bad example because they're all pretty much alike actually I mean the scatter and how long it takes to get a cup of coffee doesn't vary that much internationally having traveled the world ordered a lot of coffee I test to this experience you never end up waiting two hours doesn't vary that much you never get it immediately unless it's really bad coffee I guess you do another example where the there's a lot of variation and that means that the prior has a very small effect because it's extremely vague when you estimate the population it's extremely flat as a distribution I used to do field work in East Africa and one of the major challenges I experienced and I know many of my colleagues did too living in East Africa is intestinal infections gotta keep yourself healthy you're eating a lot of stuff that's not perfectly clean and the solution I found was to constantly spike my food with spice constantly just that my mouth was always on fire but I was healthy so I always had a pocket full of these peppers what we call goat peppers and they grow all over the place every household's got them and I always had a pocket full of them and I would put a half of these in whatever I was eating and that helped me healthy because caxi is the chemical that makes these things spicy is very very strongly antibacterial and antiviral it's a really really good medicine the thing about these goat peppers though is that peppers from the same plant can have radically different spiciness they're not perfectly domesticated as the way I would I would explain it and so I would be very careful about testing a particular pepper to figure out how much I would put into my food because one pepper could be very low on spice and I have to put the whole thing in there and another would kill me this drive me into fits of tears so the variation among peppers there's a population of peppers is the point and when you pick up a new pepper and you're trying to guess how spicy it's going to be the population does give you some expectation but it doesn't tell you too much about this exact pepper right and the multi-level model accommodates this by learning the variation as you visit the clusters as you sample the peppers estimating the variation among peppers and that tunes the prior so multi-level models work by estimating the prior this is the first set of models where the prior distribution will be learned from the data and that's what's multi-level about them in effect there's two likelihood functions but one of them is called prior so I had argued that multi-level models are the new default and defaults are powerful things and I want to help you remember this by giving an example of that some of you know the state already about organ donation there's a lot of variation in Europe about it the public of basically every country that's ever been pulled is largely a majority in support of organ donation there's international shortages of replacement organs and given the support the public support for it there shouldn't be there's a mismatch between the enunciated support supply and that's because in many countries it's an opt-in system you have to do something to check a mark to say that when you die your organs can be donated and in opt-in countries the supply is vastly smaller than the actual support for it many people say that they would like their organs donated if they're dead but they never go about checking the box in other countries it's an opt-out policy and then there's not a shortage and so I show you the data here we have opt-in countries on the left including Germany and Germany public support for organ donation is over 50% it's quite high and in the book I give you a citation to the polling data and then the opt-in countries you see it's almost completely high except for Sweden which is in the case lots of people opt out in Sweden I don't know what the sociology of that is anyway the point is defaults are very powerful things and so it's important to establish statistical defaults that do good things for our inference but it's perfectly fine to opt out there are times when a multi-level model is totally unnecessary and that's fine there's no shame in not using one it's just that it's a default it makes sense to ask people to excuse not using one than to ask them to excuse using one because they are inferentially better if there are clusters they're inferentially better than models that don't use pooling okay this is just what I said here's what we want to do today and on Friday going to introduce multi-level models with a slow example the whole goal is to explain to you these mystical things called shrinkage and pooling that arise from the robot in the way forms prior about the population these are statistical properties I'll show you what they look like visually we want shrinkage and pooling of our estimates it makes the estimates better and it does so for reasons you've already learned because it trades off overfitting if they overfit less because they used a whole sample rather than just isolated pieces of the sample I'm going to show you how to fit these models with math to stand they look very much like the models you've already fit except now the prior will have parameters inside of it rather than just fixed numbers that's it, that's all you do if you replace your fixed numbers with parameters the prior is going to be a multi-level model that's really basically the only way to do it and I'll show you some methods of plotting comparing these models and then probably next week I don't think we'll get to this on Friday this pooling strategy it isn't just discrete clusters that you can do this with you can do it with lots of structured variables order category variables or perfectly continuous variables like age can also be pooled there's a device called a Gaussian process which is the worst name ever two vague terms, Gaussian process there's no hints about what this means it's a way to extend shrinkage and pooling to continuous categories like age what that means is ages that are similar to one another you expect them to be more similar in the things that happen to them so there's pooling but it's local and it fades as you move away from any particular point so this is a one thing for example in some areas of political science individuals in birth cohorts tend to vote in similar ways because of historical things that happen around the time they turn 18 and I'm the Reagan kid generation from the United States and yeah we're bad we're not all bad we're wonderful people but we're politically different than other groups we were in the cold war and stuff like that we got kind of drunk on it stuff happened make fun of my own generation but just to say that there are covert effects and those covert effects are persistent they're lifelong effects but they fade as you move away from any particular historical event so there are categories in this and you want to do pooling to deal with those random effects as we're of age groups but those effects bleed out those categories aren't discreet and Gaussian processes let you do that they let you do pooling on continuous categories where there's similarity another way to think about it is distance there are points in the landscape emitting things like toxins and so proximity to those emission points affects the outcomes and areas but it's a continuous process and you want to measure covariates across space Gaussian processes model things like that as well but it's just an extension of this pooling strategy okay that's a lot of promises let's ground this as some examples when do we use multi-level models we use it when there's some clustering in the data clusters with things like classrooms you can have classrooms and you can have schools there's students in classrooms and schools this is the class of example in education people use multi-level models quite a lot owing to a certain British tradition so classrooms within schools are type of cluster students within classrooms or another type of cluster grades within students or you have another type of cluster this is like Russian dolls clusters within clusters within clusters and you can do pooling on all these at the same time inside a multi-level model and then of course questions within exams there's clustering on that any time you've got repeat observations on the same entity you've got clustering and if you use a multi-level model you can make stronger inferences better inferences about the tendencies of each cluster we worry about this most intensely when there's what we call imbalance in sampling so that some students or schools are sampled more than others those of you who do fieldwork this is always true in field data so primatologists this is the norm if you do any kind of behavioral ecology whether it's on humans or other primates you get massively imbalanced samples because some individuals end up in your data way more multi-level models let you deal with that they don't bias your inference based upon just the frequency in which a particular cluster appears in the data so in behavioral ecology that's the usual reason we appeal to these because we have massively imbalanced data those of you doing laboratory experiments, congratulations you usually don't have imbalance but you will still get better estimates even in the absence of imbalance because you'll have pooling between clusters it's just a particular inferential threat when there's imbalance we had examples earlier in the course where we could have used pooling we had individuals and families in the home data species within clades when we talked about primate milk earlier on nations and continents when I talked about recordness in the economy and applicants in departments very recently we talked about binomial models so the example I want to work with today is a new data set where I can show you the pooling this is in the rethinking package this is refrog data this is a field experiment these data are numbers the outcome variable is numbers of surviving tadpoles of refrogs which were grown in buckets but they were in the wild so refrogs lay their eggs on leaves and what the field experimenters did if I remember the experiment right I hope you can look it up and embarrass me if not so they hung the buckets under the leaves and they were supposed to fall into the water but they fell and set it to a bucket that's life and the experiment is set up so that there would be different densities different numbers of eggs on the leaves and some of these buckets were spiked with predators or rather I think they were shielded or not so the predators could or couldn't get into them I think is the manipulation what are the predators here the things on the right hand side I think that's a damsel fly larva dragonfly damsel flies have spent substantial portions of their lives as aquatic larva and they eat a lot of tadpoles it's just you know nature not in Disney movies you don't see things like this in Disney movies but this is how it goes so this is a field ecology experiment it gives us a lot of information about antipredator strategy so tadpoles can mob predators and do other things to defend themselves and that's what the experiment was actually about these density dependent effects and how these things trade off we're not going to go deep into the analysis of all of the treatment effects in this today I just want to show you when we leave out treatment effects how we can use the multi-level model to analyze variation and do pooling what is the pooling, what are the different so-called paws these are the buckets that tadpoles are in so we're going to be interested in how the number of surviving tadpoles varies across buckets I guess here I called them tanks they were buckets I remember seeing a photo of this so we have tadpoles in tanks different densities the outcome here is the number surviving this is a binomial model like the UC Berkeley admissions data think about each tadpole as being an application and whether it lives or not is whether it's accepted but the model structure is the same and now instead of departments we have tanks yeah so some tanks have more applications and then they have acceptance rates and we want to model the variation and get estimates the best estimates we can of the survival probability in each tank we're going to use a multi-level model to do this first let me show you the fixed effect version of this model without the pooling so you can get the structure in place what this means is it's going to be a dummy variable for each tank so we get a unique intercept for each tank and we only use the data from each tank to estimate that parameter this is like models we've done before and then we're going to turn to the multi-level model and we'll use what are called varying intercepts by tank instead so here's the fixed effects model number surviving in tank I is SI on the left there this is binomially distributed variable according to the maximum entropy considerations and in a sub I is the initial density the number of tadpoles that hatch into this tank and P sub I will be the probability any particular tadpole survives we model this as a logit with an intercept on the logout scale this is alpha sub tank right where tank is a number from one to the number of tanks I think there are 48 tanks in this data so there's alpha sub 1 through 48 and we are going to regularize we have a fixed prior for alpha tank normal zero five logout scale this is a prior centered at zero zero on logout scale is 50% yeah remember this and this was last year and five is like the whole space that's like the whole logout space this is a very uninformed prior but it makes infinity impossible which is extremely important but it's a fixed prior and all of these parameters are independent of one another here's the code to fit it you guys are pros at this now right it looks basically the same here's your alpha tank, just do this bracket tank and then it figures out there are 48 of those it makes a vector of 48 parameters it estimates them we're not going to look at those estimates yet we'll plot them up later let's fit the multi level model now now we're going to do the so the previous model has regularization the whole zero five prior on the intercepts has been to regularize influence to reduce overfitting just like all the other weekly regularizing priors have been introduced in the course but it's not adaptive because the prior is not learned from the sample, it's fixed it's ... you just stick it into the model I hope it's a good guess multi level model instead is going to have an adaptive prior that is learned from the data and that means it has parameters inside of it so here's the adaptively regularizing model the alpha tank parameters are varying intercepts the varying just cues you it's not good terminology but it cues you to the fact that there's an adaptive prior attached so notice now the model looks very similar but now this normal distribution is at zero five it's alpha sigma which are two new symbols and those are parameters free parameters and we're going to have to learn them and then they get their own priors just below alpha has a normal zero one and what is alpha across tanks the average survival rate across tanks yeah and what is sigma well it's the variation on the logout scale across tanks well the standard deviation on the logout scale across tanks so normal alpha sigma is the distribution of tank survival probabilities on the logout scale yeah you with me okay a little bit about terminology um these things are often called varying intercepts that's the terminology I tend to favor they're also called random intercepts these terms can mean different things depending upon speaker in context it's really awful it's a terrible thing about statistics makes me think supply statistic needs a total reset just to cleanse the vocabulary but it's normal if you're confused by this sometimes people will use random intercepts that would be different like the sort of thing that happens in certain kinds of ANOVA but it's not statistically the same necessarily neither of these terms makes a lot of sense in terms of what it actually denotes what does random mean if you've been paying attention in this course you've figured out that I've tried to convince you that all random ever means is we don't know something in the epistemological state because the universe is deterministic that's a pre-commitment to doing science the universe is deterministic and random just means we don't know something that would let us predict the outcome so we average over the stuff we don't know and that's what creates distributions this is true of every random number generator example we've ever had in this course even the soccer field thing where I first generated Gaussian distribution everything about the physics of the coin flips you could exactly predict the distribution of step lengths on that soccer field because coin flips are deterministic everything about the physics of coin flips is deterministic I've known it's just that it's a chaotic system so it's incredibly sensitive to initial conditions that's why you can't predict coin flips it's because you can't measure them precisely enough but the physics are deterministic there's nothing inherently random about coin flips it's a property of us that they're random we use them as a device so the numbers can guess whether heads or tails will turn up if you flip it right and catch it if you let it hit the table it's not random anymore there might be a massive eagle on one side of the coin for example that will bias which side falls down it's true those eagle coins are biased because the eagle is heavy but if you catch it it's fair and that's because you can't predict it because of the chaotic nature of the physics but it's not inherently random it's a property of us that makes it random does that make sense? so the word random here isn't doing any work because everything about this model is random for that perspective there's tons of stuff we don't know that's why we use distributions to model our ignorance but if we could perfectly measure everything we wouldn't need distributions that's the gambit of this think of part of that and you're welcome you can think about that all day and that's fine but it doesn't every parameter is random every data point is random until we know it until we've measured it precisely ordinary dummy variables also varying isn't much help because ordinary dummy variables also vary across clusters in the model I just showed you before where there wasn't an adaptive prior the whole point of that model is that you want a unique intercept for each tank obviously the intercept is varying so why are they suddenly called varying? the answer is there's no good reason for this this is just convention and instead of inventing new words I've just used the old ones and then I give you speeches like this so I hope you appreciate it it's just awful I give you a citation in the book to a great paper by Andrew Gelman on analysis and variance in the second half of that paper he's got a list it fills almost half a page of all the different definitions of random effects so you're welcome this is just how it goes it's not your fault if you're confused it means you're paying attention so what's distinctive about so-called varying intercepts or random intercepts is that they learn from one another that they have memory corresponding to the they have pooling, they exhibit pooling so if I could rewind time I'd do lots of things but one of the things I'd do is I'd erase this and maybe we call them nestic from the Greek, from memory these are nestic intercepts because they remember what they've learned from other clusters I know that's never going to catch on what's that line from bad girls or something stop trying to make something happen if I know this someone knows my joke that's enough so how does this work? these nestic intercepts here are varying intercepts alpha sub-tank, alpha is the mean sigma is the standard deviation as I said before and we get those there priors and as a consequence of this the survival across tanks now has a distribution and that distribution serves as the prior for each tank but simultaneously it's doing the prior it's doing it all at the same time you get an adaptive prior that informs the estimate for each tank it constrains, it regularizes the inference for each tank because in a population where tanks don't vary much if you get a, this is what I'm going to show you in a bit if you get a tank that's a real outlier it's probably an accident and you want to regularize that outlier back towards the mean but if instead the tank's vary a lot from one another and you get an extreme tank then it's plausible there's a famous example in this literature from American Baseball which is I know not the most popular sport on this continent but about batting average where most players in professional baseball are about the same on batting average but there are some famous historical cases of individuals who are really strong outliers and the question is is that a fluke and how many seasons does it take to get a fluke and this is a there's a lot of money writing on this because you've got to recruit individuals to teams when you don't have a lot of data on them so you use multi-level models to do that business this is called Moneyball this is about Moneyball use basing in multi-level models to do Moneyball okay so this is what the model looks like in map to stand it's the same as before but now we put parameters inside the prior for the tank intercepts and then we put priors for the new symbols for A and sigma just below you're going to have to run this in map to stand it won't work in map but the formula is very familiar to you at this point so both these models and by the way you really need to do this at home run them yourself it brings it home it bodies the knowledge to you it's very important to do and then you can look at all the estimates and play around with it get scary warning messages all the things that make this feel right but skipping all those steps let's just do the comparison the WAIC comparison for these two models and I want to talk about the effective numbers of parameters estimate for a second so to remind you about WAIC it's an estimate of the out of sample performance on relative scale it's not an absolute measure so one model relative to the other smaller numbers are better because it's a measure of badness big numbers are bad or better, worse big numbers are worse sorry I have an 8 year old so my English is decaying and this PWAIC is the so-called penalty term it's an estimate that comes from the flexibility of the model that we often call it the effective number of parameters it's a measure big numbers here mean that the model has the potential to fit the data more flexibly and that's bad in some sense because there's a trade-off so you don't want this to be 0 because then you underfit but you don't want it to be infinity either because then you maximally overfit so with these Bayesian models this number is typically always lower than the actual number of parameters in the model not always but typically lower why? because we're using regularizing priors so the parameter count in a model is not a pure measure of the flexibility of the model because we use priors to constrain the parameters they're not free and that's good that reduces the overfitting but you'll notice that in the so there are model 12.1 is the model with the fixed effect model that has no pooling in it it has how many parameters? it has 48 parameters the estimate of its flexibility is actually slightly higher here it has to do with features of the data because the flexibility is dependent upon the sample too in this case but it's about the same this is close enough for government work and 49 is about 48 it's a little bit higher but nothing to get too excited about this happens by the way because there were a few tanks that had almost all the tadpole style and then you're on the edge of a loget space and you can't tell what the parameter value is it could be infinitely low and the only thing constraining it is the prior so you get a lot of flexibility for those cases and that's why you can get cases like this where the estimated number of parameters is actually greater than the literal number of parameters in these estimates of the flexibility the point of, to really focus on though is that we added to get the multi-level model we add two parameters we go from 48 parameters to 50 but we ended up with radically fewer effective degrees of freedom it's now 38 instead of 49 how can that be true? this is a weird case we added parameters but the model over fits less now this violates everything you learn in basic statistics, right? in statistics they don't teach base but that's all there's nothing weird about this at all the flexibility of a model, the overfitting risk is a complex feature of the relationships among the parameters and in classical models where all the parameters are on a single level then that's and there are no priors, then it's true every time you add a parameter the fit to the data gets better and that was true back in chapter 6 when I first introduced that terror to you about overfitting, you should be terrified by the way it is no longer true, with multi-level models you could add parameters and get a worse fit to the data but better predictions out of sample that is the whole point of multi-level models is we add parameters to make the model more complicated it fits the data worse because it's using why does it fit the data worse? because now the estimate for each tank the estimate for each tank is pooled to be more similar to the population of tanks so it fits the data from that particular tank worse it makes worse predictions in the sample as a consequence this is exactly like when I introduced priors to you regularizing priors to you in chapter 6 as we make the priors stronger and stronger it fits the sample worse and worse but the predictions out of sample got better and better remember that slide? it's exactly what happens here now the prior is doing the same thing here but now we're learning it from the data in a multi-level way that's what's awesome about these these are technical terms about these models no guarantees of course if the model is badly structured there's all the usual threats to causal inference that should keep you up in that but this is doing better than a model that doesn't do pooling so here's what the estimates actually look like that pooling phenomenon I'll try to explain to you only with words what I'm showing you here are all the 48 tanks from the refrog data on the horizontal axis we have the tanks index number there's 48 tanks numbered 1 to 48 and they're in three groups by the initial densities there were small tanks, medium tanks and large tanks that vary in the number of tadpoles initially sorry I forget how many that is but I think it's like 10, 20, 40 that's in the data set though when you run this yourself you can take a look and on the vertical axis we have the proportion that survive so I've taken the counts and converted them to proportion so that we can view all the tanks on the same scatterplot with me so far now the points there are dark field points and open points the field points you can think of as the raw data and the fact estimates these are the estimates that come out of the first model model 12.1 the model that doesn't do any pooling there's with a sufficiently flat prior that there's no regularization towards the mean going on here so you can think of these as if you took just the data from each tank and calculated a raw proportion surviving that's the dark dot for each tank you with me and then the open points and the multi-level estimates for each tank the dashed horizontal line on this slide is alpha it's the estimated average survival probability across tanks which is quite high, good news for tadpoles a lot of them survived in this experiment probably because buckets are safer than real ponds that's why but regardless of that what I want you to see now is actually I think that yeah the raw mean that I've drawn up here now is different from the population mean so if we just took pool of all the tadpoles together across the buckets imagine emptying all the buckets into a common giant pond and just counting up what's the proportion of survival of the whole population of tadpoles it's different than the dashed horizontal line and the reason is because there's variation among the tanks there's imbalance in the data set because of the experimental design there are more tadpoles in the big tanks but if you're trying to estimate the average survival across tanks you don't want to let those tanks bias the experiment, right? it's easy to think about if you imagine you did the experiment where there was one bucket with a million tadpoles in it and then you pooled the data all your efforts would come from that one bucket yeah, that's bad so multi-level models fix that problem for you so the raw mean there is not the best estimate of the average survival across tanks that is the dashed line which is that parameter alpha for the model so let me walk through the different categories now and show you the pooling phenomenon so let's start with the small tanks what I want you to see is that all of the open circles are displaced from the open circles towards the dashed line right, so for the tanks where the field circle is above the dashed line then the open circle is closer it's below and then for the interverse when you're below the dashed line yeah, you with me? this is what we call shrinkage all of the estimates are shrunk towards the mean it's like you just shrink everything but it's not uniform shrinkage the distance that the points are displaced from the raw estimate is proportional to how far the raw estimate is so for the tanks that were really extreme relative to the mean the estimates move more because the model is more skeptical it thinks that the sampling variation is what created that because that's a very unlikely kind of bucket very few buckets are like that have these really extreme perfect survival rates you'll notice that up there there are three small tanks where every tadpole lives but is the best estimate of survival of those tanks 100% if you could have a multi-verse experiment where you replicated those exact buckets again it could turn out differently it wouldn't because it's deterministic but bear with me same conditions there's micro conditions that would vary that are important to us you wouldn't expect them all to survive again it's a fluke there's some mortality possibility in those tanks and then for the likewise on the other end for the small tanks where more than half of the tadpoles die so barely unlikely outcome so those get shrunk up now those buckets probably aren't that bad on average so it makes sense middle tanks are just the intermediate between small and large so I'll just jump to the large to get the whole lesson in we see the same shrinkage phenomenon now one of the things that's happening though is that there's a lot less shrinkage notice that the open points are all closer to the raw estimates for every bucket why? because there's more data per bucket I think there's four times as many tadpoles in the large tanks as there is the small tanks and so the data from each tank has a more weight on the inference from each tank again you can do thought experiments and think about the rationality of this imagine a tank with one tadpole in it so now most of your that tank will come from the population not from the one tadpole right? so your best estimate of the mortality in that bucket isn't from the one tadpole's life history but it's from the population of tanks in contrast if you think about our bucket again with a million tadpoles in it the pooling has no effect on your estimate for that bucket because you've got so many tadpoles in that bucket you know what the mortality rate is in that bucket does that make sense? and the model handles this this is the wonderful thing about Bayes is you don't have to intuit that you need that you set up the assumptions and then logic gives you the right answer the great thing about Bayes is you don't have to be clever you just have to set up the assumptions you don't have to intuit what the consequences of the assumptions are that's the point of Bayes' inference it is the motor that figures out the implications of your assumptions you just make the assumptions now if the implications of the assumptions are insane that tells you that something's wrong with your assumptions that's also a nice bonus you can simply obey the consequences you can say oh wow I set up the model badly I can tell that because the conclusions are clearly crazy that happens to me every day it's another feature of this this is what we use for logic or why we use logic and I've said this before earlier in the course but then I've lapsed in this sermon Bayes' inference is an extension of ordinary true false logic of truth tales an extension to continuous plausibilities and you want to use it the same way it's a garbage in garbage out if you make bad assumptions you get bad inferences but that's often a way to learn you learn about the consequences of your assumptions from these things but you don't have to obey them it's a goal okay so we're actually on schedule this is exactly what I hope to get to so a point of warning there are many points of warning about multi-level models in my experience one of the most important is the alpha parameter in this model the average survival rate across tanks it's not the same as the alpha in a model that has no clustering it's a different parameter don't be fooled by the fact that it has the same name the name is just something you give to it the machine doesn't care it has a different meaning it's now the average of a population and so here's the typical phenomenon imagine you fit a model that ignores the clusters entirely and you just estimate it's like the model where you pour all the tadpoles into a single bucket and then you estimate the average rate of survival on the long odd scale you get an alpha and you'll estimate it incredibly precisely and so if we do that in this case we get as shown on the bottom here this very peaked distribution and that's just the average survival probability across all the tanks treating them all the same as if they were identical in the multi-level model we also have a parameter called alpha but now it means something different now it's the pooling estimator there's this statistical population of tanks that we're estimating the mean from it's a different question the question in the first model is if we assume all the buckets are identical and their probabilities of survival what's the average? and the second one we're saying tanks vary in the population of varying tanks what's the typical tank like and the consequence of asking the different question is there's a lot more uncertainty about it so the posterior distribution for alpha and the varying intercept model is very vague and this is incredibly normal this is the usual outcome if you start with a model that has no varying effects in it at all no varying intercepts and you fit that and then you there'll be varying intercepts there'll be these parameters that have the same name you think of them as the average effect but they mean different things and in the multi-level model it's much vaguer and errors and inference arise from this because suddenly people will say if this was a slope which we'll do starting on Friday and next week people will suddenly say well now it's not significant because the average effect crosses zero suddenly now you should never make that difference anyway but it's just not true why is this vaguer the reason this is vaguer is because now there are many combinations of the specific varying intercept estimates and alpha and sigma that give you the same predictions and so when you look at the marginal posterior for any one of those parameters there's a lot of uncertainty about it but the combinations of them the sums are actually determined with the same precision as in the original model this is this problem again with the tide prediction engine the predictions are what you want to look at not at the gears yeah I'll never get tired of that metaphor no apology but it's this thing parameters will mislead you you need to look at the posterior predictions of your model to understand what it actually thinks okay sorry I know that's my 300th sermon on these things okay let me sum up now in the last few minutes here the phenomena that we've been talking about and the names that they go by what we've been talking about shrinkage is this phenomenon of the migration if you will of the multi-level estimates towards the estimated population the further the raw estimate is the tendency of that cluster from the estimated mean the more shrinkage you get because it's less plausible according to the distributional assumptions you've made the fewer data you have in a new particular cluster the more data you get because now the population has more information that's relevant than the cluster does the more data you have in a particular cluster the more and more the population gets effectively ignored but it works just like all Bayesian inference you're starting with a prior and then you're updating it the more data you have for that particular case the more you update the prior and eventually the prior gets washed out it's just like it's always been from the first example with globe tossing it's just the same as regression to the mean at the top level so when you run an ordinary OLS regression you don't make the prediction for every case as being exactly the outcome from that case you want them to shrink towards the population mean and that's called regression to the mean and it makes better predictions it's exactly the same phenomenon because each case has some random variation which is not a property of the actual entity itself but just of the sampling we're doing regression towards the mean at the second level of inference not just at the individual outcomes but at the tendencies the average tendencies of each unit in the data but it's the same concept as regression to the mean that you've been using all along so it's not far-fetched at all shrinkage is what we see what happens to estimates the phenomenon is called pooling pooling of information shrinkage arises from pooling pooling is what we want so I like this one of my favorite posters shrinkage arises from pooling each tank informs estimates of other tanks what's being pooled here is information among tanks it's like the coffee robot moving around among cafes it pools information among cafes the model doesn't have amnesia as a result of the pooling as it moves from cluster to cluster the pooling as I tried to show you in these examples it's influenced by the amount of data in each cluster and the amount of variation among clusters the sig month is estimated as well if the clusters vary a lot you don't get much pooling why? because the model won't be skeptical of an extreme cluster it'll be plausible and then nothing's extreme you can be anything it wants if instead there's very little variation among the clusters then there'll be a lot more pooling because then the model infers hopefully correctly that the variation you've observed is largely a consequence of sampling and not of the long term out of sample the one called the regular features the regular features of the data generating process does this make sense? okay all of this the benefits arise from everything we learned in chapter 6 the trade off between underfitting and overfitting we're regularizing here but we're learning the amount of regularization we need instead of having to guess it or get it from scientific principles which is the best way to do it obviously you can learn it from the data if you have a multi-level structure that's what's going on here you're learning the prior so that you can regularize your inferences you're learning that prior from the data instead but the benefits the reason the estimates are better is because they're trading off underfitting and overfitting so when you come back on Friday I will pick up exactly here and I will show you I'll try to connect this tadpole example to the underfitting overfitting trade off so you understand that and then we'll do another example where we have even more varying intercepts so that you can see that we can have all kinds of clusters in the same model as many as you like because that's normal in experiments you've got all kinds of clusters there's experimental block and there's treatment there's a bunch of stuff and they all happen at the same time in a rich inferential mix and you want to be able to have them all together so I'm going to show you an example of that on Friday thank you for indulgence and stay dry