 Welcome back everybody. We have reached the point in the course that some of you have been waiting for where we actually talk about multi-level models. The way I'd like to justify multi-level modeling is with the basic principle that it is better to remember things than not to. And so let me tell you this memory anchor story to help you understand. So there's a musicologist. I think he's about 80 years old now named Clive Waring. And in 1985 he got herpes in his brain. Some of you know herpes is an odd virus. It manifests as a different disease depending upon which tissue it infects. But it's all the same virus. If it gets into your brain it causes whatever kind of brain damage for the part of your brain it eats. And in Waring's case it ate large parts of his hippocampus and he developed a kind of amnesia called anterograde amnesia. He has his old memories. He didn't lose his memory. He has lost the ability to form new long-term memories. He's a very interesting clinical case. His spouse wrote a whole book about living with him and how weird it is. But also about how much he loves him. It's a very touching book. And he has these amazing notebooks. There's an excerpt from one in which he writes his short-term memory experiences. And so every time he has a cup of coffee now it's like the first time he's ever had a cup of coffee. Which means it's terrible because the first cup of coffee is always bad. It's only later when the addiction kicks in that you like it. Now it's a truly fascinating clinical case. He can't remember what happened one minute ago but he lives very intensely in a current moment. It's an odd case. What I want you to consider is that basically most of the statistical models we have considered in this course up to this point are like this. They have enterograde amnesia. They forget things about the data set as they move from one case to another. So in statistical language the models we've considered so far have all been fixed effects models. This is not great terminology. But today I'll try to give you a sense about why it's called that. These models have amnesia in the sense that every time you move to a new cluster in the data set like an individual or pond. We'll talk about ponds today chimpanzee experimental block. The model forgets everything it's already seen about the previous ones it's visited has no ability to form new memories. This is bad in the same sense that it is bad for Clive wearing not to be able to form new memories. Learning depends upon not being a slave to our past experience but obviously using it. It develops expectations and that helps us learn. Multi-level models are models that remember. It's not that they treat every cluster as the same but they develop expectations about the whole population of clusters in the data by visiting the clusters. They learn in a way that is actually invariant to the order that they might visit them and that is the right thing to do. It is demonstrably the optimal way to learn. Let me try to give you some metaphors to latch on to here and then we'll do the formal modeling version of this. Imagine you're visiting some cafes and you travel around various European countries and you always end up in a cafe somehow. The cafe experience is largely the same. I know every country has things that are special in some way. Here I'm going to contrast two different European countries and they're different cafe styles. You may recognize this already. On the left we have Paris, typical tourist cafe in Paris. On the right that's Berlin. You can tell because it's a dirty alley. It's a tiny table. I'm definitely on the right hand side here. That's me. This is it. These are different cafe experiences but you get your order of coffee in both places. Let's focus on one particular aspect of ordering coffee at a cafe and that is how long you wait for your coffee to come. If you've never been to a cafe before, you totally have no expectation about how long it would take and say you're in Paris, you visit your first cafe, it takes five minutes to get your coffee. Now when you go to Berlin and you order a coffee, you don't forget that experience but you also don't think it's going to be exactly the same. You have an expectation that is informed by your prior experience. This is what I mean by a model remembering. It treats the cafes as a population and you can transfer information among the units in that population to learn more effectively about them. This is going to be really useful for developing better estimates of every cafe. By better, I mean more predictively accurate. The other thing about this metaphor to keep in mind is, so think of it this way, you're in Paris, five minute cup of coffee, you go to Berlin and it takes what? Seven, eight, ten, half an hour, something like that and it depends. You had your five minute prior, so to speak, for the Berlin Cafe. You update that with Bayesian updating when you actually receive your cup of coffee but the time order here should be irrelevant to your learning. Now you need to update Paris too because you have a limited sample from Paris and you've gotten data in Berlin. You won't treat them as exactly the same but you should say if there are average waiting times to get a cup of coffee in these two cafes, data from both are relevant for updating both of them because they're from a common population of things which we call cafes and they have features in common. Does this make some sense? We're going to do this statistically today but I want you to get this idea in place and also that the time invariance, it shouldn't matter the order you visit them, you have to do it all, update it all simultaneously. This is what machines are good for. How much information you transfer across units like cafes depends critically upon how variable they are and you need to learn this variance as well as you experience the different units. So let me use a different metaphor than cafes for a moment. I used to do field work in East Africa and I know there's some people in the room here who have also and one of the strategies that I had for not getting intestinal infections while in the field was to constantly eat spicy food. I had a pocket full of these things everywhere I went. Oh Jeff, yeah, this is the strategy. These are goat peppers. They're called goat peppers. Peer peering and boozy in Swahili and they're great. You buy them for pennies in any market. They're everywhere and they do fight infection. They kill germs in your gut. Really, I recommend this. As you travel the world, eat spicy food. However, these peppers are not completely domesticated in my experience. An unpredictable and exciting thing about them is that their spiciness is quite random. Any particular handful of peppers you might get in the market, one of them could be a dud and you need the whole pepper to taste anything and then the next one will kill you. This is always an exciting and so I would take a thin slice off of one and try it before I dump the whole thing in my food. Peppers are highly variable. So when you're learning the spiciness of this particular species of pepper or say one particular plant, you can use your expectation from the whole population but because different plants are so variable and how spicy they are, it's very hard to transfer information because the population is just so different. It'd be as if in cafes, well, some cafes give you your coffee basically instantly, other cafes make you wait half an hour so no particular cafe experience tells you much about any other cafe in a population like that. If instead cafes are all basically all the same, you're waiting times are between one and five minutes, then data from any cafe helps you estimate more precisely any other cafe because they're all really the same. Does this make sense? But you have to learn this variation. It's another thing we have to learn during the statistical model. So to pull back a bit just for a second here, in this course what I'm trying to get across is that we've got to wage our statistical battles on two fronts, always at the same time. We've got to worry about the causal inference issue, which is not purely a statistical issue. It's a framing problem that our statistical models are embedded in. This is, I say, avoid causal salad. What is causal salad? Causal salad is the typical way causal inference is done in the sciences. You've got a bunch of factors and we're going to toss them in a bowl and Adam is attracting the models and see if some coefficients change and then tell a story. That's causal salad and it's a disaster. Waste of public funds is just a disaster. We can do a lot better than that. Having a DAG you believe in, or at least one you will playfully consider for the moment, is a small victory but it isn't the only thing. You've still got real statistical battles as well. Estimation is no joke. And getting precise estimates is a whole separate set of technologies in addition to the DAG. These are different things we have to deal with at the same time. And so today we're going to be talking about the second front here, the functional inference. If you have a DAG you believe in, it implies a bunch of functions you have to find, estimate from the data, that's its own set of challenges. And if we can use the data in more powerful ways, then that's what we'd like to do, to get more precise causal estimates. And that's what today's lectures are about. In fact, this whole week and next week are about ways to use the same data sets to get more precise estimates. And what I want to try to convince you of is that there are some really good defaults. There are lots of good choices. There's no single approach that always works well, but there are some really nice defaults which are unfortunately not currently the defaults. And the default I'm going to argue for is that you should always, as a default, use multi-level regression. And the standard kind of default single-level regression is a bad choice in nearly every case. There are cases where it's fine, where the answers will be the same between the multi-level and single-level, that's nice. But in that case, there's no harm in using the multi-level. And there are many, much wider range of cases where multi-level is demonstrably better. So it deserves to be our default. What are you looking at on the screen here? It's my favorite example of defaults. Organ donation, as some of you know, in some countries you are by default an organ donor whenever you reach adult age, I guess, and just automatically when you sign up to vote there are different mechanisms in different places. In other countries you have to opt in. So what I'm showing you here is different countries in Europe. In the ones colored in blue, these are so-called opt-out countries, countries where everybody is by default an organ donor unless they fill out some form that says they want to keep their organs when they die. In the other countries on the left, including the one we're currently in here in Germany, you have to opt in to give your kidneys. And then the bars are showing you the organ donation consent percentage that actually comes from this. And these don't reflect if you just ask people, like in Germany, should you donate your organs, vast majority say yes. But only a tiny percentage actually legally do because defaults are powerful things. And so in stats it's like organ donation, you want to think hard collectively about what the good default should be. And multi-level models are like opting out. They're donating your kidneys by default. There are plenty of good reasons maybe to keep your kidneys. This metaphor is falling apart, sorry. There are always reasons not to use the default, but still defaults are important. So I'm going to argue that, as I just said, we want to use multi-level models as our default because typically they're better. And of course you can find situations where you justify not using them. But that's not the point. The point is to have a better default. So here's what I'm going to get across today and on Friday this week. Just introduce you to multi-level models what they're about, how shrinkage and pooling work, which are these funny terms that arise from the way multi-level models work, why these things are good. Shrinkage sounds bad, right? But it's actually a very good thing in statistics. I'm going to show you how to do this using our Oolong tool. And thank you. And to show you how to plot and compare these models and going forward in the weeks after, this will open up a bunch of new kinds of model types we can do. There are lots of models which are really just some kind of multi-level model or we can glue together different bits of models and different levels and do lots of fun things. So for example, factor models are a kind of multi-level model. Lots of things are multi-level models and this opens it all up. As I say, it's just turtles all the way down, right? It's just parameters all the way down. You can make a model of any particular parameter and then you've got a model and a model and that's what multi-level models are like. Okay, so what are they for? As applied scientists here, well, applied statisticians, scientists that we're applied statisticians here, we're interested in what we get from these things. What have they done for us lately? These are useful, multi-level models are useful because they help us deal with clustering in our dataset. That's one way to recognize what they're for. So for example, you have classrooms within schools, students within classrooms, graduates within students, questions within exams. So in a single dataset on educational tests, you could have a bunch of different levels nested within one another and repeat observations at each of those levels and these are like cafes or chili peppers or any number of other things. There's a population of items, they have variation. You have a finite sample of each. You can learn more effectively about each item if you pull information across the items. This is especially important when there's something called imbalance in sampling. So some of the clusters, some of the classrooms or students or cafes have been visited more than others and you don't want that imbalance to let the most visited items dominate inference. You want to treat it fairly, right? And regardless of this separately. So these models will let you deal with that. In biology, there's this term pseudo-replication which I'm not a big fan of but this is this terror that you're taught when you're a young biologist to avoid pseudo-replication. These models handle that. You don't have to worry about pseudo-replication. There's a great paper actually called something like pseudo-replication is a pseudo-problem. Anything like that? Anyway, just usable to level models, you'll be fine. Okay, so we already had examples in this course. There were one individuals inside families, there are species in clades, primate species in clades, nations and continents, applicants in departments. All of these things are clusters and because these clusters are members of a population they have some properties in common. So let me give you a data example now. Here's another data set that's built into rethinking. This is a reed frog tadpole data set. Where these data come from is this is a quasi-experimental, it's a field experiment with reed frog tadpoles and what was done is eggs were suspended on leaves above buckets which I'm going to call ponds. I just have to think the bucket is a pond and when the eggs hatch in the natural ecology of a reed frog the tadpoles then fall down into the water below where they swim around and eat and grow into happy frogs. The cycle of life, it's beautiful. In this experiment they get into the bucket and they're later released. Why? Because the buckets are ways to create little microcosms where you can study the effects of different densities and presence and absence of predators and food supply and other things to understand the growth and development of these tadpoles. It's a cool experimental data set so if you look at the data set and look you have variables about the numbers. The outcome we're going to be interested in is the number of surviving tadpoles in each little bucket where we call them ponds in each pond. There are a bunch of explanatory variables like density, the size of the tadpole, some are bigger, some are small and the presence and absence of predators like this insect larva here chowing down on the tadpole. I'm not sure what that is. Damselfly or dragonfly larva, someone will tell me after they listen to the lecture. We're going to focus just for now not on the predictor variables but just on the variation and find a credible way to characterize the variation among tanks. Here's the structure of the data set. We've got tadpoles in tanks, buckets. We'll call them tanks. They're at different densities. They're different number of hatched tadpoles in each. So there's a different maximum number that can survive in each tank. The outcome we're going to predict is the number of surviving. So as you can probably guess, this will be a binomial regression because we've got a certain number of tadpoles that start and a certain number at the end which is always, well, could be the same or smaller. We're going to fit two basic models to this. A model with a dummy variable for each tank. This is the kind of model we've done before. The kind of thing where there's an index for each tank and we can fit it that way. And a multi-level model with varying intercepts. I'll show you what the difference is. So here's the kind of model we've done before. We have an index variable for the tank from one to, I forget how many tanks there are in this data set. But all of them. So there's a different alpha. And then each alpha is assigned the same normal 0.1.5 prior. This is a regularizing prior because it's not flat. So extreme mortality or survival probabilities will be viewed skeptically by this model because of that 1.5. Remember in log odds, you're trained on log odds now. Log odds of four is always. Log odds of minus four is never. Yeah. This is a fine model, but it's not a multi-level model. It has amnesia as you move from tank to tank. The only data in the data set that informs each alpha are the data for that tank only. And it ignores all the others. This is the entire great amnesia, right? So the model learns about tank one. There's seven tadpoles in it. Using those seven tadpoles, it estimates that alpha. Not a lot of data, right? Moves to the next tank, forgets all about that. Tows tadpoles and what it learned about tadpole mortality there. It's like it had never seen a tadpole before. Oh, a tadpole, how interesting. Then it counts them and how many survives, estimates the next alpha, and then the third one. Oh, look, tadpoles, right? And over and over again. This model has amnesia. We can do better than this. Here's how we fit this model though, just to remind you in the code. Look down at the bottom. At the top, I just set up the tank variable from one to the number of rows in the data set. The row is a tank. Set up a data list. Pass this into oolong. You've seen all this before, right? No surprises. Now let's try something different. I've added some stuff to this model and I've colored it blue. Hopefully that's helpful. And let me step through the new bits piece by piece here. Alpha J is still there, but what I've done is I've made it into this magical thing called varying intercepts. And I've done this by inserting parameters inside the prior. So where there used to be a zero and a 1.5 there, there are now two new parameters. There's this alpha with a line over it. When you see a line over a parameter, you should say the word bar, right? Because it's a bar. That means an average. It usually indicates there's some average. This is a parameter by itself. It's not an average of the other parameters. It's a parameter in and of itself that we're going to estimate. It's a mean alpha in a statistical population that we haven't observed, but that we sampled tanks from. So alpha bar. And then sigma, which is the standard deviation in this population. So each alpha J, where J is a tank, has this prior with a mean alpha bar and standard deviation sigma. And then for these new parameters, we have to give them priors. So it's priors inside of priors. And we give alpha bar our standard loget regularizing prior 0.1.5 and then sigma is an exponential one regularizing prior. Why would we do this? So what are varying intercepts? These are also called random intercepts. Those of you who've spent much time with me, you know I dislike the word random intensely because it's just like a mind trick. What does random mean? It means you can't predict it. There are no random physical processes that we study. I prefer the term varying intercepts, but that doesn't really tell you what they're about either, because the other intercepts also vary. Why are these suddenly varying? This is just terminology and you just have to learn it. What's distinctive about this model and the fact that there are parameters inside the prior is that means you learn the prior from the data. I'll say that again. What's distinctive about this version of the model and the fact that there are parameters in the prior is that now we're going to learn the prior from the data. So we're going to regularize. We like regularization. It gives us better predictions, but before it was like you just had to make up how much regularization you were going to do. Now we're going to learn that amount of regularization from the data set itself, and this is like visiting the cafes. If you look at the variation across, you have to figure out how variable they are. As you're learning the variation among them, you pool more or less information across, because basically you're learning whether they're all the same or not. That does the pooling. You're learning the prior as you go at the same time. Here's my trying to make fetch happen moment again at the bottom of this slide, where my failure is to create new terminology. If the other model has amnesia, then this is a nestic model. Yeah, I know. Fetch is not going to happen, but still, it'd be nice if we could go back in time and change terminology, but we can't. So here's what I just told you. We've got alpha sub j is normal now with a parameter for its mean and a parameter for its standard deviation, and we get posterior distributions for those parameters from the data. And then each of those has a prior. The prior of a prior, it's a prior of a parameter that's in a prior, it's called a hyper-prior. I know. Like I said, it turtles all the way down. It's just parameters inside parameters. They're all just priors, and you already know how they behave. It's just that they're feeding up now, and there are multiple levels of inference going on at the same time. And that's why we call these multi-level models. It's like there's two levels in the model. You'll notice that the line for alpha j looks like the top line, right? You've got some variable on the left and it has a distribution which is a function of other variables. It's like the top level. So this is why it's a two-level model now. The fact that the alpha j's aren't observed is irrelevant. Bayesian inference works the same on observed and on observed variables. Okay. The way to read this in English would be the survival across tanks has some distribution. This distribution is the prior for each tank. And that distribution has its own prior, which is the alpha bar and sigma priors. In code, you can just modify this to get it. So here's the multi-level model version of it. You just write what's up there into ulam as literally as you like. So again, the loget line stays exactly the same. And then for the alpha tank line you just stick symbols in where there used to be numbers. So now we have a bar and sigma. And then you write the two priors for a bar and for sigma. And that's it. And then it works. So the Markov chain will obey this model just like it obeys the other one. And it will deal with both levels simultaneously. Does this make sense? For the moment, good. There are faces of concentration in the audience. Which is always flattering. Good. I appreciate that this is like weird stuff. I think one of the challenges in learning multi-level models is there's the changes to the model like what you see on the slide. And then there's all the terminology. Suddenly there's all these new terms and they don't seem to make much sense. And the terms just come from history. And there's nothing I can do about them. I have to teach them to you. I'd be betraying your trust if I didn't teach you all this terminology. Nevertheless, the terminology is not great. Understanding the mechanics of how the model works and the justification for the structure, that's enough. And you can kind of let the terminology slide and just ask people what exactly they mean when they use a term like random effects. So they're like, show me the model. Don't just tell me it's a random effects model. Show me the model. So if you focus on understanding the models, you can just ask people to show you their models. And then you're okay. That's always been my strategy. And if you look in the literature, Andrew Gelman has a great paper on analysis of variance about middle way through the paper. He's got this whole page, which is different definitions, different mutually incompatible definitions of random effects. So stats, it's wild out there, right? So, but if you focus on the model structure, you really can't understand things. If you focus on terminology, you get confused very fast. All right, we've got these two models now. Tadpole model. And we've got the old plain old regularizing fixed effect, is what it would be called a fixed effect tadpole model. Let's compare them. We're not doing a model comparison here because we want to select one. Instead, I want to show you something about flexibility in random effects. So 13.1 is the fixed effects model. 13.2 is the multi-level model that was on the previous slide. What you see here is in terms of WAIC, they're very similar. There's not a big difference. The multi-level one is a little bit better. Yeah, a little bit better, but not hugely better, but it's a little bit better. But look at the effective number of parameters. So how many parameters do these models actually have? Model 13.1, the fixed effect model has 48 parameters because there are 48 tanks in the dataset. There's one alpha for each. That's it. That's all the parameters in that model. Each of those parameters is informed only by the data linked to that tank. That's what makes it a fixed effect model. It ends up with 25, 24.8, we're going to call that 25 in my course, 25 effective parameters. Why? Because the prior is regularizing. It wasn't flat. And that means that there are tanks in which all the tadpoles survive. What's the log odds of that? It's infinite. On the law of God scale, the survival rate was infinite in that tank. But because you have a regularizing prior, you don't end up with alpha of infinity. That's good. That's a necessary thing in logit models. So there's a regularizing force, even though it was a fixed prior, you still got regularization. So the effective number of parameters is less than the literal number. That's just what we learned about back when I did all that Odysseus stuff. Other weird Greek metaphors, all that stuff. What's happening with 13.2 is we've added two more parameters. This model has 50 parameters because we added alpha, bar, and sigma. But the effective number of parameters went down. It actually has fewer effective parameters than the other model. So there was a hint of this when I taught you about overfitting back in, was it chapter 7? Weeks ago, there was this caveat where I said every time you add a parameter to a model, the fit to the sample improves, unless you have a multi-level model. Right? What multi-level models do is the structure among the parameters is different. And now when we add parameters, we can actually get a worse fit to sample, but a better expected out of sample performance. Because the extra parameters are regularizing devices. They learn how regular to be from the data itself. So in this case, we add two parameters, but we shed three effective parameters. Because the regularizing prior ends up being narrower. Understanding the population. Is this cool? Maybe you don't understand it exactly, but I think this is one of the coolest facts in statistics. And it betrays all of the kind of basic stuff you're taught in intro courses, about how parameters function, with limited view on actually how machines learn from data. Actually how machines learn from data involves all kinds of complicated embeddings and deep learning. That's a phrase that I'm not a big fan of. I like deep learning models. I don't like the phrase deep learning. It sounds like sorcery. It's like snake oil. Deep learning works because it's deep. Because you've got parameters stacked on parameters. And lots of layers of neurons and things. And they just do stuff like this. And they regularize and do other kinds of cool things inside the network. Okay, let me stop this sermon and move on. Let me show you graphically what this looks like. We're going to compare the fixed effect tadpole model to the multi-level tadpole model. What you're looking at here across the horizontal is all 48 tanks. Buckets, but the same tank to glamorize this a little bit. And they come in three sizes in the data set. And when you go home and you draw the owl for yourself and you look at the data set, you'll see that experimentally they were set up to be small, medium, and large tanks. What does that mean? That's the initial density. This was set up experimentally because the field experimentals put different numbers of eggs on the leaves that were suspended over the bucket. It's a cool experiment. I think this is a very cool experiment. And in the medium ones you have a medium density and in large you start out with a lot of tadpoles. So then on the vertical we're looking at the proportion survive. So this is kind of the outcome scale. You can think of this as the probability of survival if you want to have a forward looking view on it. The blue dots are the raw data. So that you just take the data for only each tank and you divide it by the initial number and that's the blue dot. That's the fixed effect estimate. That's what you get. So in the fixed effect model you do inverse logic on alpha and that's where the blue dot goes. Well, almost. This is raw. This is not regularized at all. I want to show you the raw data. So you'll see, for example, the ones up top there are three tanks where all the lucky tadpoles lived. Yeah? Usually that's not how it goes. What are the open circles? The open circles are the multi-level estimates. And so those are the alphas for each tank, inverse logited to be on the probability scale. There's a pattern here that I want you to see which explains all this weird language like shrinkage. The phenomenon you're viewing on this slide is shrinkage. So you'll notice that the model is not retrojecting the sample but this is why it's a good model. I'll say that again. The model is not retrojecting the sample but that's why it's a good model. Remember, this thing about overfitting is if you want to retroject the sample you're bound to make bad predictions out of sample. So you want to learn the regular features of the sample. That's the whole point of regularization. So what a multi-level model does is it has a regularizing prior that it has learned from the data by learning how variable the units are so that it knows how much it should treat all the units as the same or not and how much information the pool across. And that generates this phenomenon called shrinkage. So we'll move slow through this. So just focus on the small tanks for a second. Wait, I think I focused on those in a second. So let me... What's that dashed line? That is the population mean that we have estimated from the data, the alpha bar that we've estimated. So the raw mean which I've now put in here in red is in a different place. Why are these in different places? Why is alpha bar, which is the mean of some statistical population of tanks that we've estimated from the data, in a different place than the raw alpha? Where does that raw mean come from? It comes from taking all the tadpoles and putting them in one big pile and then just averaging the survivorship. There are more tadpoles in the big ones on the right and so they bias your population estimate because the survival rate is lower at high densities. In a population of tanks, the mean would be different because the densities vary. So you can't just pool all... If you just pool all the data together and estimate an overall survival rate you're estimating the wrong thing. It's like estimating the admissions probabilities across all departments. It's the wrong thing. It's a fine question for some things. Maybe that was your question. What's the total survival rate in the whole population of tadpoles? The red bar is the right thing. But if you want the mean in a population of individual tanks, it's not the right question. Does it make some sense? Okay. Now let's focus on small tadpoles. Small tanks. What's going on here? Notice the differences between the open points and the blue points. If you're above the dashed line excuse me, my voice is going because I've had a cold all weekend. I will make it through this lecture. I will. He coughs confidently. So if you're above the dashed line the the open circles are off the blue lines towards the dashed line. It's like they've been pulled by gravity towards it. And the closer you are to it the less difference between the two. If you're below the dashed line you go up instead. This is shrinkage. There's shrinkage towards the population mean. And that shrinkage depends upon both how extreme the raw data are relative to the population mean. If it's very far then the model says that's very unlikely. And it's more skeptical that the true survival rate in that bucket would really be like that if we did that bucket over again. Yeah. So it shrinks more. This is what regularization does. If we look at the large tanks you see the same pattern but the total amount of shrinkage is smaller. Why? Because there's more evidence in each tank. There are more tadpoles in each tank so you get a more precise estimate in each tank. Each tank can overwhelm the information about the population. To understand why this makes sense you have to imagine a tank that had one tadpole in it. That was not done because this was a good experiment. But imagine it was a bad experiment and you started out with some tanks where there was just one egg hatches a tadpole when in there a lonely tadpole in the tank. Tadpoles are social by the way so it probably would be a lonely tadpole. And then the only observations would be zero or one for those tanks In that case you really want to use information from the population to estimate the survival probabilities. Otherwise you're going to have massive overfitting on each tank. In those cases you get a lot of shrinkage away from the zero and one. In another case if you had a hundred tadpoles in a bucket then that bucket you get a really precise estimate of the survival rate and the population is irrelevant and so there's less shrinkage. Does this make sense? I want you to get some intuition. In this particular data set in model the exact quantitative amount of shrinkage is a product of a bunch of things balancing inside the model and you're never going to be able to exactly calculate it with your pencil. That's fine, that's what Markov chains are for. What you want to get though is some intuition for why you want to see a pattern like this. This pattern is not a malfunction. It's reducing the fit to sample to improve predictive accuracy just like regularization. You add parameters to the sample but we know what happens if we reduce fit to sample in the right way we get better predictions out of sample. I know this is a big sermon but there's lots of principles here to work through. This is my summary slide for all the stuff I just tried to say. What do varying effects estimates do? They shrink towards the mean of the population in this case that's a bar. The further from the mean the more shrinkage you get. The less data you have in any particular cluster fewer data sorry I'm trying to be grammatically correct fewer data it's like fewer, almost nobody says this anymore fewer data, sounds very pretentious doesn't it but it's correct fewer data in cluster the more shrinkage you get because there's less evidence at that cafe. Think about the cafes if you only order one cup of coffee at that cafe you want to shrink the added estimate of the mean of cafes but the more cups of coffee you order at that cafe now the more you are informed about that particular cafe and you don't need the population anymore that's what's happening with these tanks. This phenomenon of shrinkage is really nothing mystical you've seen it all along it's a phenomenon called regression to the mean. In a one level regression you get regression to the mean because your future prediction for cases with those values you expect to be closer to the mean this is the regression to the mean phenomenon that we always observe in regression extreme values are considered to be extreme in terms of the mean are unlikely and so if you're making predictions you'd expect them values with those same covariate values be closer to the mean in the future that's regression to the mean this is regression to the mean as well extreme values statistically you should be skeptical about predicting those extreme values in the future and these estimates obey that so we're just using regression to the mean at a lower level of parameters as well this phenomenon I should say shrinkage is the other term that you need to recognize in the literature is pooling shrinkage arises from pooling pooling is the process and shrinkage is the pattern I've got a slide coming up where I try to help you understand what that means okay all of this is really just overfitting dealing with overfitting and underfitting varying effects are more accurate than fixed effects varying effects estimates are called partial pooling I'll explain what that means in the next series of slides here and fixed effects models are called no pooling there's no pooling because there's no information being exchanged between the clusters varying effects are partial pooling because there's some information being exchanged among the clusters how much, well that depends upon how variable the clusters are and you have to learn that from the data too about pooling the grand mean of all the data is maximum underfitting of the model so imagine we ignored the tanks and we just thought okay we're going to pile all the tadpoles together in one statistical bucket and we're just going to count the number that survived and divide that by the total number of tadpoles in the experiment that's total pooling you treat all the buckets as the same right ignores the heterogeneity and that gives you the grand mean that's a very useful statistical model but that's the complete pooling estimate where you ignore all the heterogeneity in the model a fixed effects model as you say and that's maximum underfitting if you just use the grand mean that's maximum underfitting why? because there's a lot of variation and you haven't described it so now when you try to predict any particular tank in the future you'll have a terrible prediction because you've ignored all the heterogeneity in the population so that's maximum underfitting that's too simple you need more parameters you really do to adequately describe what's going on the opposite is maximum overfitting the fixed effects model has a parameter for every bucket and doesn't use any information across buckets so now you're using a tiny amount of data to train each parameter so that's maximum overfitting that's not maximum you could always do worse right so you get good at statistics you can always do worse there's always a way to do worse but the bearing effects is we're trying to navigate between the two monsters right of overfitting and underfitting the sill and the correct statistics through adaptive regularization adaptive means we learn the amount of regularization from the data set itself let me try to back this up with a picture here's our multi-level tadpole model again at the top still the multi-level part colored in blue and I want us to focus on sigma now sigma is the parameter that estimates the variation in the population of tanks if this was the cafes it would be the variation in wait times and sigma on the left has a minimum of zero if we set sigma to zero in this model say fix it or put a really really strong prior on it so that it's all piled up on zero this model converges to the complete pooling model why because it says all the tanks are the same there's no variation between them that's like dumping all of the tadpoles together in one bucket all the alphas become the same value and they just did that grand mean of the whole population of tadpoles is what they'll converge to does that make sense statistically you can take the multi-level model and then in the limit as sigma goes to zero it becomes the complete pooling model the model with one parameter for alpha exactly one alpha will still be there but they all have the same value why because sigma zero they're all the same does this make sense on the other extreme we have infinity imagine that's infinity I know it's off the slide but infinity for illustrative purposes that's infinity if sigma goes to infinity what happens is you get the no pooling model you get the fixed effect model what a fixed effect model is effectively statistically assuming is that all the tanks are infinitely different from one another because that's the only distribution that would justify learning nothing among them transferring no information among them does that make sense to understand this in a bit you have to violate your intuitions in a second because you're a lot smarter than a statistical model is you are intuitively as a vertebrate like vertebrate chauvinism here let's roll with it I know I'm going to get hate mail so as a vertebrate as you go from cafe to cafe you do this pooling I'm not saying you do it exactly like this model does but you do it and you would never just ignore the variation your brain won't let you but the statistical model will not do that unless you program it too so it's this difficult thing about statistics is you've got to train the model the robot the golem to do things which to you are intuitive and you never realized you were doing a very hard thing in doing statistical modeling I think and so in showing you the sigma line what I'm trying to tell you is that if you want to program a little robot to borrow information across clusters then you can't let sigma be infinity letting sigma be infinity is the only kind of golem where it won't transfer information across them and sigma zero is also a bad golem that's one where it doesn't recognize any differences and everything in between is some different amount of regularization a pooling of information between the buckets where the estimate for any particular alpha for any particular tank is going to be a mix of the data in that tank and the population the rest of the data in all the other tanks and the weight of that mix how much of it comes from the data in that particular tank and how much comes from the rest of the tanks depends upon the variation in the tanks if there's no variation in the tanks then the whole population is all that matters and the data in that particular tank is ignored it's just the whole population is used in place of it in the case where sigma goes to infinity you ignore the population in between it's some combination of the two and some carefully balanced Bayesian dance right of how it works out so in this particular model we estimate sigma of course from the data and it turns out to be this posterior distribution here the mean is like 1.6 but it's not certain we don't know sigma we have a posterior distribution for sigma and your estimates are going to average over that uncertainty because it's Bayesian everything gets averaged at the same time I show you the prior here too that's our exponential one prior it has a really long tail it goes on for a while there's a lot of data here you see we get an almost Gaussian it's not quite Gaussian but almost Gaussian posterior for sigma and that is the extreme amount of regularization that gets induced by this the sigma 1.5 it's very close to the thing we had said before but now it could be bigger it could be smaller we've learned the variation from the data itself is this good does this make some sense what does this population look like we've learned an alpha bar and a sigma I keep saying the word population it's not a real population it's a statistical population it's an information population there's some generative process that generates buckets with tadpoles it's called an ecologist but those processes generate a population with real mortality effects there's real cause and effect relationships here and the statistical population we can draw it the posterior distribution is full of lines so it's got alphas and betas in it and those combinations of alphas and betas produce lines in the posterior those alphas and betas are correlated and that's why you need to draw correlated samples from the posterior to accurately represent what the model thinks this is true in all models now the posterior distribution does not only have lines in it, it has whole functions whole distributions it's now a distribution of distributions yeah, good times so alpha bar and sigma define a distribution of tanks of mortality effects that would exist in a population of tanks and we don't know that distribution for sure instead we've got a distribution of distributions so let's draw that distribution of distributions this is where English needs some grammar that's what I would give for a data so we can just draw correlated pairs of alpha bar and sigma from the posterior distribution and then draw them as densities and that's what I've done on the left here this is that Gaussian distribution why Gaussian? because we said it was Gaussian distribution of law god's mortality or survival rather in the population of tanks I remember zero is 50% so you'll see that most of the tadpoles survive in this happy experiment yeah, they were sheltered by a bucket right but there's a lot of heterogeneity in the experiment in the whole population of tanks as a whole in some tanks they were near wipe out in some tanks all of them survived so there's a lot of heterogeneity and you can see that in this distribution and then on the right I've transformed that whole population into the outcome scale, a probability of survival and you can see if you transform this with inverse logit you get a prediction that there's a long tail of tanks but a lot of the tanks about half of them you get really high survival rates and that's what we see in the data does this make some sense but this is what you need to do to understand what the model thinks so you can just plot this distribution and see it and we'll do this with future examples as well okay so I keep asserting that these shrinkage estimates are better but I'll demonstrate that to you one of the things about statistical methods is when we don't know the true state of the world it's really hard to say if something's better or not right now with enough investigation in any scientific area you can demonstrate in the long run if something works better so statistical methods that improve say mobile phone reception started out as theoretical based upon principles that this would improve mobile phone reception but then you can prove it because you get lots of mobile phones deployed right in the theory side of varying effects we can demonstrate with simulation that if there's a generative process and we draw a sample from it and we use the multi-level model that we do better in prediction so remember back in the overfitting week I showed you this for WAIC WAIC works in theory I mean it almost gets it exactly right on average in this magical way right it predicts the out of sample accuracy and I also use simulation demonstrate to you that regularization is good even though it reduces your fit to sample because you'll make better predictions if you have narrower priors you make better predictions out of sample then you're using flat priors that was all a simulation exercise so let's do a simulation exercise again with varying effects to give you some idea about how this behaves so all of the code to run the simulation is in the chapter I'm going to move quick through this so we can just talk about the picture we're going to simulate a bunch of what I call pawns now, 60 of them and we're going to have different densities of tadpoles 5, 10, 25, 35 tadpoles in each of 15 pawns okay this is the simulation if you run the code in the chapter you get a data set like this where on the left you've got a pawn number N is the number of tadpoles initially in the pawn simulated tadpoles true.a is the true law god survival rate I've simulated it I've said in this particular pawn because of the environmental circumstances and the amount of food and everything else this is the law gods of survival I then simulate mortality events from that survival events I should say the glass is half full not half empty the pawn is half full okay I'll stop there are some wipeouts there pawn 5 is the total wipeout but there's 180 that's a random number that comes from it and then the other the next two are statistical estimates P no pool is the no pooling estimate this is when you just use that tank this is the raw fixed effect estimate from that pawn just the proportion that survive makes sense no pooling and then the multi-level model estimate is what the part pool is P dot part pool from a partial pooling in the pawn good and then there's P true which is just the inverse logic of true a and we want to compare these last three columns to understand the properties of this yeah so now we can assess that the partial pooling estimates are better because we can compare them to the true so let's look at this graphically because you don't want to look at a big table of numbers right so graphically now just like our previous one we're going to have pawns across the bottom as we had tanks across the bottom before on the left we've got what I'm going to call tiny pawns with only five tadpoles in them the blue points are the raw proportions survived in each pawn and the open points are the partial pooling estimates and the vertical scale here is the absolute error from the true value right the true value meaning causally speaking there was a generative process that was generating mortality under those environmental conditions and that's the true thing that we want to estimate because we don't care about this pawn we're scientists dammit we care about the causal process so we need to get the true thing the blue thing is honest and objective to the sample we care about the sample we care about the causal process that would care about the sample too but we care about those so if you're at the bottom at zero you hit it right on so the lower down you are on the axis the better so there's a lot of error here on absolute error scale this is on the probability scale there's a lot of error here because you've only got five tadpoles it's hard to estimate the rate of heads if you only flip the coin five times you could assume that a euro is quite biased that way if you only flip it five times the blue horizontal bar is the average for the raw estimates and the dashed one is the average for the multi-level modeling estimates so even though the multi-level modeling estimates aren't perfect they're better than just believing only the sample they reduce your fit to sample but that's good because they give you better estimates of the underlying process does this make sense and this is all shrinkage that's doing this for you you can't see the shrinkage here because I've transformed the scale to an error scale you don't see that shrinkage pattern as we did on the previous graph this is to justify what shrinkage it's doing and why it's better good let's look at the others I had labels so the other tanks now there are small tanks medium tanks and large tanks you can see the pattern holds as you go across but notice in general what happens here the overall amount of error declines as the tanks get larger because we've got more data so we get better estimates and also the difference the advantage of multi-level model shrinks as you get more and more data if you've got a lot of data per unit what the multi-level model does for you is not necessarily improve your predictions because you've got lots of tadpoles in that tank you can estimate it for the population and that's really important for generalization if the population is highly variable and you want to predict what happens in the future you need to understand the population's composition so the multi-level model does something for you even when it doesn't give you better predictions it allows you to make predictions in the right way because it gives you a population to represent the future okay I have an infinite number of future slides but this is an excellent time to stop because my voice is about to give up let me queue up when you come back on Friday we're going to come back to my favorite data set in the course the pro-social chimpanzees and we're going to make multi-level chimpanzees all over the place and I'm going to show you how this works in a data set you already understand so we can compare it to what you already saw alright thank you and I'll see you on Friday