 Welcome back everyone. We're going to start the lecture today by talking about memory and learning. Clive wearing is one of the most curious clinical cases of enterograde amnesia that has ever been carefully documented. Enterograde amnesia means you cannot form new long-term memories. Clive wearing was a musicologist until in the mid-1980s. He had a serious infection that destroyed his ability to form new long-term memories. It's an interesting case because he can still play music, but he doesn't remember having ever learned to play it. It's a very interesting case. His wife seen on the screen here, Deborah, who has written a book about their life together and carefully documented what it is like to live with someone with enterograde amnesia. Here's Clive playing the piano. Just minutes before he insisted that he had never learned to play the piano. In just a few minutes after this, he will have forgotten that he was able to play it. How is it that learning works without memory? Okay, but why are we talking about enterograde amnesia? Well, it turns out to be very relevant to statistical models. We're going to pick up with the problem we ended on in the previous lecture and try to develop a solution to it in this lecture. So to remind you, we were working with the trolley problem data, and we had tried to work through the different complex moderators of the treatment effects that you can see in the dag on the right of this screen. But we had stopped when we had reached the last two, and those are the 12 different stories in the data set. The stories are the different scenarios that are used to construct specific trolley problems in the experiment. And the 331 individuals who volunteered their time to rate the different trolley problems. How are we going to put these categories into our model? And there are lots of choices to make, and we'd like to make a very principled one. Let's look at the data in a more natural kind of perspective. What you see on the left of this slide are the 12 different stories just laid out on the horizontal axis and on the vertical axis. The response variable from the trolley problem data, that is this 1 to 7 rating of how appropriate the action in the story was. The pink intervals represent the 50% intervals for each story in this case. And then on the right of the slide I'm showing you the same sort of plot, but for participants in the story. Just the first 50 because there's 331 and that doesn't really fit on the slide in a nice way. In both cases what I want you to see is there's variation both across stories and especially across participants. Where some participants really assign almost everything a rating of 7 and others almost everything a rating of 1. Other individuals use the whole range, some individuals stick around 4 all the time and play it safe. But also for stories, some stories appear to be consistently lower rated than others. So how are we going to model this and put this into our statistical approach? Well, the most obvious way would just be to take our GLM strategy and create a vector of parameters here for the stories. It's called it beta for the lack of a better name. There would be 12 parameters in this factor and then we just use the stories index as a regular index variable to index which parameter we need for each particular response. The problem with this approach, this process can work, it's not doomed, but the problem with it is that it has interrogated amnesia. As the model moves from one response to the next from one story to the next, it doesn't use anything it's learned about the previous stories to help it learn about the next story. Maybe that sounds a little odd, so I'm going to build this up over the next several slides. What we want are models that remember and use those memories to efficiently learn about new cases as they arise. And this is what multi-level models do. Multi-level models are models within models. The first kind of model in a multi-level model is the type that we've been using all along, this kind of generalized linear model that we've been using for weeks. And this is a model of the observed groups or individuals in the data. And then the second model in a multi-level model is a model of the population of those groups and individuals. And that includes unobserved groups and individuals. Why would that be useful if we haven't observed them? Well, because the population model creates a kind of memory. And that memory is extremely useful because it generates expectations for the first observations and the second observations and so on for new groups. And expectations help us learn faster because when expectations are violated, then that is, that's kind of surprising result tells us to update more. Or rather, it tells our little Bayesian machines to update more. There are really two perspectives on why we would like our models to have a kind of memory that creates expectations and can have them violated. And they're both simultaneously true. It's like that old illustration of the duck and the rabbit that you see on the right of the slide here. If you look at it one way, it's a duck with the beak on the left. And the other way, it's a rabbit with ears on the left and its face on the right. And this is true of these perspectives about models with memory, multi-level models as well. You can think of them as being superior to single-level models because models with memory learn faster as they make more efficient use of the data you have. And the other perspective is that models with memory resist overfitting. They regularize. If you think back to our previous lecture about overfitting, we're going to tie all that back in today to the models here. And both of these things are simultaneously true and they arise actually for the same reasons. And so by studying these models as well, you can learn something more about how statistical learning works. Okay, let's take up the first of those perspectives. Suppose we go to Prague and we have a golem, a coffee golem that's going to visit various cafes. Here's one of them. And he's going to order coffee. And he's going to record how long the wait is to get the coffee. And by going around the city and ordering coffees, the golem is going to build up a database of the distributions of waiting times at different cafes. How should we program our golem to do this optimally? Okay, so let's do this actually in a computational sense. In a computational sense, I mean, I've done the calculations and I'm going to show you the cartoon version of them, how they look on the screen. What I'm showing you here is just one cafe. We're going to call it Cafe Alpha. And the horizontal axis in the two plots on the screen represent waiting times. And this is the dimension of the outcome we're interested in. How long it takes to get your coffee after you order it. And the vertical axis is density because we're showing posterior distributions. And so the red curve in the middle here is the golem's initial programmed prior, if you will, for the waiting times at Cafe Alpha. It hasn't visited Cafe Alpha yet. We're just letting it out the door. And then the blue density represents the initial memory the golem has. It's a set of prior expectations about the population of cafes. It expects cafes to vary, but it thinks that most of them will have waiting times less than 10, the vast majority, less than 10 minutes of waiting time. Now let's let the golem visit a cafe. It goes to Cafe Alpha in its first visit and it gets its coffee in about two minutes. Now this is just one visit. It updates. The golem is Bayesian. So it uses Bayesian updating to do this, of course, and it gets new posterior distribution. But it gets it for both Cafe Alpha, as you see in the middle, it's updated to the red curve. And I show the gray curve is the previous posterior distribution, the prior distribution, so you can see what's changed. But also the population of cafes has changed. The blue curve is now spiked at what the golem thinks Cafe Alpha's average waiting time is. This is after only one visit, though. So this will be easily overwhelmed by future data. But so far this is just an example of Bayesian updating. And there's really nothing new except for the fact that, of course, the visit to any one cafe has given the golem information about the population of cafes. So it has changed two things. It's changed its expectations about a particular cafe and it's created a memory, if you will, about all cafes and what they're distributed like. Now let's consider a second cafe that the golem's going to go visit. And that's Cafe Beta. Before the golem has been to Cafe Beta, it has an expectation about it. And that expectation is not the original distribution of cafes. It's not the same expectation it had before it visited Cafe Alpha. Its expectation about Cafe Beta is now informed by its visit to Cafe Alpha because it remembers that visit. And it has used that memory to update its prior for Cafe Beta. And this is what happens when there's memory in the model, when you have an explicit population model, as well as just a model for the observations themselves. And so Cafe Beta's expectation, you can see here, is not the gray curve where all the cafes started, but it's the red curve. It looks a lot like Cafe Alpha, but it's not exactly the same. It's a little bit closer to the population distribution. And now, our golem's going to visit Cafe Beta, and it gets a really long wait time. I mean, Cafe Beta is not such a well-staffed cafe. And there's been a big change in the posterior distribution from the prior for Cafe Beta. You can see that. However, the red curve for Cafe Beta is not piled up on the vertical black line, the observation, because the prior is exerting influence. And this is just one coffee, right? So it's a limited amount of evidence. What I want you to notice, though, is that the estimate for Cafe Alpha has also changed. And that has to be true if our golem is going to learn optimally. And the reason is because the optimal golem will remember that it visited Cafe Alpha. And the new information from Cafe Beta forces it to revise its estimate for Cafe Alpha, even though it's not in Cafe Alpha anymore. But it hasn't forgotten it. And it hasn't forgotten it because the estimate for Cafe Alpha depends upon the estimate for the whole population. And when the golem visited Cafe Beta, it updated its estimate of the population of cafes. You can see that on the left in blue. And now the distribution has been pushed out to the right and it's flatter. The golem expects more variation because the wait time at Cafe Alpha, the first cafe was short. It was only about two minutes. And this one is about, what's that, 18 minutes or so. And so now the golem thinks that maybe cafes are quite different from one another. They're not all the same. You can come back to Cafe Alpha and get a second data point and Bayesian updating proceeds. And at every visit, the golem, following its optimal programming, updates both every cafe it's visited as well as the population of cafes. It's posterior distribution for the population of cafes. And we can keep doing this after three visits to Cafe Alpha. It sure looks like Cafe Alpha is the better cafe of the two. But we can think about adding even more cafes. So maybe three others here, Cafe Gamma, Delta, and Omega. And the golem hasn't been to them yet, but you see it has a prior now, which is informed by the population distribution. It now expects more variation. And so the prior for these three cafes it hasn't been to yet is flat and long. It doesn't know what to expect. Will it get a Cafe Alpha or a Cafe Beta where service is not so great? And it can go to the different ones and get its coffees and at each step it updates. And what you notice is all the cafes move a little bit, no matter which cafe the golem goes to, as well as the population of cafes estimate also moves. And that's because for the optimally learning golem with a memory, every visit has implications for the estimates of every cafe in the population. But as more and more data accumulates, of course, the evidence for each cafe will eventually overwhelm any information from the population. When you're just starting out, you don't have a lot of experience. This makes a huge difference to have memory and use that to develop expectations. And at the end here, the golem's going to visit Cafe Beta again, the one with the long wait time. And again, it gets a long wait time, much longer than all the others. And it has now pushed its posterior distribution back up to the right, away from where it had been in the gray curve, shrunk back towards the other cafes because the golem was becoming increasingly skeptical that it's one coffee that it ordered early on at Cafe Beta was representative of Cafe Beta because the other cafes had all given it a good experience, as you can see. But in the end here, we see that we get another long, slow coffee from Cafe Beta and the golem updates appropriately. Okay, that's one perspective of what memory does. And we're going to deal with a computational structure of all that in a bit. Just be patient. I want to show you the other perspective first, though, and that is the regularization perspective. And it's the same. It arises from the same logic, but it really feels different, just like, again, the duck or the rabbit. So the other reason, often given to use multi-level models, is that they adaptively regularize your estimates. So you remember back in the overfitting lecture, regularization means finding a set of estimates that are neither overfit nor underfit. They make good out-of-sample predictions. And cross-validation was one method we discussed to do that. Another way to talk about this is that from a statistical perspective is that there are three kinds of models we might use to learn from a data set that we're going to make predictions with. And the first is the complete pooling perspective. And in this case, we will treat all clusters in the data. And what clusters means here is stories, individuals, any kind of unit with repeat observations. Cafes, if you will. In the complete pooling kind of statistical model, you treat all of the different cafes as identical. And this tends to result in underfitting if there's any interesting variation among them. So you think about the cafes again. There are cafes that are quite a lot worse than the others in terms of how fast the coffee comes. Maybe it's worth waiting for that slow coffee because it's better, but the waiting times are different. If you collapse all the cafes together and have only one estimate for the amount of time to wait, you're underfitting the variation in the data. You'll make poor predictions because your model wasn't flexible enough. In the no pooling perspective, instead you treat all clusters as if they were unrelated. You have a unique parameter for each and no memory in the model so that none of your visits to Cafe Alpha will inform your estimate for Cafe Beta. It's like every time the golem goes to a new cafe, it forgets that it's ever been to a cafe. And that's not good for learning. The compromise solution is called partial pooling. And this is when the model tries to adaptively learn the amount of variation in the population of cafes or individuals or stories. And that's what the coffee golem was doing. Let me show you what this looks like and try to give you some intuition for why partial pooling is a good idea. I'm going to use a data example. This is the same data example that I used to motivate multi-level models in the book. It's a relatively small data set. 48 rows is in data read frogs in my rethinking package. And each row is a group of tadpoles. I call them tanks, but maybe be better to call them buckets. These are experimental groups of tadpoles that develop together. And there are three experimental treatments in these 48 groups. One is the density of tadpoles in each tank. That is the number of tadpoles that are in the tank. And second is size, which is how big the tadpoles are when they were put in the experimental unit. And the third is the presence or absence of predators. And that was manipulated by covering and protecting the tank or not. The outcome of interest is survival, the number of tadpoles who survive. We're going to use these data to think about, well, overfitting and regularization and the value of models with memory. Let's dag it out here a bit. This is an experiment. And the reason I've chosen this particular empirical example is because it is a controlled experiment. And so for the most part, we can put the issues of confounding and such behind us and just think about the estimation problem for now. And that's the goal of this lecture. Of course, in real studies, confounding is always haunting you. But let's put that aside. We've dealt with that for weeks now. And let's think hard about getting good estimates. So in the dag, we have in the middle there our outcome of interest, survival. This is going to be in the data, the number of tadpoles who survive. And then we have the three experimental treatments, density, size, and the predators. And then there's also the tank. And the reason we're interested in the tank as a variable is because there may be unobserved things about each tank that explain the numbers of tadpoles that survived in each case. There would be some heterogeneity among them due to other issues we haven't observed. Think of it as a competing cause like story. Okay, so here are the raw data. Just plot it out for you so you can see what the structure of the data set looks like. The horizontal axis is just the tank index. And I've grouped them. So on the left you have the low density tanks. I'm going to call the small tanks. And in the middle, the middle density tanks. And on the right, the high density tanks. And the vertical is the proportion of tadpoles in each tank that survive. And the black dots just show you that proportion. The horizontal dash black line is the average survival across tanks. Now that means the average of these black points. It's not the rate of survival across the whole experiment. It's if you take the rate of survival in each tank and then average those rates, you get that black line. It's a feature of the population of tanks. Marginalizing over their densities and so on. We're going to model these data. We're going to use it to think about regularization and the value of multi-level models. Okay, so let's make a little model. And let's make a model that doesn't have any memory. This is going to be an ordinary binomial GLM like from a few lectures ago. But now we're going to think about the prior and a little more detail and explicitly vary it and see what that variation does to our estimates. So just to show you on the right and review, we have S sub I, which is our outcome variable in this case. And that is a binomially distributed variable because it's a count with a known maximum. And that maximum is the density D of tadpoles in that particular tank. And then we have a probability P sub I. And we're going to model that on the log odds scale with the logit link. And the linear model is really simple here. It's just a log odds parameter alpha specific to each tank. Each row I has a tank that is attached to. And so we model that with the T of I. And then we have a vector of those alpha parameters and each element of that vector gets a normal prior and it'll have some mean alpha bar because there's an average survival, log odds survival across all tanks and we want to model that. And the question that's going to interest us here is what kind of standard deviation should we assign this prior? So there's a lot on this slide but I'm going to take it slow and we're going to get a lot out of this. So at the top I've plotted the data again. I think you'll recognize it. And then I've added a bunch of estimates, the little red dots with the pink confidence regions on them. Those are 89% confidence regions. These are estimates for each tank, the little alphas for each tank. And these have been estimated for the sigma shown at the very top. I've plugged in a standard deviation. This isn't a parameter. It's just a number that I plugged in for the standard deviation of the prior for each of the alphas. And I made it very small to start with 0.1, 1 tenth. And what happens as a consequence is that we get a really narrow range of estimates hugging the mean which is alpha bar, the mean estimate near the black dash line. And at the bottom, in the bottom plot I'm showing you that posterior distribution for the population that is the alpha j distribution. That is what is the prior for each of the tanks. Now what I'm going to do is I'm going to let sigma increase and we're going to increase it all the way up to 5 and then back again. And you'll see that as we manipulate sigma, fixing it at different values, it lets the distribution of per tank estimates change. So here we can increase it up to 5. I'll freeze it here for a second and you'll see we get basically the opposite result is what we started with. Now all of these individual tank estimates, the individual alphas are quite dispersed and they're much closer to the black dots as you'll see because now the prior isn't constraining the estimates to be near the mean. It's a very flat prior. It's much, much more flexible and so the model fits the sample much, much better. At the bottom you can see as well what's going on here is that each of the vertical bars is one of the alpha estimates that's also shown in the top. So you can see how they're distributed relative to the prior. Something to note here, I think this is important, it'll come up again and again, is that even though the prior for the alphas, for each alpha is normal or Gaussian, in the posterior distribution of a Bayesian model that prior doesn't force the distributions to be normal so that you see that all of these vertical red lines at the bottom underneath the Gaussian curve, they don't form a Gaussian distribution. You've got those high ones up there on the right, you've got clusters on the left and that's just because there's nothing that says that a posterior distribution has to be of the same shape as a prior distribution. So when we use a Gaussian prior for any particular collection of parameters like these alphas, that's not forcing them to be Gaussian in the posterior. They can have whatever distribution they need to Okay. So I should also say that you'll see that alpha bar has moved now. It has shifted away because of the dispersion, if you will, of the individual alpha estimates, as you can see it there. Now we can let this thing go back and forth a few times and you can get a sense of how it looks and how it works and as it shrinks up and shrinks back down it spreads out again as Sigma goes up to 5 and then back down. It's clear that when Sigma is 0.1, it's not a good model. That prior is way too tight and the model is not fitting the data. It's underfit, radically underfit and we'd expect it to make really bad predictions about any additional experimental groups of tadpoles that we would try to understand. On the other end, when Sigma is a very large number, 5, it could possibly overfit the data. The model may be too flexible, may be insufficiently skeptical and I've been encouraging you all along for weeks now to choose more skeptical priors than something like a Sigma of 5. So what's the right Sigma? There's different ways to approach this but I'm going to use an approach that you've met in a previous lecture. Let's think about cross-validation. Which of these Sigmas gives us a good cross-validation score? So we can take this whole sequence again and we can compute the PSIS, the Pareto Smooth Important Sampling cross-validation estimate for each Sigma that we might use. So at the bottom here now, I've replaced the plot of the prior distribution of alphas with all the Sigma values that we're going to vary over and this is a log scale, the way the spacing is done but I've labeled it with the normal, natural scale Sigma values all the way from 0.1 to 5 on the far right. So right now we've got 0.1. You can see this gives us a PSIS score somewhere above 500 and remember, lower scores are better. So now I'm going to start the movement again and we're going to calculate all of the different PSIS scores and the different Sigma values and there's a pattern here is that true, very, very low Sigma is quite bad. This is a radically underfit model and as Sigma increases the PSIS score improves, it means it goes down but eventually starting around 1.2 there seems to be any substantial improvements and after a little while in fact above 3 you can start to see it, the PSIS score actually starts to go up again and if we used even larger values of Sigma you'd see that trend continue. Okay, so you probably expect this or at least we should from the information presented in the overfitting lecture. Priors can do a lot to regularize estimates and tighter priors often give us better out of sample predictions even though they make the model fit the sample worse. However, if the prior is too tight as it is on the far left here then that's also bad or even much, much worse as in this case. So the best values right around here around 1.8 although there's lots of values in the same area that are essentially the same as a bit of a plateau here this is the value that in expectation at least maximizes our out of sample accuracy is a good regularizing value. If we zoom in a bit at this area now and look at the estimates there's some interesting stuff to see. So now to remind you the black points are the raw data that's the empirically observed proportion of tadpoles that survived in each tank. The red dots and the smaller vertical intervals on each red dot that's the posterior mean at the red dot that's where the model thinks. The expectation is for that tank or tank with those features and the smaller red interval is the 50% compatibility interval and then the lighter pink intervals are 89% intervals creating a kind of top-down view on the posterior for each tank here. What I want you to see is that for the tanks that have unusually low or high proportions of survival that the centers of the posterior distributions are not on the data. This is a very common thing with regularized models. Remember we're fitting the sample worse so that we can make better predictions but there's a pattern to this fitting worse and that is that it's the extreme tanks the ones that have unusually low or high survival that are the ones that the model is most skeptical of and there's a pattern to the direction of its skepticism it thinks that those tanks actually if you're going to make a prediction for them for next time we're going to make a prediction for them for next time it predicts that they would be closer to the mean closer to the red horizontal dashed line and this is the partial pooling effect the regularizing effect where if we have a good skeptical prior with a properly chosen sigma and it turns out to be about 1.8 here but in other cases it'd be a different value then the resulting estimates will tend to have this pattern where the extreme units the extreme tanks in this case could be extreme cafes in the case of our coffee golem story the estimates will be shrunk towards the global mean and by doing that often we end up making better predictions in the future this whole pattern results from the same thing that our coffee golem got by having memory about the population and it's the prior for the tanks that is that memory but it only becomes a memory when we do the next trick and the next trick is we like some way to do this whole thing inside the model so in the example I just finished I manipulated the value of sigma and then ran the model over and over again for different values of sigma and showed the resulting estimates it would be nice if we could just run the model once and the model could tell us the value of sigma that gives us the best expected out of sample accuracy could we actually learn that value of sigma actually learn the prior as it were from the data and the answer is yes and after the break I'll show you how this material is pretty complicated and even though this lecture isn't particularly technical I encourage you to make good use of this break to review the preceding slides maybe watch them again make sure you've got some minimal understanding that'll let you keep going forward and when you come back I will be here welcome back let's pick up where we left off we were looking at the tadpole data the read frog tadpole data and we were building up to the argument about how the use of partial pooling and models with memory allows us to find models that will perform well in cross-validation and I had just showed you the cross-validation across different levels of sigma and now what we have on the slide is an actual multi-level model it's a slightly modified form of the models we've used before I've highlighted the new bits in red in the model notation on the right the prior for the individual tank alphas and these alpha parameters are the log odds of survival for each tank and this prior defines the population of tanks it holds the collective memory for the machine and there's a average tank alpha bar and some unknown standard deviation among the population of tanks, sigma and that's the variable that I had fixed in the previous examples before the break and now we're going to learn it from the data and so to learn it from the data we also need to give it a prior and so I've added a prior for sigma an exponential prior on the bottom why the exponential prior I've used these before the exponential is a good default it's a good default because it's a very easy prior to understand the only information it contains is the average displacement from zero so in this case that's one so the mean of the exponential is the one over it's shape parameter it's displacement and so this is one over one which is still one so it's easy to understand the long tail so sometimes you want to use something else we'll have examples later on otherwise the features of this model are just as before it's just we've put a new model inside it and to help you understand why I keep saying that if you look at the line for alpha j it looks like a likelihood if alpha j were data then we would be modeling a normal distribution of the observations of alpha j remember it's typically all the priors up to this point in the course have had fixed numbers in them not other variables not other parameters but now we have a model for alpha j where the shape of the prior comes from two other parameters each of which has to be learned itself from the data and those priors that alpha bar and sigma are often called hyper priors I say a little bit more about this terminology in the book it's not essential that you use that terminology when you discuss these models but if you hear it or see it in writing it's good to know what's being referred to okay let's code this model up apologies before we code the model up let's look at the implications in a more human way it's often nice to remember to do prior predictive simulations so let's do this for the alpha j let's think about it like what is the prior distribution of tanks look like here and since it depends upon more than one parameter we need to mush together more than one parameter to get its shape so the top left I'm showing you the exponential distribution here with a rate of one and alpha bar is a normal 0.1.5 is an ordinary Gaussian distribution and then on the bottom left we have them mixed together that is if we sample a large number of values from a normal distribution where the mean itself alpha bar is a normal distribution with mean 0 and standard deviation 1.5 and a standard deviation sigma which is itself a distribution an exponential distribution as shown in the upper left then we get the shape that I show you on the bottom left and that is not a normal distribution because it's tails are too thick you notice it's more dispersed than a normal distribution with the same mean and variance and that's the extra variance both from the mean from alpha bar and from sigma that are making the tails thicker in this distribution but in any event that's the implied prior distribution and as I'm going to say multiple times that's just a prior distribution of tank survival rates doesn't have to look Gaussian just because the prior is Gaussian. Remember at minimum all a Gaussian prior means is that we're saying some collection of values have a mean and finite variance and that's it. Okay here's some code I show you on the left the ULAM code for this model and on the right the ordinary mass stats notation for a model of this kind I hope you can see the correspondence between the two there's nothing there's nothing really too surprising here you can just put a bar and sigma into the norm inside the ULAM model and it'll proceed as usual notice at the very bottom of this code I've added this log lick that's for log likelihood equals true and that's because I'm going to do some model comparison using WAIC and important sampling in a little bit to make a point about model flexibility okay run this model it's samples extremely efficiently in stand and you can get if you spit out the whole pracy table you get this monster that I show you on the slide there are 48 alphas in addition to a bar and sigma so there are 50 parameters total I remember there are only 48 rows in the data set but of course in Bayes we can have more parameters and observations what matters is how they're structured it's a point I'll make again in a bit let's focus instead on sigma and a bar because these are the parameters that talk about the population shape and the point of this example is not really to estimate the survival rate of each tadpole bucket but it's to understand how the the population model inside the multi-level model creates memory and improves inferences so let's see what sigma it's estimated here so the posterior mean as you see is about 1.6 but it ranges from about 1.3 to 2 there's quite a range let's revisit our cross validation exercise where I varied sigma by just plugging in fixed values and fit the model over and over and calculated the important sampling score which is remember an estimate for sample performance of a model it's a way to understand the overfitting risk if we mush these two things together now you see that the multi-level model identifies the same range of sigmas as the cross validation exercise did I'll say that again the multi-level model identifies the same range of sigmas as the cross validation exercise does the multi-level model learns a prior that is expected to provide the best out of sample accuracy for these units and that's pretty cool it'll be useful to pull a little bit more conceptual education out of this example it'll be useful if we also fit a model that has no memory so in the upper left of this slide I'm just repeating the multi-level model the multi-level tadpole model and then just below it I've added a version of it called MST NOMIM where I fixed sigma to 1 just to help you understand what's happening here so this would have been like the kinds of models I used to do the cross validation exercise and for both of them I have log like true so that I can compare them with WAIC at the bottom why WAIC? because WAIC is going to give us an effective number of parameters measure a penalty term that is a kind of measure of the flexibility of the model and in ordinary non-Basian models it is typically equal to the number of parameters the penalty term but it won't be in Basian models because of priors let's see what it does with these two models so here's the compare table and unsurprisingly the memory model MST does better it has a smaller WAIC you expected that because it has a better sigma value so it has lower overfitting risk but look at the PWAIC column that's the penalty term sometimes called the effective number of parameters and there are 50 parameters in the MST model and only one less in MST NOMIM both of them have way less effective way fewer effective number of parameters than that though half or less but notice that MST even though it has one more parameter than MST NOMIM actually has fewer effective parameters I'll say that again even though MST which is the multi-level model has one more parameter sigma sigma is free in that model free meaning flexible even though it has one more parameter than the other model MST NOMIM it ends up with fewer effective parameters and that's because by fitting sigma it ends up being a less flexible model or having the proper amount of flexibility so it makes better predictions this is a very weird phenomenon in Bayesian inference but it's not unusual adding parameters can actually reduce the overfitting risk of a model what matters is not the number of parameters but how they're structured and once you start making priors dependent upon other priors as we've done here then a lot of a wide world of really interesting statistical things can happen which are not dreamt of in introductory statistics courses so I do encourage you to discard the lessons well, discard is a strong word do not assume that the lessons you learned about basic regression models or basic tests like t-tests or ANOVA's apply to more complicated models, multi-level models neural networks and the like because typically they don't okay, let's interrogate the model a bit more there's some more to learn from this simple model still so what I've done here is again I've plotted the proportions of survival in each tank using the black dots, those are just the empirical results and then in the red and pink I've put up the estimates the alphas for each tank and the pink compatibility region that's the 89% region for each and the circle is the posterior mean this is all on the probability scale now what I want you to see appearing at this, I know it's a bit confusing but just focus on the left side for a second where the small tanks are and look at the distance in each case between the corresponding empirical result the black dot and the red estimate, the posterior mean and you'll see that there are lots of cases where they don't overlap and this is the you understand why we talked about this before the break this is a result of regularization of trying to avoid overfitting but measure those distances the size of them with your eyes if you can for a second and then gaze quickly over to the far right and make the same comparison the large tanks on the far right exhibit the same phenomenon that is especially the more extreme tanks near the bottom have greater distances between the empirical black point and the red estimate just as on the far left however the distances in absolute size are much smaller on the far right than they are in the far left and that's because there's more evidence in the large tanks there are more tadpoles and so there's more information with which to inform the particular estimate of each tank on the far right than on the far left and as a result the estimates on the far left are informed more by the population estimate that is the memory part of the model is having a stronger effect on the small tanks because the small tanks contain less data so there's two things to see here in summary the first is that extreme estimates tend to show under optimal estimation a greater distance between the empirical result and what the model expects in the long run in future data from that unit that's the gap between the red circle and the black dot and second the more information in a particular unit the less the population distribution matters and the more that unit contributes to the population and therefore the more that unit will influence other points that have less data think about the coffee golem the coffee golem really exploited this property but it did so automatically just through the structure of the model you don't have to be clever to make all these cool things about estimation work you just have to have the scientific structure of the model right and then probability theory does the rest okay another thing to notice about this simple model this is an experiment and there are treatments and I omitted them from the model nevertheless they show up it's like the the estimates for each tanker and since haunted by the treatment because these estimates from this kind of model are very flexible and they can find treatments if the treatments have large effects even if you don't tell the model that the treatments exist here so all I've done is I've repeated the previous slide but I've colored the points by treatment focusing on the predator treatment so the blue tanks are the ones where they were shielded from predators I believe in the original experiment they were covered and then the red ones are ones that were not protected from predators and you'll notice that the red ones tend to suffer lower average survival unsurprising however this model knows nothing about the predator treatment because I did not even give it that variable so it has found it nevertheless there's an interesting thing that comes from now comparing these estimates to the model where we do tell it about predators so let's do that now let's stratify the mean of the population by predators this is kind of like saying they're going to be two populations of tanks now and they have different means there's the non-predator mean and the predator mean there's different ways to code this again we're in the realm of choosing functions now so your imagination is the only limit but I'm going to do the simplest thing and just add use a dummy variable p sub i for the presence of predators and use this coefficient beta sub p for the incremental effect on the mean of the presence of predators and now you're expecting this to be negative but I'm just going to go ahead and assign a normal prior centered on zero not the best choice in the world but certainly a neutral one there are no new obstacles in coding this I'll show you the code here just add bp times p and then add a prior for the new beta coefficient there in the model you won't be surprised to see that the posterior distribution of this new coefficient, bp is resoundingly negative and it's a whopping huge effect so the x-axis here is still on the log odds so almost minus two log odds on average that's a devastating reduction in survival but of course you could see that from the graph right you could see that with your eyes this gives us more because it gives us a measure of the precision of course and even the mildest effect is quite negative even if it were only minus one that would be a big effect but let's look at something else here so let's compare the implied posterior predictions of the two models so what I've done here on the horizontal so I've plotted the probability of survival from the model without predators that's the first one we did the first multi-level model that I fit after the break and then on the vertical I plotted the model we just did and this is also a multi-level model but with the dummy variable for predators at it and again the blue points are the tanks where predators were absent and the red points are the tanks where predators were present and the diagonal is the line of equality and you'll see that the models make extremely similar predictions across all the tanks there are some that are a little bit different and we could inspect that and think hard about why but that's not the point I want to make with this it's that the models make extremely similar predictions and that's a testament to how flexible these multi-level models are they often find structure in the data even when you haven't told it about the structure being there okay and let's also look at the sigmas that were estimated here and you can see there's been a big change even though the predictions are extremely similar adding the treatment the predator treatment explicitly has had a big effect on the estimated variation in the population of tanks so the red density on the right is the sigma posterior distribution from the model with predators and the blue one is from the first multi-level model the one where the model is ignorant of the predator treatment so what has happened here is that the model with predators is able to explain variation among tanks using the predator treatment and so the variation left to explain with the alphas is smaller quite a lot smaller as you can see this is a very common thing with multi-level models is that as you start to add in treatments or other variables that are associated with the outcome that the variation among the multi-level estimates the alphas in this model gets smaller one consequence of this is that it's often quite useful in more say richly structured multi-level models like the ones we'll meet next week to start with a model that has none of the treatments and other things that interest you actually because it's a way to assess the structure of the variation at the different levels of the model and that often is extremely helpful for understanding the size of the effects that you're seeing ok I want to wrap up the tadpole example for now the multi-level models are distinguished by having a model of the unobserved population we can't see the population of tadpoles, we can't see the population of cafes it may not even really exist except as some statistical device that we use to help our estimates it's a kind of machine memory but it has a huge and positive effect it helps us learn about the observed units and it does that in two ways first it uses the data more efficiently so that each unit can actually tell you something about the other units in the data the coffee golem visited would tell it about other cafes as well and that's a more efficient use of its experience than being ignorant of the idea that cafes are somewhat similar to one another that they're the same type of thing likewise the tanks in the tadpole data are different but they're also alike and so if we treat them as being from some statistical population obviously there's not a real population of buckets full of tadpoles in the world but if they're from some statistical population we can learn about them more efficiently and that also simultaneously reduces overfitting to help us make better predictions these types of estimates in multi-level models the alphas in the tadpole example are often called varying effects one way you can think about this is that the estimates vary across units or their unit specific partially pooled estimates one is the use of partial pooling through the device of the population model that's run at the same time as the observation model that's what matters there's a bunch of confusing terminology in this literature but if you insist upon seeing the model specification of the types that I've been showing you in the course you can always demystify what was actually done in any particular study okay there are still uninspected variables in the DAG in the bottom right of the slide like density D and size G we used density already in the model but we only used it as the number of tadpoles that could survive it could have other effects as well and size could have big effects because large tadpoles this is the size of the tadpoles large tadpoles have better defense possibly against predators we're not going to deal with those in this lecture I'm going to leave those for your fun in a homework problem to end up this lecture I want to talk about superstitions instead and there's lots of superstition in statistics and I'm very sympathetic to the fact that it exists because humans are a superstitious species it just seems to be how our cognition works but there are some risks to that and in the varying effects the superstitions lead people to not use varying effects as often as they should and then they give up statistical power I just want to talk about three which are in my experience the most common types of superstitions the first is that the units that you're stratifying the varying effects by must have been sampled at random rather than being some aspect of the design of your study this is false think about the coffee golem now of course the coffee golem is not real data but that somehow makes it better because we have the generative process for it and the statistical analysis the cafes could be an exhaustive population of cafes in Prague the golem is still going to learn more efficiently if it uses a multi-level model in its little rock brain than a non multi-level model it doesn't matter that the units are sampled at random lurking behind this fallacy is this idea that random is some real thing in the world that people do at the scale that people live everything is deterministic and random is just an epistemological statement when we sample units at random what that means is we've yielded ourselves from the knowledge of what actually determined any particular unit ending up in our sample it's purely epistemology it's our knowledge blindness this is not a criterion by which you decide which type of estimator to use typically partially pooled estimates are going to be superior to non pooled or completely pooled estimates just because they reduce overfitting the second fallacy or superstition is that the number of units must be large this is also not true I will say that it's practically true in the case of non Bayesian estimation techniques but it is not true in Bayesian techniques and you saw this again in the coffee golem example we could do partial pooling with one cafe already and get a population estimate and then as soon as we visit the second cafe just with one more data point we already can do multi level estimation so that's not an obstacle at all and sometimes this is useful and realistic just because you don't have a lot of data that's not an excuse to use the wrong scientific model the third often you'll hear it said that these varying effect estimates assume that the population effects have a Gaussian distribution and that this fallacy arises from a very natural observation so I can understand why people think this is because when you read the model formula you see a normal distribution assigned to them and if you think of them like you think about outcomes maybe you think that the frequency distribution will be Gaussian but of course in neither case not for parameters like the alphas nor the variable outcomes does assigning a Gaussian distribution to them mean that their frequency distribution is Gaussian it just means that the prior of the residuals is Gaussian the posterior distribution is free to have any shape it needs to if there's conflict between the prior and the likelihood you've already seen an example of this in fact from lecture 10 and I copy it on the slide here you may remember I did this example in the U Berkeley data graduate admissions data and I had an example where I had showed you that the confound could mask discrimination in structures of that type and I had simulated differences in ability those are the U values on the horizontal axis on the graph on the right and I had done it in a binary fashion I had just created people with value zero who are average and people with value one who were excellent who were really good highly qualified individuals I think I said in the lecture and then we can run a Bayesian model that has U in it and assumes some effects that is that the U value an individual has will help them get admitted and also lead them to apply to different departments and we estimated the underlying distribution of views and as you see the posterior distributions even though I assigned a Gaussian prior to the use and you see it there normal zero one the model identified the clusters and had no problem doing that prior is just a prior the posterior can take a different shape okay varying effects are really good default and I believe majority of the time when scientists are doing some kind of regression problem with repeat observations and clusters varying effects are usually what they need to use best thing for them to use if you don't use them it's not the end of the world they're better but in many cases they're not essential that said there are practical difficulties in using them and so part of teaching you how to use these models is not just how to build them and interpret them but also how to make them run correctly and there are complexities that just have to do with writing the code and not with writing the model and what I mean by that is the same mathematically expressed statistical model can be coded different ways and those different ways of coding it can really matter and starting in the next lecture next week we're going to look at this we're going to look at two issues in particular how to code multi-level models that have more than one type of cluster at the same time for example in the trolley problem data we have stories and we have participants and we'd like to include them both at the same time how can we have the model have two populations inside of it and structure them appropriately together in the response predictions and then second how do we make these models sample efficiently and we're going to look come back to skateboarding and divergent transitions again to think about basically deforming the skate park to make the machine run better okay that has been week 6 in your introduction to multi-level models and starting next week we'll have a full week of multi-level models looking at more complicated things like multiple population types and varying slopes and other sorts of fun things you may or may not have heard of and that'll provide a strong foundation to get into a bigger world of models that have models within models that is the multi-level technique opens a big horizon for us and I'll see you there