 Welcome back, I'm going to get right back into it because we were in the middle of a unit last time. We were talking about tadpole mortality, and I used this to introduce the first multi-level model. This is a binomial model at the top, so I'll run you through it again. S is the number, S sub i is the number of tadpoles that survive in tank i, this binomial distributed, N sub i is the density of tadpoles that were alive at the beginning of the period of observation, so that's the maximum observable survivors. And P sub i is what we're modeling, we want to make inferences about that, we're going to get a posterior distribution for it, the probability of an individual tadpole in tank i surviving. We put a logit link on this, and then we attach it to a linear model, in this case the simplest possible linear model. It's just a logon parameter, and we have a different intercept for each tank. We assign them all a common prior, this prior is an adaptive prior, a very intercept prior, instead of having fixed values inside of it, it has parameters, and so this prior represents inference about the population of tadpoles, they all have a mean logons of survivorship alpha across the whole of the tanks, the average tank has a mean logons of survival alpha, and standard deviation among them is sigma, we're going to learn those values from the data while we're estimating the actual survivorship rates in each tank. And what arises from this, magically, not so magically, just logically actually, is cooling and shrinkage, which is what I was about to explain to you, which is better reference is why we're interested in it. Here's how you fit it in map to stand, if you want to look at the raw stand code for this, which I encourage you to start doing, as soon as you start to feel comfortable with this at least, the stand code looks very similar, this modeling language I'm teaching you is basically the same, just with different names for densities in a lot of different software packages, so you're learning something that will last you your whole life, absolutely, you are in retirement home running Markov chains, you will be using similar languages I assume, because this is as close to you can get in algorithmic form to the mathematical definition of the model. Any questions about this structure? Yeah. How do we get to where we see the model? You type stand code, give it the fit, map to stand model, and it will dump it out for you. I think in the Markov Chain Money Carlo chapter, I talked about that, that if you're interested in doing that, you can always get it, it's in there, because what map to stand does is it just takes that list, it translates it into stand code and passes it all off to stand, stands uses a bunch of samples, they come back, I wrap it all up in a nice R object that you can interact with, like you're used to interacting with stuff, but all of the components are still in there, and if you do STR on these fit models, like STR M12.2 here, you'll see all the components, there's a bunch of shit in there, thank you, it'll turn, they call it cruft in programming, but that's just programmer language for shit, but no, it's all in there, there's stand code and it'll spit it out, it's not so different, you'll see a part of it, the model block looks almost identical to that. Okay, and there are model types, which you're not going to be able to define and have to stand, so do some raw hacking in stand, it's not that hard. Okay, so here's what happens, and this is the slide that I ended on last week. What I'm showing you across the horizontal here are the different tanks, there are 48 tanks in this data set, and on the vertical I'm showing you the proportion of survival for two different sets of estimates, the blue points are the raw fixed estimates, these are what you get if you just took the number of surviving tadpoles in each tank and divided it by the density of tadpoles originally alive in each tank, it's just the raw and empirical survivorship rates, yeah, and then the open points are the posterior means that we get from the multi-level estimates, the alphas of tanks for each tank, there are 48 of them. And I'll tell you what that horizontal dashed line is in just a second. First I want to draw your attention to the phenomenon here is that you see there's a gravitational force induced by that dashed line, all of the open points have drifted towards it relative to the blue points that they are paired with, right, so there's two estimates here for each tank, one would come from a traditional fixed effects model, that's what you get the blue estimates, and the open points are the multi-level estimates, the question is why are they different and why do they show this pattern that they do? This pattern is called shrinkage, as you can think of it is that they have all shrunk towards that horizontal dashed line, right, relative to the blue points. You see that? Everybody just kind of the general attraction to it, but the pattern of shrinkage here is informative about what's going on. Shrinkage results from the pooling of information across tanks. The tanks vary in the amount of data that's in them because there are different numbers of tadpoles in them, so there's more and less evidence in different tanks, and when we learn about the whole population, we learn about the likelihood, the probability of different intercepts for different tanks because they will be more and less plausible depending on what we learned about the whole population, and this means that we can do better than the raw empirical average for each tank. That's the pooling thing, when I explained on Thursday with cafes, right, remember the cafes issue, so it's not just the data from the Paris cafe that can help you get a good estimate of what happens at the Paris cafe, the data from other cafes also helps you because you're learning about the population, and you have a finite sample for the Paris cafe, and you want to augment that with the data from the other cafes, but how much you should augment it depends upon the variation among cafes, and that's what's going on in these models. What's cool about it is the logic of that, and exactly how to do it optimally, is taking care of four of you. All you have to do in Bayesian inference, this is going to sound weird, it's like, all you have to do is build a tank, no, all you have to do is set up the assumptions, assume there's a population, say you make some parameters for its shape, and then set up the logical relationships between the population and the individual tanks, and then you don't have to be infinitely clever, and do the calculus, figure out what goes on, the Markov chain does it for you, and figures out the implications of the assumptions, and this is, next week, when we talk about measurement error, this is going to be a huge bonus, and it's going to save us a ton of anxiety because you don't have to be infinitely clever to figure out the implications of your assumptions, probability theory does it for you, you just have to make the assumptions, right, and if you're like me, your intuitions about probability theory are terrible, absolutely terrible, and it's because you're human, right, and you were not involved to do probability theory, so this is a huge thing to rely upon, and then, of course, the issue is developing intuitions after it's fact, schooling your intuitions based upon what the logical implications that arise, and the fact that all of these open points have shrunk towards the horizontal dash line relative to the blue points is a logical implication of the assumptions, and now let's think about why it is, and what that means, so the horizontal dash line is the population mean, that's alpha, that's the posterior mean of alpha, in the adaptive prior, and the red line that I've imposed on here now, this is the raw empirical mean, so if you just take the number of surviving tadpoles across the whole data set, and divide that by the number of tadpoles in the whole data set, initially a lot of tadpoles in the whole data set, you get the red line, it's not the same as the posterior mean of alpha, why? Because some of the tanks have more tadpoles in them than others, so they have, and the ones that are smaller suffer more sampling variation, they represent less well the population, so if you just average naively across all of the tanks without observing the fact that there's an imbalance in sample size among them, you commit this cardinal sin in biology called pseudo replication, the biologist knows that, this is the horrible word, but the consequence is, you make the wrong inference, right, because there's heterogeneity among the tanks in their survival rates, and so when there's variation and you just treat them as if they were all the same tank, you get the wrong inference about the actual population mean, the population mean is not the raw empirical mean, it's something you must infer in light of all the imbalances sampling and everything else, and again yes, this is terrifying, you'd have to be infinitely clever to figure all this out without the aid of, well, probability theory which does it for you as long as you set up the assumptions, assume there's a population, assume that each tank is sampled from that population, and then automatically probability theory tells you the logical implication of that, which is that the population mean, the statistical population mean among tanks, need to be the raw empirical mean across the whole data set, does that make some sense, and so the shrinkage is towards the dashed line and not the red line, now it'll often be pretty close if you have a big data set and the data are well balanced, they're going to be pretty close together, but there are cases like this one where they're not supposed to gather, and that's why I like this as a teaching data set, you can really see the difference. In particular big tanks have lower mortality on average, you can kind of see it in the data, right, the big tadpole tanks is a crowding effect, so on the left hand side of this graph we have tanks with initial small, call them small tanks, they have few tadpoles as the initial density on the far right, we have large tanks and there's density dependence, so you get more mortality in the high density tanks here, and if you just do the naive averaging they have an undue weight on what you think the average, well the raw average mortality rate in the whole data set is dominated by the big tanks, those are more tadpoles, that's what goes on, that's why the red line is below the dashed line, does that make some sense, this is hard to get, so in fact this may take you a couple of weeks to actually interact or head around, you'll do some homework, not fully understanding this and then it'll come together, that's the first hand I saw it, you can also sort of see that the shrinkage is less in a large tank than the small as a result of the superior sample size, yes, thank you, I got a couple of slides coming up about that, that's exactly where I'm headed next, other, yeah, so how has the population one generated that it didn't have this, isn't it also average for off-the-dash tanks, you mean how was the horizontal dash line inferred, it's not dragged down because of the parametric structure of the bottle which is a horrible answer, but alpha isn't, the meaning of that parameter is the average of the statistical population of tanks, and so if you've got one tank, say you had a population of like exactly two tanks, and one had twice as many tadpoles as the other, so now you want to say okay we sampled intercepts from a population, right, so nature sampled mortality rates into these tanks, each tank only gets one vote in the inference of the whole population, so you've got to account for the fact that they're imbalanced and not just pool them as if they were all the same, right, so it's sort of like, if I'm interested in something like speed of mammals and I've got, I know I need some examples, weasels and weasels are all over my Twitter feed today, so you know what I'm talking about, I've seen a hundred copies of that, and Putin was on his back, and I don't know what's happening right now, but some of you don't know what I'm saying, and that's just normal, but, no, so, weasels and weasels have different average locomotor speeds, so I'll assert, and if you measure the weasel, measure a bunch of weasels, you get an estimate for weasels, and you measure a woodpecker just once, there's a different sampling error variation there in the first place, and if you just took a, so if you pooled all that data and said what's the average speed of vertebrates, you get a very terrible estimate because your estimate would be dominated by the weasel measurements, right, it's just making some sense, so now if you wanted to say on average across vertebrate, what's the average travel speed, you'd want to account for the fact that you've got way more data for weasels, and that's what this model does, and it does it because it's estimating the population while it estimates the individual intercepts for each tank, and the individual intercepts for each tank are using the data from that tank, pooled in with a little bit of data from the whole population, and how much, well, there's a formula for it, but I'm not teaching it to you because the model figures it out automatically, but when there's more data in a particular tank, the standard, the error of estimate in that tank is smaller, so you augment it with less data from the population, and tanks have very few tadpoles, and I'm going to explain this on the next slide, they shrink much more towards the population being because there's more sampling variation because there are fewer individuals, so you get a less precise estimate from it, but there's some examples coming, I've got like four slides coming that are supposed to just unpack that point where it comes, there's another hand, or yeah. I'm just curious with the fact that it's, the average survival, when you're sampling these things, the intercepts seem to cluster above 80%. Yeah, and that has to do with the covariates that we're not modeling, in the full data set there are a bunch of covariates, and if you take these models and you start adding in the covariates, the experimental treatments, you'll remove a lot of this unexplained variation from the population. So the presence and absence of archiproduct predators is creating those two clusters that you're seeing with your naked eye. I guess, yeah. I forget that we're dragging problem over, but I think they were. I'm just curious, is it a problem that this is being assumed to be a symmetric distribution? But it's not. Remember maximum entropy, when you say the distribution, when you say we assign a Gaussian prior to this population, you're not assuming it's symmetric in empirical realization. You're saying I want to estimate its mean and variance, and that's all you're claiming, the only information contained in the Gaussian distribution. And this is deeply weird because you tend to like think about its shape as being something about empirical realizations, but probability theory is epistemological, it probably makes no sense. So they were like an exponential distribution. So I'd say you put an exponential prior on a parameter. Is that a claim that the most plausible thing is zero? No, it's a claim that all the values are greater than zero and they have a mean. And there's no other information in the sense of information theory embodied in that distribution. In a Gaussian, the only information embodied in it is the mean and variance. And you're making no claims about anything additional, actually. It's just that, whatever those other moments are, if you don't want to make any claim about them, the most conservative distribution you can use is a symmetric one. But it's not a claim that it's symmetric. So I guess it's not a problem that... Well it could be a problem, you could probably do better. Okay. Yeah, yeah, but it doesn't, I think what your question is getting at, and you can tell me if I'm wrong here. I may be, is that, does that, if we assume that the population is Gaussian, does that force the distribution of these estimates to be Gaussian? It does not. You'll see it, plot them out. That's not Gaussian around the mean. It's even on log odds. In fact, we're only assuming it's Gaussian on log odds, which is way different than this. But it in fact doesn't. You can easily, they could all be on one side. Well they can't all be on one side of the mean. But most of them can't be. And they still make that difference because the population parameters are distinct from the others. Yeah, so it's epistemological. This is a weird thing. It may take you years to kind of like, and you don't even need to ever like become comfortable with this, but it may take years for it to happen. Probability theory is always about states of information. And sometimes those states of information make really good predictions about things that arise from sampling, but they need necessarily. And probability theory is incredibly successful, even in cases where we may prove crazy assumptions about things. So as I said before, linear regression, the geocentric model of statistics, as I said before, it's unreasonably successful given how unbelievably bad its assumptions are. I mean if you read the list of assumptions like this is useless. I could never use the same thing. So why is it so damn useful? Because it's just an epistemological claim about, if all I care about a measurement is its mean and variance, here's a model to do it, to track the change in mean as a function of predictors. And it does that job really well. It makes terrible empirical predictions in most cases because it's not an opological model. Does that make some sense? It's tricky. This isn't to say we shouldn't be striving to do better. We should always be striving to do better. This is just a single right model. Now we're back to like week one in my philosophical, you know, sermon on the multiplicity is on the horizon here. But this is a tricky thing. Anyway, is that like half an answer? It's good enough? Okay. This is tricky stuff. It took me a long time to wrap my head around this too because I had the same like decades of scarifyingly bad stats courses as everybody else did. That's sort of like, as I told you, I wanted to give you guys the stats course that I always wanted to have but never had. And I'm still trying to figure out what that course is. But there's this, the way things are usually taught, it's all about sampling distributions of observable frequencies. But if that was true, then no classical statistical method could actually work because they don't actually predict observable frequencies very well at all. And you can make, you can assign a Gaussian distribution, all kinds of stuff to sound like Gaussian, and it's still getting really good inferences out of those models. And it's because they're just epistemological maximum entropy devices is what they are. They're information processing machines. And so they can work even when the assumptions are purely epistemological ones. They're about, here's a machine, it starts with this state of information, these assumptions, what does it learn from the data? We get advice from them. Anyway, I can try this again on Thursday when it's like something, if you've got a glass of wine, thought about this, it's tricky. But I think back to what are the, in a maximum entropy sense, what's the information contained in this distribution? And the answer is often really surprising and minimal, right? And then Gaussians are like that, they're funny that way. Okay, let me get to shrinkage. So now we'll get to your thing. Oh wait, sorry, funny. Yes, in the imaginary statistical population we have inferred. Absolutely. Yeah, yeah, absolutely. It's not the empirical average. It's, yeah, but that's not right. Yeah, I'll go with that. I haven't even been drinking it, I'll go with that. That sounds fine. It's hard because there are a bunch of averages. It's like when I talk, it's the same thing. You guys ask me questions and I hear myself back. I'm like, oh my God, did I say that? And I was thinking, you have averages of averages. Later this week we're going to have a distribution of functions. You're going to love me. We can do that. So, okay. Yes. It really makes these models worthwhile. I'm pragmatic like you guys believe it or not. I'm mainly interested in making inferences about nature. And what I want are good estimates. And multi-level models use this pooling to give you better estimates. Shrinkage makes better estimates than the raw empirical need. And that seems like a paradox. In fact, I'll show you a little bit. That's called Stein's paradox in statistics. And like most paradoxes, it's just a paradox of intuition. With the Bayesian model, violates your intuition. And so they're really always your intuition that is raw. It can also be a bad model. But if you believe the assumptions and you disagree with the conclusions, then you're the wrong one. So, let's go with the shrinkage. And I'll show you that it's better in a moment. You get more shrinkage when the tanks are small, when there's less data per cluster. So, what I mean is there's more movement of the open points from the blue points in the small tanks than in the big tanks on the far right. Why? Because there's less evidence per tank. So, if you have a naive posterior distribution, like in a fixed effects model, for each intercept of each tank, you have a bigger standard deviation than it would in the larger tanks. There's less certainty there. So, the Bayesian model automatically omits that estimate with more information from the population and that drags it closer to the grand beam or the average tadpole in the average tank, as Bonnie said. Does that make sense? So, likewise, in the large tanks, there's more evidence. So, the posterior standard deviation for that intercept parameter would be smaller in a naive fixed effects model. In the Bayesian model, it augments that estimate less with the population as you're more certain about it. So, the thing about, again, back to your Paris and Berlin cafes, if you initially visited the Berlin cafe a ton of times, and then when you just, the first time you visit the Paris cafe, you have a lot of confidence that it's like that Berlin cafe. There's a lot of experience from the Paris cafe. But, eventually, you've got so much data from the Paris cafe that the population, the average in the population of cafes, doesn't matter at all. You don't augment hardly anything. So, that'll happen in these models, too. Eventually, you can get so much data for cluster that the population does nothing for you, or almost nothing for you. Nevertheless, if you have some clusters which have very little data, those clusters with a lot of data are going to help you a ton to estimate them, because they give you information and, of course, the medium takes their intermediate. Notice, also, that the further the tank is from the population mean, the more it moves towards the mean. That's because, remember, there's a distribution of effects and the more extreme estimates are less likely according to what we've learned about the distribution in the population and its mean invariance. So, they're less plausible according to the posterior distribution and they get more further in out of the tail of that physiological distribution. Does that make some sense? So, this is like any sort of situation where if you have, you know, you put like three tadpoles in a tank, say, notice that we've got three tanks here on the far left where all the tadpoles survived, right, where the blue dots are up at one. That's probably sampling variation, right? You wouldn't want to infer from that that if we put 100 tadpoles in those, they would all live. There's a whole chance of death in those tanks. But since there are so few tadpoles, they all got lucky. Nature flipped their coins and they all came up heads. Right? Does that make sense? Bigger tanks you don't see, I mean, there is one large tank that's like that, but bigger tanks you see fewer of those because sampling variation gives you extreme outcomes like that less often. But they get shrunk away from that because there, yeah, there aren't very many in the population. So yeah, here's where I think I got to this slide actually already by accident answering your questions. You do have to be careful with comparing the intercept alpha in a varying effects model to an intercept name the same thing in a fixed effects model because they don't mean the same thing now. In these varying effects model, alpha is a feature of this population. You're estimating, right? It's not an empirical feature of any particular unit in the data. And so nearly always what happens and this way you want to watch out for it is that the standard deviation of this intercept parameter in a varying intercept model is going to be a lot wider. Why? Because it co-buries massively, it's highly correlated in the cluster distribution with all the intercepts and all the combinations of alpha and all the alpha tanks and the standard deviation sigma that are equally plausible given the data. And so if you just look at the marginal cluster distribution like in the graph above this slide for alpha which is shown in blue in the varying intercept model it'll look like you don't know where it is but if you take that cluster distribution and add it to the samples from the cluster distribution of the individual alpha tanks you'll show that the standard deviation is going to be a lot wider but this is that accident that I've been harping in you guys forever about marginal cluster years they lie, they're lying dogs. Look at one, if the predictions depend upon a bunch of parameters then you've got to combine them to see what your certainty is like. So I've had a bunch of examples like this so I don't need to go over it too long. This is a weird thing because as soon as we get to varying slopes the same thing will happen but then it's beta coefficients meaning their slope is not significant and they freak out and then they go back to the fixed effects model because they want to publish and there's completely no wrong inference. The slope can be just as important or even more important it's just that you're only looking at part of the effect now and that's what you want to think about here you're only looking at part of the intercept and this part of the intercept has a wide standard deviation and the other parts might too if you look at it marginal you don't have to do pair plots to figure out maybe but this is the guts of the machine and this is why people like this course make you open it up and fight with the carburetor and see how the union runs Does this make some sense? You just have to be careful about that you should expect this to happen it's a different parameter it means something different so you can't really compare it just because it's named the same thing doesn't mean it's the same kind of parameter it doesn't mean which in this case we called alpha the further from the mean you get more shrinkage because the model is less skeptical it thinks it's more plausible that those extreme values are due to sampling variation and the finite amount of data for cluster rather than some genuine feature of that tank in this case so also as a consequence if there's less data in some particular cluster a tank in this case you get more shrinkage because it's more plausible you get less shrinkage so you need more information from the population to get a better estimate and if more data likewise you get less shrinkage yeah hands back there the intercept is very effective model is the mean of the intercept of each tank in this case I don't know I did quite understand that statement but it'll be right the intercept variable alpha the population the sample size and balance is all taken account of automatically exactly this is the awesome thing about Bayesian inference it means you don't have to be clever which is like I mean usually people advertise it like it's a super clever thing to do but actually I want to advertise it like it's the opposite if you feel like me and like you're not clever and nature is a lot smarter than you then what you can do is you can make assumptions you can say I think these things come from some common distribution that these tanks is not infinite right so it makes sense to try to estimate it from the data once you make that assumption that it's something and assign a parameter to it the model takes care of all of the logic the logical implications of imbalance amongst the tanks and everything just because it's probability theory is counting up logically all the ways these different things that happen you don't have to even be able to do the counting yourself anymore because we have clever robots to do it for us that's a wonderful thing maybe that's bad though when there's a rocket every using the good for humanity alright we won't have that conversation other questions before I move on it's how we conserve nature species yeah I know another conversation to have some time after a beer okay so this is often called pooling as well and shrinkage is the phenomenon of the comparison between them the general statistical phenomenon is called pooling this poster here is just to help you remember it jokes help you remember things pool or the terrorists win to use the motivate americans to do anything so what's going on here is we're pooling information from all the tanks into a population and then information is getting doled out by the Bayesian inference by probability theory two tanks where it's needed most to improve the individual estimates and all this is done simultaneously by distribution of all the parameters so each tank informs estimates the other tanks this just goes back to the the amnesia thing that on Thursday that I launched this lesson with this is a result of just remembering information and learning as you go so that if you remember the last cafe you were at that gives you a prior for the new cafe you arrived at but you quickly update that prior with your experience from the new cafe by the way the order you visited the cafes in is irrelevant it's arbitrary so you must now simultaneously use your new data from this cafe you just visited to update the previous posterior from the other cafe these models do exactly that they do it without you having to be clever enough to know you do it and you just define the assumptions and what we get from this is better estimates so this is a famous thing in statistics so I said there's a non Bayesian version of this because there's there's this tradition of using multi-level models in in frequent statistics they call it empirical base because the estimators end up coming out to look very much like the Bayesian solutions but they were derived completely differently the person usually associated with this and rightly so is Charles Stein recently celebrated his 90 something's birthday I think he's emeritus at Stanford and Stein has this very famous paper in statistics and only in statistics with this horrible title called the inadmissibility of the usual estimator for the mean of a multivariate normal distribution and what this means is this is just to find about pooling and this is called Stein's paradox which is that you can do better than the mean if you want to predict the future and you do better in this very paradoxical way by using data from other clusters in the data you can make better predictions about the cluster of interest by colluding it with data from the other clusters and this strikes people as really strange a commonplace use of this is in baseball statistics you're trying to figure out which rookie you want to bid for on baseball teams and this is all the analytics of baseball are really well worked out right and so do you use just that player's data that finite sample maybe they're just an outlier who did really well in training camp and so you get some skepticism from the population distribution of rookies about how much you should pay for this person that said there are some people who are truly outliers and famously so in baseball and so pooling can hard hurt you in those cases where there are true outliers who are super extreme like orders of magnitude out I think baseball doesn't furnish as many examples of that but there's a famous case in cricket I don't know if anybody here has ever played cricket it's strange for it it's very entertaining though and they play it in the other common locations and they're crazy about it in other places so there's this famous Australian cricketer whose name I forget for the moment if anybody here is Australian you know what I'm talking about? I know what you're talking about yeah I don't know, learn it out if you remember I'm embarrassed that I can't remember either the video games actor yeah exactly, he's huge you go into Australian homestude people just have portraits of the guy in their house he's a national hero and he was an order of magnitude I'm not kidding you better than any other cricketer who's ever lived worldwide he's just amazing but you need a lot of data about them before you're sure that they truly are outliers the other players are perfectly Gaussianly distributed and then there's this Australian guy who's just like truly way out there anchoring the end of the scale I can't remember his name Bradman I was in Australia a long time ago and this weird guy in everybody's house I'm like who is this guy oh you don't know the Bradman they call it the Bradman he's a big deal he's a pretty amazing athlete he's really extraordinary and hated being famous too he's an interesting guy anyway, so I just put this up here to give you some of the history Steins' Paradox is interesting there's this great paper which gives you some of the culture of statistics around it by Efren and Morris from 1977 called Steins' Paradox and Statistics in the internet search it's a cool paper and they look at some empirical examples where these pooling estimators help you a lot to get better estimates here's a famous case a study of Toxoplasmosis in El Salvador I think it was done in the early 70s and more and less data in different places so it's this classic hazard and estimation there's in balance in sampling so you can't just take naive averages across things but you really care about the variation too you want good precise estimates for every locale in El Salvador so when you look at these pooling estimators this is the way they're often plotted I'll explain this to you real quick just because this is a motivating example what you're looking at up there on the top are the sort of naked estimates raw empirical estimates for each district in El Salvador each site where they got estimates and notice that they're scattered quite a lot that sort of shelf that's going up is showing you the standard deviation of the estimate there when the standard deviation is high there wasn't a lot of data from that place from that town when it's small there was a lot of data so urban centers have very small standard deviations because you're really sure what the rate is there small villages you're not quite sure because you can only sample a few adults so then what happens in the bottom is we get what are called the Jane Stein estimators for the sake of this discussion you can think of those as the posterior means you get from the Bayesian multiple model this is a frequent analysis that arrives at the same logic that's pulled towards the mean the inferred mean and the more extreme ones shrink more because they're implausible and shrinkage is also proportional to the sample size so this one let's far out on the left had a very large standard deviation because it was a small locale and they got a really extreme estimate it gets shrunk to be much more reasonable and what we know from simulation and iterative science is that this improves estimates on average because a lot of the scatter is not part of this graph it's just sampling variation and in balance and data across sites does this make some sense so all of this is really about overfitting again overfitting is the specter of our lives and think back to Ulysses compass from however many weeks ago that we did this and I made fun of Occam's razor because Occam's razor really gives you advice in one direction if you had two models that had equally good prediction that's Occam's razor basically Occam's razor doesn't give you any advice of trading off predictive accuracy and complexity which is unfortunately the problem we usually have so that's by Ulysses compass ideas you need to trade things off you got to choose which hazard to sail closer to and ideally you'd like to navigate between them and not kill any sailors so this is trading off overfitting and underfitting one way you can think about this is that varying effects are adaptive regularization so they're solving the overfitting problem and we use regularizing priors because we don't like to overfit so I introduce them these aren't just regularizing priors but they learn the prior they learn the amount of regularization from the data that's what they do so you can do them legitimately that way as well and this is why they do a good job because they learn they tune the compass from the data itself so in reasonably side samples you get a lot of information about how skeptical you should be given the variation across units so you can think about if you had a model in which you didn't even distinguish among the different clusters and you just used the grand mean to make a prediction for every cluster so this would be like taking that red line that I had on the tadpole tank graph which is just a naive total mortality rate across all the tanks and use that to predict each tank this is maximum underfitting you're doing a terrible job of predicting every tank but it's the simplest model possible the other end is a fixed effects model the amnesiac model is maximum overfitting because now you only get to use the data from each individual tank ignoring all the data from the others it's a complex model it maximally overfits because you're trying to estimate a whole batch of parameters with tiny bits of data for each and the adaptively regularizing model the bearing effects model tries to learn how far between these extremes it should be and it tries to learn that from the data it'll be bad but in most cases simulation and science tells us it does a better job so that's why I argued this deserves to be the default form of regression which doesn't mean you have to do it it just means it would be better to start with the idea that you should do it and then back out of it when it turns out you don't need it rather than the other way around so really quick I'm not going to spend much time on this because this is a section of the book where you can simulate this over and over again if you want simulate tenfold mortality and then do the bearing effects model what's good about the simulation is you know the true mortality rate for each tank and you know everything because you plugged it in there but your model doesn't from a model's perspective and a bunch of people wrote to me over the weekend about the zero inflated models because I rightly did a horrible job they rightly wrote to me but it didn't change anything it was meant to be a bonus but I understand now it just infuses people and so I'll try to do better in the future but the whole point is with the simulation is to validate that it's working and you can only do that when you know the true parameter values the data to the model data looks the same it has no idea models are stupid they have no idea of the difference between dummy data and real data it all looks the same to them so it's a true test we've got 60 ponds now and with a range of sample sizes 5, 10, 25 and 35 tadpoles in each of 15 ponds at those sizes and in this case there's a bunch of ponds I'm only showing you the first 10 all of these are small ones there's only 5 tadpoles to start with there's some true log odds of mortality in that tank which has come from a population and then we simulate a number surviving from a binomial distribution and then we detect four things and some of them move, party they all live and some of that variation is due to the variation and these rates of survivorship then we get estimates from different models and there's the no pool estimate which is your fixed effects model the amnesiac model and you notice we get extreme ones when they all survive the estimate is that it's 100% probability of survival that's pretty naive you can see these are the varying estimates which are the varying effects estimates and then we have the true ones this is just the logistic transform of that and then we can compare the accuracy so let me quickly show you what happens and this is one particular simulation but it's a representative one I'm not trying to lie to you here but I give you all the code to do this as many times as you like sit down with your favorite beverage and just run it over and over again and then we have the absolute error and then the absolute error is on the vertical that's the difference between the absolute value of the difference between the parameter estimate the posterior mean of the parameter and the true value which we know because we banked all these tests and blue are the raw proportions the fixed effects estimates the maximally overfitting ones and the open ones again are the multi-level estimates these bars on here show you the averages to figure out what's going on so the blue is the average raw error for the tanks in this category and the dash black line is the average multi-level error so what I want you to see first of all is the average multi-level error is lower it does better because the pooling estimates have done better in the simulation trials it's nearly always true especially for small tanks because there isn't a lot of data from them so you can get a lot of information to help you but for large tanks you get almost nothing so large tanks there are 15 large tanks in the sample as well over there were 35 tadpoles in them when you have 35 tadpoles you get a really good estimate of the actual probability of survival in the pond in this case and so there's almost no difference there is the pooled estimates are still a little tiny bit better it is below it but almost nothing this is a general feature of the population too and I'm going to come back to that point later today even if you don't get any benefits and shrinkage you might want to learn about the population so you can extrapolate to new clusters so you may want to learn about variation anyway and then the sizes in between are intermediate the benefits diminish as the amount of data goes up but there's information flowing from the right-hand side of the graph to the left-hand side of the graph out of those to improve the estimates on the left and the multilevel model is alpha weighted mean yes but it's weighted so in the simplest like Gaussian-Gaussian models this is not an example of there's a formula for it show you how to do the weighting so if you dig into the literature but you'll find that formula what you do is you weight it by something called the precision which is the inverse variance the inverse of the variance in that cluster is a very precise estimate for a particular cluster then it gets a big weight in when you try to figure out the population what's tricky about this is information is flowing in both directions because it's a joint probability distribution the individual estimates for each thing now depend upon the population estimate but the population estimate depends upon them and so yeah try to do that in your head that's why we have math and it's a weighted mean that's correct but there's a second order of intuition is that there's information flowing in both directions so it's not exactly that but does that help? okay can you say a comment about you might want to learn something from the population what about if you have a categorical variable that you don't think you can fix in the sense that maybe it's an experimental condition by conditions in the population by conditions would you say question one is what if your what if your categorical thing you have an index that goes over different treatments I most cases know I wouldn't do it because there's this issue of in statistics there's this term called exchangeability which basically in the simplest case means are the subscripts irrelevant right can you just exchange your ordering among them and it doesn't change the information in those cases pooling is almost always an age when you have exchangeable things and you could code your treatment so that they're exchangeable but you probably know more about them there probably are actual interventions you know something about that explains them if you just wanted to measure the variation across the treatments then yes by all means just a number of 1, 2, 3, 4, 5 plug it into a very intercepts model what you're doing is a hierarchical analysis of variance there across the treatments and that's okay but I bet you can do better so I mean I often start with even when I've got a bunch of predictive variables I often start with just a very intercepts model with all the different kinds of clusters in the model I'll show you how to do more in one cluster in a little bit just to figure out where the action is in the data and what level is the variation and that's just it's a Bayesian analysis of variance with shrinkage and it's often really useful as a first go but then you know your experiment tells you the model structure you're aiming for and then you start adding in predictors and you could probably do better by saying like treatment 3 was hot something like that then the pool but that said we'll get to varying slopes I bet this question will come back and then you will ask it okay was that a hand? that was out of the room okay that was like sign language or something sorry good? no? okay so let's add more intercepts this is called so often there's more than one type of cluster in the data to construct the only real structure to the data is that there are different tanks or ponds that tadpoles are found in but often the raw observations can be found in different kinds of categories at the same time this is a routine thing so I want to give you an example of this using a dataset you're already familiar with and show you how to specify what's called the cross classified varying intercept model where you partition the variation in the data and you're learning about the population of effects at all of them at once and you get shrinkage in all of them in other words it's still adaptive regularization if you prefer to think of it that way you're regularizing friars on offsets for different kinds of categories in your data and the model learns how much it's your regularize them from the variation that's present so let's think back to the chimpanzees data there were pulls the raw observation is a pull right a zero one pull of a left hand lever so we can do a varying intercept model of chimpanzees we'll do that but there are also experimental blocks and experimental blocks were sessions in which they brought in the chimpanzees and had them do a few things chimpanzees can board in this pretty fast there's only so many food items they want I think so blocks since they happen in different times on different days there may be unmeasured covariance and effective behavior of all the chimps before they brought them in to do the experiment so one of the blocks is everybody sulking or something like that this happens it really does weather and all kinds of unmeasured things and mess around with your experiments so there's a long tradition in biology and the social sciences of thinking about experimental block temporal correlations that arise from unmeasured covariance so we want to do both of these things as well what's interesting about this is that it's not nested every chimpanzee is found in every block so they're all cross-classified it's not a hierarchical data structure if it were hierarchical the model would look the same it really would sometimes software wants to make you put it in with different codes but they're actually the same model nearly always underneath so here's what it looks like well first let's just do bearing intercept model on chimps to ease you into this it's a little bit more expansive and I'd like to use this chance to introduce you to another common convention for coding these where we take grand alpha now out of the regularizing prior and we put it into the linear model it's exactly the same model why because Gaussian distributions if you subtract the mean from them and just put them over zero as long as you add the mean back in at some point it makes the same predictions you can just add and subtract the mean from a Gaussian distribution whenever you want to put it out and put it up here same Gaussian distribution as if you put it inside so just so you'll sometimes see them written this way and then you'll see a little bit in a moment there's this doing it this way can prevent a kind of mistake of non-identifiable models so we get adaptive regularization on actors this variation across actors and then we estimate the variance for them does this make sense and everything moves all the intercepts shift yeah because they mean something different now because they don't have the mean in them but if you add the mean back to them they're the same possibility but you're absolutely right when you do it this way the alpha actors are offsets from alpha when you do it the other way they already include alpha so you don't have to add alpha in it but you're right they absolutely do same predictions I'm just highlighting for you alpha sub actor bracketed by actor for the index in the actor variable in that case here's the cross-classified model again highlighting in blue the new bits so up in the linear model we add an alpha sub block now for each block for the case I that corresponds to block is an index variable for the alpha blocks in this experiment so six daily sessions in which they ran all the trials over and so whereas alpha sub actor is an offset for that actor all the actor's observations get offset by that amount the law gods do every observation in a particular block is going to get an offset as well and we can combine these in the linear model and we make another regularizing it should be by giving it a free parameter sigma sub block that we're going to estimate from the data just like before so you can have as many of these within pragmatic reason as you like whatever your cluster structure is and it can be really complicated so the traditional example in the social sciences is you have questions in exams in students in classrooms in schools in districts in states and countries and people analyze that model all the time in the educational reform and there is a lot of very intercepts in those models we'll get the slopes in a moment but that's what it does is there's a lot of cross classification at different levels does that make some sense because there are different students that are all taking the same test but not all students take all the same tests and not all tests have the same questions in them so you're figuring out time and variation in the outcomes it's a fancy analysis of variance with shrinkage and students also transfer schools which makes it fun and classrooms you guys with me for a moment yeah question so your health actors are not necessarily your estimates won't be necessarily normal right end up centered at zero or is that also not necessarily not necessarily if you just take like the posterior means for the alpha actors and average some now there's no guarantee it'll be zero it's not necessarily because you could have some outlier for whom there's a lot of data and then the empirical the mean of the posterior means won't even be close to zero you could have a Bradman there could be a Bradman chip in fact in this data we do have the Bradman right oh that's the wrong thing to associate with the Bradman with the Bradman did everything right that chip well who's to say that chip was doing anything wrong really does that make sense yeah this is I'm very sympathetic with this being confusing and it is weird but the epistemological assumption about the distribution of effects is used to improve those estimates but it doesn't force them to look out especially if there's imbalance in sampling or you've got something that's particularly extreme it doesn't do shrinkage it moves them but they can be very skewed in fact in this data set they are this is a good example because most of them are right handed and then you've got those few individuals who really strongly prefer the left lever the shrinkage is happening on the budget scale yes it is on the parameter scale okay so how do you fit this just as you might think I'm going to add a code add a term in linear model add another regularizing prior for block and sigma block the only caveat here is Stan doesn't like any variable called block so I've had to call this block underscore num I don't know block is a reserved word in Stan this always tricks me because I always come back to the data set and block and it's like the next error you can't use the word block well excuse me there's another one too so anyway eventually Stan will train me not to use their reserved words but for the moment anyway you'll tell you you'll get a syntax error and you can't use that word it's literally what it says okay so highlighting both of them now you've got two sets of varying intercepts they're both in the linear model and it's nice to have the regularizing prior centered on zero so that you don't accidentally make two alpha parameters and then you can't identify either of them but that could happen you with me? alright so what happens very quickly you can compare the posterior distributions for both of the sigmas now and they're both on the law god scale so they're comparable lots of variation among actors we saw that before showing you in the graph at the bottom here the black contour is the marginal posterior distribution for actors and that's a standard deviation about two this is on the loaded scale for example a normal distribution means in zero situation two you'll get law gods that cover the whole probability space basically once you transform back to probability scale it's a lot of variation remember law gods of four is basically always right so two standard deviations out is 95% so you can easily you know 95% of the samples from this implied prior or between minus four and four that's a big range and you knew that before from the data there's not a lot of variation among blocks instead notice that it's crowded up against the boundary there's some evidence there's certainly some variation among blocks because the chimps did behave just from sampling variation if something else it behave differently on average across blocks but not much most of the action here is individual handedness preferences and then one of the treatment effects the side of the table with more food on it and you can just see it if you do these dot charts up here in the top part of the dot chart you're seeing the actor estimates right these are all offsets from zero because they haven't got alpha plugged into them yet so which is another is a chance to say you want to add samples of alpha to each of those they're negatively correlated with one another so those error bars are going to shrink when you get the individual difference there's smaller than that there's joint uncertainty and then the blocks you can just see there's not a lot going on with blocks there was this one first block where there was a little bit of pulling the right hand lever more and then the last block a little bit more pulling the left hand lever but yeah there's nothing really excited about there which is good news I mean the experiment was good right you don't want block variation does this make sense? yeah we're not seeing the Grand Mane it's the Grand Mane of the statistical population where I tried to distinguish between alpha and the Grand Mane so it is not easily averaged across all the chimps here actually in this experiment it's going to be really similar because everything's balanced this is a nice experiment and all the chimps hope leverage exactly the same number of times in every block so they should be really similar actually there won't be exactly the same but they'll be pretty similar it won't be exactly the same because you've still got that crazy actor number two and the model is pretty skeptical that most chimpanzees are going to be like that but quickly let's talk about effective parameters remember varying effects are regularizing priors so what Trinkage does is it makes the model it makes those individual intercepts less able to fit the sample and ironically if you tune that exactly right not ironically paradoxically the violation of intuition you get better estimates when you do that for exactly the same reason as always overfitting is bad the sample is not what you want to generalize it's a tricky issue in this case the sample refers to each chimpanzee or each block so let's do the quick model comparison here think about the model with both kinds of varying intercepts actor and block it has 18 parameters it ends up with 11 effective parameters it gets to shave off about seven of them because the block parameters do almost nothing for overfitting because the standard deviation is really small it's less than one at the posterior or median and so that's like putting an adapt that's like putting a regularizing prior on all those with the standard deviation like .2 so they can't move right they're stuck it's a very informative prior but it was learned from the data in this case make some sense meanwhile the actor the learned prior for the actors is looser it has a standard deviation like 2.5 or something and at the median and so it's a lot looser so the model with the actor and block model and they're mostly block parameters the actor of all alone loses fewer as a proportion it goes down from 11 to 8 because they're not redundant parameters there's less regularization going on because there's more variation among actors and we need those parameters statistically this is a tie this is what I'm going to show you the WIC values are off by one point never get excited about that they're the same model and they'll make the same predictions they're the same model so this is the case where what have you learned you always learn more from comparing models in a set than selecting one it would be fine to fall back to the actor early model because it's simpler but you want to be able to say yeah we added block in and it makes exactly the same predictions and I know why because there's almost no variation posterior variation across blocks the caveat here about this adapter regularizing priors of course is that there's not a particular value of average energy as well that's what's great about the Markov chains and everything but still heuristically you can think about it as if you're like plugging in the posterior medium it'll help you think about it but it's really plugging in the whole distribution and then averaging it all up the cascade and information up does that make some sense? yeah it just makes a little bit of sense I think it's plausible to know human being really understand probability theory but we develop like motor habits that let us successfully use it and we actually really conceptualize probability theory in our heads we would need the pencil and paper and computers and everything else we're the master of this thing but it's a prosthetic we can do thinking that we can't do for ourselves that's why we need machines that's just like trials everything else adjusted to the archeologist so you don't know what a trial is which is yes it's that or you know any number of tools or mental prosthetics they help us think they expand our short-term memory and they in these cases I think Bayesian inference is a way of seeing the implications of your assumptions in the light of data and it's invaluable for that and I find it humiliating all the time because it tells me my intuitions are pretty awful but I value that that's what science is about it's about the profession of humiliating yourself in public and kind of get over the insecurity of that kind of things okay let's talk in the last 15 minutes here about posterior predictions and that'll set us up on Thursday to be varied slopes where we extend all this logic to any kind of parameter so even the effects of treatments we can do cooling there we'll do that on Thursday so let's focus instead on this gnarly issue now of what we might do with estimates of the variation we've only had tons of data per cluster we're still estimating that alpha and that sigma that gives us inference about the variation so now imagine you want to generalize to some new tadpole puns you don't get to use the alphas for the tadpole puns in this model because those puns aren't present next year they've all dried up and all the tadpoles have been eaten there are going to be some new puns or tanks or there are going to be some new students and some new classrooms whatever it is you're clustering on but the variation you've estimated among those effects that you've estimated from these data can be used in forecasting and it's invaluable for that and this issue is so it's a subtle one but with multi-level models the issue of posterior prediction now depends upon you choosing what's called a level of focus what sorts of parameters you get to use in forecasting depend upon what you're trying to forecast and what's called a level of focus it makes sense to use all those individual alpha factors or alpha tanks from the very intercept model because the identities are persistent so in PolySide this is nearly always what they get to do because nation states aren't born and die all that fast in the life of a single human sometimes too I was born in Germany during the Cold War so I remember there used to be a wall there at some point when I was in high school mainly so in PolySide they can fit a model on the features of the political systems of particular nations or segments of nations and then they get to use those very intercepts in predicting next year's data for forecasting it makes sense in biology I assert this is rarely the case and in other aspects of social science it also may not be the case it depends but in the same data set you still might want to do both things but in predicting to new clusters then you can't use those very intercepts from the clusters in the sample because it wouldn't make any sense you're not going to see that tadpole tank again you're not going to see that classroom again you're not going to see that individual chimpanzee in another experiment but the variation among the tadpole tanks or among the actors is informative of how calibrated you might be about future predictions using the other parameters so you get to use all the other parameters but you don't get to use the very intercepts it makes sense so I'm going to give you some examples now of different kinds of posterior predictive distributions you can generate out of this and in the book I go through the code pretty slowly I think in this section the section got like twice as long as I originally planned it to be my page budget I have a page budget and I'm way over it I think I have 200 pages over it actually so you're going to want to sit down with this section as well but I want to motivate it for you here in lecture and show you what the graphs look like and make sure you understand the contrast between what they mean so that you know when you want to do them often you might want to do them all for a study because they tell you different things and there's no perfect way to present to communicate the uncertainty and your inferences to the audience they depends upon what you know and the audience you're speaking to so this is my obligatory reminder this is partly very frustrating for everybody students and instructors included because we're put in this situation trying to give you generally useful advice without knowing what you're working on and we can use examples in class but those examples don't be exactly like the data that you eventually have in hand or the problems you work on and the nature of horoscopes is as you know I assume you know is that they can only seem plausibly useful because they're hopelessly vague they say things, land things that apply to everybody so it's a good time to make a financial investment okay might be true so stance advice is kind of like this in the sense that I can tell you true things I'm not making it up like the horoscope but well they're not making it up either I mean they really believe in it they have charts so I shouldn't be lying they believe it I don't but they do they're not making it up they're not lying I'm not lying to you either I'm telling you things that I think are true but my advice is often bland and I have to show you a bunch of different ways to do things because I'm trying to cover your basis and I know it's frustrating it's frustrating for me and it's frustrating for you too but that's why that's all I can say that's why it's like that so with that said let me give you some frustrating vague advice on things so let's take the chimpanzee example now and think about predictions for the same clusters those would be predictions for those chimpanzees and I'm not going to go through that but those models are no different you can use link and you can use sim and it'll take account of it if you don't have those at your disposal you know the model they're still just parameters you plug them in and you go and if you're making predictions for individual chimps in the sample you don't need grand alpha and grand sigma because those are features of the population and you're talking about this difference chimpanzee number two bless her heart always pulls the left lever you've got her long arms with a probability scale and you know it says always you can pull the left lever with a tiny confidence interval but what about new chimpanzees now you've got choices to make so oh yeah I should have moved to this slide first part same actors I just said that for new actors this is a special kind of counterfactual prediction it's not for you're trying to make forecasts for cases which are not in the data so it's not retradiction now there are different choices you make I'm going to show you all three of these because they're all useful in different contexts and they show the uncertainty in different ways the first is what we call the average actor you can generate predictions for the average actor which means use alpha say we had an actor who's logged onto pulling the left lever with alpha now there's still uncertainty because we don't know alpha for sure we need to use that but we can do it that way it's something that often makes sense to people but the average statistical chimpanzee here from what we thought would be then you can talk about something that's slightly different called the marginal actor marginal actor means you sample a bunch, simulate a bunch of actors from that distribution defined by alpha and sigma and I'll show you this and then you average over the variation and there's always a lot more variation than there is just for the average actor the average actor is an actor you're claiming is at the beat and that's different than saying say I got a bunch of new chimpanzees what's the spread of their behavior going to look like and then you need to use sigma and I'll show you what that looks like as well and that's a way to appreciate how variable actors are it's often one way to get at it and then I want to show you I think is a superior way to show variation across actors which is to just show a bunch of trend lines for a bunch of different actors and let the scatter of the trend lines begin for the shaded confidence regions it communicates a lot more it's a lot easier to see I think so I want to motivate you to try that instead here's what the three look like and again apologies the code is in the book there are no new tricks in this code you've seen it before but like with all new code it's initially frightening you just kind of have to take your time with it and bug me about it on the left we've got the average actor what I've done here is we've got a little posterior distribution of alpha so we're still averaging over the posterior distribution we don't know the intercept for sure and that generates a lot of uncertainty but the treatment effects are still in here so you see that zigzag that we had before they respond when there's two units of food on the left they go for the left at least more often but they don't care whether another chimpanzee is there to get the other piece of food and then the shaded region shows you basically the uncertainty in well it's the uncertainty in all the parameters alpha and the beta coefficients of the protein alpha in the middle panel this is marginal actor you'll see that the gray shaded region takes up nearly the whole plot now and that's because there's a lot of variations in this preferences across the chimpanzees so when you simulate a bunch of individual chimpanzees and you get their trajectories and then you average over that you get this predictive distribution of how often they're going to pull these levers it covers a little wide range of space because you might get more actor number twos they're not super common it causes a lot of heterogeneity and what's going to happen nevertheless deep in there there's still the Charlie Brown zigzag people maybe I should say I just know who knows Charlie Brown the zigzag that used to be on Charlie Brown's shirt is in these data it's what happens in the treatments and that's still in there but it's hard to see you can kind of see it at the bottom of the shaded region so what I prefer is on the right this is sampling 50 actors for each of them just once we sample an alpha for them and from the posterior distribution and then we calculate what they're going to do across all the treatments given their fixed parameter values and then we do it again sample from the posterior distribution all the parameters we need to simulate an actor's trajectory with fixed parameter values across and then it capitulates what's going on in the marginal actors and quite well you can see there's a lot of hand in the size but you see that's not interacting very much with the zigzag pattern the zigzag pattern is there there's a really reliable effect chips are attracted to the side of the table with more pieces of food on but they don't seem to be able to figure out that they're not going to get that other piece of food or at least they're not changing their behavior on account of that does it make some sense? you're pulling out the map to stand pulling out individual samples for just a random actor exactly I'm just extract samples from the model now you've got samples for all the parameters and you can just take each row of samples as an actor take like just the first alpha that you get or something like that well you've got to simulate them using alpha and sigma that's the other thing take the first samples from the posterior distribution of the model take the particular alpha and sigma or a particular row of samples draw a random normal from it just one there's your actor's offset add that to the sample for alpha on that row because that gives you their law of odds irrespective of when all the treatments are zero and then you calculate the treatment effect as you will using the beta code so if your random intercepts are skewed from the normal model so like with cricketers it might be possible to have a really good guy so it's pretty unlikely that you have a terrible professional cricketer that's still a perfect cricketer does that mean you need to update your well if they get skewed then use a skewed prior I guess you could do that although with the cricketers it is Gaussian empirically except for the Radman basically they are Gaussian it's incredible they give you reasons and then this is a thought experiment where you're bumping up against the friction of the systemology and making claims about samples but if you want to say it's skewed then put that into the prior you can use a skewed prior or you can use a fat tail prior you can use Cauchy distributions for shrinkage too they work fine so T distribution is a special case of a Cauchy you might know that so those are fat tailed distributions and they'll give you a different distribution difference about it but they represent a certain claim when you're fitting the data about you think that this is a really fat tailed distribution and there are a lot of outliers in the population and maybe there are or maybe there are covariates that explain that instead but yeah that's up to you and your science yeah so just I'm just wondering with those outliers because sigma will be large because that outliers there right you have to have to fit in so it depends upon how many there are if there's only one Bradman sigma won't be very large it will shrink Bradman unless you have the career of Bradman's career he won't shrink at all but sigma will still be small because you'll have almost no effect on the total distribution because there's thousands of other players on the data set so you'll just get a Gaussian curve around them but it all depends upon the details right here so to echo what I said at the beginning of the course you know Bayesian difference is just logic so it's garbage in garbage out and logic is amazing because it often schools our intuitions and our intuitions are terrible and especially about math at least mine aren't but nevertheless there's no way out of this trap of having to being responsible for the model of making models it's a theory of what to do with it once you got them and that theory is basically just use logic the kind of all the way stuff could happen and get some of your assumptions and that's great and it's super useful but it doesn't as I already joke once today it's not a substitute for science you've still got to get your model from some better foundation and this is the worst part of it because I think and you'll get this kind of naive attitude now we've got one minute left so people are hearing my philosophy of science but you get this naive attitude like oh I can just come up with any hypothesis I want and then just run it through and run an experiment with it and then you know if a p-value is less than 5% it's true and no if your hypothesis is garbage then you're always going to get some p-values less than 5% and they'll almost all be false positives so it's like the medical testing thing hypothesis generation is the most crucial step of science and statistics says nothing to fix it and if people like in let's say pick on a field mainstream social psychology where I think people are just like oh let's make a treatment where we turn the air conditioner here's another treatment where we don't and let's get people how much they like the president and they're just running endless labs just run these endless things and psychologists back me up on this you guys don't they're just running tons of these things and what happens then is some of those things give them asterisks and they publish those things they're almost all false positives because with almost everything you're testing is false almost every positive indication will be a false positive it's like these down syndrome testing things Paul's been scouring the internet for medical statistics to make examples of these things and you know so people are getting prenatal testing now for general disorders like down syndrome which is very rare because if they've been tested the fetus doesn't have it but they're aborting them every time they get a positive signal which is a tragedy so lots of healthy fetuses get aborted from premedical tests because the things they're testing for which are horrible mainly don't exist in the population that we've been testing and I think this is like true hypotheses in social psychology mainly don't exist correct so now I'm going to get angry not the pathologist and face the symptoms another go through we're just getting Jeong Sooeflet going to say I'm getting badly asleep not a single problem I've been in I've been in I've been actually almost a pathologist and I'm really scared because there is how many