 So we've got some good GLM-ing to do today, so I'd like to get right into it. When I left you before, we had fit our first binomial models with actual particular variables in them. And here are the three models we're going to compare. The top model is just the intercept only model for the chimpanzee prosociality data. I remember the outcome variable is pulling on the left lever on this weird table. And then two other models, 10.2 just has the, whether the left lever was attached to two pieces of food or one, that's a prosocial left means. Then the model that inspires the design of the experiment, 10.3, has an interaction with condition, which means whether or not there's another animal present at the other end of the table to receive the extra piece of food that is teleported towards that end of the table when you pull that lever. So let's do the model comparison. And what I want you to see here, I think it's fairly obvious from the weights is that 10.2 does substantially better than 10.3, although it's not, you know, it's not impossible that the interaction matters, but three-quarters of the weight is getting assigned to 10.2. And you can see that the standard error is actually pretty small. So if you double the standard error there, it still barely overlaps the other one. So it looks like 10.2 is pretty much better, and the interaction doesn't really help the prediction here, given these data and these models. So it's, of course, easier. Let's look at the coefficients on 10.3.2 to check and make sure we know what's going on. The hypothesis is that when the other individual is present, they will pull the pro-social lever more, and you can see the marginal posterior distribution or the interaction effect is all over the place, and it's slightly negative, but there's not much signal. And I leave it as an exercise to the student, although we're going to plot the posterior predictions later to verify that when you push all these things out and deal, therefore, with the covariances among these parameters, then it doesn't change that story. The jokes basically don't care if there's another animal present, but they do care about which side of the table has more food, and you can see that in the coefficient for BP there is where the pro-social option is. And, yes, they do pull the left lever when it's attached to more pieces of food that they're not getting. Just so no one thinks that I'm trying to say chimps are stupid, chimps are really clever, but it's very easy in cross-species psychology to make another species look dumb, right, because you designed an experiment in which you would be smart enough to figure it out. The fact that they're not is not evidence they're stupid, right? I'm sure if I put you guys in the forest with a group of chimpanzees, you'd be the dumbest chimpanzee they've ever lived. You'd be terrible at finding fruit. You wouldn't know how to make a nest out of branches. It would be a total mess. So don't, I'm not trying to be a cross-species show of this here. It's just, this is what happens when you design experiments that women, children, are good at. Other apes are bad at them. That doesn't mean they're bad animals, right? It's just how it goes. Okay, so now let's talk about interpreting effects in binomial models. Logistic regressions. And there are lots of conventional ways of reading these coefficients. And you folks know my mantra by now. I think interpreting coefficients is always really hazardous. It's that you get good at a particular set of models and particular kinds of data and you can do it, but you've got to be careful not to get cocky because you can't trick yourself. And I'm going to keep showing you examples of places where you can trick yourself. Remember, just with ordinary linear models and interactions, it's already hard because the effect of changing a parameter, a predictor variable depends upon more than one parameter in any model of interactions. And you can have strong correlations among parameters that you can't see in the table of coefficients. Those tables of coefficients only show marginal posterior distributions. So you miss the coherence just among them. With generalized linear models, it gets even worse, because now everything is potentially interact. And in particular, there's this base rate effect. And I want to spend a little bit of time warning you about that now. So one way to think about that is in these kinds of models, the parameters are on a relative effect scale. Because they're in the linear model space, at least, independent of the other terms in it. Those little data coefficients are. They're relative effects that ignore things like the intercept and where it's located. Because of ceiling and floor effects, the absolute effect of the prediction scale of a change in a predictor ends upon the intercept and the coefficient of the slope on that term. And so there's this distinction I want to push on you guys between relative, shark, and absolute penguin. And the penguin's better. I was just trying to come up with a metaphor here. I never really finished developing it. Sharks and penguins. OK, you'll remember it, right? That's all that matters. So parameters on the relative effect scale, so that's like a shark guy. So again, I didn't quite finish it. Predictions aren't the absolute effect scale. They account for base rate like the penguin. I don't know. So again, all that matters is you remember absolute penguin. OK, so we want, now both of these, both the relative scale and the absolute scale are value and interpretation. But you just, you shouldn't mix them up. I'll give you a couple examples of problems here. It's quite common, though, for people to quote relative effects, because you can just take single coefficients in a logistic regression. And if you exponentiate them, you get something called the log odds. The change of proportional change in log odds and proportional change in odds. And those are often things reported as odds ratios. Now, if they're greater than one, that means increasing the predictor increases the response. And decreasing the predictor decreases the response. But that ignores base rate. It ignores the ceiling of floor effects. So you don't actually get a sense on an absolute scale of how important it is to know that predictor in a natural system. So I'm going to give you an example on the next slide, but to set it up for you, set up the punch line. I think there's a preference, especially in the medical literature, for relative effects, for relative shark, because it makes things seem important. It makes things seem way more important. That the collective base rate can make things really, really important. You're getting published. You can have an odds ratio of two. There's a 100% increase. But it could still have almost no practical effect on anybody. I'll give you an example down on the next slide. And so you have to be careful about these things. So here's a real-world example that I'm sure some of you have heard of. And it keeps coming up frequently in the UK, at least. So it is a medical fact that being on burst control pills increases the probability of lots of bad things. This is not news to anybody in the audience. At least it's female. Nevertheless, I'll jump to the end. Pregnancy is way worse for you than all of those things. I don't mean that as like an antsy needle. Pregnancy is the most dangerous thing a female ape will ever do, right? In terms of endogenous mortality risk. So these other side effects of burst control pills are minor compared to the things that they're preventing. And you can forget that anyway, that's not the punchline I'm actually getting to. But it's worth keeping in mind as I set it up. So the actual numbers run something like this. One in 1,000 women in the UK will develop potentially lethal, quite seriously complicating blood clots. These are typically in the legs in their lifetime. Three in 1,000 women on burst control will develop these blood clots. This is a 200% increase in risk. Because an additional two women for 1,000 are going to get these horrible blood clots, which are their bad things. As a consequence of this fact, British physicians were required by law to start telling women that burst control could kill them. And a lot of them went off burst control and as a consequence, many more of them died from complications during pregnancy. This is the thing that arises from basement neglect. For focusing on relative risk rather than absolute risk. Take it all into account to give good medical advice. This is a famous example now of evidence-based medicine, which is my favorite phrase, right? It implies that there's another kind of medicine that is not based on evidence. Yes, that's the kind that we get. And so anyway, the actual absolute change in probability is only 0.002. It's a minuscule thing. You should be worried more about traffic than this particular risk. Even if you don't balance it against that pregnancy is much more dangerous than the blood clots you get from this. Does this help clarify the distinction between relative shark and absolute penguin? Yeah. Again, I have to finish developing that analogy. I just liked it and I wanted to try to develop it. So if you're giving that example and you're talking about it, what do you present that you say? Oh, there's 200% these tweets, but... Present both. I would present both. Once you learn this absolute effect that a woman switching is reduced her probability by 0.002, it suddenly seems a lot less serious. Right? And also, you probably also want to report in this that, and by the way, pregnancy itself, if you happen to be going off rate control, dramatically increases the same beliefs, the probability you'll get pregnant. And that pregnancy is way more dangerous on average. So both things matter. Again, we're back at my horoscopic problem. In this particular data case, I have a lot of advice about exactly what to say. Because we know a lot of the context. In other cases, it'll be different. And that was what my next slide is kind of about. It's that both matter. So here's one of the things that my pet peeves is you'll often hear this quote and it's in one of my son's books that I read to him at bedtime, actually, that it's a cool book about sharks. And there's this one page that drives me crazy because it says, more people die from bee stings than shark bites. And this page leads me off into a statistical rant that my son is doing. But it's like, okay, I believe that. Yeah, dear, care-able people for year-than-sharks, I believe that. I don't know if this hippo number seems crazy, but it's possible. I have actually almost been killed by a hippo. I used to do fieldwork in Africa. And I'll tell you, they're the real deal. They are the gangsters. That ecology. But here's the thing. The background to this number is it ignores exposure. Right? So when you're in the water, you're much more likely to die from a shark bite than a bee sting. Most of the time, you're not in the water. This is the rant my son is just, like, nodding to. I know the bee sting. Nevertheless, the relative risk here of the shark, though, is much higher. It's not focusing on the absolute scale, and it doesn't tell you the information you want. So I put this up here as a counter example to the previous slide and say it depends on the context. Because you want to know, the information you'd like to deliver to people is when you're in the water, say near San Diego, sharks are dangerous. Yes, they are, and they don't kill that many people, but they do kill people. So learn to watch out for them and understand a little bit of shark body language and think about relative shark. Sharks are relatively dangerous in the water. Out of the water, they're not very dangerous. Unless you're on stage at the Superbowl halftime show, perhaps. There is some risk of being killed by a shark. But anyway, does this make sense? What I'm trying to say here. Okay. Yeah, that was the only part of the Superbowl I watched this year, actually. It was satisfying. The sharks made it for me. Okay, so let's get back to regression. So here's the model we've been working with. Let's try to push some predictions out of this thing. The basic strategy is the same in previous models. You just have to attend to the new bits of machinery. But the steps are the same. There's just some new bits in here. So now in particular, we've got this link function. So there's an intermediate step between the linear model. You calculate the value of the linear model. But then you have to transform it with a function before you can plug it into the likelihood. And that function is the inverse link. It's the inverse of the link function itself. And so in the case of a logit link, the inverse is the logistic function. So you compute the value of the linear model. I'll show this on the next slide. And then you plug it into the logistic function. Then you get a predicted probability that you could put into a binomial simulation or just make inferences from itself. This is one of the frustrating and satisfying things about logistic regression or binomial models in general is that we want our declarative inference as a probability, but the data are not probabilities. So we want to know things about proportions of outcomes on a probability scale. But the data are counts. And this is typical of generalized linear models. And I said this before, and I want to emphasize it again. It's that the data are measured on a different scale than the things we're going to make inferences about. We're making inferences about log odds and probabilities. But what we observe are never those things. They're not observable. And proportions aren't observed ever. You calculate them from data. But you don't see them or measure them directly. And that's a... Again, it's satisfying because it meets nature where it's at. It does it the right way. It's frustrating because it's cognitively hard for everyone. And you get good at it, though, trust me. You do this a few times, and it's not so bad. But I just want to warn you about that. So that's what this inverse link is accomplishing is getting us back out to the outcome scale. Well, it gets us to the parameter scale, the probability scale, and then we can go to outcomes from there. So what's this look like? Remember, you've got parameter values. So let's just think about map estimates for the moment. If you plug those map values into this line of ours kind of pseudo code here, now I just want to show you the symmetry between this line of our code, two-thirds down the slide, and the linear model definition in the mathematical version of the model. That's always sort of the same thing. But now what this implies, logit of P equal to the linear model, implies that P is the logistic of the linear model. So when you do any code, you're going to do it that way. Does this make some sense? Yeah. You actually do it. You'll see what happens. You get probabilities out of it. The logistic of the linear model is always constrained to a zero-one interval. That's why we use it. Yeah, that is the only reason pretty much we use it. Well, it's a little better in that. In this case, it turns out the logit link is maximum entropy in this case. It's the logistic regression called the maximum entropy classifier. So that's what I want you to learn puristically from this. Your model has a logit link so that you define the logit of the probability of a success as equal to a linear model. What this implies is the probability when you want to compute predictions later is equal to the logistic of the linear model, which is undoes the conversion. All right, so what do predictions look like? The code to produce this plot is in the book. It's got no surprises. It's the same kind of stuff you've been doing before, but I do encourage you to carry through because the tricky part now is the logistic thing. However, link-semin ensemble work for these DLM models just as easily as they do the others. You may not even notice there's a difference, but it's important to me at least to know that you guys know that deep inside the machine there's this inverse link thing going on now. Technically, those Gaussian models have a link function too. It's called the identity link. It just says that the function of mu is equal to the linear model is mu. That's what identity link means. So link and semin ensemble handle all these link functions because technically they think there's always a link there. So what you're looking at on this graph are the posterior predictions superimposed on the raw data. Each blue trend is an individual chimpanzee in the data. They're averages. So each of the four combinations of the two treatment variables whether which side of the table the pro-social option is on left or right and the condition whether there was a partner at the other end of the table or not that gives us four kinds of treatment settings. Each chimpanzee experiences 18 trials in each of those four. That's what they have the data are structured. So in these blue plots show for individual chimpanzees their averages their average proportion of pulled wealth across them. Now the first thing you notice of course is zigzagging but the first thing you actually probably notice is the predictions are terrible. There's a tremendous amount of heterogeneity among individuals. The black line is the posterior mean proportion pulled and then the 95% interval of that probability. It picks up the zigzag that you're seeing and where does that come from? When the pro-social is on when pro-social is on the left they pull left more. That's what it is. So it goes up when the pro-social goes on the left it goes back down when the pro-social is on the right because they're pulling right more. The bottom of the graph is pulling right the top of the graph is pulling left. That's what the zigzagging comes from. It comes from the fact that they're attracted to food even if they're not getting it. But there's roles no effect at all. There is any effect of slightly negative of the other individual being present. And then the major story that you get what you get from looking at these posterior prediction checks that you don't get from tables and coefficients ever is you notice that the real story here is heterogeneity among individuals. Individuals have strong handed discrepancies as they do. And in fact chimpanzees like humans are a majority right-handed but there are left-handed individuals. In fact, there is one individual you may notice at the top of this graph who only pulled the left lever ever. It's constitutionally incapable of pulling anything else. And this is actor number two and actor number two is special. So we're going to turn to actor number two in a moment and talk about how to deal with situations like this because it reveals... It's one of the things I like about this data set for teaching. It reveals something interesting about GLMs. We're about to get there. Questions about this? Make sense? Ready to keep going? All right. One thing you might wonder, we fit these models with map. But I told you the posterior distributions, meaning the posterior distributions may not resemble multivariate normal. This is a case where they do. For these models, the quadratic approximation is excellent actually for the models we fit so far. And I'm just showing you that this code here is enough. You can actually just pass in a map fit into map to stand and it just sucks the data and formula out, right? Because that's all it's doing and then runs the Markov chain. And then I'm showing you this table of coefficients is the same each of the marginal posteriors is close to Gaussian and they're vibrate Gaussian as well. This is a nice... Central limit theorem safe place. It works really well in these cases. That said, we're going to move toward next into a case where it isn't Gaussian. I'm showing you what happens. And to do this, we're going to crawl forward and add handedness. There's a lot of heterogeneity among the individual chimpanzees so that maybe we can see the treatment effects better. This is a thing we're going to come to later today as well for a different data set. And so how are we going to do this? I want to take this opportunity to introduce you to a notational convention for doing dummy variables multiple categorical intercepts that we're going to use with multi-level models as well. And it's...I think I mentioned this way back in, what was it, chapter 5 when we introduced dummy variables. But this is the way I prefer to do it in some category that gets assigned the intercepts. And then all the remaining categories get dummy variables made for them. We're going to just have a vector of intercepts and each category gets its own in the data. And the way we can note this, one convention for doing that is to put this little bracket on the parameter. The bracket I, this means alpha for the actor value or case I. This corresponds that there's a column in your data set called actor and it has integers in it. And those integers are the indices of a vector of parameters. I know you're staring at me like, huh, we'll show you what's going on here. So, this is what I'm talking about. The first part, we define a unique intercept for each actor. That is, there's an alpha 1 and alpha 2 and alpha 3 and alpha 4 and alpha 5 and alpha 6 and an alpha 7 in this data set because there are seven chimpanzees so alpha sub actor I means take the ith value of the actor column and put that value in as an index give me that parameter out of that vector and what that looks like in the data is if you just type D, dollar sign, remember dollar sign means extract in our lingo extract the actor column out of the frame D. This is what it is. For every row in the data there's an actor ID and this is the chimpanzee on that particular trial. So you notice for a bunch of them so this starts at these numbers and brackets on the left are like the I values. They tell you the case you're at. So the first case, you get alpha sub 1 if you got the ith that is the first value of actor out of it. And then someone for a while you'd be making predictions with the first intercept parameter and then you switch it to 2 and then it's the second one. It's just I prefer to do it this way because it's cognitively easier and you don't have to remember which category you aliased out or any of that stuff. There's a symmetry among it now and this is how we're going to do multi-level models too because it's a conventional way to do multi-level models. Does this make some sense? It'll only really make sense when you do it. Like everything else in this class and then even then maybe not but that's okay. Years of science ahead of you is actually interesting. The prior still and this notation alpha sub actor that means the whole vector every element of it is assigned the same normal prior. So every one of those intercepts gets this same weekly regularizing prior. Okay? All right. Oh and I should say too on log odds scale zero means a coin flip. So if we have a normal prior on an alpha intercept by their outcome and 10 is really wide so it covers it's nearly flat over the whole range of log odds out to 5 and minus 5 which remember 5 means always minus 5 means never in log odds. So this is a very weekly regularizing. I actually prefer 0, 1 very often when I do modeling in these contexts. Okay. You can insert, I'll highlight them in red. You just put brackets by the parameter symbol and what it interprets from this is oh you want a vector of alphas. How long will let me count up how many unique values of actor there are and that's what it does and it makes a vector now of intercepts for you. You can do this in map too. It'll work the same way. So do you have to make sure that R knows the head of time? This is actor. It's not a factor. I'm going to start having fun with this if I get to say it again. Did you have actors of factor? Would you have to say actor of a factor? Could it be actor of a factor? With a factor in it. Now I'm trying to take something rhyme to the back. I'm hearing Dr. Seusset by son too so I'm really trying to rhyme. At least three of you have been sending me two change videos lately. Because of last week's lectures. But anyway. What was the question? Actor is a factor. So if it weren't a bunch of integers. It wouldn't run. Because then it's text. And in fact the factor data type at R is a pox. It's really dangerous. I don't know what your opinion of it is. Underneath. It's latent level where they're actually integers and are mapped on to a list of labels. It's so easy to get confused. It's tragic when it happens. I prefer never having factor data types in your data frame. If you're really careful you can handle it. It trips people up a lot. If factor is a factor convert it to an integer. A little bit later today I'm going to show you a function in... It finds all the unique integer values. It's your job to make sure it's well groomed. What I was going to say is, in the next example later today there's a convenience function in the rethinking package called coerce index. Which you can give it a factor and it makes it into a list of unique integers starting at one. Because this is a thing that I do all the time. Because factors are pox. They're just a disaster waiting to happen. That's what they are. Did I answer your question? If they are a factor... No, it's like the rhyming is coming back. But they are a factor to make them into an index. I'll show you how to do that. Where was I? We highlight it like that. What you get out of this in the estimation is a bracket 1, a bracket 2, 3, 7 individuals. One for each of them. These are intercepts and each gets used uniquely for different cases in the data to make predictions. They're the baseline pulling left and pullingness. There's a left handedness in a sense on a logout scale of each actor when both predictor variables are 0 in the baseline condition. Does this make some sense? Notice here's actor 2. Actor 2 is the one that always pulled the left lever. The posterior mean here is 10.6. I told you 5 means always. So what does 10 mean? That means forever. After the earth has been swallowed by the sun, this chimpanzee will still be pulling the left hand lever. But notice look at the highest posterior density interval over there. It goes up to 20. Really, really high. What's going on here? Notice the NF is smaller too because this is a symptom of it. This posterior distribution for this parameter is not calcium. So it gets sampled less efficiently. So that shows up in the NF as well. What's happening? This is an important thing to realize about GLMs. They're sealing in floor effects and they affect the shape of the posterior distribution. So I'm showing you in the graph what is the marginal posterior distribution for actor number 2's intercept. And it's reliably above 0. So we're really confident that actor 2 likes pulling the left. It's left-handed. We're about as confident as it possibly could be. But the data do not discriminate among any value of this parameter that is high enough to ensure that they will nearly always pull the left hand lever. Infinity will make the same predictions as 10. The likelihood function doesn't discriminate among the parameter values. And this is a consequence of the sealing in floor effects. It's a consequence of the change in geometry between the prediction space and the parameter space. This is a common place thing about GLMs to keep in mind. And as I say at the bottom of the slide, if you rely upon the data to do all the driving it might drive off a cliff. Because the data doesn't discriminate among the options quite often. But you may have information that lets you do so. You know that infinity is not the right value for this parameter. They're probably not that left-handed. The other effects could dodge them off it. If you piled enough food on the other side of the table we could probably get actor number 2 to start pulling the other lever. That would be an experiment to try. So this is a really common issue and in non-Basian model fitting of logistic regression people use regularization to deal with this. If you have perfectly flat priors because you get this infinitely long flat part of the likelihood that just once you get up in this graph past about 5 it would just be flat forever. Because the likelihood of the data conditional on any value of this parameter that's higher than say 10 is the same. It will make exactly the same predictions. And the only thing that makes it decline on here is the prior. I'm not saying that that means that we know it declines in exactly this way. We don't. But it does let you fit them all. Which is a nice thing. It makes the model converge. And very simple logistic regressions can exhibit problems when they have flat priors. Here's my favorite case. This is in the book too. But I wanted to highlight it for you guys because it's really this easy. It happens all the time. And we're going to fit with GLM just so you don't trust the built-in tools either. This is a general feature of all Golems is that they have these problems. Simple data set I just built it up with the repeat function up there. And this is what the data frame looks like. We've got an outcome y. Half of the values are 0 and half the values are 1. It's going to be a logistic regression. We're going to predict y. We're going to predict it using x. Which is some predictor. And what I want to show you is minus 1 value of x is associated with 0 except in one case number 10. And a 1 value is associated with 1s except for also in that one case. So there's almost a perfect association between the x's and the y's. But not perfect. Almost perfect. When you run this model that's what you get for estimates. And I think I've trained you guys well enough that you would not use those estimates, right? You look at the standard deviations on those posterior distributions. What's going on here? The data doesn't discriminate. It wants to make the slope infinite basically. It's happy to make the slope infinite because it's almost perfect relationship between the two. And the iteratively re-rated least squares method that GLM uses just freaks out in the circumstance. But you know better. You can fix this really easy with a tiny bit of regularization on the regression coefficient. And it solves it right up. And I show you how to do that in the book. I show you to take this model, put it in math with a weekly regularizing prior on the beta coefficient. And then you get sensible estimates right away. So you just have to worry. GLMs are like this, I'm afraid. That's just how they are. But you guys are going to be skilled and you'll handle it. You just can't trust the machine anymore. So, what do we learn now? Well, handedness, the predictions get a whole lot better once we've got these intercepts unique to ACTER. I'm just showing you four of the interesting ones now to show you the heterogeneity among them. The zigzag is in there for some of the individuals. Like ACTER number three is definitely responding to food, but not to like the other individuals, not to the presence of a partner on the other side of the table. ACTER five, we get really well done now. But all these intercepts do is move the level of the zigzag, right? They don't change the shape of it across ACTERs. We need to interact. We need to have different slopes for each of them to do that. And we'll do that when we get the multi-level models. But we're still, this hasn't changed the predictions about the zigzag shape. There's still no controlling for the heterogeneity in handedness has not revealed some hidden impacts of the interaction between the other individual being present and which side had two pieces of food on it. It could have though, right? It could have been a masking effect. The noise from handedness could have hidden what was going on in the experiment, but it didn't in this case. Make sense? Before we leave the chimpanzee example, I want to show you running exactly the same model in a different way. So for any logistic regression, you can take all of the rows that have exactly the same values of predictors and aggregate them into a single row. Just by summing up the outcome across all those rows. And there's a nice function in order to do this. It's called aggregate. And so here we're aggregating the outcome variable pulled left conditional on the values of these predictor variables, pro-social left, condition, and ACTER. And on any row in which those are all the same, we sum up this outcome variable together and make a single row that has that. I'll show you what this output looks like in a second. You can run exactly the same model with the aggregated data and you'll get the same posterior distribution. And it's often much more convenient because you get a shorter data set. The thing is, it's still a binomial model. So what does the data look like? Here's what the original data looks like, the logistic data. Just showing you the relevant columns. So on the far right we have the outcome variable pulled left, zeros or ones. And then there are a bunch of different rows. And in some of them, the pro-social option is on the left, some of them it's not, and then condition is whether or not the other individual is present. In the first 24 here, we don't see them and it's ACTER number one, but we can scroll down past the whole data set. Here it goes. And there we go. 144 rows. Each of them has a 0 or 1 outcome. But a lot of them have replicated predictor values. So this is an exploded version, representation of the data. And we can collapse it down with aggregate. And when you do that this is what aggregate makes. It becomes I'm just showing you for the first two actors, but each actor now gets exactly four rows in the data set. Because there's only four combinations of treatments, right, of the treatment variables for each actor. So this is ACTER one's four, this is ACTER two's four, and then ACTER three gets four and ACTER four gets four and so on. So much shorter data set now. And now just because the way aggregate works it calls the new column X. I haven't figured out how to change that, but you can rename those things later. And now this is the sum pulls of the left-hand lever out of the 18 trials. They get aggregated together there. How do I know it's 18? Because I know the structure of the data. In different data sets it will be different obviously. It will be something different. So now we want to construct the binomial regression in a slightly different way. But we push in now the aggregated version of the data, and now you get an 18 in the likelihood function where there was a one before. Because now what you've observed is a number of successes, a number of left-hand pulls out of 18 trials, instead of out of one. It's the same kind of thing. It's just a different data representation. It makes sense? Yeah? Is that a hand? Yeah. So can you put a vector there if you have different number of trials? Yes, we're going to do that next. Great question. Question one. Can we put a vector there? And yes, you can put data there. And we're going to do that. We're going to put a variable there in the next example. So carry through this to verify for yourself that you get the same posterior distribution. It's the same representation. Yeah. We're not losing any data, right? We're not losing anything. We're not losing any information at all. It's just aggregated differently, represented differently. It looks different though. And the aggregated form well it depends. In some cases the aggregated form is better. In some cases it's not. It depends upon just when we get the multi-level models and we want to deal with over-dispersion. I'll bring this up back up again, or if I forget wave your hand vigorously and remind me we want to how you aggregate or not affects your flexibility where you can model over-dispersion coming from. So we'll worry about that when we get there. Okay, but over-dispersion time-date is always over-dispersed pretty much. Yeah. You look like you're going to have two questions. So aggregating you assume no of order in terms of how the data are collected. That's right. If you had a data set where you expected that the chimpanzees for instance might learn that well no matter how much food there is over there I'm not going to get it because I'm pulling levers, then if there was a learning element to it then you might not want to aggregate. That's right. It might be that. That's right. The comment was in case my computer didn't pick it up what if order mattered and then the individual trials vary conditional on something you might know then you need to keep it exploded and then in that case the predictors wouldn't aggregate together like this. Yeah. There is an order column that goes with this data and it doesn't have an effect that's detectable which is why I haven't pushed it as part of the analysis. I kind of wish it did. That would be fun, but it doesn't. Alright. Let's look at another aggregated example before we move on to Poisson regression. This is a famous data set in statistics. It's used a bunch. Some of you will have seen it before. If not it's famous so it deserves to be shown. So I'm just going to show a standard one. This comes from a paper that was published I believe it was published in Science in 1974 and it's graduate acceptance and rejection decisions for PhD programs at UC Berkeley the largest departments in 1973 in UC Berkeley. Anonymized for protection of departments. You may be able to guess what they are as we go. This paper arose because internally UC Berkeley audited this, these admissions things and threw up this alarm that looked like there was severe evidence of gender discrimination in graduate admissions that women were being admitted less to graduate programs across campus than males were by a large amount. And that's not true actually in these data but the statistical mistake is worth going through and doing an autopsy on. And the paper that was published explained that and it's become a standard teaching example of the consequence. So this isn't to say that there isn't discrimination in graduate stuff because I think there is, right? The kind of drip drip drip of all kinds of stuff about places where white guys run everything, right? But in the admissions decisions the opposite appears is what I want you to see. And that's still true in these programs. Being female helps you get admitted to the program. Once you're in it then you've got other problems. I know I'm not getting like news flashes or anything like that. So but it's just important to say it's the anthropological injunction to say that just because there's no problem here doesn't mean there are no problems. Okay. So these data, wait I showed them on the I wanted to talk about them on the previous slide. These data are quite simple. Here's the whole data frame. It's an aggregated binomial data set. We've got six departments, two rows for each. On each row we have applications and then female applications. The outcome variable is going to be admit. We want to predict the number admitted out of the total number of applications. The total number of applications is just the sum of admit and reject. Yeah, you see that? These are binomial trials. We're going to assume that every application is independent of all the others which is a bad assumption by the way, because if anybody has sat on a graduate admissions committee or knows something about it, it's not actually that way we've got cohort size, goals, and sometimes you'll admit mediocre candidates just because they need to fill a cohort. People do that. We don't do it in my department, but since there are departments, I know they do that. And sometimes there are wonderful people you don't admit because the cohort's full and that's just the fact that these aren't independent of one another. But at the first level analysis we'll get the point across anyway. There's still stuff to figure out about how this works. Right. So let's set up the model. This is the point where it's just a regular old binomial The first thing we do up top is I take the male variable, which is remember the gender variable is entered as male and female and text doesn't work in that model, so you've got to convert it to numbers. We're going to make a dummy variable called male, one or zero. And now there's this column applications. It's data that becomes the number of trials and it varies across rows. Monstrously so actually. The variation in the number of applications across departments is huge. Some departments get a lot of people flying, mainly psychology. And other departments like physics get very few. Now there's huge head of genetic. And as we're going to see, that co-varies with other stuff and explains how the mistake arises in inference here. We fit two models and the top one there's a, we put in the male dummy variable to have an effect on the log odds of an application being admitted conditional on the gender of the applicant. And in the bottom one we leave it out. And then we can do a model comparison and get some assessment of the overfitting risk here. You guys with me so far? The only new thing is to follow up on Katrina's question, can you put a vector in there? There's your vector. It works great. Do the comparison. Two models, model 10.6 is doing way better and that's the one without that's the one with the male dummy variable in it. And this is one of those slam-dump cases that we haven't seen much of in this class but where the yes, it's more flexible. It's got an extra parameter and this is one of these cases too where WIC just counts parameters, right? Just two and one. And it's just so much better, the in-sample fit is so much better that it's hard to imagine that this is just overfitting by adding the one thing. And this is the other one out there. So even accounting for the big standard error, right? If you double that it's a long way from being equal to the other one. So not much overfitting risk here, right? The gender almost certainly does affect the probability that something was admitted in these data, the way we've analyzed it. So now we look at the estimates to figure out the direction of that effect. And now we come back to relative effects. So think about logent-link models. If you exponentiate the beta coefficient you get the proportional change in odds. If that's greater than one, that means that you're multiplying the odds of that event by that amount when you increase the predictor by a unit. So in this case, if you change an application's label from female to male the idea is there's an 84% increase in its probability of admissions in these data. And that's true across all departments. That is true. This is the correct description of what's going on across all of the applications in the whole data set. Male applications are accepted more. That's absolutely true. 84% more likely to be accepted. Nevertheless, there's no evidence of gender discrimination. We're coming to that in a second. You guys with me so far? Yeah? Alright. This is a relative effect size, but I don't think it's misleading us here. But it makes it sound big, right? It's like a huge deal. To get to absolute effects we're going to crawl that way a little bit but that's going to vary hugely across departments. So we'll have to deal with that issue. First let me remind you about how to get to the absolute scale. How to interpret these coefficients. You can just if you want to compute predictions at the map values you can just plug them into the logistic function. So to get the prediction for a female application it's just the logistic of minus 0.83 the probability of admission on average across all departments is about 30%. See that? And then for a male application you just use a linear model again. The intercept plus the beta coefficient times 1 now instead of 0 and that's about 45%. So there's about a 15% difference which is not tiny, but that's the absolute scale. Is that the map difference on average is about 15%. Does that make sense? How to do this? You want to be able to do both, both the relative and over relative shark absolute penguin worth doing. Okay. And also to remind you to get the full contrast here you want to get the full posterior distributions you can just extract samples and put them in directly as well. You could also use link to do this. But I want to show you this is the roll your own version of doing link right? So when you roll your own you extract the samples yourself and then you just plug them into the logistic function and R loves vectors. So when you put vectors in this thing it gives you a vector of answers vector of sums first it adds each of each row of both A and B and M together. Then you get a vector of sums of the same length and then it passes that into logistic. Logistic is also vectorized so it returns a vector of the logistic transform of all those sums and so you get a vector of posterior probabilities of a male being admitted. And same for the female applications and then we compute the contrast just by subtracting one from the other and now we have the distribution of the contrast. The posterior distribution of the predicted difference and probability of admission between a male and a female application. You with me? Okay. And then I'm just summarizing with quantile here. The median is about 0.14 that's what we got before and definitely positive right? We don't get anywhere close to zero here. Lots of evidence that on average across all departments males are getting admitted more and that's the truth. That is absolutely the truth in this data. And finally on this issue here's what this looks like graphically. The odds ratios that's the relative risk that's the posterior distribution of the exponent of that coefficient for males showing you the posterior distribution of the odds ratio and then on the right is the probability of the posterior distribution of the contrast between the probability of a male was admitted and a female was admitted. That's what we just calculated on the previous slide. And these, calculating both these things they both might be important depending upon the situation you're in. This is relative sharp on the left and that's absolute penguin on the right. Make sense? Yeah. At least until you leave the room it makes sense. I know how this goes. And then it goes about your brain. Okay. Bustier validation check. Here's the part where we see this is not such a great model. We're going to contrast the raw data with the model predictions. Raw data shown in blue for each department. And cases in the data set are just plotted across the horizontal which means the first two are department A, three and four department B and so on. The first case for each department are the male applications, the second are the female applications. I've labeled the first department there to help you out. Remember that. And I've connected data points from the same department with lines. Again the code to do this is in the book. So if you want to do something like this, you can just lose to do it. And then the posterior predictions are shown with these circles. That's the posterior map prediction of the probability of admission for that case in the data. And the little line segments are the 95% interval of that prediction. And then the little stars are accounting for sampling variation from the binomial on top of it. So that's like what you did using sim. The stars are from sim. The stars are from link. And I think you can see, so there is heterogeneity in the uncertainty of predictions across cases. Why? Because the sample size varies. In departments with few applications there's more uncertainty about what to predict. And you see that in some of the length of distance between the pluses is bigger for some departments than others as a consequence of that. Because there aren't very many applications available. And so you're not sure. There's lots of sampling variations. There's a lot of applications and you'd be really sure you'd see the average. That's the idea. There are few coin flips for some of these cases because there are very small numbers of applications. But you get the same zigzag. The model thinks that in every department females on average will be admitted less often. But the truth is there's only two departments in which female applications were admitted less often. You see? And there are some departments, like the first two departments, where being a female application is a huge move. And yeah. And those departments are, by the way, engineering and physics. And I happen to know from this data set. So what's going on? Why was the statistical model fooled? Well we haven't it's the actual effect of being masked by the heterogeneity and admissions rates among departments. There's also departments behave differently conditional on the gender of applications. Physics and engineering we want more female applicants. And they also get extraordinarily good ones. And they nearly always accept them. And it's like this effect. Some of you may know who play chess competitively that female chess players are extremely good because it's not a comfortable social environment to be in for female chess players. So you've got to be unbelievably excellent to break through the barrier of discrimination. And so they tend to be, the distribution is way non-Gaussian and skewed towards discrimination. As a consequence of those discrimination effects. And I think this happens in engineering as well as a consequence. They're among the best as a consequence of breaking through it. So anyway there's not great evidence of that in these data, but it's a phenomenon that shows up in the performance evaluations. So what's actually going on? Let's do it where we put in the intercepts. Overall admission rates vary across departments. We can put in unique intercepts just like we do with the chimpanzees. Now we have a vector of intercepts, one for each department. So there are two rows in the data set. They're going to get the same parameter. Set up the model the same way. So the top line up there, I'm showing you we make a department ID variable and I use this function coerce index. Department is a factor, a nasty factor and I'm showing you the two columns for the data set that are relevant. It takes that factor of letters, A, B, C, D, F and converts it to the nice integers 1, 2, 3, 4, 5, 6, and those can go into map or map to stand and work great. So here we're doing map fit and map will treat this bracket notation the same way map to stand does. It says, oh you want a vector of parameters with that name. I can do that. Let me see only unique values in your index variable. And what happens, well, let's do the full model comparison now. I fit two new models, one with the gender variable, one without. The top ranked model now is 10.8. That's the one without the gender variable. But with the individual department dummy variables. The individual intercepts for each department. But you'll notice that the Akaike way to split basically equally between the two. So I would interpret this as gender probably matters but it is overfit in the model that has gender in it. So the best estimate is probably somewhere between what we see in that model and zero. Count for the overfitting. But let's look at the predictions of model 10.9. Which has both the intercepts for each department and the coefficient for being male. And now notice that the marginal posterior distribution for BM is negative. And reliably so as a consequence of what your eyes already told you about the data. Which is mainly in fact most departments don't care. Most departments have pretty much flat lines. We'll look at the data again in a moment. But there are a few departments. And it's those departments that get very few female applicants. They nearly always accept applications that are from them. Or at least more than half of them. They accept more than half of them. So there's this big effect of it. Let's look at the data again. So the top are the posterior predictions now with the dummy variables for department. And now it gets the right inference in the predictions. There's this very weak uptick. They expect slightly admissions to be slightly higher for females. It's not tracking the first couple cases very well. Why? You guys can probably tell me. Because the model assumes that the effect of an application being male is the same in every department. Right? And that's not true. Almost certainly not true. Departments behave differently because they have different gender ratios in their programs. And they're trying to fix that. Admission decisions are heavily weighted by those things. I am a member of certain grad programs that I will not name in which there are a whole lot of female applications. And it's very rare to get a male application. And the bias goes in the other direction. And the university is pushing that, actually. They want, we've reached that point in certain fields, like communications. Which I'm not a member of. But communications is the worst case. It's 99% female. And so these departments in this data set on the left are like that. They've got gender ratio skews toward males and their favoring female applicants. The other departments basically don't care. But here's the thing. The vast majority of women applying to grad programs in UC Berkeley in 1973 were applying to departments on the right. And it's very hard to get into those departments. Notice the overall admission rates are really low. In fact, for department F, they're nearly zero. And that is social psychology. It's department F. David. Give us some of the lifting news. Does the model do something where it's sort of, regardless of the number of total applicants per department, it waits each department frequently? Or does it understand that department D, for instance, has 100 total applicants, department A has 500 applicants. So if we generate the BN coefficient, it takes sample size. If I understand your question, the answer is yes. The binomial model is taking the sample size in each row into account automatically. Because all of that is getting into it. And that's because we didn't convert things to proportions. We know the number of trials in departments where there are fewer trials, there's less information. And the posterior distribution reflects that automatically. Just because Bayes theorem. Just because Bayes theorem. So in figuring out the effect of gender, is it waiting each applicant the same? Or is it waiting exactly the same? Every applicant is getting the same effect in this model. When we do multi-level models and there's heterogeneity to be inferred, it won't be that way. We can come back to this issue. We can do better than this, actually. There's clearly heterogeneity, and we're going to reanalyze this model in a couple weeks with varying slopes, so we can deal with the fact that departments respond differently to the gender of applicants. We can do that as well. Everything can vary, not just intercepts, but you can make every parameter in your model probably a good idea here. And so we'll do that in a couple weeks. And cool stuff will happen. Something I call multi-dimensional shrinkage. I know you weren't expecting that. It's not as catchy as relative sharp. Okay, so just to remind you to see what's going on here, why did we get tricked the autopsy on this is, why did we get tricked the first time? We got tricked the first time because there was a correlation between an application from a woman and the admissions rate, the overall admissions rate of the department. Women tend to apply, so look at a number of applications here, from women. Very few up here is only eight to department B, only 19 to department A, hundreds to the departments down here. The departments down here have really low admissions rate. Social psychology was taking around 5% of men and women. No evidence of gender bias, it's just women prefer to apply to the departments that were hard to get into. So if you ignore the department heterogeneity, you reach the wrong answer. Does that make sense? So this is called Simpsons Paradox in Statistics, which you don't really need to like burrow into that. Just remember that you got to think about ways you could be tricked. You can't trust your models. And if there's a lot of heterogeneity across different clusters in the model, it's very easy to be fooled this way. Very, very easy to be fooled. You can be fooled in the opposite direction too. So you have to be careful. Okay. Let me summarize binomial GLMs and then we'll do we have enough time to do Poisson models before you go. We use these to predict counts that have some fixed known maximum. That maximum can vary across cases, like it did in the example we just did. But the idea is it's binomial when you know the trials. Use a logit link. It's conventional and it's the best choice. There are other choices. And if you have a good reason to use a different choice, you'll know. You'll be in a situation where you've read enough about it. You know you want to do it. Distrust map estimation. Often it works unreasonably well, as I showed you an example. But if you get a posterior distribution pushed up against the ceiling or floor, I log out of like 5 or low ones like minus 5, then you probably want to go to Markov chains. Or some other method that doesn't make assumptions about the shape of the posterior distribution, so you're not fooled. Regularization is really important for getting these models to converge because really strong effects will have flat reasons of the likelihood that data just don't discriminate among different values of the parameter. You can't let the data do all the driving. You just can't. I know it sounds like a horrible anti-scientific thing to say, but conditional on models, data don't discriminate among things that you may already know some of those things are stupid. And the data in the model doesn't know that. So you could do better. It's like you're the backseat driver. Let the data drive, but you've got to intervene. If the data's been drinking, then you've got to grab the wheel. That's how it goes. That's a horrible metaphor. Wow. Katrina's like shaking her. You know me, Katrina, though. That's not the worst thing I've ever said. Okay, so as usual, my mantra applies. It's really important to focus on predictions and not parameters now. Really important. And especially because they're the ceiling and floor effects that are going to happen. Okay. So now let's come back to our exponential family. We want to talk about Poisson models now. That'll be the last model type for this week. And to ease you into that, Poisson distributions are just a special case that by now people have said this before. They have the same maximum entropy constraints. But we get to use them when we don't have the number of trials. Because the Poisson is a special shape where the variance and the expected value are the same. And you get that from binomial wind. There are very, very, very large number of trials. And the probability of success in each is really, really low. And then you get this special shape. So let me demonstrate to you that through a simulation example. We're looking at, remember our flies dying in test tubes or our graduate students passing prelims in the first year that we have up here. And here's a binomial distribution. Numbers out of 10 that pass their prelims. The mean is about 4. The variance is 2.4. As we decrease the rate of success it moves the mean to the left. And if we also then increase vastly the number of graduate students. And in proportion decrease the probability of passing prelims. So it's like there's a vast army of graduate students who are admitting to teach our intro for freshman cough courses. This is how English parties work. It's not a no-scroll, but it's true. And most of them are never going to make it to the PhD program because there's a bottleneck that's imposed there. Then you get a distribution that is technically binomial. But the mean and the variance approach one another and eventually converge to the same value. So showing you here a probability of passing your prelims about 1% out of 200 graduate students the mean is 2.8. The variance is about 2.8. If it's again 1% and they're 500 now they're even closer. 7.07, 7.02 and out of 900 students again almost exactly within simulation variance. And these distributions are examples of Poisson distributions. There's nothing magical about them. They're just binomial distributions. The observed values never get anywhere close to the maximum. The theoretical maximum. Most events are failures. Not sad at all. I'm going to bum out Jason over there. You're going to pass, don't worry. Lots of examples. If you have a number of trials, 6,000 you need a very low probability success. But still the variance equals the mean. And this lets us, this is something called the Poisson distribution has a very nice mathematical function that requires only one parameter to describe its shape because now the mean and the variance are the same. And this parameter is usually called lambda and it's equal to the exponential rate that was present in the previous graphs. But you can think of it as the mean number or as the inverse of the exponential rate in the other. And you can think of it as the average number of successes per unit time. So we use these when we have counts where there's no clear upper limit. We don't know it. So what could this be? It's like mortality data. Your counting deaths. A famous example, I think I have this on the next slide, is the number of soldiers killed in the Russian army by donkey kicks. Famous for one of the earliest data analytics applications of the Poisson regression. And a very important problem. Livestock kicked people. And it's a big deal. And in that case we don't know how many soldiers were exposed to potential death from donkey kicks. We don't have the exposure variable. So we can still do good statistics with this by estimating the rate at which it happens in different contexts. So this is named after Semyon de Nipoisson who did a bunch of cool stuff in mathematics. I love the cover of this book, by the way. I'm going to get a photo of myself to hang out with you. I haven't found those pants yet. So that aside. So great kind of classic applications of Poisson regression. Soccer goals per game. I'm sure there are soccer fans here. Soccer is a great game. But man, it's not high scoring. Most of the time it's just really hunky guys who run it and sweat a lot. So it's fun to watch. But there's not a lot of balls going into nets. So Poisson distributions work great for that. Visions per unit time in uranium. Almost exactly Poisson distributed. Independent vision events. Really large amount of matter that could decay. But the probability of any atom decaying at any unit time is really small. Photons tracking detector. Same kind of story. And then here are the famous soldiers in the Prussian army killed by horse kicks. It was horses, I guess, per year. And DNA mutations closer to home for many of you, DNA mutations per strand per generation are approximately for all these reasons. Very large number of sites that could mutate most of them don't because mutation rates are low. This makes some sense. So this gives you just some heuristic about when to know you're in one of these situations. So let's do a data analytic example of just enough time, I think, to do this. This is a paper that was published by Michelle Klein down there. Some of you know her, I think. And Michelle's a postdoc at ASU and she works in Fiji and studies the cultural evolution of oceanic societies. And this is a paper she done with kind of a historical analysis interested in long-term technological evolution and complexity of toolkits in island societies and the size of populations. The background theory very quickly is that large human populations can retain and develop more technology. So turns out there's some evidence of drift losing technology. If the only person who knows how to make a bow and arrow dies and sees someone else, you lose bow and arrow. Happens. Happens in graduate programs all the time. Not with bows and arrows, but with like, statistics. Right? The last quantitative anthropologist leaves your department. Yeah, so we won't talk about that. But I guess we will at some point. We won't talk about that. So there's a drift effect, but there's also the adult catalytic effect that all attempts to learn technology have some error variants in it and it's only rare individuals that innovate something that makes it better. So the more people who are trying the more positive mutations you can have. So this is exactly like in genetical evolution, it's often mutation limited. Big populations evolve more efficiently because there's more opportunity to get positive mutations. Same thing can happen in human cultural evolution. And so Michelle is trying to see if she can find a whiff of evidence of this in Oceanic Tool dataset. And unfortunately this is all we have to work with. This is the whole dataset right here. But hey, you know, you go to war with the data you have, not the data you wish you had. The caveat here is you might think the sample size is what there are 10 islands here. 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. The sample size is not 10 though because there are a lot of count events that go on in these tools. So it's not the traditional ideas of what sample size are in count models. It's not transparent. It's not like in a simple regression model where you can break apart these things. This will make a little bit more sense when I get to exposures in Poisson models. So we're interested in the association between the complexity of the toolkit. We're going to be focusing on the total tools column. That's the total number of distinct technological types of tools. There are classes of tools. And some of these are kinds of kayaks and things like that. All kinds of fancy farm implements and so on. Bows and arrows, knives, fish hooks. Cool things like that. We're interested in the association between the total tools and the logarithm of population. The magnitude of the population size. As the population size grows in magnitude, that is we use this logarithm as a measure of magnitude. We expect a relationship between total tools and the log population. And then there may be this monitoring effect of contact. Some of these island societies historically had really well reticulated trade networks with other islands. So they might get more technology just because of that. And we're going to return to this in the last week and do a spatial autocorrelation model with this where we actually look at island dyads and transporter tools. We'll punt on that in a moment. So, yeah. So I guess Hawaii is not the state, it's just the island. This is historical Hawaiian culture that we're talking about. Not contemporary Hawaiian. Multiple islands. The whole the whole whole kingdom, yes. And all right. So here's the anatomy of a plus on GLM. Let me step through. I think you can probably make sense of this already because you guys are getting pro at this, right? And it doesn't matter what the likelihood is called. I can call it foo and you would know it's a likelihood. It's got some parameter in it, right? And we could call that TARDIS and you attach that to a linear model and you go to town, right? But these are the conventions. So the outcome variable here is total tools. We noted that the same way. The only parameter that we need to establish a link to is the expected number of tools, 4k psi. We use a log link because we want it to that has to be positive. So we use a log link to constrain the expected count to be positive, right? Zero or greater. So the log link will do that. This means we're assuming an exponential relationship between the value of the linear model and lambda. Remember, right? Lambda equals the exponentiation of that whole linear model. And inside the linear model, there's nothing new here. We have log population, a coefficient for it, a dummy variable for contact rate. Murray was an island in it. High contact with his neighbors or not. And then an interaction term to deal with any monitoring influence. For example, you might think that having a small correlation isn't as bad if you have high contact. And that's an interaction hypothesis, right? You guys with me? I love this data set, by the way. It's tiny, so it's easy to think with. Obviously we want more data, but it's nice to have small data sets for teaching, right? As you can kind of see it all on the same screen. Okay, and then of course we've got to have priors. And these are ye olde regularizing priors. Alpha is flat because we have no idea where intercepts end up, right? Let the intercept land where it may. Let the slopes guide it. Then we regularize the slopes, so that the model is tame. There's very little data here, so actually my preference would be to use even stronger regularizing priors than these. These aren't all that strong. These correspond to I think I've told some of you before, if you have a Gaussian prior centered on zero with the standard deviation of one, that's the posterior distribution you would get if you had observed one Gaussian deviated zero in the previous analysis that started with a flat prior. So it's very weak. It's one, it's the amount of evidence equivalent to one observation at zero. Nevertheless it helps you fit models, right? I would prefer something smaller like 0.1 in this case because there's not a lot of data and there's overfitting risk here, right? Of course, you're not going to get more islands. That's the least you might have to wait a long time to get more oceanic islands, but it's okay. Set up the variables we need construct ahead of time log pop. This is always a good idea. You can get away with it in some cases, but it's always a good idea to construct your transform before you plug them into the model and then construct a dummy variable for contact and make it contact high. All implied jokes are intended. Then we fit the map model. No surprises here. Log of lambda is the linear model. You can break linear models across lines. In fact, it's often a good idea. Always put the plus at the end. Now at the start, some of you who are strong in RFU know a plus at the start of an R line needs continued input from the previous line and it eats the plus. This is maddening. Absolutely maddening. You might curse the New Zealand programmers who just designed it this way, but it's just how it is. Always put the plus at the end of a line, not at the start of a line. You've seen this vector notation for prior before. I'm saying all three of those parameters get that prior. That's what that means. Okay. Fitting works great here. No problems and showing you the marginal posterior distributions for these. Here's another case where you have to be aware of marginal estimates. It looks like if you just look at the marginal posteriors for BC, which is the effect of contact, the main effect of contact, and then the interaction of contact with the population, it looks like neither matters much. They're really close to zero and they straddle both sides a huge amount. That is not true. Contact matters here and the interaction might matter because these two things are highly correlated. I'm going to show you that on the next slide. It's clear population matters, right? It's positive and has a tight confidence interval. Long population is strongly associated with the complexity of tool kits across oceanic societies. Let me show you what goes on. Notice I put up the correlations here to show you that these two, BC and VPC are almost perfectly negatively correlated with one another. That's why you get tricked by the marginal posterior distribution. Remember, I'm trying to program you to distrust coefficients. If you look at the pairs plot, you can really see this. See that beautiful little wedge diagonal for the two parameters. If one is small, the other is big. If one is basically small, when you combine them and multiply them by the right predictor values, it may affect prediction a lot. You want to construct predictions to see that. How do you do that? Here, you could use link to do this, but I'm showing you how to do it yourself. Extract the samples, just plug the linear model in, but now it's not logistic because the inverse link is exponent, EXP, because that undoes a log. Logistic undoes a logic transform. EXP undoes a log transform. It's the reverse operation. The inverse link here is EXP, so you put the linear model into the exponent function. You get the expected number out for lambda-high means islands with high contact rate, lambda-low means islands with low contact rate. Then we can construct the contrast, the posterior distribution of the difference in the expected total tools between islands with high and low. Notice I plugged in an 8 up there. That's log population, just as an example. You have to do it across those. We'll do that in a slide in a moment. I'll show you here is that 95% of the posterior probability is above zero. So the model is actually saying on the prediction scale islands with high contact have more tools, but you couldn't see that in the marginal posterior, because the marginal posterior is a lie sheet. It doesn't show you the covariation. It just doesn't. This is, again, think of William Thompson's tide machine. These models are like tide machines. You can't just look at the gears at the bottom and figure out what it says about tides on the top. You've got to look at the top. You tune the gears, so if you look at this distribution, this is what it looks like on the bottom left, posterior distribution of the difference in expected numbers of tools between high contact islands and low. It's not all above zero, but this is good evidence that contact ratings associated with more complex toolkits controlling full population size. And then it's a consequence. You're showing this is the posterior distribution, the joint posterior distribution of the two parameters. I'm showing you with the bars on the axes, the margins. You're working by the margins because see where zero is? The average is near zero. It's wide on both sides. The same for the other one. It's near zero wide on both sides. But when one, when this one's big, the other one's really small. And this one's really big. The other one's really small. So you get tricked. They contain similar information. That's why you get this in it. Okay. Now, I'm making you wonder about every paper you've ever read, hopefully, right? Because many of them contain only tables with coefficients. You can go really wrong with this stuff. Yeah. We should pay a large group of statisticians to audit the journal Science and Nature over stuff like this, actually. I think it'd be fun. I only do it after 10 years, though. But anyway, I got only a couple minutes. I'll finish this example. And we'll talk about, that'll be plenty for you to do your homework. So we do the model comparison. All I want to say about this is the model comparison, WAIC is constructed on the prediction scale. So it automatically takes account of the correlations in the posterior distribution. So it doesn't get fooled like the coefficient table can trick you. Right? And what the model comparison says is that log population really matters a lot. Contact might matter. The interaction, some evidence that there's an interaction. The moderating effect of contact is log population, but both of them matter. So you can see the highly-ranked models contain both log pop and contact, but the top-ranked model doesn't contain the interaction. See that? But you get a lot of model weight for both. This is what the predictions look like. So the horizontal here is log population. So that's the magnitude of the population. And total tools is on the natural scale, is plotted on the vertical. So the actual raw data, the different societies, this is Hawaii up here in the upper right, really big, have really complicated tool sets. I mean, really extraordinary tool sets, actually, in contact. And open circles mean low contact. So Hawaii had low contact, why? Because it was out in the middle of Bum-Pub Pacific. Part of the technical term and anthropology there. No, it was just like, it was discovered really late as a consequence because it's really hard to discover. It took forever. The other oceanic islands were discovered really rapidly because they just get blown into them in that part of the Pacific. And then the filled circles are the high contact islands. And then I'm showing you the two prediction envelopes for the expected number of tools and the 95% interval of that expectation. The blue and the blue region around it is the high contact islands at given log population sizes and then low contact as well. And then you can take Hawaii out, by the way. And it barely changes the estimates. It's almost no effect. Hawaii looks like an half-wire, but actually it's dead in the middle of the prediction. So it's not, it's not, it's serving much leverage. Contact less evidence, but there's evidence consistent with contact mattering. But it's hard to say how important it is. Okay. With that, I've got extra stuff that I'm not going to get to. So let me very quickly get to your homework slide. Hang on. And also I want to show somebody an illuminated manuscript. Yeah, there. I said I would put one in here. There you go. Just for you. And here. All right, your homework. When I went over there was exposure. How to do exposures. There's a section at the end of the chapter. I encourage you to read on that. Sometimes the amount, the observation window for a Poisson variable differs across cases. And you'll want to cope with that. It's easy. It's called an exposure or an offset. You take a look at it. You don't need it for your homework though. For your homework, you're going to analyze pirate eagles, bald eagles are bastards I was explaining to my office mates earlier. Big bald eagles nearly always steal food from smaller eagles. Seriously, bald eagles, symbol of America. So that was everyone. And so you're going to analyze a behaviorally logical data set on eagle piracy from other eagles. And that will help you practice your binomial regression skills. There's a cool salamander density data set from Northern California. It's called Bel Norte Salamander. And you'll practice Poisson regression with that. Have fun with it and I'll see you guys next week.