 We're going to resume with generalized linear models, but let me give you a little bit of motivation to get back into the boring math. So these are fireflies, they're not real fireflies, they're simulated fireflies. Many of you may know that fireflies in nature do this amazing thing. They synchronize their flashes. This is most famous in Southeast Asia, but it also happens at the Appalachians in North America. The whole florist will just flash it once. It's very spooky if you're hiking right through. And so obviously there's no parliament to the fireflies. This is a simulation where you see they're slowly synchronizing into a rhythmic discrete phenomenon without any central coordination. And what's happening is beneath this, each firefly has a little clock, a little physiological clock, and when it hits 12, the firefly flashes, but each firefly adjusts their clock forward a little bit whenever a clock near them flashes. This is all that's required to get perfect synchrony and hold it in such a system. And lots of systems do this. Heartbeat does this. Lots of cell signaling works this way, right? Your heart, when it beats, you've got a bunch of muscle cells that you can contract at once. They get out of rhythm, how do they get it back? They use a system like this. And why am I telling you this? Because we're studying discrete phenomena now, and the fun thing in nature about discrete phenomena is underneath them they are never discrete. The system that generates the discrete regular integer-like behavior is internally not that. Just like these wonderful GLMs that I'm teaching you, right? Under the hood they're definitely not discrete and simple. Another example that I like. The asteroid belt. You've heard of it? Yeah. Our solar system has this belt of asteroids. The interesting thing about it is actually a number of belts close together, like Saturn's rings, which is a similar phenomenon for a similar reason. There are gaps, really significant gaps. And these gaps in the asteroid belt occur at even integer ratios, resonances of Jupiter's orbit. It's like, you know, God spiked the integers here. There's something special about two to one and three to one. There are no asteroids that can live at those even resonances. Why? Well, let me give you the quick and unsatisfying version because this is not an astronomy class. But Jupiter is like, imagine you're in a swing. Remember in your kid, you're in those swings, right, that you sit and you rock in. And Jupiter is a very large parent behind you, it can push you every time you come back. Jupiter always pushes you at the same point in your swing, you end up going higher every time. If Jupiter keeps taking a step forward and back and pushing you at a different point in your swing, it'll dampen. So this is the way that Jupiter pushes asteroids out of orbits. And it pushes them out of orbit when it gets to push them at the same time in the orbit every time. That's what's called the resonance. And so if you're at, for example, a three to one, that means that the asteroid goes around the sun three times, so every one time Jupiter goes around. And every time the asteroid passes near Jupiter, the asteroid will be at the same place in its orbit and it gets boosted out of its orbit as a consequence of this eventually. Other places where it's not in some integer resonance, you can stay in your orbit and you can live there and be found as an asteroid by some little monkey on a planet with a big lens. Yeah. And that is this cool thing. So I find this stuff really exciting because I'm a real nerd. This is just an example. Nature is really full of discrete integer phenomena, but the systems that generate them are not discrete integer phenomena. They're complicated. And so when we then turn to these statistical models, like this logit monster that we're working with, I just would like you to see all the bother in this thing that, yeah, there are these nice integer things were counting at the top. Why can't the bottom be nice and integer like two? Because nature isn't like that. This is maybe annoying to you, but so is figuring out the asteroid belt and everything else. This is just how it is. Fireflies, all of this stuff, discrete regular behavior in nature is not the product of discrete regular systems. It's the product of some continuous interacting dynamical system underneath. And this is much simpler than Jupiter resonance is, really. Yeah. We don't have to solve gravity models. So let's come back to the logit model. Hopefully that has inspired you to put up with a little bit of the mechanics here. So remind you what we're doing. We're modeling this chimpanzee pro-sociality experiment. The outcome variable is pulled left. When the chimpanzee pulls the left lever, we've got a couple of things to predict that. We're dealing with getting sensible priors for a logistic regression just to set it up. And what I had showed you at the end last time was to think about the slope. And I'll show you on the next slide to remind you what that was so you don't have to remember. And now we're also thinking about treatments. And there are four treatments in this experiment. Let me remind you. There can be a partner at the table that is another chimpanzee who might receive the other piece of food or not. And whether the extra piece of food is on the left or the right. So you get four treatments because there are four combinations, those two binary things. We've got four treatments. We want to measure unique log odds for each of those. So we set up the model this way with treatment as an index variable. I had to just finish getting this alpha of d norm 0.1.5 because what's the lesson? I'll show you this picture again on the next slide. A flat prior on the loget scale is definitely not flat on the probability scale. It's about as unflat as you could possibly get, in fact. And now we've got the same problem for treatments. So I want to show you what treatments usually what we're thinking of is the differences between them. That's our target of scientific question of interest. So when we do a prior predictive simulation on treatment effects, what you want to look at in the prior predictives is the distribution of differences that are implied by your prior. So let's do this now. We're taking this model and we do simulations. All the code to do this is in the book. So if there's any mystery, but really all it is is I'm using this wonderful function called r norm. Yeah, you may have heard of it, right? Dances through your dreams. And you just need to use r norm and sample some parameter values and then look at differences. And that's it. And so the code to do this is in the book. It is dead simple. So remind you on the left is simulating the intercept alpha. And the lesson from the end of the previous lecture was that if we use something like 0.10, which you think of as pretty flat and neutral, on the probability scale it is definitely not because it puts nearly all the mass in 0 and 1. And that's not what we want to start with. And if we use 0, 1.5, we get something that's approximately flat on the probability scale. Now on the right we're looking at these treatment effects. It's a very similar thing, but now we're looking at differences. So the horizontal axis now is the prior difference between any two randomly selected treatments. It doesn't matter which two because they're all exchangeable in the prior. Yeah, 1, 2, 3, 4 are just random names. Right? Joe Bob, Mary Sue. Right? They're just treatments and they're all exchangeable. So we just subtract one from the other and we plot the density of that difference. It's got to be between 0 and 1 on the probability scale. Right? Because the biggest difference is 1. Because one treatment could be never happens. The other treatment could be it always happens. So your biggest difference is 1. Your smallest difference is 0. Yeah? This is the absolute difference. Right? Taking the absolute value of the difference. And so the black density is the 0, 10, 1. And again, you get this thing where if you use something wide on the logic scale, on the probability scale now, your prior says that the treatments are all either identical to one another or completely different. And that's not what you want to assume, I think. Yeah? Now something will come to me with an example where that is exactly the right thing to assume. But in this case, it's not. In this case, we want a prior that the treatment probably aren't that different from one another. Why? Because we've done a lot of experiments like this in primatology. And we know it's hard to get chimpanzees to do things. So it's not going to have this big of an effect. So something like a standard deviation of a half gives you differences which are nearer 0. Right? So to get differences to be near 0, you need a tighter scale parameter on the prior. Does this make some sense? Yeah? Exciting? Not as exciting as Fireflies. I know. Fireflies are better. You can watch that later. But I wanted to say actually with that, the Firefly Sim is from Nicky Case who has the most amazing website full of interactive simulations of that sort. And the link was, the URL was on the slide. You should, during your lunchtime, I recommend touring Nicky's website. Okay. So let's run this model finally. You know how to run these models. You're just going to, Stan's going to do all the hard work while you have a cup of coffee, right? Actually, I mean, all the time will be at compiling, right? It'll take 30 seconds for it to compile, then it'll execute in 0.5 seconds. But that's not that big a data set. And then we ran four chains here, and here's the summary. You're going to get seven chimpanzee parameters and four treatment parameters. And I'm going to plot these in a second so I don't feel like you have to get it all from this table. So I'll show you what to expect from this output. Chimpanzees numbered 1 to 7, each posterior mean is on the loget scale, remember. So this is the log odds handedness preference, basically, of each chimpanzee across all of the treatments. So number above 0 means they pull the left lever more than chance. Below 0 means they pull the left lever less than chance. Yeah. And you can see there's a tendency towards right handedness because that's how chimpanzees are, just like people. There's a tendency towards right handedness. Although, just like people, too, you can train them to any handedness you like, including ambidextory. It's no problem. So look at number 2. We're going to be very... I feel a great deal of affection for chimpanzee number 2 in this experiment. Chimpanzee number 2 knows what chimpanzee number 2 wants, and that is the left lever, right? And then there are four treatments, and we'll plot those up in a second to get an idea what's going on, but these are the average log odd deviations after you've got an individual's handedness from each treatment. How each treatment nudges individuals around. Okay. Let's just look at the individual difference parameters. So here, we extract the samples, and then we transform them to the probability scale using inverse loget. They're at the top. And then we can just dump this into pracey and plot it. Pracey will plot any rectangular grid of numbers you want to give it, right? So there's number 2. We call number 2 lefty, right? Number 2, if you look at number 2's data in this data set, number 2 never pulled the right lever in any treatment. As I said, number 2 knows what number 2 wants, and that is the left table, the left lever, right? And the other individuals have some more variations. You can see there's a slight tendency towards right handedness among these seven chimpanzees. But three of the individuals, well, two of the individuals are definitely left handed. I don't know about number 6 probably has almost no handedness preference really on this sample. Did these parameters make sense to you? Why do we care about this? We don't actually care about handedness, but this adds noise, right? It's harder to see the treatment effects if you don't estimate these individual effects that are just adding measurement error to the treatments. Does this make sense? It's not technically a confound in the backdoor criterion sense of a confound, but it makes measurement harder, and that's why we care about these individual effect parameters. Does this make sense? Yeah, let me see. The backdoor criterion doesn't tell you you have to control for handedness, but controlling for handedness gives you a more precise estimate. Is that clear? Yeah? Okay. We don't need the backdoor criterion because this is an experiment. We decided where the food was, and all of that, right? So now we look at treatment effects. So the codes on the left, R is right, L is left, N means no partner, P means partner. Those are the four combinations. Yeah? And what's going on here? So these are the effects, and you have to think now, wrap your mind around, remember the meaning of this experiment, try to figure out what's happening in this thing. And the point of this, we're going to try to do this, but I want to, before we try, I want to remind you that this is an example of how hard it is to figure out what happened in an experiment by reading individual parameters. Even in something this simple, it's already hard, right? Your mind is burning. I know mine is. Of course I'm standing next to the blackboard, and the radiation makes me dumber. But it's, so what's happening? Let's think about the contrast of interest. The contrast of interest are the interactions. So we want to know it isn't enough to conclude that the chimpanzees have pro-social intent, that they choose the pro-social option. You need them to choose the pro-social option more when there's a partner present. Does that make sense? Because they could just be choosing the pro-social option more because ooh, there's more food on that side of the table. Yeah? And that actually seems to be what's going on in this experiment. So how do you figure that out? Well, let's do the easy part first. Let's look at the two treatments with L in the front. When the pro-social option is on the left, now we expect higher estimates, right, because they're attracted to the pro-social option, and that's true. But there isn't much difference when the partner is present in absent. You see that the second and fourth rows in this table? For the other two treatments, we expect the bias to go the other direction because now those two sweet grapes, I think they were grapes, were on the right-hand side, and so we expect them to prefer pulling on the right-hand side, and they do. There's more of a difference here, but there's no statistically strong difference. It's incredibly uncertain. So again, there just isn't much of an interaction effect to write home about here. The partner presence absence doesn't have as big of an effect. Which side the pro-social option is on has a big effect on chimp behavior. However, it's not nearly as big as the handedness effects, which is most of the variation in the data, is just the individual handedness effects. So let's, now that's all hard. I think. Just to read that from that graph. That's difficult. It's much easier if you just plot the stuff on the outcome scale. This is my mantra. I know you're sick of me saying this, but I will just say it forever, is that if you want to understand the behavior of a statistical model, you should make it behave on the outcome scale. Push the posterior distribution out through the model on the prediction scale, and then it's nearly always much, much easier to understand what's happening than if you interpret the parameters directly. So this is the raw data. I'm going to start here. This is what the raw data looked like, with some augmented lines to help you understand the structure of interest. So each group going left to right is an individual actor, and I've taken each actor's poles across all blocks, and I've averaged them to a single point. Just a representation. And the different treatments are shown here. You can see the key on the far left. The open points mean there was no partner present, a field point means a partner was present. And then we've got these two groups connected by lines depending upon whether the pro-social option was on the left or the right. Does that make sense? You see that? And you can see, as you scan across, skip over actor number two. Actor number two is just, again, actor number two knows what actor number two wants for the others. You can see that there is one group which consistently deviates for every individual up, which means more poles on the left, and that's when the pro-social option is on the left. So there's a hand in his preference, but even individuals who are biased or who are right-handed tend to pull the left lever more when there's more food on that side of the table. That's what separates those two groups of points with the lines in them. Does that make sense? Yeah? But you'll notice that the lines are pretty horizontal in most cases. If the lines tilted more, that would mean there was an interaction with the partner presence, and so you'd get a big change in poles when you added a partner, and you don't see that for the most part. There's lots of tipping around, but there's no consistent pattern that the partner pulls the thing in a particular direction. That's the raw data, right? So now you're thinking, well, why do we run a stats model? Well, you have to deal with the finite sample size and all the potential confounds and measurement. This is not a substitute for doing the stats model. It's a compliment. Now let's look at the posterior predictions. What does the model think after you've trained it on this sample? This is what it thinks. Same structure, same meaning, just all in black now instead of blue. It sees that there's this... Every chimpanzee has the same partner effect. Why? Because that's what the model says. The model doesn't allow chimps to vary in how they respond to the partner treatment. We will do that in a couple of weeks. I'll come back to this data set and we'll let everything vary by actor. We'll go all over the place, but we're not going to do that today. Today it's a simple logistic regression. You can see the model sees it as there's a lot of evidence that there's more pulls left when pro-social is on the left, but there's really not much effect of the partner. In fact, if anything, on average, adding the partner reduced the tendency to pull to the left. Make sense? Yeah. So that's the conclusion of this. I want to add two things. Look at actor two's predictions. You notice that in reality, actor two never pulled anything but the left lever. The model is not sure if we ran the experiment forever that that would always be true. You'll notice the posterior distribution allows the possibility that in the future, actor number two might pull the right lever once. It's different from the raw data. That's because the model is learning from a finite sample. So even though the data has no variation in it, the posterior distribution does. Make sense? Other thing to note is in the meaning of this, this experiment doesn't say that these chimps wouldn't care about the partner in any context. It's just in this experiment where human kids care incredibly about the presence of the partner. You do this with five-year-old kids adding the partner changes everything. And then they always choose the pro-social option. But chimpanzees are unresponsive. That doesn't mean chimpanzees or assholes. These are scientific terms. It just means they don't respond to these things in the same way. And this is part of what this is probing out. Okay. If you want to do model comparison with these ULON models, for the time being, at least you're going to have to add this log-lick. That's for log likelihood equals true argument to it. And that adds to the stand model some extra code that computes the log probability of each observation as the model gets sampled. There's a box in the chapter to show you how that actually works and connect it back to all of the background in the model comparison chapter about the calculations we need to do. So once you have those, they get spit out of the chain and then calculating W-A-I-C or this Pareto smooth leave-on-out cross-validation is almost instantaneous, because all the calculations have already been done while Stan was running. But you have to do it optionally because sometimes it's a really giant grid. And if you don't need it, you might exhaust the memory in your computer. It can be quite big. Because think of it this way, there's one probability for every observation in the data set times the number of samples you take. So you get a giant matrix of probabilities. It could be, you know, if you take 10,000 samples, it's 10,000 times the number of observations. Yeah, so you run a data set with, you know, 20,000 observations, which is not so unusual these days. 20,000 times 10,000. It's a big matrix. So if you don't need it, don't add it, right? But then once you do that, you can use the compare tools just as before. And here I'm using Lou, which will give you identical results as WAIC. And you can see what I'm comparing is the model we did before to this model at the top where there are no interactions. And how do I know there are no interactions? Because I've created two new index variables, one which is just the side the pro-social option was on. That's what side means. And the other is con, which is condition for whether the partner is present or not. It's, it's, there's no interaction here because there's not a unique parameter for those combinations, right? There's, there's one parameter for left, there's one for right, there's one for partner present, one for absence, but there's no interaction effect. Yeah, does it make sense? And so then we can compare this model's expected to have central accuracy to the previous one just to confirm what we got from looking at the posterior distribution on the previous slides, and that is the interaction's not doing any predictive work in these data because these models are tied. That's the way you should read this table. These models are indistinguishable on the prediction scale. Does that make sense? Yeah? Okay. So, you'll get used to, to logistic progressions and binomial aggressions pretty quickly through some examples. One of the tensions in interpretation is you can talk about effect sizes in at least two important ways and they're both useful and important. So, I'm going to give you a little bit of five minutes of why you need both and how they're described. So, I call these different ways the relative and absolute effect scales. What does this mean? When you're looking at differences between parameters on the law god scale, these are relative differences. Why? Because you're not talking about the probability the event happens. It's just relative differences if you adjust some predictor value ignoring all the other predictors. So, if you're on the law god scale in a linear model, you can just talk about relative differences holding everything else constant. But if you want to predict the rate of the event happens in the world, now you've got to go to the absolute scale and now ceiling and floor effects happen. It matters the, what we call the base rate matters, how frequently the thing happens. And so, there's this common thing you'll see in lots of scientific papers, the proportional odds. You can report relative effects by just exponentiating the parameter or a difference in parameters. And this is a relative effect measure. And something that can seem really, really big actually when the absolute effect is small. This is the famous thing. So, like, in this thing we've just talked about a difference between treatments 4 and 2, a relative, the proportional odds adjustment is this calculation here, which ends up being 0.9. What does that mean? It means 90% of the previous odds. It means whatever the previous odds of pulling the left lever were, if you switch from treatment 2 to treatment 4, you expect to reduce the odds by 10%. This is the relative effect scale. If that was not totally clear, I'm sympathetic to that. It's a little bit confusing, but you'll see this all the time in articles. And the risk with this is, of course, incredibly unimportant things can seem super important on the relative scale. Why? Because of base rate effects. So, let me walk you into a parable here that will hopefully anchor in your mind and you'll remember. I call this the parable of absolute shark, I mean, sorry, relative shark and absolute penguin. So, let's focus on the shark part of this story to begin. So, sharks are terrible animals. I'm sorry. No, I mean, I don't want them to be exterminated, but I don't want to be near one either. Yeah, Australian beachgoers know what I'm talking about. And so, people are scared of sharks. And you'll see posters made all the time like this one, which are trying to tell people, don't worry about sharks. Look, if you tally up annual mortality from different wild animals in the world, sharks kill very few people. And this is true. So, some of these are quite humorous, right? So, deer kill, like, what's that, 130 people annually. I don't know about these numbers, but it's easy to believe that deer kill more people than sharks, actually. Why? Because people are not aquatic for the most part. And there's an exposure effect that comes from other parts of a model. If you're going to model this, this is the base rate effect. Your exposure to deer is higher. Even though each individual deer is less dangerous than each individual shark, the total mortality from deer can be higher. That's an absolute effect, right? This is the absolute measurement scale. The absolute danger of a shark conditioning on the distribution of humans on the planet, being mostly on the land, is that very few people die from shark attacks every year. But what if you were a penguin? Right? Now, the absolute risk of sharks is really severe. And you're much more worried about them than you are deer. Yeah? So this is relative shark absolute penguin. I said that hippos in the bottom here, having done fieldwork in Africa myself, I would like to testify that, yes, hippos are the most dangerous animal on the planet. And you should never go anywhere near one. They're absolutely terrifying. Okay, after that little bit of promotion for hippos. There's a bunch of hippos that I'm going, yeah. But so here's the parable of relative shark and absolute penguin. So we're going to think about relative effects. Relative effects, they're useful. Absolutely. And you'll see them in epidemiological literature, I think misused quite a lot. But it's because they can make actually really tiny risks seem huge. Why? Because if there's a very rare disease and something you do or eat doubles the risk of that disease, it doesn't make the disease common. You're still probably not going to get it, even though there's a doubling of risk. And that's a relative effect. And this turns out to be true, even of things that are kind of counterintuitive like lung cancer from smoking. Lung cancer is rare. But smoking vastly increases your risk of it. Something like three fourths of all lung cancer is caused by smoking. But lung cancer is still rare, even if you smoke. Smoking is bad mainly because of heart disease, it's bad. I'm not saying it's great, but it's not the lung cancer. So, but you really need relative effects to do causal inference, because you've got to transport results to contexts in which there are different base rates. And so this is absolute penguin. Absolute penguin is absolutely delicious and is absolutely worried about sharks. Why? Because the penguin doesn't live on the land. And so there's a bunch of stuff about, it's the relative effect, the relative danger of sharks conditioning on being in the water is what you care about in that case. So neither relative nor absolute effects are the only thing you care about in all contexts. You need to think about both and just be careful. So an example of this base rate effect for us, we don't live in the water, so we don't worry about sharks, right? Unless again, you're an Australian beachgoer, they do worry about sharks. And poisonous things, everything in Australia is poisonous or deadly, right? So, but we do worry about lots of public health risks and our newspapers are full of all kinds of alarmist headlines. So here's a famous case from the United Kingdom, not so long ago. I think this was the early 2000s. And where there was a study of the way that a certain kind of oral contraceptive, a birth control pill, increased the rate of fatal blood clots in women. And this paper came out and it turns out that on average these blood clots develop in the absence of taking birth control in one in a thousand women. It wasn't even this high. I've taken off some zeros just so the counting is easier here. So more like one in 10,000. And for women on birth control it was instead three in a thousand. This is a 200% increase in the rate of blood clots and that's what was reported in the newspapers and a lot of rich women stopped taking their birth control and subsequently got pregnant as a consequence of this. Your risk of dying from pregnancy is vastly higher than three in a thousand. Sorry, I shouldn't be saying things like this. It's up to you. You knew that. You're all biologists. So you knew that. So there's an irony here where the miscommunication of risk resulted in a lot of maladaptive behavior on the part of the public. On the absolute scale, so the relative scale is 200%, on the absolute scale it's a change of .002 probability. Two in a thousand. Does that make sense? So this is the distinction. This happens anytime the condition of interest like a shark attack has a really low base rate. A change in relative risk has almost no effect on the absolute scale. Does this make sense? But for a common thing the same relative risk change can be incredibly important. So things you can do to make traveling by car safer are absolutely worth doing. That would include, by the way, wearing a helmet. If you think wearing a helmet while you're biking is a good idea, and it is, wearing a helmet while you're riding a car is an even better idea. You're welcome. Good weekend thought for you. Sorry, there was somebody at UC Davis who used to study this, so I get that anecdote from them. Anyway, so let's move on. New data set. So remember the penguin. This will help you out of these mine traps. Just keep the penguin in mind. Are you a penguin or a shark today? That's what you want to think about. So let's look at another data set to exercise binomial regression. Binomial regressions come in two major flavors. The first is the logistic regression flavor I just showed you where the outcomes are decomposed into zero one trials, often called Bernoulli trials. And in that form it's usually called a logistic regression. Mathematically, it's the same as this other flavor which the data are just arranged in a different way. And I would call this an aggregated binomial. Usually it's just called binomial but I want to distinguish between the two. And before I get into the details of this and the example, I want you to know that these are the same kind of model. It's just the way the data is coded. But you can translate between the two whenever you like. There's really no difference at all. And you'll get the same posterior distribution in both cases, whether you break the data apart into zero one trials, or if instead you aggregate them. And on the next slide I'll show you what aggregated data looks like. So the data set we're going to use is this historical data set. This is a data set that's famous in statistics and nowhere else, but in statistics. It says 1973, graduate admissions data at the University of California, Berkeley. So the legend here, I tell you a little bit about this in the text, is that a dean at UC Berkeley was worried that they might get sued for gender discrimination. And so he asked a statistician. This is one of the rare cases in which a dean has contacted the statistician to nest and to do anything, to look at the graduate admissions data for the University and see if there was any evidence of bias in admissions by gender. And so this is a very interesting data set because a high profile publication came out of this and it illustrates a very important number of statistical principles. So this data set is built in to the rethinking package. I call it UCB Admit, UCB UC Berkeley. The Bears, yeah. And let's call in the statisticians and see what happens in this data set. This is the whole data set right here. It's a bunch of binomial trials, or Bernoulli trials. Each application is like a coin flip. And there's a probability that the candidate is admitted. And we want to condition that probability on things we might know about the candidate. In this case, just their gender. Yeah, which is what we're focusing on. It's what the dean was focused on. And there are one, two, three, four, five, six departments anonymized to be A through F. You might be able to figure out what they are later. However, there are clues in the data set. And then there's an applicant gender column self-identified applicant gender from the application. And then we get to the aggregation. The admit column is the total number in that combination of department and gender who were admitted, who were offered admission. This is not whether they accepted the offer, but this is whether they were offered admission. And then the reject column is the complement of this. It's all of the applications that were rejected. And then the last column is just the total of the previous two. It's the number of trials. You can think of this as you could take these data and you could disaggregate them into a bunch of zero one trials, but it would be a really long table. Yeah, because there would be a row for every application. And that would be, how many rows would you get? It would be the sum of the last column. It would be the number of rows in the data set. And I don't know how many that is. But it's a lot. No, that's fine. Your computer can handle it. Yeah, but it's a lot. So this is a nice compact way. If you take all the cases that share the same covariate values and you put them together, you aggregate them into the same row, you get a shorter table. You can run exactly the same model either way. It's up to you. Does this make sense so far? So this is the aggregated form. And here's the model we're going to use. Let's just start by considering a model that just has an index for the applicant's gender. And the way I'm coding this is male is one and female is two. The thing to notice now is in, when we do the model specification, the binomial up top, there's this in now, whereas previously there was a one there. This is the number of trials. Now there's a variable number of trials on each row. And so that's data. Now we put a variable from the data table in in this spot. It's the number of applications, that final column in the data frame. And then the rest of this is the same. You've got the same problems with the priors. You can specify them the same way. Do a prior predictive simulation. Everything else works out the same way. We run this model, the marginal posterior distributions of these two gender intercepts. I remember one is male, two is female. This is log odds. So again you're going to exercise your psychic powers and try to peer into the log odds space and figure out what this means. Lower numbers have lower probability on the outcome scale. And so males had minus 0.22 and females minus 0.83. So males had a higher average rate of admission in the sample. So this looks like some kind of bias on the first cut. It's the idea. Let's think about the penguin. So we want to calculate the relative and absolute effect sizes here. So here's the exercise. We extract the posterior distribution. The first thing I do is calculate the difference in the A parameters. So this is the difference in log odds between a male and female candidate. And then I calculate the difference in probability of admission. That's the absolute measure. So shark penguin, right? Shark penguin. Remember that. And then I can summarize using Pracey. So the average posterior difference in admission is about 0.61 log odds units. Between 0.5 and 0.7, which is a statistically reliable advantage for male candidates in this dataset. Yeah, no doubt about it. And then on the probability scale, it's about 0.14 somewhere between 0.12 and 0.16, which I consider to be a really big advantage. That's a huge advantage. It's like if you could switch your gender and you can, now, I don't know about then, but now you would get a 10% advantage in admission. That seems like worth doing. So hang on, the story unravels a little bit here. But I think that's a really big effect. So I'm not saying that's a small effect. That's a huge effect. Very few things like extracurricular activities have equally large effects. Yeah. Doing your undergrad at Harvard will have as big effect, but nothing else, right? Okay. Now let's do the posterior validation check. So we push the posterior distribution out through the model and we get predictions. And let's compare them to the raw data. And that's what this is. This ugly graph is courtesy of the post-check command in my rethinking package, which is since it works for any Oolong model, it always looks ugly. I'm sorry. It's just how it is. You've got to make your own. But there's nothing aesthetically pleasing about this. I know that. It's just there to motivate you to make your own good graph. Okay. So, but what are we looking at? Blue is raw data. Black is model-based predictions. The open circle in each case is the posterior mean probability of admission for each case in the data. What's a case in the data? That's a combination of department and gender. So case one, I've labeled them here, is males from department A. Case two is females from department A. And then we get department B, C, D, E, and F, always male, female, male, female, because that's how the data came. Yeah. And then you'll see the posterior predictions. They go down as you shift from male to female, which is that thing we just measured on the previous slide. Males have a consistent advantage in this sample. But at the department level, nearly always the opposite is true. And you see this? In almost every department, all except one, female applicants were admitted at a higher rate than male applicants. How can this be possible? Well, because of the magic of statistics. This can be possible. You have to think for a second what's happened in this. So, here come the DAGs. You were waiting. You knew they were coming back, right? There's a backdoor path into gender, right? So, from gender to admissions rate. And in these data, well, there are probably a bunch of backdoor paths, actually. And in this data, one of them is department. Departments have different overall admissions rates. Some departments accept a lot of applicants relative to the size of their applicants pool, like physics. Physics gets a very, physics programs get a very small number of applicants. And they take, like, half of them. Social psychology gets a lot, lots of applicants. And they take less than 10% of them. It's much, much harder, conditional on applying to a social psychology program. It's much, much harder to win admission than it is applying to a physics program. Now, different people apply to these different programs. And one of the things that's different about those people is their gender on average. Lots of other things too, like which state and lots of other things. So there's one path that we'll focus on here, this backdoor path. If we drew a DAG, we've got a direct arrow from G to A, which is from gender to admission. But there's this backdoor path through department, gender influences which department you apply to. Influences. There's a statistical association, but it's got to go that direction. The department doesn't influence your gender. This is what the magic of DAGs. They have arrows in them, and the physics tells you about the arrows. And then department influences your probability of admission because some departments just don't take very many people. The interpretation of these two paths is completely different legally. One of them is discrimination, and we get UC Berkeley sued. And the other lets them off the hook. So the dean is praying for the backdoor in this case, which is what the statisticians find in this example. I would say there's other backdoors, though. And in your homework, I will have you explore the fact that there are definitely other backdoors. You can augment this DAG. I promise you that it is hard to make interpretations of what's really going on in these data. That's what I want to convince you of in the homework for this week. So we add another vector of parameters to this model, department. Now for each department, we get an average department offset. What's the average admissions rate in each department? We estimate those with the deltas. Again the same, these normal 0.1.5 priors. And before I show you what happens, I know you're in total suspense. Let's think about what the previous model really asked. And remember some time ago, probably before Christmas, I had this metaphor that the thing about regression models is that they're like hostile oracles or genies. And you have to be really, really careful exactly what question you ask them because they will answer it very literally. They understand your questions in an extremely literal fashion and they're evil in this regard. And that is true here is you have to realize that the statistical question, this first model that we fit is asking is not what that direct path is. The arrow from G to A. It's asking what is the average probability of admissions for females and males across all departments? It's asking the total causal effect of gender, not the discrimination effect. Does this make sense? It's the total graph. So the total causal effect is answered by this model. If you wanted the total causal impact of changing gender, it would be this because all paths are in play if you do that. So it makes sense? But if you want just that one path, the direct arrow, which is what the dean wanted, you've got to use a different model. You've got to close the back door. And so that's what we do with this model. This model asks, what is the average difference in probability of admissions for females and males within departments? Conditioning on the department's overall admissions rate, what are the average differences between males and females? They're different statistical questions and they correspond to two equally valid, depending upon your purpose, causal questions. They're both causal questions, but they're different ones. The left is the total causal effect of gender. The right is this isolated, direct causal effect of gender that the dean interprets as discrimination. Yeah? And so we stratify by department with this model. This is just the code for the model that was on the previous slide, the right-hand model. Now we get some estimates. There's a bunch of delta parameters in here. There are six of them because there are six departments. All I want you to notice about these is that some of them are high. Departments one and two have log odds of admission greater than zero, which means they accept more than half of their applicants on average. The majority of applications win admission. I think that's electrical engineering and physics if I remember the original dataset before it was anonymized. I've just de-anonymized it. Sorry, electrical engineering. And then as you go down, we get to department six, which I think is social psychology in this dataset, where the emissions rate is actually very low and most applications are rejected, guaranteed every time, because they don't have enough slots. They just don't have enough slots for the interest in the field. Now look at the A parameters as before. They're basically the same. There's not much to write home about here. They're both about minus 0.5. There's a little bit of difference, but notice that it breaks in the other direction now. Males are one. They have a lower average rate of admission than females do, which is what we saw on the raw data plot. But it's very small, actually. It's not much of an advantage. And it's only in a couple of departments that females have a big advantage in this dataset. So again, with the shark and penguin calculations, the same code, just to repeat it, on the relative scale, the difference between males and females, conditioning on department. This is the direct path now. It's on average minus 0.1. It varies between about minus 0.2 and a little tiny bit above 0. So it's maybe slightly negative, but it's small. On the probability scale, it's on average a 2% disadvantage for males in the sample. But again, it overlaps 0 a little bit. So if there is an advantage for males, an advantage for females in this dataset, it's very small. Flip the direction. So this is a famous statistical phenomenon called Simpson's Paradox, which is when you add a new variable to the model and then some previous effect reverses direction. And it can happen endlessly, in fact. You can construct systems in which you can just keep adding variables, and the effect will keep flipping. And it has to do with backdoor paths. And Dags explained how this happens. And how you interpret this reversal depends critically upon the causal model. It could be a spurious reversal, or it could not be a spurious reversal. And it depends upon your question. Simpson's Paradox is purely a statistical phenomenon. It isn't a causal phenomenon. So you've got to keep the two separate. In this case, the reversal tells us something about a particular path in the model. So why does this happen? This happens because, as you figured out, females apply to different departments than males on average, and they tend to apply to the departments that are hardest to get into. And so across the whole graduate program, and usually Berkeley in the 70s, it's true that the total causal effect of switching your gender to female would hurt your admissions. But within any particular department, it would probably not affect it at all. Does that make sense? So the total causal effect is definitely discriminatory. The system is discriminatory. But the application reviewers are not. They're different interpretations. One is the Weberian discrimination, if you will, the total effect. Anybody else? It's Weber. There was this guy named Weber. He founded Sociology. Anybody? One person. Two people. Thank you. Sometimes I wonder if I'm hallucinating this stuff that I learned in graduate school. Weber, he was real. He existed. And the other is this direct, agentic version of discrimination where there's a sexist person who's evaluating applications. And both can exist in these systems, but if you want to do an intervention, you have to figure out which is going on. And again, I have a homework problem for you using a different data set, not this one, but a different data set, which is much more recent, which has the same structure in which I would like you to explore the same phenomena and also consider other paths, because there are other variables in play here that create confounds. And as you'll see, I think it's very hard to make causal inferences from these analyses of these samples, actually, because they're not experimental. And everything's confounded up. But in this case, to summarize, we've got an indirect pathway, which is very strong in these data sets. Typically, different departments, different disciplines have very different rates of acceptance, much weaker evidence of any direct effect. This doesn't mean there's no discrimination, because there is in a system like this. If you ask Weber, if we resurrect Weber and we ask Weber if there's discrimination in the system, he'll say, yes, there's discrimination in the system. Female candidates are disenfranchised from graduate education as a consequence of this, but it's not because of any animus from an application reviewer. It's just the sociology of the system. But if you're going to do an intervention that fixes that, you need more slots in social psychology. That's the lesson from this data set. All right, that's enough of my sermon. Again, there's a homework problem I hope you'll find interesting, which is structured similarly. Okay. I want to spend the remaining 10 minutes telling you about Poisson GLMs. And these are like binomial models. The Poisson distribution is a binomial distribution in which the number of trials is unknown and very large. There's some huge population of little particles out there and some event could happen to any of them. But the probability that event happens during the period of observation is very small. And so in that case, you can describe the shape of the binomial distribution with one number. We usually use this lambda, and lambda is the average number of events. So the expected value of a Poisson distribution, E of y is lambda, and the variance is also lambda. There's one parameter that describes both the mean and the variance. It's a very fun distribution. Nature is full of Poisson distributions, actually. This is another one of these maximum entropy distributions. It pops up all over the place. So just very quickly, some examples. What's distributed in a Poisson way? Soccer goals. Soccer goals are typically only zero or one, right? Does it get higher than that in league play? Sorry. Vision events, nuclear decay, photon striking detector, DNA mutations. There are lots of sites where a mutation could happen. The probability is very low. This famous data set, which is actually in the rethinking package now. Prussian soldiers killed by horse kicks. It's the first published analysis of using a Poisson regression. It's done by a Prussian state statistician in the late 1800s. I know, it's very... These are the things that states need to regulate. Prussian needed to regulate. The next war was on the horizon and had some horses. Anyway, so... And it's named after this fellow on the left, Cébillon de Ni Poisson. But it's also derived by De Moivre, independently. It's been used a bunch of times. So, I mean, in fact, the rule in statistics is whoever distribution is named after or anything is named after, it was discovered first by somebody else. So it's named after the person on the left. It was discovered by the person on the right. That's just how it goes. Data set, I want to use as an example of oceanic tool complexity is an example that comes from an area of research that I care a lot about, and that is cultural evolution, technological evolution. So this is an analysis is published by Dr. Michelle Klein, who's at Simon Fraser University. Michelle does ethnographic field work in oceanic societies, mainly in Fiji. She's interested in social evolution in Oceania as a set of natural experiments in studying cultural evolution. So she has this paper where she goes through and finds toolkit complexities, measured in a number of different ways for historical oceanic societies. And there's an underlying theoretical model here which predicts that cultural complexity should be proportional to the logarithm of the population size. It's the magnitude of the population size. So we want to evaluate this. Our outcome variable here is going to be the total number of unique tool types that were historically present. Remember, these are Stone Age. This is Stone Age technology. And then we've got the historical population size measured at the same ethnographic present in which those toolkits are measured. And we've got another variable, which is the contact rate, which might moderate that. So some of these islands, even though they have small populations, they interact frequently through trade with bigger societies, and so that might moderate the influence of population size on them, because they get tools from trade. So we're interested in the complexity of toolkit is proportional to the magnitude of the population and the effect of contact rate. So we're going to make a Poisson GLM to do this, and this looks a lot like all the other GLMs. Total tools is our outcome, and then Poisson at the top, and then our lambda. It's going to be conditioned on things we know about each case i. And then our link is a log link now. What does the log link do? This is the other really common link function in GLMs. The log link ensures that a parameter is positive. So if you put a log link on something like lambda, lambda has to be positive, because it's an expected number of things. It's an expected count. So it's got to be greater than zero. If you put a log link on it, it will be. Why? Because you get lambda back after by doing the inverse function, and that means exponentiating it. So if you exponentiate any real number, you get a positive number. I'll show you this on the next slide. And then we have a linear model, because these are GLMs. We're still in geocentric land here and what I'm going to do is we have an index variable CID, which is the contact rate ID, one or two for low and high contact rates. And we've got an average on the log scale, number of tools for each contact rate, and then we've got a change for a slope for each contact rate times the log population size. And then we've got priors to determine. So the sermon on the priors is coming. Yeah, we're definitely going to have to finish this on Monday. That's okay. Let me ground you well in the log link here. So what's the goal of the log link? The goal is to make sure your parameter is always positive. Otherwise it'll explode. You can't have a negative expected count. So all the log link does is it maps all negative real numbers to the zero one interval. If you exponentiate anything between negative infinity and zero, it's between zero and one afterwards. E to the zero is one. That's why you think about the middle pivot point on a log scale. And then all of the positive real numbers get mapped to the interval from one to infinity. So the scaling is really explosive with a log link. It's exponential. Literally. No metaphor there. It's literally exponential. And so on the left we've got the scale of our linear model on the X as the value X of your linear model changes. And we've got some log measurement scales shown on the vertical. We exponentiate that line on the left and you get this curve an exponential curve. And zero on the left scale is mapped to one on this scale. That's that bold connecting line I've tried to draw is where you go from zero to one. So you get this massive compression of small numbers into this tiny range. And then the other half of the real number line is from one to infinity. This is a really difficult measurement scale to think about and have intuitions about. And so you know what I'm about to recommend. That's right. Simulation. You should simulate to understand the implications of these link functions. So let's figure out some reasonable priors. Priors in log link models at least I find you really need to do prior predictive simulation implications are because of this explosive scaling on the when you exponentiate things. So let's think about some rudimentary plus on regression where we've just got an alpha which is the log expected count. You think it would be easy then to put a prior on this. So let's try like something benign again you know this is not benign because I've already done this joke before right but something benign like normal zero ten and let's see what happens. How do we simulate from this prior we use our norm we just sample a bunch of our norms with mean zero and standard deviation ten we convert them to lambda by exponentiating them that's the inverse link and then the first thing I plotted that distribution in black on the left here doesn't look very flat does it and in the text here below I've calculated the mean of this distribution and since everybody here knows scientific notation you look at that number and even though you can't say it you know it's really big and the reason is because the tail of this black distribution goes on for a very long time this is the explosive scaling of exponentiating a distribution of numbers so something like norm three standard deviation point five gives you this nice hill which is actually within the possible outcome space that we'd see in these archaeological data sets historical ethnographic data sets so it's a much better kind of prior here it doesn't fit the sample yet it's going to move when we fit the sample but at least it's within the outcome space the prior mean is not in the billions or whatever that is ten to the twelfth so the thing is this is 9.6 times ten to the twelfth at the bottom that's a lot of tools it's like an island covered in tools there's no place left for people they're just sleeping on tools okay slopes are equally hard to into it so you can simulate same thing here again the code to do this is in the book on the left I've got slopes on log population centering so we say we calculate log population then we center it or standardize it in this case so a log population of zero doesn't mean zero log population it means the average log population in the sample so we can think about the priors a little bit more easily and if we put normal zero ten prior on this and simulate you'll see what that implies is explosive scaling as you move away from the mean either up or down doesn't bother me I want to let the data decide it's going to go up but it's the explosiveness of it that's the problem so you expect that you're going to get a billion tools again really really rapidly with this kind of scaling and on the left something much much tighter point zero two much much flatter on the outcome scale now so there's again this intuition with the GLM if it's flat on the on the linear model scale it's not going to be flat on the outcome scale I'm sorry it's just reality it's like fireflies or anything else asteroids or whatever it's just how it is but you have the power of prior predictive simulation to sort these things out so some tools models now we can run these nothing surprising in the code D plus is ours name for a plus on by the way you can you can just write plus on there too you can use the R function names or you can just write the actual name of the of the function we'll understand both and we're going to run a model that the top model there is the intercept only model and then the model of interest the scientific the model of interest here it's just a model at the bottom and we can compare the two the only reason I'm doing this model comparison the intercept only model is not interesting to us as a candidate to select this is just to show you that once you get into GLM especially plus on GLM the effective number of parameters typically have very little directly to do with the actual parameter count this is the law in GLM all that nice relationship in Gaussian linear regression models where the effective number of parameters which remember is a measure of your overfitting risk it's very strongly correlated with the actual parameter count in a linear regression in a GLM it's not why it's ceiling and floor effects where the data lie near a ceiling or near a floor the model if you've got an outcome and it's near zero well then the model can't be flexible about that right or let's say there's a bunch of parameter values that will still predict zero in that range so where the data lie if they're near a ceiling or floor affects the flexibility of the model in your overfitting risk and that's just the thing that doesn't happen in the Gaussian space because it's got the same measure everywhere and so you don't have these hazards so I think this is cool but it's also sort of annoying but Lu and WAIC will still happily measure your overfitting risk there in P Lu and I want you to see that the more complicated model 11.10 which is the one with all the extra parameters actually has less overfitting risk than the model with one parameter and this is not wrong it's actually pretty common with Poisson models you just have to sort of roll with it reality is way more fun than you ever anticipated right this is just how it goes but it's a consequence it's a well-interested consequence of the fact that these the way the log link generates predictions so I'm going to put this up I'm going to stop here we'll pick up with this on Monday here's a Poisson example in your homework so you should read this section and finish it but I will definitely do duty to it on Monday and now I skip all the slides that I will do on Monday sorry there are going to be cats on Monday definitely come on Monday censored cats and here's your homework you've got three problems two new data sets multiple good DAGs print and one of these new data sets is like admissions data is in Rethinking 1.83 so you need to update these are data from the Netherlands Organization for Scientific Research these are scientific grant awards and I think you will enjoy this data set a huge amount and then please come back next week to learn about the cats alright thank you and have a good weekend