 We're going to pick up talking about GLMs, and I also want to get into models which are also technically GLMs, but aren't usually caught as such, in particular, ordered logit models which are really, really useful in psychology, so it'll be a bonus for people. They are GLMs in the sense that I think nearly any statistical model can be represented as a GLM, kind of a running joke in statistics, actually. Everything's a GLM, and it's because all a GLM means, as you've learned, is you've got some outcome distribution and you hook up linear models to its parameters to determine its shape, and so everything's a GLM, because that it's a very permissive definition, but then that doesn't mean anything. What it does mean is when you learn one kind of fancy GLM, like a binomial regression, you are learning some concepts which will transfer to other monsters, as I call them, as we go. Okay, we finished on this slide before. I had gotten to the punchline with the UC Berkeley admissions data, that there's a phenomenon in these data where there's a reversal of the association between one of the predictors and the outcome when you add another predictor to it. In the case of the UC Berkeley admissions data, it was when you add department to the data so that you get a unique intercept for each department, then the direction of the association between mail application and probability of admission reversed, on average. It shrunk nearly to zero, actually, but it reversed. It's almost all of its masses on the other side, it was very, very small. Simpson's Paradox is a very general phenomenon. It shows up all over the place, and like most paradoxes, there's no paradox at all. Paradox just means it violates our intuition. It wasn't its original definition, but that's what it means now, right? If it surprises you, people call them paradoxes. So it's easy to manufacture paradoxes because the world is very surprising to humans. And Simpson's Paradox, you can think of it as this reversal of a trend when you add a digital predictor variable. There are tons of examples. Here's a website which goes over all kinds of examples in organizations. I put this up because I think it's funny to plot things and have neuroticism as an axis, but neuroticism here means anxiety, concern about things, right? It's usually a good trait in organizations. It's not a bad thing. And my neuroticism score, by the way, is really off the axis. It's super neurotic. It's a good trait in academics that makes you double check your statistics. But this is the general pattern, is that when you just look at the whole population without looking at any subgroups in it, the relationship between the two variables here is neuroticism and salary is positive. But within each educational subgroup, this is real data from a particular organization that this person works in, the dots are not normalized, right? Neuroticism actually reduces your salary and the least neurotic individuals at each educational level are the highest paid. On average, right? There's tons of scatter. Most of the variation in these data isn't explained by neuroticism or education. So this happens all the time, the phenomenon of trend reversal. This doesn't tell you what's true, right? You still have to have some background information about the variables. It could be that there's a real confound here. And so the reversal in trend is telling you what's really going on. That's almost certainly the case in the UC Berkeley admissions data. That what's really going on is that men and women applied to very different programs. At least they did in 1973. And that is most of what's going on in those data is that some programs are really hard to get admitted to, no matter what your gender. And it just turns out that women mainly applied to those programs. Social psychology was the one on that graph, the last department in social psychology, Berkeley, which gets lots and lots of applications that they take less than 10% of them. And then there are other programs, which are conditional on applying, much easier to get into by physics and mechanical engineering. Many, many fewer people in 1973 applied, but applications sometimes had a coin flip chance of getting accepted to those departments. Now it's conditional on thinking you have a chance. So there's a lot that's missing from that analysis. But in that case, the reversal in trend is a real thing. It's telling us something about the causation and what makes some applications get accepted or not. But it's not always the case. I remember I talked about colliders earlier in the course. The reversal trend could be caused by a collider, like controlling for marriage status. Remember that example? Then you got a reversal of trend, but it was spurious. So Simpson's paradox could lead you astray as much as it could give you a clue about what's really going on. You got to think carefully about the causation in it, just because it reverses. It is like, oh, now we really know what's going on. No, still have to think about your causal path diagram. There's no easy way out of this problem. Simpson's paradox is worth keeping in mind, though, and having it on your top five list of things to worry about, because it catches people out in the literature all the time. As an applied statistician, when I read journals, I see constantly papers that have not worried about this problem. It's doing average regressions, even in cases which are exactly structurally like the UC Berkeley example, which again is famous. It's in lots of textbooks. Here's one from very recently, from 2015, just two years ago as I was lecturing today. Gender contributes to personal research funding success in the Netherlands. This is data that is structurally just like the UC Berkeley admissions data, but it's about funding in the Dutch equivalent of, you know, if you're in the United States, the National Science Foundation. If you're in Europe as a whole, think of the ERC funds. This is only competition within the Netherlands. They analyze it just as my first analysis of the Berkeley admissions data, and that was enough to get them a PNAS paper. However, these data show exactly the same phenomenon as the UC Berkeley admissions data. These are the data from this paper, and they have exactly the same structure. Now, I've structured them exactly like your UC Berkeley admissions data. It's the same kind of thing, and the problem, of course, as you might expect, is that male and female applicants apply to different sections, and different sections have radically different funding rates. It's exactly the same problem, but they did not control for that. And then, of course, you get a letter following up in PNAS pointing out that, hey, have you heard of Simpson's Paradox? A peer review. Is there a peer review today? I don't know. Now, to be fair, this is the thing, is that peer review is always going to be leaky, right? It's always going to be leaky, so maybe we shouldn't think to journals to tell us much about the quality of papers, right? That's my opinion. But anyway, so just to say, this is a real live problem in science, is that people accidentally make these mistakes all the time, even for exactly the same situation. Okay, this is good news, by the way. All the evidence is funding in the Netherlands is fair. At least at this level, there may be inequities elsewhere. But at conditional upon application, there's no evidence that there's any gender bias and acceptance rates. Okay, there's gender bias. Well, there's a kind of epiphenomenon gender bias from the fact that different genders apply to different sections, and that's something to worry about, right? Because it does affect the aggregate science that goes on. But it's not discrimination in the narrow sense. Discrimination in some broader sense, some broader Marxian sense of the word discrimination. Yeah. Okay, one last thing to say about binomial GLMs, before I try to summarize them for you. With binomial GLMs, and in fact, this is true of all GLMs, but especially so in binomial GLMs because they have both the ceiling and the floor effect, you really, really need priors. These are not optional things now. These models will radically misbehave in the absence of priors. Maximum likelihood estimation with GLMs is extremely dangerous. It's like driving without a seatbelt. Maybe most of the trips you're okay, you get to the grocery store. But occasionally, you're going to really be glad that you had that seatbelt on. Here's a simple example to show how easy this is to happen. I'm using the GLM function in R, which is you can think of this as doing a binomial regression with flat priors. That's what it would be like if you programmed it in map. That's what it would be like if you do your binomial model and you leave all the priors out. You get the same misbehavior, in fact. I believe that is an exercise. Actually, it's in the book like that. I'll show you in the book. So this is a case where it's a rig data set where the applied effect size is really strong. There's one predictor and the relationship between the outcome and the predictor is almost but not quite perfect. So the model gets extremely excited and you can see that the consequence of this excitement in the estimates at the bottom. The intercept, which is alpha in our case, is some number and its standard deviation is kind of big. This is on the law god scale. Now remember, law gods of four is always. So what's three thousand? That's a great confidence interval. I like that. It's really attractive. And then the beta coefficient on X, here's the map estimate 11, but again the standard deviation is just absurdly large. What is happening here? Well remember, if the law gods get above four, they predict always. It doesn't matter. There's almost no prediction difference at all between four and four thousand. There's a little bit, but not much. It makes this minuscule difference in predictions and there's certainly almost no difference between a thousand and two thousand. The likelihood function cannot distinguish what's going on based upon the data in a GLM. Say that again because it should terrify you. The likelihood function alone cannot tell you what's going on in a GLM if the effects are strong because you hit the floor of ceiling. You can only die once. That's the problem. This allomander can only die once. Once any one factor is pushed you against the floor, the data alone can't tell you what a reasonable effect size is. And so that's what happens here. The effect is strong and it can't tell how strong and then it just keeps cruising out. It's like well you know the likelihood is flat all the way out to infinity and there's no prior. So it thinks that every parameter value is equally plausible before the data and so your model does exactly what you have to do. And it looks like misbehavior but it's just it's programming. So you need a prior here and you don't need a strong prior. You just need a prior that says that infinity is impossible. That's a good start. In fact I would say you want a prior you know for beta coefficients like the ones I've been putting in the example centered on zero and make it think that anything about five or six is extremely unlikely. That's a really good place to start. And it's not regularization was a good idea with linear regressions but here it's really mandatory to get sensible stable estimates. This is a huge problem with these and so binomial models logistic regression in the in the presence of flat priors they're engines for overestimating effect sizes. And then I think this is one of the things that contributes to the lack of replication in many fields. You get these if you're trying to farm over estimates of effect sizes it's a great idea not to use priors but if you're trying to do replicable science I think you probably want to use them. Anyway, that's enough of my sermons. Actually it's not enough of my sermons you'll get more. This is what you're paying for. Okay, summary slide. These are nice to have. I'm going to try to do one for each of the different GLM types. These are for predicting counts with a fixed maximum. Counts always have a minimum of zero right can have a negative count. So just saying something's a count means it has a minimum of zero. But there's a known maximum. You do the binomial regression where you know the maximum possible observation before you see the data. That's what makes it a binomial regression. So it's like you know the number of animals and then the thing you're trying to predict is how many lived. That's a binomial regression. When you don't know the number of animals and you have a number who died that's not a binomial regression because you don't know in. Then you're going to have to do something else and that's what we're going to do next. There's lots of things you can do next. But I'll show you one option that's reasonable. It's conventional to use the logit link but often there are better choices if you know something about the structure of your process. You can get away from the linear model what I call generalized linear madness. Then the link function will build itself for you and you don't have to use the convention. And I apologize for not having an example of that here. But one way to think about this is survival analysis. If any of you have done that this is a series of regressions where you're predicting as you might expect survival. But it's used for anything where there's latency to event. You're observing some cohort of things. They could be organisms. They could be anything. It could be the latency until kids stop play or something like that in an experiment. And you stop the experiment at some point and there's still a bunch of entities, animals, children who have not yet done the event. Then there are some who have and you need to analyze that data in the aggregate. And the right way to do that is a family method called survival analysis. They look often like biennial GLMs or some kind of GLM but they have special link functions that nominate themselves from just thinking about okay there's latency to event. What's the probability that I've waited this long and the thing has not happened yet? And if you write that expression down you get your link function straight away. You don't have to invent it ad hoc. Those methods are extremely important and I feel guilty that they're not in the course. But if you get to something like that you should come see me and I can help you with it. We just did one of these in my department for somebody's field data because we get situations like this all the time with births, right? Got a cohort of people. What's the latency to first birth after they get married, right? So some of them have not yet given birth and others have. You've got to use all the data to estimate the rate. So that's a survival analysis. You don't just want to count out the number who have birth and don't and run by no aggression. You'll get the wrong answer. Yeah, okay. You should distrust map estimation for GLMs. Often it works out okay but again this is like the seat belt. Usually you don't need your seat belt but that's not a reason not to wear it, right? So you should be skeptical. You're on the public's dime, right? And you think about, hey, I'll show you a tzada. You know, I think he's going to want you to use the public's funds wisely. It's not okay to do the second best thing because yeah, usually it doesn't matter. That's not okay, right? You don't want to stand up in front of hey, I'll stuiozala and give that excuse, right? Okay. For interpreting these models it's tricky. I think generating predictions out of the model is the best way to go. And I'm going to keep giving you examples of this in GLMs to reinforce this but it's the tide prediction engine. That's the thing about these models is the relationship between parameters and outcomes is no longer transparent and linear and so generating predictions is a nice thing to do. After a while your intuitions get schooled and you can read the gears but you still shouldn't get overconfident about it. Okay. Next model type is a special case of the binomial and let me bring you into it by showing you a binomial distribution. So this is a binomial distribution where the probability of success on each trial is 0.014. Very small. A little over 1%. But there are 200 coins. It's a flip. 200 trials. And then I've simulated a bunch of data sets. I think this is like 10,000 binomial samples from this. The number of successes. I've got 200 trials where the probability of success in each one is a little around 1% and I plotted the distribution here. You notice that the observed maximum is only 11. That's the largest number of successes ever observed because the probability of success on each trial is really small. Yeah. So we don't see the maximum here. You don't get anywhere near 200 ever. This makes sense. And what I want you to notice is if you calculate the mean number of successes, the expected value, it's 2.84. This is empirically from the thing. You could get it from the theory. And the variance is almost exactly the same value. And this is an example. What happens to the binomial distribution, whenever the number of trials is really big and the probability of success on each trial is really small, is that the mean and the variance are the same. And this is what's called the Poisson distribution. You don't have to say it like it's the French. So here's another example. Now out of 500 trials, you notice now we've got more trials. So the mean shifts up. The mean is 7 now and the variance is 7. And again, now 900 trials, the mean goes up to almost 13. Now we'll call that 13. And so the variance is 13 as well. These are all the Poisson distribution. And it applies in cases too where you have vanishingly small probabilities of success here. 2 times 10 to the minus 4th. But 6,000 trials. The mean is only 1.2, very small. Most of the, it's mostly zeros that you observe. You usually observe, the most common outcome is the two, the most common outcome is one, sorry, but you observe a lot of zeros. And but the variance is the same as the mean. Two more examples of the same case. So interesting, you may say, right? Or not. What's nice about this is, this gives us the way to model successes or the rate of quote-unquote successes when we don't know in. As long as we can say the potential number of successes is very, very large, we can model the observed events without knowing it at all. So there's some mass big population of things that something could happen to. And in any particular window of observation during our study, for most of those entities, they won't experience the event. In those cases, the maximum entropy distribution is Poisson. And nature loads maximum entropy, right, remember? And so nuclear decay is an example of this. Well, I'll give you some examples in a minute as we get into this. So this is the Poisson distribution. We're going to write it like that, right? You call it, I had an instructor in grad school called us the fish distribution. Hopefully that, yeah, it's funny, isn't it? Yeah, it's good, so good. Some jokes don't age well, that one's good. All right, so the expected value of the Poisson distribution, it's usually conventionally called lambda, which is the rate, the expected number of successes in a unit of observation. And that unit is up to you to define, right? Could be an hour, it could be a mile, right? Because time and distance look the same in the Poisson distribution, it could be unit of area if you're doing quadrats in ecological survey. Number of monkeys, John, in a section of forest, it didn't be Poisson distributed if they're, monkeys aren't because they group, but let's say antisocial monkeys would be Poisson distributed. And if you throw tax on the floor and draw squares and then count up the number of tax in each square, then it's Poisson distributed. This is something I once did with undergrad, they hated me. Right, but it works. And so the fact that it has a single parameter makes it an attractive option for modeling, but this isn't something we do for the sake of convenience, it's a natural consequence of this small slice of possible binomial distributions that they have this feature that the variance is equal to the mean. And that's why we do it. Okay, the history of this, it's, it goes pretty far back. It's named after, of course, Sémillon de Poisson, pictured here, I love this painting, but it, like most things in math, it is not named after the person who first used it. It's just the thing that's called Stugler's law, that everything in math will be named after someone who didn't admit it. And so it's the first person that we know to have actually analyzed and used this and applied statistics is Abraham Dimwalf, who is actually the, to my knowledge, which is imperfect, the, also the first person to have derived a central limit theorem for the Gaussian distribution, lots of other really important things in mathematics. This is during the period where French mathematicians were doing all the hot stuff and probability theory, right, prior to the German period where all the German mathematicians were doing all the hot stuff, it's like it switches. And so we, commonplace examples where the Poisson distribution is a fantastic starting place, you may be able to do more, the more you know, include things like soccer goals, right, the potential number of goals in a particular game is theoretically very high, but as you know, if you are a soccer fan, the realized number of goals is vanishingly small. I think the most common score is 0-0, worldwide, something like that, 0-1 is probably the most common score, right, very small. Poisson distributions do a great job with predicting soccer matches. Again, you could do better if you'd have more information, but it's a it's a good maximum if we start. Vision events are a famous case where from theoretical principles alpha particle decay, for example, you've got a lot of atoms, you know, say a mole of uranium, you want to say how many alpha particle decay events do I expect in a second, they're Poisson distributed, because the decay events are independent of one another, it's like the tax on the floor. Analogously, photons striking a detector, DNA mutations, and a famous historical example, soldiers killed by horses. One of the, I think, potentially the first real applied statistical example of using the Poisson distribution in action says analysis, here I copy the data table on the right, this is from a book, seven pillars of statistical wisdom, it's kind of a history of statistical principles in their discovery, there's this chapter here, and so this is from Bortkevitz's data, Bortkevitz was some statistician who in the Prussian army, or really retained by the Prussian army, and you know horses are dangerous, and horses per capita are more dangerous than cars using the historical data, so back the only data time I've seen data like this is from New York City, people used to get around New York City on horses, and there's lots of statistics about that, and every day in New York City back then more people per capita were killed by horse kicks that are now killed by cars in New York City per capita, horses are ordinary animals and they're excitable and things go wrong, and so if you're a Prussian general you want to reduce the death of your soldiers, ceaselessly, to excitable horses, you might so call in the statisticians, and I don't think there's any progress made on stopping this problem, but it was, these are Poisson distributed horse kicks, that's the point from it, there's some structure to it to be understood as well anyway I just want to give you an idea all right a quick example to show you how to do this here's a published example I'm just going to take you through the core of a published analysis this is from my colleague Michelle Klein who's now an assistant professor at Simon Fraser, and Michelle does fieldwork in Fiji and she's interested in the evolution of oceanic societies that's what she studies and so this is a paper where she's looking at a cultural evolution hypothesis a bunch of models in cultural evolution predict that there should be a relationship between the magnitude of a population and the complexity of its culture and so she's trying to test this these ideas with some empirical data so what we've got here is all the oceanic societies you can get archaeological or historical data for about their the complexity of their material culture their toolkits is what she's coding up here historical data on their population sizes sometimes this is census data often it's archaeological estimates an estimate of their contact rate with their neighbors right I think high and low some islands are really close to one another like Toma is really close to lots of other societies Hawaii pretty far from all the others and so yeah Hawaii's down here low contact really big population right and then there are two measures of the complexity of the toolkit I'm only going to talk about the first one today the second one's analyzed in the paper as well the total number of tools of distinct functional tools the society manufacturer okay so two two questions of interest what is the complexity of the toolkit proportional to the magnitude of the population magnitude means the long population Natalia loves long population data all your afternoons are taken up with it right and and then the second order prediction contact with other islands this is moderate the impact of this you might think connectivity is something that gives you an effectively larger population size you don't have to invent or sustain all these tools yourself you could just take it from Toma for example so I'm excited by theories like this you don't have to be but you will be excited by this plus on regression right this will help you in your work here's what it looks like looks like all the other GLMs but now we put plus all instead of by now we don't work out or normal there's a single parameter that determines its shape lambda so each count of tools for each society I as a we expected plus all distribution of tools with a rate an average and expected value lambda for that case we're going to model it with a log link I'll show you what this looks like in a second so that the log of the rate is a linear model so this is why these are something called log linear models there are other things called log linear models but this is one of them and what that implies is I'll show you on the next slide is that these things are multiplicatively related because if you undo the log by exponentiating this linear model what you get is e to the alpha times e to the beta times p that's already logged it comes out e to the bcci so on so you get this multiplicative relation where things are multiplying the effects of the other factors right so plus all the models are multiplicative models even though there's a linear model inside of them I'll show you what this looks like on the prediction scale you get an idea this produces exponential scaling with a predictor and then we have some priors these are regularizing priors okay so total tools that's our outcome variable expected tools for kci our log link here's what the log link looks like so again here's your predictor x on the horizontal axis this is our log measurement this is the scale our linear model is on right it's log rate of events log rate of tools I think about it in this case and that's a linear model right our linear model is going to scale as a straight line as we change the value of the predictor that's what this is showing you yeah yeah so now what when we project this will be undo the link we exponentiate this one straight line and this is what it looks like when you exponentiate a straight line you get an exponential curve so there's a massive compaction of the space you know the lower parts of this down to the bottom and a massive expansion of the higher ones so this compacting thing happens again so this is exponential scaling and so you have to wonder about this because things can't be exponential forever right eventually x gets so big that the exponential relationship has got to break down so you might want to think about that there's a homework problem I'd like you guys to do over the break has to do with hurricanes that exemplifies this problem extremely well things can't be exponential forever right okay I was given over this slide this is this is in the book just to say that if you undo the log link what this implies is the linear model is exponentiated that's how you get the original parameter back yeah log link is conventional for scale parameters but it's not your only choice but it's a it's a choice that has really good mathematical properties and that's why it's conventional okay so um yeah the rest is E old linear model you guys are pros at this now right you love these things again if you have a real theory you can do better than this okay how do we fit these models as you might expect here's just data prep we construct the log population by taking the log of the population I like to do my transformations before I run the model because then you don't have to constantly redo the calculation inside the model and slow your computer down just do the transformation once and then a dummy variable for either the contact is high or a contact high and contact equals high one or zero and then I'm using map here map to stand code looks identical I encourage you to compare them yeah when you run this code yourself and total tools the plus is the r function for the plus off distribution log lambda there's your linear model and here are priors notice the trick you can create a vector of these three priors with the c function in our that just gives them all the same prior good you love it right exciting so you're learning new stuff now I'm sure all of you have data sitting on your hard drives that is begging to be analyzed as a possible distribution everybody does okay this is a world it's just full of possible distributions it's like the Gaussian right we shouldn't be surprised that there are also distributions okay here's what happens in this model this is a chance for me to show you some general thing about regression that happens in all kinds of regressions but in this analysis it looms very large and that is that marginal estimates sometimes mislead you to what the model really predicts and this is a common feature with interaction effects so there's an interaction effect in this model sorry I didn't emphasize that not very well before we're looking at the interaction we've got a main effect of log pop main effect of contact and then the interaction between two because we want to know whether contact moderates can make up for the fact that you have a small population right so we want we want the impact of population to depend upon your contact rate so we put an interaction in one of the things that often happens when you have interactions is now remember that the effect of the variables now the predictors depends upon two parameters in every case so the effect of population depends upon the interaction and the main effect and the effect of contact depends upon the main effect and the interaction so if you just look at the what's called the marginal distribution of the main effects or the interaction effect you can't tell what's going on in the model and I emphasize this because people do this in journals all the time so here's what happens in this case so what I'm showing you is the pricy output and the caterpillar plot or whatever we're supposed to call these these days dot and whisker what are these called? forest I think what? forest forest oh I like that's nice uh forest plot I'm gonna use that thank you the forest plot um uh so uh will ignore the intercept like we always do for intercept it's not personal and uh we have BP which is the main effect of law of population it's positive yeah so you're looking at down here it's positive and pretty precisely estimated right seems for sure there's something going on with population here and then BC is the main effect of contact uh that's negative but the the standard deviation is really very big relative to the magnitude of a distance from zero doesn't look like there's much going on as a main effect of contact the interaction between log pop and contact is slightly positive it isn't wildly uncertain my contact but it's still overlaps there's a lot of probability mass of both sides is zero so the usual thing people do in these cases if you were doing significance testing which you definitely shouldn't do but if you were doing that you would include neither BC nor BPC is significant we should drop those terms contact has no effect you would be wrong the reason is because the predictions that come out of contact depend upon both of those parameters and those parameters are highly correlated with one another they're negatively correlated one's big the other small but the the aggregate effect of contact is that it definitely predicts more tools so let me show you what this looks like in the posterior distribution so now when you do the pairs plot on this posterior distribution there's alpha BP and here's the two cold prints the main effect of contact the interaction of contact with log pop they have a correlation of negative 0.99 in the posterior distribution there's a lot of joint information this is what it looks like right it's nice little ridge here so you can't just look at the marginal of one of those and ignore the other and say oh there's no information right the model of states there's some trade-off and prediction what's going on this is what it looks like when you look at the marginal of BC for example which is when we're looking at the horizontal axis here's what we were looking at on the forest plot I've now learned it's called you were looking at this that's the marginal as if you were standing down here and looking this way on the graph you see this hill that has that shape it didn't overlap zero on both sides quite a lot you're like yeah nothing going on then you forget that you walk around to this side and you look on and you see it overlaps your a lot of both sides you're like yeah nothing going on there either but if you look top down you see there's a lot going on because when uh BPC is is positive the other one's negative what what's happening here is there's some is identified uh there's some is known and there is an effect of contact so how do you know that let's push predictions out of the model the predictions are what matter so this is the problem looking at the gears of the machine right we're looking at the gears of the machine and you could be forgiven for not having any idea exactly what this means in the behavior of the machine what happens there's a section in the book where I walk you through all the r-code to do this and I encourage you to do it all we do is we take those posterior predictions and we push them out as predictions we we construct predictions the the expected lambda for an island with high contact and the expected lambda for an island with low contact for for an example of population size and then we look at the difference between and that's what I'm showing you on the left plot on the bottom of this slide this is the posterior distribution of the difference between the expected number of tools at a high contact island and the expected number of tools at a low contact island and 95.27 percent of this probability mass is above zero there's a lot of evidence that islands with high contact have more tools but you couldn't see this if you just looked at each isolated parameter the reason is because each isolated parameter does not generate a prediction it's only the combinations that do yeah it's a little bit mystical does it make some sense these machines are complicated but you can you can deal with the complication if you always push predictions out of the machine okay the model comparison backs this up quite a lot so now we've got the original models here the one called interaction in 10.10 we could also do one without the interaction in 10.11 that does slightly better but you know I call that no way to tell them apart right one gets 0.6 of the weight one gets about 0.4 of the weight but the difference and the standard error the difference are the same about the same so the interactive there's not really strong evidence of interaction if there is an interaction it's weak but both contact and log population are worth keeping in the model according to this despite the fact that we had like 12 data points right there's almost no data here it's an interesting thing so log pop only just the null model and then contact only right you get this ranking only the top two models get very much weight at all the log pop model gets a little bit and this is what it looks like uh WAIC remember is looking at the outcomes it's it's judging predictions and so it automatically deals with all that saucy stuff with the way that parameters interact and that's a nice feature of it so you won't get tricked by gazing at the marginals of each thing so let's look at the prediction ensemble if you take that full set of models and pass it into ensemble remember ensemble where we use the weights to to mix predictions this is what you get there's not much of an interaction here but there's definitely a main effective contact the first thing you see is a really strong effective log population right this is the exponential scaling of the expected number of tools on the vertical against log population on the horizontal and then there's two trends here one for high contract islands and one for low contact islands the actual data points are shown as the dots the field ones are high contact islands and the unfilled ones hollow is that what i should call them the hollow points are the low contact islands an interaction if there was a bunch of one what would that look like well then the slopes if you will slope is a terrible word here for an exponential scale but would be different right there would be a different tilt for them so there's not much evidence of that dude so that's to say if there is one it's small yeah this makes sense this is the wonder's power of the possible gom and to try and summarize that for you these are models for counts without an obvious upper bound the log link is customary this is a major your linear model is a is a model of magnitudes right so you think about logarithms are the magnitudes of the measure so like log population is the magnitude of the population is the exponent on the population size right so you want to think about this as a your linear models on the magnitude scale right because it's on the log scale might help you a little bit to think about it you need to be wary of exploding predictions because exponential scaling when you've done transform it well exponentials exponential and things can get big very very fast so you need to worry about that focus on predictions as always now parameters i know this is my course i i'll just keep saying it right reviewers will ask you to do the wrong thing they'll like show us the parameters like i'll put them in the supplemental that means that you don't want your readers to do when you're presenting your results it's just read isolated data coefficients and think they understand how the model works it's not true uh editors and reviewers will pressure you to do the wrong thing in my experience you give them a table there's nothing wrong with giving them a table but show them the predictions that's the way to help people understand what's going on uh something you need to look at it's not in the lecture but it's in the in the book is the use of an offset to adjust what's called the exposure of the duration or distance so in the in this case uh example we don't have that but it's it's ordinary with data like this that some of the cases have longer observation windows than others so for some particular case you might have observed uh the area for a month and in the next case only a week but you've got counts which are plausibly possible distributed how can you put them in the same model the answer is it's easy there's this thing called an exposure term and all it is is you add the log duration of observation to the linear model no coefficient just the log duration and i show you why in the text this is exactly the right thing to do it's actually a simple concept but and there's a computed example with monasteries that you will find entertaining that will maybe i did that will help you bring the lesson home okay predictions tend to be under dispersed relative to the data in these things which means that there's lots of residual variants and uh that means that there's some unmodeled heterogeneity that gives us clues where to go next in these things multi-level models help a lot with that okay there are lots of other count distributions which generalize the lessons here and i'm not going to tell you about them because i want to spend a little bit of time on ordered logic the multinomial the geometric are some of the most common and then there are also lots of mixtures to help you cope with heterogeneity the beta binomial the gamma possum these are our binomial and possum regressions where each case has an intercept that's drawn from some family some other distribution and so it's really a multi-level model so we're going to come back to this thing when we do multi-level models in the new year when you come back fresh okay so glms are kind of monsters in the sense of the classical sense of a monster a monster has features of multiple animals right so there's we can take features of the the simplest kinds of glms links functions and outcome distributions and the ways we build linear models and hook them together in ways that help us model strange measurement situations this is what i call in the book monsters and mixtures you can get more complicated glms there's still glms but they look a bit monstrous and so i want to i want to show you in the remaining time today what i think is the most useful of these monstrous things it's mainly useful in the social sciences where we're often asking people to give us some subjective rating of their internal state my psychologists are familiar with this right sociologists and anthropologists as well but you also sometimes get data like this in the natural sciences and the book chapter also has a bunch of stuff about mixtures i don't think i'll have time to talk about that today but we'll come back to that theme all right so order categories what are order categories this is measures like you get when you ask people how much do you like this class yeah don't answer but but you have but your answer must be a number of an integer from one to seven it's called a Laker scale how important is income of the potential spouse right one to seven there's a big literature in questions like this for some reason and people get very inflamed questions like this how often do you see bats around leipzig this might be the first time this question has ever been asked but there are bats around leipzig like there's a there's a group of people that tracks them that city's leipzig and bats are widely distributed fly to do that for you and so you might ask the answers might be never sometimes frequently so the thing all these kinds of answers have in common is that they're discrete outcomes there's a minimum and maximum and there's a defined order to them right two is always bigger than one three is bigger than two and so on sometimes is bigger than never but the distances between the categories is mysterious how much more often is sometimes than never some right how much more than one is two on this scale the answer is not one unfortunately it's some subjective amount and it's not the same as the distance between two and three the amount of liking this class that's required to get you from say four to five might be really different than the amount that's required to get you from one to two probably doesn't take much of teaching quality to get you from one to two right minimal satisfaction but to get you from five to six or six to seven i have to really work right and that's the subjectivity of these scables these things are are super common so we call these ordered categories they're difficult to model they're not continuous and they're not counts good times right so what's going on with these things really a robust and useful solution is called the ordered logistic model and something called ordered logic i'm going to call it that a few slides in fact essentially this is a categorical glm like a multinomial like a binomial but there's more than one thing can happen the things that can happen are the categories that you could observe never sometimes often one two three four five six seven those are discrete outcomes that could happen but we're going to use a fancy link function fancy the technical term of statistics to to impose an ordering on them so that they're always in the same order but without assuming that the distance on some subjective latent scale is always the same if you get you from one category to the next this is the magic of it so yeah this is our Frankenstein monster sort of thing so we've got 10 minutes i think i can just get you to the structure of these things there's lots of extra stuff in the book section of this you should go through and run the r-code to build up the plot something about to show you so you'll understand what's going on so let me tell you about the data case i'm going to use first philosophers for some reason are fascinated by what are called trolley problems i don't know if any of you are familiar with these things so imagine if you will a trolley track we have lots of them here in leipzig right there's a trolley track and there's a speeding out of control trolley that never happens here obviously they're they're stuck not moving trolley that's the usual problem so um and it's going down in uh track a for some reason tied to this track are five people we don't know why it's a story it's a philosophy story it's they all start like this and um there's a sidetrack the trolley could be diverted down and there's only one person tied to that one again i don't know why philosophy it's like this philosophers think of these things and that's all the opium and i'm sorry my friends are philosophers they don't use opium but you haven't been standing next to a control switch yes you and it's set to a your quest your decision is whether you flip it to be if you flip it to be that one person dies but five people get to live if you don't do anything five people die one person lives what do you do uh the question that's usually posed to people is not what would you do but you say somebody does this say somebody flips it to be and you're asked how morally permissible is that action and yes you guessed it on a Likert scale from one to seven so um there are very what makes this an experiment is you can vary lots of the details of this story to try and decode what are the intuitive principles people used to make these subjective moral judgments and unsurprisingly normal psychologically healthy people reason nothing like philosophers uh right which isn't to say that philosophers are psychologically unhealthy they're just trained they're trained to a certain view of what a logical moral structure would be and people don't agree with that and and there's also cross-cultural variation but there's also a lot of cross-cultural generality to this so let me give you just two more examples very quickly here's a version of the same kind of trolley problem again here's our speeding out of control trolley he's going to pass under a there's this observation bridge he's going to pass through that it's not going to knock that wall down it's going to pass harmlessly through and you uh or say the the the protagonist in our story is in red there standing eating a sandwich on top of this wall as this is happening um and yes there are five people tied to the trap of course i don't know why you were just watching the video your sandwich there you go there's a very large man standing next to you on this overpass that has no railing for safety and you know because you're very clever that if you push this fellow in front of the tram he's heavy enough to stop it now this is i'm not making this this is in the public service room okay so i don't know who is responsible for originally for the story or if they would stand up and let them put their hand but uh this is yeah so yeah whatever and again likert scale how morally permissible is it to push the man you can expect this pushes the answer way down on the likert scale um one more version of this again here's our trolley uh like in the first version up there it's speeding uh but now it's going to go down to sidetrack and kill one person instead of five you're standing next to the switch and the question is how morally permissible is it not to pull the lever it's weird right but this is kind of games that philosophers like to play now you just inverted it technically it's the same but it's not to a human being it feels really different now and people answer differently it's a consequence and again inspector intuitions there's no right answer to these right this is why it's fun stuff uh and people have intuitions very strong emotional reactions to these stories and the empirical literature on this aims at getting at principles and one um uh say there's a lot of uh evidence behind the idea that there are three at least three but these the three big ones uh three principles that explain a lot of the variation in how people respond to these little anecdotes stories um first is action that harm caused by action is worse than the same harm caused by inaction doing something in causing harm is worse somehow than not doing something in causing harm you know that's called the action principle the intention principle harm intended as a means to a goal is worse than the same harm foreseen as a side effect as of the goal contact well i should give you an example of this so it's like um do fossil fuel companies intend global warming to happen do they need global warming to happen to make their profits or is that just the harmless side effect if they're making their profits right it would be even work both are bad right but but it would be even worse if they needed the warming so they could get more oil right it would be even worse you see that's the intent uh issue contact um harm is uh caused by physical contact is worse than the same harm without physical contact this is the pushing the big man off the bridge uh sort of thing people find that really offensive right now you got there's a there's a very intimate where there's a lever that opened the trap door under his feet and people don't think that's as bad so uh on average i think all these things are awful really i'm really intolerant so i have to call this is terrible it's a horrible universe anyway so these these three stories get coded the first one has action but not intention and contact the second has all three it's really bad people really hate this one uh yeah you felt that right when i showed it and the last one has none of them and people answer differently to all of them and so our goal is to analyze these data and uh you have in the rethinking package um data from uh the kushman and all experiments large number of experiments there's 331 individuals in the data there are 30 different scenarios trolley problems mixing and matching all the different things right there's a lot a bunch of logical combinations and they sat down hard and thought of weird stories that would have these different things good times and then there was a website lots of people went to it uh and did the full set and so there are uh almost 10 000 responses to these things and you've got also got participant age and sex and there's other things like education level and cultural background and lots of stuff we're not going to focus on all the predictors i just want to give you an idea how to use um uh how to do an ordered logic using these kinds of data because these data have a really inconvenient distribution if we look on the right on the how permissible from one to seven just this is just the aggregate sample all nine thousand nine hundred thirty responses plotted as a histogram you can see that people really like number four number four is the shrug right i don't know i'm hungry we want everybody who's done experiments with real humans has had this experience animals do it too you've done experiments with them with monkeys they get impatient real fast right uh about 30 seconds in if they're done the experiment but uh number one is also popular there's lots of condemnation in this data set right and uh then it's flat everywhere else so this is not normal it is not binomial it is not plus all what is it and well you need some ability to just model the histogram and that's what ordered logistic regression does is it creates it's a it's a tool for modeling that any arbitrary histogram and modeling how that histogram distorts when you change a predictive variable that's what it is i'll see that again it's a tool for modeling any arbitrary histogram like this and also modeling how the histogram distorts as you change your predictor variable so give you an idea of these distortions at the top is a total sample then we look at the cases where there's action in the scenario uh it looks a little bit different right there's more uh condemnation the ones go up right the seventh go down cases where their intention is present a lot more ones flat everywhere else still there's always force there's always force there's the people who are just doing four and everything I think and contact is very offensive remember now ones are the the most common single response so we want to have some way of modeling as we add subtract the dummy variables that indicate these different principles of how this histogram morphs and this is the ordered loge model so I'll go three minutes over since we started three minutes late is that okay with everybody yeah but i'll say to that okay so here we're just looking again at the aggregate distribution and our goal is to make something called the law cumulative odds link so let me talk you through what this means this is the trick that gives us the ordering the first thing we do is we take this histogram where it's just frequency against response and we convert it to cumulative proportion of each response so what does that mean well out of the total sample of almost 10,000 responses what proportion of them are ones and that's the number right here right it's like point one five up about 15 percent of them are ones would be and then we what's the cumulative proportion of twos that means what proportion are two or less and the answer is a little over point two right because there aren't as many twos like point two two or something like that point two five yeah and that gives us that point and then so on what proportion or three or less what proportion or four or less and all the way up eventually or proportion or seven or less the answer is always one because it's the maximum response okay and then we may build our link function out of this our link function will be the log cumulative odds not the log odds not the loget you're you've grown the love but now it's the log cumulative odds it looks exactly the same it's the same formula but the p inside of it now is not the odds but the cumulative odds which are these numbers right here because these well yeah so the the odds of this the cumulative odds of this is this point over one minus that point that's the odds the cumulative odds of this are this point over one minus that point still the odds where the odds are the probability of something of a probability of something else that's the odds so the cumulative odds it's still you just same thing what's the cumulative probability of this meaning probability of that thing or something less and you compute the odds and that's what you get on the far right the cumulative the log cumulative odds of one is almost minus two and so on up to the sixth the log cumulative odds of seven is infinity right because there's a zero on the bottom right it's one over zero the log of one over zero is not computable yeah so this is our link function now is log cumulative odds and our linear model will be on the log cumulative odd scale and now it looks like a binomial a logistic regression so the book lays all this out mathematically and gives you the code to make these graphs and i really encourage you to step through in your r console and compute it out to get a sense what's going on this isn't that complicated but it looks monstrous it looks like a hippogriff or one of something else a freaking sign monster when you first see it but here's the log cumulative log of the cumulative odds the probability that the observed value is less than one of the possible response that less than or equal to one of the possible response values we're going to say that's equal to some linear model for that level for that particular response sorry i've got lots of things that i say and so of course the inverse link is just the logistic and it looks like a it is a logic model but instead of probabilities they're cumulative probabilities so what we get on this graph then this is our cumulative proportion again of each response across the bottom as we plot it up and these gray bars are the cumulative probabilities of each observed type in order to do statistics we need to discrete probability of each of those types not the cumulative probability remember how likely hit functions work you want to say i observed a three what's the probability of a three you know what the probability of a three or less right base formula needs the probability of a three so where do we get that well from this graph you could do it graphically right if you if you know the height up to there you just need to subtract that off this one and you need the interval right and then once you got that one you just need that much right that's the probability of each of them yeah and so this is the discrete probabilities so you can calculate them and that's these orange line segments are these calculations you just subtract each one from all the previous ones and then you get the discrete probability of that thing your computer will happily and beautifully do this for you now but it's not hard to do and again there's code our code in the chapter to walk you through it it's not it's not rocket surgery as they say okay let me show you what this looks like for i let you go there are lots of conventions for writing these models i tend to do it this way to call this an order distribution and here's your log cumulative probabilities where the p's are now cumulative probabilities really this is just a categorical model you could just say sometimes you'll see categorical here and then you'll see the same link function but it's i think ordered to use the reader into what's being done so it's a nice convention but it's just a categorical model with a funny link in code that's the way you'll see it there's this distribution called d or logic which builds the funny link in and you give it a linear model which here i call fee and it's set to zero in this example and then a vector of the cut points cut points are the intercepts for each of the levels that make the histogram so you think each level from one to seven now gets its own intercept which gives you the overall shape and then the linear model morphs it as the yeah see my hands move just teaches you everything right morphs it as the as it changes so i don't have time to go into the detail of that do you want to think about it as in the model that has no predictors basically the model just learns not basically but exactly the model learns the histogram because it's got a parameter for each response level and that's an intercept for each response level and that's what you're seeing here there's a vector of what are called cut points conventionally and each of them fits the the cumulative proportion of the different ones right so you can calculate this by hand because i did it to make the graphs earlier but this this does it with Bayesian inference so you get a posterior distribution so there's uncertainty about each cut point right because you have a finite sample so it's better than just calculating it directly and then the magic comes when you add in a linear model now we take each intercept and we have some linear model that we subtract the subtraction makes it skew in the right direction i talk about this in the book but the linear model looks just like your other linear models except there's no intercept why because the intercepts are the alphas they've already been fit and there's a different intercept for each level they're called cut points understand this is a lot so you're going to have to work through the section in the book to get this but otherwise it's exactly the same and so you're you all the stuff you've learned about linear models up to this point applies to building these models as well and it creates distortions in the distributions so let me show you last thing before you go what do these outcomes look like and what do i mean by it distorts the histogram what you're looking at here are the predictions of a model where we put in the different principles action contact and intention as dummy variables so we're predicting subjective responses of each participant to each narrative about trolleys and people dying and now we're asking here we're looking at all the narratives where action is zero and contact is zero and we're asking what happens when we change intention from zero to one when it's zero this is how many ones there are that's how many twos that's how many threes that's how many fours that's how many fives that's how many sixes that's how many sevens it's a histogram yeah and there's the shading here indicates the uncertainty we have a lot of data so the model is super confident where the cut points are then if you shift it to one notice all the lines go up what does that mean it means there are more ones and fewer sevens it squeezes everything at the top and it reallocates probability towards the bottom because people are more upset now yeah and that's what the linear model is doing it's morphing the histogram the same thing happens in the other cases when when we turn on action everything's more upsetting and you'll see that from from the get-go but it also when you add intention it makes it even worse you're getting lots of ones now and then with contact this is pushing the big man off the bridge sort of situation lots of ones right lots of lots of upset people who quit the experiment immediately after that okay i appreciate your indulgence having gone over during our room reallocation to drive this point home if you if you model these data with a binomial regression you get utter nonsense uh or a gaussian for that matter which is unfortunately the convention in psychology i believe is to use gaussian regressions on these data and the reason is because you can't get even close to the distribution of the data not even close so this is our ordered logic model it's not perfect right it's it's the predict the the data are the blue the the black are the posterior predictions you know it's over predicting fours for this particular case that's because in the aggregate sample there's an excess of fours but not in this particular case i chose the most embarrassing prediction for this model this is the worst it does right over predict some fours for this very upsetting case of intention and contact here's the binomial model for the same thing again the the blue is the data the black is the prediction the binomial model is always going to try to be symmetric it gets squished up against boundaries but it wants to be symmetric because it tends towards a gaussian as your sample gets big the binomial is wrong uh and the gaussian is even worse because it ignores the the minimum and maximum right so this in its traditional statistical terms what happens is a consequence is you get massively inflated type one error your alpha isn't point zero five it's more like point three there's been simulation studies of this there's a very recent paper actually from john krusky which exam is exactly this problem and he's on this big hard drive now he's actually a psychologist unlike me i'm just a spectator on psychology i can sit back with my popcorn watch today's happen in psychology tell me more about the replication crests no anthropology don't feel me started it's yeah it's darkness there so i don't want to be clear it's not that we're we fixed it all but anyway uh psychology there's a lot of credit and all honesty all getting aside for taking it seriously and john krusky has a great recent paper out about what ordered logent regressions can do for modeling like your data and i encourage you to go look at it it's up on his website okay thank you for your indulgence i apologize for going a little bit over we're going to let me get to my this is my monastery slides that i had a dream of getting to but of course didn't um i recommend for homework you do exactly two and only two and no more so you have a holiday homework problems one from chapter 10 10 h3 one from chapter 11 11 h1 they're not terribly hard once by meal once possible um the first is pirate eagles natural history data eagles do pirate both bald eagles american bald eagles mainly pirate fish from other eagles they get most their fish and yeah it's naughty it's a good american symbol and and uh the other one is hurricanes do hurricanes with female names kill more people that's the question and you'll have fun with these and they'll teach you how to do these regressions we'll resume on january 3rd um in the i hope the room in this uh our traditional room and we'll start with chapter 12 multi-level models um you don't need to worry about reading it ahead of time okay all right thank you all and and uh i'm who just annoys y'all