 This week, we're gonna continue forward with the generalized linear model strategy, generalizing it a little bit to talk with new model forms. My ambition, before I start on to the train track here, is to spend all of today on ordered logistic regression, which will, for some of you, will be really interesting. It's like payback to the social scientists in the room because you usually sort of model a lot and they're terrifying. And so we'll spend all day on that. For the rest of you, it'll be a great illustration of the general flexibility of these approaches and you probably will find some situations to use it as well. On Thursday, I want to give you a brief introduction to mixture models and one of the most useful varieties of mixture models, zero-inflated models, which again is meant to be an exemplar of a class of options available to you but also to be useful. And I'd really like to start on the conceptual content of multi-level models on Thursday as well so that we can spend more quality time with the code aspect next week when we get into... I know for some of you it's like next week is when the course is really beginning for you, like you've been waiting forever for this multi-level model stuff to begin and we're getting there but all of this foundation is necessary. I can't think of any other way to do it. So, let me show you some monsters to start with. In the upper left is the Minotaur. Do you guys remember that? And the Minotaur was, let's say, what was the story again? The king was cursed by one of the gods. He offended some god somehow and therefore he was cursed to be sexually attracted to... No, his wife became sexually attracted to a bull, right, and made it with a bull and then gave birth to the Minotaur. Is that right? I know, it's like no one wants to admit that they know the story. It's vaguely awful, right, but no, something like that. So, we get this hybrid of a human with a bull head and unimaginable strength and a penchant for eating fruit, I guess, there. In the upper right, that's from European folklore Griffin, which is a combination of a monstrous hawk and the front part is a monstrous hawk and the back part is a lion. And in the bottom left, the Maori have a bunch of traditions about fierce beasts that live in nature called tanewa, and tanewa come in different forms, but what they all have is their amalgamations of different beasts, so part serpent and part hawk and part lizard and all kinds of other bits of creatures pieced together. And lots of Maori legends about fighting demons and the horrors they bring. And then one of my favorite legends here showing me a great cartoon porn that I found, Hawaiian legend, traditional Hawaiian legend, of Nanaue, who was a shark man, half shark, half man, had a ravenous shark mouth in his back and gills and stuff, and here we see the beginning of the folklore, where the primordial shark man impregnates this woman and then leaves, and then later she has a son who ends up being a ravenous hybrid shark man. It's a better story. Why am I showing these things? It's a feature of monstrous things in human psychology is that they're hybrid compositions, right? They're monstrous partly because they're not any one thing, which is how nature is supposed to have these ideal types. It's supposed to be one thing or another. Make up your mind, please. And monsters are compelling in the psychological theory of folklore. It's thought because they violate that basic condition of natural kinds. They're instead amalgamations. They're monstrous because they're hybrids. But their hybrid nature gives them, well, vigor. It gives them monstrous powers and terrible powers and abilities. We're going to be worried more about monstrous robots. And so as anyone here watched Scrappy Challenge, it's a BBC series. It's great in a horrifying way. And in Scrappy Challenge, very quickly, you take teams of crazy people, usually men, and you give them a scrap heap, which is the British term for a junkyard. And although the stuff in it is not all junk, as you quickly learn, and then they're given a common challenge. In my favorite episode, they were supposed to make jet cars. And I imagine they signed very lengthy liability waivers before doing this. And just to show you, they try out different prototypes, and they create these monstrous hybrid engines, right? We'll take a car and let's put a jet on it. Yeah, that sounds fun. And at least it lets you sell advertising space. So we're going to be interested in model types which are inspired in some sense, like jet cars. They cobble together different pieces of the simpler models we've seen so far. And as we built, they let us do things, actually useful things, unlike this. But let us deal with inconvenient measurement scales and mixture. Observations we can measure, they're actually mixtures of multiple processes. In order to do that, we need the hybrid figure, if you will, of different model types in the same plus series distribution. So I want to give you some exemplars of how to do that. So I think of these things as monsters and mixtures models. In the statistical literature, they really do call many of these models mixtures. The monster's thing is just my appellation for some of them that aren't usually called mixtures, but they are in spirits. They're all hybrid model types that cobble together structurally different capabilities. And they obey the generalized linear model strategy, even if they're not by strict definition generalized linear models because the distributions may not be of the exponential family, but we don't care. The issue is that the distributions are, in some sense, maximum entropy. That is, they obey the constraints on the outcome that we input into it and nothing else. So I'm going to show you today an example of a monster, the most common of which is used to model a kind of measurement called an ordered category. I'll explain to you why it's monsters in a moment and how we can cope with it. Ranks, I'm not going to have time to talk about. Rank data is terrible, absolutely terrible. So there are two things to say about ranks, and I'll never mention them again. First is, never collect rank data if you can avoid it. You don't want your primary measurement to be ranked. What's the problem with ranks? You've got to predict the whole vector of ranks simultaneously because they're exclusive. So if somebody's number one, none of the other things can be number one. So no longer can you treat the cases separately. So if you've got a bunch of individuals and you had somebody rank them on some scale, man, now you've got to predict the whole vector of ranks simultaneously out of your model. There are model types to do this, but you don't want to go down that road, at least not without me. So the best thing is not to collect rank data. The second thing is don't transform data that's not ranked into ranks. And there's a tradition of telling people to do this for some reason. And I would like to discourage you from doing that. If you find yourself in a situation like this, come to me and there are alternatives. If you must deal with rank data, there are ways to deal with it. It's annoying, but it's doable. And definitely don't transform things into ranks. Or the categories are harder to avoid and easier to work with. So we're going to focus on those. On Thursday, we're going to talk about mixtures. In mixture models, we blend together different stochastic processes in the same model, typically hierarchically within the model, but not necessarily. And there are a bunch of cases where, for example, the mean across cases may vary and come from some other distribution. And we may want to estimate the distribution of means in the population. And that will create what we call dispersion in the outcomes at the top level. Such mixture models are really useful for dealing with the heterogeneity that we don't have predictors to explain. I'm going to punt on dealing with this until next week, because we're going to use multi-level models to handle this. Multi-level models are a very flexible engine for doing this. So we'll get there as well. The multi-level models do have a hierarchical nesting of stochastic processes. So on Thursday, instead, we're going to focus on a simpler mixture case. A family of mixtures known as zero-inflation models or hurdle models work here in the same way. And I'll postpone explaining what that means. But data is routinely zero-inflated. So this is a very useful case to have routinely, both in the social sciences and the natural sciences. Okay. Order categories. What are they? So the typical way, and social sciences are full of data like this, where the only way you can measure people's attitudes is to ask them vague questions. It's language is a wonderful thing. It's great that you can talk to people. You can get a lot of data very cheaply from them this way. Unfortunately, the measurement scale that it comes out on is very frustrating to let me explain why. So I may ask you something like on a course eval. I don't think you have to do course evals for this class because it's a graduate course, which means no one cares what you think. But no, I care, but the administration apparently doesn't because they don't require them. But mainly because most of you aren't paying tuition, right? This is how it works. Anyway, I might ask you how much you like this class on a scale of one to seven. Now, measurements like this, you could probably come up with a number. Say you come up with a number like four, right? And that's about my attitude about it. And so, I'm very self-critical, right? And so, or lots of other things more. Seriously, how important is income of a potential spouse? Lots of papers published upon outcome distributions from data like this. And how often do you see bats and devas? I want to include the natural scientists in this. You can get lots of good informal natural history species occurrence data this way. There are all these smartphone apps now. They're trying to get to the citizen science to help people with species occurrence. Like iNaturalist, there's this good website iNaturalist that has an app you can download. And the data is coarse, but you can get, it's not just presence and absence. You can do a little bit better by asking people in categories like never sometimes frequently, right? How often do you see bats and devas? Depending upon what hours you do, obviously. But you're out late pretty much a lot, right? A lot of bats and devas. And I once read a paper about the depth harbor seals dive and they hit, the data were basically only shadowed a little bit deep because they were using, this was an old paper I think in the 60s and they had a depth meter, but it wasn't great. So it was only reliable in these really crude categories. And you'll take it, it's better than no data, right? So now what do you do with it? It's not exactly all of these measurements. They're continuous within some range. So it's fair to call them quantitative. But they're kind of discrete too in a weird way. So people are prone to, so for example in the classic like one to seven, people pick one, two, three, four, five, six, seven and usually you forbid anything in between. But even if you allow things in between, now the distances, the spacing between these numbers is not necessarily the same. So how much you like the course? Let's take that for example. How much more do you have to like the course to go from a one to a two? It might not be very much to get you from, I detest the course to, I hate it, right? But going from like a four to a five might be much harder. It might be much harder to go from I'm kind of indifferent to starting to like it. And so the spacing between the numbers is not uniform across the scale unlike with real numbers. With real numbers the spacing from one to two to three is always the same. That's how we construct the real number line. That's a nice feature. With these sorts of measurement scales, there's no guarantee that's true because this is some, there's something going on in the psychology of the respondent or in the physics of your depth gauge that is compressing some high dimensional space with uniform spacing into this oddly packed measurement scale you've got now. And that is kind of hidden from us. If we knew all that stuff we'd have the raw measurement to begin with. But we don't have that. So these are the inconvenient features of order categories. If you want to think about it that way. There are discrete outcomes. Typically we'll work with discrete outcomes like never sometimes always 1, 2, 3, 4, 5, 6, 7. They have a definite minimum and maximum. So you know constraints about the outcome already. There are discrete values. There's no minimum and maximum. And they have an ordering. Two is greater than one and three is greater than two and so on. Sometimes is greater than never, right? Frequently is greater than sometimes. So there's your ordering. However the distances are in uniform. And so you can't treat them just like a regular old discrete outcome. And they're not counts. This is important to realize. I mean it looks like counts but they're not a count of anything. They're a continuous scale. Some excitement about the course. Yeah, funny. Well I was just going to ask. Well I don't know if that's true. But the question was doesn't it matter how the scales presented? Oh yeah. I mean this is the horror of doing work with people, right? I said language is great because it's cheap. But then lots of little framing effects. When there aren't real incentives on the table to get the answer, to be honest or get the answer right, lots of little adjustments in the wording matter. So it's like political polling is so manipulable, right? Same reason. Nevertheless we seem to somehow learn something about people's attitudes. So there's something coming through the noise. I don't know if that makes the spacing more equal or not though. I don't think it does. And the spacing may vary across subjects. There's something we can talk about after I get some distances from the model types that we coat with that. We can make everything vary. Was there a man or was that just out of room effect there? Okay. Okay. So these things are hard to model. They're not continuous. They're not counts. None of the stuff we did last week is really perfect. You can analyze these data types with Gaussian models by the way. And I don't think that's horrible. But you just have to keep in mind what you're giving up. You're giving up anything on the predictive scale. Right? Because a Gaussian model is going to predict Gaussian-shaped outcomes if these things are not. And I'm going to show you some raw data in a moment and convince you of that. So what's on the prediction scale? The AIC, WASC family of methods, p-values, if you're so inclined, those are all things that are defined over sampling distributions. And those things depend upon the likelihood of having some maximum interview relationship to what the constraints you know about the outcome. But you can treat it epistemologically. The models are, in my opinion, always kosher if all you're interested in is the mean variance of a measurement. And then you can treat it epistemologically and give useful information about how the mean and the variance respond or conditional on your predictors. But then on the predictive scale, you suck. I mean, you have to be really careful about what's going on. So that's the caveat. So if you want to do better and deal with some of the nonlinearity like the boundaries and so on about the measures, the common solution for these order category types of data is to use an ordered logistic regression, somebody's called ordinal, means the same thing, at least in my mind. I'm a native speaker, so I search that they mean the same thing. And I think this is one of the cases we've got a monstrous measurement scale and we're going to make a monstrous, somewhat monstrous model. I'm going to walk you very slowly today through this monstrous model. And then the purpose isn't to scare you, but to show you the assembly, that there's a strategy that's gone into building this thing up and it's components of basic GLM pieces we've already seen to build up to it. And these models are really useful. They're a workhorse in the social sciences, really useful in planning. And so we want to get to them. And for those of you who've seen Pacific Grant, this is, we had to fight monsters, so we've made monsters of our own. That's always here. Those of you who don't have small children, you haven't seen that movie and you shouldn't. But anyway, it's an unbelievably dumb movie. But it's kind of like The Matrix, if you'll excuse my digression for a second. It's kind of like The Matrix. It's like when I watched this, I said, that's the dumbest movie because I had two more exactly like it. And then they made two more. They were even dumber. And I was like, can we have more? So the fact that it's stupid doesn't be decided or tated. And the fact that these models are monstrous doesn't mean they're not useful. Okay, let me introduce the data context we're going to work with today. These data come from a very large battery of narrative experiments, attitude experiments done with people like yourselves. And actually it's an international sample that investigate moral reasoning. Moral reasoning is one of the obsessions of cognitive philosophy. People vary in their moral attitudes, but they're definitely strong in common attractors worldwide. And people are interested in the principle, the unconscious principles that guide moral reasoning. What I mean by that is given a story in which behavior has taken place, people can make snap intuitive judgments that they're very strongly emotionally committed to about whether the behavior was morally appropriate, whether it was morally permissible. And then if you ask them to explain that, it becomes very difficult. And people like ourselves are very good at confagulating explanations to justify anything, right? That's how you get into grad school. But most of the world over people say, no, it's just the right thing, right? And so it's an interesting empirical project to try and figure out what those principles, the intuitive principles are. So there's a famous tradition of using something called trolley problems to investigate this. And we're going to work with a trolley problem data set, the big one. Let me introduce you to the classic trolley problem. All right, what you're looking at, top down up there is a trolley, a street car in North America, right? I think they're the same thing. And so there's a track that's moving along on path A. It can keep going that direction. And on this path that it's headed, there are five, you know, I think it's a traditional story, they were Girl Scouts and they're like lash to the rail. No, seriously, it's horrible. It's like someone didn't like their cookies. They didn't lash to the rail. Anyway, there are five innocent people who've done nothing wrong and they're lash to the rail because there's a wicked villain with a handlebar mustache tackling off on the side or something. And if nothing happens, this trolley's going to keep going and run over these five people and kill them. But there's a sidetrack, Track B, which is only one person lash to it because there's a slightly more diminutive evil villain with a handlebar mustache who's only managed to lash one person down or something. I don't know, it's a strange story. There's another, there's one innocent person down Track B and five innocent people down Track A. That's what matters in the story. And there's a switch and you are standing by the switch or somebody, you don't necessarily need to be the protagonist but I'd like you to imagine you are. You're the protagonist. You're standing by the switch and it's set so that the trolley will go down Track A and you can pull that switch and make the trolley go down Track B. Will you do so? This is a hard problem, right? Because obviously, given the forced choice you'd rather only one person dive in five but you'd rather someone else pull that switch. It still seems creepy for most people to pull the switch. And this is part of trying to uncover what it is about moral reasoning that makes this awkward. So in the usual version of the presentation the subject reads a narrative or sees and it may be combined with a picture like this and then they're told the protagonist pulls the switch or they're just asked the question how morally permissible is it to pull this lever? I should call it a lever, sorry. And you're given a scale one to seven and you're supposed to pick one of these numbers that ranges from it's one it's never morally permissible to seven you should always do it. And people vary a lot in what they select here, right? So I'll leave you for a moment to thank for a second. I'm going to give two more of these scenarios on the next slides and you think about what you do. You don't have to shout it out. Usually there's only one person who's like three. I'm an anthropologist so I'm constantly committed to always use four for everything on these. It depends. Are my kin hungry? Okay, so let's do the next that was Makav if some of you got the joke there. I'm an anthropologist, you know me great. Okay, so next version of the story. So the scenarios vary and now you'll see how this is an experiment to uncover principles. The first version, you say the protagonist here is the red person standing on an overpass so the trolley is going to pass under this elevated walkway. And of course our villain strikes again and there are five innocent people again who actually attract on the other side and they're going to die if someone doesn't intervene. And luck would have it there's a large individual standing next to you. Sorry. And this is a famous one though. This is a famous one. And there's a horrible person standing on the walkway right next to you with a mass large enough to stop the trolley and save the lives. This person falls in front of the trolley it will kill the large person but it will grind the trolley to a halt. And you know so this is Governor Horton here. He will be killed but he will give his life and it will save the five individuals and now the question is oh yeah I got an animation there Right so now how morally permissible is it to push the man? One to seven. Keep your answers to yourselves. Your neighbors will judge you. Say what to it. Alright. Next scenario back to the basic scenario five on track A one on track B but now if you don't do anything only one dies. And now the question is how morally permissible is it to not pull the lever? Right. Now what you're thinking like well I wouldn't pull it but it's exactly it's logically exactly the same as the first scenario I showed you. In fact all three of these are logically equivalent. All this shows is that people aren't logical and you already use that. Right. But remember the fact that people view this one this last one really differently than the first. Even if the outcome is exactly the same. So you know spoiler alert in the first one people find it morally weird to pull the lever even though it saves four lives. Right. Only four because one person is always going to die regardless of what happens. And in the last one people are really comfortable letting the trolley hit the one person. Same outcome. Right. But different moral feelings about it. This is coming up there these different principles that seem to be at work here and more than a century of interest in these weird scenarios literally. They go back a long way. Has uncovered these three dimensions that people think govern part of the moral intuitions that are happening unconsciously in rating. And different people care more and less about these different principles and uncovering that variation is of interest as well. So and how it changes across the lifespan. So the first of these principles called the action principle which succinctly stated is harm caused by action is morally worse than the same harm caused by inaction. That distinguishes between the first trolley thing and the third trolley example. Right. In one, the protagonist has to do something. How morally permissible is it to do this thing that results in the death of a person. That makes people uncomfortable. In the third one how morally permissible is it to do nothing that results in the death of one person. Right. And that's more acceptable to almost everybody. But again, if you're an anthropologist, it's four. Straight down to one. Four, four, four, four, four. Because it depends. But most people have intuitions have been corrupted by anthropology. Intention principle, harm intended as means to go is worse than the same harm foreseen as just a side effect. So this is like the villain principle. Right. So hurting someone by accident isn't as bad as hurting someone instrumentally in order to get what you want. And it may not be that this isn't the distinction like you wanted to hurt the person. This isn't the sadist issue. This is you are going to get you're going to profit. And if your profiting means that some people must be hurt because that's very bad. Right. If it's merely some side effect of the goal instead, that is the people don't have to be harmed for you to get what you want. Then that's not seen as bad. Right. So give you a quick distinction. If you're a bad person and you burned down someone's house because you want their land, that hurts them and you profit from it. If instead you do something to get some land next to theirs and it accidentally burns down their house, their house burning down wasn't necessary for you to profit. It was still bad. You're still at fault and that makes you a bad person. Right. But it's not nearly as bad as the first case. Is that the distinction? Makes some sense. Yeah. And then finally, there's a contact principle which is a kind of, you can think of it as a special case of the action principle. Contact caused by, harm caused by physical contact is worse than the same harm without physical contact. So touching the man on the bridge, that is the scenario that really offends people. The idea of pushing the guy and how can you distinguish it within and out of contact. There's a version of that scenario where there's a lever that drops a platform out from under him and people find that somehow less offensive. Right. I'm serious. I mean it's all horrible. It's all horrible. It's all up there. Yeah. Yeah. Yeah. Anyway. Okay. So let me try to summarize these. We'll recode these by action, attention and contact. These will be the dummy variables we used to do the analysis. We have a big data set. It's part of the rethinking package. I think it's almost 10,000 responses of people putting one to seven ratings on these sorts of stories. Yeah. That's how many people have taken these experiments now. And, yeah, I think it's almost 10,000 responses now. And each different story can be coded by whether it has action, attention and or contact. So the first one has action, but there's no intention to do harm. The harm is merely a side effect. The one person doesn't have to die. It's just a side effect of saving the five. Saving the five is good. Right. That's different than the second case. All three are fulfilled. You have to touch the guy to push him off. That's an action. And his death is necessary to save the five lives. It's not just a side effect. Right? Why? Because his mass is what stops the trolley. Right? Yes. I know some of you can't stop. I know. You're trying not to laugh because it makes you a horrible person. Right? It's like how morally purposeful is it to laugh at the second scenario? It's okay. We're in the same place here. It's okay. These are awful stories. They're all awful. But, and then the third one has none of these. There's no action taken. In fact, you just let what was going to happen happen anyway because the trolley's going to go down, track B. You don't intend for that person to die to save the five, even though you want the five to live. And that's why you don't take the action. And there's no contact with the person. So it doesn't have any of them. You see the distinction? And there are a large number of different stories that cognitive philosophers haven't been in that recombine these things and also move away from trolleys. They're not all about trolleys. They're medical stories. There's one crazy one in this data set about an aquarium, something, and you've got to stop it from leaking. It's just bizarre stuff. I think there's one with a dialysis machine. Very inventive things. When they present these scenarios, you get them all at once. You're able to compare them. They do them in a sequence, in a randomized sequence. There's a random sequence for the different subjects. In the data set, you'll find an order variable that is the order that that particular subject was presented in these questions, each one. Order makes it a... It has an effect, I think. It has a weak effect. People get fatigued and they just start choosing four. That's how it goes on. At least a lot of people do. So there's an order effect, but since individuals get them in different orders, you can partial it out. So, on the right there, I'm showing you what the aggregate data look like, just as a histogram. The ratings across all questions and all subjects, how permissible it is, one to seven, and then the frequency of those responses. You'll see there's a big spike at four. So lots of people are like, I don't know. But there's also, most of the data is not at four, though. People have opinions about this. So these come from a large battery of experiments done by Fiery Cushman and his colleagues. Cushman, I think he started off as a philosopher and he became a psychologist. He's still kind of in between fields like that. There's this data from 331 individuals, 30 scenarios, 9,930 responses. Scenarios are the different kind of trolling stories. So we're interested, for now, just in how responses vary with action, attention, and contact. In your homework, I'm going to ask you to analyze these data a little bit more deeply and consider effects like age and gender and individual variation, which matter a lot. Absolutely. Age is a huge effect and gender's not such a huge effect, but it's very consistent cross-cultural effect of gender on this. I won't spoil the story with what it is, though. You can break it down. So the top there, we've got the full histogram. It's not conditioned on any of the predictors. And then at the bottom, I break it apart. So in the lower left, stories where action is implicated, so action dummy variable is coded as one. There's a shift. You notice that there's, we lose some of the sevens, right? We've lost some permissibility. Things that involve action are on average less permissible, but it's complicated. The change in the distribution here is really odd. This is the thing about these measurements is so monstrous, right? And now intention, we turn on intention. Get a big spike here at once when the person, their death is intentional. It's instrumental in saving the lives of the five. That makes it a lot worse. And then contact is the worst of all you can sort of see. It really pulls all this other stuff down and you get a lot more data there than one. Lots of individuals, by the way, flip flop between ones, fours, and sevens, ignoring things in between, which is a good example of how the spacing, the implied psychological spacing is different at different units. There are these vocal points, one, four, and seven, or vocal points on the scale. And people are attracted to them, but they do choose the others, as you can see. But it's much easier to move from four to five than it is for four to seven, right? It's a smaller psychological distance. So let's start building up a probability distribution for these outcomes. This is the part where it gets a little monstrous, but we'll go slow and you will understand this. So the traditional solution, and there are other solutions, but they're really all the same strategy. So once you learn this one, you can quickly learn other ones. It's to use something called a log cumulative odds link and I'll explain over the next several slides what that means. But it's a link function. You've seen link functions before, and this is just a special kind of link function that helps us usefully model weird distributions like this. Basically our goal to talk functionally for a second before we get into the mathematics of it is that we'd like to be able to describe the histogram that we see of these data and just take it like that, just to have a mathematical redistriction, a set of parameters that basically redescribe the histogram and then allow us to model how those parameters change conditional on predictors so that we can morph the histogram. And that's what these models do. They fit the histogram, which you can already see, so you'll like to feel, but then they model how as predictor variables change, the histogram changes shape, and that's why they're useful. So we start with the raw data here on the left. You can re-describe these data using a cumulative distribution. That's what I'm showing you here in the middle. All that means is we start with each response and we ask, what proportion of the data are that response or lower? That's what a cumulative distribution means. I'll say that again. For each response, we ask, what proportion of the data are that response or lower? So for response value one, it's a little more than 10% of the data have response value one. It's hard to see from this distribution, but if you sum up the total number of responses here, there are about 1,300 ones. The proportion of ones is a little over 0.1 or 0.15. So that's why the cumulative distribution there is about 0.15. When we go to two, we add the extra amount of tubes on top of that. That's what makes the cumulative, so we're crawling up. Does this make sense? We haven't worked with cumulative distribution in here, but I know you guys have seen them before. And so this is also why, for example, there's a bigger jump at four, because there are a lot of fours. So we get a big jump at fours. And then all the way up. And seven is always at one, because it's the maximum value. So the cumulative distribution of seven is free, right? You always know it. It's the maximum value. It's got to have a cumulative frequency of one. You with me so far? So we're going to model, we're going to define our link function over this cumulative distribution, because it makes it easier. We get a degree of freedom removed here. We get the top one for free. So we need one less parameter. And that makes it convenient. And we're going to work with this, as you might expect, on the log odds scale. For exactly the same reason as before. The logistic, the logit transform, is easier to work with and it lets us plug a linear model in here. So we're going to, I'll give you a question in a second. So we take these cumulative proportions and we logit transform each of them to get a logit, a log odds cumulative proportion of the data at each value. And that's what I'm showing you on the far right. So this is now on the log odds scale, exactly like your annoying Eagles homework that you're working with right now. You're loving that. And it's the same scale, but now it's cumulative odds, not just discrete odds, which is what you're working with in your homework. Yeah? Question? I was going to ask, would you always? You don't have to. There are other approaches, but this is the conventional one. It's the one you'll see almost always. You don't have to. This is really convenient though. It solves a number of computational problems for us. Was there another question out there somewhere? No. Have you guys got this? This is the weirdest part of it, because at this point it's like, why am I doing this? It's like, I know where we're going, we're going to the beach, and this is the dark woods part of it, right? But how people invent this stuff is a whole other story. And the process of discovery is one thing, but one of the people who discovered it, solutions are often, they seem weird, but we keep using them because they're useful, right? So we're going to put a link function on these cumulative odds. So here's the link I've just defined implicitly. This expression here on the slide, on the left here, is the log cumulative odds. Remember, odds are just a probability of something happening. Over the probability it doesn't happen. And the log odds just mean you take the log of that. That's all it is. And this is what you're working with before. This is the logic transform. But these probabilities are cumulative now, which means we've got some outcome. Why sub i? Oh wait, that's the cumulative log odds. We've got some outcome. Here I'll call it, which is our response, is the name of the variable in the data set for case i. And we want to know the proportion of the data where that's less than or equal to k, where k is some particular value 1 to 7, right? It's a particular observable value. So this gives us the proportion of the data, the cumulative proportion of the data up to value k. And then the 1 minus that probability, so that's the cumulative odds. Then we take the log of it. That's the link function we're going to use. Oh yeah, so the category. And then we're going to say that this is equal to some linear model. So here's the DLM strategy. And we're going to model this as a function of predictors, right? Some burrow browser. That was a mean look you gave me. I know it's not meant to be, but it really was. It looks to kill. Hey, I just report the models. I don't make them. I've made some, but I'm not teaching them in this course. They would be truly horrifying. But yeah, we're not going to do experience weighted attraction. Some of you know that I've seen my serious weighted attraction papers. That would be truly, if I was a sadist, we would be doing those. Okay, so inverse link is the logistic again, because it's the same link function. It's still a logit. You have to remember that the probability is inside of your cumulative. And so now what we get out of this, when you're predicting something from the model, the probability that any particular case is less than or equal to some particular response value is defined by the logistic of the linear model. Exactly as before. There are different linear models, though, for every, potentially for every outcome k. That's why the k subscript is there on speed. This will make some sense in a second why we need this. So how do we get this linear model out of it? Now let's focus on the graph at the right for a moment. These are just the cumulative proportions. This is the cumulative distribution of responses. So these probabilities of yi less than or equal to k are the heights of these gray bars that I've put on the graph so far. You guys see that? That's all they are. That's how they're defined. What we're interested in to get likelihoods, though, is discrete probabilities. We don't want the probability that some value is less than or equal to k1 to 7. We want the probability that some observed value is equal to 1 or 2 or 3 or 4. We need discrete probabilities to have a likelihood function so that we can fit them all. Remember, that's how Bayes rule works. We need a likelihood that has for case number 3 if 3 was observed, what's the likelihood of that 3? So that's what I mean by discrete likelihood. We need the probability of a 3, not the probability of 3 or less. Does that make sense? So how do we get that out of this? Well, you just subtract. So we want the orange thing. The probability of yi is equal to k. And this is just the difference between yi less than or equal to k minus the probability yi is less than or equal to k minus 1. This is the monster part, right? This is Taniwa, the bower of men, right? So these are the orange bars that I impose on it. And they stack up to 1, but they don't overlap. So these orange bars, line segments, are the likelihoods that we're after. We just have to calculate the conditional on the parameters of the model. And your computer will happily do that for you. You guys with me so far? Conceptually? Right now we're just doing the conceptual work for this. OK. There are a number of similar conventions for denoting these models. I'm going to try to use a transparent one. You may see people do it a different way. These model forms aren't as conventionalized as standard-defense GLMs, but you'll recognize them. It's just important to be able to recognize them. So here's a convention that I like. We say that our response for k-size is distributed as an order distribution. And what we put into this is a vector of cumulative probabilities for each response, this vector p. And then we define the elements of p or p sub k. They're the probabilities of each of these things. And then we just define our link function. The log odds of those is equal to some linear model. And here I'm just going to make it an intercept. There's an alpha sub k for every possible outcome k, except, as I'll show you in a moment, the highest one. Why? Because the highest one is free. And we'll talk about what its value is later implied, but we don't have to include it in the model. And then you need some priors. These are bad priors because they're the same for each one, and usually they're ordered. So we know that there's an ordering of information to them, but you can get away with this. These are essentially flat. These normals are 0, 10-plack priors. In code form, it's going to look a little bit different because you don't want to have to process this link function yourself. All that line segment subtraction that was two slides ago, you have to do all that in the R code. And there's nothing particular difficult about it, but you don't want to have to do it inside a math model. So instead, we use a likelihood function that does it all for you. It has the link function built in. And that's d-ord-logit here. And so the way you're going to model this is the response is distributed as an ordered logit. phi is going to be the rest of the linear model that's going to come later. That will contain predictors. For now, it's just a placeholder. And then a vector of intercepts, which is what's been presented there, alpha 1 through 6. And whereas alpha 7, we get it for free. We already know it. It's equal to infinity, right? Because the logistic infinity is 1. And that's what the highest level has to be I'll say that again because some of you were like, huh? So if you take the value 1 and you do the logit transform on it, what's the answer? Infinity. So I know if you're like, infinity. I've never seen that, neither have I. But in math, infinity is great. It solves problems. So alpha 7, we get for free because we know the cumulative proportion of the responses that are 7 or lower must be 1 because it's the maximum value. So we don't need to fit that part of it in the cumulative distribution. We get it for free. All the others are the ones we fit. So that's why 7 is missing. Does that make sense? That's the only thing that matters. But try it sometime. Do logit 1 and see what it is. And then we assign some basically flat priors here, weekly regularizing priors. For these models, you really need initial values for the intercepts. Or it's going to be a hard time getting started. Sometimes you get lucky. They don't have to be right. They just have to be ordered, right? And why? Because you know that A1 has lower log odds than A2, than A3, than A4, right? You want to preserve that ordering. So just give them ordered inputs. And this is a kind of default vector. There's a box in the notes where I show you how to calculate these from the joint histogram of the data. You can just convert them to logit scale and get these log odds values for start values out if you want. But these will work for all the examples in the book. It sounds too finicky about it. You guys with me so far? Yeah? I know that this is a lot to take in. I told you these were monsters, right? But they're really useful. And after you've used them a few times, they're not bad at all. So this fits no problem, although I will say you're going to wait longer for these models to fit than the others. The posterior distributions for these always have strong correlations in them. Why? Because if you move one of the alphas, all of the other alphas must move because they have joint implications for modeling the histogram of the data. They have strong posterior correlations among the posterior distributions of these parameters, right? You can't move one without changing the plausibility of the others, right? That's the idea. So they care joint information very strongly. And that means fitting can be slower. You'll see this whether you fit it by Markov chains or you fit it here with gradients. It's going to take a little bit longer, not very long though. This is the first time you'll have to wait. Not long enough even for like a cup of coffee but feel you'll notice. It won't be like before where it takes longer for your screen to refresh than for model fitting to stop. But so far in this course it's been that way. It's also true that this data set is one of the biggest we've worked with so far and that's going to slow it down to but it's not just that. It's also the model type. This data set has almost 10,000 cases in it. That's the biggest we've used. You guys, yeah, it makes sense? You with me? Okay. When you get the pricey output for these models is almost always pretty useless. At least for the intercepts. So we've got an intercept-only model right now that implies six intercepts. These are on the log cumulative odd scale. So if you logistic each of these, you get the cumulative proportion of the data that's expected to be at each of these values. That's something you could have calculated from the raw histogram. But now you have confidence intervals around these things too, right? So it takes into account the sample size. Right? What do these machines do? Remember they're basically machines. They start with the prior and they update condition on the data. So there's a ton of data here and the prior is completely washed out long before you even get halfway through this data set. So the standard deviations are very small, right? There's a lot of precision about the overall data and how to re-describe it on the log cumulative odd scale. That said, these things aren't very useful to you because unless you read log cumulative odds in your spare time, because it's not a scale that you're very familiar with, right? But I just wanted to show you what it looks like. Fitting these models in stand looks very similar with map to stand. Stand does a great job with these models. Again, it'll be slower than the others, but there's no real obstacle. I just want to show you notationally. If you prefer to do your homework with this truly in stand instead of math, that's great. DeOrdLoget, map to stand, translates that into the order distribution the stand uses. And instead of using that C function in R and then a bunch of individually named intercepts, you can just create a vector parameter called cut points. So you can call it anything you want. As I always joke, call it Pickle TARDIS, whatever you like. But cut points will help you remember what it is because these are the cut points between the different outcome values. And you just need to tell it, it'll know that it's supposed to be a vector of parameters and it'll figure out how long it needs to be from the number of unique values in response. So it'll find seven unique values in there and know you need six parameters. It'll make six cut point parameters. Again, you've got to give them start values or you're likely to spend a long time trying to initialize your model. So go ahead and give them order of start values. Not as essential for stand, stand savvier than math on this. And then when you run it, by default, it doesn't show you vector parameters. This is something that'll become clear why next week when we do multi-level models. I think you can have hundreds of vector and matrix parameters in a multi-level model. And usually, I don't want Tracy to show them all of my screen. So by default, it'll only show you fixed effects. So if you want to see it, you set depth equal to, it'll show you the six. These are exactly the same estimates at the same level of precision as on the previous screen. Just fit my markup team on the problem. Yeah? Okay. You should try these examples at home. I wanted to say here, by the way, and I wanted to do on this slide, notice that the number of effective samples varies across the parameters. That's okay, and that's normal. Some of them are harder to sample from than others. And you could say that they're more constrained by the posterior distributions of the other ones as well. And also, it's not a problem. A lot of you in your homework from last week got freaked out because NF was not equal to the actual number of iterations. You thought there was something wrong? No. In fact, you should expect a number of effective parameters to be less than the actual number of effective samples to be less than the number of samples you take. That's not a big deal at all. You need convergence, and you need enough effective samples so that you can make effective inference about the shape of the distribution. But NF being low doesn't necessarily mean anything wrong. You could sample the hell out of that chain. You could eventually get as many samples as you need. That's how people use Jax, right? Just sample the hell out of that thing. Or MCMTGLM is definitely that case, right? It's like, I took 500,000 samples within a thousand or something like that, right? Yeah? Let's say you have to scale from 1 to 7. Then it drops out. It won't, yeah. It'll just drop out, right? It's effectively not there. If you define a scale of 1 to 7 and no one ever says 7, you have to scale from 1 to 6. What if it's meaning for 11? You can't. No one ever did it, so, yeah. It's a tough problem, right? You can assign a parameter to it, but I'll tell you what the estimate will be. Right? It'll keep it from collapsing to a singularity, but you can put it in there if you really want to, but it'll drop out of the automated tools. So, like ours, automated tools in the mass library for doing these models, they just drop some sort of outcomes. You're right. It may not be what you want, but there's no variation in it. It's hard to know what to do with it, right? People never use it. Questions about this? If you took this data and you treated it as an unordered data, that you treated it as an ordered category, what bad things would happen to each of those two scenarios? The question was if you had, if you treated this as unordered, or if you had unordered data and you treated it as ordered, what bad things could happen? It's hard to say. This is horoscopic squared. Yeah, that's a good question. Well, I think I'll be able to give you a more satisfying answer to that when we get to the constructing of the rest of the linear model. You're going to see that the ordering here gives us a really elegant and simple way to get predictive variables in here, because we just, as a predictive variable increases, we want to traverse the responses. And so the ordering makes the modeling easy. When it's unordered, go back to the end of last week, the chapter from last week where I do multinomial models. That's a perfect, that's a categorical unordered model. And you can think of this as a special case of that where we pose ordering. This is easier. In the case of multinomial models, and you've got a million choices to make, which is why I didn't teach them. They're very confusing. There's a lot of freedom, and you can have the linear model for every potential outcome can be completely different from every other. And sometimes Calvin does models like that, just right back there. Because as long as you need it. When you have an example exactly like this, I can do better. Like if we're looking at a particular data context, I can give you more useful advice. Yeah, I think there would be a way to peer through the parameter estimates, meaning completely different things and figure it out the same way. If you treated this as multinomial, yeah, you'd get, I think you'd probably get the same effective inference, but it'd be much harder to arrive at it. Because the ordering lets us know, that is, you add action, you've got to traverse across the values in order. And that saves us a lot of parameters. We need many fewer parameters. In a multinomial model, you need a parameter for every level. For every predictor you put in, and you want it to affect the possibility of any particular outcome of peer, you've got to have a unique coefficient in that level. And that's annoying, but sometimes necessary. Does that make some sense? Good question. All right, back to the intercept estimates. If you want to interpret these, you can just plug them into logistic. So COF, remember, extracts the math values from the posterior distribution. And if you logistic each of those, while cumulative odds values, you get the cumulative proportion of the data up to this. And you already had these in a sense. These are just averages from the data, but now we're going to start adding predictors, and we'll have distributions like this with the whole posterior distribution. And that's what you get from this, too. Remember, it's not just the math that's of interest, but the whole posterior distribution, the relative plausibility of every combination of parameters conditional on the data. With me so far. And then, remember, A7 is missing because it's down to the infinity on the logistic scale. Meaning logistic infinity is one. It is. Try it. Just for fun. So back on the graph, this is what it looks like. We could plot these out. And you see it matches the empirical distribution fine, because that's all we've done so far. We've re-encrypted the data in the form of log cumulative odds parameters, which seems like, why did we do this? Well, because we're building towards something a little bit more useful in this. But even when we got from this, we've got uncertainty about it, right? Because there's sample size that's taken into account in the whole distribution. The math estimates match the data exactly because there's a distribution around it, right? Because there's more data at some responses than others. And so you can get different levels of uncertainty for different intercept parameters. And that's useful information, right? What you know where you want to focus sampling? Okay, you with me? Yeah? All right. Let's add some predictors now. So now what we do is we take our link function there, that we say that the log cumulative odds is equal to some linear model. We've got half a sub k's before for each possible response k. We've got one of these equations. And now we're going to subtract a common linear model from them. This linear model I'm calling phi sub i for k psi. And that's going to be equal to some linear model like beta times x sub i, where beta is a regression coefficient like we've done them before. And x is a predictor of some kind. This will be like our action codes when we get there in a second. That's why we're subtracting. And what I'm going to say about that is I've got a long explanation of this in the notes. The quick version of it is that if a predictor increasing leads to people say choosing higher values, what you want to do is take the parameter that's the cut point for those high values and move it down and log off. And that's why the lines is there. I'll say that again. So say add some I'm asking you how much you like this ice cream test. You're always sitting here so you'll be mine. There's an international audience of these. Who's this head guy? You're famous. So I ask you how much you like ice cream and you say five. And I say okay I'm going to put some notes on it now. How much do you like it now? I'm saying you like notes on your ice cream model though. So now it's a six. In order to describe what has happened the dummy variable added nuts to the ice cream. What it's done to those intercept parameters to the intercepts to the cumulative log odds, you need to move it down so that you're more likely to end up with a higher response. Because that will assign more cumulative probability to the high values if you shift all the intercepts down. Because then more mass ends up in the high part of the distribution. I know this is like sorcery and those of you watching at home are moving my hands in a very explanatory way. Everyone's nodding Absolutely. Tell your friends later they're missing out. Actually we have no chairs so oh there's one. Does that make at least enough sense? I repeat that in the notes. This is a tricky thing about it. It's exactly just that. As the predictor increases, if you want to increase the probability of high values you have to subtract the linear model because you want to move those cut points down so there's more probability mass of responses in the high values. It's a weird thing but it's a necessary thing and it makes it work out. If you add instead it just flips the sign on the coefficients. You get the same inferences but now it's a little bit frustrated because when adding nuts increases how much you like it you get a minus beta coefficient so that the linear model can get subtracted. So this just lets you interpret the parameters a positive beta coefficient will need increases responses. A negative beta coefficient will need decreases them. That arises only if you subtract the linear model. In the context of the trolley data fee is going to be more complicated. In these fees there's no intercept because there's a unique intercept for every response. There's a different linear model for every response. Now it's there's a beta coefficient for action and then our W variable for action which is capital A sub I. A coefficient for intent our W variable for intent and then a coefficient for contact and a W variable for contact. You can investigate interactions. They're actually very strong interactions between having more than one of these things in a story makes it even more offensive. You guys with me? A little bit? You're willing to just squint thing? Is that a happy squint thing? No, you. You just can't see. Does the thing about being an instructor you're trying to read the ministerial body language like all your auto grooming is very meaningful to me. You're happy to know right? Okay. Wave your hands vigorously if you have questions about this. We're doing great on time so we don't have to rush here. Alright so how do we get this into code? Well you just set feed to this linear model exactly as you've done before and it works out. Also define priors for them. If you don't find priors you're in flat land and flat land is bad. Remember that? And you don't necessarily need start values for the coefficients. They can get sampled from the priors. There's no ordering among them. So you're back where you were before. Does this look okay? Yeah? Should I leave it here for a second in case questions bubble up? There's lots of confused looks and lots of beard stroking and stuff like that going on. Alright. Well you need start values for the intercepts and they're down there at the bottom but the beta coefficients we've added three parameters to three beta coefficients. You don't necessarily need start values because the priors don't get sampled from the priors just like all your homeworks you've done so far and that'll be fine. There's no ordering implied among these beta coefficients so that's why we don't need to worry about start values. You can use start values that's perfectly fine. If you want to use start values set them all to start at zero. That's typically the rational and useful to start at zero. But you don't have to. Okay. So now let's apply this to the data interpretation. If you thought defining the model was annoying wait until you try to interpret it. Now we're definitely in Thompson's tide machine and we're in a tide machine that predicts the behavior of the tide machine basically here. And but I'm going to try to walk you through useful ways to plot the predictions because the coefficients are very frustrating now. They're like the little gears at the bottom of the tide machine. Reading them by themselves is hard. But that's where the information is and how the machine functions. The machine's internal state is just those things and the predictions are an implication of them. So we've got to push these. We're going to get parameter estimates. We're going to get a posterior distribution of parameter estimates out of this like all the other models. And then we need to push them out onto the prediction scale in order to just like before. But now the calculations get more annoying. So we're going to spend some time the rest of the time today on that. Let's fit three models. The first one we've already fit in 11.1 is the intercepts only model, six parameters. 11.2 is the one we just defined. It will be a main effects model with action, intention, and contact only as main effects in all interactions. So three more. So now that's six plus three is nine parameters in the second model. Then 11.3 we're going to interact action and intention and contact and intention. We can't interact action and contact because they're mutually exclusive. Contact as a special code is a kind of action, right? Contact always implies action. So just the way I've coded it contact is necessarily the interaction of the two. It's just a special coding. So you can think of them as being mutually exclusive. So we've got two two-a interactions to worry about what stories where there's both action and intention and stories where there's both contact and intention and how much worse or better are those. They're worse in a way. Combining these things is non-additively worse to people, at least most people. There are people like me that just do four all the way down, right? But most people don't. So we fit these models. There are no surprises in how to do that. And let's just look quickly at the coefficient tables just to convince you how difficult it would be to figure out what these models do purely from coefficients. And you can get a little bit out of it. We'll do as much as we can. We're going to move to graphic. So remember, the top part there is the intercepts. And rarely do you care about those. You need them to generate predictions. Remember, they're proportional to when all the predictors are set to zero, that gives you the histogram of the data. Right? I'll say that again. When all the predictors are set to zero, those intercepts give you the histogram of the data. Right? For the posterior distribution of the histogram of the data. And then all these coefficients are adjustments to that. They nudge. They shift the whole histogram. They squeeze it towards the bottom or push it towards the top. And therefore, a mass more or less probability of a response in high-end mode values. So let's look at the middle model here real quick. 11.2, you'll notice that all the coefficients are negative. But you can read from that in a pure main effects model. It's still safe because it's like old linear regression. There are ceiling and floor effects, so you can't say on the absolute scale what happens. Is that all of these things make the story, make the action less orally permissible. Lead people to choose lower responses. Yeah? By how much? Well, that's all the cumulative log odd scale. So your guess is as good as mine. When you start combining stories that have multiple dummy variables set at one, then they get added together and then you can't simply add them on the up-up scale because there's a floor. Right? You can't go any lower than one. So eventually something so offensive that everybody's answering one and they can't go any lower. Which leads me to a funny anecdote which I will put in here. There's this famous cross-cultural study of main purposes by David Buss and his colleagues. Some of you will know this study and they did it. So it's a bunch of questionnaires where they ask thought experiments about on scale of one to seven, or maybe they used a one to five scale. It doesn't matter. One to something scale, one to five, one to seven. How important are these various qualities in a potential made? And one of them was chastity, which is a funny word in English, like chastity. So they asked this in Sweden and people were like, can I give a number lower than one? Is it going to be smaller than one? They're like, no, you have to choose one. So they chose one. And basically everybody in the Swedish sample chose one for the importance of chastity. It was different in other places. So the scales do constrain. There's the floor. Question? What increases the likelihood of decreases of the... Well, this model won't handle that. You could do it if you had different beta coefficients for each level. You could absolutely do it. So if you have a problem like that, come to me and I'll show you how to do it. We can do that. It's not a problem. We need STAM to do it, but we can do it. You just define a different B for every K that you want. So then you can have a special beta coefficient at four for events. People do that a lot. So it's not a problem. But we're not doing that here. There's a constant effect on the log cumulative odd scale. Quick model comparison. There's an interaction model and then it's really hard to figure out what's going on. The interactions are negative, but what exactly does that mean? Because we're multiplying things. You can probably guess here since all the data are positive that interactions are bad. People think that makes things even worse. Especially contact and intent. Contact and intent is super villain territory. Right there. So we do the model comparison at the bottom. 11.1, 11.2, 11.3. The refresh parameter there is just so it updates you on its progress. This might take a while. There are 10,000 observations and it's got a WIC, remember it's computed over predictions. So it'll turn you'll get a cup of coffee while it's doing this. It's good for you. Talk to your lab mates. Watch left shark again. Something like that. I should have said watch my old lectures, but so notice that there's a lot of data here. So what happens here is very typical even small differences in predictive improvements in predictive accuracy lead to big differences in model weight. The model with the interactions does a lot better than the others despite having more parameters than the others. A whole lot better and it's really no contest of what's going on. The difference between WIC between model 11.3 and 11.2 is 160.8 units of deviance. The standard error on that difference is 25 or 26. So there's uncertainty absolutely, but there's a lot of data and so even small improvements in predictive performance can be discerned really accurately with huge samples. Okay, so how do we plot these? Yes. Yeah, somebody still remembers these, right? The classic way to do books. I don't know if I'm licensed to use that. Fair use. So so posterior prediction is now a vector of probabilities. A prediction now is not you know, something happened, what was the probability of that? It was the probability of all the responses, the whole vector of likelihoods that are the output of the density here. It predicts the distribution of responses now. And so this is a complicated thing to think about and there are a number of different conventions for plotting these models. I'm going to show you a really common one that I think is pretty useful as the one I use when I'm trying to understand these models. There are these kind of stacked plots where you put a predictor on the horizontal. In this case there are only two possible predictor values and just showing you a case where let's consider predictions where we set action and contact to zero and we only have, we look at the contrast between intention at zero and intention at one. The vertical axis is probability between zero and one and the blue lines there are posterior distributions of the cut points on the probability scale. So there are six of them because they divide the regions of probability mass of response. And I'm labeled the spaces in between with the response from one to seven. So this area down here at the very bottom is the probability of a one out of the whole probability mass on the vertical axis. So it's when intention is zero it's a little bit thinner and you notice the line slopes up so it's greater. And this there's a lot of data so these posterior distributions are pretty tight but I just plotted the gray kind of fuzzy area that's a bunch of cut points sampled from the posterior distribution. The code to do this is in the notes when you go and you look at it you'll see it just takes a bunch of samples it computes the cut points at each predictor value for a given sample plots a line takes the next one and does it uses transparency so you can see the fuzzyness we've done some of this before with these complex sort of posterior distributions you need something like this. You can see the slanting more when you get up to more plausible values like four right the fours there are more fours there. What the consequence of this is that when intention is zero you've got more probability mass up at seven. So turning on intention makes things less permissible because it increases mass of lower values it squeezes the distribution going does that make sense at least for a moment let me show you some more examples I hope you make some more sense at this. So tryptic time I love tryptics in this case you can do even more but we've done interactions in these models so to understand the interactions we need to consider plots where along the horizontal we're varying on predictor but we set the other predictors of different values so we can see what happens in the slopes remember this was like the tryptic stuff that we did before the interactions chapter so on the left this is the plot that was on the previous slide just you know I've squeezed it so I can fit three plots on the slide same picture as before turning on intention when action in contactor zero creates a mild shift down in the average rating which means the lines are tilted up means the average rating gets lower I'll say that again the lines are tilted up that means the average rating gets smaller means you get more probability mass at small values the top value can't get more probable that the lines are tilted up you've got to take the probability mass from 7 and redistribute it down to the others and that's what's happening in this graph because the lines go up you can see the scatter here right in the posterior distribution so there's a lot of data if you want to convince yourself this books too precise take 20% of the data of the data set and rerun the model it's a lot faster you can rerun the same plotting code and you'll get much wider confidence bands on these cut points just as an exercise you should try that out that's what you inevitably have your own data running it won't look this nice because you won't have 10,000 cases probably I shouldn't say that maybe you will but you probably won't I never did middle case we turn action on and off as push these cut points up relative to everything being off which is what you get on the really far left of the graph over here which means the average permissibility rating is lower with action than with by itself without turning anything else on and then when we turn intention on there's an interaction now they're both on the lines are steeper because it's extra bad when they're both in the story together it's not just that the big guy who was pushed off the bridge had to die to save the other people but it's that you did something to make it happen rather than let it happen there's a version of that horrible bridge story where he's going to fall because he's clumsy but you've got a chance to save him and you just let him drop that makes people uncomfortable but not nearly as much because it doesn't involve action it involves inaction and then finally we get to the worst case we turn on contact now the lines are very steep you get a lot more probability masked down at 1 this is the most popular option now it's very impermissible when there's intent and contact that's the original big guy falling off the bridge story the most obnoxious one we get the touch to guy and his death is intentional instrumental in saving the lives of the other people this is action alpha this is action alpha yes it is but contact always implies action in the way this is coded contact implies action so you can't turn them both on in the way I've coded the data when you look at the data set you'll see there's an alternative contact coding where it's not usually exclusive like that then you can re-run the model this way you'll get the same inferences freeway interaction and I wanted to let you guys run the model so you didn't need a freeway interaction it's for your own psychological help but as an exercise to the to the masochistic I left the original variable in there so that you can run the freeway interaction you'll get the same inferences but there'll be more parameters I know it's a tricky thing but often recoding your data can make the model form easier it's part of stat foo high level stat foo so if you didn't ask a question with a permutation of action contact and attention could you still form a like if you're missing one of the combinations could you infer what it might have been from the data depends again this is horoscopic in this case no I mean your model will make predictions for it probably but you won't get an interaction estimate if you have none of the data at the interaction and you have an interaction model then you can't measure the interaction and you'll get a really wide posterior distribution because there's no actual case in the data that combines them all at the same time so there's no information you basically get the prior batch for the interaction term but if it's the main effects model you're fine because that model assumes there are no interactions so everything's additive the things on it at the time and the model will make predictions for the cases where they're combined they'll probably be wrong in this case they would be but it'll give you a posterior distribution yeah I mean this is one of the things I think your general question is one that I can answer like if there's no information in the data about a parameter what do you get back here you get the prior does that make sense so that's why it's often nice to check compare the prior to the posterior distribution and if not next week next week I'm going to start showing you more examples of that where we compare prior to the posterior so you can see exactly what the model has learned from the data and it's often a useful thing to do so we'll start doing examples like that these models have been easy enough and the priors get so washed out but there'll be cases later on with multi-level models where the prior doesn't get completely washed out so we'll want to look at that okay yes okay yeah I feel like we're almost out of oxygen so I'm almost done though so let's look at what this looks like on the predictive scale and I want to show you that it's still these are still just models and they make assumptions in particular these models assume that there's this constant effect on the log-cubitive odd scale of action and tension regardless of what response value you're at and we already talked about this is unlikely to be the case and you can see the consequence of this in the posterior prediction check here is it does a pretty bad job of predicting fours so in the in the histogram in the data where all the predictors are zero there's a big spike at four but when you turn everything on like I'm doing here we turn on the tension and contact the distribution the actual data which is shown in blue here slams all the way over to four all the way over to ones and all these fours get subtracted out basically and fours are no more numerous than threes but the model expects the black bars because it still thinks you're going to get this spike at four because it's still using the alpha that was estimated for the level four when all the predictors were zero this is a bad prediction so this is a case where you probably do want to if you really care about that getting the predictives right you need to have that ability to adjust you need a different beta coefficient for the fours so that they can get slammed down when you turn on the tension and contact I don't think that it's essential here right it's always about posterior prediction checks there's always something wrong with your model as to be if you don't see it you're not looking hard enough it's like guarantee you these are golems and they will rat frog you just have to figure out when and why but you may not care because you're still getting the right inference there's no reason to think that this is leading you astray turning on intention and contact shifts the distribution towards the left and people think it's less permissible and if that's the inference you're after then this is valid you're fine but they're always imperfections nevertheless you're doing way better than just coercing this into say a binomial model you can predict the response as if it were a count put this in a binomial model try it for fun and it will fit no problem but remember binomial distributions well they have their unique implied constraints there's a constant expected value across trials and there are only binary outcomes and we sum across them so as a consequence there's a very constrained number of shapes by the distribution can take and the black lines here show you and it's terrible compared to what data actually looks like order categorical data take on really weird shapes anything within the family of constraints has a defined minimum maximum and is over discrete values and that's why we use this weird cumulative log-on scale that coats with it does this make some sense Gaussian would be even worse because it would predict negative numbers a Gaussian distribution would predict negative responses here guarantee it and again if you're only interested in mean invariance that's not necessarily the end of the world you're going to get the right qualitative inference out of the model but there are things you can't do with it okay let me try to sum up with these ordered logic GLM so I put GLM in quotes because these aren't technically the classic definition of GLM but they're in the same spirit we've got a likelihood function parameter or parameters and a likelihood function and we're going to attach linear models to them through a link function so this is a GLM in that sense but it's not a classic GLM because it isn't this isn't a member of the exponential family but we don't care it's the right distribution for now map estimation can be hard here you need to use choose good starting values I give you some more advice on this in the notes as long as they're ordered generally you're fine there's a nice function called POLR proportional odds logistic regression which is just another name for this model type it's in the mass library in R it works quite well but it assumes flat priors but with a lot of data like this your priors don't matter anyway unless you make crazy priors and then they'll matter but for anything not insane the priors are going to wash out you still need to be aware flat priors because these are GLMs and they have ceiling and floor effects and so sometimes the data don't discriminate and you may need priors to do regularization but stand does great with these models it's going to be slower but go outside and throw a frisbee something while your model is running and these are also called ordered probe there's this other version of this called ordered probit very common in econometrics to use ordered probits instead of ordered logits they're essentially the same the difference is probit is the cumulative Gaussian distribution probit is the cumulative Gaussian distribution so instead of logistic as our cumulative distribution is what we're using here they use the probit but the shape is so similar that if that matters you're in trouble if that matters you've got other problems I think it never matters but it changes the exact processing that you're doing but almost any old cumulative distribution will work if you have good reasons to use one or the other then you'll know what I'm going to say about that other questions about this? I'll let you guys go you'll have questions next time come back on Thursday we'll resume here talking about mixture distributions