 Okay, we're going to get started. Today is entirely one type of outcome variable which I know sounds like the most boring introduction to a lecture ever. But to make this a little bit more exciting for you maybe, this is one of the most common data types in the behavioral sciences and one of the most commonly mistreated data types in the behavioral sciences. So, this is a very practical lecture and compared to the material we just finished, it's not any harder if that sounds encouraging. So, you will draw the rest of the owl basically and it will be fine. But this is a really important set of stuff. Everybody, whether you're in the social sciences or the biological sciences encounters data of this type eventually. What is this? It's ordered categories. So, a categorical variable is the kind of variable you've been dealing with since very early on in the course. It's discrete classes of things, labels of things. It could be age groups or genders or income brackets, any number of things. When categories are ordered, then they're not exchangeable. You can't just assign the index values to them at will, right? It doesn't. So, like when we constructed gender categories of various points in the course, we could assign one and two to any particular category we like and you can make it some story like you're counting the X chromosomes if you like. But, you know, sex and gender aren't the same thing. So, that's not a great, it's not a cover story, right? So, it's, we just assign index values at will and we move on. It doesn't matter. It doesn't affect the analysis. With ordered categories, it really matters because you can't shuffle them around. There, the idea is that they're monotonically ordered in some important way. So, these could be categories of completed education. How important something is to you if you're asked this sort of response question, how important is the income of a potential spouse, right? You could do all the way to one, not at all, all the way to 10, like it's the most important thing ever, right? There's an ordering there that cannot be broken up. If I ask you a question, how often do you see bats? All right, not a question I guess that is asked very often here. But where I used to live, this was actually a survey question that you would get sometimes. So, never sometimes infrequently have an order and they cannot be shuffled at will. This is from a, this last example is from a published paper that I've collected years and years ago. A depth harbor seal dive, shallow, middle, deep. There's information there that you might use to understand harbor seal behavior, but it's not metric, right? The deep is a bunch of different depths are deep. But it's ordered. Deep is deeper than middle and middle is deeper than shallow. Does this make sense? Okay, the world is full of this stuff because of the way we measure it, right? It has ordering and it has categories and then they come together. So, if you want to think about these things in terms of the constraints on an order categorical variable, there are discrete outcomes, like counts, but they're not counting anything. They're just indexing categories, but these indexes have to go up in order. There's a defined minimum and maximum. It's a bit arbitrary what the minimum and maximum is, but it's convenient to start the minimum at one. But you can start it anywhere as long as you index them consistently, right? There's fine to index at zero. Some of you know there are programming languages that index from zero, and then there are programming languages that index from one, and it's totally arbitrary, actually. What you do, you just have to, if you switch programming languages very often, like I do, it's a nightmare. You forget which language you're using, and suddenly, you know, vectors start at one. Like, damn you are. Can't you start at zero like a mature language? And no. So, you just have to figure it out, right? It's like switching languages, spoken or there are human languages, right? No, no, I'm speaking this language. The grammar is different there. The defined order has to stay, and the important thing that we need to model when we model these variables is that the distances between categories are not constant. They could change quite a lot. The distance between shallow and middle could be very different than the distance between middle and deep, right? What do you mean by the distance? Well, the underlying metric change in the system that you're measuring that's required to transition from one category to another. These things are difficult to model. If you had to come up with a system all on your own. Lucky for you, people have figured out very effective ways to model these variables, and I will introduce you to those ways today. And we need to deal with these variables on both sides of the equation. So, as an outcome variable, there's a particular way to deal with these that observes the fundamental fact that the distances between the consecutive categories is variable, so we need to model that. We need to model the variable distances. And also on the right-hand side, we're going to have to deal with exactly the same problem as a predictor variable. Because it's not metric. It's not exactly the same. So, we're going to start with an example dealing with the left-hand side problem. If you call it that, the outcome variable version of the problem. And this leads to a kind of model, kind of generalized linear model known by various names. But I'm going to call it an ordered logistic regression. Sometimes called an ordinal regression. And, yeah, this is one of my examples of making a monster by hacking together different pieces of GLMs. So, let me introduce you to the data set first. And you'll see why we need ordered logistic regression for this kind of data. So, I know many of you are going to be familiar already with a category of philosophical puzzles called trolley problems. Yeah? No? If the answer is no, you're in for a treat. You're about to see the weirdest corner of the behavioral sciences, I think. And so, what you're looking at here, you're looking top down on a speeding trolley. Everybody here knows trolleys, right? Yeah, we live in Leipzig, right? You dodge them all the time, yeah. And they're not very speeding here, are they? So, I shouldn't talk. Someone's listening. But so, this thing is going down the track labeled by this arrow A. And there's a, it turns out some super villain or something. These stories are never perfectly clear. For some reason, there are five people lashed in the track in front of this. I don't know if they're asleep or whatever. The story is never perfectly clear. It's a philosophical problem. You're just supposed to accept the premise. So, some super villain has lashed five people to a track and the trolley is speeding towards them. They will all be killed if the trolley strikes them. There's another track. There's a switch in the track ahead and goes to track B where there's only one person lashed to the track who sets these things up, right? And you are the protagonist or the antagonist, the finiteness of how you choose, and you're standing next to the switch control and it's currently set so the trolley will keep going straight on path A. If you pull the switch, it will switch to B and kill one person. So, here's your choices. Don't blurt out what you want to do, please. Just keep it to yourself. You can not pull the switch, in which case five people will die for sure. If you pull the switch, one person will die for sure and those five people will live. And then, so the way these data are collected is, this story is explained to individual people and then they ask the question, they answer the question, how morally permissible is it to pull the lever? And you're supposed to answer this on a scale from one to seven. Choose a number, one, two, three, four, five, six, seven, which is how morally permissible this is. Seven means always, it's always under all circumstances. There's no other detail you could add to this that would change my mind. You should always pull the lever. That is morally permissible to pull the lever. One would be never, you should never pull the lever and then, you know, in between. You get to decide what in between is. This is an order categorical variable. It's measuring something, substantial, people have really strong opinions about these trolley problems, sometimes surprisingly strong opinions about them. And let me show you another trolley problem, okay. These come in big flavors and the data set I'm going to show you actually there are huge numbers of recombinations of the factors that do these things. So second version, now you're looking at the side view of the speeding trolley and that black thing is not a brick wall that's going to hit, it's an overpass, right. It's like a foot bridge that goes over the trolley. And that's the protagonist on top in red standing next to a switch of course, well, I know, sorry, there's no switch in this example. You're standing there and then there's the requisite five individuals who are doomed, right, on the other side. And now there is a large individual next to you, say, you know, rock the Dwayne Johnson, right, or Dwayne the Rock Johnson, sorry, one of the two. And like his legal name is Rock now. So, and you could push the rock off of this bridge and he is such an extraordinary human being that his mass would stop the trolley and spare the lives of the five people. But he would be killed. In the process. He's not that superb, right. And now we ask you how morally permissible is it to push the rock off of the bridge on a scale from 1 to 7? Again, this feels different, doesn't it? I shouldn't have put the rock in there because maybe you have opinions about his movies or something, they're excellent, by the way. But let's do a third example. So, again, same setup as the first one I showed you. Trolley, it's speeding down the track. There's five on path A, one on path B. But now the switch is set such that the trolley will veer off on the track B and kill the one individual instead of the five. And again I ask you, how morally permissible is it to not pull the lever, which is the same implied outcome, logical outcome as the first version. To a philosopher, this is identical. To a human being, it's incredibly different. And that's why the trolley problems are asked, is that on some purely logical basis, this is the same question as the first one that I opened with, the same number of people will die. But in this case, you don't take action to get that consequence. And in the first one, you take action to get exactly the same outcome. People feel really different about this. Yeah, respect your intuitions. It feels completely different suddenly. Yeah. And so, there's a big literature on this and there's a big literature on this. These are trolley problems, but they get at real moral intuitions people have. Actually, people disagree quite often about the permissibility of these actions. And so that's what this is. An empirical literature meant to get at moral intuition, how it's constructed, how it develops in individuals. And one way the literature is organized is there are these three principles which seem to explain a lot. And these aren't things that you can interview people and get them to say. These are kind of underlying factors that people come up with that are meant to explain the variation. And the data set that I'm going to use as an example is designed to probe these principles. It mixes and matches them. It constructs a large number of stories. I think there are 18 different stories in the data set, whole trolley problems that mix and match these different kinds of principles. And they are the action principle, harm caused by action is worse than the same harm caused by inaction. That's the distinction between the first and the third thing that I showed you. Intention posits that harm intended as a means to a goal is worse than the same harm foreseen as a side effect as a goal. So what is this about? Well, I'll have a summary on the next slide and help you get some clarity on what this means. But the idea is that if you intend that someone get harmed in order to do some action that is maybe morally permissible otherwise, that's bad, right? If it's just a side effect, consider, for example, pollution, right? So corporations pollute. If the pollution is merely a side effect of a profit motive, that's not nearly as bad as if they intended to pollute, right? It suddenly gets a lot worse, yeah? They really wanted those kids to suffer the harm by breathing bad air. Yeah, but if it's merely a side effect, it's not that it's great. It's not that it's okay, but it's not as bad. And most legal systems care about intent, actually. So this is an old interest. Contact the final one. So this is like a subset of action. You can't have contact without action, but you can have action without contact. So we had that in the first one. You pulled the lever. You didn't have to shove anyone. But in the second one, you had to touch the rock and push him off the bridge. And this is another principle that seems to be different than just pure action alone. Okay, so let me try to summarize these three things. Now, I'll give you a sense. We have three features, action, intention, and contact. The first story has action, but it has no intention and it has no contact. You don't intend for the one individual on track B to die. It's a side effect of saving the other five. They don't, the number, that one individual doesn't actually have to die to save the other five. They just happen to be on the track. Does that make sense? So there's no intent. And there's no contact because you're not touching anybody. You're just touching the switch. But there's definitely action because you have to move to switch. In the second case, all three are live. There's action. You have to do something to save the other, to save the five individuals. There's intent because the rock's death is required. It's instrumental in saving, I should stop saying the rock. The anonymous hefty individual. The anonymous Hawaiian professional wrestler, now movie star, has to perish in order to save the other five. And there's contact because you have to, the way this story is set up, you have to touch them. You can have a version of the story where you take the contact out, right? You just have some lever you pull and there's like a trap door and they fall. And no, these things are done. You can come up with any kind of crazy story. And finally, the last one has none of them active. Yeah. And people tend to see this last one as pretty benign, right? It's like, yeah, it's permissible to do nothing. And then the least terrible thing happens if I do nothing. That's okay, yeah, people see that, but we have to probe these things. So the data set, just called trolley in the rethinking package. This is a bunch of questionnaire data. It's a very large sample, lots of age groups, 330 individuals, 30 different scenarios. Sorry, I thought 18, 30 different scenarios. What's the scenario? It's a different mix and match of these different features plus some story about what's going on. It's not always trolleys. There are stories involving ambulances and lots of other things, lung machines, all kinds of stuff. And they're not all literally trolleys like this. And you need variation in that because maybe people have particular feelings about trolleys, right? And so you need to mix and match things to deal with that. So there are different scenarios. And there are almost 10,000 responses in this data set. So there's a lot of data to deal with here. And our question is how responses vary as associated with these principles? You want to think about it causally. When I change a scenario to just add or subtract one of these principles to it, leaving everything else the same, what will we predict happens to the moral intuitions? And that's what we're trying to probe in data sets like this. And we care about this because legal policy and autonomous vehicles are a kind of trolley problem issue right away. There's a literature on this as well. Driverless cars, how many pedestrians should they kill? This is serious policy stuff going on all the time, right? And so there are lots of other variables too like age, gender, the individual. There are repeat measures on individuals, there's education. Lots of stuff that makes my variation in addition to the treatments. So these data, if you just plot them as a histogram, they look like count data, but they're not, one to seven are the different ratings. And this is averaging across all the scenarios, all the data. You'll see a few things that are quite typical of order categorical data, at least human response order categorical data. There will often be a spike in the middle. It's sort of like, I don't know, number four is, I don't know. And it's a very popular response in these things, right? If you want to be safe, bet four. Four, four, four. And it's also true that the distribution can often be quite flat. You can get a lot, use the whole range up, that's the way it is. It does not look like a binomial random variable or Poisson random variable. It doesn't look like it's not a count and it doesn't behave like a count. It can take any shape actually. It's just an arbitrary histogram depending upon the nature of the stimulus you give people and how they feel. And we break this down. So again, the top histogram in this slide is the one I just showed you. And then I've broken them down averaging across the one on the far left, in case our scenario is where action is present. And you'll see it's different from the total one in that some of the mass has been shifted from the high categories to the low. Shifted a little bit, still a lot of fours. Again, intent, when intent is present, you'll see now there are a lot more ones. We're before a lot more people saying no, never, it's never permissible. And then in contact you'll see there's a big change, right? Suddenly a lot of ones. But also a lot of people answering seven, right? People have different moral intuitions about these things. So what is an ordered logistic or ordered logit model look like? So this is essentially a categorical model which is like a binomial model but with more than two categories. But the link function is this weird thing called a cumulative odds link or the log cumulative odds link. Let me explain to you how this works in pictures first and then we'll develop it up into a GLM. So the left here we've got just all the data, shown as a histogram. And what we're going to do in our model is we want to have a model that describes this histogram on the logit scale. Just re-describe to it. We're going to use a parameter for each level in the histogram, each of the possible outcomes. Actually we don't need that many parameters. We need one less than the number. So we're going to need six parameters as you'll see as we go. So the way we think about this is we think about this histogram as a cumulative distribution which means we start on the left and we start counting up the proportion of the total data set. That is each response or lower. That's what a cumulative distribution is. So the first one here, the center graph I've just revealed, is the cumulative proportion of each. So one, about 15% of the sample is one or less. There's nothing less than one so it's one. About a little over 20% of the sample is two or less. About 40% of the sample is three or less and so on. All the way up to seven, you know that seven, since it's a maximum category, is always one. Always has value one because it's a cumulative distribution. Right? If I ask you what proportion of responses are seven or less, you don't even need to see the data set. You know the answer is one. Yeah? Because seven is the maximum outcome. Does this make sense? This is just constructing a cumulative distribution. We've done no statistics yet. We've just re-described the same histogram. And what we're going to do then is transform these cumulative proportions in the middle onto the logit scale, which means we just take the logit of each and the log odds. So what are the log odds? Odds are probability of something over the probability of everything else. That's the odds of that thing. The log odds are the log of that ratio. And so if the thing we put in, the probability we put in is a cumulative proportion, now we have the log cumulative odds instead of the previous log odds. I know this is tons of stuff to keep in your head. There's more slides coming. But this last slide here is just the logit of the middle one. It's transformed now. And remember on the logit scale, zero is 50%. So that's why on the vertical axis on the far right, it corresponds to 50% in the middle one. Yeah? Okay. I'm going to say all of that again in a series of slides, showing you some notation now. All right. So we're building this log cumulative odds link model. On the right here, I'm showing you the cumulative proportions in the sample again. All right. So about 15% of the sample is one or less. About a little over 20% is two or less. 40% is three or less. And so on, up to 7 where all the values are 7 or less. Because 7 is the maximum. Mathematically, this thing that I showed you here is the log cumulative link. So, or what we call the cumulative log odds. So the, this is the log of a probability of something over one minus that probability. That's an odds. That's what odds are. It's the definition of odds. And the probabilities though are the probability of some particular response value, yi, being less than or equal to some arbitrary value, k, where k is between 1 and 7. So it's like you're saying, what's the cumulative probability of a 6? Well, it's the proportion of values that are 6 or less. So k in that case is 6. And then you have to get that probability that some observed value yi is less than or equal to 6. We have to estimate that from the data. That's our statistical problem to get it. And then we take the log of that and we've got log cumulative odds. So this is structurally exactly like logistic regression. In a logistic regression, the probabilities in the odds are discrete probabilities. They're the probability of some particular category being seen here. They're the probability of that category or a lower category in the ordering. I know this is weird, but it makes it all magically work. We're going to hit this point where it's like suddenly, boom, we see the madness. Like coming up with this on your own, you would never do it. Right? It's like lots of simple technologies, bows and arrows, bows and arrows seem simple. Right? Or maybe they don't. But they're not. Right? Archaeologists now. It takes forever for bows and arrows to evolve. Try making one sometime. Try making an arrow. Right? Hard enough. So this is like that. Right? It'll look simple after the fact, but it's not. So yes, K is the category. And this thing, phi, is where our action is going to be. We're going to have a linear model attached to this. We're going to use predictors and a series of parameters to describe these cumulative probabilities. So now our linear model is linked to these cumulative levels, not to the discrepancy. But that's enough. We can still do all the work that way, it turns out. There's just lots of fiddling that goes on inside the model that makes it look monstrous, but you can just think of it as like a logistic regression, but now on a cumulative scale. And usually when you use it, you don't even have to remember that it's like that, because all the machinery is taking care of four year. So as a consequence, right, to remind you that this is just like a logistic regression in its essence, if you solve that top thing for the probability of yi less than equal to k, you get the logistic function back. It looks just like the logistic that we've used before, because it's the same link, it works the same way. Now your computer is going to do all of that, you know, logisticing for you, but that's what's going on inside of it. So now let me take this picture on the right and show you the bits of the math on the left in various colors. So these gray bars going up to each of the cumulative proportions, those are those probabilities of yi less than equal to k on the left. For every k value, there's a different one of these bars, the gray bars, that's what those proportions are. Does it make sense? It's all they are, they're just proportions. When we run a statistical model, we need the probability of each discrete yi excluding all the others though, right, because remember we need the probability of the data we saw, that's how we do stats, at least if you're Bayesian, right, the probability of the data you saw. And so to get those, we've got to subtract adjacent gray bars from one another, to get what's left over, and that's what these orange bars are. So we want the probability yi equals k, not that it's less than or equal to k. So then to get that, we've got to take the adjacent ks and subtract them and then you get those little orange bars, right, those are the discrete probabilities of each individual one. Does that make sense? And I know at this point you're saying, but I already had that when I started, yeah, yeah, but you wouldn't be able to model the whole thing with a linear model. The whole reason that this literature uses the cumulative link is because it manages the fact that they're all bound together in an ordered way. You establish the ordering by using the cumulative link, and then you don't have to do any extra work to keep them ordered. It's just magical there, it's just there for you. It's a very clever trick. And when you run it in your computer, you won't even notice this anymore. It's just all, like, in fact, I could teach you this whole lesson of telling you how the model works and if it works the same way, it's just like you run a command, right? But you know you're here to suffer, right? Get all the details, it makes you a better person. So now we've done all that cumulative link stuff so that the ordering is preserved, right? So the cumulativeness that preserves the order and establishes it, but then we've got to get to discrete bits by subtracting adjacent cumulative probabilities from one another. And that's what this orange little line segment's are. Yeah? Okay. So if you try to write this model down in the model notation we've done before, it looks something like this horrible thing on the left-hand part of this slide. Because all that fiddling I just showed you with the cumulative link and then subtracting adjacent categories to get the discrete probabilities, it's just a bunch of little algebraic transformations. It's just arithmetic, really. And so if you wanted to write all that thing down in detail, like here, you could do it. No one ever does this. I'm just showing this to you so you know that, you know, there are steps here and it's totally algorithmic and this gets written into the code in the algorithms that do this stuff. But you could do it all by hand. Obviously, you know, I wrote a function in my package to do all this so it's just an algorithm, right? And, but I want to show you that it's just a categorical distribution up top. That's all it is. And categories that are unordered in principle but we force them to be ordered because we construct this vector of probabilities P which has the ordering to it. We're imposing that because we know the nature of the data. So what we're usually going to do when we run this is going to look more benign. There'll be something like this distribution function D or logit for, you know, distribution ordered logit which has all of that fiddling in it. Does it all for you? And, and all you specify is two things. The first argument is some linear model that has predictors and parameters. It's just the same old GLM stuff you've ever done. And this first example is going to be set to zero because we're not going to have any predictor variables. And then you're saying, wait, if it's zero, then what's going on? You don't have an intercept in an ordered logit model. You have a lot of intercepts. You have, if you have seven categories you have six intercepts and they're usually called cut points. Why? Because that's how you get the histogram back. You need a unique intercept for the log cumulative proportion of that type, right? You're just re-describing the histogram. You don't need the last one because you know it's 100%. So it's just a leftover amount from all the others. I'll show you this as we go through. We're going to run a model now and I'm going to show you, we're going to back transform to that so that you understand it, okay? So that you see the intercepts. So right now we're only going to be working with intercepts and the code I call them cut points because they cut the cumulative probability up is what they do. So yeah, this D-OrdLogit function, if you want to look inside it on the R-Prob, it does all that stuff down there. It just twiddles with stuff. It's not rocket science, it's just annoying science, right? It's just the most annoying sort of unglamorous, like shuffling indices around stuff. Okay, if you run this model, this model, this is one of the largest data sets we've used so far I think. So if you run this model, it'll take a minute maybe and you'll feel bored, right? You'll be like, oh, stats are so slow. It takes a minute to run this model with 10,000 data points, right? You will, you'll feel spoiled, right? That's okay. You live in a good time. When I started running Markov changes, like you expected a week. Like you just run it off and you go on vacation. Come back, it crashed. That was sort of how it went. It's better now. You're living in a better time. Aside from this old global warming thing. I should stop talking. So anyway, so you run this and you know that zero up there means there's no predictors. But cut points is a vector. We assign logit priors to each of them because we're still on the logit scale. So then think about the logit prior problem and what law gods mean, right? Four means always, minus four means never. Yeah, you're still in a logit scale. So everything you learned about logistic regression still applies to thinking about priors in this space. And then we run this model, it's no problem. And we spit out the cut points. Now that's the only thing in the model. There are six of them. And this is totally uninterpretable, right? There's just some numbers. It looks like science, but it doesn't make any sense. These are log cumulative probabilities. They're still in the logit scale. We can interpret them by converting back. So let's do that. We just need inverse logit. We undo the link function, right? We're using the logistic. Inverse logit means logistic. So if we do the inverse logit on that mean column, we get these numbers which are cumulative proportions of each outcome. So let's compare them to the picture again so that you see that this model has successfully done this very boring task of measuring the proportion of the sample which is each outcome type, right? What has it done more than that though? Well, let me talk you through that first before I say what it's done more than that. So you'll see that the first one, cut point one is about point one three which is the proportion of the sample, which is one. Yeah. Cut point two is point two two, which is the proportion of the sample, which is two or less. Yeah. Cut point three, little over point three. That looks a little different. Yeah. Point three three, which is, well, this is means, sorry. So we'll get to, this is going to be important, talking about what's different. It isn't exactly re-describing the sample. This is what it's done more than this. But it's close. And then four is about point five six. Yeah. Again, it's not exactly right. And up to point seven and point eight five. Why does it not match? Because it's a finite sample and we took the mean and we converted it. So there's posterior uncertainty. About where they are and there's a real distribution here. So you do better because you had to learn from this. The cut points are about the population. And that's the sample, right? And they're different numbers. So on the, on the log proportion scale, you've got a distribution and if you take any point in that distribution and convert it, you don't get this. But the posterior distribution describes the whole population. Adding predictive variables. Now, what we've got to do is I want you to think about all those cut points as alphas. They're like intercepts. There's an alpha sub one, two, three, four, five, six, seven, where K is a number one to seven. And we're going to have this B thing now, which is any old arbitrary linear model or we're going to subtract it from all of the intercepts, all the cut points in these cumulative odds equations. Why do we subtract it? This is an interesting fact about how these linear models work. It's because we want to shift probability mass down when ratings go up, right? So it's like we've got to reallocate mass. And so you want the linear model sort of to push mass in the opposite direction, then you want the ratings to go. And I'll show you this in the picture when we get to it. You'll have to see it in the picture to understand why it's doing this. Right now, just assume that if you did it the other way, it would still work, but it would mean that negative coefficients were associated with increases in ratings. If you do it this way, positive coefficients are associated with increases in ratings. That's all it does. So this makes it cognitively easier for you on the end. And when I show you the outcome predictions, you'll understand why it needs to be this way, okay? So you can use any linear model, and I have this example here up at the top for in general, phi i is beta xi, notice there's no intercept because you already did the intercepts, right? They're the cut points. Don't put another intercept in this linear model, please. It won't kill it. You'll just have a redundant parameter. So just slow down sampling. The cut points are the intercepts, and there's a unique one for every level k. And it's just the predictors that apply to all of them and move the mass around, as you'll see in a moment. Yeah, scientific distance. Does this make enough sense to keep going? I know this is like draw the whole owl, yeah. So what we're going to do with the trolley data, our phi is going to be a linear combination of indicator variables for action, contact, and intent. And we're going to be interested in interactions as well. So we're going to layer in a previous lesson. With action and contact, of action and contact with intent. So I make the coefficient in front of, so a sub i is an indicator variable for action, c sub i is an indicator variable for contact, each of those gets an ordinary coefficient, beta sub a or beta sub c. In i sub i, this is what, you can see it on the slide, it's a horrible thing to say, but capital Roman i, little lowercase italic i, is an indicator variable for intent, and it's got a symbol in front of it, which is not a parameter, but is a linear model. This is how you do interactions. Remember, I introduced interactions to you this way. And then I just write the second one. And now there's a coefficient, beta sub i, which you think of as the main effect of interaction. And then there are two interaction effects, and a and c appear again. Right. So if you multiply this out, you get those product terms. You're going to get an a times i term and a c times i term with corresponding coefficients. But I find this way easier to think about. And I tend to write my models this way, because it's easier to read. Otherwise, you've got a really long linear model with a bunch of junk in it. Yeah. So, but it's up to you. It's totally up to you. You just got to be right about stuff, right. So it's fine to use mental prosthetics like this. Some of my models have like 20 linear models in them, but all of them substitute into one another. Right. And it all works out. Okay. Repeating the trolley model at the top, you write this into Oolong, exactly as it looks. It's no problem. You just have this capital B i and then you make that linear model and you write the other things in. Give all the other symbols priors. Otherwise, Oolong will complain. It'll say it doesn't know what that symbol is. You've got to give it a prior. And cut points on the logic scale. And then you run the chains. And again, this will take about a minute. You'll feel really bored. I feel like this is taking so long. Right. It'll be different than the previous models you've run though because in the previous models that you've done with Hamiltonian Monte Carlo, you spent all the time waiting for it to compile, right. And then it would run in like 0.3 of a second. It's just like boom, done. This one will be about the same amount of time for both. A little bit longer. You'll notice it sampling. But still, it's, it goes very fast. Considering you're getting the full posterior with no Gaussian approximation for a dataset of 10,000 observations. So what happens? I'm going to suppress the cut points because they're not interesting. There's a bunch of cut points and you know what they do. They give you the histogram in a sense. Sort of average histogram. And then what these coefficients do is they tell you how that histogram gets distorted. It's like morphing when you add or subtract a feature to a story. And so we can look at these. And again, it's hard to, this is an interaction model. So you know what I'm going to say. I'm going to say you can't interpret what's going on by looking at the coefficients one at a time because every prediction depends upon more than one coefficient. That's how interaction models are. But you can probably get a guess from the sign on all of them, they're all negative. That all of these things lead to disapproval. Right? They lead to lower response values. And lower response values means less mortally permissible. Whenever you add any of these things to a scenario, it makes people disapprove more. It means that they get smaller responses. Make sense? Not everybody does, by the way. Some individuals in this sample always answer four. Right? Some people always say seven. There it is. But most people have variation and they care about things, especially when there's contact. So plotting these at the bottom, you see zeros on the far right. They're all negative, but there's this, the interaction effect I see is really negative. It's the worst thing in the world, right? You killed the rock and you needed him to die to save the other people. It's the worst possible thing. Okay. So you've got to plot these things to understand them. These models are complicated. I introduced them as monsters. They're necessary monsters. There are lots of options for plotting models like this. And you'll see a lot of conventions. I'm going to show you the one that I find personally the most useful. It generalizes to lots of circumstances and it'll also help me explain to you how the linear model works to more things around. So the thing to realize about posterior predictions from this model is that it's not a point. It's not predicting a single thing. It's predicting a distribution. So when you ask this model to predict what will happen when you set the predictor values to certain values, it gives you a probability for every observable value, right, from 1 to 7. And that's a distribution. So the posterior prediction is a vector. It's not, you know, 30% value of this thing will happen. It's not a metric distribution of particular values with a mean. It's a vector. Sometimes it's called a simplex. Those of you who've done a bunch of this stuff before. And so we have to plot that vector to understand what the model thinks is going to happen. That's what I've done here. In this plot on the horizontal, we've got the two possible values of intent. The blue points are the data, right, so what you're, and I've set this so that we're only looking at scenarios where action and contact are absent. And we're just looking at what happens when you add intent. So on the left, you've got all the most boring scenarios where there's no action, there's no intent, there's no contact, and the blue points show you the distribution in the data of those. Does that make sense? And on the right, we've got all the scenarios where there's only intent active. And the blue points, again, show the distribution. And then the black lines are, I think they're 50 samples from the posterior distribution. And I've just connected the two so you can see the change, right? You can't actually have intent at 0.5. The connection is just to show you the correspondence and how much it moves where it goes up or down. So you can see the model is describing the sample. It's not doing it exactly, right? You don't want a model to exactly describe your sample, otherwise we just used a sample, right? But it is describing the changes in the accurate way. And what you'll see is the lines all tilt up. Why do the lines tilt up? The lines tilt up because when all the lines go up, you squeeze probability off the top of this thing and you reallocate it to the bottom. If the line's tilt up, the mean goes down. Say that again. If the line's tilt up, the mean goes down, right? Because there's more mass at the bottom. You've taken mass from 7 and you reallocated it to 1 and everything else below the 7, yeah? And that makes the average response go down because there's more probability mass in the lower values. Does that make sense? And this is why there was that weird thing where we subtracted the linear model now. It's just to make it go the right direction. You've got to move the mass down. And now let's look at the other three to give you a sense of this and how it helps in interpretation. Oh, yeah, what I wanted to say about this, sorry, is that it's the gaps, the areas you want to look at your eyes. The cut points are determining where those lines are, but it's the gaps is where your attention should be. That's the mass. That's how probable the model thinks each outcome is. It's the gap, the space in between. So at the very bottom, there's this space labeled 1. And that's the probability mass given to 1 and then 2, 3, 4, 5, 6, 7, all the way up. Does it make sense? Okay, the other combinations, now we look at the interaction effect. So from the first plot over there, you conclude, okay, the main effective intention is if there's intent to harm involved, that's bad. It makes, it leads to lower judgments, right? The line's tilt up, but that means people disapprove more. They gave lower response values. In the middle, we're looking at scenarios where action is present and then we're adding intent. We go from ones where there's no intent to ones where there is intent, the line's tilt up again, they tilt up more, right, more steeply. So there's an interaction between action and intent that results in making it even worse, even morally worse than before. And then finally, the pushing rock off the foot bridge scenario on the far right is the case where we've got contact, action set to zero here because the way the data is coded, contact implies action. So the action variable here means action without contact. And contact means action with contact. It's just the way it's coded in the data set. Yeah, so that's why action sets zero at the top. And you won't find any scenarios in this data set coded so that both action and contact are one. They're mutually exclusive in the way the data sets coded. The contact always implies action, right? So again, it's like story, they tilt up, but they tilt up a lot more now. You'll notice there's a lot more probability mass at one. That's that big shift. Lots of people suddenly start disapproving when you add contact to a model, especially if there's intent. Yeah. So you'll see a lot of these sorts of plots in policy journals and political science journals sometimes because they have response data like that. They'll ask people preferences for policies and various things and you've got ordered outcomes. And you need to describe the model output in some way. So that's where I learned this visualization. It's from people who do that sort of thing. And, but you'll also see other things. You'll see people just plot histograms and such. Okay. Let me try to summarize this and simultaneously transition to the next exciting bit, which I can do a little faster, order categorical predictors. So we're going through all this fuss because the spaces are different. So in the, when we, this was an outcome variable that was an order category. It's much harder to get from, say, six to seven or from one to two than it is to go from three to four. Three and four are like really similar values or two and three are really, really similar values. The gaps change. The amount of how bad you have to make the story to move people between adjacent values changes as you move across values. And that's what order categorical variables are like. And they're also like that on the other side of the equation. Just because you make them a predictor now, they don't suddenly have a different nature to them. They still have those gaps in them. And if you had a response variable like this and you wanted to use it to predict behavior, so now say we're going to predict how people actually behave. We put them in a real trolley problem or something like that, right? Then they do that on the good place. Something, an episode, anybody watch that? Then you would have to observe these gaps and this non-metric nature of the variable to make good predictions. So, we can luckily do that, but it has a different sort of form. And in the notes, I give you all the algorithmic details to this, I'm going to move over this in just to deliver the concepts to you in lecture. So I'm going to use the same data set and we're going to keep the outcome as it is, but we're going to add another predictor now, a demographic predictor, education. Education in this data set is completed educational level. So at the bottom of this slide, I'm showing you all the unique values in the education variable in this data set. And they range, you know, from elementary school, there are some individuals who have only completed elementary school in this data set, all the way up to, you know, my education level, which I guess is graduate degree. Is that what I have? It's a graduate degree, and, but you've got to intermediate things too. Like one that's going to be important to our story here is some college. There are a lot of individuals in this who have some college. What does some college means? It means you're in college when you respond to this questionnaire. And there are a lot of people who have some college in this sample because, you know, it was something you could click on and do, right, as worked out. And it turns out that's an important category here. So each of these, here's our general strategy. Each of these levels is going to get a unique parameter, and we're going to have to establish an order among them. There's a natural ordering here that you have to observe, right? You have to have finished elementary school to go to a secondary school, I think, in most places, yeah? You have to have finished secondary school to go to college. So there's an ordering. There, this is a cumulative monotonic idea. And if we were to just put education in as an ordinary metric variable, treating this as metric, you could maybe get away with that. We'll look at what that does. You could maybe get away with that. But it ignores the discrete order category nature. It assumes that every additional level of education has the same marginal effect on your tendencies to respond. That's very unlikely. So we at least want to consider the possibility that it might be different. But you are assuming it's monotonic, meaning that each additional level adds or subtracts. It's always in the same direction. We're going to retain that monotonic assumption. More education, if any unit of education makes you less morally permissive, then that's monotonic. You know, we'll keep going up or keep going down. But the size of the gaps can change. Okay. So we're going to have all these parameters. And the sum of these parameters will be the total maximum effect of education. So at the highest level of education, there'll be a parameter which tells us the biggest possible effect. But then we'll have all these little subparameters which will tell us each gap. And we can look at those independently and learn stuff. And the trick here that I'm going to show you is if you want to assign a prior and not go insane in the process, if you want to code it in a particular way so that you can use this cool thing called the Deer Clay Distribution. And that's what I want to show you how to do. Also, let's me introduce you to another mathematician which is always my hobby in this class, right? So here's how you do it in cartoon form. We've got a linear model fee, your member fee. And there's a bunch of other stuff in it, like the predictors that you already put in that we just ran a little while ago. What we're going to add to this is these little delta parameters. So say you have an individual who has only completed the first increment of education, which in this data set, it's elementary school. That individual will get added to the linear model, delta one, because they completed the first level of education. And that's it. Say there's another individual who has also completed the second level of education, middle school in this data set. They get delta one and they get delta two. It's the elementary school had an effect, that's delta one. Middle school had an effect, that's delta two. Does this make sense? And this just goes on forever. How many you have? You can have 100 deltas if you want, right? The only limitation is your computer and your data set and the bounds of information theory. Yeah. So in our case, we have seven, seven increments, and the idea is we sum over all our little deltas. And we just, it's just in the linear model. And this is, it's a single predictor, but it implies a bunch of parameters. That's the key. It's just like how we dealt with it as an outcome. In practice, the way we deal with this is we want to factor out the total size of these, right? So the deltas could be any size. Education has a big effect and the deltas could be big. What we want to do to make this more interpretable, and so we can set the priors more easily, is we factor out the sum of all the deltas. And we call that beta subeducation in this case. And that's the maximum possible effect, right? So there'll be individuals who've done graduate degrees. Their linear models will have the sum of all the deltas in it. Whatever that sum is, let's just call that a parameter. In this case, it's beta subcapital E. That's the maximum effect of education. And now what are the deltas? Well, since we've factored out this maximum, all the little deltas now are a proportion of that, of the maximum total. So all the deltas now sum to 1. And then we've got this beta thing in front. This also means we can lose a delta, right? Because sometimes we know that they all sum to 1, we get 1 delta for free, right? And where did that free delta go? It became beta. So we've got the same number of parameters we had before. But now we can think about priors correctly. We can set a prior on the maximum effect. Sort of like, okay, given background knowledge, the most graduate degree education is going to do for a person like this is move them a couple points or something, that's the prior knowledge. Or maybe you know nothing and you need to sign something diffuse to that, regardless. And then the priors on the deltas, you can make them all the same if you want. You don't have to set a magnitude, you just have to say relative to one another, how big or small are they? And it turns out there's a very nice distribution for that. And it's called Dirichlet at the bottom down here. It's also a great word to say in practice later. And so the Dirichlet distribution is a distribution for probability distributions. I'll say that again. It's a distribution for probability distributions. So with discrete outcomes. So it, you tell it, there's this, it has one argument alpha which is actually a vector. And the length of alpha tells the distribution how many different events could possibly happen in that probability distribution. And then it, when you sample from a Dirichlet distribution, what you get are probabilities. One for each of the possible categories. This is a workhorse distribution in machine learning. It's been everywhere. It's absolutely everywhere. It's the generalization of the beta distribution. Those of you who know the beta distribution. Beta distribution is also a distribution of probabilities but there's only two events in a beta distribution. The Dirichlet you're unbounded. You can have a million in principle. You won't, but you could have a million in principle. We'll have seven. Actually six because we get one for free, right? So here's where I give you the history lesson to this fellow was. A Dirichlet was a German mathematician with a French name. There's some interesting history there. He married Felix Mendelssohn's sister. And lots of other interesting things. And Gauss was one of his teachers. So he comes from a good mathematical pedigree. And he did a bunch of stuff. I mean, this is one of his more minor things. He did things with this distribution. But he's the one who really described it and understood it. So it has this one vector. As I said, its shape is determined by a vector of n parameters. Let me show you what it looks like in principle. Give you some intuition about how it works. So in this case we have seven dimensional Dirichlet distribution as an example, right? Our seven levels of education. And the prior is going to say what are the relative importances of completing each, right? So if any particular level is high, then when you complete that one, you get a big jump or decrement in your moral attitudes depending on how it goes. So we set all the, if we set every value in alpha to 2, that's what I mean by alpha equals 2 up top. And we sample from that a bunch of times. I'm showing you the distributions that come. So I connect all the dots with lines to show you that they go together. They're a set. That's a single distribution that's been sampled, right? And the one in bold is just to help you see it. What I want you to understand about this is when you set all the alphas equal, that isn't saying that you think they're all the, that all the probabilities are the same. What it means is that you have no reason to think that any of them are bigger and smaller than any of the others. That's what it means. So when you think about what the prior actually means. And what that means when you set all the alphas the same, again, it's not that all the probabilities are the same, is that there's no reason to think that any one of them is bigger or smaller than the others. So sometimes that results in them being the same, but most samples as you see here does not give you distributions where they're all the same. Especially if alphas are small numbers. Right? And if you increase alpha, you force the mass so that they're more and more similar. So that's why I show you on this slide, top middle, we set alpha to 4, you see it starts to contract. And I'm using the same random number seed with each of these. So you'll see that there's some ghost similarity in each of the plots from that seed. But we're squeezing all the mass together into a smaller and smaller area, but they're still not equal. They're just getting more and more similar as we go. By the time you get to alpha 64, right, that's a prior that says I'm really, really confident that they're super similar. If any one of these things has an effect, then they're all going to have about the same effect is what that means. We're going to use alpha equals 2 as a start here. But again, this is domain knowledge. You can make the money equal too. You just make the different alpha values different. And you can pile up the mass wherever you like it. But you understand these priors by simulating problem like this. Okay. So if you want to use this in ULAM, you're going to go home and draw this out and look at this. There's some advanced notation here. If you want to do this, you can just copy it and run it. Also the BRMS, our package, automates this. They just have a little bit of notation that does all this stuff for you. But my goal is to show you how the sausages actually made, right? Why it is so delicious. And so this is all you need. And there's some just to show you that ULAM, you can do lots of fancy things. And in this case, we've got this theory clay and we declare on the left that it's a type of object called a simplex, meaning that all the values in it have to sum to one. This is a very convenient kind of data type and so on. And then I do this index twiddling thing with a pinned row. This is all explained in the text if you care about it. You don't have to care about it. But this is what your computer is doing. It's constructing this vector so that it can sum over it. What happens? We get out, when we run this model, we get an estimate. So I left out the interaction effects between intention and action in contact just to make this a little simpler as an example. But look at the first parameter in this table, B capital E. That's the maximum education effect in the data set. Notice that it's negative, right? So individuals who completed a graduate degree morally disapprove more, right? Education makes you more judgy, right? Lings are less permissive, the more educated you get in this sample, right? Now it's not true of every individual who's more educated, but on average, that's picking up the average effect. Notice also that the magnitude of the maximum effect is less than half the size of the effect of changing the story. So the treatment effects are really much bigger than the maximum demographic effect of education. On average. Yeah? So the educational differences aren't as important a part of the story, but they are noise and you end up refining the other effects as a consequence. And then in this giant pairs plot on the rest of this slide, I'm showing you the deltas and you see them on the left, they're hard to interpret, they're different amounts. I want you to see some are bigger than others. And in particular, the major action is on S call here, what is that, some college. So lots of individuals with some college, some college tells you nothing. There's no effect on your judgments of having completed some college. It's not, it's like a ghost educational level. It doesn't actually do any action in this dataset. So this makes sense. So you can see the way it's piled up in the left, some college is the small one is 0.05, but it could be as small as 0.01. It's a smaller effect than all the others for sure. And you see that in the pairs plot too. If you look vertically from S call, it's bunched up against 0. Yeah, not much evidence that having completed some college has any independent incremental effect on your judgments. And that explains, this is the last thing I'll tell you before I let you go, if you run the model with education as a metric effect, so the top I'm repeating the model with the ordered category version of education, in the middle of this slide I show you ordered logistic model where we put in education as a metric, just a normalized metric variable. Normalized mean I set it to 0.01 so that the lowest level of education is 0, the highest level of education is 1. That means the coefficient in front of it can be interpreted the same way as beta sub e from the previous model, right? It's the maximum effect of having completed all the levels of education. And we run it as an ordinary regression now, and now you see you get a much smaller estimate of the effect, and it overlaps 0 now, whereas the previous one didn't. Why? Because of the sum college effect. The sum college effect means this is not a metric variable, and if you treat it as a metric variable, if a linear treatment of it gets dampened out, right? The correlation is lower over all the aggregate values. Cool? Yeah? Exciting? No, I love this, this is so exciting. Okay. You will go onward to home. I've had a very busy week, I've been in Berlin for two days for example, so I have not done your homework problems, so I've written them up, but I have not done them myself, and I have learned from past me that it is very dangerous to assign homework that I have not done myself. So, you're welcome, I'm going to do that homework later today, and then I'll put it up online, possibly this weekend, but I know you're eager for it. Next week, on Monday, we're going to start fresh and excited, finally doing multi-level models. And have a good weekend, I'll see you then.