 Oh, here's another way of people. All right, welcome everybody. This is the last week of lectures for Christmas break. We'll resume in the new year, so what we'll get done this week will excite you for the material to come in the new year, I think. We're going to, after a period of philosophical lectures, we're going to return to practical skills and learn new model types and some new ways of plotting predictions for models to remind you where we've been and where we're going. In the first part of the course, you got an introduction to Bayesian inference. Then you acquired a bunch of skills for fitting simple Bayesian models. They were all linear regressions, which is actually, so I keep saying, more than 95% of all applied statistics is linear regression. Always wrong, rarely bested. And remember the geocentric model? It's hard to beat it, but it's always wrong. It's just how it is. And this week, we're going to try to beat it. And I'll show you that it can be beat when you use the available information in a savvy way. What we're moving into now is to fully embrace this tide prediction engine metaphor that I introduced in an earlier lecture to remind you, statistical models are machines that process information for us. And the parameters are states of the machine, but they're not the predictions that we're interested in. And the states of the machine have to be adjusted so that the machine makes good predictions. But then reading the states of the machine is not actually what you want to do. What you want to do is read the predictions of the machine to understand what the machine is going to do and how it behaves. There's this temptation to gaze deeply at the states of the machine, that is, the parameters as an interpretation. And in the simplest models, you can do that and get away with it. The simplest linear regressions with no interaction effects, it is typically safe to gaze at the gears of the tide prediction engine, this would be the bottom row sound here, and figure out what the tides will be. Because the machine is so simple and you're an expert. So you can do it. That time is over, as of today. It was actually over last week with interactions. It was already the case that you couldn't stare at the parameter values and make any sense out of the model. Now it's really over because we're going to move into generalized linear models. Our goal will be to connect a linear model to some outcome variable, which is not plausibly normally distributed. What do I mean by that? I mean, we have additional information about the constraints on the variable. Before we've seen the actual values of the data, just what we know about the measurement itself means that it will not be Gaussian. So if we use that information, we can do better. You can get more information out of the data about the underlying process of interest. I quip here that it would be better to ditch the linear model too, by which I mean, if you had a real theory, it's unlikely it would look like a linear regression model. So that should be your goal. Every time you write down a linear model, I think we should be embarrassed. But it's OK to be embarrassed in public and science. That's basically what a scientific career is. It's a long series of public comparisons. And that's OK. It really is. But the linear model is a placeholder for a real theory. It's just an engine for measuring partial associations. And a lot of good work can be done that way. But it's not a substantive theory. So I want to get that in here. So the generalized linear model goal is to hook up a linear model to some other kind of distribution. But this is just a stopping point on the way to a really good statistical model of a phenomenon, which gets rid of the linear model as well. We're going to model multivariate relationships and non-linear responses with these sorts of models. And these models are the building blocks of multilevel models. And we make multilevel models by hitching together multiple GLMs, actually embedding them in one another. And that's how you get multilevel models, hierarchical models, random effects models, grow out of that strategy. Measurement error models, lots of the other things we'll do. In the new year, we'll come from hitching together generalized linear models to one another in strategic ways. Here's our strategy. And I'm going to review these three points on the slides to follow. Our first step is we have to pick an outcome distribution. Then we model its parameters using what are called links to the linear models. And then finally, as you might expect, what is our always our goal in Bayesian inference? It's to compute the posterior. There is one kind of estimator in Bayesian inference. One to rule them all is the posterior distribution. Then you process the posterior distribution in lots of different ways to make decisions, but the posterior distribution is a stopping point on the way to that. Step one, pick an outcome distribution. The point of the lecture on Friday was to explain to you that there are principled ways to pick outcome distributions. You don't have to use maximum entropy, but I strongly encourage it because it's a highly conservative, risk averse way to pick an outcome distribution. It spreads probability as flatly as possible, given the information constraints. You believe you know about the variable. So that strategy leads to, and that's what chapter nine in the textbook is about, and that strategy leads to all of the conventional choices in generalizing linear modeling. But it's not the conventional way to do it. It's kind of the physics way to do it, but it's, there are lots of other ad hoc ways of picking outcome distributions that produce the same choices somehow because they turned out to be effective. I think that's a wonderful thing about applied mathematics is that lots of different sets of axioms and points of departure lead to the same answer quite often. Which gives me pause to think that any of the perspectives is right, but other people read it differently than I do. So if you're modeling distances and durations, these are outcome variables which are positive real values. They're displacements from some point of reference. So you can think that they're zero or larger. That's what durations and distances are. So when I say what are the constraints on a distance and duration, if you told me you measured distances or durations, I know that all the values must be positive reals. Yeah, they have to be. Even before you've seen the data, that's when you said make this choice. You don't wanna look at the values and say, ah, that looks like a something distribution. That's cheating, right? That's a recipe to overfit. Do something very bad. I call that histomancy in the book. There's a snide overthinking box, a rethinking box. There are a bunch of snide rethinking boxes in the book, right? There's one of them about histomancy which I call the dark art of picking outcome distributions by gazing at the histogram of your outcome variable. That is a very bad, very, very bad idea. But it's taught, unfortunately. You've probably been taught it. You wanna use information independent of the actual values of the data. Or that's why you don't wanna use the histogram. You wanna use the constraints on the variable given to you by the way you measure it to determine what a maximum entropy distribution would be. So with distances and durations, if all you know about a variable is that it has some positive average displacement, the maximum entropy distribution is exponential. Talk about this in the book. Exponential distribution is, in some sense, the foundational distribution of this big family of statistical modeling distributions called the, you guessed it, exponential family. You can build up all the others starting with the exponential. And there's a figure in chapter nine about this. The gamma is, you get the gamma by adding exponentials together. The gamma is also constrained to be positive real, but it also has an average magnitude of displacement in addition to an average displacement. And the gamma distribution is the second survival distribution. Lots of processes are gamma distributed in nature. For example, age of onset of disease is often gamma distributed. It's a distribution you get if a bunch of things have to break before something happens. You get a gamma distribution. Counts is what we're gonna focus on today. The count distributions that are the work courses and modeling are really all versions of the multinomial distribution. We're gonna work with the binomial and the Poisson distribution, which are special cases of the multinomial. And in the book, I also talk about the geometric distribution as well. And these are also maximum entropy distributions. When does the binomial arise? Well, I'll tell you that when we get to that example. Then on Friday, we're gonna talk about models that I call monsters. There are kinds of measures which by the nature of them are very inconvenient to model. But nevertheless, when you think about the way you measure them, you can construct a generalized linear model of them. We're going to work with, so ranks, rank data is like this. Rank data is when you take some quantitative measure and then you just rank the values. So you just know the cardinal order of them. Data like that is a nightmare to work with. It really is awful, but it can be done. What you wanna do is just not rank things, to be honest. Just go back to the original data. Don't rank. Ranking loses information. But sometimes, you know, someone gives you rank data and tells you to analyze it. This is all you got. Ordered categories is what we're gonna look at on Friday because it's extremely common, at least in psychology, to work with things called Likert scales or Likert scales. I don't know if anybody knows how to pronounce that name. Do you know what I'm talking about? Is it Likert or Likert? Thank you. So Likert scales, you don't have a more fundamental measure. What you elicit from the participants is some ranking from one to seven conventionally, isn't it? And then you've got to model the distribution of those responses. Lots of important things to be studied this way. But that variable is distributed in a highly inconvenient fashion. We will build up a model of that that works much, much better than linear regression, which is unfortunately the convention, I think in psychology, is to analyze them using linear regression. And then mixtures. Also, I'll give you a taste of mixtures on Friday. These are cases where you've got a single set of measures, but multiple processes produce the lump of numbers. This'll make more sense. I'll give you examples on Friday. And so it's a mixture because the histogram, if you will, of values is a mixture of different processes. And you don't know which of the values is produced by which process. But you can nevertheless still model the aggregate distribution. This happens a lot in science. And I'm gonna give you an example on Friday. Okay. The second step is link functions. This is actually the easiest thing, but it's a little bit weird. The linear regressions we've done so far, the strategy has been to take the parameter for the mean mu and we hitch it up to our linear model. And this is easy because the outcome variable, y here, and the parameter for the mean mu have the same units. Whatever the measurement scale is on y, mu has the same measurement scale, right? So if y is what, centimeters per kilogram, or centimeters of height, rather centimeters of height, then mu is also centimeters of height, yeah? They're on the same measurement scale. This is the only model type like this. There is almost nothing else in statistics that is benign in this way. But don't get panicked, there's no problem. No problem, right? So this means what we're gonna work with today is our said binomial models, like the globe tossing model from the first model you met in this course, back when I was still calling models golems. Remember that time? A housing age at the beginning of the course when the homework's easy, right? That's that part of the course. So we're going back to that model today. And the thing about y is now a count, and actually I have some animation for that. Y is a count, and what is p? It's a probability, it has no units. Probabilities are dimensionless, right? Because they come from, you can think of, if you construct them for frequencies, the units divide out. They have no units on them. So now we need some way to connect these things. The units on our parameters can't be counts of whatever it is we observe. It has to be dimensionless. And so we need something called a length function. That is the question mark there between p and our linear model is what happens to connect these things? What is the hitch that is gonna make this work? And the very effective approach in statistics is to use something called a length function. Here I just anonymize it, call it f for function. You're used to this, right? And we have to choose those so that it scales things right. So that some unconstrained space of the linear model gets put on the proper space for that parameter. So in this case, the proper space for the parameter is zero one because that is the proper domain for probability, right? The average probability of this binomial process of each trial producing a count has to be bound between zero and one. So as I show you on this graph here, if you have some x variable on x dimension on the horizontal axis, it gets plugged into the linear model, then there's some inverse function on the linear model has to only have valid values inside the proper domain. So we have to choose this function f so that that's true. And all this hard work has been done for you. And there are very good choices for these things. So this is called the link function. And I'll show you how to use it when we get to the context of the model. All right, third and final step, the part that your computer does for you, you do all the hard work of programming this model and then it will happily sample from the posterior distribution for you, right? And the thing about journalism on your models though is that the search through the posterior distribution is typically harder. So if you're using some optimization approach like map estimation, more things can go wrong. So you have to be a little bit more careful. The interpretation of these models is harder because now the relationship between the gears in the type prediction engine and the predictions of the tides is not as transparent. But that's okay, you've already learned how to force predictions out of your models. I made you do that back when it wasn't necessary. You're welcome, and now it's necessary. So you're already skilled at this. The link functions matter, so we're going to think about the impact those have on inference and the quadratic approximation that you were using up to this point in the course often works with GLMs, but not always. In fact, sometimes it's fantastically bad. I'm going to give you an example, maybe if I get to it at the end of today's lecture, but if not at the start of Friday. So it's really safest to rely on MCMC. You can build up steam as it were using map estimation, but the estimates you publish, I would encourage you to use Markov chains, just to be sure. Because there's no guarantee that the positive distribution looks anything like a multi-dimensional Gaussian. It could look like anything, even if all your priors are Gaussian. It could be all over the place. Okay, last thing to say, well, I'm going to want to say two things about generalized linear models, all of them that are different than the models up to this point in the course. And you have to keep these in mind, you'll see examples in the applied work that you do in the course. So the first thing is there are ceiling and floor effects on the outcomes. So for binomial, there are both, but for unconstrained counts, like Poisson counts, there are floor effects still. But all of these have some boundary. They can't go below or above on the outcome space, unlike a Gaussian, which is in principle, infinite in both directions. In practice, of course, it wasn't, right? Heights, we modeled heights as Gaussian, heights don't go below zero, right? But we didn't have anybody in our sample near that boundary, so we didn't worry about it. But in principle, height is not Gaussian either if you get near zero, right? If you're one cell, you don't get shorter than one cell. Yeah. And so the thing about ceiling and floor effects is that they change the way parameters operate depending upon where you are in the outcome space. That is, a change in a predictor doesn't induce the same constant change on the outcome scale. All of the predictors matter at the same time, and they affect one another. So there's not a constant impact. And so one way to think about, this is not a bad thing, this is necessary. You want models to do this because that's the way the world works. So think about it this way. Say you had salamanders. I used to work at a university where lots of people studied salamanders, so this is why there are examples like this in my course. Not so many salamanders in Germany yet. We should import them, but they're probably already a government agency trying to censor you for saying that. So if, say, we're modeling the probability salamanders survive given different spring temperatures. And so the sustain about survival is that if it's really, really cold, then there may be lots of other threats to the salamander, which don't matter because it's gonna die anyway, right? But those threats would be incredibly relevant in the middle temperature range where it's a coin flip whether it survives or not. Then like say whether it gets enough food or whether predators are around, those are great predictors and things you wanna control for. So the impact of some other variable, mediating variable in survival like the presence of predators, its impact will depend, will interact with the temperature because you can't die twice, right? That's the basic puzzle. And that's the way biology works. So you want a model that observes this fact about how things work. And the same term on the other end eventually gets hot enough that it doesn't matter if there are predictors, you're going to burst into flames, right? Or dehydrate as the case is with salamanders, right? Does this make sense? So this is not a bad thing about these models. It's not a reason to fall back on linear regression. It's a reason to leave linear regression behind. It's because this is the way the world actually works. And so the way I talk about this is that everything interacts in generalized linear models and everything interacts in the world if you push your system far enough towards a floor or a ceiling. Because you can only die once, right? So to think about this mathematically, with linear regression, there's this wonderful and benign fact that if you want to know the rate of change in the mean mu for a unit change in x, you take the partial derivative of mu with respect to x and it's beta, right? The beta coefficient is the partial derivative of mu on x. Wonderful, isn't it? Absolutely fantastic. In logistic regression, which you'll meet today, it is not like that at all. In logistic regression, I'll teach you this in the upcoming slides, it's a fact that the mathematical expression we're going to use for p, the probability of a success on any given trial is this thing, which is the logistic model. You'll notice there's a linear model in here and it appears twice in the formula. Yeah, good times. But this is the most rational way to set up these models, actually. This is maximum entropy, in fact. So this is a great way to set it up. And if you take the partial derivative of this thing with respect to x, I leave this as an exercise for the student, or Wolf or Malfa, if you prefer, you get this thing. And what is that thing? Well, it's actually a well-known trigonometric function. It appears in the real world in physics. I think it's the path of suspended cable mix. But it appears, that's not what you need to know about it. What you need to know about it is that the linear model is still in there. It doesn't go away, it's still there. So no matter what that linear model is, the whole thing will appear in the rate of change of p with respect to the change in any of the predictors. And the reason is it's the ceiling and floor effect. Where you are on the temperature range, whether it's really hot or really cold, affects the marginal impact of change in any of the other predictors. Right, you can only die once. Yeah, so it makes sense? Okay, so how do you solve that problem? Well, you plot predictions at different base rates, as it were, and I'll have some examples of that in the worked examples. Okay, we're gonna work with binomial distribution, which is, if you're gonna learn one type of regression, in addition to Gaussian regression, it shouldn't be binomial regression. This is an extremely common sort of data, count data. Anytime there are a fixed number of trials in, and a number of discrete outcomes can happen in each of those trials, and you count up one of those kinds of outcomes, we call them successes. So this is like in our globe tossing, it was water. How many waters did we count in the globe tossing? Then the data are really well modeled as binomial. I say really well modeled, because there could be correlations between trials, but if you don't have any information about those correlations, the maximum entropy distribution is binomial, because the binomial will be the only distribution that actually makes no assumptions about the correlations. Anything else will put information about correlations in there that you don't actually have. So that's why we use binomials. Now, because it's true, or it has to be used, is because it's the most conservative distribution you could use, given the pre-data constraints, the pre-observation constraints on the outcome variable. So remember, in Chapter 9, there's a box where I talk about the maximum entropy nature of the binomial. Okay, so when we write these models down, just like in the conventions we've been using so far, the outcome variable here will be a count, often called the successes, but if it's des, then maybe you don't want to call it success. It could be a mortality count. So some sensitivity, as always, is required in describing your model. Number of trials in, this is the maximum value of y that you could ever observe, it's the number of trials, the number of individuals, number of salamanders exposed to threat, right? Things like that. Be the population size of salamanders, and y is the mortality count, something like that. And then p is the probability that any of the individual trials is a quote unquote success. Yeah. So interesting thing about the binomial is that like all of the distributions other than the Gaussian, the mean and the variance scale together. If you increase the average, the expectation of y, you also increase its variance, or rather don't increase its variance, it goes up first and then goes down because of the nature of the binomial, but there's a, they vary together. And these are the formulas for them, which you don't need to memorize. I just put them up there to show you that they contain the same things. And so if you change in, they both change. If you change p, they both change. This was not true of the Gaussian. In the Gaussian, we had two different parameters for the mean and the variance. They had nothing to do with one another. So you could ignore the variance, essentially. But that won't be true with any other modeling type. This is the variance scales for the mean. And again, that's good because that's how your data behave as well. It counts have to behave this way. Absolutely have to. So if you use a model that doesn't behave this way, you're risking unnecessary error. Okay, we need a link function. I've already hinted what it's gonna be. We're gonna do logistic regression. So y and p are in different scales. This is just to summarize what I told you before, y is a count, p is a probability. We need to model p as a function of our predictor variables. And our goal is to bound p in a zero one interval. So graphically, you can think about it this way. We've got some predictor variable x. It's on the real number line or whatever measurement scale it's been on. What we're gonna do is project. And that linear model is gonna live on an outcome space that we call the log odd space. And that's the y-axis on this graph. Log odd space is also continuous, centered on zero. And it goes infinitely down and infinitely up. So our linear model is linear in the log odd space. What are log odds? The odds are the probability something happens over the probability it doesn't happen. Those are the odds. And the log odds are the logarithm of that. Those are the log odds. So why the log odds? And the reason is because the log odds are the fundamental parameter of a binomial distribution. And again, there's a box in chapter nine that proves this. You don't need to care about the proof. You can just take it as like the angels descended and gave you a stone tablet that said the log odds. But actually it's proved in that box. You don't have to take my word for it. Again, math is nice, unlike science, in that you can prove things. So, but then we transform this log odd space onto the outcome space through the inverse function of the log odds. So you just invert the function. And this is the logistic function. What's called the logistic transform. And then it rescales. So I've taken these horizontal lines to show you the even packing of the log odd space on the outcome space. It's very uneven. So extreme values end up being very close together on the probability space. Values around zero end up with probability one half, like a coin flip. All right, so a log odds of zero is probability one half. And larger values in both directions give you, get you closer to zero in the negative direction or closer to one in the positive direction. But the space gets compacted. So that the log odds change of one unit has a different effect on probability depending on how far from zero you are. And you get used to this very fast when we're working with these models. So think about it this way, this is PI our probability scale over there. This is our linear model scale. They're on different measurement scales. And after you do the transformation, the linear model is not linear anymore. That's why we call these nonlinear models. Does this make sense? That's one reason to call them nonlinear models. There'd be other ways too. Okay. So very quickly, this is covered in the book. What does this imply about P? So we're gonna write these models like this at the top. You're gonna do loge of P is equal to alpha plus beta X or whatever your linear model is. What does that mean about P? Say you wanna, we have to calculate P to make predictions. Well, algebra to the rescue solve for P. Loge of P just means the log of PI over one minus PI. PI over one minus PI is the odds. And loge it means log odds. It's just a goofy word for log odds. There's a, I think there's an end note in the book where I explain the history of this term if you're interested and you have time to kill. You can look at it. So now we've got an expression we just solve for PI using your secondary school algebra, right? You remember that stuff? Yeah. And you get this answer, which is the thing I showed you before. This is logistic function. It also arises in ecology as a fundamental function of how populations grow. Under constraint. Yeah, you may have seen it before. Bacterium, petri dishes, grow logistically. All kinds of things grow logistically. Here, that's not the justification. It arises from this transformation of the spaces. From your perspective as an applied scientist, what you wanna know is kind of reference values. And it's actually very easy to work on the log odds scale in school year intuitions. You just have to remember sort of references. A log odds of zero is a probability of a half. That's easy to remember. Think of it that way. That's your middle part of it. Log odds of zero is a probability of a half. When the linear model has a value of zero, the outcome, the expected outcome is a half. That's a coin flip, right? As you go up, a log odds of one is a probability of 0.73. 70% of the outcomes will be a success now. We've gone from 0.5 to 0.7. By the time you're up to a log odds of three, you're at 95%, 0.95. And four means destiny. So log odds above four are pressed up really tightly against the ceiling. And minus four is pressed up really tightly against the floor. This is an important thing to think about when working with models like this, especially if you want to establish priors. So you think about what's a reasonable, before you see the data, a reasonable effect size. A log odds of eight is unreasonable, right? That means always, always, always. And a log odds of minus eight means never, never, never. So a uniform prior on a logistic regression is madness, absolute madness, right? It'll put all of the probability amounts outside any reasonable effect size, yep? So this is very important, and if you, I'll show you some examples in later lectures of the kind of madness that uniform priors would create in a logistic regression. Okay, so this is our logit link. It looks like that. I've said where does this thing come from? I've already told you it's the natural link in a sense. It is actually present in the mathematics of the binomial distribution. It's the fundamental parameter of the binomial distribution. And you can look at the overthinking box on pages 279 to 280. You'll sometimes see other links for the binomial models. These are less common in general, even though the logit link was last to be invented of these three. The other two common ones being the probit and the complementary log log. I think complementary log log was the first. This was Fisher in the 20s, used it in a toxicology experiment, I think it was. I'm hazy on this. This was graduate education for me, which was last century. So I forgot these things, but it's still used in toxicology fairly often, I think for historical reasons. And the probit model, which is, the probit is the cumulative normal. The probit distribution, the logit is named after it. So rhymes with it. The logit came last, but it has a lot of mathematical advantages compared to these two. And so it's much more common in applied statistics. The probit is still common in economics, where I think it, again, for historical reasons. Okay, let's actually do some data analysis now, which is why you're actually taking this class. So I'm gonna use the most famous binomial regression that I know of. This is a very common teaching example, because it has a lot of, it's both real and it illustrates many important things about applied statistics. These data are built into the rethinking package, of course, called UCB Admit. These are data from 1973 PhD program applications and acceptance rates at six different programs at UC Berkeley. These were the largest departments in 1973. Largest departments now are different, right? But in 1973, these were them. The departments had been anonymized to protect the innocent, but you will be able to guess probably which programs are which in some vague sense. I've just called them A through whatever the six letter of the alphabet is. And the question, the reason this is a famous historical example is the dean, one particular dean, again, whose name shall be held anonymously to protect the innocent, was worried. Now this isn't a bad story, was worried that they might get sued for gender discrimination, because in the 70s in the US, institutions were getting sued for gender discrimination. It was a new time, a brave new world. And so institutions were worried about this and there was an honest dean who was like, okay, we should look closely at our processes and tell. So to the rescue, he called the stats department. It's probably the first and last time a dean has ever called the stats department for help. No, I joke, I don't know. I was at Davis, the dean called me for help to do statistics analysis on things. So maybe deans do this a lot. But so we're gonna look at these data and retrace the steps of the statisticians in trying to figure out if there was evidence of gender discrimination in graduate admissions in 1973 at UC Berkeley. Here's what the data look like. You can look at, this is all the data right here on your screen. There are 12 rows, two rows for each department. So department A is the first two rows. Then we have an applicant gender column. We've grouped together all of the male applications on the first row. There were 825, if you look on the far right. 512 of them were admitted, 313 were rejected. Second row is female applications of the same department. There were 108 applications, 89 admitted, 19 rejected. And so on down through the mall. These are counts. These are not Gaussianly distributed variables. Our goal is to model something you can't see in this table. And that is the probability of admission, conditional on gender of the applicant. Does that make sense? This is a binomial regression. It's the ur binomial regression. Describes the whole method of what you want to do. All right. Here's what it looks like. This is our model. The mathematical version is in the lower right. The number of admit, the number admitted on each row i is distributed binomially with a possible number of successes in sub i. And I show you in the code, what does that mean? It's the number of applications. It's the number of trials, right? Each application is a trial. And there's some process which admits or rejects that. It's a committee. At least it used to be a committee. Now it may be a computer. I actually don't know how these things work now at Berkeley. And then there's a probability on each row i that each trial is admitted or not. We're going to model that as a linear model with some intercept alpha, logit link to a linear model, intercept alpha. And then a coefficient beta m times m sub i or m sub i is, whether the application is male or not. So dummy variables, you're a one-sided dummy variable. Yeah, with me? Okay. And then some priors, which I will talk about on Friday. I will loop back to this and I'll talk about simulating priors, prior predictions from this model. But I want to try to keep it as simple as possible today. Okay. So we're going to fit this. The code makes sense? Yeah. So we're using map, but map to stand code looks the same except you add to stand on the top part up there. And it works the same way. Okay. And I encourage you to run it both ways and compare. Look at the Markov chain samples compared to the map estimation and think about it. In this case, they're very, very similar. Yeah, but that's not always true. But in this case, they are. We're going to fit two models. We're going to fit a model that has the male dummy variable. That's the top one, model 10.6. And then we're going to fit one that just has the intercept, model 10.7, right? Where alpha is the average long odds of admission across departments. Yeah. So if it's zero, then it's a coin flip whether an individual gets admitted. Yeah. Okay. What happens? Here's the model comparison using WAIC. The model, model 10.6, which remember is the model that has the male dummy variable. There's a lot of evidence it's better. Like it's all of the Akaikei weight. Yeah. The difference in WAIC is 90 units, right? Pretty big. And the standard error on that is big. The standard error of the difference is about 20. So there's a lot of uncertainty of exactly how good our predictions would be on the next class of applications, next year's applications. That's the way you want to think about that. But nevertheless, there's a lot of evidence that having the male dummy variable in the model improves predictions. You can see that maybe better on the graph, right? How far apart they are on this. Okay, let's look at the predictions. That's important too. Well, okay, first, yeah, proportional change in odds. Let's think about interpreting these parameter estimates. And remember the tide prediction engine. But we're gonna stick on this issue for a few slides now. I think there's some added value here. It's very common in binomial models with loget links to interpret the coefficients, the slope parameters, like the beta sub-m here, by exponentiating them. And when you do this, you take them off the log scale and they're on the odds scale again. And this is called the proportional odds adjustment. So, and this is not silly. But I wanna talk you through this to help you understand that there are pitfalls in inference from this. Now, you're gazing at the gears of the tide prediction engine. And there are still ceiling and floor effects. So you're not getting an unbiased estimate of what's going on in the actual system. So, in this case, the map value of BM is 0.61, right? But here's it's 80% interval, right? Or 89% interval, sorry, 0.51 to 0.71. If we exponentiate that, it's 1.84. That is the proportional odds. That means whatever the odds of an application being admitted were before of a female application are, a male application has 1.84 times those odds. I'll say that again. Whatever the odds of a female application getting admitted to a graduate program are, a male application has 1.84 times those odds. Or 184% the odds of being admitted, which is higher. Yeah? In 1973, you see Berkeley across all these departments. Yes? So the thing about this estimate is this is a relative estimate. This is what's called a relative effect size. And bear with me for a few slides now as I riff through my ramps about relative effect sizes. Is that relative effect sizes can sound a lot bigger than they are on the outcome scale. So when you say 184%, that's larger, so that's bad. That sounds like policy intervention is required here. That's the first thing to say. But how much of an advantage is that? And you really can't tell, because remember the ceiling and floor effect. Alpha matters, where alpha is will affect the actual impact of that beta coefficient. Whether you're, if almost everybody's getting admitted anyway, then it doesn't matter, right? Because you're already against the ceiling. If almost no one's getting admitted, it doesn't matter very much either. Because you're basically, everybody's dying already. Hardly any applications filter through anyway. And so it's not a big advantage. But if alpha's in the middle so that lots of applications are accepted and lots are rejected, then it can be a huge advantage to be mailed. And it's cool. Does that make sense? Okay, so that's called the distinction between, or at least I call it, the distinction between relative and absolute effects. And I tell you this in the parable of relative shark and absolute deer. I'll help you remember this. So think about relative shark and absolute deer. I'll explain why these animals in a moment. Relative shark is the slide we just did. Relative shark exponentiates the beta coefficients inside linear model of a logistic regression and then publishes that number as saying this is the effect. And then it's really hard for you to interpret what's going on. Yeah, because of the ceiling and floor effects. Relative effects tend to exaggerate the importance of a predictor. In this case, I don't think they do. I'll show you the absolute effect in a moment. You can judge for yourself. But there are lots of cases, famous cases in applied statistics where quoting the relative risk greatly exaggerates the absolute risk. So this is really good for scaring people in the public health literature. Why, if you have a really rare disease, you're pushed up against the floor and the relative risk could be huge. It could be 300%. You could have exponential coefficient to get a value of three. So you have three times the proportional odds of contracting the disease. But if almost nobody in the population gets a disease, it still doesn't matter to you when you should still conduct your day the same way. This is like the risk of being the victim of a terrorist attack in a country like this one. That's the last thing you should be worrying about. You should be worrying about cars. That's what you should be worrying about. Cars, number one killer, cars. And that's what you should be worrying about. So sorry, but this is a common problem. So we worry about this and that's why you wanna calculate both the relative risk, relative shark and absolute deer. We need both of them. We still need relative shark because for causal inference and prediction outside of domain, you actually want the coefficient, right? You need to plug that in and then you could change the other parameters too and make inferences outside of context. Where does absolute deer come from? Absolute deer comes from because absolute deer are like cars, but they're animals, right? In the sense that deer kill many more people per year, at least in places, high-latitude places where deer live, than sharks do. Why? Because there are a lot more deer and they're terrestrial, like people. So there's an exposure that goes on here and deer are actually much more dangerous. If you're gonna be worried about death from an animal, you should worry a lot more about deer than sharks. And this is why I say it's absolute deer. Now the deer is the absolute threat. The shark is a relative threat, what do I mean? If you're in the water, sharks are a lot more dangerous than deer. Yeah? I worked really hard on this metaphor and I hope you guys enjoy it. I was trying to help you remember this distinction and it's important. So remember this slide, absolute deer relative shark. Sharks kill five people annually. I should have gone with absolute hippo maybe, but I thought maybe that was a stretch and a bit too far. Hippos are really dangerous. Having worked in Africa myself, I can tell you that they are not joking around. So risk communication. This is a famous example of this problem where if you, all your report is at relative risk, you can wildly mislead people about the absolute risk. So here's an example. You can always rely on the Daily Mail to do things wrong. So you guys know this newspaper. So the Daily Mail did a lot of damage by reporting relative risk of a particular study in Great Britain. So it turns out that one in a thousand women who are on conventional birth control pills, standard kind, two kinds of hormones, right? People take monthly. One in a thousand women who aren't on the pill will develop blood clots that are potentially fatal. Typically they happen in the lower leg. You may have heard about these things. And three out of a thousand women on birth control develop these blood clots. It's rare, these blood clots are extremely rare in the general adult population, but they're a lot more common if you're on birth control. And there are reasons for this, actually. It is causal, the link here is causal in this case. So the relative risk is a 200% increase in blood clots. That's what the Daily Mail reported. This was on the news, it was all over the newsstands. Lots of women stopped taking birth control. And then they got pregnant, and it turns out that that's a lot more dangerous. So many of you are thinking, so I shouldn't say things like that. You should have children, they're wonderful. My son is the best thing that ever happened to me. But I didn't have to gestate, right? So easy for me to say. So jokes about pregnancy aside, pregnancy is medically much more dangerous than being on birth control. And so there was a much greater loss of mortality as a consequence of people going off birth control than staying on it here, despite the 200% increase in risk of these blood clots. But the relative risk misleads because on the absolute scale, the change in probability between new categories is 0.002 because you're looking at an increase of two women out of 1,000, right? So that's the 0.002. That's the absolute, that's absolute deer. Relative shark is 200%. Absolute deer is 0.002. This is the last thing you should worry about, right? You should be worrying about fish and chips and what it does to your heart. So you should be worrying about in this population. Okay, that aside, that's a real historical example. So how do we get absolute predictions in this example in the UC Berkeley data? You just push the predictions out of the model just like you've been doing so far. So again, this is just looking at the internal states of the machine. Let's take the map values now. Of course we need the full distribution of predictions, but if you can just take the map value and understand what's going on, you just apply the inverse link function, which in this case is called logistic. And that's that formula, which has an exponential on the top and the bottom. That's the logistic function. And it puts the linear model on the probability scale. The logistic function takes a continuous range of values and puts them in a zero one interval in a particular way that's right for our modeling. So logistic of just the intercept gives you the prediction for an average male application. And logistic, the intercept plus the coefficient gives you the prediction for an average female application. Yeah, and now we're on the probability scale. We have absolute predictions. And those are 0.3 and 0.45, just about, right? So now the difference in probability is 15%. Make sense? So you see there's added value. This is interpretable, right? And you understand the advantage now. It's a 15% increase. Lots of both are getting admitted, but a lot more male applications, right? 15% more. So we want the uncertainties. I'm gonna go fast through this, but this is in the book. We calculate the posterior distribution of these effects by just putting all the samples into the logistic function here. Exactly as you see. Put all the samples in post and then you have the post object inside the logistic function. Works the same as before. Now you get distributions, posterior distributions of predictions as we always want, and we can plot them. And then relative shark on the left, the odds ratio for a male application. You can see it's centered around a little over 1.8, right? But between 1.6 and 2.1, about, and then absolute deer on the right, the difference in probability between a male application and a female application centered on 0.14, but it's always positive, right? Greater than, the advantage is always greater than 0.1, but less than 0.17, about 0.18. Make sense? You wanna compute both. Okay, you will not leave with this ad story. You will leave with a better story. Now, you should always look at the posterior predictions against the data. You wanna know how the model fits badly, because all models fit badly in some way. Why, because you don't wanna overfit, right? You could put in a parameter for every department and then you fit the data perfectly. That's not our goal. So let's look at the posterior predictions. So this is the output of the post check function in rethinking, which is the dumb automated way to compute posterior predictions. Just uses link, uses the link function. It takes each case in the data set and then makes a plot like this. So now what we're looking at is each case in the data set along the bottom. So this is department A, our cases one and two, cases three and four, department B, and so on. Make sense? Each column is a particular combination of department and gender of applicants. And then on the vertical, we have the outcome scale of P, the P parameter, the probability of admission of each application. The model predictions are the open circles and the pluses on above them are the prediction intervals, right? This is what comes out of link if you use the link function in rethinking. This is what you get. This is what we just computed on the previous slide actually. Absolute Deer, this is Absolute Deer. And then the blue points with the lines between them are the actual data. They're portions of the actual data. This is a really terrible model. It misses every department. Absolutely every department, right? And even more interesting than that, you'll notice that there are only two departments, departments C and E, where female applicants have a lower rate of admission than males. And all the others, female applicants are admitted at a higher rate than the males. You see that? Because the bars are going up, right? Yeah? What has gone wrong here, right? You see the models are always predicting down. It's predicting that the second one in each case, right? Because the model thinks that the female applications have a lower admission rate. It's a terrible model. This is why you check your predictions, right? Don't publish this model. So what's going on here? The overall admission rates vary a lot across departments. You can see that on this slide, right? All over the place. Not all departments are equal in their admission rate. Some are extremely selective and some less. Some get a lot of applications and so they accept less of them. And some get very few applications and they accept more of their pool. You know, this is graduate programs vary in that way. We dump them all together like they're the same in their average admission rates and assign them the same alpha. Let's fix that problem, right? So that is the problem here, as you'll see. So let's create a unique intercept for each department so that we're gonna say each department can have its own average admissions rate. But now we're going to estimate the average difference between males and females in each department. So we're gonna let departments have different base rates of admission. So that means we have to give them different intercepts. So this is the model to do that. We make subscripts. We like to make subscripts and stats, right? All over the place. And so now we make alpha conditional on department. So for each department on row i, there's an index for the department now between one and six. And you get an alpha vector of alpha parameters from one to six. We did this, I didn't do it in lecture but this is at the end of chapter five when you do categorical variables. Way to think about it is now there's just a different alpha for every department and we're gonna estimate it with the data from that department. And in plain English, you can think about this as in the previous model we did on the left, this model is asking the question, what are the average probabilities of admission for females and males across all departments? Yeah, because there's nothing about departments inside the model at all. It ignores departments. It just pools all the applications into giant heap and treats them the same. So across all departments, ignoring departments, what are the average probabilities of admissions of males and females? And the model successfully answered that question. It did not lie to you. But you actually wanted to ask a different question. I assert or I tricked you into pretending for a moment you were answering a different question is what I did. It's actually my fault. The question we wanna answer is it said what is the average difference in probability of admission for females and males within departments? And that's what the model on the right answers because for each department, we get a rate of admission alpha. And then we have the average difference, betas of M, which doesn't depend upon department. That's why it's an average difference. It's the average difference across departments, but it's the difference between the two, between males and females. Now it's within departments because each department gets a different level that it starts from. You with me? Yeah? So way to think about what's going on with this in the code is now the, you'll see there's A inside the model and it's bracketed with department ID. What's department ID? It's a new variable we make. And I show it on the left here for each department has a name A through F. We just give it an index number one through six. So now that there's a vector of alphas of just six parameters and we just get the right one for each department. This is the way you construct it. So what map and map to stand will do is it looks at department ID, sees that it has their six unique values in it. So it makes a vector of parameters called A and they have positions one through six and it estimates each of them using the data from each department. And again, this is at the end of chapter five when we talk about categorical variables. So, and then now two new models. There's a model at the top 10.8 which ignores has unique base rates of admission for each department, but ignores the gender of the applicant. And then 10.9, which has both. It has unique intercepts briefs department and the presence of gender for the applicant. Let's do the tournament again, right? Suspense foreshadowing, right? And what happens 10.8 is the winner now? Well, first of all, you should see 10.8, 10.9 beat the pants off of 10.6 and 10.7. Yeah, it's no contest. Look at the difference in WAIC. And that's because most of the variation in this data set in the, if there was one thing you wanted to know that would help you predict the chances of an application it should be department. It's the most important piece of information. Everything else is just mopping up the little trivial bits of probability here. The most important fact is which department you apply to because the differences in average admission rates between departments are huge. They cover the whole range of possibilities as you saw in the previous graph. So that's why 10.8 and 10.9 do so much better. That's the reason. But it still may be that there's a gender effect, right? And that's what we're looking at in this case. And in that case, you'll see that the top two models, remember, 10.8 doesn't have mail. The mail dummy variable in it, 10.9 does. And you'll notice they're tied. They have about tied, okay, okay, wait. You can't tell the difference between these two models. They do about equally well. I would encourage you to read that as the gender of the applicant matters. But it's probably overfit a little bit in the model that has it. Yeah, that's the way I encourage you to read ties like this. It's not that you wanna say the gender of application doesn't matter. It's just that it doesn't matter a huge amount and it's probably overfit a little bit. But the evidence here is that it matters. Just not a lot. Yeah, it's a tie between the two. Remember, AIC and then WAIC and they just measure overfitting. They don't tell you what the truth is here because they're priors of these models. Yeah, so it's probably true that in model 10.9, we wanna look at the predictions now and see what it says. So we can do that. When you look at the pricey output for 10.9, you'll notice now that there's a vector of A's, one for each department. I told you A1 is department A, department B down. These are their average rates. You'll see they go all the way from law gods 0.68, which is very positive. Most of this department accepts most of its applicants. I think it's physics, actually, in the real data. And they get very few applicants and they accept most of them. And at least in 73 they did. I don't know what it's like now. And then you go down. This was engineering, I think. And then we get into negative zone. We get into the selective departments, which are social sciences and humanities. Social sciences and humanities receive way more applications and accept way fewer of them. We get all the way down to minus 2.6. Remember, minus three is 5% chance of acceptance. Yeah, get good at reading law gods. You'll dream in law gods before law. Get really good at it, it's very natural scale. People quote probabilities to me, I immediately convert them to law god scale. I do, it's just like a tick. They're good to add, you can reason with them, you can do math fast with them. Okay, BM is the target of our IR, it's minus 0.1. And a standard deviation of 0.08, but most of it's below zero, right? So what does that mean? Males are slightly disadvantaged on average now. Male applications are very slightly disadvantaged on average within departments. Across departments, in general, male applicants are hugely advantaged, but it's because of the departments they apply to. The departments they apply to tend to accept most applications. Does this make sense? This is why this is a famous data set. It's fun to think with, right? So here's what the predictions look like. With male dummy variables on the top, you'll see now the model's doing a much better job. It actually touches the data now, right? It doesn't get all the variation. We don't want it to. We don't want to overfit. We want to measure the effects we're interested in. And then the previous model on the bottom, so you can compare, bad model on the bottom, better model on the top, that at least gets the order of predictions, right? So yeah, the department A I think was physics. I forget, it was physics or mechanical engineering. And that's the one where there's a huge advantage for being a female applicant at that department, but they had very few and they accepted nearly all of them. They were probably, you know, there are selection effects about who applies to these things. So they're not, the applicants aren't all anonymously the same. That's why these data analysis of these cases is so complicated. Okay, that's all the time we have for today. We're gonna pick up with explaining this on Friday. This is an example of a famous thing in stats called Simpson's Paradox. And I will leave the explanation of the generality of this phenomenon until Friday morning. Thank you all for indulging us and I'll see you next time.