 Good morning. Welcome back. I want to go right into this. We've got a lot of new exciting things to get motivated and we're going to move conceptually at a slow rate this morning and then that'll set up a bunch of new types of bottles that you'll learn in the next several lectures through this week and the next week but we need to get all of our philosophy in place first, and that's the first thing I want to do today. So put statistics out of your mind for a second and instead imagine you've got five buckets positioned equidistant from you from your feet and then in a pile at your feet. There are a bunch of pretty little pebbles Each of which has been painted with a number. There are a hundred of these pebbles And I show only like a dozen here on the screen because I got tired of drawing pebbles But each of them is numbered individually. They're they're unique pebbles, right? Each has an identity But to us they're just all going to be exchangeable pebbles because we're just interested in What happens when we toss these pebbles one at a time into buckets at random? So I want you to imagine you could toss these pebbles one at a time in such a way that each pebble has an equal chance of landing in any of the five buckets Yeah If you miss a bucket you go pick it up and you throw again something like that Eventually all the pebbles all 100 end up in the buckets distributed somehow and then you count them and What you get is a distribution of counts of pebbles in the buckets and my question to you is what are these distributions like? There's some family What types of distributions are really common what types of distributions are really rare And we can approach this Intuitively, maybe I can motivate it for you. We think about extreme distributions first. So there's only One way that you can get all the pebbles in bucket one and that's to have all the pebbles in bucket one right so There's no other Arrangement of individual pebbles, which will give you this Likewise for the other extreme you can get all 100 pebbles in bucket five or any of the other buckets So there are five unique distributions which have all the pebbles in any particular bucket Probably you won't ever see that distribution, but it's possible right There are a bunch of distributions Which can happen a bunch of different ways, but they will look the same But the individual pebble numbers will be exchanged across buckets So consider this distribution here where we have five twenty two twelve thirty seven twenty four pebbles arranged in the buckets I hope that sums to a hundred someone check if it doesn't let's not talk about it. It's supposed to sum to a hundred and We could take a pebble from bucket two and exchange it with a pebble from bucket three without changing the distribution But it's a different distribution and it would arise through a different order of tosses Right and then I ask you how many ways could you realize this distribution? Well, that's what we're going to solve today And what I sort is this is the principle of Bayesian inference is this very problem that some Distributions can arise in exceedingly very many more ways than other distributions And those are the things those are the distributions that Bayesian inference gives us And this is a principle called maximum entropy that I want to explain to you It's a principle bigger than Bayesian inference that justifies Bayesian inference and it's going to help us do modeling going forward. So let's put aside arithmetic for a second I Don't really like arithmetic. I'm not very good at it, but I like algebra, right? It's more fun So I always like if there's a if there's an actual digit. I always like to replace it with an x Right makes me happier. So let's replace all those actual integers with n's We have in one two three four five which are the bucket counts So let's talk about the properties of these counts and what can happen at some point in your education you learned and then healthily forgot The fact that there's a formula for the number of different arrangements of the pebbles that will give you these same counts It comes from combinatorics. You probably learned it in secondary school and they never used it, right? And here it is it's called the multiplicity And this is just from combinatorics the number of unique ways to realize this distribution in one in two and three and four in five is Capital in factorial, which is the number of pebbles in this case a hundred divided by the product of the factorials of each And again at some point in secondary school you learned this right and then you were like I'm never gonna use this This is crazy. Well now you get to use it. I'll tell you what it's important. It's the foundation of statistical inference so Give you some intuition. This is a very powerful result because this thing Gets big really fast When the ends get equal and I want to give you a motivation for that So let's think about an extreme distribution again But now let's consider only ten pebbles because if it was a hundred well the numbers will as you'll see they would They would take up the whole slide. So let's do ten the lesson will be fine with ten with a hundred It's even more extreme. So there's only one way to get all the pebbles in bucket three right Intuition delivers that for you. Yeah, now the formula in the previous slide will also deliver that for you But there's only one way to do this to make that happen How many ways do you think there are? We're gonna take one pebble from bucket three now And we're gonna move it to bucket two and we're gonna take one pebble from bucket three move it to bucket four so that we've got one pebble in bucket two eight in bucket three and One in bucket four. How many different arrangements of individual pebbles do you think can make that distribution? somebody seen this lecture before but Just an idea or magnitude and the answer is there are 90 ways to do this yeah But I've also uploaded the slides so You can check and see It's really really bigger. It's like a massively number of ways and this is just going to accelerate We're gonna keep playing this game. Maybe you don't find it exciting but people have really bad intuitions about combinatorics and this is I'm trying to justify why statistics works here This is the justification. I think about why statistics works because these numbers go up so fast We're gonna again take two pebbles from the middle bucket and move them out to the sides Now we've got two pebbles in bucket two and four and six in bucket three. How many now? Now it's over a thousand So basically we're getting an order of magnitude increase Every time we distribute a couple of pebbles out many many more arrangements to make this stuff work and Indulge me. I like this. This is fun. We're gonna keep going Now we distribute a couple more and we put them on the extreme So we get one two four two one are the counts how many ways to get how many unique arrangements of pebbles can produce this distribution and the answer is 37,800 different ways to do it and there's nothing special about that number except it is really big Yeah, and it's it's hugely bigger than the previous one Every one of these is massively bigger an order of magnitude in fact bigger than the previous one before it So we've got one more we can make this Flatter yet. So now here in the in the bottom middle of this slide. I show you Distribution where there are two pebbles in each bucket The flattest most distributed we can possibly get the pebbles and we finally reach a maximum The number of ways you can realize this distribution is 113,400 different ways and there is no other arrangement of the pebbles which has more ways to be realized than this And this is a general principle of statistical inference is that Distributions which are flat can be realized in many many more unique ways and this is why we bet on them They have high entropy and you may remember from was it last week the week before we talked about When you're calculating k l divergence or distances flat distributions are closer to other distributions Some you have seen somehow magically right remember that the earth was closer to Mars and Mars was to earth remember this story This is another property of these things is these flat distributions to be realized in a huge number of ways They're less surprised when the distribution turns out to be different and they're their divergences is smaller to other distributions and these things become Then really good foundations for statistical inference because they distribute the possibilities as widely as possible So let me show you what happens here is that this is a unique way to actually in fact derive the information entropy formula It's nothing more than the multiplicity so there's a box in Chapter 10 where I show you the mathematics of this if you're interested, but here's the pure intuitive version So I had we had this w thing before which is the multiplicity the number of ways to get these ends Let's imagine we take the log of that multiplicity and then divide it by the number of pebbles So this is like a per pebble magnitude of ways right so we've normalized it across the number of pebbles and It turns out that is approximately. This is a very good approximation. There's this approximation I'm sure some of you know called Stirling's approximation for logarithms Gives you this Minus the sum over all the pebble all the buckets in I over in Times the log of in I over in and this should look eerily familiar It is the information entropy formula and this is one way to derive it. You're like, okay. That's very nice Richard. What are you kidding at information entropy is just This thing it's just the Logarithm of the number of ways to realize a distribution. That's what information entropy is and it's maximized when the distribution is flat And flatter distributions have higher entropy That's all it is There's nothing magical about it is just counting and I want to use this to draw this all the way back to the beginning of the course and Then give us a way to go forward to think about using many different kinds of outcome distributions in our models But we want them to have this property that they are the distribution that is as flat as possible Consistent with the constraints that we put in given what we know scientifically about the data before we see it So this perspective on statistical inference, it's due to a large number of people, but it's most centrally associated with one man the American physicists Edwin James and seen here in his Navy uniform he was a young man and So James published a lot on the maximum entropy principle, which is connected very closely to Bayesian imprints The principle is really just that the distribution of the largest entropy Is the distribution most consistent with our state of assumptions and if you choose any other distribution to characterize your state of knowledge You will be implicitly adding other information in there And you don't know what it is, but you've smuggled some unknown bits of information into your distribution And so the argument was well if you lay out all the constraints the things you think you know before the data arrived And then you solve for the distribution that's as flat as possible under those constraints You're doing the best you possibly can you're honestly characterizing your ignorance That's the maximum entropy principle And it just comes from the fact what we saw with the pebbles those are the distributions that can arise the largest number of ways So there are lots of conceptual things. I think are nice for this And I'll give you examples as we go forward for parameters. This gives us a way to understand the meaning of a prior What's what are the constraints that make a prior? legitimate right what is the information content of a prior distribution and For observations as well. It gives us a way to understand the likelihood on top So when when I introduced Gaussian distributions weeks ago, I think it was even before Christmas, wasn't it? I gave you this argument that there's a maximum entropy interpretation of the Gaussian If all you know about a measure is that it has finite variants You should choose a Gaussian to characterize it well I don't know if you should or not But if you but the distribution consistent with that information that contains no other information is the Gaussian the Gaussian is just the flattest distribution possible For a measure on the whole real number line that has finite variance There is no other distribution with the same variance that is flatter. I know it seems weird If you look at chapter 10, I've got a proof of this. There's a little box with some integrals. You'll love it right and It turns out that Bayesian updating what we've been doing in this course is a special case of this principle so you can start with The constraints on the on the variables that is if they're they have to be positive or if they're bounded to some maximum Any kind of constraint you like you can input the data as constraints because the data they put these direct delta functions on values Those of you know love direct delta functions if you don't don't worry about it It's just you put a spike you put a probability spike on a value You put a feed all that in and you get the posterior distribution out by solving the maximum entropy problem So basically nothing is just a special case of this larger inference framework And I'm not saying that then we're going to do it this way It's just that you want to understand that what you're doing when you when you solve for the posterior distribution is You're getting the distribution that is as flat as possible Consistent with the data that's what the posterior distribution that Bayesian inference gives you is it's the flattest distribution possible consistent with the constraints and the data No other distribution could be flatter and still be consistent with that information that you put into it So it's the highest entropy answer and why is that good doesn't entropy sound bad? No, it's exactly the opposite because that means your distance to the truth is smaller Right, that's the whole bet that you get for maximizing entropy Okay, this is this is the church of entropy this morning One way to think about this though is that this is deflationary. There's nothing magic about statistics. It's saying well Junk that can happen lots of ways. We're gonna bet on that Right, so you folks threw a bunch of pebbles into some buckets and now I got a bet on the distribution I'm gonna bet it's pretty even why because no matter what happens an even distribution is bound to arise That's all that statistical inference is doing. We don't know what's gonna happen We put in a tiny sliver of scientific information into our model and then what's left? We bet on entropy everything else is just betting on entropy isn't it majestic Right, there's no access to truth here. There's nothing except betting on stuff that can happen in lots of ways That's all it is But that's amazing if it works incredibly well, even though it sounds colossally stupid, right? But it's really all it is so Here's here's my summary slide of what I just tried to say But I want to use this to motivate forward a bit to other distributions It's probably intuitive to you it might not be that If we were going to maximize this function by choosing the values inside the vector P This is information entropy That if all the P's are equal it would have highest value if that's not intuitive to you play around with it on your R Command line, and you'll see You can't do any better than that you make them equal their highest and this is why I gave you that homework problem with the burbs That was not a typo by the way The burbs there was one island where the burbs were equally frequent and that had the highest entropy Yeah, it's because that's what that's how you maximize entropy is you make things equally likely that minimizes your surprise Yeah But sometimes there are constraints which prevent us from making all the P's equal and then what happens well Then we get the flattest thing possible consistent with those constraints What might those constraints be they could be constraints like the variance is known or the mean or the or the average Logarithm or any number of things and depending upon the constraints you input you get a different distribution and the maximum There will be some other distribution that maximizes entropy So this is what we did actually Way back in the beginning of the course think this was week one right where I took you through the garden of working pasts And just imagine we drew some marbles and then we imagine all the alternative draws You could get from the bag and we're trying to say how likely is it the thing? We got and we just counted up all the different paths through the garden This is entropy maximization and the probability distribution you get from this exercise is a maximum entropy distribution So they're actually pretty easy to derive in principle. It's tedious to do the counting There are compressed mathematical ways to do it But if all you do is count up all the ways that stuff can happen and use that as your probability distribution Then you're using a maximum entropy distribution So there's nothing magical about it and all of the familiar probability distributions of applied statistics are maximum entropy under some set of constraints Now you can use them in the wrong circumstances right where the constraints don't match But under some set of constraints they are maximum entropy distributions So give you an idea The uniform is the one I asserted if the only constraint is there's some real value within an interval the maximum The so-called max int maximum entropy distribution is a uniform That's like your burb homework, right If Instead we have some real value and it has finite variance then the max and distribution is Gaussian There's no flatter distribution. It could give you that again in chapter 10 There's this whole section where I try to motivate this for you and draw some pictures of alternative Distributions with the same variance, but aren't Gaussian. So you try to get some idea about what's going on If you have binary outcomes here's counting coin tosses and There's a fixed probability of each of the outcomes across trials Then the max int distribution is binomial Which is what we're going to play with today more, but that's also the globe tossing distribution We had originally or the marble drawing distribution that we had it's max int There's no other distribution, which is flatter Then the binomial and why because the binomial is the distribution that just counts the pass And any other distribution you use is smuggling some other kind of constraint in there And maybe that constraint is legitimate, but we got to figure out what it is And then we've I've been using these exponential distributions all course for scale Parameters like standard deviations, but I've mainly brushed aside questions about why I was now ready to reveal why I like them They have exponential distributions have this nice property that they have a very clear maximum entropy Constraint if all you're going to say about a parameter is that it's non-negative real and has some mean value Then the exponential contains only that information So it's very clear what it means you can set the sort of average magnitude of that scale parameter as a prior and then use the exponential It's nice, but doesn't mean you have to use an exponential, but it has a very clear interpretation. It has some nice properties Okay, so generalized linear models Are the larger family of geocentric regression models that the linear regressions We've been using our members of so that the Gaussian outcome model is a special case of this linear strategy where we have some probably distribution for an observable Variable an outcome variable and we want to connect a linear model to the mean of that distribution somehow I call this the generalized linear modeling strategy. It's a scamp. It's Unreasonably effective given how geocentric it is It works amazingly. Well, it really has no business working as well as it does the general strategy is we pick some outcome distribution How I'll talk about that You won't be surprised to hear that maximum entropy is the principle that I think is reasonable Then you model the parameters that distribution using weird things called links. I'll explain what those are and Those links well, what do they do? Well, they link the distribution to some linear model And then of course step three in Bayesian inference. You always know what step three is, right? compute the positive distribution Or step three to be question mark question mark question mark and step four is compute the positive distribution, right? So this is a very powerful approach You can do all kinds of fancy things you can do multivariate relationships and nonlinear responses Lots of stuff with the same basic strategy And I would say 99% of applied statistics is just generalized linear models And then often even if you don't want to play this game and You have some scientifically derived model when you write it down and it'll turn out to be a generalized linear model by accident Why because of maximum entropy? Real generative process is also generating maximum entropy distribution So you end up with a GLM in an unreasonably large number of circumstances even when you don't want one It's a bizarre thing. I have some colleagues When who are also statisticians and people come to them with stats problems And ask them what's you know, they asked my colleagues. What sort of model should I use the colleague will say well I don't know no details yet, but I bet it's a GLM, right? Just let's just start there You probably want a GLM and that's usually the case so How do we pick an outcome distribution? Nearly all the outcome distributions We're going to use our exponential family distributions And the reason is because this thing called the exponential family. They're all maximum entropy under some set of constraints There are distributions which are not maximum entropy under any known constraints and they're weird And you might want to use those in some circumstance, but they don't tend to have much value in statistics people don't tend to use it very much And I think there's good reason because they don't have maximum entropy interpretations All of these exponential family processes arise from Distributions arise from natural processes So with the Gaussian remember I wanted to teach you that the thing about on the football pitch If you move left and right the distributions of positions of players will eventually be Gaussian Lots of processes aggregate up this way the same is true for all the others So the binomial the exponential that will do gamma distributions I think beginning of next week all of them kind of rise from natural processes that are well understood So we see them in nature all the time and we shouldn't be surprised They're true for power laws, which are also members of these things. They have maximum entropy distributions constraints Okay, and I would say this is my note at the bottom should resist this thing that I see people doing all the time Which I call histomancy so histomancy is So you get some data set you got to figure out a probability distribution for the outcome and so you plot it And you look at the histogram and then you kill a chicken And then you figure out sorry, Greek sacrifice, right? And then you try to figure out what distribution is you should never do this This doesn't make sense under any framework there is no statistical paradigm that makes this permissible At all you want to use knowledge of your constraints in the first place to figure it out But just in a pragmatic sense There's no statistical framework in which the aggregate histogram of the outcomes Unconditional on something else is going to have any particular distribution of all You could have some variable which is perfectly Gaussian after you condition on the predictors But there's no theorem which tells you that the whole population mixed together has to be Gaussian, right? So it's it's it doesn't make sense in any paradigm, but people do it all the time In fact, I've seen it taught to people so just don't do it. It's much easier to use principles Instead of peering and guessing so let me give you a quick introduction to some of these distributions So that you've got some vocabulary and you know their general shapes and then what we're going to do Starting today and through the next few weeks is we're going to build GLM's with these different outcome distributions And you're going to see why they're useful and how they're connected to scientific processes And it's just an extension of what you've already been doing and you've already got all the tools you need So the first the core member of the exponential family is you guessed it The exponential the exponential is everybody's favorite distribution because it's got exactly one parameter and it's got this really nice soothing shape The exponential is the distribution that has the same proportional rate of change across its whole shape Well, it's exponential. It's what exponential means The the lambda parameter is a rate and the mean of this distribution is one over lambda. It's one over the rate Generatively the exponential can arise from a machine that has a number of parts a machine like a body say Has a bunch of parts and if one of those parts breaks the machine stops working and So you think about washing machines or dishwashers if you don't for to think about fruit flies, but fruit flies are machines, right? And so if the fruit flies heart stops or the dishwashers heart stops what a what a dishwashers have they don't have hearts They have stuff like they've pumps that's it And then the machine doesn't work anymore. You can't wash your dishes and so if there's a bunch of parts inside the washing machine and Each of them has got some chance of breaking on any particular day Then the the waiting time until the washing machine stops will be exponentially just Distributed this is a very interesting fact. Generatively you see these things or approximately Expendential distributions all the time as a result If you count events emerging from an exponential distribution say now you got a bunch of fruit flies So I don't mean to pick up fruit flies, but I was you know I did biologies We had lots of flies and tubes right it's like a pastime and I also used to live in California where fruit flies actually run the state I think Sit out a glass of wine soon. There will be a fruit fly and so say we count we were counting fruit flies ascending to heaven and Through some in some fixed window of observation there those mortality events ascendance events arise Exponentially, and we count them up. It turns out the distribution of mortality events is Binomial it's like coin flips You've got so many flies each one could or could not ascend and you count them up And you get a binomial distribution, which is the maximum entropy distribution for binary events with some constant expected value There's no other distribution that is consistent with those constraints and distributes the probability more evenly Think about this. It's a bit counter-intuitive there and there's a section of chapter 10 Where I also proved to you that the binomial is maximum entropy Because that's not very flat right, but that's because there's a constraint that the expected value is high So it's got a bunch up against the ceiling Right most of the flies die eventually all flies go to heaven. It was a Disney movie, right? Okay There's this other distribution that we're going to use Think starting on Friday. I'm going to spend all of Friday on this distribution. It's one of my favorites Called the plus all distribution or if you're speaking English, you can just say plus on really There are two ways to think about getting this if you start with a binomial distribution If you have some binomially distributed random variable But the probability of any particular success is very low and there are a huge number of trials So this means there's a lot of flies and the flies have really long lifespans Then the more tout that your count mortality events among flies. It'll have a Poisson distribution Or Poisson. I should stop sounding pretentious and Also, you can just count exponential events that have a low rate and you'll also get a Poisson So the sorry Poisson the Poisson distribution is just a special case But it's a count distribution related to the exponential and again, it arises in nature all the time And we're going to do good stuff with it Also If you count if you think about the time to the event in the exponential How long did you wait before your dishwasher broke, right? And you're recording or times to death for fruit flies if you Start adding those times together So say like two things have to break before the dishwasher stops working or you want to know the amount of time before a Certain number of fruit flies have died then those waiting times are distributed in another way the gamma distribution The gamma distribution is a distribution of waiting times or distances in which multiple things have to happen before the event of interest Occurs and then you get a gamma distribution gamma distributions also a really common in natural phenomena It's also maximum entropy A maximum entropy distribution for example age of onset of cancer is gamma distributed Why well no one knows for sure But it's plausibly the case because there are lots of cellular defense mechanisms and all of them have to fail before the cancer Can get going Right. It's like a bunch of locks in the cell that are trying to stop it And so the gamma distribution is distribution of the waiting time until all those things fail And so it's true that humans at least age of onset of cancer is gamma distributed If you get a gamma distribution with a really large mean It converges to a normal distribution and now we're back home We've got normals again. Yeah, there are lots of other ways to get a normal This isn't the only way right all roads lead to normal and once you're normal. You're stuck in normal It's like an absorbing state So what's the point of this I don't explain you to memorize this I just want to show you that there's conciliants to all this they're generative processes which link together all the Distributions we use in stats and each of them is principled based upon the constraints on the variable that we're counting and then all the rest of The shape of the distribution comes from maximum entropy It comes from betting on things that can happen lots of ways Those are the things that are more likely to happen in proportion to the more ways that they can happen And that's all it is it doesn't mean that these things are correct But it's it's the betting part of statistical inference that arises. Okay. I got half an hour left so now let's do some actual statistics we're going to build on this and So I showed you this before right the tide prediction engine or maybe a different one This is Lord Kelvin's tide prediction engine and I put it in here again Because when we get to generalize linear models this metaphor is very potent What is this metaphor about again? So this is a mechanical computer and there's a certain part of it that is the prediction of when the tides will come and Then there's all this stuff at the bottom which is just calculating junk that allows the calculations to work The bottom are your parameters In a model the top is the out as the prediction space that you're interested in and this is true with even with Gaussian models as soon as you had Interaction effects things got really hard and we had to start doing those triptych things that I like remember Lots of things to understand the model would generalize linear models You're absolutely wedded to this prediction perspective if you want to understand what's going on If these things are like tide prediction engines that the relationship between the bottom layer of the top layer is nonlinear now It's got a bunch of intermediate things which are hard to have intuitions about but you can understand these models as long as you Resist the urge to understand the parameters That sounds bizarre, but you want to understand the prediction space You understand the parameters by looking at their effects on prediction How do we build these things? So I just went through the sermon on maximum entropy you pick an outcome distribution This turns out to be pretty easy in practice You're not going to be solving a look some Lagrangian optimization problem, right like an economist with Jeff But because that's how you find max its distributions You usually use Lagrangian, but you don't have to do that You just need to think about before the data have arrived you know things about the outcome variable just by its very nature So for example account variable, which is what we're going to start with those are the most useful generalize on your models Count variables are our integers starting at zero guarantee it. So there are no negative counts Right a difference can be negative, but a count can't So from the very beginning you know things about the variable before you've seen the values in the variable and then Constrains the distributions that make sense for it And so the count distributions that arise are the Poisson the binomial and then the multinomial Which is just an extension of the binomial to law to more events And geometric which I'll talk about later geometric is like the exponential, but for discrete counts We're going to work with count models this week and then starting next week I'm going to think about talk about models that I call monsters They're monsters because you glue together different kinds of distributions with special link functions to do very useful things For kinds of data that arise naturally The most common sort are things like ranks or ordered categories There are psychologists in the audience. Yeah, if you want to reveal yourselves Likert scales is that is that how you say it Likert scale? Yeah, okay someone's nodding. Yes I've never known and like it was a person right, okay, it's not a location or something like Likert, Pennsylvania So these things in psychology called Likert scales, which are these Ordinal integer scales and they're not metric though because Typically you'll ask somebody how happy are you today on a scale from one to seven something like that, right? and What it takes to get a person from one to two might be very different than what it takes to get them from six to seven and So while they're ordinal, they're not metric the distances between The different values. They're not constant and so these things are really nasty to model I think usually people just wave their hands and use a Gaussian, right? But we're gonna do better. I'm gonna show you how to do much much better than that And but we're gonna build a monster to do it, right? That's how you fight monsters. You make monsters I'll also show you how to do the same thing on the prediction side Sometimes you have predictor variables which are order categories and the same thing applies the distances between the units It's not constant. You don't want to treat him as metric You can do the same thing on the on the right hand side of the equation So I'll show you both, but this will be next week and then mixture distributions which will Provide a transition for us into multi-level models, which are cases where we take Usually it's a count model distribution like a binomial or a Poisson And then we model one of its parameters as emerging from a distribution where there's heterogeneity These are called mixture models and they're really useful They're super useful and they bear a lot of resemblance to multi-level models So I'm going to show you some examples of these mixture distributions and then the following immediately on the heels We'll start really doing multi-level modeling because you'll get it'll provide a nice conceptual introduction Okay, step two in generalized linear models. There's this thing called a link So consider the Gaussian linear regression pictured here. You're familiar with this now, right? You see it in your dreams and your dreams are just full of dancing linear regressions. Yeah, and Linear regression is super benign and that's the reason I started the course with it Well after the globe tossing model because it has a very special property Which no other generalized linear model has and that is that the scientific measurement units on the outcome variable and The mean the parameter of it the parameter for the mean are the same So for example when we model had the height model Hight was measured in centimeters in that example Mu also has units of centimeters because it's the mean height This is not true for any other generalized linear model Unfortunately, so that's the case. So we didn't notice that there was any kind of friction or problem to solve here We didn't need this thing called a link What is the link and what problem does it solve for us? The much more typical case is something like a binomial model like the globe tossing model I'll show a build up here if you want to connect a Linear model to the parameter P, which is the probability of success on any given trial P is a probability. What are the units on a probability? It's actually none, right? So they're shaking heads in the audience are none, right? Exactly. It's unitless Probabilities are unitless all the units have divided out Right, you folks remember doing scientific notation and like balancing your units once upon a time Yeah, and so the units cancel out in a probability But your outcome is a count and it has units count of something people fruit flies Right something like that. It has units on it So now the units aren't the same and we've got to have something that connects the parameters now to the outcome scale and As a consequence of this the the domains that are legitimate on the parameter Are not on the parameters inside the linear model are not going to be the same as the outcome And so the usual thing here is we want to connect this linear model alpha plus beta x typical linear model to PI But we can't just say the PI equals that because linear model can be any real value, but a probability can't It's bound between zero and one so we need some function to put in there for where that question mark is to make it so that this thing obeys the laws of physics and This thing is called the link function and what we're going to do is we're going to wrap The parameter P in some function. I'll say what it is when we get to that part of the lecture Which constrains it to the right shape we say some function of the probability is linear There's some transformation we could do to the probability so that it is linear in these other parameters I know this is weird, but bear with me. It'll make total sense. It will you'll love it Okay, the third step of course is you compute the posterior you know how to do this With generalized linear models searching is harder Ordinary least squares can be used, but actually it tends to be pretty fragile There's this thing called generalized least squares, which is used a lot in non Bayesian inference We're just going to start using markup chains Because we also want to have priors in here and we want to get a really good approximation and not worry about it And that's why markup chains were introduced last week One of the fun things that happens with generalized linear models is that suddenly all of the variables interact with one another Even if you don't explicitly put an interaction effect inside your linear model just a bunch of additive terms They will interact on the outcome scale why because that's how nature is right This is not some bizarre statistical accident that is like this It's a necessary consequence of modeling the natural phenomenon. So let me try to give you an example again Remember I was a biologist so my examples are things like the lizards and dandelions and stuff, but So imagine you you're you're trying to understand the habitat preferences of some real animal right, it's like a like a reptile and Sorry, I'm a primatologist, so I have to make fun of myself all the time, right, but so With reptiles if it gets really cold Their probability survival is very low And you can get really hot and they can live under really hot temperatures, right? So this is like Australia Australia right now has been like 45 degrees centigrade for three weeks or something like that Yeah, I don't know all the humans are fleeing and rafts or something and the lizards will So anyway, but I want you to see is that on the probability scale on the vertical here Eventually things get cold enough that you're dead no matter what right? You can't die twice This is the basic fact about the physics of mortality And so if any one variable is going to kill a lizard then it doesn't matter what the values of the other variables are That's an interaction effect when the effect of a predictive variable depends upon the values of the other and that Necessarily arises from what I call ceiling and floor effects and probability outcomes if it's efficiently cold I don't care how much food you give the lizard. It's going to die, right? It's an interaction effect where the effect of feeding is conditional on the temperature Yeah, and that again that's not a statistical accident it arises from the you want your model to do this You absolutely want it to do this There's a box in the book where I try to show you mathematically the consequence if you'd like to think about these Linear regressions this way if you just think about the rate of change in the mean of a linear regression with respect to any old Slope that means you take the partial derivative of mu the respect to beta. It's a constant. It is the beta coefficient That's why linear regressions are so nice do this with any generalized linear model. Oh, yeah, the chain rule kicks in You'll be taking that derivative for a couple minutes and I can use your computer and you get a much less nice expression So for example this in a logistic regression, which is what we're going to do today starting today and through Friday This is the equation for a logistic regression P equals this thing I'll explain this to you in a few minutes If you take the partial derivative of P with respect to X you get this thing on the right and you're like what is that thing? Yeah, well, that's the necessary. That's the rate of change in Probability as you change X and the whole linear model is still there. You see it. It has not gone away That's why everything matters. That's this that interaction arises as a consequence of the compression of the scale Okay so Let's actually move into doing some good work here. We're going to work with the binomial distribution and Model some counts of events. This is like the globe tossing thing for the beginning of the course What is the binomial distribution for? It's counts of some specific event out of in Possible trials so in is the maximum value Then it could take and zero is the minimum value it could take and there's some constant expected value Conditional on the predictor variables, right? If you change the predictor variables, you get a different constant expected value But for any specific set of predictor variables, the Binomial assumes that there's a constant expected value and then under those conditions the maximum entropy distribution is binomial It's a distribution you get by counting the pass through the garden of working data So there's two parameters in a binomial In which is the number of trials the number of coin flips and P, which is the probability success on any given trial. I Know you're familiar with this already because we worked with it in the beginning of the course The expected value of binomial is in times P the number of trials times the probability success in any given trial and The variance is in P one minus P the variance and the mean are not Independent anymore like they were in the Gaussian and in general the Gaussian is the only distribution You will work within your life where the mean and the variance are independent Every other case if the mean gets bigger the variance gets bigger in most cases It's not true for the binomial because if the mean gets really big you bunch up against one, right? So the relationship here is actually when is the variance maximized in a binomial when P is a half Yeah, those of you who work with you know genetics and stuff You're familiar with this right with diversity and disease and things the colleges know this too, yeah But the lesson I want you to get is that that case where mu and sigma are independent in a Gaussian That's a really rare rare circumstance. It doesn't happen with anything else. Okay? So we're going to model we're going to plug a linear model and attach it to P. How do we do this? We need a link function. Let me motivate this link function for you. So on The horizontal on this graph. I've got some predictor variable X and we're going to attach some slope to it and It's going to be linear related on some scale that's called the log odds What are log odds? Well the odds are P over one minus P And the log odds is the log of that That's exactly what it is And it turns out if you do this, there's a very nice. This is the conventional link There's a very nice mapping onto the probability scale Where X is linear on the log odds scale and so our whole linear model is defined on the log odds scale And it will be constrained then to the zero one probability interval on the outcome scale And in chapter 10 I show you that this is not some ad hoc assumption This arises from the maximum entropy derivation of the binomial distribution this relationship Which is a really cool thing. I think so in machine learning. They call this the maximum entropy classifier They don't call it a by binomial regression. It's actually this is different literatures different derivation histories So analytically, let me show you what this looks like This is the binomial Model the way we're going to write them why I is distributed by nominally number of trials in probability on each on trial I of P Then we write this link function loge it loge it means log odds The log odds of PI is equal to some linear thing alpha plus beta X So P right is the probability scale over there on the right hand graph and Then this linear model thing That's the log odds scale on the left hand graph and they're connected through this loge it function And what is the loge function? It's log odds. So let me show it to you what it looks like It really is just log odds remember I said odds is P over 1 minus P Those are the odds right anybody here do betting and gambling you shouldn't shame on you But if you do you know all about odds, right? Odds are really handy to think in if you measure stuff in odds You can you can use Bayes formula intuitively really fast because you do these multiplications and do the odds adjustments It was like I'm teaching you how to gamble don't listen to me but So and the log odds are just the log of the odds that's all it is so this that's what the loge function is and We're saying that's linear So how do you get back to the probability scale? You just use algebra and you solve for P and then you get this thing which we're going to call it the inverse loge it But it's also the logistic function if you're an ecologist, you know, this is a logistic growth function, right? It shows up in all kinds of cases So this is the conventional way Link in a binomial GLM because it has maximum entropy properties. It has lots of good mathematical properties For users like yourselves you want some intuition about how it works Because this log odds scale is a there's a metric scale there And at and you want to relate it to the probability scale So this is a graph to help you understand this so on the horizontal axis We've got the outcome scale of probability zero on the left the event never happens one on the right the event always happens 0.5 in the middle happens Equal time. Yes and no on the vertical scale. We've got this alien log odds thing, right? On the log odd scale zero is 1.5 on the probability scale So that's your anchor point think about log odds of zero is equal chance on the outcome scale But then it log odds get smaller and the probability goes down towards zero and it gets bigger and it goes up towards one On log odd scale you can go to minus infinity and you can go up to positive infinity, but the probabilities will stop at zero and one There's this compression effect Between the two so I want you to understand that you need some anchor values to think about it a log odds of One is about three-fourths of the time and minus one is about one-fourths of the time Yeah, about close enough for government work, right? A log odds of three is 95 percent of the time or Minus three is 5% of the time log odds of four is always log odds of minus four is never log odds of five is really always log odds of minus five is really seriously I'm serious this time absolutely never gonna happen So this turns out to be really important for defining priors We'll get to that in a second that when you put a prior on the log odd scale You need to interpret it on the outcome scale and that's tricky But we you already know how to do it because I taught you how to do prior predictive simulations, right? So we're gonna do that for this and and avoid all the disasters that could arise Okay This is just a summary slide for my logent link lesson We use this thing because it's the natural link inside the probability formula It's the law god's innocence the fundamental parameter of the binomial distribution again There's this oh, there's this box on page 313 314 where I show you this it without any assumption This weird link function arises naturally in the derivation of the binomial distribution There are other cases though where you want a different link and those are equally justified by the natural processes You're modeling and common ones would be probate. This is very common in economics because economists always want to do things different I think they're cool kids, right Jeff? And and the complimentary log log There are big and legitimate physical literatures which use these links I am not going to use them in lecture here, but that's not because I think badly of them Just wanted to let you know if you've got a scientific model You can nearly always derive the link automatically and I'll have an example of that when we get to the Poisson Models, and I'll show you a sign an actual scientific model where the link function emerges just from the basic science Okay, I got ten minutes though, and I want to talk about chimpanzees, so I always want to talk about chimpanzees, right? But especially now so let's get an example data set to motivate this You're only going to learn these things through action right really processing some data so in the rethinking package there's a data set that comes from a Published experiment looking at the pro-social tendencies of our close relatives the chimpanzees. Here's the setup of the experiment So what you're looking at on the left is my bad drawing of the experimental apparatus And on the right you've got photos of actually what it was with these I think I can explain it to you So what you I want you to see on the left is to imagine that you're a chimpanzee sitting on the close end of the table Looking out at the table, and there are these two levers in front of you one on the left and one on the right And if you reach out and grab one of those levers and pull it towards you It will make the weird accordions in the middle of the table expand out and there are two trays attached to that accordion and There may or may not be food in those trays So there are two options on one side of the table on the left in this case There's only food on your side and the dish on the other side is empty If you pull this one food comes to you and an empty dish goes to the other side of the table where a Con specific may be sitting. Yeah That conspecific has no levers. They are helpless. They're at your mercy And if you pull the other side, there's a so-called pro-social option where there's food on both ends And if you pull that one both of you get a snack right these may be grapes chimpanzees will do anything for a grape Yeah, maybe not anything but a lot. They really like grapes and So What we're interested in is the process whether chimps care about this distinction What's tricky about doing an experiment like this, of course Is that it's not enough just to do the experiment as this table is set up? Because they may just be attracted to more food and pull the right-hand side in this case because there's more food on that side Even though they only get one of the items they might pull the right-hand side because well, there's more food there Yeah, you hung out with a chimpanzee you realize the risk of this right or a child a human child very similar that always point to the bigger pile and So there's a One of the experimental treatments is to remove the partner from the other end It's not just that we're interested in whether or not they they pull the right-hand lever in this case Or rather they pull the lever that's associated with the pro-social option because that'll be counterbalanced Right left and right because chimpanzees are handed like people are most of them are right-handed And so you have to adjust for handedness But you also you want to know the difference the interaction effect you want to know Whether they pull the pro-social option more when there's another individual at the other end. Does it make sense? I think it's a clever experiment. It's this is cool. So and chimps got a lot of grapes. So it's always good So to summarize that Two conditions there's the partner and the alone condition with the partner condition the other individuals on the end And the alone condition the other end table is empty Two options the pro-social and a social option which are counterbalanced left and right across trials each focal chimpanzee does a bunch of different trials on different days and Then there are two outcomes you can observe they pull the left lever or the right lever. We want to predict this outcome As a function of the condition That is the total treatment that the individual found themselves in on that trial So that we can figure out whether chimps prefer the left lever when the partner is present and pro-social is on the left This is an interaction effect. So I get to teach you by no regression and interactions again all at once. Yeah Here's how we're going to code it Let's take all the possible coding treatments and make it into an index variable So there are four possible distinct unordered treatments. We're going to number them one to four number one is The pro-social options on the right and there's no partner at the other end of the table Number two pro-social options on the left and there's no partner three It's on the right and there's a partner for it's on the left and there's a partner These are all the four different possible treatment combinations And we want to estimate the tendency to pull the left lever in each of these and use that to figure out if There's an interaction effect, right? So the linear model in the left the only Novel part of this model is the binomial part in this loget thing. The rest is you know ye olde linear model, right? I've got a Vector of alpha parameters one for each actor. You're going to see this is super important There's repeat measures on actors and actors have handedness preferences. So alpha is measuring handedness It's like adjusting for the back door through handedness is what you're doing in this case And then we have a vector of four beta parameters one for each treatment And I leave the priors to be determined because there's several slides about that Coming up. Does this make sense? Yeah, oh, I wanted to say here I'm going to write binomial one P in this course But sometimes you'll see that written as Bernoulli P. The Bernoulli distribution is just a binomial with one trial It's like the binomial named after a Swiss mathematician. I think that's Bernoulli was Swiss someone know Okay, how do we do priors? priors and GLMs Behaving very counterintuitive ways And so I think the only responsible thing to do is do prior predictive simulation to see what the implications are of the priors on the Outcome scale Before you you fit the model to the data. So let's consider the basic problem Skeletal version of the binomial regression where the linear model is just some alpha parameter some intercept And this will be the average log odds of the outcome. That's all alpha is going to mean It's just the average log odds across all trials What kind of prior do we want to set on that? So let's say we put a Gaussian prior on this thing That makes sense alpha is on the real number line so we can assign a Gaussian to it But we have to and zero makes sense if you want because zero means a half So if you want to locate it the prior on a half so that it's neither more common or less common than chance that makes sense But what about the scale and I put this omega in here to say we that's our choice. We have to pick an omega What happens when you pick omega? So let's say we pick something seemingly benign like 10 That'd be weekly regularizing in a kind of linear regression depending upon the scale that you use Let's do prior predictive simulation with this and we get all the code to do this is in the book You know how to do prior predictive simulations though and what happens is What I show you here on the bottom axis of this plot we have The probability scale that is the outcome scale of the event. This is the prior probability the chimpanzee pulls the left lever from that model The black density curve is the prior where you you assign alpha a normal 0 10 This looks really strange Why is it like this? Because a Gaussian distribution with a standard deviation of 10 has huge amounts of mass outside of log odds 3 or minus 3 It's piling up almost all the prior probability Says either it never happens or it always happens even though Do you love in this I can tell me I was loving this Even though zero is the point but highest probability in the Gaussian But most of the mass is out of it in the tails Outside the extreme log odds intervals because remember four means always minus four means never right And a Gaussian with a standard deviation of 10 Think about that's the standard deviation. So 95% of the mass then is between 20 and minus 20 Yeah, so almost all of it is that so what happens when you transform it to the probability scale is you get these spikes at Zero and one this prior thinks either it always happens or it never happens, and that's not what you want to assume. I think right I probably not what you want to assume. This is a very bad default It can get worse. So lots of people run Bayesian Bidomial regression models will they're put standard deviation of a hundred in there and this just makes it really absurd This prior is not harmless This can do a lot of harm actually So you want to use something that's actually sensible I'm going to adopt this convention. It's just about as flat as you can possibly get I think we probably would want to regularize a little more than this But I'm going to adopt this heuristic position of just having something flat on the probability scale And that would be normal with a standard deviation of 1.5 And then it's basically flat on the probability scale Yeah, but it's definitely not flat on the log odd scale. It's pretty concentrated around the middle Prior predictive simulation lets you suss this out. You can figure these things out Does this make some sense? Are you terrified? Don't be terrified. You just have to simulate. Yeah, there are tools to get you out of the terror Okay, it's 11 o'clock. So I should stop When we return on Friday, I will continue this and we will talk about How we get priors on the slopes as well through the same sort of prior predictive simulation And then we'll actually model chimpanzees pulling levers and that part will be rapid We've got to do all of this scaffolding first, right? All right. Thank you for your time. I'll see you on Friday