 Hello, everyone. Welcome back. Before we get into the material, some quick tips. I think, hopefully, everybody has gotten a stand installed by now. And the two recurring issues that are problematic are these two. On Windows, there's this issue with getting the tools, the tool chain installed. Maybe your C compiler. And the problem aspect seems to be setting this path correctly. The path is some text-reading stuff somewhere so that R can find where the compiler is, right? Yeah, it's on here. I mean, I appreciate that. But Reven path is just a path editor. Yeah. Just run the R tools installer with administrator rights, right? Click on it and run it as administrator. And then it always sets everybody as it works. It'll set it correctly. And unfortunately, by default, when you get to the end of the quote-unquote wizard, those of you at home, I'm doing this air quotes. I don't know why the sellers call wizards on machines. Although it sounds amazing. But there's this checkbox to set the path variable. And it's unchecked by default for some crazy reason. It's not checked, so you just have to check it. And then when you're having administrator rights, and if that's worked for everybody, and then you haven't needed it, yeah, you don't need the Reven path thing. So far, that's worked great for everybody. For OS X, for reasons I don't know why, I only know how to fix it. For some of you, R thinks you have G++ still, and keeps calling it, and you don't. Because all the new Xcode installations have a way better compiler. Clang, I call it, or C-Lang. Some people insist on calling it C-Lang. I like the sound of clang. But so my solution is just to trick your computer by creating a symbolic link. And if you execute these three lines one at a time in terminal, it'll work. It has worked for everybody so far, right? There are a few people who can verify that this has worked. This is how you solve your problem, right? Yeah, so this has worked, and this will... You do have clang on your system, and this will just trick your computer every time it calls G++. It's actually calling clang. And that's worked so far. So these are the two things that will get you going. And I think almost everybody has gotten Stan working, right? The other thing, this is minor news. Over the weekend I updated my rethinking package. I work slowly on this kind of accumulated number of bug fixes until I decide to publish it. None of these bug fixes are critical, but the two to take note of, there is this weird issue with the ensemble function. If you put in a model that had nearly zero weight, sometimes it would do something weird only if the models were entered in a certain order. Bugs like this are what I'd die for. But I eventually figured out a way to replicate it and figured out what it was and fixed it. And then there is this thing that pops up for some people at some seasons of the year where MAP doesn't like linear models that are just... It's called a monadol. They're just single symbols basically. And that should work better now too. I think I flagged all the weird cases that that could get detected correctly. There are other patch notes that don't necessarily matter to you, but if you're curious about this, I code in public on GitHub because I believe sunlight encourages rigor. And this is a great way when you're doing your own coding to post up on GitHub because people learn from your code and you learn from other people's code. It's a good habit to get into. And GitHub is very easy to start on. It's pretty easy to get going with. So that's where it all is. Okay, let's get into confidence this week. So last week ended with my servant on the multiplicity, we could call it. The servant on the multiplicity is how maximum entropy is a general framework for doing Bayesian inference. More general than the Bayesian updating we started the course with. It's just a matter of counting up all the ways different events could arise conditional in our assumptions. And then we bet on the things that can happen more ways. And that's probability theory. It's really all it is to it. So the multiplicity is the combinatoric way of counting up the formula for the combinatoric expression that gives you all the ways that things can happen. And information entropy is a measure of that multiplicity. And so when we maximize the entropy of the distribution, we're betting on the thing that can happen the most ways. And that's really all logical probability theory is. Now I go through all that to say that there's nothing mystical about information entropy or entropy in general. It's just nature produces the distributions too because there are vastly, vastly more ways to produce the high entropy distributions than other things. That's why they're called high entropy. It means there are lots of ways to produce them. That's really all it means. So any mystical stuff you were once polluted by being taught that entropy had to do with order and disorder and stuff like that, that's wrong. Or rather it's at best superstitious. And really we're just betting on the stuff that can happen the largest number of ways. There are guarantees. And the intuition behind it isn't the reason we use it or justify it. We use it because the intuition nominated it and guided its development. But we continue to use these methods because they work. And that's why they're not the only methods that work. Other paradigms also work. This one I like it because it's logical. And so you can do a lot of inspection of the methods in the realm of your plot and figure out under tradition they will work and how to improve them. And that's when you have a mathematical bit to your mind or something like that. But other, I hesitate to say illogical. It's not their illogical. They have an alternative logic to them. Those methods in terms of actually also work. But they're harder to inspect in the realm of pure thought because they're not optimal in the small world. And the maximum entropy Bayesian approach is optimal in the small world. Conditional in your model there's no better way to use the information to learn faster about the true process. And that's nice. So that sermon on the multiplicity is really bringing information theory and AIC and those other things to what we started with at the beginning. So what is this thing on your screen then? Well this takes us further into this. We're going to continue now with expanding the types of models we're interested in. And one of the things that arises from getting past the linear models we've been working with for half the course now is the connection between the guts of the machine. The gears and levers that make it all turn and work. And the outcomes, the predictions it makes get more complicated in the sorts of models we're going to start learning this week. We really need these models so we get a lot from them. So to take a historical example because I love historical examples this is a tide prediction machine from an 1879 design. There were two other designs before it by William Thompson who was later known as Lord Kelvin. The Thompson's were a lineage of scientists. The first lineage I think that were knighted ennobled for being scientists. And named after the Kelvin river runs by their estate. That's where it comes from. And now we have this temperature sale Kelvin named after them as well. So Kelvin did a bunch of interesting things and one of them was make tide prediction engines which were mechanical computers. They weren't general purpose computers because they couldn't compute anything. They were computable but they could compute tides. That's what they were for. They're all of these gears and they're interlocking in various levers and pulleys and the whole thing is organized in a way based upon analytical mathematics such that it predicts the tides in a certain part of the world. And the connection between the outputs of the machine though, the predictions of the level of the tide at a certain time and all the states and positions of the gears inside is very mysterious. You have to know a lot about the operation of the machine to make it work. So why do we care about this? I can make models of predicting the tide. So that would be fun in another course. But generalized linear models are like these tide prediction machines in the sense that the outcome scale and the parameter scale are really different spaces now. It isn't like the linear models of old but Gaussian outcome models where the parameters have the same units as the outcomes. They're basically all the same. There's just one to one mapping between changes in parameters in the linear model and changes in the mean prediction. There's a reason that such models are still common and easy to fit to data. But there are lots of reasons to use models with different kinds of likelihoods. And in those cases, the grinding of the gears inside doesn't tell you much about what the predictions will be on the outside. They all matter together. They all interact. All of these gears and levers and everything else down here in the bottom of this machine, they interact to produce a prediction for the tide. So you have to process them together and use doom in models like this. Really doom. I've been telling you it's doom all along but you haven't believed me because you could do it. I said it was doom but you were like, I agree with these tables. Screw him. And that was fine. But now you'll see that I'm not so great. Just a little bit. So we're going to move past what I jokingly call the tyranny of Gauss. And the tyranny of Gauss is the tyranny of the Gaussian distribution. And the Gaussian distribution is the foundation of what used to be called parametric statistics. Parametric basically means Gaussian in that usage. And in the old days computers were not so well maybe they didn't exist actually in the old days. And so if you wanted to do rigorous analytical statistics you had to use Gaussian distributions to do it because Gaussian distributions are mathematically convenient. They add. They're additive and they stay Gaussian forever. And that's nice. And epistemologically using a Gaussian distribution as a likelihood is perfectly fine as long as you're only interested in the mean and standard deviation of some measure. Then it's just an epistemological model and there's nothing wrong with that. So this is like the maximum entropy interpretation of the Gaussian distribution. If all you're really to say about some collection of values is their mean and variance then the Gaussian distribution is the only distribution that's most consistent with that of interest. And that's fine. However if you're interested in prediction Gaussian distributions are easy to beat quite often because often we know more about the constraints on an outcome variable than this. And sometimes we throw information away when we use only the Gaussian distribution instead. So I think there's this. I have a little bit of animation with a tear for Gauss there. That's like that. About 15 minutes to figure that out. So there are two categories of abuse of the Gaussian distribution that arise here. And I should say before I explain these two that obviously Gauss is not responsible for this. It's like the tyranny of Gauss but he's, you know, would not have approved of this. He used other distributions as well. And these two forms of abuse arise historically from the inertia of curricula I think. He can't really blame individuals for this. We're all swept up in the collective curriculum of how stats is taught and practiced and what software will let us do and what our advisors know how to use and will approve of in a call for example. So the first is coercion. Coercion is transforming data so that it becomes Gaussian. I would like to encourage you not to do this. We can do better. We can work with things on the scale you actually know to develop. Almost always. And there are lots of compromises that arise from coercion. The most hazardous one, the one really to worry about is when you take counts and you transform the two proportions and then you model those as Gaussian. You never, never do this. Why? Because it throws away information. It throws away the sample size. One out of two and ten over twenty are the same proportion. But you should have a lot more confidence in the second one is actually a half. And so this is bad news and there are lots of fields where this is the standard thing. ZOARC. Oh my god. ZOARC, this is like the main way to analyze assemblages of animal bones is to construct proportions of different specimens and then run linear regression. And this throws away so much information. So please, I know I'm looking for you. Please, never do this. I'm going to teach you this week how to do it by keeping the sample size information in. It's easy to do. The other option is surrender so that people will realize, okay, I can't really use a Gaussian-like behavior or I should be using something else. But all I've got are these, are various randomization tests. They're kind of permutation tests of multi-proposites. Things like man-leading u-tests and those sorts of Spearman-Wreck correlations. All of those statistical procedures had their place in time. There was a time when that was sort of the best you could do when you gave up on so-called parametric statistics, which meant Gaussian. Now we should bury these things with honors and move on. I don't see why we ever need a little coxswain or man-leading u ever again. There was a time when they made sense, but we're going to learn better things to do now and you don't have to sacrifice to the null hypothesis. We can do better than that. What we're going to say in there is mantel tests. Mantel tests in biology are like in play. And some other time you guys asked me, I could justify, well, that's a problem. I know I've sermonized on this to some of you before, but in the last week I'll show you how to do spatial auto-correlation models, where it actually makes sense like mantel tests. Mantel tests don't estimate the correlations. They just tested at zero. We wouldn't be looking at these two communities are similar. We don't need to test the null hypothesis. But community ecology does a ton of this mantel test stuff. And I don't think it's productive. So we'll try to get past coercion and surrender. And again, I don't want to blame individuals. I want to blame history. That the technological advance of statistical methods is faster than the tenure cycle. But that's basically how it works. So we can do better now. We're going to generalize linear models at the beginning of a whole range of different methods. They're not all generalize linear models, but this is our gateway drug. To get freedom from the tyranny of Gaussian likelihoods and freedom from the tyranny of randomization tests as a fallback strategy. But with this comes a lot more choices and responsibilities. So we need principles. And all that maximum entropy stuff in the last week has helped us have a principle by which to select likelihoods. There are other principles as well, but we're going to review that quickly here. So that, as I said, historically there were times when computers like one on the left there, that was a computer that was maybe a thousandth as powerful as your iPhone. But it took up half a room and generated infinitely more heat. Computing has come very far since then. You were like with a computer like that, you were happy to do a big matrix inversion so you could fit a least squares regression or something. Now your iPhone is vastly more powerful than computers that send people to the moon. Absolutely so. And I have a friend who runs our iPhone, which definitely voids your warranty, I believe. So you can do a lot of stuff. Yeah, I started programming when I was young on a big 20, which is a computer that nobody should know what it is anymore, but BIC20. Which had 8K of RAM or something like that. So this thing blows it out of the water. It's really amazing. And then we have, I think the pinnacle of technology is the room bug. There's a whole Google hack your room bug. You can do some amazing stuff with a room bug. It's a robot. It's very inexpensive and it's got a pretty sophisticated computer in it. You can install Linux on it. And get a new brain. So hack your room bug. It's pretty awesome. I'm serious. The personal robots are coming. And I want to get some drones for data collections, actually. So I mean the NSA shouldn't be the only ones that have it. Okay, so what are generalized linear models? Our goal is still to connect a linear model to some outcome variable. We had that go for a while now. Before we move forward with this goal, I just want to remind you that it would be great if we had, we're working in an area where we had sufficient domain knowledge, we could get rid of the linear model too. The linear model, we should always be embarrassed by linear models, right? I use them a lot too, but I'm now trying to feel embarrassed every time I publish something with a linear model because mechanistically it's crazy, right? We'll just add these things together, multiply by some coefficients. Nature. So it works. They're very, as I say, they're unreasonably useful given how goofy they are, meaning you expect the assumptions of them. But we're going to continue with this unreasonably useful linear modeling strategy. The problem now is with a non-Gaussian likelihood function, or let's say the Gaussian likelihood is just a special case. There's nothing, it's not even a basal case really, it's just one of the many choices. And what I want to show you this week is it's, it's, there's this generalized principle, maximum entropy principles, the way I'm going to teach it to you, that lets us choose likelihood functions that are consistent with the information we think we know about the outcome variable. And the Gaussian is just one of those special cases that arises from them. And all the others are parallel to it, symmetric to it. But the other cases are harder to think about, and that's why we do the Gaussian first. Not because it's in any case special or basal, but because it's easiest. So the general strategy with a generalized linear model, so we pick an outcome distribution, I'm going to say something about this on the next slide. We model the parameters in the linear model using the parameters in the likelihood function by linking them to the linear model somehow. And this is easy in the case of the Gaussian because there's a parameter for the mean, one of the parameters of the Gaussian distribution is the mean U. The most probability distributions don't have a parameter for the mean. So we're going to have to get a bit more creative about this. Now I'll spend a lot of time today on that problem. And then finally we compute the posterior. You guys are pros at this now because you have sand on your computers, right? So you can you can fit things quite well. And this will not be that big of an obstacle. But what's going to arise now is posterior distributions can be highly non-Gaussian now. Even in cases where all your priors Gaussian you can get posterior distributions which are substantively non-Gaussian. And it's because of the tide machine phenomenon. There's not a one-to-one mapping between the position of any particular gear and the prediction of the tide. And there's a non-linear transformation that goes on between the two. And so I'm going to show you pictures about what happens in these cases so you can get you to understand it. But that's that it manifests itself in routinely non-Gaussian posterior distributions. And that's okay. It just means this is why we need a CMC in general to make things work. Okay, what do we get out of all this bother? We can model multivariate relationships with non-linear responses. And hey, nature is non-linear. Once something's dead, it can't get more dead. Dead is an outcome, right? And so you poison a dead animal, it doesn't die again, right? So it's not an added model. And that's how things are. I know it's like a silly example, but absolutely true. Once there's enough of a toxin to kill all the frogs and upon, they're dead. But you can't kill them more. And that's how nature works. So we want to have predictive models that can cope with those sorts of things. And later on, when we get to multi-level models in a couple weeks, these generalized linear models are the building blocks of them. We kind of assemble together more complicated models from these pieces. So this is the component strategy. And then it will be turtles all the way down when we get further along. So to remind you, step on. Pick an outcome distribution. The traditional likelihoods that are used in generalized linear models are all members of a family called the exponential family of distributions. I'm going to show you some of them in a moment in a couple of slides. The important thing to know is that all of the members of this family arise from natural processes because they have maximum entropy interpretations for routine kinds of constraints. There'll be a slide about that in a couple of slides down. So nature makes these things because there are vastly more ways, given certain transformations of things that happen in the real world, vastly more ways to produce these distributions than others. And just like the Gaussian in the soccer field, remember, the Gaussian distribution arises not because it's the only thing that could happen, but because there are vastly more ways for that to happen than anything else. So this lets us select likelihood distributions from first principles based upon what we assume about the outcome variable before we've seen the data. Again, there'll be examples about what this means in a little bit. And I want to say before I move on, this gives you all the same choices that you get just if you read the classic textbook on generalizing your models, there it's pretty ad hoc. They're just using intuitive selections of the different likelihood distributions, but it gives you all the same choices, which is kind of interesting that intuition can lead to these choices, which are here I'm justifying based upon a principle, a general principle of inference that also replicates Bayesian updating. So the thing I want to warn you about at the bottom here is I have this joke that I call histomancy, which is also a thing that is built into old biometric curricula. The idea here is you pick your likelihood or decide whether you get to use a Gaussian line because it's based upon testing whether your outcome variable is Gaussian. Do not do this. And there are a couple of reasons. The first is that it's sort of irrelevant what the manifest variable looks like, because all linear regression actually assumes is that the errors are Gaussian, not that the whole aggregate data is. Why? Because there are a bunch of predictor variables that change the means of the individual cases. The only Gaussian part of it is the residual. The residual is what's Gaussian. So those tests from the first part where people plot it up and look at the QQ plot it, decide I do some, what I call the vodka test, the Kovogorov-Skirinoff test, right? You should only do it when you're drunk. To see if it's in its Gaussian. This is just not even in terms of what the model wants. This is not the right test. It doesn't matter because there's a mixture. But on the second one it doesn't rely upon any kind of principle of inference about it. And actual outcomes are mixtures of a bunch of individual cases generated by different values, different states that the system was in at that time. So there's no reason that the aggregation of the raw outcome variable should look like anything. You can't say anything from that. But this is a tradition which was, it was not invented in the statistics. It was something that was homegrown in biology, I think, and maybe psychology and spread from there. But I know some of you have been told to do this by your committees. And so I'm giving you some counter authority that you can wave the little flag to Dr. McAury said you should never do this because it's histomancy. And I just left them to squint at you and send them my way. But, no, and again I don't blame individuals. There's inertia and scientists have other work to do then. So I'm very sympathetic to that. But it's no statistician likes this method. So we need principles instead. All right, just a quick reminder on the sermon on the multiplicity. This is due, the maximum entry perspective is due in large part to Jane's, although it exploded into a big area of research. But he did the most to get it going. And we use this because it gives us the distribution family of distribution most consistent with our assumptions that imports no other information. The other way to say it is it's the distribution most conservative that is consistent with the information. In the case of Bayesian updating assertive before is a special case of maximizing entropy, although in that context it's usually called minimum cross entropy. What you're getting is the posterior distribution has the least divergence and that's an information theoretic term. Remember that? From the prior but while still being consistent with the data. So the posterior distribution is a distribution that has changed the least from the prior in information theoretic space after seeing the data. So it's conservative learning but it gets all the information out of the data model. And you can recover all of Bayesian updating from this maximum entry perspective. But it also does more because you can put in moment constraints which is what we're going to do with our outcome variables. So let me give you an idea of what's going on. So typically what you want to do is you think about the constraints on an outcome variable and then or any kind of random variable and then the maximum entropy distribution that corresponds to it. Last week I showed you a couple of examples. I kind of proved you in rhetorical fashion that the Gaussian is maximum entropy for certain constraints and the binomial for others. Let me give you a summary of some of the reminds you what we did before. If all you know about a random variable is that it's a real value within some interval then the maximum entropy distribution is uniform. If you use any other distribution to represent your uncertainty about the values of that variable you're importing some other constraint. You're assuming something else and you can figure out what that other assumption is. If you the constraints are just that it's real valued and there's a finite variance the maximum entropy distribution turns out to be Gaussian. You don't have to know what the variance is you just have to assert that it's finite and then the Gaussian distribution is the thing to bet on. And remember entropy is what would they say, location invariant so the mean doesn't matter. If you shift the Gaussian distribution around that's why we're not talking about the mean. You get the mean for free basically. It doesn't change the entropy. And binary events and fixed probability that's what we ended with last week. You get the binomial distribution which remember we derived it kind of ad hoc at the beginning of the course just by counting up marbles in our assumptions. That's still all we're doing. All maximum entropy does is count the way things can happen according to our assumptions. And the largest number of ways the distribution may derive the largest number of ways is when we bet on it. It's the binomial. But it's also the one you get just by brute force combinatorics by counting everything up. Great question. So the question was what would be an example of something where you don't have finite variance? A Cauchy distribution has infinite variance. There's a whole Levy family of distributions that have infinite variance. And what happens in those distributions is that at any moment you could sample a prior that completely changes the mean value of the sequence so far that they have really thick tails. That's one way to think about them. Power laws are the classic example that people work with a lot. Power law distributions. In practical real world it's not infinite. It's just big enough that at any moment you could sample a value that overwhelms the empirical mean so far. So the mean doesn't converge to anything and that's the consequence. So if the mean is convergent over time then you have finite variance. Does that help a little bit? Yeah. The Cauchy is a favorite one of Bayesian statisticians because it works fine. If you need sufficient statistics then it's a disaster. And so in the old days when baby boomer statisticians fought the Bayesian wars the Bayeans and the Anti-Bayeans the Cauchy distribution was it's like glove that would get thrown on the ground. Like oh yeah I got a Cauchy distribution for you. Let's see what you could do with that. I think it was just like infinite silliness but it's fun to read the history because they were really nasty in the stash termals. Oh my god. Maybe I'll post some stuff up for you guys sometimes. But statisticians by age I think are over this for the most part because we think of these as just little robots right and like you're going to fight over a family of robots really? No. They're all robots. Anyway but the history of statistics is actually pretty nasty. It's interesting. As I mentioned to some of you before since I'm off on this tangent in biology which is the field I mainly think in the foundations are very secure. We don't argue about the foundations of things very much. We never argue about applications. In statistics everybody uses linear regression that everyone disagrees about what it means right so it's like a flip of that. You scratch any topic in statistics and talk about like basic advice and people will argue for days about it. Andrew Gellman's blog is great like this in the comment threads like every time he makes a post about like basic regression advice and that thread explodes but you get the foundations instantly and people are arguing but it's something like a horseshoe prior and nobody's like crickets. Nobody's talking about it at all because everyone's like yeah I can see that being useful but I think that's curious. It feels really different than my work in evolutionary biology. Final case I'm going to spend a little bit of time on this on the next slide is the exponential which is as you might think these distributions are all members of the so-called exponential family well the uniformism but the others are exponential family so the exponential is kind of a basal case and in terms of a generative model it is also we can generate all the others starting with random variables kicked out by an exponential distribution so I want to spend some time presenting these distributions that way to develop some intuition about when do you get an exponential if all you want to say about the random variable is that they're non-negative reals think of that as your own positive values and that there is some mean then the exponential is the maximum entropy distribution. Any other distribution would assume extra stuff yeah Cody what is the gamma that's kind of a great question where was gamma? Gamma will make an appearance in a little bit but you know I think the answer you get gammas by adding exponentials right okay so let me tell you generatively about the exponential for every one of these maximum entropy distributions there's an epistemological story which has to do with constraints and generative stories which has to do with actual manipulations in physical processes that will produce them and the two are related obviously and when you get deep into this stuff like me it's nice to learn both sides of it so think about with the Gaussian the generative thing was the soccer field that I gave you early on in the course what about for an exponential well let's think about a machine or an organism like a washing machine or a squirrel whatever there are a bunch of components to these things sorry the squirrels have components look at me like squirrels have components and I did verbert zoology I dissected the squirrel I'll tell you they definitely have components but if any component breaks the machine breaks this is a thing about washing machines well often my cheap washing machines last longer than expensive washing machines this is a fact in the consumer appliance sector because expensive washing machines have a bunch of new deads break and the thing doesn't work anymore simple washing machines don't have new deads and it lasts a long time so here's a simple generative case that's going to give us an exponential distribution of failure times of life spans which is observed very commonly in machines and organisms assume a constant chance of failure of any component on any given day of the year and you can do this we can run a simulation 10,000 simulations over 10,000 washing machines or squirrels with one line of our code this will replicate we just repeat something 10,000 times and we're going to take the minimum that is the first day that any one component breaks any equals one is one a machine with one component and then the days from today to the end of the year if you increase in, if you run in at one you get this graph down here in the bottom left with our one component machine you notice that there's, this is life spans the life spans are across the whole year because there's one component so this basically means the machines or washing machines are dying evenly across all the years because there's a uniform distribution of breakage days it makes sense and let's add another component let's add two more if you change that n equals one to n equals three you'll get this distribution instead of failure dates it's not quite exponential yet but you can sort of see where we're going here, it's breaking more because it only takes one of them to break and then the whole thing's broken so it falls and then by the time we're up to ten components it's almost exactly exponential and so exponential distributions do appear all over the real world as a consequence a generative process is resembling this one all the time and it turns out that these particular transformations the only information they preserve about the underlying distribution is its mean the mean failure time and what results empirically is an exponential because there are vastly, vastly, vastly more ways in the multiplicity to get this than any other distribution of failure times it's not magic, it's just combinatorics which may seem like that I appreciate that so now we have our friend the exponential exponential is kind of the basic case here and from the exponential you can use it in models as an outcome just as it's notated there it has one parameter usually called lambda which is the rate of failure and the r on the bottom of this graph is the density function in r which is dx if you count events emerging from an exponential process you end up with a binomial distribution so certain events coming out of the process like des we'll go through this story again in a little bit later today we've already talked a lot about the binomial we're going to learn, if not today on Thursday the Poisson distribution which is a special case of the binomial it has the same maximum entropy interpretation but it's a special case where there are a very very large number of trials and a very very low probability of success on any particular trial and in that case you only need one parameter to describe the shape of it which simplifies the statistics a lot and the Poisson distribution is very useful for modeling counts without any clear upper bound and there are different ways to transform into these different distributions going back to the exponential if you add exponential deviates together you get another distribution that is constrained to the positive reels called the gamma distribution which is also a fundamental distribution of displacements like the exponential time until an event gamma distributions arise only to think about it is when more than one thing has to break so you've got components of a machine or organism and if any one component of the subsystem breaks the whole subsystem breaks but multiple subsystems have to break before the whole machine fails then you have gamma distributed latency and age of onset of cancer is almost perfectly gamma distributed almost certainly for this reason because a bunch of cellular repair things have to break and then the immune system thing has to break if all those things break in the right way then cancer can get going but you have a bunch of defense lines for the longer you live the greater the chance that those things will all break and then the solution arises in age of onset of cancer quite reliably is that a hand? Yeah, so what if you observe those components and there's this variable that we think we can I'm quite interested in your question but sure so you can describe the situation where there's lots of complicated things going on for the cancer cancer cells or that's right if you had more if you had measurements on those basic subsystems you could bootstrap yourself up to the gamma distribution you wouldn't need to just make it an assumption you could get it to emerge from the data if you're just starting with age of onset and you don't have those late measurements then you can use maximum entropy as an appeal empirically at this point it's kind of the standard thing that they describe in that literature so gamma distribution in the exponential are both really common for this reason they're good distributions for displacements durations or distances offset from some reference point in time or space and they arise a lot in nature as a consequence and then if you take a gamma distribution and you increase the mean to really large it converges to a normal in fact lots of these things will converge to normal can also converge to normal as can Poisson if their means are well binomial has to be far from an edge the Poisson if the mean is large it'll converge to normal so the normal is almost at the end of this route right it's this it's this limiting distribution lots of things evolve towards as you add together add together fluctuations but these other distributions are equally as important in nature and we need them as well okay so to summarize usually what we want as practical applied statisticians is some way to map features of what we've measured to these likelihood functions so let me give you kind of the basic key here to that if you have distances and durations the exponential and gamma are your first go-to ideas there may be other distributions to use geometric is a count distribution that also applies to distances and durations these are part of applied statistic called survival or event history analysis survival analysis in biology because usually it's about survival and then the social sciences is usually called event history analysis because it's about events in people's lives like divorce and marriage and stuff like that time to get hired recidivism and convicts is an event history model right time to finish your dissertation it's an event history model I did some work on that so all these eggs are by exponential and gamma processes time to tenure is a gamma process pretty simple as a gamma distribution really closely actually okay counts the Poisson, the binomial we're going to talk about those and have in-depth examples of them this week at the end of Thursday I'll mention the multinomial and there's a section at the end of the chapter that presents some computational examples of it multinomial is a generalization of the binomial to more than two kinds of events it's really useful as a classifier sometimes called the maximum entropy classifier for that reason it is a very so it's like the tide machine though so when we get there I'll say some things about that it's easy to use and hard to interpret unless you push predictions out of it and then you can make sense out of it as usual and then the geometric distribution as well a common and useful count distribution we won't spend time on but again in the notes there's a section and a computed example of it next week we're going to do what I call monsters which are models that are kind of cobbled together from bits of GLMs to handle weird measurement scales which so in the social sciences there are these example order categories which people work with a lot because you get people questionnaires and say how much do you like ice cream sorry they give you like one to seven and they're like what is that number right so I'm going to show you how to cobble together what I call a statistical monster that models those things very effectively the order categorical model but that will be next week and so ranks and order categories are weird measures because they're transformations more basic measures and they lose information and we'll also look at mixture models next week cases where you mix together different distributions of different levels of the models and these will be our first multi-level models although I won't necessarily describe them that way and there are ways to get heterogeneity of process into the model we're going to focus on zero inflated processes next week I think because I think that's the most useful cases where there are certain outcomes like zeros that can be produced by multiple processes and you want to model all of them right so the most familiar case of this the foreground is a bit if you're walking transects and you're trying to count owls or something like that there could be zero owls there or you could be bad at counting owls right and until you get good at finding them you know the zeros to be measured in an error and you got to account for that so that creates let's call zero inflation yeah David so this says taken outcome distribution are we saying this for the same type of data a duration data as a predictor you would treat it differently than an outcome of that which we might use here outcome is the outcome variable it is the outcome variable that's fine yeah okay alright question answered okay let's get to work hard here here's the new stuff I said now the relationship between the predictions of the model the scale of the outcomes and the parameters inside the linear model is going to change with most of these other distributions the Gaussian is easy let me remind you what it's because the units on the outcome y and on the mu and the mean mu are the same if one is centimeters the other is centimeters so you have a parameter on the same measurement scale as the outcome variable this is a luxury that is about to end and you will never see it again until you use a Gaussian model again this is hugely convenient and it makes so this is the tide machine where there's a one-to-one mapping from each gear to some unique prediction unlike the real tide machine what we deal with instead in generalized linear models the typical case is that we have some distribution like say a binomial where the outcome is a count it's you know zero or positive integer but none of the parameters are that are the mean count you have a parameter for the number of trials in and you have a parameter p for the probability of success on any given trial but there's no parameter for the mean the mean is the product of in times p that's the expected value and usually we like to link our linear model to p the probability of success on any given trial so now your parameters are on some probability scale we haven't, that's what we're going to spend time figuring out today what that scale will be and the outcomes are counts so they're not the same so that's why I put the question mark here it isn't piece of eye is something of that linear model it has some relationship to it and we're not sure what it can't be equal to it because it's a probability and the linear model is unbounded real it can be negative infinity positive infinity this graph over here is meant to show you that that as you change a predictor x the linear model can go below zero or above one and probabilities can't do that so what do we do instead well we do what we always do in math when you have a problem like this we just make a function now we're going to say some function of p is equal to the linear model and it's our job to figure out what that function is that is useful and that's what we're going to do and that function is going to be called a link function is it links the linear model to the parameters the parameters that describe the shape of the likelihood they constrain it to the space okay so how do you choose these you can use canonical or natural links that's the classical way to do this what we mean is, yeah sorry from my childhood there was this TV show on which was a form of animal abuse called they dressed up chimpanzees as spies it's like a precursor to archers but a lot cleaner but also crueler anyway it's on YouTube go ahead and look at it sometimes the intro sequence alone is weird anyway so that's why it was there so canonical or natural links are link functions that you get by factoring the likelihood function and all the accidental family distributions can be factored in a standard way and then there will be this term with the parameter in it that will have a function nominated for you and that's the so-called canonical link we don't care about those is what I want to say and the reason is because the canonical link is often bad it's often hard to work with famously so in the case of exponential in gamma the canonical link is so-called inverse link where you take the linear model and you do one over the linear model and that gives you the mean and that doesn't constrain it to be positive still means need to be positive for exponentials in gamma so it doesn't solve the problem we want to solve so the canonical link is not actually a good principle to work with so as usual in modeling you just make assumptions and you see if they work well and you discard ones that don't work and you use your intuition to try and come up with new ones that might work and then test those so the most common and workhorse link functions and we're going to focus on these and how to interpret them is when you want to constrain a parameter to be in a zero one interval or any bounded continuous interval actually because you could just scale this to be some different interval you use what's called a logent link if you want to constrain of some parameter to be positive real values then the log link is the standard workhorse and the log link works great but it's got some drawbacks let me spend a little bit of time now before we get into a modeling example introducing you to these two link functions showing you what they do to the linear model space and then we're going to have examples where we work through them and interpret predictions and such so again we use a logent link when our goal is to map some linear model to a zero one interval like a probability in a binomial model and so let's start with the graph here shown on the left of this slide on the horizontal we've got some predictor x just some predictor that you're going to use and on the vertical we've got the space that the linear model is on and in the logent link the units of the linear model are log odds that's what loges mean actually means log odds I'll show you on the next slide in detail and that's your model alpha plus beta x the logent link maps that linear model onto the probability space by squeezing different parts of this linear model in different ways so what I've done is I've applied the inverse link function which I'll also explain on the next slide so that we go from the linear model over here and this is the same function in the outcome space so it's the implications of that particular linear model on the probability scale or over here as x changes so when x is at very low values here your linear model is at about minus two the probability you can kind of follow the lines ends up down here which is a pretty low value I don't know what that is, 5%, 10%, something like that 5% more like it notice that when the log odds are zero there's no change and you're at a half so that's your anchoring point log odds of zero means a half you'll see why on the next slide when I show you the log odds formula and then as the log odds increase you approach the probability of one but you have diminishing returns you can never quite squeeze out that last tiny bit of probability and likewise the other direction is log odds getting increasingly negative goes towards negative infinity to zero probability and that's the log odds space so in notation you write these models this way that the outcome y is binomially distributed within trials and on the success of trial of case i, the probability of success on case i is p sub i and we say that the logit or the log odds of that probability is equal to a linear model of our choosing what does this actually imply about the definition of p and that's what we want to know notice the p's there are the scale on the graph on the right and the linear model scale is the scale on the graph on the left there's this squeezing that goes on you can see so what is actually being implied here logit of p i just means the log of p i over minus p i it is literally log odds the odds of something or the probability happens when the probability doesn't happen colloquially odds often just means probability but in vegas maybe it means the probability over the probability it doesn't happen and that's what it's going to mean here so the log odds are equal to the linear model that's what the logit transform is so what is this assumption saying that logit p i is equal to the linear model mean about p well let's solve for p that's how you figure out what it means and if we solve for p i'm confident you guys can do this you should get out of napkin sometime later today with your cup of coffee and do a little algebra and verify yourself that you solve that expression up there for p you'll get this expression which is actually the logistic function just like logistic growth in ecology this is if you know simple growth with a carrying capacity give you logistic growth and yeast in a petri dish and so that's what creates this compression, equal compression on both ends and what we say is that the inverse link function is logistic if the link function is logit then what you do is you apply the logistic function to the linear model and that's how you calculate the probability of the success of the model so that's what computationally is going on in your computer and it's something that you will do when you process parameter estimates from a table and the numbers are on log odds units so you've got to plug it into the logistic function to get probabilities back up we'll have examples of that to help solidify this guys with me so far this is just meant to be the introduction you're not going to get this until you do it you get in there and do the kung fu can't just watch Jackie Chan do the kung fu you've got to do it yourself so I actually love working with logent link models because log odds, when you work with logups you get really good at them, they become natural to you and they're always scaled in this particular way it's nice that if you know the log odds you can map them on the probability really nice and this helps in choosing regularizing priors because you can say a beta coefficient log odds of 4 is pretty unlikely because that means basically it's going to make the thing always happen you can do calibrations like that with log odds it's easier than say even Gaussian models where you have to account for the measurement scale and do I standardize the damn thing all of that goes on it's easier in log odds space ironically log odds of 0 means a coin flip then as you increase log odds of 1 is about 3 fourths of the time minus 1 is about 1 fourth of the time is 95% of the time and log odds of minus 3 is 5% of the time goes the other direction so by the time you get the log odds of 4 it means always pretty much always and log odds of minus 4 is pretty much never log odds of 5 is definitely always and log odds of minus 5 is definitely never and you can think of it that way and this is going to come up again because it's easy to push these linear models to have values like 100 and then the log odds if you put that into a logistic you get definitely always pretty much forever always and that makes flat regions in the likelihood for large values in the linear model the data are indeterminate about what frame everybody needs this is a classic problem with fitting these models so we're going to look at some examples of what happens let's look at the log link now quickly the log link is having seen the previous slides I think you'll get this right away same kind of thing on the left hand graph here is our linear model and the scale on the vertical is just the log measurement we've had some original measurement like meters and we've logged it so what the log link implies is that we've got an exponential relationship between the value of our predictor and the outcome the mean of the outcome and it really is exponential because the inverse of log is to to raise the e to that value and it creates this exponential growth process this can be this is very useful as long as x doesn't vary over huge ranges nothing is exponential forever eventually I have this joke in my gantry course where I say peacocks tails can't keep growing peacocks are size of Jupiter at no time so at some point growth reaches a limiting point you just can't take any energy or you collapse under your own weight so over a really big range of whatever predictor these log linear models as they're called imply exponential forever and that can't be right so you do have to be careful about this and there was this paper on hurricanes that came out last year that I may make fun of later where this problem arose it's about female named hurricanes being more dangerous than male named hurricanes you may have seen that paper yeah that's a tragedy of a paper anyway they get this prediction they have this they have a log linear model they make this prediction if hurricane Andrew have been named hurricane Alice or something 100,000 more people would have died and it's like no there's the log link being stupid right there just can't be right but log links are very useful but you do have to be careful with all models they can do nonsense here's just to show you we haven't been using it this way there's nothing wrong with it you can attach a linear model to the standard deviation of the Gaussian distribution and maybe you think the variance changes in response to a certain predictor it could happen a lot of processes are like that then you need to constrain sigma to be positive though so putting a log link on it will do that because that implies that sigma is the exponent of the linear model those are often very useful models and internally in Markov chains often if you could strain a parameter to be positive this is what's implied so it's quite useful on parameter scales okay last step compute the posterior throughout the content starting this week and to the end of the course I'm going to alternate between using map estimation with quadratic approximation and using the Markov chains you guys are using now to show you when sometimes map estimation is great but in general it's never safe for these models because you've got what I'm going to introduce to you next ceiling and floor effects arise through the squeezing going from the linear model space to the optimum space squeezes different regions and that creates these well fun things that happen for us interpretation gets harder because it's like the Thompson-Tide machine so let me attempt to give you an intuitive reason that when I say everything interacts in these models all the predictors even if in your linear model you just have main effects in practice over some ranges of the predictor values the predictors necessarily interact with one another and it's because of this squeezing effect of the linear model to the outcome space you can think of this as floor and ceiling effects so in something like modeling a probability of survival it can't go below zero and it can't go above one we can have a predictor for some organism that likes warm temperature that's a standardized temperature right and as it gets warmer towards the chili pepper probably survival gets higher but it eventually reaches a ceiling effect where it isn't going to get any better for it until it gets so high it fills it but that's not obvious eventually you can't survive more it's already surviving and you get diminishing returns it's bound to happen that way because surviving is surviving eventually conditions are so good that giving you even more stuff to help you live is going to make you live more and in the other direction as well dropping the temperature and say this is like a salamander which is what I had in mind when I was thinking of this and the temperature gets lower salamanders don't like to cold it's probably a survival starts to drop eventually it's dead and you can keep making it colder and colder and it's still dead there's this minimum temperature it can survive at it can survive above and that's a common thing with lots of organisms there used to be lots of lab experiments like that where scientists just spent their time trying to find to kill various organisms I like those things no they're textbooks full of this stuff I always imagine the macabre lab technicians it's like how many goldfish did you kill today 200 they don't like acid it turns out just like macabre stuff in previous centuries anyway does this make sense to you guys how this arises it's a consequence I mean real data behaves this way so you want the models to behave this way this is not an affliction of this form of modeling it needs to work this way in order to be useful but it makes the parameters harder to interpret because the parameters are on a linear scale and the effects are not the consequence of that is that it induces interactions what I mean is an interaction is any case where the effect of changing a predictor depends upon the values of other predictors I'll say it again an interaction is any general case in which the effect of changing a predictor depends upon the values of other predictors and you can explicitly put interactions by putting in interaction terms and you'll still want to do that with these models but you always get interactions in some ranges if you're near a ceiling or floor if it's cold temperature then it doesn't matter that you take the food away the thing was going to die anyway so there will be no effect of starving the salamanders this is horrible I should have thought this would sound much cruel before it started feeding it more when it's warm will help it survive even better however if it's cold and you feed it it makes a big difference but the effect will be smaller than if it was in the middle range mathematically you can think about it this way if you prefer with a linear regression the effect of changing x is just the partial of the linear model with respect to x and that turns out to be just the cabeta coefficient and that's the classic thing about linear regressions it's so nice and it arises because there's one-to-one mapping of views and the outcome in logistic regression here's the implied definition of p it's the logistic of the linear model if you take the partial derivative of p with respect to x you get that thing over there where that is beta over 2 times 1 plus the hyperbolic cosine of the linear model the whole linear model remains inside this thing and yeah it doesn't matter you don't even know what a hyperbolic cosine is although they're cool you don't even know what they are it's the path of the suspended stream experiments actually but the whole point is that all the parameters are still in there and no matter how big your linear model is it'll appear into the denominator of this thing so it'll all the parameters always matter all the predictors always matter for the effect of the changing one predictor on the outcome and again you kind of want that otherwise the model wouldn't behave right on the outcome scale but it makes interpretation a little bit more difficult we'll work through examples in plotting alright so here's our game plan starting today and then cruising on Thursday we're going to work applied examples of count models mainly binomial and Poisson examples next week is monsters and mixtures that'll transition us into multi-level models and we'll do varying intercepts and slopes and probably start Gaussian processes and then week 10 we're going to do measurement error and missing data which are special applications of the strategies we've gone so far cases that I love because it blurs the lines between what's data and what's a parameter and in the Bayesian rule they're all the same stuff there's probability distributions if you have partial information about a measurement then it has a probability distribution and it works the same way as everything else and you can always replace the data point with a probability distribution if you're uncertain about the exact value and basically it works exactly the same way we're going to do cases like that so we can put measurement error which is the extreme case like a missing data point is total measurement error measurement went so bad and don't even have it and you would think you can't do anything in that case but I will convince you we can we can still do stuff because you still know things about it it's a measurement of a family for which some of the other elements you didn't you did actually measure so you have the information about the values that we're missing because you have assumptions about the total distribution of that random variable so we'll exploit that when we get there okay so let me take you in the remaining 20 minutes towards logistic regression in our first detailed work through case and then I'm going to give you a couple more cases on Thursday of logistic regression just to give you a populate your mind with a few varied examples so it reminds you the relationship if we count up exponential events we can get a binomial so let me go through that conceptual exercise for you now just to drive it home this is not a mystical thing fruit flies in a vial or this is pretty macabre fruit flies in a vial or it's graduate students finishing their qualifying exams and we want to know in a cohort of graduate students how many years it will take them to finish their qualifying exams so we want to count up the number who will finish their qualifying exams in the first year which sounds unreasonable sorry the story didn't go so well something else and their prelims in their first year would select prelims like mine which is horrible and cruel so we're going to finish their prelims in their first year so before we get to the gray area on this graph let's assume that it's an exponential distribution works better with the fruit flies in vials because fruit flies do die exponentially they're like hard drives just like that so we can say that each grad student completing their prelims is like a vertical bar on this graph and there are ten of them in this example here in the white region in the first year are successes the first term are successes and then all the ones in gray are not counted as successes sorry that sounded bad maybe I should go back to fruit flies is that more reassuring so as we change the exponential rate so in the bottom graph I just added the exponential distribution curves down faster because the rate of completion is faster and this means you get well actually it's exactly the same sorry you were searching like are they different that's the next slide so they're exactly the same across different cohorts of ten individuals sometimes they'll be four in the first term sometimes they'll be only two it's going to vary because they're random samples from the exponential distribution ten's random sample from the exponential distribution if we do this trial a bunch of times with a bunch of cohorts of ten we'll get a bunch of counts in the white area and if we plot up the distribution of those counts you get over say ten thousand sets of ten you get this distribution on the right and it turns out this is a binomial distribution exactly for all the maximum entropy reasons I tried to do this last week and earlier today exactly why it emerges naturally there's nothing magic about it the lambda here is the rate parameter in the exponential distribution that's what it is it's a half and if you change the rate on this thing so here's a process where they're finishing slower so the exponential is straighter right the rate is lower then you're going to get a distribution that's piled up against zero because very often none of the group flies die in the first hour and none of the grad students finish their prelims in the first year that would be sad so I have to write those prelims later this year don't I so if it happens really fast then almost all the fruit flies most of the time are going to die in the first hour and most of the grad students are going to finish their prelims in the first year and instead you get a binomial distribution that's piled up against the max does this make sense I'm trying to give you some intuition connected to natural processes and sometimes if you get fancy it's worth modeling this exponential process underneath your binomial data because there can be different censoring at the time interval of observation your counts are over different durations we'll talk a little bit about that on Thursday so here's a yielded binomial distribution you're familiar with it but I'm trying to give it some new paint here it models counts of a specific event out of impossible trials so if there are 10 fruit flies or 10 grad students anywhere between zero and 10 could be observed to have the event happen to them in the interval there can be more than two kinds of events your categorization is only to count one of them that's what it means lots of other things could be happening so we have our count which is usually called successes but I think it's a horrible term depends on what you consider success fruit flies dying is not really a success for them number of trials think of that as a number of coin flips or the number of marvels you pull out of the bag and the probability of success p is what we're our linear model to Cody you say a specific number of trials so like you've been thinking about like okay did this put months in there or no that's that's a hard problem offspring because of which organism obviously we know that most is one well I guess two maybe three but there'd be lots of auto correlation between months a human sample so I want something else yeah I want to know more about biology I'd be nervous about that so there isn't a sense of a very specific kind of trial it's easier to talk about an example again this is my horoscopic thing I think it's very hard in general to say what these trials are or what the outcomes are until you get a data case in a data context you can give precise and useful answers to these questions outside of it it's horoscopes and mercury and something it's an auspicious time to start a business partnership sort of it sounds good and that's the best I could do anyway so along those lines important things to know about the binomial is the expected count is n times p and the variance scales with the mean so you don't have a separate parameter now for the variance around the mean and this is a very important thing about count variables all count distributions are like this the variance inflates with the mean and that's true in nature as well the bigger the magnitude of a count the more uncertainty there is around the mean that's a normal thing about counts this is actually kind of embedded intuitively in the innate human counting system which is logarithmic so it seems that human kids are born with logarithmic counting and its present in all human languages one few many that's logarithmic and people intuitively even if they're not numerous beyond one few many understand that many is vaguer than few and few is vaguer than one so the precision goes down as it goes up as well so that's how kids got that innately and then the real number line is something you have to painfully force into someone's cortex over time the idea that you can take any number and add one and there's another number that doesn't seem to be the way that kind of precise counting system is harder to come by but this logarithmic scaling between mean and magnitude is intuitive for people so I'm trying to extract the logarithmic part of your brain out of here one few many think of it that way and okay so we use this to model counts of a specific event out of inn possibilities in trials the goal is to model the probability usually as a function of some predictor variables when n equals one this is usually called logistic regression although you'll hear there's lots of slop about this and sometimes it's just logistic regression just means you have a logit link it could just mean that as well okay so we're going to go through some examples we're going to do a logistic regression example first where the outcome is zero one and we use a logit link then we're going to do an aggregated binomial example there's almost certainly beyond Thursday where the outcomes are the counts over some larger number of trials they aggregated together a bunch of zero one success trials and you can do that and as long as the predictor values are always the same it'll work fine and you can run the same model on these different forms of the data and I'm going to show you that it's often quite convenient to do so and along the way I'm going to try to in the context of the data examples teach you these important things about working with them you've already heard about ceiling and floor effects the other big one is relative and absolute effects that I'll just punt on saying what that is until we get to the case where I can show you how it manifests okay here's the data context we're going to work with we're going to start this today I've got ten minutes to kind of explain the data situation and then we'll work hard on this on Thursday and go through the code and how to make it work and interpret it this is a data from it's already in the rethinking package from a set of behavioral experiments done with captive chimpanzees living in social groups cases where all the chimps knew one another and hung out affiliatively behavioral experiments meant to test how prosocial chimpanzees are in context in which people are extremely prosocial people of a similarly knit community like, probably grass here you would always give more food and you'd probably give your own food to them you guys are so good so here's the setup this is my cartoon on the left which I was trying to do better than this, I don't know all I got was the hair parting I'll chimps have that hairstyle it's true if you work with them it's absolutely true so here's the idea, you're looking at this table you're a chimpanzee who's looking down this table and that's your social partner on the other end in one of the conditions there's another individual at the other end in the other condition there's not so there's a partner condition and a control condition where there's no other individual you're just by yourself on the table in both conditions the table is set up the same there are two levers that you can reach one on the left and one on the right if you pull these levers it expands this accordion device in the middle of the table you can see in these photographs a little bit better and there are two little dishes and on one side of the table there's food on your side and nothing on the other side and on the other side there's food in both dishes so if you pull the right lever in this example these two things expand out and both individuals get a piece of food grapes and things like that healthy things and if you pull the left lever a piece of food in the other individual is set up so the question is so if you do this with people really young kids they always pull the side of the table that gives food to both individuals there's no cost to you there's no altruism here it's just concern for the other individual that's being tested you're not giving up anything we're just measuring whether they're attending to the other individual at all so we want to know if when the other individual is present the individual is more likely to pull the lever that's what we're interested in so here's the verbal rundown of the experiment there are two conditions partner and alone two options you can pull the pro-social lever or the a-social lever the pro-social lever has two pieces of food the a-social lever has one and there are two outcomes then this is the outcome variable left lever or right lever we have a predictor for condition and option a predictor variable that tells us whether it's a partner situation or whether you're alone it tells us whether the left lever is pro-social or not and then we predict the pulling of the left lever that's the way the data is going to be coded does that make sense and what we want to do is we predict outcome whether they pull the left lever as a function of condition and which side the option is on and this implies there's an interaction we want to know if chimps prefer the left lever when the partner is present and pro-social is on the left right so the effect of the pro-social option on pulling that lever should depend upon whether another chimpity is present so that implies an interaction effect does that make sense I'll show you the model in a second and you'll see how this gets instantiated so here's our model this model addresses this question do chimps prefer the left lever when or how much do they pull the left lever let's model the probability of pulling the left lever when the partner is present and pro-social is on the left in fact we want the interaction of the two the dependency so we can focus on the linear model line here notice we've got our logic link same thing as before so what we're saying is p sub i is the logistic of this linear model we've got an intercept then p is whether the pro-social option is on the left or not and that's our outcome variable whether the left lever was actually pulled by that it's got a main effect which is the change in log odds of pulling the left when the left lever has two pieces of food and so why might they do that just purely because chimps like food when they see a side of the table with two pieces of food on it they may pull that lever in fact this is what's going to happen just to give you some preview chimps are attracted to big piles of food even if the experiment is set up so that they don't get the pile of food they coin at the big pile of food there's this hilarious video online actually where they set up the experiment that way you train the chimps so that the pile of food they coin at goes to the other individual and they can't stop themselves from eating the big pile of food there's one chimps that he doesn't as soon as she points at the big pile of food she smacks herself because he just can't inhibit he's like no right it's like Homer Simpson and all kinds of things anyway so yeah why be interested in this humans are pretty good at inhibition in that regard I think it's one of the species differences so we've got this alpha is just the baseline hand in this that's sort of how much you like to pull the lever regardless it's just the intercept then beta sub p is the change in log odds when the pro-social option is on the left it could go up just because there's more food and it will and then we've got some additional effect when the other individual is present when the partner is present see for condition if there's some additional probability of pulling the left of the pro-social side right and so it's in there notice there's no main effect of the other individual being present because there's no reason theoretically to expect that just because another individual is present you will want to pull the left hand lever right so we don't put that term in the model as an exercise for the student after you've run through the examples this week I encourage you to put that main effect in and see what its estimate is just go ahead but on theoretical basis I suggest we shouldn't put it in the model right there's no reason to think that the chimpanzee on the lever pulling side will want to pull the left lever just because another individual is present and I encourage you to analyze it don't see alright and then some weekly regularizing priors alright so at the top showing you how to fit in-map just the intercept only model up top just to show you a basic example these models are defined just the same way as before I'm using map here although map to stand looks the same right you just put to stand the code does a bunch of extra stuff but from your perspective it's very similar and so you change d-norm to d-bino the one is the number of trials right there's one lever pull on each row in the data set the p now you do loge.p is your linear model and there's one parameter in this model now there's no sigma because the mean and p determines both mean and variance of the binomial process that's the intercept model two other models to show you the structure model 10.2 just has the pro-social left option in there so this is the model that asked the question does the chip like to pull the lever that's attached to more food whether or not they get it the answer is yes by the way they do and then 10.3 is the research question the reason that the experiments were run is there an interaction of substance between condition and having the pro-social option on the left it results in more left lever pulls and then if there is is it therefore there to be more left lever pulling could be negative so we've got to check just do the model comparison we've got to check the frame of the rest yes it could the question was when would the number of trials not be one on the first day we're going to redo this example aggregating style and then in will be 18 actually and then after that we're going to do a case where in varies row by row in the data because there are different numbers of trials in every row and then I think I will have populated your mind with every possible community except of course the case there are there's a whole useful family of bottles where in is what we're modeling mercury capture analysis which biologists here will have heard of I love me some mercury capture it's a great family of models and hugely useful in fly conservation and game management and there in is the population size and what you're counting are like captures and deaths and recaptures and things and in is the target of inference you don't know it but it turns out you can get a posterior distribution for it we're not going to do that in this class but I just want to say sometimes what's data and what's the parameter depends upon what you have right that's one way to think about it okay so with that I'm just going to put up the next slide so we can load a resume right here when you guys come back on Thursday we're going to interpret this model and I'm going to teach you how to plot the implied predictions from it do model comparison with it and think through it we're going to keep using all the same tools we've used so far so in a sense you're learning just a few things link functions and likelihoods all the previous stuff still holds but it's going to get trickier so with that I'll see you on Thursday