 This very modest river is the River Kelvin in Glasgow, Scotland. Most people will recognize the name Kelvin as a temperature scale. Zero Kelvin is absolute zero, no heat at all, no molecular motion. It gets its name from the Baron Kelvin, William Thompson, who was an inventor, Scottish inventor, Lord who invented many things, including tide prediction engines like this here. A tide prediction engine is a mechanical computer that uses an array of gears and pulleys and mechanisms to chart out, you can see on the left of the image here, a prediction of the tides that will come in the next days. This is of extreme importance for all kinds of commerce because the tides really matter when you're moving things on the ocean. We can make tide prediction engines out of Legos actually, as you can see here, because we understand their mechanisms really, really well now. But when the Baron Kelvin invented them, they were wonders, computers that did their calculations without error, infallible in a sense. What I want you to see about these machines is that when you look at the positions of the gears, that is the bits of the machine that do the calculations on the right of this slide, you can't see the tides at all, but that's the part that's doing the thinking, as it were, the part of the golem that's doing the thinking. The output, the predictions on the left are the things that people can understand, and those predictions depend upon all of the gears and pulleys and positions of the different bits of the machine. Generalized linear models that we met in the previous lecture and that we'll explore more in this lecture are a lot like Baron Kelvin's tide prediction engines. They have a lot of gears and pulleys inside them, and these are the parameters and the link functions and the other bits of them that make them move and generate useful predictions. And we must resist the urge to interpret directly those pieces, the parameters and such. Instead, we have to push them out as predictions in order to understand what these models mean. So to remind you, generalize linear models, the expected value of the outcome distribution of the observable variable is some function of an additive combination of parameters. It's not a direct linear function of it. And the consequence of this is that uniform changes in the predictor, like the x-axis value on the graph on the right, do not result in uniform changes in the prediction. Instead, depending upon the absolute value of the x-axis variable, you get more or less of a change in the outcome. So, for example, if you look at the animation on the right, if the x-value were very low, like minus two, and you increase it by one unit to minus one, the probability of the event doesn't change very much in this example. But if you were at zero and you increased it by one unit to one, it would change a lot more, much more. And that's a general feature of generalized linear models, these sorts of tide prediction engines that produce custom nonlinear predictions that are very useful, but their internal mechanisms are often quite confusing if we stare too hard at them. Another feature of generalized linear models is that all the predictor variables interact. If any one of the predictor variables pushes the prediction towards zero or one in this example, then the effect of any other predictor will be smaller. In contrast, if the x-value here were near zero, then any change in any other predictor variable would have a larger effect on the outcome. All of this doesn't only affect the expected values, but of course, since we're doing Bayesian modeling here, it affects the whole distribution of the predictions, including the uncertainty. And so the uncertainty in generalized linear models is often highly asymmetric, and we need to pay attention to that as well. The good news is that all the tools you learned up to this point in the course work the same for generalized linear models. You just have to adapt your understanding to them. Your habits are already adapted. Let's pick up with the example from the previous lecture. At the end of the previous lecture, I had just introduced confounds to this example of gender discrimination in graduate school admissions. That is, we should imagine, because it's very likely that there are confounds present in this system. Here's one I'd like you to consider. One of the common causes of choice of department and whether someone is admitted or not is their ability. That is their qualifications. Ability influences choice of department because applicants are wagering that they need to have some threshold of qualifications in order to be a credible candidate to be admitted to some departments or others. Some departments are harder to gain admission to. That's just a fact. Ability also matters because, of course, higher ability applicants with better qualifications, as judged by the people rating the applications, are more likely to be admitted. In the dataset we have, we don't have any measures of the applicant's ability. We don't have access to their applications, so we can't even get indirect information about that. This is an unobserved confound, but it's still present in the scientific model. The question is, what does it do to our inference if the DAG that I've redrawn on the screen here is true? Let's think it through. The first thing we want to do following our workflow, drawing the owl, is to make a generative simulation of this to make sure we really understand what we mean by the unobserved confound of ability influencing both choice of department and whether someone is admitted or not. So here on this slide I have yet another generative simulation. This is very similar to the one in the previous lecture. Let me walk through it. So again, the first thing we do is we simulate the variables that have no parents, that is have no arrows entering into them. And in that case, it's going to be G and U. So we start with G, we sample at random genders 1 and 2 for the imaginary in 2000 applicants in this simulation. Then we simulate their unobserved ability variables. And for simplicity, I'm just going to make these Bernoulli variables. This is R burn, that's the random Bernoulli variables, 2000 of them. And the probability of 0.1 means that 10% of them will take the value 1. And what I want that to mean is that 10% of them are high ability and the rest are average ability. And we're going to give a bonus and admission to individuals of high ability with values of U that equal 1. Next, we can simulate the choice of department because we have its parents, we have G and U simulated. So now we choose department and for simplicity in this synthetic simulation, there are just two departments, departments 1 and 2. And as in the previous lecture, gender 1 tends to apply to department 1 and gender 2 tends to apply to department 2. But what I've changed now is that individuals of gender 1 who are of high ability have U values that equal 1 have a 50% chance of applying to department 2. Which in the way you can think about this example is department 2 is the discriminatory department and individuals of gender 1 are discriminated in admission in that department. But when they're of high ability, they think they can overcome that and so they're more likely to apply. Now we have to be more subtle about the acceptance rates in this simulation because we need different acceptance rates for individuals of different ability. So let's think about the first matrix here, accept rate U0. This is the matrix of acceptance rates for individuals of average ability with U equals 0. And what I want you to see there is that you can see these matrices displayed better at the bottom of this slide, the one in the lower left. Rows are departments and columns are gender. So what I want you to see is in the first column, gender 1 has or maybe we think about it in departments. This will be clear, the first row, department 1, both genders have the same admissions rate. There's no discrimination there. But department 2 is the discriminatory department and in that department, gender 1 is at a 20% disadvantage relative to gender 2. And then there's the acceptance rate for the individuals with high ability which is unobserved with U values that equal 1. That's the other blue matrix at the bottom. And in this case in department 1, again, it's equal but they have double the probability of being admitted. And then in department 2, both genders have a higher probability of being admitted but there's still discrimination. Let's see what happens. We simulate from that data. We push that simulated data into the statistical models we developed in the previous lecture. Remember, these statistical models are designed to access two different estimates. The first in one here is the total effect of gender and the second, the direct effects of both gender and department. What I want you to understand now and you'll see is that the second model is now confounded. The first model is not. The first model still gets the estimate right. It estimates the total effect of gender which is that gender 1 is admitted less. You can see that in the pracey output at the top. These are on log odd scale. So remember minus 2 is a value closer to 0 relative to around minus 0.8 or let's call it minus 1 just for ease of comparison. More negative numbers or lower probabilities. But model 2 is confounded. There's discrimination in the simulation but it doesn't show up so well in the inferences of this model. It's going to be hard to see just from the pracey output. So let's look at the contrast as always. We want to look at the contrasts and not rely upon gazing at the gears of the tide machine as it were. So let's compute the contrasts for department 1 and department 2 and what that means is we compute the probability that individual of gender 1 are admitted to department 1 and the probability that individuals of department 2 are admitted and we compare them. Here I'm just doing it on log odds but it's still contrasts. And we could compute this to probabilities if we wanted to but it's just log odds contrasts. And we do it for department 1 in blue and department 2 in red and the code there on the left of the slide. What I want you to see is that it doesn't show any difference. There's a very slight disadvantage to gender 1 which we're thinking of as women in this example. But you know from the simulation that department 2 is discriminatory and department 1 isn't. So why do these departments look the same in the analysis? And what's happened is that the confound of ability has actually hidden the discrimination in this example. Let's unpack this a bit. Some of you have already guessed what's going on. This is a case of collider bias. When we stratify on D in model 2, this opens a non-causal path through the unobserved confound U. And you can look at the DAG there and see it. D is a collider on the path from G to D to U. So we can estimate the total causal effect of G because when we run model 1 we don't stratify by D and so we don't open the confounding path through U. U doesn't matter. Yes, the causal relationship between D, choice of department, and admissions is always confounded by the unobserved confound. But that's not our estimate. And so we can still estimate the total causal effect of G even though the relationship between D and A is confounded. But when we try to stratify by D, we turn on a confound by accident. And this has this effect in this particular simulation of hiding discrimination. Maybe here's a more intuitive way to think about this. In this particular example, high ability, gender 1 individuals, women, if you will, apply to the discriminatory department anyway because they know their high ability and they're willing to do it anyway. They think they can overcome the discrimination. And they're right. Gender 1 individuals in that department, department 2 in this example, are higher ability on average than the gender 2s in that department even though they're discriminated against. But since their higher ability on average than the gender 2s in that department, they're likely to get admitted and in fact most of them get admitted and their admissions rates are quite high. Their high ability in essence compensates for the discrimination. And so when we run a statistical model that ignores the ability differences among individuals, it masks the evidence of discrimination. In the previous lecture I had promised to talk more about the policing example, but I ran out of time in that lecture and decided to cut it off and spill it over into this lecture. So let me say something about that now. The policing, assault by police structure is very similar to the graduate school admissions structure. Just to remind you in this scenario we also have a mediating path. We're interested in whether individuals of a particular ethnicity represented here by X are assaulted by the police due to direct discrimination. Another way they could be assaulted more by police is that they're stopped more often by the police. It's a different kind of discrimination and that's the Z variable here. And there will also probably be confounds, very similar to the ability confounds in this case, but in this case it's things like acting suspicious. So think of it this way, there are two ways that you might be stopped by the police. It's because they're discriminatory and they tend to stop individuals of particular ethnicities and you could also be acting suspicious and really deserve to be stopped. Of course if you deserve to be stopped you don't also deserve to be assaulted, but the police may also be more likely to assault individuals who are acting suspicious or have criminal intent or carrying weapons. So that's the U variable in the DAG on this slide and it's another way that you could be stopped by the police and it's also another way that you could be assaulted other than ethnicity. This is structurally like the ability differences in the admissions example I just finished. The consequence of this is that when you only have data on individuals stopped by police, then the data set itself has already been conditioned on the Collider Z. I'll say that again. When you only have data on individuals stopped by the police, which is the norm. We don't have data on all the individuals not stopped by the police. Then the data set you have is already conditioned on the Collider Z and therefore the confounding effect of variables like you, which we don't have, are already baked into all of your analyses and so it's data on police stops are nearly always confounded by a lack of data on who wasn't stopped. If you're interested in this topic there's a great paper that I give the citation to you at the bottom of the slide by Knox et al. in 2020 which carefully walks through these examples and talks about estimation and strategies, empirical strategies for trying to understand police behavior in these scenarios. But you have to think through the causal assumptions very carefully. The answers as always are not in the data alone. You have to put a lot of causal thinking into this to get any causal information out. Okay, back to graduate school admissions. Of course if we could measure the confound then we could add it to the model and it might be illuminating to see how we do that. So suppose we could measure you and in the synthetic data simulation of course we could because we simulated it so let's play the game here just for a moment so you can see what the model structure would look like. Let's add the unobserved confound U to the data set so that it is in effect observed and now we have a new model here. What I want you to see is that in order to make an inference about the direct effect of G on A we need to block the non-causal path through U and that means as you know from your skills in learning the elemental confounds is we need to condition on U. We need to stratify by U. So that's what we do. We stratify by U by adding it to the linear model and here I've invented a coefficient BuA the effect of U on A times U. Just an ordinary linear model like you've been doing for weeks now. The little trick here I'm going to do is put in a special bit of prior information. The effect of ability should be positive and so I'm going to constrain this parameter BuA only to positive values. You don't have to do this but it's a better prior and I wanted to show you an example of how to do it. We run this model. We extract the samples. The code at the bottom of the code box on this slide shows you how and compute the contrasts on the logout scale and what you see is that you get the right answer now. That is that Department 2 is discriminatory against Gender 1 and that's why the red density is shifted lower. The thick densities on this plot are the new ones that we got from the unconfounded model that we ran on this slide and the thin ones are the previous ones that were confounded by ignoring the unobserved ability differences. Okay but of course in real research context we can't observe these unobserved confounds. That's why they're called unobserved confounds. We know they exist but we don't have an easy way to measure them. Sometimes it's practically impossible to measure them so what can we do instead? One strategy of course is to do an experiment. If you can do an experiment and randomize your way out of confounding then that's a great idea. Experiments are fantastic inventions but there are lots of contexts especially in the social and behavioral sciences where experiments are practically impossible or unethical. So considering this case of course we can't randomize individuals genders and we can't randomize which departments they apply to. We can't force their choices of that sort. We can't blind their gender in the application process but that can be very difficult in many of these cases. Its applications to graduate school are very detailed things with CVs and often with letters of recommendation. It's practically impossible to successfully blind gender of applicant in those situations. There are other situations that are structurally similar where you possibly can like orchestra auditions is a relatively famous example but in this context it may not be feasible. If you can't do experiments that doesn't mean you're out of luck though. There are other options which help a lot. I want to talk about two of them. The first is sensitivity analysis and the second is trying to measure some proxies of the confound. We can't measure the confound itself but we still might be able to get some information about it and partially de-confound our inference. So let me run through these. Sensitivity analysis is trying to answer the question what are the implications of what we don't know. So of course you know that Bayesian inference is trying to measure the implications of the data for some unknown process but simultaneously it measures the implications of how much we don't know about things. That's why it represents knowledge with distributions and so when we don't know a variable we can represent that lack of knowledge in the model and then what we do know that is the things we have measured that go into the model will have implications for the things we don't know and the things we don't know will have implications for the things we've measured. I know that sounds all confusing like a bit of probabilistic wizardry so I'm going to run through an example for you and show you. Effectively what we're going to do in the simplest case of sensitivity analysis is that we assume the confound exists we add it to our statistical model even though we haven't measured it and we model its consequences and we model its consequences assuming different strengths of influence of the confound we don't know how strong the confound is and since we haven't measured it we can't infer how strong it is but what we can do is say how strong must the confound be to substantively change our conclusions. So let's see how that would be done. Let's think back to our DAG and our statistical model this is just repeating what you've done before but we're going to add you to the model now because it's in the DAG. So one part of the statistical procedure here is estimating the influence of how acceptance works that is modeling acceptance decisions and there are three things that influence that there's gender and department and differences in ability which we're representing with the letter U here so we add to the logistic regression the Bernoulli GLM in the linear model which represents the influence of ability U sub i on an individualized chances of being admitted and I'm going to stratify this by gender because it may act differently for the different genders. Simultaneously there's other models in the system that U also affects that is choice of department is also affected by differences in ability and again we can model this with a Bernoulli logistic regression I'm going to create a new outcome variable zero one outcome variable to model as a Bernoulli and that is that an individual the individual chose to apply to department two and that's what that parentheses di equals two means that takes the value one when it's true and the value zero when it's not and we model that as a Bernoulli outcome and then we have a new probability q sub i this time because p was already taken we assign it a logit link and we give it a linear model where the delta vector of parameters one for each gender is the probability of applying to that department instead of the other applying to department two instead of department one and then the gamma parameters again one for each gender are the influence of ability on that decision to apply to department two now in code form let's focus the stuff at the top of the code box it's just prepping and what I want you to focus on though is that I'm actually adding coefficients as data in this example because this is a sensitivity analysis so the the beta and gamma parameters represented here is B and G we can't estimate them because we haven't measured you we haven't measured it we're imagining it exists and so we're going to feed in effect sizes for B and G we don't believe in these we can vary these and change them and see the consequences of changing the assumptions about how strong the different effects are in this particular case I'm going to make it conform to the synthetic simulation that earlier in this lecture so the beta effects remember these are the influences of ability differences on admission I'm going to make those equal and strong value one for both genders and then the gamma effects these are the effects on decisions to apply to department two I'm going to or rather I guess it's department one in this case I'm going to make those one and zero so that gender one is influenced by differences in ability and gender two is not we then the last part of this it's maybe a bit confusing at the very bottom of the code block is that since we haven't observed the u-values they're not data they're just a long vector of parameters so you may remember in earlier lectures I've said that in Bayesian inference the difference between data and a parameter is just that one's observed and one isn't and here you can see where that comes home so you would would be data and we would have fed it into the model if we had measured it but since we haven't it's just a parameter a long vector of parameters there's one u-value for every application in this example so there can be a lot of them actually in fact as many as there are observations but that's fine as you'll see and I'm assigning them a prior that's normal zero one because the scale is arbitrary and I just make them standard normal since we haven't observed them we can't figure out the scale at all so we just fix it to a zero one normal distribution and that's just a prior if the actual ability differences have a different distribution that's okay this is just a prior distribution the ability differences that are eventually inferred can have whatever distribution they have we can run this on the synthetic data where we know the right answer and that might be interesting but what if we do it on the real data as well and that's what I'm doing here so in the script that goes along with this lecture I show you both ways on the synthetic data on the real data what we're looking at here is the real data and the thing to note about this of course is I've converted this to a logistic regression with Bernoulli outcomes which means I had to take the UCB admit data and make it into what's called the long format with all of the zero one outcomes on each row each application is one row so you end up with like four thousand more than four thousand rows in the data set but that's okay and the code to do that is in the script for this week when we look at the model though it's just like what we talked about before and in the real data example it's department one that is the department of interest that stands out because it's department one that has a seeming preference for applications from women women in that who applied to that apartment were accepted around 80% of the time I think the vast majority of them were but not very many women applied to that department and so the question is is this a result of this mechanism that has to do with ability differences that is that it's a discriminatory department it would be a field I won't name any fields I don't want to embarrass any particular fields but we all know there are some fields that have a reputation for being hostile to women have hostile cultures and but high ability women may decide to apply anyway because they love the subject and that's department one or department a in the original data set it's called and so that's why I'm focusing on department one in this model in the D model in in the middle of the formula list here and we add the effects of you in both places in the a model and the D model just as I explained before what you see on the right is the results these are the female minus male contrasts just for department a the the discriminatory department and the thin density is the original one that we had before from the previous lecture and then the thick one is the new one and this is a sensitivity analysis so we don't know this is true but what we've learned is that if it's true that the effects of the unknown confounder in the directions we've assumed that we've input into this analysis then with these data the admissions advantage of women in that department is essentially erased or it's greatly reduced it it now straddles zero much more than it did before it could still be large it could still it still be positive it could now also be negative however and if we made the effect stronger then we could possibly push it entirely negative and that's an exercise that you may want to try why does it have that effect well we can take the actual ability estimates out of this so remember the u values are parameters in this model and there are a lot of them so over 4,000 of them over 4,000 ability parameters that have been estimated in that model here we're only looking at the ones for department a which is the object of our interest and in that case there are a little over 900 individuals who applied to department a is one of the most popular departments on campus in the 1970s and what I'm showing you here is on the horizontal is just the applications and I've sorted them by the ability estimates the me posterior mean ability estimates from the model and I've colored in blue the applications for men and in red the applications for women and you'll see that what's what's happened here is that all of the applications for women are on average estimated to have a higher ability than the average application for men so you see on the far left those are the applications for men who were rejected and this is the reason the model assigns them a low estimate of ability remember we can't see ability that figures it out from whether or not the applications were accepted or rejected in combination with the sub model that chooses departments and so if what happens from the application from women in this example is the model says well they must have been higher ability because that's the only reason they would have applied to this department and that's what we see on the far right is those are all the accepted women the 80% of them are so who applied to this department so this is consistent with the hypothetical scenario but the model has essentially inferred that just from the minimal assumption of the effect sizes of the parameters we put in so let me do some summary this is sensitivity analysis and it's a broader topic than I just showed you but always the goal is to figure out the implications logically what are the logical implications of what the things we don't know and sensitivity analysis is like somewhere in between pure simulation where you're making up everything and pure analysis instead it's saying there are aspects of the system we haven't measured and we can't estimate them so let's try a variety of assumptions and see what consequence that has for our inferences so for example a very common thing is you'll vary the strength of the confound over some range of values you don't try just one but you try a range and you show how the results change as a consequence and you report that whole sequence of fits or you could do the opposite in a sense you could vary the other effects in the model and estimate the confound strength that's also a possibility the truth is that in much research confounds often persist and it's not plausible that there are no confounds in most observational systems and so we shouldn't pretend if we can't find a way to design around them that doesn't mean we should give up confounded estimates are still useful especially if we can report how strong the confounds need to be to reverse a result or erase it completely okay let's talk about the other thing I mentioned another option is you could try to de-confound yourself with some information about the confound so what I want you to see in the DAG on this slide is that the variable U now has some descendants some new descendants, some children T sub 1 and T sub 2 and these are test scores that we might learn about the applicant or any number of other things letters of recommendation or such it's information about the applicant which gives us information about their qualifications their ability as I've been saying it turns out you can use this information if you're lucky if it's high quality information to infer you with error but infer it well enough that you can de-confound your inference about the effects of gender on admission so let me walk you through a quick example to give you an idea how that works but before I do that I want to emphasize that what you can't do is just add the test scores to the GLM as if it were some big very regression model instead you need to obey the structure shown in the DAG so it's a more subtle problem so think about the sub models here we've got a model for admissions just as before and there's really nothing new here just as in the previous example we now add U the ability differences to the logistic regression admissions decisions at the top and then we have a model for the test scores right so imagine we wanted to predict test scores and then we could use the DAG to predict test scores and the model for that is shown at the bottom of the slide each test score is a normal sample with the individual's ability as the mean and then some precision of the test tau I'm calling it here and the abilities are all unknown and I'm just going to assign them a normal zero one distribution a standardized distribution because of course their scale is arbitrary they're latent variables the good news is we can run these two models simultaneously and when we do we get some really nice power if the tests are informative that is if they're sufficiently reliable descendants of ability differences U so I'm showing you a simulated example here because we don't really have test data for the UC Berkeley admissions data at the very top of the code box I simulate three different tests and I've given them different precisions if you will the first one is highly precise it reports back the individual's view with a small standard deviation of only point one and then I made the others a little worse and then in the ULAM code we have the admissions model and you'll recognize this there's nothing new there and then we have the model for the test scores for U and T and I don't think there's anything new to be confusing about this necessarily note that I've allowed tau to vary among the different tests just to show you that that's possible note at the bottom of the slide this model doesn't always sample very efficiently depending upon the simulation and later in the course we'll learn some tricks to make models like this sample more efficiently but I ask your forgiveness for now I want to skip past that technical issue okay this is what happens and I think this is pretty cool on the left I'm showing you the contrast plot again and this will look familiar because the thin one is the one we've seen before from the synthetic simulation when we ignore the confound we underestimate the amount of discrimination in department 2 in the synthetic example but now when we use the test scores in the way I showed on the previous slide it gives us a better estimate much closer estimate to the true amount of discrimination and we also get estimates for the abilities of course because they're parameters in the model there are 2,000 U values being estimated in this model and I've plotted them on the right of this slide with women applications in red and men applications in blue and on the horizontal I show the true value now in any real example we're not going to know the true value of someone's qualifications or ability but here since we simulated the data we do and then on the vertical I'm just showing the posterior mean for each individual and you'll see that even though the prior distribution for you is normal 0,1 a bell shape the posterior collection of U values is definitely not bell shaped the model has found the clusters of 0 and 1 for the 2 individuals the 2 kinds of individuals I think you'll see and it's also found the right proportion about 10% of the sample has been assigned a value around 1 this is just a symptom of this fact I keep saying is that priors are just priors the parameter estimates themselves are free to take other distributions other than the prior and that goes for data as well the likelihood or the probability of the data is a prior for the residuals of the observations but the actual residuals can have a different distribution that's the way Bayesian inference works okay one more note about this model I think is the first model in this course that has more parameters than observations and if you've had a traditional statistics education this is impossible right you're told that you can never do this well you can do this and we just did it and it worked it de-confounded our inference that advice that you can't have more parameters than data points only applies to very simple sorts of regression models the relationships among the parameters and the observations in the model we just ran are more structured than that but you don't have to be clever and understand the structure to know when you can do this instead just think the model comes from scientific structure it comes from the DAG or the more sophisticated causal model that you draw and that determines how many parameters your model should have it's not your sample that does I'll say that again a scientific model that determines how many parameters you have not your sample and so when you're doing Bayesian inference there are no guarantees you can infer what you want to infer but it's the scientific structure that gives you your model structure okay I appreciate this is all a bit dizzying one way to think about GLMs is that what we've started doing in these last two examples is gluing together multiple regression models into larger simultaneous equation models that can do much more capable things in terms of dealing with confounds or sensitivity analyses and so on and these machines are a bit complicated and you can get a bit lost in them like Charlie here on the screen but with a little bit of experience you start to really appreciate and understand what we're doing and really what we're doing is we're just drawing out the consequences of our causal assumptions in statistical language and sometimes that means you have multiple regression models running at the same time in the same Markov chains and that lets you do much more interesting things to address your scientific questions with that let's take a break that's been a lot of work I encourage you to look back on the previous slides make some notes take note of especially the things that were confusing and maybe review those then take a walk have a cup of coffee and when you come back I will be here I'll come back let's talk about technology it's a fascinating thing about people about humans is that all human societies depend upon fire but in the richest human societies most people don't know how to make it not without some sort of device like a butane lighter what you see here is a contemporary Polynesian man showing you a traditional way to make fire by rubbing sticks together and then using coconut hair, coconut fibers that's not really hair of course to light the fire different societies have lots of different techniques depending upon the materials present and this kind of technology of course is very simple children can learn it very rapidly but it's extremely difficult to invent on your own if you were simply lost on a Polynesian island it would probably take you a long time to figure that stuff out we're not going to talk about making fire today but we are going to talk about technology in this second half of the lecture it's a big question in anthropology why is it that first of all technological complexity tends to increase over time why is it so different in different places and times and we're going to enter into this literature with a particular data set a small and interesting one that focuses on oceanic tool complexity and not techniques for making fire using coconuts but things like canoes and hammers and fish hooks and many of the other specialized technologies that oceanic peoples Melanesians, Polynesians used to colonize the Pacific only with the raw materials they found on the islands that they colonized this data set is shown in its entirety on the right hand side of this slide and our estimate, our focus is the causal relationship between population size and the number of distinct functional tools that were found in the islands culture historically and there are two causal variables that we're going to be interested in and I'll explain them a little bit more in the subsequent slides if you're interested in this literature a bit more I give you a citation to the paper that first published and analyzed this data set at the bottom Klein and Boyd 2010 so here's a conceptual scientific model to get us motivated the technological complexity in a society is influenced mainly by two processes the first is innovation somebody has to invent things and as I asserted just a couple minutes ago it's quite hard to invent stuff most people don't do it if you get lost in the woods or on a Polynesian island it's going to be hard to invent a way to make fire it's much easier to learn it and that's a feature of nearly all human technology and so on this slide what I've done is I've drawn two different influences on innovation rate and the first is population size the more people they are the more likely that you can generate an innovation and then the second is contact you can also get innovations by meeting other people who have invented things themselves so innovations then lead with some more development and ingenuity to functional tools and those are the things we find in archeological records and that ethnographers count at the same time some tools fall out of use or the only person who knows how to make them prematurely dies and this leads to loss of tools over time the thing about loss of tools is that its rate is proportional to the number of tools not to the population necessarily we can make this into a conceptual dag and this is, I like to thank the messiest dag we've had so far in this course as it looks like a quilt focus your attention on the population in the lower left because that's the causal variable we're most interested in and what we'd like to do is infer the causal effect of population size on the tools in particular in the lower right so using your powers of the back door criterion you can analyze this graph with your eyes try to do it I know you can find all the back door paths so remember back door paths mean that there's an arrow entering P those are the only candidates so in this case there can only be one and it would be a back door path through location as I've drawn it on this dag that is location is a confound between population and tools what do I mean by that the adjustment set to get the causal effect of population on tools here would be to stratify by location and what I mean by that is that oceanic societies which are near one another might have similar tool sets just because they are near one another they may have similar materials and needs and so they generate similar tool counts there could be other geographical causes for similarities and technological complexity and we also here want to stratify by contact that is societies that had a lot of trade because that's another source of tools it's not a confound though you'll see there's no back door path out of population through contact instead contact is a competing cause of tools and so if we stratify by it we can probably get higher precision of the causal estimate of population on tools okay so let's do it let's make a statistical model note that tool count is not a binomial variable like in the admissions example earlier and the reason is there's no maximum we don't know how many functional tools are possible in human societies is seemingly infinite number actually and so what we're going to need to do here is do a different kind of generalized linear model one that's called a Poisson generalized linear model or Poisson you don't have to say it with a French accent if you don't want to and the Poisson distribution is a special case actually of the binomial distribution it's a binomial variable with a very high maximum that will never be realized and a very low probability of each success and the way to think about that in this case is that there are many many possible technologies in human societies but in any particular society only a very few are realized and that's a classic kind of Poisson variable situation when we work with the Poisson distribution the link function we need we need a link function because it's still a count variable and it can't be negative so we can't directly hook up the expected value of the number of tools to a linear model we need a link function and the habitual link function in this case is the log link what does that mean I'll show you on the right of this slide we have some count outcome y sub i like the number of tools could be number of fish and the Poisson distribution has one parameter lambda is usually called the expected count sometimes it's called a rate but it's the expected value of the outcome variable y sub i and then in the linear model we say that the logarithm of that expected value is equal to the linear model sometimes these models are called log linear for this reason what this effectively does is it enforces the positivity if you will so the expected value of a count must be a positive number so lambda must be positive but the linear model can be negative but if we exponentiate any real number then it's always positive that is e to any real number is always greater than zero and so what the log link does is just a convenient way for making sure that lambda is positive one consequence of this that I'm going to unfold on the next slides is that this induces exponential scaling the x function is exponential and so you get exponential scaling of the things inside of it of the components of the linear model and this can be a bit surprising sometimes so let's think that through its effect on the priors you can get quite explosive prior relationships as a consequence of the log link so I'll show you again here is just the core bit of a log linear model of Poisson regression the logarithm of lambda sub i remember lambda is the expected value of the outcome variable it's just equal to some intercept alpha don't worry about any slopes just yet and let's start out by considering some seemingly default prior distribution alpha has a normal distribution with a mean of 0 and standard deviation of 10 if we simulate from that prior what I want you to see is that the expected number of tools is actually shockingly large and the density I'm showing you on this slide is just that normal 0 10 but it's been exponentiated so this is what happens in the consequence of the log link and you'll see that it spikes just above 0 so that's the mode and maybe this is not what you intended in the first place but then there's this really long tail of huge values and in fact the mean of this is very large because of this very long tail distributions like this this is literally a log normal distribution and log normal distributions can have very long right tails so you have to be careful with long song regressions, log linear models and be sure to look at the consequences of your prior choices on the outcome scale like this another choice which we're going to use in this particular example would be something like a normal 3,0.5 which it sounds crazy but we arrive at this through prior predictive simulation and seeing that what this gives us is this very wide range of prior expectations and then the same thing with the lambda shown on the horizontal axis on this graph but it expects there always to be some tools and all human societies have some tools and the mean is around 20 now a more sensible number soon as we add slopes and a regression parameter like X to the log linear model in this case again in the upper right I show you now we're looking at the logarithm of lambda sub i is equal to alpha plus beta and on the left I show you some wide default priors slope beta is distributed as normal 0,10 and what this produces is almost all the probability mass is for explosive relationships between the X variable and the expected count these curves shooting up into space as it were and soon entire Polynesian societies are covered in nothing but tools there's no room for people to live anymore obviously this is not a good prior and on the right I show you some much more modest less explosive relationships between some standardized X variable on the horizontal and the expected counts there's a long section several pages in the book where I talk through these examples as well and if you're interested in this is worth going through that slow set of pages to get a better feel for this okay the statistical models don't look so different than the previous ones we just make a few small changes so what I'm done on this slide is on the right I've just written the mathematical version of the model we want to run here at the top we have our outcome Y sub i and this is a Poisson model so this Y sub i be the number of tools and there's a Poisson distribution with some expected value lambda and then we have our linear model and in this I have an intercept which I've stratified by contact which is the letter C so we have each contact higher or low in this example has its own intercept and then also the same for the slopes each contact rate has its own slope and then I'm using the logarithm of population size instead of raw population size and the reason for that is really not plausible that there's no diminishing returns to population size as populations get bigger and bigger and bigger every individual you add is not at a constant probability of an innovation and so the logarithm here is a way to sort of a backdoor way if you will to induce diminishing returns if you don't do this you get this really explosive and relationship between population size and it also doesn't fit the data well and that's an experiment and you may want to try for yourself what happens if you take the logarithm off of population size in this model and then I'm using the priors that I simulated out in the previous slide on the left we have the code and the top of this I just load the data do some scaling I standardize the log population size I code high contact as index two and low contact is index one and then I make a little trimmed dat list for the ULAM models and I fit two different ULAM models and I just want to show you what we can do in model comparison here because the contact variable are rather because we are interested in using population size to predict tool sets we don't want to use model comparison here to choose a model but it's important to give you a feel for what it looks like when there are important predictors in models and what that does to the differences in values and criteria like PSIS that we're going to use here and the other thing is I want to show you that sometimes you want to compute the PSIS score for a model just because you want to find outliers and there is going to be a couple outliers in this case so there's two models here the first one is just an intercept this just estimates the overall rate across all islands the average number of tools and then the regression model where we stratify the intercept in the slope by the population size and we use compare using PSIS here and the model that includes population size and contact is better and substantially better however there's still a lot of variation in the model the standard error of the difference is not so different than the difference between the models if you look at the DPSIS column there that shows you the differences and then the DSE so if you double that standard error that's like a 95% interval and you'll see that it's a pretty big standard error relative to the size of the difference between the models and that doesn't mean the population size doesn't matter it just means that there's a lot of other variation in this data which isn't explained by population size the thing I really want to focus your attention on in this example are the Pareto K warnings that you see in the output here so if you remember from the overfitting lecture the Pareto K values that are output by PSIS are estimates of the estimates of each data point to the posterior distribution well they're related to that and we use them as measures of the leverage of each data point and when some individual points have very very high Pareto K values greater than 0.5 then we should expect that they've had some large influence on the posterior distribution and therefore it's going to be very hard to predict how well the model performs out of sample because there are these outlier points in our sample it's good even if you're not interested in doing prediction just if you're doing inference to use those Pareto K values those warnings to think about the structure of your model as it relates to the inference so here's an example in the oceanic tools case so on the left I've got a bunch of code I know it's like a wall of code but the chunk at the very top of the code box is just extracting the Pareto K values from the PSIS function then I plot the data and I've scaled the points and their thickness by their leverage, by their importance to the shape of the posterior distribution and then I've just plotted the posterior predictive distributions like we've been doing since beginning of the course here and I've shown high contact islands in red points with the red shaded confidence region, compatibility region and low contact islands with the blue so I think you see here is that it's Hawaii and Tonga are the outliers in a sense but especially Hawaii Hawaii is the largest historical oceanic society much, much larger in population size than the others and also the most technologically complicated and it's really dragging that blue curve around quite a lot that's the main source of these Pareto K warnings if we look at this on the natural scale on the horizontal instead of the log scale you can really appreciate what Hawaii is doing here it's bending that blue curve down all by itself yeah and of course there are no red islands out on the far right of this graph in order to inform where the high contact island should be Hawaii had low contact because it was out by itself in the Pacific there aren't any other oceanic islands near Hawaii if we zoom in on the graph on the right you can see some other weird things about this model fit the first is that the expected values for the two different contact rates, two different sets of islands with different contact rates actually cross one another as population size increases so at low population sizes the low contact islands have fewer tools which makes sense according to the theory they receive fewer innovations because they do less trade but then this model says as population size increases that relationship inverts for some reason and now the high contact islands are expected to have less complicated tool kits now of course you see that that whole thing with the high contact islands is that there's a really huge uncertainty region out there I mean the pink region on this graph is the whole upper half of the graph and so the expected value is highly uncertain this model actually has no idea where the expected value for high contact islands the red points should be at high population sizes but still the model allows this reversal which seems odd and then if we zoom in on the far left at zero population there's a weird thing about this model is that it allows a population of zero people to have more than zero tools I'll say that again this model allows a population of zero people to have more than zero tools you see that for the high contact islands that's the solid regression trend rather than the dashed regression trend the solid one goes with the red points and it intercepts zero around 20 not at zero and that's got to be wrong if you have no people you have no tools this is a scientific context in essence where one of the parameters should come free that is the intercept these regression lines must go through the 00 point there are a couple of immediate ways to improve this model the first would be to use some kind of robust model I talked about this before when we talked about the divorce data and I showed you how to use a student T regression to deal with outliers in this case it's not a normal model so we wouldn't use a student T instead we'd use something called a gamma Poisson model this is also sometimes called a negative binomial there's an example in the book of this using exactly this data set the second thing we could do is use a more principled scientific model instead of a generalized linear model and that's what I want to show you to end this lecture so let's go back to the conceptual box and arrow model of this situation I had asserted before that population generates larger populations generate more innovations and contact and trade also generates innovations through borrowing and these things increase the number of distinct functional tools in human societies and then there's a loss process of these tools as tools are forgotten or replaced what if we took this box and arrow model and actually made it into a model instead of using a regression we just tried to make a scientific model out of this so let's do that here's a start these sorts of box and arrow models you typically get what you want to start with is you say you're modeling the change per unit time so that's what we're going to do here we're going to model the change per unit time so let me annotate this equation a bit explain the pieces to you starting on the left we have delta T and this is just represents the change per unit time you can define the unit time any way you like it just ends up changing what the parameters mean the change per unit time in the number of tools in a society and this is equal to some parameter alpha which is the innovation rate per capita innovation rate and capital P is again population so alpha is a parameter we need to estimate capital P is data and then we exponentiate P by beta and what beta does is it establishes diminishing returns an economist would call this an elasticity so if beta is less than one then each additional person contributes fewer innovations and then so that term there the first term on the right hand side of the equal sign is the innovation term and then we subtract from the change in tools the rate of loss and the there's a parameter for the rate of loss gamma and then it's proportional to the number of tools right you can only lose tools if you have tools and so the more tools you have the more you can lose in any particular unit of time now of course this model could be more complicated you could make loss interact with population size there's lots of things you could do but I want to start with this example just to prime your imagination that we don't have to start by thinking about regressions we can actually make scientific models that meet the causal assumptions of the context we're working in we also want to stratify our parameters alpha and beta by contact rate so there's a little bit of regression thinking back into this and I just add the C subscripts to the alpha and beta as we had done in the previous models okay how do we turn this into a statistical model though because our statistical model doesn't have a delta T in it well you can think about this as being a dynamic process and at some point the innovation rate that is the term with P in it the population size in it and the loss rate that is the term with T in it are going to be equal and therefore this whole thing is going to equal zero so when delta T equals zero the population reaches an equilibrium or steady state number of tools and of course it won't no real population is really going to stay fixed there because this is a stochastic process but this would be a mean of a distribution that the population orbits around over time until its population size changes right or the parameters change and so if we want to find the expected number of tools for a society with a particular population size we just need to set this equation equal to zero and solve for T and if we do that and we call that equilibrium value T hat that little circumflex over the T then we get this expression here and we can use this directly in the statistical model this is now lambda, lambda is this expected number of tools pretty cool huh so we can put that right into our code into our model code the only new bit here is that you'll see that the line for lambda there's actually no link function at all instead we're enforcing lambda to be positive by making sure that the parameters are positive and they need to be just because of what they mean all the parameters in this in this particular model are rates and so they have to be positive real numbers and so there are two general strategies in statistical modeling for making sure that parameters are positive the first is just to declare the parameters as ordinary continuous real numbers but then to exponentiate them inside the model when you use them and you'll see I'm doing that here for alpha you'll see that in the lambda line I exponentiate a bracket c and that makes sure that that's positive because the exponent of a bracket c is always a positive number the second thing is you use the prior to constrain it to be positive and I do this for b and g so you can see how that works I used exponential priors for those and the exponential prior is a strictly positive prior only has probability mass on the positive reels you can run this model it fits no problem the chains makes really nice and you get estimates this is one of those cases where you see the tide engine again it's impossible to interpret these parameters right these these models that this model is highly nonlinear all the parameters are active ones to produce the predictions there's no way you can look at this table of coefficients and figure out what's going on so let's push predictions out the code to produce these graphs I'm going to show you is in the book so take a look if you're interested in the details but there's no new tricks there it's the same stuff we've been doing since the beginning of the course you extract the samples and since you know the model you can generate expected values for any combination of the parameters in the data and then you you make these graphs just that way so I'm showing you on the right is the new posterior predictive distributions for both high contact islands in red and low contact islands in blue using this model the so-called innovation loss model and you'll see now that the expected values the two black lines the solid one the dash one do not cross at higher values and the trends make a lot more sense now Hawaii is still inducing a lot of leverage in this model though this is just to help you do the comparison to what we had before the new innovation loss model on the left and the generalized linear model on the right what we haven't done is dealt with location as a confound that is something we'll do in a future lecture I promise you okay this has been a long lecture let's sum things up this lecture and the previous one have had as their goal to introduce you to generalized linear models for count variables count variables are things that are zero or positive integers events that happen and we count and the way we choose a probability distribution for observed count variables is you want to think about the constraints on that variable before you see its realized values and the argument that I presented only very briefly at the end of the previous lecture but there's a whole chapter on in the book is maximum entropy that is counts have particular constraints and this means that we can get the flattest distributions that will apply to such count variables are these familiar regression distributions like the binomial and the Poisson and their extensions their extensions like the multinomial and categorical which are the equivalent versions of binomial and logistic regression when there are more than two kinds of events robust regressions are also an important part of this topic and there are beta binomial and gamma Poisson regressions and there are fully worked examples of those in the text as well robust regressions are are nearly always a better choice than the plain binomial and Poisson regressions but when we get to multi-level models later on I'll talk about this issue again because multi-level models actually tend to accomplish the same thing as these robust regressions but they don't use the beta and gamma distributions to do the robustness so examples are to come okay that has been lecture 10 and week 5 of statistical rethinking 2022 in the next week we're going to do some more generalized linear things but yet more elaborate and all the elaborations have in their spirit to better meet the scientific goals of each analysis so I'll see you there