 It's great to welcome you all to this, it's one of the social and behavioral research branches seminar series, and here we highlight scholars who are conducting really innovative research at the intersection of genetics, the social and behavioral sciences and health. And it's a really great pleasure to introduce you today to our distinguished speaker, Dr. Jeremy Freese. Dr. Freese received his doctorate in sociology from Indiana University in 2000 with a special concentration on survey and experimental methodology. Following this work, he went on to be faculty at the University of Wisconsin-Madison, then a Robert Wood Johnson scholar in health policy, fellow at Harvard University, and then the Ethel and John Lindgren Professor of Sociology at Northwestern. And currently, he's a professor of sociology at Stanford University. Here his research focuses on the relationship between social differences and individual differences, including work that connects biology, psychology and social processes. And Dr. Freese is also the co-leader of the Health Disparities Working Group for the Stanford Centre of Population Health Sciences. He's also principal investigator on a wide variety of shared data resources, of which I'm sure some will be mentioned today. But today, he's going to also share with us his perspective on using polygenic risk scores to better integrate genetics and the social sciences. So join me in welcoming Dr. Freese. Thank you very much. Do I have to use this mic as well as the wireless mic? No. In fact, that would be bad. Okay. Great. Great. All right. Thank you very much. I am excited to be here today. I have, I apologize if I, so this is part, I thought coming from the West Coast, I was able to stack up several different stops and events and things, which is great at one hand, but I've had a little bit of travel misadventure. So if I look a little rumpled as I visit you, that is what is going on. So, but let me, I'm going to talk about a variety of different projects today, which involve a host of different collaborators, which I'll show here. It also, of course, involves funding from different sources, including the National Institute on Aging for the Wisconsin data that I'll talk about today, NICHD or ad health work, and some other research, as well as the National Science Foundation. Okay. I will give this talk in a kind of a catacysmic format. What I mean by catacysmic is I'm going to put up some questions, and then I will talk about those questions so that we can follow along more easily with things. But just to make sure that we're on the same page, what do I mean when I talk about a complex social outcome, right? What I mean by that is, what I mean when I say complex. We'll talk about complex trait genetics. What I want to emphasize with a complex social outcome is just that, and we won't go into those diagrams, but those are related to a couple of papers that I've talked about on how to think about integrating genetic and social causes, but just that, especially we're going to talk about educational attainment, for example. In my field, as was mentioned, I am in a sociology department. Educational attainment is the single most studied non-demographic variable in sociology because there's a lot of sociology that deals both with education, like what causes educational attainment or correlates of educational attainment and then what implications it has over the life course. And educational attainment, when you talk to people, especially interdisciplinary audiences about educational attainment, people will often imagine, for example, talk very much like educational attainment is like an individual choice, which it kind of is, right? Or that with educational attainment, it's sort of like a trait in a way that psychological or a held outcome is a trait. And it kind of has features of that as well. And then it also has this logic of external response. That is to say that there are people who would like to do better in school than what they do. There are people who would like to get into better colleges than they get into, or for that better, to get into or to continue their education than what they're able to. So it's not simply a choice or a trait or an outcome. And we have to keep that in mind when we study it. And yet, nevertheless, we can talk about, and this is the difficult thing, talking about how we might think about and use genetic differences when we talk about a complex social outcome. And so why might we think that something like educational attainment is associated with genetic differences? The first line of evidence from this comes to twin studies. And so from twin studies, at one point, Amelia Brandigan and I, we collected all of the twin correlations that exist regarding educational attainment. And the backbone of a twin study is the idea that monozygotic or identical twins are going to be more correlated than disygotic or fraternal twins. So we collected all of those. And from that fact of monozygotic and disygotic twin correlations, you can back out what's called a heritability estimate. And so if you look across all of these different countries and samples, while some samples are different in way, you can back out from this an average heritability estimate of 0.4. That is to say, an interpretation of that would be that 40% of the variance in educational attainment is resolved by or explained by genetic differences, given the assumptions of behavioral genetic models. But now with that, behavior genetics has talked about, it's even characterized certain recurrent patterns in behavior genetics as being laws. There are laws of behavior genetics. And one of them, so you'll see here from a widely cited article, the second law of behavior genetics is that the effect of being raised in the same family is smaller than the effect of genes. Now that comes from, if you do a twin model, you can decompose what's called a shared and a non-shared environment. And this is equating that estimate of a shared environment with a family effect. Now I could do a whole thing on why that is not a great interpretation. But this is just taking it for what it is, this idea of that non-shared environment component is usually very small connected heritability, so much that they would call it a law. So small, in fact, that often when you're doing a model, it simply drops out. And you estimate it as being zero for all kinds of traits. But what we found with educational attainment, I think I mentioned that educational attainment, in my neck of the woods, the most studied variable is actually an exception to this law of behavior genetics. As I say, these blue squares are the estimate of the shared environment from a behavior genetics model for educational attainment. And you'll see that the shared environment estimate is roughly the same as the heritability estimate. This is, in fact, the only trait educational attainment for which it has been documented this robustly, that this is a case that you have a trait for which you have simultaneously a heritability estimate in the point 4-ish sort of range and a shared environment in the point 4 range as well. That is to say that educational attainment in some ways is different in this reflex. The way to think about what this is going to mean is that siblings are more similar to each other in educational attainment than they would be in something like personality traits or something like that from models. There's an unusual concordance in educational attainment, which is, depending on your background, maybe not that surprising. But we can even see it in these models that it behaves unusually. But of course, we're just talking about educational attainment. Of course, we can see that genes and environments are going to interact. And we've known that and talked about that for a very, very long time. A famous model of this, though, in terms of how are we going to get at or think about how genes and environments interact is this idea that cognitive ability, for example, varies by socioeconomic status. The idea there would be this is, in fact, a quite old idea. But it got over quite a sense of old by means it dates to the early 1970s. It's sort of the first article trying to show this. But it got a lot of attention back starting in 2003. With this dramatic finding, the idea would be that sometimes people will use the phrase draw out. Potentially, the idea would be that advantaged environments are stronger at drawing out potential of the child. And so genetic differences, heritability is actually a sign of progress or good or an enriched environment. And in low SES environments, one does not see that same level of heritability. It's been a very evocative idea for that. But then it's sort of a couple of things have sort of happened in this literature more recently when people have tried to sort of look at this work. One thing is that this giant effect like this, people have not been finding that in other samples. But then also, there's raised this idea of a kind of American exceptionalism. That is to say that effects, in this case, the dark blue or black dots are showing a larger effect than the red dots. That is to say that people have actually not found evidence of this so much in other countries, Australia, the UK, Scandinavian countries. Then they have in the United States. And there's been different theories for why that might be the case, the US welfare state, greater socioeconomic diversity in the United States relative to some place like Scandinavia. So we decided with some collaborators. This is a couple of economists at Northwestern University to look at this with respect to Florida. We're going to be making a few different stops and different destinations. Today, our first is the state of Florida because what we were looking at with this is using administrative records from the state of Florida. So these economists and some others have put together administrative records, matching birth records and school records for every child born in the state of Florida who went to public schools over like a six or seven year period. So an enormous sample and also relative to twin studies, not as selected as of a sample as normally. So twin studies are often have issues, especially recruitment of low socioeconomic status individuals into those studies. But when you have administrative records, you have effectively everybody that is in the birth records and school system. And so now we don't have for this, there's not anywhere on your birth certificate if you're a twin. It doesn't say anywhere if you're an identical twin or a fraternal twin. But what we do have, so we could tell a twin from birth records because we're going to have the same birth date, the same mother listed on the birth certificate. And we can tell that their same sex are opposite sex pairs. And every opposite sex pair is a dizygotic twin. Every opposite is zygotic. Whereas the same sex pairs, the way that it works out is about half the people who are in same sex pairs are identical twins, about half are fraternal twins, roughly. I didn't say that quite right, but let's just move on. But what we can do then is we can look, this is going to be inter-correlational coefficient, but just so that we get the idea of the hypothesis, remember the idea is that you see a fanning out with bigger, better environments. And so what we'd expect to see here is who we would expect to see these things diverging as we move up the levels of mother's education here. But in fact, that's not what we observe. And this is the largest, this makes, having administrative data means for the largest study that has ever done this. But both for math and reading tests and for younger and older children, we don't see any evidence of this ourselves. So, Florida is certainly socioeconomically diverse. It's slightly more socioeconomically diverse than the United States at large. And so, a lot of explanations, and so really we're not sure about that version, that simple version of this theory at all from this. And we'll be interested to see how replication in the future works. Even having said that though, this underlying idea, I've called it at least one place, the pervasive environmental reinforcement hypothesis, is a powerful idea. And I think it's probably a true idea. And I think the issue is that these sort of twin designs just provide a very blunt idea that all of heritability is gonna work in one way or another way or something like that. It's likely. But the core idea here, right, or one idea of a gene environment correlation that could be pervasive over the life course. And we can in fact think about this as maybe even easier if you think about other talents than thinking about cognitive abilities. But we can see where early displays of ability can lead both to people's choices into environments and environmental responses that lead to later displays of ability, right? So a well-documented example would be the children who show early aptitude at reading. They read more, right? They get more out of reading. They form more of an identity of a reader. And so as a result, they're getting more practice at reading. And so it wouldn't be surprising from that alone that you could imagine a gap in reading emerging after later displays. And we can think about this in all kinds of other domains. This idea of environments reinforcing one another. Now, what I'm gonna be talking about today, so we're talking here about twin designs, but of course, there's long been this movement to having molecular genetic data and interest in this. And we're gonna move to the polygenic score. But we should make clear why are we talking about polygenic scores? Why don't we just look at individual genetic variants? And it depends on where the audience is for how well they are familiar with what the answer to that is going to be. But there was a period where people did exactly this. And there were many headlines as a result of that, finding a link between a single genetic variants and violence and delinquency, right? For that matter, ideas about biology strongly governing voter turnout based on a single genetic variant. A lot of this work was using the ad health data set, which is a great data set. This is not, and they were pioneering in having genetic data available, but they started by having six genetic variants available. And in all kinds of associations, we're shown with those six genetic variants in the ad health data. Close to home, for example, the American Journal of Sociology has a paper published 2008, 2009 or so, right? With a single genetic variant, they're calling this DRD2 risk. The genetic variant in question is the TAC1A, familiar TAC1A variant, which is not actually in, but is near the DRD2 gene. So this is dopamine receptor genetic variant. But what you can see here, right? Is you can see that the table is presenting effectively a 15 percentage point difference, depending on if you have this single allele, right? Whether or not you have this single allele. 15 percentage point difference, a difference so large that if you imagine all of the reasons, all the social reasons that lead to white-black differences in educational attainment, the idea would be that the single genetic variant would be larger than all of those if you look across the table. That's just when you kind of think about it, it doesn't really necessarily pass muster as plausible. And so it's perhaps it's not necessarily that surprising. This is from the data set that's used. The idea would be that the percentage if you're homozygous, you're homozygous on C that you have a 50-some percent chance of going to college, whereas you're homozygous on T, you only have a 30-some percent chance of going to college. I put this as you might notice in Carolina blue because it's the ad-health sample. But we tried it early thing to replicate this using the Wisconsin data that I'll talk about in a minute, Wisconsin will be in red as a demonstration. And we can see that we didn't find any evidence of this in our effort to replicate this. And of course that's the story that happened with so much of this. We talk about it now, I don't know how broadly, but in the social science genomics world, we talk about this as the candidate gene era. We talk about it in kind of the sort of hushed tones of a Halloween ghost story as if we had a flashlight on our faces as we talk about it as so much failed to replicate subsequently. And in fact, we can even see recently, right, with the UK Biobank, people have produced results from the UK Biobank, which is gonna be based on 500,000 individuals and looking across a genome-wide set of assaying. This is the TAC1A variant in question. And this is basically just somewhere in the middle of a long list of non-significant hits at this point. So even with 500,000, there's not any evidence of this variant that in the ad-health sample looked like it had a 15 percentage point effect, has any effect whatsoever on educational attainment once we get to a really large sample, right? I should mention that I think that the scalding effects of the candidate gene era on this research also probably show themselves in other guises with respect to wariness of like how are things going to turn out with various kinds of epigenetic measures, various kind of biomarker measures. There are people who have jumped in with both feet on this. There are other people, myself included, who are very reticent just because we've been hurt before, as they would say, by these different measures and we're not sure what is gonna be high and what is going to be true, which makes the fact that I'm gonna talk about polygenic scores here in an enthusiastic way, more meaningful in this respect because I do feel like we have reason to think that this works and I'll explain why, okay? So let's think about, can a polygenic score be developed for educational attainment, right? And if you're familiar with the history of GWAS at all, there was a period in time in GWAS where there was a question whether this was really going to work at all, right? The early GWAS studies were not finding much of anything and when the group of people that are the SS GAC, the group of economists forming the big first study of this started and started to try to assemble cases on educational attainment, there were people in fact who I think thought that they were insane or certainly that they were not going to find anything, that this was a really big long time. Even people who were giving their data to the study because obviously we all know that educational attainment is very complex, including the people who are doing these studies. And yet, nevertheless, and the reason that educational attainment was so promising is that medical studies routinely ask educational attainment in the studies. They're not going to be doing anything with it in the GWAS framework otherwise, but it's available in a very large number of studies routinely collected and if you're going to have something, people can report educational attainment very accurately, they can report it very quickly compared to those of us who try to do psychological measures alongside them. So there's a lot going for educational attainment as something that you can imagine assembling a large sample with. Their first effort, when they got over six figures to 100,000, they found three hits with educational attainment. What I'm showing here is from their Nature article from 2014. And at that point, they had nearly 300,000 individuals and they had 71 hits, including their three initial hits all replicated. So what I mean by a hit is GWAS, because you're doing in a genome-wide study, you've got a million or more different genetic variants. As a result of that, you use a high stringent standard of correction, a P to the 10 to the negative eighth power. And so something is considered a significant hit. They also do some cross-validation with replication samples, but we won't get into that. But what they found are 71 different along the genome, 71 different hits. And you'll see from this, that they're spread all across the genome. They're not concentrated in any one location or anything of that nature. Now, what I can't show you, but what this group has presented, so I feel comfortable talking about, even for YouTube, talking about is that they have a mask at this point with UK Biobank and with an agreement with 23andMe. They're over a million with their sample. And now that they have over a million, just imagine all of this just shifting up over the bar. So now they have over a thousand hits across the genome. And in terms of out-of-sample predictive power, including in the Wisconsin data that I'm gonna be talking about later, the predictive power of this score is about the same a little bit less, but about the same as being able to predict a child's educational attainment, knowing the educational attainment of one of their parents. So if you know either the mother or the father, not both, but if you know either the mother or the father, that's about the same predictive value as what the polygenic score has now that it's over a million individuals, which in my neck of a woods is a pretty good amount of predictive information. It's not any kind of crazy determinism that some people think you're gonna get into when you talk about genetics, but it's a meaningful signal regarding educational attainment, right? Now in terms of how we're going to use this score, one of the things that we're going to see and that is a big issue in this field is thinking about these polygenic scores and these GWAS studies almost entirely are concentrated among European populations. There is now more GWAS studies of East Asian populations, particularly Han Chinese populations than there was before, but still it's a very white field. And so the question then is, can you apply polygenic scores developed from a European ancestry population to other populations? Can you just go into a dataset like AdHealth or HRS and generate a polygenic prediction for everybody? And there's great reasons to be very cautious about doing this, particularly striking example at least for me is this paper that was published in Human Genetics in 2017. And what this show, they're using some simulation of data from a GWAS study of height. And what they show here is that if you had calculated polygenic scores for this whole population or for these different populations using a European polygenic scores for height, the scores people would use for height, you would think for example, so it is the case that if you look at an African descended population, it is the case that the score will be correlated for height within that population, right? But if you looked, actually the African population is in red here, if you just looked at the polygenic score, you would think that there was almost no, there's almost no overlap in the polygenic score between the European descended population and the African descended population, right? Even though anthropological evidence makes abundantly clear that there's not an actual height difference between these two populations, right? So in other words, the scores may predict within population, they're gonna predict weaker within non-European population, but the idea that somehow then you're gonna be able to go and look at differences, that is a source of extreme nervousness, right? And it's especially nervous when we start talking about things that are much more morally freighted than height is, but the height provides a good example of the underlying problem, right? For this reason, at this point, this research relies even samples that are nationally representative will almost certainly do their analyses only on the European descended, the white respondents, right? In other words, okay. So nevertheless, can we be confident that these scores actually work? That is to say that these scores are capturing causal variation, right? And like I said, I mean, it has to be understood in the context of other things that have been tried, other things have been told to social scientists from the medical and experts at the time that this is gonna work and then when applied to our samples, this didn't really work. So polygenic scores, why might we think that this works? Why is there so much enthusiasm for this, right? For this, we're gonna have a little foray into the state of Wisconsin. I wanna describe the Wisconsin data that I will be using. I'll do that by way of talking about, I don't know, this is an increasingly dated reference to audiences, right? But this really was honest to God at one point a thing, right? This TV show, Happy Days, was number one show in the United States for three years in the 1970s, during for like seven or eight seasons. Happy Days takes place in Milwaukee, right? Milwaukee is in Wisconsin, just as first steps, right? But so, and follow the adventures of Richie, and Potsey, and Ralph, and early example of sitcom breakout star, the Fonz up there in the corner. But as you see, it takes place in the fictional Jefferson High in Wisconsin. And even takes place in Richie, Potsey, and Ralph, were graduating seniors in 1957 in the state of Wisconsin, which is exactly the sample that I am talking about by amazing coincidence. That is to say, the Wisconsin Longitudinal Study is a sample, so in 1957, nearly all schools in Wisconsin, the seniors were given a survey, right? And that survey, one third of those respondents became the Wisconsin Longitudinal Study, right? Originally they were just surveyed for purposes of studying college aptitude sorts of things. Then it became sort of the leading study of social mobility for status attainment models, if you've ever heard of that, in sociology in the sense that it's become a study of aging and health. But it's the one third sample, right? And so that means if you think about Richie, and Potsey, and Ralph, those three students, you could imagine one of those three would be so Richie, for example, might be in the WLS, right? The FONs, incidentally, would not be in the WLS because the FONs did not graduate from high school, right? He was a couple years older. He did get his GED, incidentally, at the end of season three, but that would not be good enough to get him in the Wisconsin Longitudinal Study, right? So that is our data. What we have then as a result, the core sample of the WLS is based on about 10,000 people. The rates of survey attrition from the WLS over the years have been actually very good compared to longitudinal studies, but of course you do have survey attrition, and for that matter, people start to die and such. What that means is for the DNA sample, we first collected saliva used for DNA in 2006 or so, and then we tried to get in 2011, 2012 to collect saliva from people who hadn't or whose samples were very good, but we have saliva from about 9,000 people, including 2,048 pairs, right? So the WLS is a graduate in the high school graduating class of 1957 and a randomly selected sibling. So if you remember Happy Days, Richie had sibling Joni. Joni could have been selected into the sample as well. They could both be in our sample and they could be a sibling pair. Okay. So that's our data. Let's actually take up a side question. We're getting back to the effectiveness of polygenic scores, but can we actually see ancestral structure just in the white WLS respondents? A graph that I have seen probably a dozen or more times in different talks on this area that people will show, is this the people, anybody seen this before, right? Is this some people know? Well, we'll see. So this graph was an early example from 2008 of someone taking, actually it was not at that time genome-wide data, but it was data from a lot of different genomic markers, doing a principal components analysis and then laying the first two dimensions of that principal components over a map, right? And what that showed is that you could see, you could locate sort of, this is where the respondents actually were by their place in the components. You could sort of reproduce, if you spun it around, you could reproduce a map of Europe based on genetic markers, right? Now, if we think about that with the state of Wisconsin, we imagine, well, we could take our genome-wide data and we could do a couple of factor analysis if we could put it over state of Wisconsin, right? That's probably not a very good idea, right? Because people in the state of Wisconsin, white Wisconsin respondents are all migrants from somewhere on this map, from different regions on this map, right? So that itself wouldn't make that much sense, but we have asked in the Wisconsin Longitudinal Study, we've asked people what country in Europe their ancestors are from, right? And so what we can do is we could take respondents who say both of their parents were from a particular country in Europe and then we can look at this. Now they didn't locate exactly what city or whatever, but we do have the country of origin. When we do that, what we'll do is we can see, so red is the parents of both British, blue over there is the parents of both Polish. What you see is you can actually reproduce in the WLS. You can see an east to west or west to east axis consistent with where these countries are located in Europe based on the first two principal components in the WLS data. Now we can also start to ask questions about whether we can get other information relevant to evaluating self-reports and surveys from genomic information. For example, in the Wisconsin Longitudinal Study, there is in the state of Wisconsin, there is a Native American reservation that's coterminous with a county in Wisconsin. That high school was not included in the Wisconsin Longitudinal Study, but there are other survey respondents in the WLS who report Native American ancestry. Now a thing about reporting Native American on a survey, some of you may be familiar with is there's a couple of reasons why those reports can be especially problematic. One is that there are some people who believe they have Native American ancestry, but do not have some kind of familial lore about the question perhaps that is not the case. And then it's also the case that there's a subset of survey respondents who really wanna assert that they are American versus some other descent and so they will report that as well. But we can think to ourselves, well, we have a sibling report. And so what if there are some people for whom, so if they're full siblings, and of course having genetic data, we know they're full siblings regarding of what they might tell us, but full siblings could either, full siblings certainly have the same ancestors, right? So they have the same ancestry and they could either disagree on this question or they could agree, right? And so we can look at that. And so this is, I should say, a distance from the center in terms, don't look at the vertical axis, this is just jitter. But in other words, this is the center of those principal components moving out to the most genetically distinct people out there in the sample. And if we do that, we can see from this that the orange dots are cases where the siblings disagree and the blue dots are cases where the siblings agree. We can see that in the WLS sample that in fact when the siblings disagree, they're much more likely to be indistinguishable from respondents who don't report any Native American ancestry than the respondents who do. Yes, okay. Now let's get back to this question of whether we can feel confident that polygenic score is capturing causal variation, right? And a way that I have shown, well, we can just look at the sibling correlation, which hopefully is going to, okay. There's a great scatter plot here and it's gone. But I think we will otherwise be okay. The correlation is about 0.53 between the two siblings in their polygenic score. But this is what I want to show. We can, instead of just looking at the relationship between the polygenic score and educational attainment, we can estimate it's a simple sort of thing but it's intuitively very clear between within model. That is to say that we'll take the polygenic score and we'll make it actually two variables, right? One variable will be the average polygenic score of siblings and the other will be the deviation from that average, right? And so if it were the case, right? So first of all it was the case that the polygenic score was just noise relative to the WLS and there won't be any between or within effect, right? But if it were the case that the polygenic score was just picking up either ancestral confounding or some kind of confounding that was not actually causal, we would expect that there are differences gonna be between the pairs, right? But there won't actually be differences within the pairs, right, once you look within pairs. So in other words, we'll have an effect here but we won't have any effect there, right? And so when we do this in WLS, right? This is the magnitude of the effect of the polygenic score and years of education that we observe between pairs. So the question is, is that gonna be similar to that or is that gonna be closer to zero, okay? We all got the, right? In fact, this is what we observe, right? No attenuation whatsoever, in fact, in the polygenic score in the between or within, right? And that to me is for one, until I sort of had the data and could sort of run something I could find was ready maybe not to, but it's very hard to explain why you would have, and this has been replicated in other samples even before we showed it in WLS but it's very hard to explain how you would have a within pair difference in the score. That is to say, what this means just to be clear is that the sibling with the higher polygenic score is more likely to be the sibling that has more educational attainment, right? If we look at the siblings who are kind of in the upper half of the distribution of the magnet, so the sibling scores are pretty close, then that probability is gonna be pretty close. If we look at the full siblings who are kind of in the upper half of having the bigger difference, it's about a 70-30 guess. That is to say the sibling with a higher score will have the higher education if they differ 70% of the time. It's not a small difference, right? If you were using it to gamble on kids' education, that would be a big advantage, right? Now we could also see the same between and within with some other measures we could look at. In the Wisconsin Logitudinal Study, we have a measure of their test scores when they were in high school, right? It was constant state testing. They gave tests in freshman and junior in high school, and those are all matched to their respondents. And we can see, again, we have a very consistent between and within effect for cognitive test scores. This is using the educational attainment score to predict cognition. We could also see, right? We could also see that if you look at the children of the WLS respondents, the respondents reported whether or not they reported a roster of their children and whether their children had a college degree. We could also see that the polygenic score predicts both between and within families, right? So the sibling with a higher polygenic score is more likely to have a higher proportion of their children that they report finishing college. Now it could be that one, one as well is the between and within always the same and that sort of thing. There was a finding, a couple of papers that have come out that have reported that the polygenic score for educational attainment is associated with the number of children that people have. And this has in fact been used to make argument, the effect is small and so the argument would be that it corresponds to like a year of educational attainment over like a thousand years or something like that. But this idea that this is associated with fertility. In fact, in the Wisconsin data, you can see that there's, while it doesn't attenuate all to zero, you can see that when we use between within there, this is because higher scores associated with fewer children, you can see that that's actually not the case. So we actually do not, the studies that have been published so far have not had within family data to show this. And when we look at the within family data, we do not have, we do not see evidence of there being a relationship between the score and fertility itself. Okay, so then what can we do once we have that polygenic score? And I'm just gonna roll out some applications that we've been talking about with WLS. One is we can look at differences among schools, right? And so a question when one looks at school effects is, okay, well more kids from this school go to college than kids from this school go to college. But does that reflect selection going into the school? Does that reflect just differences between the students? Are there actual differences between schools? And we can look at this. And so the model that I'm gonna show is gonna have ad health and Wisconsin data. It is going to, but the question is, we're gonna have a bunch of different regression slopes for the different schools is what I'm going to show. And what we found, right, it's even more kind of marketed in ad health. It was constant, it gets a little bit pattern because of the fact that the focal respondents are at least high school graduates. There's some location that we have to kind of work with. And this is probably the more striking demonstration of the pattern. What we see is we certainly see big intercept differences. That is to say that people who have the same polygenic score for educational attainment can have vastly different probabilities of going to college, right, or their ultimate educational attainment. But what we don't see, and this is closer to this than what I think when one corrects for the attenuation, that there's not great evidence of differences across schools in the extent to which you could think about it, the returns to the polygenic score for educational attainment, right? So in other words, there's clear differences of some overall mean level effect between schools. There's less, and in fact maybe a clear demonstration that there's not differences that schools are more efficient or effective or something like that or however one metaphor one would want to use in the relationship between the score and educational attainment, right? Now can we use the polygenic score to estimate genetic confounding in the relationship between parental SES and educational attainment? That is to say, right, that there's a big literature showing all kinds of, the clear relationship between parents' socioeconomic status, parents' education and such, and children's educational attainment, right? But that literature has always been dogged by the idea, well maybe this is just genes causing educational attainment and not. And so with the polygenic score, can we sort of get at that confounding? Now if you just put, so you mind about it, okay well let's just, we can do a regression of child education on parental education, and then we'll put in the polygenic score as a control variable, and we'll see how much the coefficient attenuates. Now if you just do that, the score doesn't really, the coefficient doesn't really attenuate very much, right? But then you think, well, but that score is like I said, I mean even now it only predicts 10%, the earlier version only predicted four, five percent. That's not really a very fair test. You have to think about what the random degree of measurement error is, right? And so we've done a little bit with this, but it really, you have to involve an idea of what you think the structure of that measurement error is. Even if you just think it's random measurement error, you have to have an idea, well how much signal do we think in that is versus the total amount of noise in that signal? But just to kind of show you what it looks like, so zero attenuation would mean that, there's no reason to think that there's any genetic confounding whatsoever. The hundred percent would mean that, oh, it looks like the whole thing is genetic confounding, right? You could say, what's it gonna look like? Imagine, predict. It turns out that it varies a lot depending on what the underlying assumption of variance is, right? But even under the strongest assumptions that you imagine in terms of the weak signal is no more than 70% attenuation and under the weakest assumptions or under the weakest ideas, maybe around 30%. I probably think that the real answer is gonna prove to be somewhere in here, but we'll see, but the answer in other words is that yes, there is some confounding of that association, but no, the relationship is not entirely confounded, right? But then between that, it's gonna depend a lot on assumptions about error, okay? All right, let me, well, I'll mention it. So this paper, they got a lot of attention in psychological science, collaborator Dan Belsky, showed that the polygenic score was associated with, they used the Dunedin cohort, so these are people based in New Zealand, and they showed in fact that the polygenic score was associated with leaving New Zealand, so people with a higher polygenic score were more likely to leave the state of, or to leave, not the state of the country of New Zealand for elsewhere. And so that raises the question of, well, we've got people from Wisconsin, is it associated with leaving the state of Wisconsin? And in fact that it is, we can show evidence that it is, it's not the case, I had thought, I grew up on a farm, so I feel like I could study this part of it, that I had thought that the relationship might be stronger for rural areas, but that's not actually the case, we have not found evidence of that. If anything, there might be more out of Milwaukee than out of rural areas, but there is a clear relationship there. And then what we've been trying to do, so this is sort of tentative, but I mean, the thing about educational attainment is that it's associated with so many different post-education outcomes, right, virtually everything, this is why it's so widely studied. And when you think about data sets, like something like ad health or something like, but the Wisconsin data set, or the health and retirement study, large-scale survey projects, right, they tend to be assemblages of a lot of different interests as a result of which they measure a lot of different things, a large number of different life domains, they may not measure any of them all that super well, but they measure a lot of different sorts of domains, a lot of which then show themselves to be associated with educational attainment. And we just kind of take it for granted that educational attainment is gonna be associated with things that we study. It's more interesting when it's not. And if we just think about this in just a very simple like first month of regression class sort of framework, right, we might imagine, then this is what's so different about a propensity score, but I think that for social scientists, it really takes some getting one's head around that I think that the field collectively is going to have to figure out. Like, certainly we're all used to dealing with antecedent variables, but the idea of something that is developed fully as a predictive measure of something that we have observed, right, that is very, very different. And the way that you'll see this manifested is people lapse into wanting to talk about that polygenic score as something, like you'll see like there's some economists who've done papers on like essentially equating that with ability, right, and just saying using it as a proxy for ability, right? But there's ways that you could measure something like, that's not the score, right? The score is everything that predicts. The closest analog I can think to my mind is imagining, like imagine if you had a data set where you had people's credit score, right? So you have the credit score, but you don't know otherwise how that credit score, it's like it just came from God, or in our case came from the people who did our genotyping, whatever, you get the score, right? And so then it would be a puzzle to figure out, well what are the things that we have in our data set that are associated and how does that score work? And in that process of figuring that out, you would learn a fair bit about credit default, right? And the causes of default and things, and that's possibly the promise of this, right? But just from a simple causal perspective, the idea of a propensity score, right? It could be the case that that, any effect of that propensity score, the things that go into that propensity score, on things that happen after educational attainment are fully mediated by educational attainment, right? That's why. It could even be that the fact that educational attainment is a cause, right? And the second model we would all be familiar with would be the idea that, well, the propensity score contains things and those things cause educational attainment and those other things in later life problems, right? And then we would know from regression costs, okay, well, that means that if you look at the association between the propensity score and the outcome and then you put in educational attainment in the model, here, correlation, right? It's gonna go to zero conditional on educational attainment, right? And there it's not gonna be attenuated much at all, right, on educational attainment. And we have those sorts of two lines of expectation, right? We're just applying that here. And so if we look at things like occupational prestige or psychological well-being, if I would have had more room on the slide here, I might have put that migration measure in from Wisconsin and what we will see is we'll see this pattern of, okay, well, when you look at, it's basically all in the education variable, right? So there is a relationship between that polygenic score and occupational prestige, but it goes away when you put education in the model. Because if you look and we looked across various domains, that educational attainment polygenic score is significantly associated with a whole lot of things, right? But that's really not that surprising, right? Because educational attainment is associated with all these things is what we would expect if we had enough sample, right? We see the same thing with psychological well-being, but that's not what we see with every outcome, right? So for example, if you look at being a non-smoker, there is some attenuation, but it's hardly all attenuated after that. So there are some outcomes for which there's not actually very much association. In our conjecture, but we're at early stages of this, is trying to use sort of this attenuation across a large set of outcomes to try to figure out what this polygenic score is and how to understand it, right? And so, this is gonna be too tough to explain in the time we have left, because I was supposed to leave a little bit of time for questions, so I'm just gonna just go right past this, I think. The basic thing that we are kind of observing, you'll just have to take my word for what this graph means, but that it's especially like health behaviors and that have very little attenuation, and then either sort of long run, basically things that are really kind of associated with long run attainment, in many cases, are the things that have the very complete, almost full attenuation, right? But we're gonna skip this. And I just wanna conclude by saying, what do I think are going to be the big directions in this area, how are polygenic scores gonna be used? I think there's gonna be four big fronts at which they're used. What I mean here is to say the polygenic scores pose a puzzle, like I was saying, this credit score puzzle, and that process of figuring out what that polygenic score ultimately means, how we can decompose it. The polygenic score to be clear is not, we have a very good measure of cognition in the Wisconsin Longitudinal Study. That explains, on the short side of half of the association, anything like 40%, so it's not the case that the polygenic score is just a proxy for cognition. It is abundantly clear that that is not the case. It does have promise as a statistical control if we can figure out how to adjust for it. It also, in this front, has great promise for people interested in RCTs of educational processes for being able, perhaps to increase the power of those studies, because a lot of those interventions are very expensive and every data point is valuable. We can think about it as a moderator of different kinds of associations in an interactional framework. And then we can also think, and this is a benefit of sibling data to think about natural experiments. So economists, although maybe doing a little bit less of this than they were five or 10 years ago, but they have this great hunt for exogenous variation. And so you'll see papers where some kind of weird policy change in Manitoba is latched upon and studied because it provides this great natural experiment of things. But an appealing aspect about genetics, when you're used to seeing that, is that all of us with full siblings are in effect natural experiments, right? Because we have the random variation between ourselves and our siblings. So that offers the possibility of a great amount of power going forward for that design. Great. That's my question. Okay? Yeah. Yeah. Yeah, so, and that's an issue for this design, right? Because if there's sex differences, you would see that. So we had two pieces of leverage on that. Three, actually. One is just for the scores, we just used within sex standardized scores. We eliminated any mean difference that way. We also put it all in a multi-level model where we had sex in the model. So we could get that way. There is, especially in reading, there is a difference where girls do better on reading customers. So same sex girl pairs are doing better than same sex boy pairs in that respect. And then the third thing that we did, which I'm sure is eminently clever, has just gone out of my brain at the moment, but perhaps it will come back. Yeah, yeah, and so here would be the broad front. So when these effects first started coming, there would be the idea, like is this just all gonna be cognition, right? That's clearly not the case, but it's partly the case, right? So in other words, what I mean by cognition are either differences in knowledge itself or in other ways of measuring learning. Things measured by standardized tests. There's also a big front in, I mean, and this by itself should indicate that parts of the social science literature should take help wherever they can get it, which is that in social, there's a thing called non-cognitive skills that you may have heard of. I mean, that's kinda get your head around that as a phrase, because it's a thing, but like, it's very vague, right? And so, but there is, if you take something like attempts to measure self-regulation or conscientiousness from, there's reasons to think that that is part of it. Maybe 10 or more percent of that could be decomposed by that. And I mean, one of the things that's appealing to me about, well, it's the other thing just to mention would be, there's also the possibility of things that actually are pure environmental effects but are working through the parent that's in the polygenic score, right? So that's a possibility. That's gonna be very hard to untangle, I think right now, but it's on the horizon. But for those of us who work in survey measurement, like some of these things, like one of the reasons non-cognitive skills has had difficulty getting more precision is that these things are very difficult to measure at scale, right? They're difficult to measure in the first place, like do you tend to make short-sighted decisions that screw up your life? You can't really ask that in a direct kind of way, right? And you can't like send two marshmallows in the mail to people, you know, or whatever things people do, right? And so there's the idea that if we have, you know, whereas genomic data, so my other hats that were mentioned, I'm in a survey world, straight survey methodology as those of us who are straight survey methodology is weird for a scientific instrument, it's not getting better and it isn't back getting worse, right? Because we're having more difficult to getting people to respond and such. We're using administrative data as a big hope for that. But then another thing is if we can use this, if we recognize it as a partial cause, we're not being determinist or anything, but as a partial cause of a proxy, we may be able to get leverage on things like non-cognitive traits in ways that we don't currently have. So I think that's the real promise for this area. Did you have a follow-up? Yeah, yeah, yeah, yeah. Yeah, and that's our, so there's just a general cause for concern of mortality selection as a biasing effect for looking at these things. That is to say that, so for example, in our study, the modal birth here of a Wisconsin luncheonals that responded as 1939, and so they would have had to survive to be 67 to be in our DNA sample, right? A fair number of especially men are not alive at age 67. So something to adjust for. Some work has been done on this on HRS. It's clearly associated with mortality. In our data, it's currently in the right direction, but not significant, but I think that that's just because we're not yet powered enough for it, yeah. Yeah, yeah, yeah. Well, so this is a source of really big interest. Like if this worked, this would be a blockbuster display, right? So the problem with it is the question of pliotropy, right? That is to say the idea that for most things that we would want to use it as an instrument for, we can't satisfy the exclusion restriction. That's to say that if you imagine looking at educational attainment of mortality, right? That you wouldn't have grounds for saying, like, okay, well, the same genetic variance causing educational attainment have some non-educational pathway effect on mortality, right? And so the question is, if that is simply going to be the case, that is to say, if the only way that instrumental variables are effective is kind of the way if you have an econometric background to think about it, this will not work for that reason. Now, the question is whether or not, this is a very different instrumental variables problem, because in an instrumental variables problem, there's not the idea of like 100,000 weak instruments that can be put in, right? So this is still developing this is very much a work in progress. And one of the ideas is getting a lot of traction. I'm skeptical of it, would be that, well, you can essentially look at the idea of imagine taking, trying to use each of those individual weak instruments as instruments, and then looking if they have an ultimate effect distribution consistent with pliotropy not being a problem, right? Which takes on ideas and meta-analysis, right? Because it's almost like a publication bias problem because pliotropy would cause certain distortions that are like publication bias. So that would be, I'm not sure about that particular technique, but the idea of people are thinking in that direction, it makes me not willing to rule it out of hand that that will eventually work, but I think that it's going to require an innovation in that, that you can't just use off the shelf IV techniques. Yeah, in that, yeah, yeah, yeah, yeah. Well, so the larger question there is being a lot of underlying change in educational attainment. I mean, I think that's really mentally useful for thinking about what a polygenic score would be, right? And that it moves us away from any kind of deterministic notion, right? Because you've got parents and children where you've got a big mean difference, and clearly there's no genetic difference, right? And so, but it raises, right? It raises, I mean, there are big underlying profound questions about social mobility here, right? Like in terms of the extent to which, the extent to which when you have like a voluntary migrant population, right? And so you have a characteristic upward mobility and some people showing less mobility. How does that associate with the polygenic score, right? You can imagine a world where that is, that is very predictive, right? You know, because two siblings in an immigrant family will often have very different outcomes, or you could imagine it being much more stochastic, right, than that. And so I think it gives a leverage for that kind of way of thinking about interactions. And you could also imagine, well, maybe it's the case that like for something like, you know, are enclaves, you know, to settling in an enclave does that lead to a difference in the sort of polygenic effect. So I think there are a lot of interesting questions, right? And the fact that there's this sort of either, you know, upward mobility or varying mobility or changing positions of migrants over time actually offers a lot of leverage, should scores work well. Yeah, okay, thank you very much. Thank you.