 So, good afternoon, everyone, and I want to start with a happy new year as we enter a new year. My name is Vince Bonham. I'm the Acting Deputy Director at the National Human Genome Research Institute, and I'm pleased to introduce our lecture series. So I want to welcome you to the 30th lecture of the NIH Genomics and Health Disparities Lecture Series, which began in May of 2015. The series aims to highlight the opportunities of genomics research to address health disparities and advance health equity. The series is cosponsored by four additional institutes at NIH, the National Heart, Lung, and Blood Institute, the National Institute of Diabetes, Digestives, and Kidney Disease, the National Institute of Minority Health and Health Disparities, along with the National Human Genome Research Institute. We're also pleased that our collaborator and cosponsor is the Office of Minority Health and Health Equity at the Food and Drug Administration. Speakers have been chosen by the institutes involved in the sponsorship of this program, as well as making sure that they are bringing a conversation about issues around health disparities and health equities reflected on genetics and genomics. The speakers series approaches this problem and areas of issues from different perspectives of basic science, population genomics, translational clinical and social science research, and policy issues of importance to health equity, health disparities, and genetics and genomics. We are pleased to have Dr. Genevieve Wolczyk joining us this afternoon and my colleague, Dr. Jamil Scott, the Senior Scientific Program Analyst at NHGRI will introduce Genevieve. Good afternoon. Dr. Genevieve Wolczyk is a genetic epidemiologist and assistant professor at the Johns Hopkins Bloomberg School of Public Health. Her research focuses on understanding the role of ancestry and genetic risk and developing solutions to address health inequities for diverse and admixed populations, as well as genetic susceptibility to infectious disease. Her most recent work explores the interaction of genetic ancestry and the environment in admixed populations and downstream consequences for genetic risk prediction. Dr. Wolczyk is a member of multiple NIH consortia, including the population architecture using genetic and genetic epidemiology study, the clinical genome resource, and the polygenic risk methods in diverse populations consortium. Following Dr. Wolczyk's talk, Mr. Bonham will facilitate the discussion. Virtual attendees, please submit your questions at any time to the Q&A box. Those in person please do approach the microphones. There are two in this room after the lecture. Now please join me in welcoming Dr. Wolczyk. Hi everyone in the room and to everybody online. First I want to thank the NIH for inviting me to speak with you today and I'm just going to go right in because I worry I'm going to go over a little bit. And so my research program is based on this sort of fundamental disconnect we have in terms of our research in which we know that in the United States there's sort of a disproportionate burden of chronic disease on minoritized groups. And yet the majority of our genetic work is done within groups of European ancestry and European descent. So, you know, where the burden is lying is where we're focusing a large amount of our work. And so when I think about how to address this gap where we need to look, I'm an epidemiologist from training so I always refer back to these sort of DAGs and conceptual frameworks. And really it's about how we model the systems. You know, we've had a large push for big data over the last decade or two and that's been very fruitful. Now that we have the data, how can we refocus and ask questions to better see what's going on. And so, you know, to sort of look at this what we're going to do throughout this talk is sort of walk through this framework using evidence from my own work and my group's work as well as work from colleagues in the field at large. And so when we ask these questions, you know, in genetics, you really just think about the relationship between the genetics and the trait that's sort of very clear cut relationship we're thinking about if this than that. But what's actually happening is something way more complex. You have population genetics affecting the distribution of the actual genetics you're measuring. You have racial and ethnic identity in the mix as well with the socio political context of that. You have other individual risk factors from lifestyle to biomarkers. As well as more macro scale social and structural terms of health. So there's a lot going on. And we're looking for this association between genetics and outcome it's important for us to think about, you know, amongst who are we looking at. And if we're looking at relative risk relative to who right so that's sort of interesting to think about for a population perspective. And so when it comes to the data that we're looking at. We can take a look at what we see. And what I'm showing you here is from the Geos catalog published genome wide association studies. These are large scale genetic studies looking at that the association between individual markers and the traits. What I'm showing you here is the mean sample size. I think many people are used to seeing the total number of participants over time which have increased. But I think it's also worthwhile to get the mean sample size, because that's really a rate limiting step when it comes to discovery right how much statistical power do you have. What you're seeing over here is sort of the annual mean sample size in the dashed lines, and then the cumulative knowledge base in the solid line. What you're seeing is that over the past know at this point six years, five, six years. There's been this sort of meteoric rise in the mean sample size with much, much larger studies being published. This is largely driven by the UK Biobank, and they sort of large scale biobanks. And you can see they're colored here by how the Geos catalog categorizes these participants and so you see this big jump in the European populations. But if you zoom in on everybody else, it's relatively stagnant. So even if the total number of participants are increasing the sample size so how much statistical power you have per study is relatively the same. Right, so we're not getting sort of doing as well as we could be in this space. It's important to think about not only the total number of participants but per study and per question. What is your power like in terms of this big data. So when I try to think about, you know, how do we get here, you know how do we get important when you get to a point to sort of understand the systems in place for how you did it. And if I try to peel back layer after layer after layer, it's turtles all the way down. And by that what I mean is that there's sort of this bias baked into the system for that permeates every single step. Now over the past few years, a lot of this has sort of been improved upon. But really if you look at just what we see with the genomic health inequities, and you go back one step to translation. The basic in our follow up is often in these large scale biobanks in our majority, white or European ancestries. If you go further than that a lot of the discovery methods that we use are out of the box made for a single homogeneous population and therefore not made to account for the large number of diversity large level of diversity and our populations. Again, this is changing somewhat in the last several years with these methods becoming more standard to incorporate diversity. And then if you think about even just who and how populations are sampled right historically, even the gene type arrays were mainly from one population at a time and mostly European ancestries. And then who we include in our studies in these consortia are not very diverse. And again, these are changing, but it's a reactionary process to this inequity that exists. And it all boils down to sort of what we accept as the default with our research, what do we accept as the default, because it seems like for a lot of things we accept all relatively homogenous white population as a default. And then we try to catch up with everybody else later on. And so this sort of imbalance when it comes to genome wise association studies also is exacerbated in polygenic scores, what I'm showing you here is paulage scores that are published so paulage scores are a risk measurement where you're summing genetic risk across the entire genome. And the PGS catalog curates these. And what you're seeing on the first panel on the left hand side is of all the published polygenic scores, what proportion of them include just those that have categorized as European or European ancestries. You can see most of them are those solid blue boxes. And there's some that maybe have mostly European but maybe include other people which is the plus signs there. And in the middle what I'm showing you are populations that I focus on in my work where these are all the published PGS that include anybody who was identified as African American, African Caribbean, or Hispanic or Latino, right. And that is definitely not reflective of global demographics and it's definitely not reflective of the US demographics you see on the last panel on the right hand side. So you have these inequities and the disparities in your GWAS and since polygenic scores are often built upon GWAS summary statistics it's even the more exacerbated moving downstream from that. So we have these sort of lack of representation when it comes to global diversity of our genetic scores. So why do we care about diversity, you know, I think there's a lot of reasons why we care about diversity, but from a method standpoint, it's partially because we know through many studies, including the work of Alicia Martin and the work I'm showing you here with Wydab Senyuk looking at how badly basically polygenic scores do in other populations. So here what they did is they trained polygenic scores in the UK Biobank, the White British sample, and then applied it to their Biobank in LA. And what you see in these different groups how they categorized it that there is each dot is a person and how well that risk score did in that person and individual level accuracy. And what's important to note is that there is a large amount of heterogeneity, both between these groups as expected, but also within these groups right and so you see overlap in terms of the distributions. And so it's important to note that yes there are differences between these groups, but also when you're discretizing the group membership and looking at the accuracy. There's also a large amount of heterogeneity within them right so it's important to have broad representation not only between the categories, but within the categories as well to have that in your model. Okay, so I'm going to walk us through an example here. And so we know that PRS performed differently by racial ethnic groups as largely how we've been framing in here. So we want to know for this research question, how is this further complicated by recent admixture. When you have a lot of heterogeneity within a single group. Right, and then further how is that complicated by heterogeneity and the environmental influences for these different outcomes right to what happens with even within a group. You have a large amount of heterogeneity on both the genetic side and the environment side. And how does that influence how we would model genetic risk. So, again, I'm an epidemiologist from training so I think about population definitions. And I think it's important to note that when we think about this, these pathways for polygenic scores. You know often we think about okay we only really care about PRS to the outcome, and a lot of the method development that's been conducted, tries to break up this relationship of what else going on by sort of severing this relationship between the population genetics in terms of frequencies and the linkage to equilibrium to the polygenic score, right. But there's also other relationships at play here that might introduce other mediators or confounders that can be important for how we're looking at genetic risk. And so remember the goal for a polygenic risk score and for relative risk is that you want assumptions of exchange ability to be met, which means that you want to find the counterfactual which essentially is says, if I want to compare an individual to a reference population. I want them to be comparable in everything except for the measure I'm looking at, which is the risk score, right. And so it's important to us to know, you know, is the reference that that we're using actually comparable, or are we introducing some unknown factors into that comparison. All right, so we're going to build out the dag here in terms of what's going on. And looking at we're going to use BMI as example. Now BMI is not really a health related trait in terms of being informative it's not really clinically relevant at all, in terms of the genetic space of things. But it does provide us a really good example to work through this in a larger sample size of a trait that is both influenced by genetics as well as environment and differently between groups as you'll see. So, again, it's not meant to be clinically relevant in terms of the story but rather to illustrate what could happen with a very commonly measured outcome. Okay. So the first relationship we're going to look at is we're going to look at just within Hispanics, you know, individuals, right we're going to focus on one group that is often assumed into genetic space to be modest is often assumed to be sort of all the same in the level that's sort of applied to other groups and we're going to sort of see how that's true or not. And so we're looking at again the relationship between ancestry and steps and then group membership further within this group of substructure and ancestry itself. For this, I work a lot in the page study. This is a long running NHGRI study that focused on minoritized groups started in 2008 and has been ongoing since then. And this is from page two, which ran about 2013 to 2018. What I'm showing you here is a principal components analysis so what you're looking at is every dot on the left hand side here as a person, and they are colored by how they self identify right there's no clustering existing, and to orient you to the right hand side you have this sort of African inferred ancestry component being pulled out here on the top it's East Asian to the left and back are the indigenous the America's component being pulled out and at the bottom it's Europe. And so what you can see sort of off the bat from this is that there is no real way for you to cluster these individuals into these discrete groups that's meaningful for genetics. There's no clear cut point that I can say anybody at this side of this cut point is this group, anybody at that side is a different group, because it's sort of this continuous distribution that what I'm showing you on the right hand side here is looking beyond PC one and PC two which explained the most amount of variation, all the PC five and what you can see is that within all of these groups. There is a level of spectrum of diversity, where you don't have these clear clusters or rather people being pulled out in numerous axes diversity. And certainly because these samples within United States is minoritized groups are ad mixed and recently ad mixed and therefore you see the sort of large amount of heterogeneity within one group. So you see this diversity. So, I'm focusing on a span Latino groups you know we typically when you look at these studies often they're categorized by these different racial ethnic categories. They're a combination of OMB categories usually. And so we're going to look at just the Hispanic and Latino individuals and these are all self identified, so they identified themselves. And I think when we think about genomic research, it's important to think what do we actually mean when we say this is a group, right is this a group the way we think other groups are groups right is what does it meet assumptions that we would think of in terms of homogeneity. And we can look at the PCs again. And so here you have everybody who identified as being Hispanic Latino Latino. Hi, and you see that they sort of run the breadth of diversity across multiple axes this is just PC one and PC two. Now within page a number of studies did allow people to further identify themselves within the Caribbean people could further identify their ancestry. As being Puerto Rican Cuban or Dominican, and then within the continental US Mexican, and then for numbers sake Central America was combined and then also South America was combined. And so when you look at that, you do see some sub structure right there are differences in terms of where people fall on these different spectrums based on where their ancestry comes from. These are all people who are sample to the United States. And so, when we say this is a single group, and we do it usually to ensure some level of homogeneity in terms of the distribution of the genetics. It's not really met that there's actually any kind of homogeneity within this group, if it's so diverse. And for us to look at this is admixtures admixture plot. Each line is a person, and they are colored by the proportion of their genome that is assigned this inferred ancestry component. And so for this run of things it was five different components you can see them on the right to sort of infer ancestry components of Africa Europe the Americas, Oceania and Asia. And again, this is a very subjective estimate. These are not sort of the truth in terms of what actually existed in history, but rather artifacts of the data of who went into the models and what we're seeing so if you had a different subset of individuals he probably sees a little bit difference of things. But sort of what we're looking at for the five components. And so what you see already is that between you know, even when we think about this not in terms of multiple PCs but even just admixture components you see a wide range of proportions. On the left hand side is that the African American groups on average 80% of this European component and 20% of African which is consistent with research. And then our right hand side with the span Latino individuals, you see this wide range from different of at least three if not more components. So, if you let people a further self identify which is the bottom, you see differences, right, so I think it's important to note that if I picked one single person out of this and I saw the proportions. I could not tell you how they self identify, but there are differences on average that can be informative for how we understand their genetics and what that means for their genetic risk. It's important distinction to know right that if a single person you can't really tell based on this data, maybe how they self identify but again we're looking at populations and group level dynamics, which they can be informative. Okay, so we've already established that because of history, putting it lightly, there are some differences in genetic ancestry and the backgrounds of these different groups. So let's move on to another relationship so now, you know we're going to look at a confounding relationship which means it has to be associated with the exposure which is the genetics, and also with the outcome, which in this case is BMI. So, do we see a relationship between group membership and BMI and literature. We can rely on colleagues work, we don't have to do it ourselves, there's a wealth of information about this. These are some studies that have been done in the Hispanic community health study and study of Latinos. Over the past, you know, decade of these papers have been published. And what they found is that between these different groups, they see differences in BMI distributions as well as weight gain trajectories so both in terms of a cross sectional look at the trait. Right. And then we also, you know, to make it more relevant maybe to health that there is some differences in the burden of cardiovascular disease respectors in general between these populations, both men and women. So we can rely on the epidemiological evidence here to really establish this relationship between the group membership and the outcome of interest for this. So, we've done that we have this all all these relationships that have been shown through different evidence spaces, looking at our work. So now we're going to say okay now does group membership modify the relationship between these snips and BMI right how does it do that. And what happens when it's compounded into a polygenic score. So for this what we did is that we apply the published polygenic score for BMI to page. This is the cure at all published in 2019. And it was trained in a European ancestry background to groups. And so that's important for for relevance but sort of what happens at the performance level between the different groups in terms of ancestry, as well as environment. So, what we found is that there was a positive relationship on the left hand side between these atmosphere components so here it's the inferred and renditionist population are in renditionist ancestry and me and the score. And so you have this correlation positive correlation where if you have more of this ancestry component, you have a higher score on average. Now there's really two reasons why this could be right what we want to find out which one it is. One is that this is a confounder, it needs to be adjusted out introduces bias and therefore needs to be dealt with. The other option is that this is actually informative for our risk model, right and we want to know which one it is because you don't want to sort of adjust everything out if it could be informative for your prediction model because again the point of these models is prediction. Right, it's prediction of the trait. And so what we can do is look at different models on the right hand side what I'm showing you is the unadjusted score this is the R squared incremental R squared. And then you adjust the top three principal components and it goes up. And that just shows us that this is not a true relationship with the traits. It is some sort of a confounder needs to be adjusted for right so we want to make sure we adjust this in the model. But what's interesting is that if you only adjust for this one proportion of ancestry in the model to be very specific and explicit of what you're adjusting for the accuracy does better. Right, so this sort of reinforces the notion that when you look at prediction models and you're thinking about what you include versus what you don't include. It's important to think through every one of the terms and think about what you're doing right in terms of are you throwing the baby out the bathwater with adjusting for everything. Or some things relevant for your outcome right and this is sort of different based on your question. If you wanted to distill things to something independent of PCs independent of background variation. Then maybe you don't care you lose predictive power you just want everything to be adjusted out. But if you want better predictive power you want to keep it in. So it's more for us to look at now you know when it comes to pledging score it's often a relative risk right we're looking at relative risk so you don't really just look at how well the model fits but you want to know how people are situated relative to other people. So what I'm showing you here is each dot is a person and I'm showing you their change in ranking before and after adjustment. So if you're above the dash line that means after adjustment for ancestry your rank or your risk was upgraded. And if you are on the right hand underneath the dash line that means after adjustment your risk was downgraded. And so what you can see is that there's sort of an inflection point this pivot around the average proportion of the inferred ancestry. Now, I sort of two important points from this this figure that I want you to look at and one is that there is differential bias within a single group based on ancestry depending on how you adjust or don't adjust for this thing. Right so that's an important note to them those with low of this AMA ancestry had their risk underestimated beforehand and those with high ancestry had their risk overestimated for this. The other thing to think about is that often you assume a certain level of homogeneity within your sample and therefore when you're modeling to the mean. It that's okay your meeting it's okay you don't have a lot of variants so people may be on the edges aren't too far from the mean and so what make a difference. When you look at these very heterogeneous groups and you look at the large diversity within a single group. It becomes problematic when you model things just to this one mean right because you can see the tails have a big variation in terms of how much things can matter. And so it's important to note again that you know it's sort of inherent in the methods you can always want to model things the mean populations, but it's not just about mean differences but variance differences as well when you look at this. Think about. And then you can say all right john like I really don't care about rankings we only care about people who are at the extremes of things looking at the different distributions. And what I'm showing you here is okay well we take the top decile this project for and we call that high risk. And then we take the bottom decile says low risk and everybody in the middle. That's fine you're normal. We don't really care about where you are at the point. And then you look at the for and after adjustment. And what you see is the majority of people they stay where they were of course right if you have 80% if you are a distribution in a normal category it's not going to move around that much, but about 5% move up and 5% move down. And if you go to the middle this sort of recapitulates what we saw in the previous figures, where those that have upgraded risk are in the lowest quintiles of this ancestry component, and those that have downgraded risk are the highest. So again you have this sort of relationship between ancestry and how your risk is misestimated in this risk score. Okay, so, you know we're going to look now between the group membership we talked a lot about you know what happens with ancestry differences and what that means for the risk score. And what I'm showing you here is it's stratified further by how people self identify with these Hispanic Latino groups and looking at Central America and Cuban Dominican Mexican Puerto Rican and South American, just unadjusted and adjusted models. And what you can see is that even before and after adjustment there are differences in how well this polygenic score predicts BMI. And this could be, you know, maybe this is because of different ancestry differences in general we saw that before with the admixture plus in the PCA. It could also be because of environmental context so it's important for us to pick it apart and see what's actually going on, because if we don't know what's going on. We don't know what we should be aiming for how to improve this how much can we actually do and what data we need to include in these models. Okay. So, getting a more complete picture. So if establish this relationship now between this group membership and the relationship between the SNP and BMI. And so, whoops, what I'm showing you here again is this this model performance so adjusted R squared. And you have it separated up by these different groups. And what's important for us to note here is that there's different performance, but this is an integrated model so it's the PRS and different populations and so. What you see is it purchased out for those the project score adjusted for ancestry and the base model of age sex and study. And we can actually realign it to see what portion of the variance is being explained by the base model of age sex and study and what is being adjusted for by the project score itself. And what you see here is that, first of all, that it's not consistent between those two components between the different groups. And maybe partly due to the sampling structure of who's involved, it could be due to other dynamics included that we're going to pick apart a little bit further. But it's important that you know the Mexican Mexican American individuals had the lowest performance of just a base model age sex and study. But the Dominican group had the lowest for the project score and this is consistent with ancestry differences in terms of there being the large portion of African ancestry in that group versus others. So, you know, we want to think about, is this just a matter of adjustment right do we want to just put more things in the model, if we see this relationship do you want to keep on adding more and more into the model just adjusted out, which is a standard practice I think for a lot of model building is just to add more things in that are relevant. But what happens when we add these this membership in to the model is that the performance actually drops. So, you know, we want to make sure that we have the corrective ability to lose information by adjusting at this group membership. And so I'm showing you here at the top left these bar charts is looking at different quintiles of the project score. And what you can see is that there is an over representation of Puerto Rican individuals at the highest quintiles of risk and an under representation of them at the lowest quintiles of risk. So their entire distribution which you can see at the bottom here is shifted towards the right with a higher genetic risk overall. So you can view standardized and stratified by this and it'll shift everybody on the same direction. Again, there's two options for whether this is informative or not either it's some sort of bias in terms of confounding, or it's a real relationship between the outcome. And what it is that there's a for this case, it's a real relationship with this one population actually has in this sample, a higher BMI and average. Now it's important to note again that this is not something inherently true for these populations. This is one sample with very particular sampling structure in terms of where participants came from the page study is a consortium of consortia, but where it's all over the country and a large portion of the person individuals were actually samples from New York City, while a lot of the other groups bring that. And so there are differences in terms of where you live and what that means. And so we'll we'll just add further and then slide a little further but that's going to note that even when we do these studies and we pick things apart and try to find out these dynamics it's always important to note that a large part of what goes on is mainly due to your study design right who did you actually sample. You know we have these large studies, but even with the big data we have we don't even approximate having full catchment for our population right there's still some selection going on. Okay, so it's not a matter of adjustment. You know this also depends on you when we think about what's going on in terms of is it genetics is it social context is an environment. It's important to think about the heritability of the trade right so how genetic is even in the first place. What I'm showing you here is comparing contrasting BMI and height. What we did is said okay, we're going to see just the trait itself how much of the variation of the traits is explained by the base model again it's age sex study. And then I'm going to add both principal components. And in this case is sort of ethnic identities within his fan Latino groups, right and that's a full model. And then I'm going to take out either PCs, or the sort of social context of these ethnic labels, and see how much information do I lose and explain the variability of the traits. What you see is that for BMI, if I remove the social labels, I lose a lot more of the explanatory power of the model, I explain less variation than if I remove the PCs. Right. And this is not the case for height, where if you lose the social labels, you lose some information but not a ton. And if you move the PCs you lose a ton of information right the model is not as informative. And what this shows us is that you know height is a very heritable traits, genetics just matters more. And in that case what you see of course social context does not matter as much as the actual genetic architecture of it. Well we think about a trait like BMI, which the heritability is half usually it's estimated at least half of what height is. And the social labels actually mean a lot more than any kind of continuous diversity with PCs. Right. And so it's important for us to know sort of what we think the upper bounds are I'm just supposed to show basically that this is going to be trait specific it's going to be population specific. There's no one rule that's going to sort of be blanket for every kind of situation that we're looking at. Okay. Well, you know what is being captured by these polygenic scores in the first place. Right so polygenic scores are often built upon genome wide association studies, which are essentially millions of correlations, right you're looking at millions of correlations across the genome, which will of course pick up biology, but we'll also pick up a lot of other things that are different between the samples in that study. So in the polygenic score, we want to see, okay, what is explaining the variation in the score itself so not in the trait but in the score. What we see is overall, you know, BMI explains about 3% of the variation for the trait which is a system that we saw before for the R squared. An ancestry explains for this specific answer component about 18% of the variation in the polygenic score. So we see stratify by people group membership how people self identify with the Mexican Puerto Rican groups here. What you see is that there's difference, right so BMI explains a little bit less in the Mexican Mexican American individuals, and a little bit more in the Puerto Rican individuals, but ancestry plays a bigger difference in terms of what is actually explanatory in this. So ancestry doesn't actually make a big difference for this component in the Puerto Rican individuals. It's probably just due to the different genetic backgrounds of these groups, but also how that relates to the trait. And then you think, okay, well, you know, we know this from, from this part, we can further stratify by an environmental variable, right so say okay, based on environmental context, does it actually also show different things right and so we show here is the Mexican individuals that whether you've ever smoked or you've never smoked in your lifetime. Even though the PGS is being explained by ancestry and BMI is relatively the same, right it's relatively consistent. But if you look at the Puerto Rican individuals, those who have smoked the risk score, the proportion of the variance of the risk score is better explained by ancestry and the ever smokers and the never smokers. And this is not to say that smoking alters your ancestry or anything like that but it's more to say okay something is going on here with substructure even within these groups that results in different environmental contexts that could be related to how your risk score is being estimated what it actually is capturing. And it's important to pick that apart because remember apologizing score again is millions and millions of correlations just being added together. And it doesn't pick up just cause relationships but anything that came along for the ride for that particular study. Okay. So, these are considerations to make when you're comparing your training and testing steps. Alright, so we've decided you know we've looked at this relationship we found the group membership to smoking to BMI that sort of possible relationship for for this. But what about more macro scale things to look at right now we look at lifestyle towards that individual level. What about social instructional determinants of health. So, what are social instructional determinants of health. These are loosely categorized CDC into these five categories looking at education, healthcare, your environments in terms of the built environments your neighborhood and what's around the people that you interact with your social and community context and then economic stability. And so there's sort of these broad different ranges of these more macro scale factors that can influence human health. So if we think about, you know, how that relates to genetics, you know how much of that is overlap with differences in genetics and how much does not. So, we know it's going to bother me before I messed up but that the estimated answer proportions actually differ based on geographic area within a single group. So this is an older study from 23 and me. Looking at for this one, just Hispanic Latino individuals in the 23 and database and as of 2015 I believe. And what you can see is those who identify as being Hispanic or Latino is in Louisiana have more African ancestry which is what you probably expect. So those who identify as being Hispanic Latino in Texas have more of this indigenous ancestry, and then you see more of the European industry maybe in Kentucky and Tennessee. And so you see the differences by by geography we have slightly different ancestry compositions, right. So for thinking about confounder we can think okay, we see differences in the ancestry do we also see differences in the social instructional determinants of health. So this is a county structural racism measurement model that was published a few years ago now. And then what it does is it created a composite score, mainly looking at black and white to help despair or disparities in the social instructional determinants of health. And it's included measures such as housing similarity index school to similarity index graduation ratios, poverty ratios, access to health care is really across those five domains. And what they found that is that a higher county structural racism score was associated with higher BMI in black individuals and the lower in white individuals. So you see this different relationship this interaction between race and how structural racism affects their their health, and this was even after accounting for county income, which actually so that more money is a lower BMI on average. So what you have here is a sort of two legs of that relationship where you have both an association between genetic ancestry and location location geographic location, and then between the social instructional determinants health and location. And so you have the sort of confounding factor with BMI as well. And so, you know, again we can rely on previous work done in the non genetic space for this looking at Hispanic snow groups for the BMI as an outcome or a trait to look at for this. We're not really interested in immigration. Right, so people come to this country and for many different reasons different times. And they were interested in the process of acculturation, right so does acculturation make a difference with how your BMI trajectory is or obesity levels in general. What they found there was actually no association between a level of acculturation and obesity within these samples this is again the Hispanic community health studies and Latinos. And so what they found out is it's really just how long you're in the country right how long you're exposed to be static environment of the United States. Right, so the longer you're in this country, the more of an effect it has on BMI. Okay, and this is sort of consistent across different groups with some variability but really it's a sort of a relationship that's been verified. So we have these social and structural terms of health that align with these group memberships and geography and the outcome, and it all gets very complicated. So I was, you know, honored to work with Lindsay Flanders Rhodes who led this, this paper here with her published in 2021. And what they wanted to do was integrate the project score into this model of immigration and BMI and what's happening. And this is a sort of one of the things we wanted to do was stratify by the membership and see what's going on and I'm going to walk you through this. It's kind of a big table. The first thing to know is that, you know, as we saw before with RFIT, the model fit for this PRS and a lot of other factors is different by different groups. They included way more variables and their models is why the RFIT is just bigger in general. But we see similar trends in terms of the project score. So you see differences between this. If we look at the main effect of the project score itself by background, even after adjustment, we see differences, right? So you see different effect sizes of the project score that a number of different models that were built up and successfully bigger and bigger models. But it's important for us to focus on model four in which the interaction term was introduced. And so if we look at model four and look at the interaction term between a polygenic score for BMI and age of immigration and each one of these categories is compared to those who immigrated to the United States over the age of 20 to pass the age of 20 as adults. And what you can see, first of all, is that, you know, one thing to note is that, again, it's different by different group. So we see that, for example, the Cuban individuals, those that came to the United States between birth and five years of age, compared to those who came over 20 years of age. Their polygenic score is actually more informative. It does better, right? It has a larger effect size in this interaction term. And so it's more predictive. Now, if we do the exact same comparison of those who were brought this country as small children and compared them to those who came as adults, we see the opposite effect in the South American individuals, right? You see this opposite effect of what's going on. And so what's important to know here, and then again, you see a different from the Mexican individuals that stratified and what I really want to hammer home here is not the actual effect estimates themselves, but rather the same environmental variable that you measured can mean very different things and different groups based on their cultural contexts. People come to this country again for different reasons at different times due to different social political contexts. And it's important for us when we're expanding our genetic models to include these contexts to better understand them and have some expertise. So it's not a matter of just plugging more and more data in but really being thoughtful about what you're putting in and to understand the dynamics at play. Okay. So, you know, in all, it really just gets a bit complicated, right? There's more you can add on to hear about this conceptual framework, the DAG, it's a bit unwieldy. But it's not to say that you can't do this, but rather that it's important to think about, right? You can't include all the factors, but you can softly think about a lot of them. So, you know, one of the major barriers that I think we think about when how we do this is, you know, one of the more foundational things which is the imprecision when describing our study populations, right? We're very imprecise in general. This is one of the few studies that allowed people to further self-identified beyond the Hispanic Latino label. Often, you know, we saw with people when they added race is that these people don't actually put race down because it's not applicable to how they self-identify. And so there's a lot of imprecision as to how we capture this in our studies. And one of the things we think about health equity is, you know, how can you have any accountability to assess the equity if you can't measure who people are, right? If you can't describe them. And so how do you know the study populations without allowing them the opportunity to describe themselves to you and give you some context? So, this includes a lot of different contexts, including race, ethnicity, ancestry, geography, other demographics. But I focus on genetics here. So what I'm looking at here is the GWAS catalog, and I am not doing this to pick on the GWAS catalog that did what they had to do and I understand it. So here's how they define their groups. This is from a 2018 paper. And, you know, when they define these different groups to have accountability for who's included in genome-wide association studies, they had to combine a number of different constructs. They had to combine genetic ancestry, geography, nationality, and race. And so you see this in definitions where here is for European, where you have people have been described the authors as European, Caucasian or white, or maybe on a national level Dutch. And then they also did some computational metrics to cluster individuals as well. And so here you have sort of this hodge podge of whatever was available to the curators to help create this accountability. And so it's very imprecise, but you know, again, you work with what you have. And they would be helpful as a field for us to have a bit more consistent measures of how we describe people. So this was tackled somewhat within the NASA report who came out next spring that I was honored to be part of. It's available online with some pretty cool tools if you want to check it out. But basically design some guiding principles with specific recommendations of how we can do better in this space with, you know, one thing that was really important for this I think is that, again, there's no blanket approach. There was no one answer given it's all about what your question is, you know what you have access to, and to really be specific in that front, you know, for a future research as well as your current research. Okay. Now, you know, I want to say that there is a cost to granularity. Right. It's not just me saying you need more granularity you need to be more specific because there's that there's a heavy cost for what happens. And the cost is the sample size. Right. If I keep on cutting my data up into smaller and smaller bits, and to start stratifying by all these different factors, eventually I get to such small sample sizes that I can't actually find anything. Right. And so in this day of big data. It's important to know that even though the numbers of participants in these diverse groups is larger. They're still proportionally much smaller than other popular than the European answer populations, as well as if you need more granularity it just gets smaller and smaller. So it's not say it's sort of a balance here. And to that extent, you know, it is a trade off between the power and the precision of your question and your study. Again, I'm an epidemiologist so we think we teach a lot about population definitions. So we think about what is the target population, what is your source population, what is your study population. And again, I want to hammer home the point that even in this age of big data where you have millions and millions of people in your studies. We do not even approach the level needed to not have to worry about representativeness or bias right in terms of selection bias. In your study, why then the study, you know, when was your study, things like that. And for this need to have precision in your question, right, especially when you look at polygenic scores. You're no longer looking for discovery of specific loci, but rather what's happening across an entire population. So important to think about this, this trade off between power and precision when thinking about your questions. What does all mean, what do we do looking forward for this. And so, you know, one thing that I think I will always sort of sharp on, which is sort of hindered by the practicalities of it is it does seem a bit ridiculous that we pool people from across two different continents, and very diverse histories into one group and say they're the same. They're one group they're homogenous it seems a bit ridiculous both in terms of just it sounds silly on its face, but also scientifically it's not very valid right. The other thing I want to talk about is that add mixed individuals and groups are not just some of their parts, often in methods development space we think about okay well maybe you know, if we want to look at the risk and add mixed groups. All we need to do is say okay we're going to model them as sort of this this a summation of all their bits and parts look at these different haplotype tracks and then some them together and that's the same. They'll be valid estimate and the truth matter is like that's part of it, but also they are communities and populations on their own, but different dynamics that need to be appropriately models. The other thing is that you know the development of these risk models is complex, it requires expertise for the populations at hand, it requires expertise for the outcomes for the environmental context. And this does require a very cross disciplinary approach, we can't all do it all right there's just no way, especially in academic science we've all specialized such a degree that when you wanted is integrated models, you really need to reach across the aisle to different models to get the best work together. And so, you know as we move for this genetic risk models these integrated models are precision health and hopefully to sort of targets of health inequities. It's important to have that cross disciplinary approach. Okay, so what do we need again, you know, often we think about very simplistic sort of models of if genetics then outcome. And we're used to thinking of that really micro micro scale for genetic space. But there are models for how to look at the full system right this is an older model from 2006 and adapted here in the slide, where you see that you know it's a socio behavioral biology nexus and this multi dimensional space very long name for basis saying everything is connected. It's everything everywhere all at once but that you have these sort of nested hierarchies, right we have the genomic substrate at the bottom, it goes more macro macro and sale within the body. And then once you get to the stream level here in time you go above it's above the individual level right you're looking at the micro level measure level macro level and then global level. It's going to be increasing challenge for us moving forward having to reconcile these two levels and how they are modeled right because below. We often think about genetic similarity when it comes to the genomics and genomics right looking at how similar people are to each other, but then above the water sort of line. We have in more as a social construct, right so you're thinking about these things when we were the long lines in the United States at least a racial and ethnic groups and how that works with social context. So, however you reconcile that will be sort of a large open question in the next few years, as we're bringing these big data initiatives together and try to reconcile these different streams of data. You know again it's really important for us to know that there are models this is an adapted socio ecological model looking at things from the micro scale genetics and genomics up to individual level interpersonal relationships, neighborhood levels and societal relationships. And again this sort of becoming increasingly important as we move from the sort of era of GWAS in which individual loci are the point is discovery and trait mapping to characterizing population level distributions for genetic risk. Right, so you're moving away from biology and mechanism and more towards what happens overall. And at this case you need to have a more macro scale of what's going on your populations, especially as many of the works that we do now are not limited to a certain geographic area or context or rather sort of whatever we can pull together to get bigger numbers but understanding what's happening at these different scales and help us pose better questions and do better studies. So, again, you know I think this is the epidemiologist sort of lament here where it's that big data is necessary for what we do but it's not sufficient, right, it's a necessary cause of discovery and what we're doing but not sufficient and that it's not enough. You know we have a sort of unprecedented skill power to look at things, but it's important so that if you pull everyone together and you sort of average everything out you can actually obscure meaningful substructure what's going on that can be really interesting. I think especially when we think about health disparities. It's important to define the lens in which you're framing the question. Right, because depending on how you look at the data from what angle it is, you might have different answers and it's largely because you've asked different fundamental questions and so having that precision is sort of important for us, moving forward. So, you know, again, I wanted to sort of get back up this high level and I think this is what I spent a lot of time in my work trying to do, which is that when we model human health, many of us are very computational and we make a large number of assumptions. And our models is different than what we accept as the default right often genetics. We have whatever is there to do the first thing and then we sort of try to fill in the gaps afterwards and it's largely due to these biases and what we accept is the default in the questions we asked who we include in our research both in terms of participants and the workforce. And which systems we even choose to model right and you see this in terms of a large number of outcomes that might be over-designed in some minoritized groups but are understudied at a national level. You see this in terms of when you make the first genotyping arrays and you use them they're mostly for European ancestry backbones right and that was sort of the way it was for years. In terms of you know what we even include for these biobanks right where the big biobanks are what populations they draw from and what are these can be in samples of like it sort of has this default baked into it. So again, if we think about equity, I think it's not enough to sort of go with the default and then, you know, try to backfill things later on you know trickle down theory doesn't work for many things that includes research right we doesn't really work. And so we need to think about what the default is and challenge those systems. Now, I also wanted to point out for the workforce diversity so this is the GWAS catalog. I said before our genetic studies of participants are not representative of the global population here. We're not representative of the US population at all, but it is sadly representative of us. It looks like our demographics, and I know that I didn't go into this selfish reasons for all it's purely selfish reasons. And then I'm sure other people didn't as well. And so it's important for us to think about what systems are in place that produced this, and how can we change them. So once we sort of go back to the default and the systems at place, we're just going to keep on doing the same thing and the same thing over and over again. I think we want to achieve these goals of sort of figuring out what's going on with health disparities and providing solutions for health equity, we need to fix the solutions in our own yard as well. The last question I'm going to leave you with is as we move towards precision health. It's important to know precise for who, you know, I spend a lot of time at different institutions that have different focuses on public health versus precision health and precision medicine. And it's important to think about, you know, as we have this revolution and we move at, you know, light speed to make these developments who are leaving behind. And is it acceptable for us. Right. And that includes both in terms of technology as well as the methods that we developed. And I wouldn't be here without my collaborators. The page has been a lovely home for me to do a lot of work over the last almost 10 years. The Lindsay Fernandez Rhodes led the lovely, a cultural immigration story and she's a state. And then my lab, which has been very helpful and joy to be with. So thank you. I'm happy to get any questions in person online. A great talk. Thank you so much and please those in the audience here. Come to the mic with questions as well as those online to send in your questions. So I'm going to start with a question and it's really in the context of all your work. Jen, and then your work on the study, the consensus study and the concept of genetic similarity. If you were talking to a junior researcher who is starting out their career read the consensus study report they're trying to make some decisions and understanding how they think about different approaches. It's related to their work and their work as a context around health disparities. Yeah, what advice would you give that person. Yeah, it's a hard one for health disparities. So I think one thing that we seem to better as a field and I think this is actually the report has been adopted, I think much more readily by more junior folks who are learning and trying to do better actively. It's really about what your question is anchored in. And so if your question is anchored in the access you know what does this is this genetic test relevant for all of the diversity that's present that we do that that maybe, you know, when you think about the descriptors. It's not really about disparities in terms of racial context that really the representation of the panels and the sequences that go into that to find these panels and define these these tests. So if you're looking at a specific disparity that is quantified on these sort of bounds of social constructs and race ethnicity. Then I think it's it behooves you to sort of think, first of all, do we think there's actually a role for genetics in this, and how do you thoughtfully align how you look at this with the question you have at hand and then includes maybe only including individuals that are relevant to that group and then just do along racial or ethnic lines, make sure your samples are actually those racial ethnic groups not some sort of amalgamation of genetic ancestry groups to sort of recapitulate these sort of labels but are actually fundamentally different in terms of how they're categorized, which is basically just to say that it really depends on your question and making sure you're very thorough that okay the question that I'm asking, do I really need to look at race or ethnicity social context, do I really need to look at genetics and how do I make sure that they're aligned with each other in a way that is meaningful and not just using whatever is there. Yeah, if you can, but I mean it's hard if you're junior but whatever you can do. Thank you. Yeah. All right, first question. Very cool talk. Thank you. I have two extremely related questions. I think I missed early on. Yeah, in the page project. What are the like sample sizes you're working with. Yeah, I'm sorry about that. Yeah. I have two samples here. I just called my students it was a 50,000 individuals about 22,000 of individuals self identified as a span Latino about 18,000 as African American and smaller groups of Asian Native American native Hawaiian. Okay, because I was curious there was some slides where there was some unadjusted and adjusted, adjusted our squared comparisons. Yeah, I was wondering what like the statistical resolution of those of those, you know, bars were on there. So, so for some of them they do overlap in terms of their resolution for the effect sizes they did overlap somewhat for the R squared though because it's still large enough sample size the R squared it's pretty tight in terms of the confidence intervals there. And so those are actually different in terms of the performance. And it's hard to sort of see, you know, there's not really a easy way to estimate these sort of measures to see how different they are but at least for like the correlations and everything they were all that with numbers in the in the tens of thousands that you have enough to see that they're different. Yeah. Question. Other questions. Online. This question that we have asked you to return to your discussion of PRS changes and immigration status and it reads it is not clear why PRS changes with immigration status. We do not expect snips to change in such short periods of time. Can you elaborate. Yeah, yeah. So the way I interpret it is not that the biology changes the mechanism doesn't change. What does change is how much the biology is allowed to matter. What does change is what matters for the environment so heritability is not a static measure. It is about a balance of environment and genetics and when you change the context you change that balance between what matters for genetics and what matters for environment. And so for immigration status, what you're saying is that you know when you in certain groups. What matters more because all the environment may be the environmental is set aside of things is more consistent across individuals and therefore there's less variability therefore genetics explains more of the variability. It could be sort of the other way around. So I don't really think of it in terms of the biology changing, but rather how much of the of the variance in the outcome can we actually explain with genetics versus environment. So I don't think that the biology change for the, that part. And that was great. And very thought provocative. So, a question is, in the page study, do you get the sense that if you had enough of the socio cultural environmental information that you wouldn't need the race information at all. Is it possible to essentially collect enough social cultural information that these other sort of more distant proxies just don't really matter. Yes and no. So I think when we think about a lot of, you know, we think about racial categories often we're bringing things down to sort of white versus black right in terms of that. Now we think about race in terms of the MB categories often they sort of conflate this sort of based on physical characteristics social construct with sort of national origin as well. So, one of the other groups I didn't talk about in this group which is sort of a, you know, other side of the same coin in terms of the Spanish Latino groups is that you have the Asian group here, and Asian again it's about, you know, it's a large geographic area you're covering here it's about two, over half of the global population in one group. And so it's not that if you had all the social and social determinants of health, you could get rid of it. But you could focus I think away from these maybe racial broad stroke racial categories and maybe focus on how that aligns with more granularity in terms of geography will be exposed to, and that might be against along the lines of different groups or different communities in the United States that may be along, you know, even within a certain geographic area for the resolution that we have this data, even within that sort of granularity you still would have sub structure based on these different groups you need that information for. And so it's sort of a, it's sort of a strong it's but yes yes and no sort of it depends on. Yeah. Go ahead. We'll take another question from online this is on social and structural determinants. Yes, and it reads for BMI the emphasis in this talk seems on ancestry and heritability ancestry could be a surrogate for number of social determinants as you discuss, including the environmental exposures. Did you replicate your finding using independent data sets. Yes. So one of the things we did was that the page study is for different studies combined. It is this very community health study it is the women's health initiative is the multi ethnic cohort and it is the bio me bio bank. So, these are all different studies in terms of the times that they were collected, the way people were recruited and sort of what populations they, they draw from, and these are known as were consistent across all the different studies I didn't have time to show you in depth, but there is consistency between this and we're in the process right now of replicating it with the with all of us for some of the data that's available, but not all the data that we need is available right now for all of us, but we're trying. So I have an additional question and you show two different models of frameworks with regards to how bringing different social context and how do you bring diverse expertise disciplinary expertise in the research and to ask the questions and interpret the data. I think it's down to the money it all comes down to money. I mean I think you know we like to think that in science we're about the truth and everything about it is, it's capitalism like everything else and so, you know it's it is very difficult to get sustainable funding across different disciplinary sciences, in a way that is not exploitative. When it comes to Genesis reaching out to ethicists or sociologists may can you just give me some, you know, to put your name on this grant and I'll you know private salary, and I'll listen to you, or you know, to have it sort of be equal from the get go it's sort of very hard to find these and I and I don't know of any really good routes to do this because you want to have people coming into the room with equal standing and equal respect, and that is very difficult to do, given the current funding mechanisms and the current economy of academic researchers. So having sort of a better and better publicized and better rewarded avenues for that research would be be great you know it's not it's not just about the money coming from the NIH or other institutions but also that your institutions themselves reward you for promotion right you're always thinking about that. And so it's the whole sort of economic system, rewarding that in a way that I think right now it's very difficult to do. So with that, I can say thank you so much for such a great having me. Thank you, everyone for attending in person and online today.