 Can you all please sit down take your seats be quiet please. So it's my pleasure to introduce today's keynote speaker, Loic Jango, who is an associate professor in statistical genetics at the University of Queensland. It's been an absolute highlight in my career to be able to recruit Loic and now nearly eight years ago and to see his career flourished. It's a real privilege I think in academia to hire people who are much smarter than yourself and it's also for me a great pleasure to now have Loic as both a collaborator and a friend. Loic, floor to you. Thank you very much Peter for the very kind introduction. I should definitely get your beer more than one. Okay so it's a real pleasure for me to be here. I'm extremely excited. I have to say yesterday the program was fantastic and I didn't even need to fight my jet lag. I was just carried over by all the science and the great ideas floating around the room. So today I'll be telling you about a trade that we all know and love to paraphrase what Arcus said yesterday and so it's a trade that has been the flagship trade of quantitative genetic studies for more than a century now and one of the reasons is because it's a trade that is very easy to measure and it has a large heritability. So I'll be talking about some of the recent findings we've learned about its architecture and towards the end of my talk I will highlight some unresolved questions and controversies but before that just wanted to answer a question that I've had over the past few days. I actually predate from ever since we published that paper last year so no I'm not retiring just yet and if I do it's not going to be in DC so it's going to be somewhere on the sunshine coast. I'll invite you there sometime but more seriously so in this talk I will start with a sort of general reflection about what it means for high to be a model trade at least it's a very vast question but I'll try to sketch a few ideas there. I'll then talk about some of our recent discoveries that will be the bulk of the presentation and towards the end I'll brush over some unresolved questions and maybe share with you more like a wish list of what we can learn together. So why is height a good model? I believe it's I said at the beginning it's easy to measure it has a large heritability that's probably what makes it appealing to study but it's what is interesting here is actually besides heritability its genetic architecture pretty much resembles that of any other complex traits and this is a picture from Zendipal 2021 I think Naomi yesterday showed that picture as well and as a consequence we hope that when studying height that what we were going to learn would somehow generalize to other complex traits. But in practice what exactly does it mean generalize what do we hope things will what are the things that we hope will generalize? I guess the straightforward answer to that question is mostly methods but here I'll try to go a bit more of some somehow further and think can large genetic studies of height also improve more generally gene mapping of other traits and and I'll argue that the answer to that question is maybe if you think about you know in the context of methods like m-type where you can borrow information for from genetically correlated traits you can imagine that having very large studies of height could also improve mapping for other traits. I think large genetic studies of height could also improve our general understanding of biological mechanisms and I'm having a particular experiment in mind which is a village experiment that has been mentioned yesterday a couple of times and the reason why I think height could be a good candidate for this is because of effect sizes so in this sort of typical village village experiment we are ascertaining participants at the tail of the polygenic score distribution but when you think about it how much contrast you have at the tail of the the polygenic score distribution is a function of the prediction accuracy and so for many existing experiments and you know attempts to answer questions using that framework I believe we've been mostly underpowered and height could play an interesting role there and so the third point is a bit of a leap right I don't quote me on that height will not directly improve our ability to discover new medicine for cancer but there's just wanting to flag that there there have been in the literature examples of models or study that I've thought about using height as a way to prioritize new medicine by looking at variants that both increase and for which lots of function variants are both gain or increase or decrease the expression but besides thinking about what are the properties of height or genetic studies of height that can generalize the over complex traits you can also think about it in somewhat different terms you can think about what hypothesis have have we been able to test or generate using height as a model and so among the tests is hypothesis I think maybe the most important one has been that complex traits are polygenic and that their irritability can be mapped to a single variant resolution and so that was hypothesis in hypothesizing the Yang Yang 2010 paper way before we had enough data to actually test that hypothesis and we've been I guess with our last few months last year I've been able to to get very close to and to validate that hypothesis and as we start getting more genetic data on height we can also test hypothesis about the natural selection for example by looking at how it acts on changing and modifying allele frequencies that are trait associated variants and just another example that also height has been used as a way to answer a very broad question about what are the forces that are explaining the loss of prediction accuracy for polygenic score between populations their height has also sort of helped generate new hypothesis so by new I don't mean that those things were not known before I mean that because we start observing them with height that prompted questions about how much that generalized to other traits and so one of them is allele heterogeneity we talked about it before yesterday I mean but I think in height for height it's it's quite interesting because the very very early GWAS is of height in 2010 already flagged this interesting property of height that the height associated there intend to be clustered you know next to each other in the genome and that is something that we we continue to see and I will cover that in the remainder of my talk there is another interesting hypothesis about the presence of rare non-coding variants with large effects and so this is I guess it's maybe mostly driven by technologies because now we have large samples with whole genome sequences but it's interesting that in this whole dose in paper the flag and this particular example where you have this utr variance that has an effect that is larger than any coding variants and so how much that generalizes to other traits is also interesting and something we should try and investigate in the future and finally in another hypothesis I think that was prompted by studying height is to which extent that we saturate our biological interpretation of GWAS findings as we increase sample sizes and this is something I will cover toward the end of my talk so just to summarize this part the genetic architecture of height largely resembles that of other traits height has been mainly used as a model for gene mapping and method development I think at this point in time it's unclear whether height will will remain the flagship trait for some of our post GWAS mechanistic studies I believe it's still it's still a good candidate for that the point I made earlier about the effect sizes and finally height has enabled complex hypotheses to be tested empirically and also led to generic new ones so in the second part of this talk now I'm going to cover some of the recent discoveries we've made and it's mostly based going to be based on our GWAS of height published last year with five million participants but before that this slide is mostly to highlight that this has been a long journey you don't get to five million just just you know just like that right it's a it's been a journey so before the GWAS era what what we knew mostly about the architecture of height came from family-based studies and we knew we had estimates of heritability between 0.7 and 0.8 and those studies were mostly done in European ancestry populations and so I'm sort of going fast on this timeline but from you know the early days of GWAS to 22 which are the the result I'm going to cover mostly in details in the next few slides what we've learned is essentially that we can map that heritability to single variant resolution from detecting just under about 27 loci to now over to 12,000 loci explaining pretty much all the snow-based heritability. So on this timeline just wanted to highlight a few I think important milestones I think one of them was the sort of constant update of estimates of snow-based heritability I think that has played a major role in sustaining the momentum in the GWAS discovery and two other points are the UK Biobank I think it stands at its own as a milestone and I will say similar to that also which has contributed to boost sample sizes is the collaboration with direct-to-consumer companies. So let's now talk about the mother of all GWASs so okay no reaction everybody's is expecting me to be that cocky. So this study was with five million participants heavily European buys but that's no news for GWASs but I just wanted to highlight that this study is still commendable because we have more than a million participants whose ancestries are outside of Europe and so we mostly focused on this panel of SNPs called the heart my three SNPs panel are good and bad but I think maybe what's interesting about the panel is that it contains variants that are largely common in all of these ancestry groups and so these are some of the questions we asked in that study we wanted to discover heart-associated variants and characterize their genomic distribution we wanted to estimate how much variance there they explain and parallel that with missing heritability we did some prediction work but in the interest of time I will not cover this in this in the presentation and then we sort of quantify this level of saturation of GWASs at different levels. So the first thing we did was to run GWASs in each of those groups so this is you know the number bar plus showing the number of hits we just we detected in each group and we pretty much mirror the sample sizes and then before combining everything we we thought we essentially compared effect sizes between those ancestry groups and what we found is that they were largely consistent and in a way that it was sort of makes sense to use a fixed-effect myth analysis to move forward. Once we identify after running this fixed-effect myth analysis we identified about 12,000 variants and we asked what is their genomic distribution and so we developed this measure of density which essentially asks so it's a per state measure and we're asking for each nib how many other independent associations are shared at a particular location and so when we do that we see that on average across the genome each of those 12,000 variants tend to share their location with two other independently associated variants and and we in particular flagged this locus on chromosome 15 that has a density of about 25 independent association within just 200 kb window and so this locus was known before to be to show some analytic heterogeneity but we wanted to go a bit further and try to characterize what could be the patterns that have led to that high density so we tested a few hypothesis sorry for the busy slide but essentially panel ABCD are asking a simple question is how much that density can be due to untagged rare haplotypes at a particular locus and running simulations, habitat analysis, etc the conclusion here is well it's not the most plausible explanation of that high density but interestingly that locus was also shown to harbor this variable number tandem repeat which also showed association with height and in particular if we if you regress you quantify the variance explaining that copy number variance that the VNTR variation explained by the 25 variants of the locus well we we have a pretty good tagging I think in South Asia we reach about 80% of that variable number tagging variation that is explained by those 25 variants but nevertheless this sort of bottom right panel is showing that if if you add those nibs on top of the VNTR you still have more variation and so the interpretation here is high density is above a reflection of untagged more complex variation but it's not just that it's also evidence that we have a lot of heterogeneity happening there and so this was worked on in collaboration with Paul Rulo and Ronan McAllen Harvard and so the second question we ask about at high densities to which extent it will reflect enrichment of genes that are known to you effect to lead to extreme skeletal growth phenotypes and so we had the lease of about just under 500 genes and by and large we found that we have a significant enrichment of high density nearby you know genomic regions where those genes are and this is significant over and above in all distributions from genes that are randomly sampled but matched on the length and so the right panel is essentially showing a very clear monotonic relationship between density and enrichment and so I wouldn't I couldn't some you know finish a talk about is the section on architecture without showing this kind of plot so this is something we showed on the in the paper about this is the relationship between frequency and effect sizes for the 12 000 variants but I also wanted to flag a follow-up analysis we we've done in a subsequent paper published sometime this year where we we condition on those 12 000 variants and we run an X and Y participation study in the UK Biobank so the point there was to detect variants that were independent from the ones we identified last year and also we wanted to see how far does new signals were from what we identified last year and so this plot here is showing that you know the x-axis you have the allele frequencies so we went all the way down to 10 to the minus 4 and that allows us to detect very large effect coding variants but interestingly so each of those dots are colored by their proximity to the height in the low side that we identified last year and what we found is that the vast majority of them are sitting within just about 100 kb for most of them so that's also confirming that even for the missing heritability at least the rare variant heritability there is a concentration nearby those low side so once we identified those 12 000 variants we define height associated low side by essentially extending small windows about 35 kb around each of those snips and so that by doing so we we can partition the genome into any kind of variation that lies within those low side versus outside and so when you partition the genome this way and you perform a partition snip-based heritability analysis we found that even beyond the half my three streams that we have analyzed that 100 percent of the heritability in European ancestry population is contained in those low side and in other ancestry groups it doesn't reach 100 although you know standard errors are what they are but it's interesting that more than 90 percent of that common common variant snip-based heritability is contained in those low side so that was quite surprising and exciting and we thought okay but how far does it go because we had common variants we have some ideas that some rare variants can go there but is that is that a pattern that is observed and we essentially uncovered all those high containing all the heritability of height and so we're trying to answer that partially in the paper last year by essentially repeating this analysis by lowering our inclusion threshold to one in a thousand not just one percent we went down to one in a thousand and that also confirmed the pattern but I'm now showing unpublished result that we generated earlier this year which is based on whole genome sequence data released in the UK Biobank so this is a partitioned analysis so I'm just going to walk you through it through it so this is height based on 154 unrelated participants and so we have our essentially four math groups like for 10 to the minus 4 to 10 to the minus 3 up to 50 percent and so what you see for each of those math beans we have LD groups so don't pay attention to the LD groups but if you just focus on the colors so the color the blue colors are the height low side so we do confirm the pattern that essentially all the heritability for common variants up to lower frequency variant is containing those low side but what I want to draw your attention to is for the rarest bean it seems like this pattern start to be broken up a little bit so we have a bit more the enrichment is not as strong so this is essentially telling us that there is a call to genetic variation for height outside those low side the question to which I don't have the answer now is how far away from those low side that generate that heritability is I don't know at the moment so now I'm going to talk about saturation of GWASs so the way we approached that was essentially by revisiting some of our previous GWASs and essentially down sample our large study and so by doing so we generated new sets of independent associations and we first looked at what happens in terms of the number of variants that you detect as an increased sample size so as you increase sample size you do increase the number of snips that you detect that's the blue curve but consistent with that sort of enrichment or sort of clustering of high variants we found that as soon as you start defining low side with say a window of 100 kb you clearly see saturation so meaning that the new variant that detect when you increase sample size are within the low side you pretty much identify before so we wanted to see how does it translate in terms of saturation for enrichment for you know biological pathways or gene sets so apologies for this very busy slide so but what we did here is that we first of all use two methods a code method called depict and magma just methods are for quantifying enrichment in a functional you know pathways and what we did then is that we clustered the pathways into 20 clusters of pathways a very high-level analysis but the bottom line is that if for example we focus on depict as we increase sample size as we go from this row to this row what we see is that the clusters of pathways that we implicate with a sample size of 130 000 are pretty much the same as the ones we implicate with 5 million and so this is essentially telling us that the saturation at least conditional on the kind of biology we know at this point in time the saturation doesn't need 5 million participants so just to summarize our big to us identified 12 000 independent associations clustered within about 7 000 non-overlap in low side and those cover about 21 percent of the genome and what's special about it is that more than 90 percent of the common state-based irritability is contained in this low side. I haven't shown any of the prediction result but interestingly we do find that our best polygenic score outperforms the parental average I think that was not able enough to be underlined in the paper and we've now reached saturation in terms of pathway and gene set prioritization so I'm not sure how I'm doing with time I'm just jumping to the last three ten minutes all right I'll be quick so just about the unresolved question so I have two slides on this one is the still missing irritability so I think we have we can explain about 70 percent of high variance with whole genome sequence data which I think it still leaves about 10 or 15 percent if you consider the irritability from twin study to be 85 percent difference that is not explained what could be the gap here so I think yesterday Arcus was underlining that twin studies can be confounded and for example can also be inflated because of something like common environmental effects and so what we've been able to do is that we use it an orthogonal design based on segregation variance in a large number of new sibling pairs so this is unpublished results but what we found with this analysis is that if you use segregation variance and you can sort of tease apart the products coming from shared environmental effects is that we are getting very close estimate to do one that we had from pedigree pedigree base twin based that is so maybe it's not much the common environmental effect that is causing the problem maybe it's something else this is an open question and so other questions we would like to answer in the future is what are the biological mechanisms are underlying underlying high causing genetic variation this is a very broad question but essentially what we need and that which we don't know we don't have at this point in time is a lot of omic data on growth relevant tissues I think we will definitely need that more of that and I believe that will also help fast track some of the functional studies for other traits the missing irritability I talked about it I think we definitely need larger samples with whole genome sequence data and larger family studies I think ideally not just sibling pairs because that will allow us to potentially tease apart contribution from non-additive genetic variants and finally fine mapping I think we haven't talked about it much so far I think yesterday maybe today we'll talk more about it but we definitely need to get at the causal variants behind those associations and that will require new models new data and differently more diverse genetic samples so this is my last slide so on the genetic architecture I think I probably not going to repeat myself I'll just jump to the last point is is height that height has a future as a as a model for complex traits I believe yes because the advantage in terms of sample size will remain for the foreseeable future and there's a lot to do in terms of new biological translation there but there is a downside I think with bio banks we have lots of traits available to us all the time and there's sort of special status of height as a model trait is somehow undermining this context so we've got I thank you for your attention I just thanks all my colleagues in this work that I brushed very quickly about over 25 minutes involved a lot of collaborators so I just want to thank them the funding buddies and the department thank you so much so Loic that amazing my question is the variants that sort of the rare variants that's outside of the height hits and thinking of that in the sort of constrained framework right so those would be variants that hit a gene where there's no common variants for height do you expect those to be sort of like core genes or genes that are in you know intolerant to permutation like because that's interesting right that suddenly we find you find new things new genes or new places in the genome that aren't kind of influence of common variation are they special should we seek those out so it's a great great question thanks Michelle for the question I believe I don't have the answer to that question first of all but I think part of it is that I'll tell you what I need before I start investigating that question I'll need to know how far this sort of residual heritability is from the low side because when you think about it the way we define low size quite arbitrary and we use a 35 kb window which we found was quite extraordinarily extraordinarily small and for the common variants but maybe for the rare variants we need to go 100 kb and and if we go 100 kb away and we capture that heritability then it changes sort of changes the nature of the investigation but I'll definitely look into that in explaining the 10% still missing heritability you you didn't mention something that you've worked on yourself which is even rare of variants so do you think that they explain some of the missing 10% thanks Mike for the question so yeah I couldn't cover everything 25 minutes not a lot but I'm not complaining I'm just saying but yeah so the evidence for the extreme the the rare tale of the frequency distribution it is quite hard I think we have estimates of just about two three percent but you know who knows if that's you know confounded by stratification I think that's that's the unknown there it's possible I think we're using the we have this interesting conendrum where whereby estimate from segregation variants using either sibling pairs or more complex pair degrees leads to somewhat different estimates and the estimates from more complex pair degrees are supposedly not capturing non-additive genetic variants so we'll be great first of all to try and replicate that in a different data set I think the only data point we have for that observation came from decode so if we need first of all need to replicate that and if it holds yeah maybe that gap will come from could come from non-additive variants but I need to see the data what does the distribution of the variants say with respect to identification or not of the 462 skeletal genes and my second question in a way is far more trivial that if you added more variants if the height itself was less accurately measured which is the case for most you know traits how much of your observations would really dig right meaning that some of the observations are probably more stable than others that is if you added more variants you know just to that right so I have to say I missed the first part of your question about so you were asking about genes the actual identification of genes that are relevant for height yeah so that's I think you know it's it's a it's a problem that holds for most complex traits right how do you go from this assertion to the genes and and I think one of the reasons why I believe height has been not as well served as other complex traits is because we were lacking I've mentioned that very briefly at the end of my talk we are lacking you know qtl new eqtl data or some molecular data on tissues that are relevant for height like for the conversites or growth plates so we have some experiments we did actually publish the paper earlier this year on this but I think we'll need much much larger investigation of that so at this point in time most of the evidence is somewhat coming from the the exome-based analysis where you have a more straightforward connection and we have hundreds of those genes and the OMIM sort of yeah but there's more to say about OMIM things but so your second point was about I'm trying to remember the the quality of the data the variants of the phenotype um yeah I guess sort of it depends right I think if you if if we talk if you think about this this variance coming from for measurement error well I guess what what we will do essentially just lower the heritability won't necessarily change what the the associations per se I guess but it's interesting that height is also involved in a lot of things like you know participation into studies or assortative maintenance and so in way if being in a study is related to your height somehow and essentially bringing you know correlated with other behavioral traits then maybe we'll be picking up you know things that are made may not be relevant for the process of growing up but that's that's just me extrapolating um I've got a question um hi I'd wanted to draw you on height as a model for disease and conclusions that we could draw or how we should take forward analysis with disease based on your results so for example I think everything's much harder in a binary trait so the types of work you've done with the whole genome sequence stator I'm not sure will translate but some of your work with pathways maybe do translate so yeah yeah I guess thanks for the question air me there's a lot to be said there I think what one point is if you think about the model in terms of or we are developing methods and would that can be applied to anything else uh yeah I guess you know a lot of what we've seen here can be applied but I was thinking maybe more beyond that conceptually height could be an interesting model for diseases and I have in mind the uh was talking this morning with Greg about this heterogeneity and this is a concern in you know consortium when you aggregate data from different sources and you think how much the signals we're picking up are reflecting sub types of the disease etc and and I think it's somewhat overlooked at least as far as I know and some discussions I had with Peter that it seems like you know this height could be an interesting model for disease heterogeneity too if you think about different body parts you know there are many ways to achieve your height and and maybe by under developing methods you better dissect that heterogeneity that potentially inform things we could translate for diseases but um yeah that's I think I can do that question hi Magnus Norvig recommendation institute so you mentioned fine mapping and like eventifying causal sites so so what do you actually mean by that right we you know I've done a lot of this stuff and it's there's no high throughput way really right I mean you can map it down to region you can identify causal genes if you can knock them out or something like that if you actually want to get the causal sites I mean that involves you know making constructs inserting them and and the limiting factor it's not only you know the sheer number you have here but you know unless the variants have enough of an effect that you're going to detect it experimentally I mean you know and you can't do the kind of I mean I work in plants you can do this experiment so yeah thanks for the great question I guess that was part of my talk which where I was you know listing unresolved questions right and I guess I was also trying to justify why I'm not retiring so what are the next big things we we are planning and we started doing with this following the study it's okay now we have this loci okay we have great associations but can we identify causal variation uh causal sites for those underlying these associations and we have a lots of new methods out there to do to and to do that one of the challenges the sample size because you really need large samples to be able to break down LD and so we sort of working towards much bigger samples to to try and achieve that but I think it's not going to be enough I believe we also have to think about maybe some shift framework framework shift in how we think about fine mapping but I'm happy to just being very elusive now to draw attention for coffee discussions afterwards but eventually what this will mean is we want to be able to make a hypothesis that this region if you were to CRISPR in a cell line that will be in particular effect on growth for example birth plate proliferation that could be a great endpoint we could you know if we've done a good job at fine mapping I hope we'll be able to see this kind of phenotypes at the cellular level all right thank you so much okay so you know yesterday Alexander introduced this by showing us a pile of turtles and you know yesterday we spent most of the time talking about the lower parts of that pile looking at organisms and molecular phenotypes in this session we're going to talk about the top parts of the pile looking at genetic architecture at sort of higher levels and for me there's sort of two questions here and I think the prompts we sent out sort of geared at this you know one by studying trait variation at these higher levels can we actually learn something about the lower levels so and then the second thing is are there sort of aspects of genetic architecture to relevant things which only become apparent or only become relevant to these higher levels so we're talking about things like indirect effects and then the kind of overarching question and this is something that will hopefully go into the discussion is you know what's other study designs and data types that we need to collect in order to answer these questions so we have three excellent speakers addressing different aspects of this it's a pleasure to introduce and pleasure to introduce our first speaker Bogdan Kossanik thank you so much for for the invitation thank you so much for for the amazing lineup of speakers I'm deeply honored to and equally terrified to sit in front or have you guys in front of me and in many ways it feels like it's passing the qualifying as a grad student and statgen 101 so I'll touch a little bit about polygenic risk scores and the impact of genetic architecture on how we think about polygenic risk scores and the north star of the talk today or the the vision that my my talk today will come from is how do we pour these things into medicine so how do we take these predictions that we have and what's the impact of genetic architecture as we think about taking these predictions towards one patient at a time let's say to improve health outcomes so probably I don't have to like spend too much time to exemplify the fact that polygenic risk scores has emerged as one of the key tools for precision medicine because it allows for all these complex traits to potentially lead to identifying you know high risk cases or improve disease prediction but the message of my my my talk which I'll get back to at the end is that if we want to implement polygenic risk scores in medicine we need to make sure that these predictions are well calibrated and we because in medicine in humans we need to make sure that you know we do we do no harm before we implement any new biomarkers for any new diseases and here here's an example of miscalibration in which a given population labeled here in the blue our prediction that let's say might be pgs based might have my predict higher risk for individuals are not as high risk and that causes you know healthy equity or inequities if we predict them and I I challenge ourselves to think about this this calibration or this implementation in humans that it's extremely difficult particularly in the US in which genetic ancestries or any type of ancestry genetically or self-reported impacts genetic architecture if you go or genetic architecture varies by by ancestries both in terms of the average risk for a given disease in a given population as well as in the distribution of the risk and because of these distributions because of that when we think about you know identifying individuals especially in a calibrated manner all those genetic architecture parameters will directly impact one disease at a time how we we think about let's say in this particular case identifying patients that may have an absolute risk tenfold or above the average in the population because that's what we would want to do in medical settings so obviously the biggest you know elephant in the room here and I'm listing here one of the beautiful figures for militia's work in the audience that showcases that if we train predictors in european ancestry individuals they'll perform poorly in non-european ancestry individuals and that raises a fundamental question which is we touched upon it in the puzzle themselves vary or not across ancestries and you know without going into too much details we can think about partitioning discrepancies in genetic architectures as they are completely different by ancestries or whenever they are the same the causal variants and genes are the same across ancestries they could have different effects because of all sorts of interactions or even when they are the same the frequencies and dld the way we tag the causal variants will also be very very different and as you might expect this is a question that has been studied since the inception of all statistical genetics as a fuel here i'm just listing uh uh from from sasha's uh this is work i'm just thinking just a highlight of recent work from past four for five years that leverage large-scale g g was data to pretty much say that in uh and like kind of express in the effects but it varies pretty much depending on the methodology that we use varies by trade and so on and so forth uh so motivated by this essentially we try to get to the to the to the fundamental question of whether genetic effects are the same or not and one of the big confounders of all these studies is that by comparing not just comparing the genetics but we're comparing all the other context as well and all in many ways all these studies are inherently confounded by looking at different populations altogether so to overcome that we made use of an interesting aspect of a mixed individual uh recent uh genetic ancestry grouping for example african america's within within us that have a feature that their their genome is a mosaic of the computationally track and figure out what where where they they're coming from and by pulling those those apart computationally we can ask the question whether genetic effects are similar or not within the same individuals within the same population and own these different segments of ancestry within such individuals and obviously we did that in a polygenic setup i'm happy to talk to you about the details of this but it turns out that all the the the nice statistical toolkit that puts mixed models to estimate variance parameters can be extendable in this setup and you can estimate a parameter that will give you the covariance or the similarity in the causal effects in in a mixed individuals and long story short we found very very high consistencies from page that's a consortium from NIH BI to the UK Biobank all the way to all of us across all the trees that we looked at we see a very very high consistency in the effect so in many ways we think that causal effect that's at least on the common variance with respect to polygenic scores are largely similar and the lack of portability really comes from differences in frequencies and and and now you might ask yourself how does that occur why why the how the how do the three differences of frequencies and linkage patterns across a genetic diversity cause such a low portability in and polygenic response and in order to tackle that one interesting finding that essentially it's been it's been around the population generics for like 50 60 years but it's more and more deeply appreciated as we get data from these medical link diversity sits on a continuum in humans so this is a lovely work from last year in science from from Lewis et al on the left side where they took individuals from from a biobank within New York City from Mount Sinai and you see there in gray dots that there'll be individuals from the biobank they span the entire genetic continuum across all the different genetic ancestry sources when we do the same thing at UCLA in our biobank we observe a similar type of continuum in which clustering individuals in this broad ancestry groupings it's you know it's a very poor approximation of how genetic diversity in humans will impact polygenic risk scores and just to orient yourself our knowledge about polygenic resource portability comes from this idea that we take this continuum of diversity in humans we kind of like ascribe it to this this clusters and then we essentially make statements about all the individuals in a given cluster and of course that that causes issues I'll just highlight three of them the first one is that we are obfuscating diversity within every any of the UCLA if you just focus on the Asian ancestry cluster there but we find a lot of different diversity that will be it will be missed and will be not you know present when we understand the coupling of genetic diversity with with with the polygenic risk for example you see there all sorts of different groupings of individuals that will all be labeled as another problem is that particular for mixed individuals the cluster boundaries for this genetically if an ancestry depends on the method of choice and the reference panel and we often find the case in real data where pairs of individuals are very similar but they sit in and are outside of these clusters so they will be ascribed different as a different error if you will and then finally a huge chunk of individuals in the data are really hard to cluster to any of these and anywhere from four to ten percent of all the individuals will be completely left out from characterization polygenic risk scores because their genetic ancestries are so complex that they cannot be assigned to these these reference maps so motivated by this we thought about how do we integrate a continuum way of looking at the impact of of genetic ancestry with genetic architecture for predicting complex rates and I'm not going to go to the details of this but you can show with relatively straightforward assumptions that if an individual is sampled d units away from a training population and this d units seem it's kind of an analytically that the correlation between the predictor that comes from the pjs for that individual and its true genetic liability has this form there that depends linearly so now we can go back to the data and then we can drop all the cluster ins all together and we can put every individual on this on this genetic ancestry distance and what we find in the real data so this is data from the celibar bank every dot there is one individual and then on the y scale is is the accuracy of the prediction for that particular individual we find a remarkable linear relationship explain a huge very very variation right obviously there's like their patterns there that tell us that linearity doesn't explain everything but to a first approximation the decay in in performance take ancestry groupings again on the on this on this plot you would see that there's way more variation across individuals than across this genetically inferred ancestry groupings this is not just a feature of one population or other but this holds in all the the genetically inferred ancestry groupings that we looked at from all the way from the the european ancestry individuals all the way to you know the unclassified folks so this is really saying that there's no such thing as homogeneous populations of the genetic diversity humans to personalize or find groups of individuals across these different generic ancestry clusters that might have similar performance so to a large extent we think that we can understand pretty well how genetic diversity of the field is what do we think about the actual polygenic risk scores themselves and here i'm showing you a plot of polygenic risk score for height on the left side on the on the bottom and on the top is it the actual measured height and they're all plotted as a function of genetic distance on the right side you see a different phenotype neural fields which is i'm showing them as as kind of a comparison and what you see there is that they track really well up to some point in which they completely go in all all this different direction which you know you can think about maybe being single but probably the most reason my explanation here is that they're biased predictors and they should not be used you know past some some given genetic distance and this raises you know these types of statements that that appear on on twitter i cannot tell you what what those claps mean but hopefully you can i can appreciate them that we should you know generate more data before we start using and comparing polygenic risk scores across different genetic ancestry groupings or across you know different different categories of distance on genetics so genetic the genetic ancestry is just one of the many contexts and in many ways we've understood it much better in the past five six years but in general whenever we apply this polygenic resource it's often the case that the genetic ancestry is one of the contexts but there are many others and i highlight here recent words that work from you know folks in the audience that show that the prediction varies across this non-genetic context like age sex and social economic status that will essentially tell us a little bit about maybe there's the genetic architecture varies across these contexts and that's why you know we should be accounting for it so motivated by this work we asked the question whether this is just one example or this is a fundamental component of polygenic scoring in humans so we took back again to all of us in an eco biobank we assessed as many traits as we could and the long story short is that this is highly pervasive so you here on this this plot here you see this differential of performance in polygenic risk scores you have multiple polygenic risk scores there on the columns contexts are on the top anywhere from age all the way to you know where whether you wear glasses or not and you see that the performance varies drastically across these different contexts if a picketing is out whether wear glasses or not tells you something about whether the prediction is going to be accurate or not and obviously this is fundamentally problematic if you want to apply these these methodologies directly in the data this is you can buy that if we focus actually in the US which is way more diverse than this this biobank if you look in all of us the problem is way more dramatic so we see I'm not going to go through all the numbers there but hopefully you can appreciate there's way more coloring and way more numbers in this side of the matrix really showcasing that the more diverse you don't touch my laptop guys hello hello fantastic perfect yeah so hopefully they appreciate you know that that's all of us yes so we see a way more problematic or impact of context I should say in more diverse data sets and that opens up you know you know I told you a lot of potential problems on how do we move forward and I'm also willing to propose a solution I'm curious how everyone thinks about so the solution that I would propose for you today I'm curious you know how we all think about it is to start before we take a PGS a prediction and try to apply it in a patient population that we want to calibrate it first so assume that you have a calibration data set that has the same characteristics as the patient population that you want to apply to and you also have the measured phenotype and before you think about oh this is unrealistic I'll guide you through the fact that this has already happened so at least in the US this is a plot of all the biobanks that exist in the medical system and beyond that e-marketing is a group of green there in the middle which are data sets so they're different from UK Biobank in the sense that they are patient data from medical systems in the US that have their stated goal to return results to their patients so in many ways they cannot be a better calibration data set than the data set of that patients within that medical system right and that's definitely true for UCLA this is a you know Biobank that I told you a little bit about it that captures the diversity in Los Angeles and both in terms of genetics as well as as medical records and you know just to point on how this may work consider the case of LDL that may have differential performance in the prediction across two contexts so assume that we take all the individuals that we predict an LDL of 120 let's say and then maybe in the context one we're we're we're we're worse than an accuracy that will lead to higher higher confident ranges in context one over context so maybe in context one we may predict that this the prediction would be 120 plus minus 40 as opposed to context two obviously this is just a two example we can do it more sophisticated with with you know some some very easy fitting of the data which I'm not going to get into it but by fitting all these contexts together we can actually disentangle the role of different contexts and maybe that's one way to answer the question that we had yesterday so for example in in for LDL wearing glasses is not a relevant context because it just tags is just tags age so wearing context should not be thought about it because you know it's just a proxy for age and then once you account for age in the context then wearing glasses go goes away which makes a lot of sense so we can integrate all that stuff together and then at the end we can really look at the prediction essentially across all the individuals and and that gives us a way to think about which traits we might need to adjust for a lot in a data set like all of us versus traits that are essentially kind of ready to go if you will so the context doesn't matter so I'll stop here by hopefully convincing you that causal effects at least for common variants whether say polygenic discourse are highly similar across ancestries that frequency in LD is a big source of confounder that we need to adjust for that we should start thinking about continuum and then hopefully I made a pitch for biobanks that are embedded in the medical systems as as a primary source for you know fixing some of these problems if you would so thank you so much for listening and I acknowledge all the all the the funding and all the people that have done a lot of the work and particular portion that has a stated goal to diversify genomics thank you so much your craft asks patients in health system biobanks may not represent the full patient population and this can affect both relative and absolute risks what are your thoughts on this yeah that's a fantastic question and that's one of the caveats of is that the patients within medical system are not a representative sample of the whole populations as a whole and we need to deal with that the other problem is that when we think about absolute risk our estimates that of prevalence or incidents are coming from a concept of self-reported race ethnicity that are you know that's what epidemiologists have done in the past we need to include those somehow into our models but overall I think that's a challenge that we need to take on and maybe that's also motivating new data sets that are more representative across let's say you said US population or other types of populations that are also collecting more context specific data sets that we can use to estimate those in a more essentially unbiased manner. Bogdan so probably a misunderstanding in my part but I'm trying to square the faric that first you showed that the prediction accuracy is very well predicted by the genetic distance and then you showed that contexts have a large effect which to me me would imply that the that very plausible so I'm just wondering if you could help me out here that's a fantastic question and we're grappling with that question as well and I think one of the hypotheses that there's there's a mixing of all these contexts across all these ancestries and somehow we're integrating that out when we look at at the linearity across genetic diversity but even when you look at the linearity there there are aspects of it that are nonlinear there and you know which may reflect this context specific like a subtle point or like a secondary point in those plots about the care of accuracy across genetic. How mixed state does not be what mixed across ancestries in effect. Yeah and part of it is also the limited data that we're looking at so there in that data set we're looking at the UCLA Biobank which in some ways because those are all patients that already enter the system we don't have you know other types of diverse so the limited context also obfuscates some of this some of these findings which again maybe I'll pose it again as a we need more data sets that are looking at more diverse contexts if you will beyond you know the genetic ancestries. Well very brilliant talk Bogdan I was curious about your quite elegant analysis using admixed populations and so you managed to estimate the correlation of causal effects for admixed individuals living in the UK and in the US as well right and so did you see any difference in those genetic correlations across those different environments I guess they're still still talking westernized sort of populations but do you see a difference and do you expect a difference if you were to do that in a different population? That's a great that's a fantastic question we did not see a difference so we saw like highly consistent whether you look at admixed individuals within yes a UK Biobank or all of us with restriction to African-European mixture however we also looked at the consistency between the causal effects on the local assets let's say Africans versus Europeans as different continents and we saw a much bigger drop so the correlation in the causal effects using similar methodology Europeans versus Africans is much lower than European versus African segments within admixed individuals and you know one explanation for that is you know this G by E type things another explanation that is that we're not doing such a great job of correcting for you know when we look at the cross-continent or across different groups of events. Okay I think in the interest of time we're going to move on to our next speaker The second speaker is Jeff Wayjack from Johns Hopkins University I just I can't see the screen sorry I'm too short for this where's that exit oh it's all one slide deck yeah brilliant okay do I use this one it's easier because I'm shorter okay all right so thanks for having me today just a heads up that I'm an epidemiologist I just as well but come from an epidemiology background and so what I'm going to present to you today is sort of how I work through these programs these problems and how that affects the research questions that we can ask so I wouldn't be a a real epidemiologist if I show you some sort of vague pseudo dag and so what I'm showing you here is sort of the relationship that we normally think of which is we think about genetics and how it relates to a trait and most of the time we think of this as the core question down further okay further more okay all right so we think about how we are thinking about the genetics versus the trait but there's a lot other things that are going on how genetic ancestry intersects with the environment and what that means for our genetic risk for human health and there's a lot of things going on so we have population genetics in terms of different allele frequencies and ld patterns that affect our genetic school hours or you know variance level risks and then this is sort of confound these and group memberships the sociopolitical context in which they exist how that affects social and structural terms of health and then of course biomarkers and individual level risk factors and so there's a lot going on here and I think most of time we either try to ignore a lot of this or we just try to adjust it out when in reality I think it's a really fundamental way of thinking that becomes increasingly more relevant as we move towards polygenic scores this is from social epidemiology but this idea of the sort of society behavioral biology nexus this is from 2006 and what you're seeing here is sort of the this different framework for looking at how we incorporate different levels and so you have at the bottom this genomic substrate the riverbed and you go more and more macro in scale as you go up and then once you get sort of above the surface you're going at this point to above individual level factors so micro level factors meso macro level and then global level factors and they all have effects on human health and so we're just part of part of it and I think one of the fundamental tensions that we have for this is how we reconcile the two sort of sides of things right we have the top in which social constructions is race or ethnicity or other group membership might be most relevant for that and then we have capture in our methods to to do this all right and so when we think about context I think it's important to note there's many many different scales of context so this is sort of looking at biomarkers lifestyle all the way up to these societal factors that we're going up in terms of scale and they all matter and I think part of the the disconnect that we have for this is that we have moved from GWAS in which maybe the prime motivation was discovery and mechanism to looking at polygenic score which is more about population level dynamics that we're capturing not just the biology but basically a bunch of correlations that we need to account for across the entire genome and so as we move from this GWAS to polygenic school the rest of this context and how macro of a scale it is and moving beyond individuals to walk you through some examples of what this means in terms of our data in the next few slides so we're going to use two anthropometric traits as model traits as described earlier today and we're going to look at height and BMI so two traits very easily measured in large numbers but different heritabilities right so different contributions of genes of what matters more right so does genetic similarity matter more when looking at differences and distributions between these different traits these groups or does a racial or ethnic identity matter more does explain worse the social construct of those identities and what you can see here is the correlation between these we do PC one through five and what you see is that for height there's a strong negative relationship in this data set between height and this PC one and then positive for PC three etc and so you see this differential relationship by PCs for these different traits and it's sort of noticeable that the traits don't show the same patterns right this is the place to ancestry now we can also look at differences in the average height and average BMI within this is the page study looking at racial ethnic groups so here have AA which is African and African-American Asian and population our study population here you have different average heights and different average BMI's and so you want to sort of tease apart is this because of genetic differences it's because of more social context being different so one way to do this is look at that how much of the trait variance is explained right so we do a full model where the base model includes age sex and study within the page study because page is a consortium of consortia and so what we're looking at is there's a base model and then we're adding both the PCs and these sort of racial ethnic categories as factors into the model a saying we're going to throw everything in there and then we're going to take one out at a time and see where the biggest loss of information occurs and what you can see is that for both BMI and heights looking at these sort of broad racial ethnic categories genetics matters more right so when you take out the racial ethnic categories you don't lose a lot of information that can't be captured by PCs right and this doesn't really sit well with our knowledge of these traits in terms of how heritable it is what's at play in terms of the context and I would argue that part of this is because of the granularity that you're looking at in these populations they're very very broad they're sort of very um earlier today and so it requires precision of who we're looking at to get a better idea of what matters and so one example that we can do is looking within Hispanic Latino individuals within page we have better granularity for how individuals self-identify and these include the groups in the Caribbean we have Cuba Dominican Republic and Puerto Rico and then in the continental sort of Americas we go from Mexico and then to Central America and South America now unfortunately Central America and South America had to be combined with the granularity of the data and there's numbers to get power but we see differences between the average age-adjusted BMI right and so you can see there's differences in the average distribution here and so it's hard to know again is it due to genetics these in mind by most people have recent admixture from three or more groups and so they sort of run the whole spectrum of the human diversity and right so is it due to genetics environment or both whether this sort of concept of group membership and ethnicity matters more than PCs and genetic similarity we recapitulate what we know about heritability of the trait which for BMI if you remove these sort of ethnic categories then you lose a lot of more information than if you just remove the PCs right which means that this is a contract that matters more for looking at the distribution of this trait or for height it doesn't matter as much right you lose the PCs you lose a lot of the information and that tracks that we know about the heritability of the trait itself right okay so moving forward to an actual polygenic score looking at the genetics of it here i'm showing you the performance of a BMI polygenic score within page stratified by these different groups and again you see that there's differential performance right for the adjusted R squared including a base model of age six and study once again where you see the Mexican and Mexican-American individuals and so what is actually going on here so a lot of times when we present these sort of fully you know expanded models we're really covering up what each individual term means and so we can look at the incremental R squared for each both the polygenic score which has been adjusted for there's some amaranthid just ancestry which is what I've been counting for the substructure there and then a base model and what you see here is that if we actually realign these individuals at the end of the groups the base model itself explains different amounts of variation this is not unexpected this is confirmed with epidemiological data where you have basic factors that will explain different amounts of variation based on the study population from which they are being captured and the prevalence of those factors what I think is noticeable is that the polygenic score is also different right it performs differently in this group and often what we do is we would pool all these individuals together give them one metric for how well the score is doing when at reality they have a varied amount of accuracy across these different groups another thing to note for this is that the base model itself is different in terms of how much information it's giving to you from the polygenic score right so you see that in the Mexican-American individuals it's actually the base model that's doing the least not the risk score while you have individuals who identify the public as ancestry it does the worst which is what we expect with the highest proportion of to have this precision now is it a matter of just adjusting it out right do we just adjust it out and then have another parameter in our model that we account for this diversity and the answer is no model our accuracy drops and the reason is that because there's real relationships between the distributions of this risk score and the outcome within these study populations so I'm showing you here on the sort of top left hand side that Puerto Rican individuals are over represented in the top quintiles of the polygenic score you can see in terms of the whole distribution is shifted and then you can see on the right hand side what the effect size difference is for that top quintile before and after adjustment for group membership if you have everybody combined together and what you can see is that for some groups the the equals actually goes up it does a better job of discriminating these different groups if you were looking at it at a binary way and some it does worse right and so again it's not a one thing fits all it's not sort of a nuisance parameter you just throw in and hope to adjust things out but it's really fundamental for defining what group you're actually looking at right the context and we can look at this in terms of environmental context this is work that was led by Lindsay Panner's roads and it was sort of back with the room right now and what we did is sort of look at the polygenic score for BMI and to see how it performed in an expanded model and how that changed based on measures of immigration right and so you're looking here at this and what I want to point you out is that it also is stratified by the same groups I had before but this is in the Hispanic community health study and study of Latinos with the larger model which is why the R squareds are larger and so what you can see off the bat sort of confirms that the model fit is different by background even after adjustment for a lot of different factors and an ancestry the effect size of the actual PRS also differs by background even after adjustment and you see the sort of differences in the different groups and pretty substantial differences in effect sizes and then what I want to point out here is with the addition of an environmental context so this is the age of immigration for these individuals this is already been adjusted for their current age but you're looking at what age they were when they came to the United States relative to people who came above the age of 20 and what you're seeing here is this one variable that's sort of very easy to sort of standardize in terms of it being age and immigration it's sort of a cut and dry variable means different things in different groups it does different things in different groups because immigration occurs for different reasons in these different groups at different times and to different contexts both where people immigrate from as well as where they immigrate to right and so you have this context is different where a single environmental context that we measure a single variable means very different things again in different groups and you see this in other factors where it's often common in environmental at B where you'll look at pollution right so pollution air pollution is in the way you did it in in New York City and you compare different levels it's the same monitor same technology used to measure it it means very different things because in New York City the areas with higher socioeconomic status on average have more air pollution than those with less and in LA the areas with more with lower socioeconomic status have actually higher air pollution and so again the same environmental variable the same measure same technology means something very and so this goes down to just sort of these basic epidemiological concepts of what is your study population how does that reflect your source population and how does that reflect your target population one thing I think to know is that we're getting larger and larger data sets but we're not reaching numbers that actually approximate the full population so we still have to worry about bias just because we have large numbers does not mean we go away from bias and because this we need to have precision in our question and so I understand this is sort of a trade-off now right between power and precision the way the two sides of the coin here and this matters for GWAS it matters for sort of all things so what I'm showing you here is but here's sort of a complementary way of looking at our data is the average sample size of studies that are published so I'm showing you here is it's from the GWAS catalog from 2012 2022 and it's colored by individual by how the solid line is the cumulative trends over time and the dash lines are yearly trends and what you see is that the average sample size has increased dramatically for European and European ancestors and this is largely due to biobanks being made more available but for everybody else we're not doing very great right the sample sizes on average are much smaller which is a rate limiting step when it comes to discovery and therefore translation to other downstream steps and so again this is this this trade-off between precision and power looking at this and it gets even sort of more dire if you try to make it more and more precise in terms of your study population and their context. Now this is consequences for your polygenic scores themselves right so the polygenic scores are often built off of GWAS and what I'm showing you here is of all the published polygenic scores in the PGS catalog earlier this year the vast majority of them only include European ancestry samples right so the solid blue on the left hand side is a proportion who included only these European ancestries and the ones with the pluses say they included other people as well right and the majority so you have a lot of resources now for further work looking at polygenic scores and how they work in populations but it's not for everyone is any of the polygenic scores who included anybody who I don't fight as African-American, Afro-Acribian or Hispanic-Latino and it's a very very small amount of scores so if we want to look at this data in the US population this is on the right hand side we're in trouble because we just don't have the scores to deal with so we're moving towards precision medicine and precision health at a sort of rate next speed but we're leaving a large amount of people behind again which is what we've done time and time again and so how do we get here so I swear I had this in my slides before the presentation yesterday but if we look at these sort of genomic health inequities that we see it's turtles all the way down and by turtles I mean this Eurocentric bias right it permeates every step of what we're doing from you know who we sample how we sample them discovery and then translation of these results and so what it really comes down to I think at the base of it is what we accept as the default and what we consider after thoughts right we're just sort of patching up the faulty foundation later oh you know we're we're method developments and we're physicians so we all make assumptions right but I think it's different to make assumptions than it is to assume to accept a default right and the default permeates sort of questions we ask who we include and the systems we model and so if we think about equity we sort of have to rethink the foundation to begin with to make sure that everything that results from it also is an equitable consequences so I'm going to breeze through this because I don't have much time but what do we need when we move to forward right so it's always always going to say data right data for genetics data for phenotypes and data for environmental contexts across these different scales this includes diversity of the participants and the populations from which they come from these different contexts and I think what's really important here is that this requires an incentive structure right this requires time and money it is not incentivized to think about these questions deeply because it requires collaborations across fields we're all very used to team science but maybe we're not used to team science that incorporate people from outside of our genetic sphere right which I think is becoming more and more important as we move forward and so we need to do that in sort of moving forward to actually look at how these risk scores might perform in different groups and even how much genetics matters across these different contexts and then lastly you know one thing we think about is this mind what systems are in place and how can we change them this is from the GWAS catalog a couple of years ago and what you can see here is the demographics for the GWAS catalog and participants they don't reflect the world population they do not reflect the US population but sadly if you look around the room they kind of resemble us which I think is a little a little embarrassing I think most of us didn't go into this for ourselves but we sort of need to do better in terms of the diversity of the participants and the populations that we work with and so let's leave you with our thoughts as we move towards this precision health, virginomics, sort of who is it actually precise for and a lot of acknowledgments and I'll end and send it out of time so thank you. Your talk and Bogdan's talk do you have a sense of of how much effort we should put into sampling genetic diversity versus environmental diversity I mean obviously that's kind of an open-ended question but yeah it's like picking your favorite child um you know I think that it it's we have to do something even before that which I think that there's no point in sampling increased genetic diversity if we don't have methods that account for the full spectrum and allow for the full spectrum in the way that Bogdan showed before where you have individuals who are on class and who are arbitrary so that's sort of you know more of a foundational question for that and I you know I know it's NHGRI so obviously genetics is the most important part of this but this requires collaboration I think between the institutes as well and so possibly getting you know this sort of collaboration not just being on us but also the institutions and so again that's not an answer to your question because I can't choose but it's everything that's fair enough and question from online from from Timothy Rabin so this is really about cost and maybe this gets to one of your points so some Eurocentric by for example UK Biobank are free to use or relatively free and some more diverse Biobanks for example all of us cost more money to use now do you think that the cost of accessing the data is is a barrier to the sort of applications you're describing yes and no so it definitely is a barrier in terms of people doing this research and I think you know especially from from groups that are not as well funded both the United States and sort of globally but I also think that there is a lot of work that's being done in much smaller datasets that are not in Biobanks that is very valuable and it's being done in a careful way like the study I showed before by Lindsay and where the sample has are much smaller but you can get a really good idea of what's going on and working with those collaborators is much lower costs and and should be done as encouraged as well great okay so I think we have to move on to our next speaker so our last speaker in this session is Abdul Abdulawi and from the Amsterdam University Medical Center okay thank you thank you Ian thank you NIH for inviting me to be part of this amazing crowd of people I'm going to tell you something about genome wide association studies it didn't take long before we realized that sample sizes are key in increasing genetic signals so appropriately the sample sizes have more than tripled over the past five years but we are all gathered here to try to understand the content of those signals so what we did here we downloaded all of the GeoSummary statistics from GeoS catalogs over the thousands of summary statistics and did some QC filtering the strongest filter was on sample size so that had to have at least 50,000 participants in order to get some decent estimates for of genetic correlations between traits because we made a very big genetic correlation matrix with all pairwise genetic correlations between all traits and then did a principal component analysis on those on that matrix to see what the largest patterns of variation are in all of these GeoSummaries that we are conducting and these were these explain the most variation these these five I think around 35 percent of the variation and this first principal component explained 10 percent of the variance we called it cognition and socioeconomic status but the strongest loadings are for these GeoSummaries actually educational attainment and income well mostly educational attainment income there aren't that many GeoSummaries of that but that shows a genetic correlation of larger than 0.9 with educational attainment and this is a very interesting signal that that's so ubiquitous across all these GeoSummaries and if you look at processes that things that influence the genetic makeup of a population so things like migration assortative mating fecundity how many children people have these signals are most strongly associated with those kinds of outcomes but what do these signals contain exactly I mean there we know there are no genetic variants that encode for your diploma or the money on your bank account these are influenced these genetic effects travel through very low level biological processes that we talked about all day yesterday and those influence traits and those traits influence each other and influence what environment you are exposed to and that environment influences you and those traits and perhaps those biological processes and all of that gets captured in the GeoSignal so the stack of turtles that Zander showed us yesterday that's been mentioned a couple of times today well that figure right there that is our stack of turtles and this paper helped expose the turtle on top so this was not what we set out to do we tried to look at the geographic distribution of polygenic scores so we collected GeoSummaries on a wide range of traits physical mental health behavior personality any GeoSummaries that we could get our hands on that was big and did not include the Yoke Biobank we used to build polygenic scores for the 450,000 individuals that fall in the European ancestry cluster in the UK Biobank and then we looked at the geographic clustering so it on the x-axis you see can I get that arrow on the x-axis you see the moron side it's a measure for geographic clustering this spatial auto correlation and before controlling for ancestry a lot of the polygenic scores so showed substantial geographic clustering a lot of that looked like those beautiful geographic distributions from the PCs those reflect ancestry differences within the country often they are in line with geographic barriers or cultural barriers here often you see the difference between Wales England and Scotland we're not sure why many of these polygenic scores showed these distributions it could be a lot of the things that Bogdan talked about or maybe there are actual differences between those ancestries or the GWASs did not control well enough for population stratification but that's not what we were interested in so we controlled for the first 100 principle components and then we saw all of the geographic clustering drop except for one polygenic score and that was educational attainment and when you plot that on a map you see the same map as the social economic differences in the country so you see the Townsend index next to the polygenic score distribution and those black lines those are coal mining regions that coal mining industry collapsed between the 20s and the 80s in the past century a lot of joblessness in those areas a lot of environmental stressors it's very different to live in those regions than in the rest of Great Britain and see half of this figure dropped not sure why but the what it shows is that people that were born inside the coal mining regions have a higher polygenic score on average than the rest of the country and this figure actually shows the same but longitudinally so we split them up into different birth cohorts you see there's a red line these are from the people born in the coal mining regions that drops faster than the blue lines and this green line here that comes from this red line so there's a and that's they show consistently a higher polygenic score than the rest of the country these are the people that migrated out of the coal mining region so there's sort of a brain drain going on that is detectable on the genetic level that is increasing these regional differences and this shows the similar this is another way to show the effect of migration so we compared birthplace with current address and you see that the variance explained by region increases when you look at current address compared to birthplace and for the principal components so those older ancestry patterns those decrease so you have those old geographic patterns of ancestry those get broken up by these more recent migrations that are driven by socioeconomic forces and just two weeks ago this preprint came out that did exactly the same analysis in the Estonian biobank about 180,000 genotypes people about 20% of their population and they see exactly the same thing happening so the geographic clustering increases because of the migration for the polygenic score of educational attainment while for the PCs for all of the PCs the geographic clustering decreases so that those older ancestry patterns get broken up and if you plot the polygenic score for educational attainment in Estonia you see that this is largely driven by these two university towns Tartu and Tallinn and they also looked at a lot of polygenic scores we looked at 33 and they looked at more than 160 I think and they saw the same outlier the one for educational attainment and here on the x-axis you see the correlation between the other trade and educational attainment and the higher the genetic correlation for the other trade with educational attainment is the stronger the geographic clustering becomes we saw the same thing in UK biobanks so here on the x-axis you have the measure for geographic clustering and on the y-axis the absolute genetic correlation with educational attainment with intelligence that's about half of that signal here on top and so this GWAS on educational attainment it captures this collection of trades that influence how well you do in school or how well you do in society and these all of these trades get clustered not randomly as I said earlier it's very different to live in a in a poorer region of the country than in the richer region especially in the UK which is the most unequal country in northern Europe and here you see the geographic distribution of clearly an environmental variable the number of fast food restaurants in in the region and that shows the same distribution as that polygenic score so you get genetic effects and environmental effects clustering geographically and they're hard to disentangle so in this paper we looked at whether these gene environment correlations whether they extend beyond the family or were before this many studies that showed that polygenic scores predict the family environment as well as genetic effects and we looked at that here so that's passive gene environment correlations you get born into a family the parents pass down their genetics but also the rearing environment but what we also saw was that as these siblings grow up the sibling with the higher polygenic scores more likely to move to a region with healthier environmental influences so that's active gene environment correlation and what we also did we ran GWAS on more than 50 trades related to physical and mental health and a lot of other stuff and controlled for where people were born or where they migrated to or both and we saw that the genetic correlation with educational attainment decreased after controlling for the geographic region for almost all of these trades most strongly for BMI and substance use so trades related to what you put in your body and this shows a similar effect this figure so let's if you focus on height here what you see that on an individual level polygenic score for educational attainment explains about 1% of the individual differences in height and the polygenic score for height explains 21% of the individual differences in height is pretty good polygenic score but not the one that Loek talked about but when you take the regional average of height and of the polygenic scores then suddenly educational attainment outperforms the polygenic score for height because on a regional level this polygenic score also predicts these regional environmental differences so it's the map the actual map of the height polygenic score and then the map of the phenotypic height and those are not the same and here's that polygenic score for educational attainment again and you see that if it was all genetics then people from those coal mining regions should be the tallest of the country but now they seem to be among the shortest so these environmental influences that cluster with these genetic effects that are associated with educational attainment they influence a very wide range of trades and also educational attainment itself so if you control for the region of birth or of current address for educational attainment you see the SNP based heritability substantially decrease and that is I think because we as a society we make these genetic effects stronger we reward certain genetic propensities of the traits that we value in society with a better environment and leave the people that lack those propensities or have a lower propensity with a worse environment so if you see that a child does well in school he gets sent to a better school with better teachers and is more likely to grow up and find a better job and be able to afford to live in a better neighborhood and in that neighborhood he's more likely to meet a spouse who has the same genetic and economic luck assortative mating also shows the strongest effects for these signals and over time this could make society more unequal and also make our jobs harder makes studying genetic effects more difficult because these genetic effects and environmental influences are very difficult to tease apart just controlling for the region that people live in is just a very rude proxy of someone's social environment so we have to think carefully about how we apply these these these signals these polygenic signals so this is one of the more controversial applications of polygenic scores screening embryos in IVF treatments for their polygenic risk so this is the first baby that was has been born from such a procedure her name is Aurea it says here Aurea doesn't know it yet but her birth was very special she is the world's first PGTP baby meaning that the she is statistically less likely than the rest of us to develop a genetic disease or disorder throughout her life so this is from the frequency frequently asked questions section from the website of the company that assisted in her birth it says does genomic prediction screen purely cosmetic traits no we only provide risk scores for polygenic traits related to disease and the other question is does genomic prediction clinical laboratory screen embryos for increased intelligence and the answer is no so I took GWAS for all of the traits that they screened their embryo for so you have brain disorders heart diseases cancers and inflammatory bowel disease and type 1 and type 2 diabetes and I computed the genetic correlation with a whole range of non-disease traits so anthropomorphic traits like high BMI reproductive traits substance use some psychological dimensions and educational David in IQ and of course it's no surprise that these genetic correlations are significant all over the place because there is a lot of pleiotropy that was discussed yesterday in length but we're starting to see that these this pleiotropy does not happen on a just strictly biological level but also through our environment through social environment specifically through the way we organize our society and the way we create social inequality that explains a substantial part of these genetic correlations so I want to thank this wonderful my wonderful colleagues I'm very happy to work with them some of them are sitting in this room so thank you very much and thank you for your attention thank you you briefly mentioned assortative mating I was just wondering to what degree is this geographic clustering just tracking the degree of assortative mating across those different traits yeah I think I I'm not sure to what degree exactly but I am pretty sure that those two are related because if you if they cluster they're also more likely to meet each other so yeah we're planning to to look at how much of assortative mating is influenced by this geographic proximity yeah that's something we have to look at empirically but it's a very good question those two are closely linked I think I have a quick question I was wondering you know given the educational systems in the UK and Estonia how do you think that these results would translate to like the US where the educational system is sort of very different in terms of you know educational team and who has access to higher education yeah that's that's a good question I think I suspect that it's going to look similar a lot of the migration here is also related to education either education or other economic reasons but that's something we'll have to look at empirically I would love it if someone that has access to these big datasets in the US could have a look at it Peter yes and so a lot of your results show there's a gene as you call a gene environment correlation or covariance and that if you have you know sort of population studies like G was that that influences your estimates of your and the interpretation of your associations but what you did mention is that there are experimental designs to break that up if you look within families for example and people obviously have been doing this and and we'll do more of it and you can see for those traits enormous differences in effect sizes you know with it at the within family versus population level so that's quite interesting I find yeah I think so too yeah that's true and and they all they see this a similar decrease in the heritability of traits that are closely related to educational attainment but they also see for other traits that their genetic correlation with educational attainment drops when you look at within family similar to what when we control for region which is effectively within region G was her um Lindsay Fernandez rose Penn State University thank you Abdel I really appreciated that you for the first time brought up life course perspective by looking at birthplace and then current address when you plotted it out you could look at the various categories and how certain groups would be differentially affected have you done something similar looking at socioeconomic status or educational attainment of a parent compared to the individual once they reach adulthood and compared those further kind of deepening your life course understanding of what's going on no I haven't done that it's a bit more challenging to do that it's in UK Biobank because there is not a lot of parent child observations but Michelle Nifari has looked at in MOBA who have collected parents and their children and yeah he's gonna sit here later maybe you want to comment on it I don't know what you found or I think you're right we should do a lot more of that and and when one you bring in like parents but also parents of parents there's like a whole new set of association tests you can do right you can like think about what Peter said you can do within sibling association and and you could also do within sibling association stratified by region so there's like combinations and crosses of those kind of things you can do for like regional differences with Rosa Cheeseman did some really cool work on that in MOBA where they have parents and also bring like 100,000 3Os basically where you do both controlling protein environment correlation using the parents but then you can also do the interaction with regional things and you know as we all know you know estimating either correlation or interaction just risks the other one like mucking everything up yeah we have 10 minute break so back at 10.57