 So, welcome all to the Autistic Computational Biology Seminar, I would like today to have Zeltan Kütalik with assistant professor at the Division of Biostatistics of the Institute of Social and Preventive Medicine at Chuve and also a group leader at the Swiss Institute of Bioinformatics. So, Zeltan was trained as an applied mathematician in New York Times, his PhD in 2006 at the School of Computing Science at the University of East Anglia and the Computational Microbiology Lab of the Food Research Institute in Norwich, UK. Then from 2006 to 2010 he did a post-doc at the University of Lausanne in the Department of Medical Genetics and in 2011 Zeltan became junior lecturer at the same department of Medical Genetics and then in 2013 he became assistant professor. So his main research interest lies in developing statistical methods, integrating values on his data in order to better understand genetic disease in the context of what we call GWAS Genome 1 Association Studies and he has been part of many international collaboration as an analyst and principal investigator on different GWAS efforts like obesity, hypertension and other traits. So today Zeltan will explain to us what we can learn from analyzing metadata from those GWAS. So Zeltan, the floor is yours. Thanks for your introduction, also thanks for the invitation. A lot of thanks for those who actually came and are just virtually looking through a screen although we are geeky bioinformaticians, we tend to like to watch things on screens rather than coming, but thanks a lot for coming. Today I decided not to talk about, not to center my talk around the research topic, but to center my talk around data type. So this data type is this, I call it metadata that explains to you what metadata is and in the first about 15 slides will lead up basically at the stage for what it is. So instead of directly jumping on what is meta-analysis and what is genome analysis studies, I start with one simple example of one small cohort here in Lausanne, what we have in the Lausanne hospital at Schuve where about five and a half thousand individuals have been genotyped, meaning that a large part of the genome has been identified. Genotyping technologies, it doesn't matter how, and at the same time these people have been also extensively phenotyped at hospitals, so this is really serious examinations over hours and hours, so these people are kind of heroes for us. And having this genetic data, we can run genome-wide association scans. The way we run the scans is that we model the outcome trait, which is denoted by Y on the left-hand side of the slide. As a function of different covariates, covariates typically, if we look at the phenotype being, for example, BMI, as you can see in the data column, the covariates typically, variables that are influencing BMI that are non-genetic factors, so this can be typically age, gender, diet, physical activity and so on. So these we can all dump in here as covariates and we can also put in other genetic factors and then there is the second term on the right-hand side of this equation, which denotes the genetic one single gene's additively coded effect, which is the effect size is beta, meaning that each additional copy of a risk allele increases or decreases your outcome trait by a certain amount and this amount is this beta gene. So this is what we call the effect size. Everything else is called error, which we can't explain or just what we don't know about or we haven't measured. So we have the model, we have the data that we obtain, for example, in our hospital but similar data have been obtained in many other hospitals around the world and if we just look at one data and we compare the model to our data and we can test the hypothesis for every single snip when this effect size is beta g is significantly different from zero. So that's the simplest model that you can imagine in any sort of statistical setting. We do hypothesis testing and we estimate this effect more visually and surely the way it's done for every polymorphism you can carry this example on FTO, it's an axonic polymorphism in the FTO gene on chromosome 16. This has the largest ever reported effect on obesity on the EMI and every population can be split up into three groups. Either you can carry two allele, one allele or zero allele and these three groups can be compared phenotypically. On the y-axis we see the distribution of the body mass index of these individuals into three groups and we want to make an observation that for example there's a trend of increasing BMI as you have more and more A allele copies. Then you can fit a regression line, you get an estimate that every copy of an A allele that you carry will increase your BMI by about 0.7. It's a slight overestimation. In reality the true value is applying more about 0.3. So practically it's 0.3 units. So these people on the TT end versus the AA end they differ here in Kolaus with more than one BMI unit. So it's in effect relatively small but if you can with other effect it can be larger. They also have an estimate how much confidence we have in this. So this is marked by this shaded area which means that if this is very narrow it means we're very sure about that the t-effect estimate is the right one and if this shaded area is very large around the line then it means that you have little confidence in the estimator and we don't have enough statistical evidence to reject the height of this that actually the effect is zero and the three groups have identical means. Then these results these models are tested basically these plots you can stare at these plots a million times at four million different polymorphisms here I just picked one of them and basically you can scan through the genome and for each of these plots you can assign p-value. In the next slide you can see how to visualize these p-values in two fashions. One is each dot is a polymorphism which is plotted according to on the x-axis its chromosomal location and on y-axis is the minus log 10 association p-value. So basically if you have a large peak like the FTO that I just showed you here this is a much larger study than ours it found FTO being extremely strongly associated with BMI and you see where it lies on chromosome 16 and then you can zoom in and you can even look at gene adaptation and so on. So this is one of the first plot it looks like a man up in skyline there is a baseline level of skyscrapers and there are a couple of huge ones that are emerging in this plot so here you see about 32 skyscrapers. The second type of plot is a quantite but it's a bit more complicated here you don't care about the position of this polymorphism here you just plot the association strength against what you would expect if there were no association whatsoever. So as I mentioned we obtain 1 million p-values which we get for each of the 1 million SNPs and these 1 million p-values are ordered so the most significant in this study for example was 10 to the minus 25 and you compare it what would be the most significant p-value if there were no associations in your study which is the x-axis position of this dot which is around 10 to the minus 6 because we are testing 1 million variants. So if you see a deviation on this quantite plot from the line it shows that the distribution of your p-values that you observed in your study significantly differs from the uniform distribution and you have more significant findings that you would just see by chairs in a completely arbitrary trade. So now this is a cumulative magnet template over all the studies so here you can see how the time changes and here you see the old G-val studies that have ever been published in the literature in the past 10 years roughly. So you see that after a while you can't see anything and that's the point of the whole graph is that initially there were very few studies and there was a big boom around 2008 when everybody started to conduct these studies and it became really really cheap. It cost about 100 francs now for individual genotype so what you know the bigger cost is on the hospitals where they have to phenotype these individuals. You see a major peak on chromosome 6 on the HLA region and otherwise you see that there are pits everywhere all over the genome every particular segment of the genome is nice associated with some trade. So basically if it keeps going on if I show the same slide in 10 years or 5 years you won't see even this mismatch according to different trade groups. So if you look at this graph in the previous one then it's very successful so we find many many variants these studies in the past 10 years have discovered enormous number of genetic associations. If you look at one particular one on body mass index as you mentioned I'm interested in obesity body mass index in the latest study which was published two months ago it has found about 100 genome-wide significant loci. You see here about 3.2 dots so this is the PICAL MANA template very successful study but the downside or the other side of the coin is that even cumulatively if you take this 100 together they explain not even 3% of the variability of BMI of the population. So they don't explain much we can learn a lot about biology but you can see that the top one here which is the FDU variant it explains one third of a percent extremely tiny so it's a roughly into 0.34 BMI unit difference of people who carry more risk values. And you can see that initially it grows quite well but then it starts to plateau and from around 50 number of loci up to 100 the effect seems to stagnate it seems to be the same and maybe if you continue even longer it starts to really flatten out meaning that every additional discovery expands very very much very very small additional variants of the phenotype. You can look at it more positively you can look at the next slide basically grouping the individuals into how many risk values they carry so this is the very unfortunate group on the right hand side of the right extreme who carry more than 104 risk values. These people of course is just a tiny fraction of the whole population a few percent but they have an average BMI of 30.5 but if you look at the other ones who are the rather fortunate ones they have 778 riskalins their BMI is about 27.5 so it's four units of BMI difference depending on how many riskalins you carry. This is already defined at birth so if you look at the positive side already at birth you can come up basically this means 104 risk variants it's that on average you are heterozygous for each of these hundred discovered SNPs and already this makes your BMI increase by practically two units compared to the population average. So we can actually tell something about at least a subgroup of the population so it's typically like if you look at the rare variant the rare variants have also a large effect and they define a very small group which is who are at risk. This is the same that you have many common variants that also defines a small risk group which are at high risk obesity. Okay so this was what I wanted to say in general about genome and association studies. These studies are typically conducted in hundreds of thousands of individuals, discovered very tiny effects and we have the way they're conducted is basically the data type what I'm going to talk about is this metadata. This comes from these studies in each individual study provides us effect size estimate, sample size, allele frequency for each polymorphism, a standard error of this effect size the P value, the quality of the data. So you can have something which you inferred or which you have measured but the poor quality, we can quantify this and also they can provide us the thermotypic variance in each genotype group. So these typical statistics are available for a large number of cohorts and they have not been used for anything else apart from looking at the P value and say okay hey this is a significant P value, it's a new discovery let's publish it. But all the rest has been largely ignored. So what I will talk to you about is how you can recycle this data, how you can look at these kind of values in dozens of cohorts and what we can learn from it. So I will structure my talk around these five different things you can do with these statistics. For example the first one is if you have two SNPs that are associated with the trade for example obesity, what you can do on top of this from this data from external data is that you can calculate the cumulative effect of these two SNPs, basically a multivariate model instead of a univariate model without going back to the cohorts and asking them to redo this analysis but you can centrally do this already available univariate data. So this is to detect allylic heterogeneity meaning multiple independent or semi-independent SNPs in the same genetic groups or in the same gene influencing the same phenotype. So this is topic number one. Number two is imputation so practically if you have the same two SNPs associated with obesity but you can ask a question about an other SNP which has not been part of any of these studies and you are asking is this other SNP what is its association with obesity knowing that you have many other SNPs for which you know the association but not for this one. This is done through imputation. The third topic is ginging interactions. So by knowing the association of these two SNPs can we tell something about their interaction? Do they interact with each other influencing the phenotype? This is the third the fourth topic is can we learn something about the parent of origin effects. So we know that if you carry for example an allylic it may increase your obesity but the question is that if you carry an allylic that was inherent or for example an allylic which was inherited from your father like in this case maybe your BMI is different from the scenario where the ally was inherited from mother. So these two individuals they had the exact same genotype but they inherited their allies from different parents. One is from maternal, the other is from paternal. Does it make a difference for the outcome measure for example for obesity? And we can also answer this question without knowing anything about the parent that's the beauty of it. And the final topic is inferring other things than just simple associations. We can also if we have a list of SNPs that are associated with obesity what we can ask is do people tend to find mates or are they more similar to their partners with respect to this set of SNPs? Meaning do you find your mating partner based on the genotype of this person? So I will tell you more about it when we get there. But this is something which was quite a recent work and it's still in progress. So these are the five topics in the next about half an hour I will cover these five or as many as I can. So allylic heterogeneity first came across with a co-operative Maddy Tufty who is working in this building and we were looking at genetic markers that predispose you to narcolepsy. So narcolepsy sleep disorder practically you can fall asleep any time of the day anywhere without being tired or without having any problem. The major discovery which has been done several decades ago is that in the HLE region the BGP1-602 haplotype you don't need to know what it is but there's a haplotype which predisposes you to narcolepsy. Our quarter was so one quarter of all was in this room carrying this haplotype and still doesn't have narcolepsy. So the question was what is there in our genome or somewhere else which protects us from narcolepsy and when we learn the research we scan the whole genome and we didn't find anything. So then we decided that okay let's look again at the HLE locus itself because that's everybody ignored because yeah we know the HLE there is the DQB1 you don't need to look there. So we looked at the same region and we found that there are several other alleys most importantly the 0603 alleys so the last one highlighted line in this table is a protected one so it's roughly decreases your chance to be narcoleptic by 5-4 if you carry this HLE DQB1 alleys on top of your 602 alleys so practically you have a risk alleys but you carry an other additional alleys on your other chromosome and this other alleys is actually saving you from developing narcolepsy. So this is the same locus there was a second signal which was actually protective on top of the first risk signal. The second experience that I had with the HLE was in carbide deficient transfer genomes. So this is a complicated term what it's used for it's an alcohol marker so practically if they measure this value for you they can tell you whether you have a long term of alcohol abuse practically telling you whether you are alcoholic or not and you can associate this level with the different genotypes in the genome and our most significant association in the whole genome was falling into an SRPRB gene which didn't tell us anything but if it was just lying next to a gene which is called Transparenia which makes a lot of sense because it's in the end it's some sort of complicated transferring so but we didn't really understand why the signal is not in this gene but somewhere else nearby and then what we did is we did a step-by-step selection where we tried to find the best predictors which are in this region that would describe best the association of this variant or actually the association of this region with the levels of the CDT what we found actually that it gave us directly a model of three SNPs during the step-by-selection the original SNP was dropped and three new SNPs came up as these three explained best the association with CDT levels so meaning that this SNP was just tagged by a linear combination of these three was that are highlighted in green and these are all non-synomous changes in the transfer engine which makes us very confident that this actually has a real meaning and actually these three explain more than 50% more variance of the CDT variability than the topic which was found earlier so there are many examples in the literature and luckily some of us involved in where actually really there are different independent contributions of variant lying nearby to each other or even in the same gene that contribute to the trade so that's why we came up with a method which can use gene-wide so this method data practically here this is defect size the univariate defect size is reported by each study plus uses a correlation matrix which is just a slip-by-slip correlation matrix in the region which you can easily get from anywhere there are many public data available from which you can estimate correlations between SNPs and this formula just based on these two quantities the M and M are very simple the M is the number of markers in the region the M is the sample size from which you gathered your data from so typically this is large and this formula gives an estimator of how much variance is explained by this locus and also it gives you an estimate how does this SNP look like or what can be the causal SNP so when we applied it to real data on height we found many interesting loci this was probably one of the most exciting one because it has been now rediscovered by the recent Nature Genetics paper on height which is the most sort of polymorphic most allylic heterogeneous locus where in the same gene if you only look at the top hit and it's LD partners you only see some intronic associations but if you look at this multivariate analysis then it discovers many core synomers and synonymous changes which are associated with height in this gene and there are novel sequencing studies that have shown even more of these non-coding part of the gene being very important so let's now switch to summer statistic computation so as I mentioned to you this is the nightmare of a GVAS analyst when there is a causal variant which is markedly the red on this dot but you don't measure many many of the variants in the region and you just measure these blue ones and you will somehow infer what would be the association of other variants if you have data on them but obviously you don't have but what you have is external data of sequence of individuals so you use a large set of population panel like the 1000 genome panel from which you can infer correlations between SNPs that are available just without any phenotypes so this is a large sample and this sequence but you don't have any phenotypes for them what you have is extremely large number of core data where you have genotypes but only measured for a small proportion of SNPs so typically in GVAS we measure half a million million SNPs sequencing studies obviously they discover tens of millions of variants we would like to know what would happen if these or our studies have been sequenced of course it costs a lot of money so we can't do it so for this what we run is an imputation where we infer our cohort what kind of habitat mosaics would give rise to our actual individuals and this way you can fill in the missing information okay so this is some trick you can apply it to data and what you see is that you will get many many rare variants and many of these rare variants actually terribly imputed so you impute them so you somehow you guess them in your cohort but with a terrible accuracy so for about only 6% of these very rare variants it's about 23 million rare polymorphisms that you can impute in your cohort but only 6% of them are actually useful for you all the rest is just noise and in general this is the on the right hand side this is the plot of the distribution of the imputation quality and here you see that these ones about 0.7 they represent a quarter of the markers that we imputed and all the rest is quite useless all the rest of the 75% of the data that we imputed are useless but still it still gives many many more millions of markers that are very important and we can very accurately impute if you're lazy you can do something simpler so what I described to you before was that you have your data and several SNPs like the third SNP here is not measured so that's why there are the question marks and then you use reference sequence data and from the sequence data you can impute them in a probabilistic fashion and then you can run your association study and then you have now new associations for a much denser set of markers so typically as if you had sequence data partly well if you're lazy you can actually don't care about imputing your genotype data itself but you directly start with association summer statistics so you have here the fact size of each of your measured SNPs in your cohort and you use now the reference database to impute directly the association summer statistics and not caring about imputing the genotype why do we do this because it's much much faster much easier and doesn't take much memory so this is typically the lazy people solution to this and we are quite lazy now so what we came up with is that there's a fairly simple formula that can estimate if this is the red variant in the middle which you want to impute and you know the association results for these crosses then simply by knowing the correlation between these measured markers between themselves if you know the association summer statistics so practically you have the association results for the crosses then you can calculate what would be the association statistic for the unmeasured red variant so it's all great we can apply it to data you can even make it more complicated in the sense that for each cohort you can use a different reference population to impute the summer statistics and then metanalyze it together but the bottom line is that typically what we find for human height is that there are several dozens of those where if you look at the less denser so this is the head map is the red ones the red dots here are the original genotypes and you see an association clearly but if you look not much denser at 1000 genome basically as if you had sequenced them we can impute it and we see now new variants appearing with much more stronger association effect sizes and these are probably the ones that have been tagged by some of the red ones that are the top hits but now we discover what is actually the causal variant we managed to find that what individual variants and which are much rarer often are driving this association signal and what are the causal ones and what are just the consequences which are just ok so now it's GNG interaction this is a topic which is either most people don't care about it and don't believe in it or people become obsessed about it and they try to at all cost finding GNG interactions which is an extremely difficult topic but if you find some usually you publish it in nature so it's either you get into or you publish it in nature so it's really a thin line between the two that has been done for example gene expression people have found a few associations and they published it late last year so in the basic setup we have our outcome trait which is Y and we modulate with the genotype in this case it's like G by F and alpha was the effect size how much this gene has an effect on the outcome trait in an interaction study you have two so SNP alpha and SNP G and you also have the interaction of the two SNPs which is H so simply if you multiply element by element these genotype then you get this H but I guess you don't care about formula that much what you can visualize it very simply is that if you look at two SNPs so one is called RS9747 blah blah and the other is RS155 and so on this was still the FTO seen on the first slide if you now split your population into three subgroups those who carry this first SNP in an AA alludes those who carry the first SNP and their heterozygos for it and the last ones who are homozygos for this RS9747 SNP and if in these three subgroups now you run an association study with the second SNP and if you see that for example the first plot on the left side you see there's a negative trend the more ALAs you carry for the second SNP the lower your BMI but only in the AA group for the first SNP in the middle group in the middle panel you see no association doesn't matter how many ALAs you carry you have flat all the same BMI and in the final plot the more ALAs you carry the higher your BMI is of course I just made up this data it doesn't exist in reality because it's too beautiful to be true but the point is that this is a typical interaction between this SNP the first and the second SNP meaning that if you stratify your samples according to the genotypes of the first SNP then the associations basically the slopes which show your association with the second SNP and your outcome trade which is BMI in this case the slopes will be different and this is the GNG interaction or SNP-SNP interactions the first SNP is interacting with the second SNP so this is an extreme scenario because most of the cohorts won't provide this data especially they would provide such data that would be a million times million effect size and so on and it's just too much to share and to upload and to download and so on so it's not feasible but what cohorts provide is what you can look at it as this is an extreme scenario of an association in a subgroup with 0% GLA frequency for the first SNP and the right hand side panel is an association in a subgroup where the GLA frequency is 100% now what if we have a population where the GLA frequency is 10% and another population where the GLA frequency is 30% so the population which has less GLA frequency expected the slope be less or more negative compared to another cohort is higher so what we'll do is simply I directly jump to this slide actually what we'll do is look at the L frequency of each cohort so here each dot is a cohort and the size of the circle is proportional to the size of the cohort and what we want to see is if there is an interaction between the first SNP and the second SNP then what we want to see is that there is a slope because when you regress the beta so basically the effect size of the second SNP onto the L frequency of the first SNP ok so there is a trick here the higher the L frequency of the second SNP in this case the lower the effect of the first SNP the lower the effect size of the second SNP and if you find such associations where the L frequency of one SNP is very well correlated with the effect size of the second SNP that means actually an interaction between the two SNPs and that trick we can actually use because this here each of them is a contributing study to our giant consortium so if you have enough studies you can actually gather quite strong evidence for such interactions and you don't need to go back to the cohorts there is a beauty you only need so in the formula you can estimate the interaction effect by just knowing the simple marginal associations which are the betas and the standard errors and knowing the L frequency of the second SNP so just by knowing L frequency genome and knowing effect size genome can help you to estimate the interaction between the two variants how about this on the downside, obviously if you have a SNP whose L frequency so in this case for this SNP the L frequency varies a lot from 23 even 20% up to 47% so you must have a broad variability in the L frequency otherwise you wouldn't be able to see such correlation between L frequency and effect size of another SNP so typically this test has to be done only for SNPs which have a variable L frequency across our cohorts so typically these are the SNPs that have a FST value so FST is if the frequency of the SNP is very different across different European populations then this SNP is a good candidate for testing it as an interaction so in our case we had 300,000 samples but the actual effective sample size is 300,000 times this FST value which is typically between 0.1% and 2-3% so although you have 300,000 samples it's roughly equivalent to just having 10,000 samples with actual genetic data where you do the test and you run the test in your actual cohort and actually that's why it's not very appealing often except for a few hundred SNPs where you have actually more power than a very large study so there's a downside to it and so far we are replicating the outcome association but we are not sure whether anything will survive but the parent of origin effects very briefly is when the aliens you inherited, we are typically looking at heterozygous individuals the heterozygous individuals are distributed into two groups, one is what the ALL from the mother and the other is the ALL from the father and these two groups may have distinct phenotypes and the reason for this is that some of the groups for example what you inherited from your mother it may be methylated meaning that there might be a methyl group attached to your DNA and this methyl group attracts many different molecules which eventually shut down the expression of this so this genomic imprinting is achieved by this monoalytic regulation of the gene expression and shutting down one of the expressions of the gene and that's why it makes a difference whether you have a methyl group or not because if you don't have the methyl group the expression will not be shut down you have a high chance that the region will be methylated and whatever variants you have in this region will have no effect so that's why it matters where you got the ALL from the mother or from the father so that's why we looked at in our cohorts what can we say about this? I skip now one slide because we can directly go to look at the distributions of an outcome trait so imagine this is BMI and we have four genotype groups now so we have those who carry two ALLs those who have AB heterozygous but they load the B pattern early and those who got the BLM pattern early and the fourth group is the usual homozygous BB so we have four groups of three groups if we knew where the ALLs are coming from so the distribution of these four groups are depicted here so you see that the green one, the green curve and the blue curves have vastly different mean values meaning that probably there is some parent of origin effect going on because those who paternaly inherited the BLM have a much higher phenotype than those who maternaly inherited the BLM and the way to detect such effect in our cohort where we don't know who is heterozygous AB paternal what we can look at is still the distribution of the heterozygous group must be wider so the phenotype distribution is now increased compared to the homozygous group the reason for this is that the heterozygous group is made up of two heterogeneous subgroups these two subgroups have distinct mean values and that's why the composite distribution has a wider variability and that's why we want to that's how we're going to pick it up so we don't need the parent information it's enough if the heterozygous group has a larger phenotypic variance that's what we're particularly testing actually we were lucky enough we found two examples one of them is lying actually was our top hit in our we can scan it was genome white and we found one in the KCNK9 gene on chromosome 8 which was one of the major this is if you look at only methylated regions this is the KCNK9 major discovery where we see opposite effects of maternal and paternaly when you inherit the same allele and the effect size is fairly large so when we replicated that in family studies what we've seen is the effect is roughly comparable to the effect of FTO which was the largest genetic effect but the effects are going opposite direction depending on which parent you got it and we then looked at the gene expressions and gene expression show similar pattern of opposite maternal effects so we discovered two new steps basically that are contributing BMI depending on from which parent you got the allele from I showed it five different research question can be asked from such large forward information and you have all the summary statistics and no actual individual data but for large numbers of cohorts and I hope I managed to convince you that actually it's worth recycling large available genetic data and genome white association data to answer very different questions from what was originally asked for but why they were generated for so I would like to thank for basically giant consortium and also Koolhouse, summary statistics invitation was done in collaboration with them from UCLA the parent division effect was from Imperial College here in Lausanne and now we are doing the same with my postdoc on parent origin effect gene expression and also the allele and the genetic is a bit older study which was done also in collaboration with GSK and most of the gene in Lausanne. Thanks very much for your attention.