 Very briefly, I'm adding a statistical genetics group in Lausanne. So we are a group of mixed biology, biostatistics background people with usually strong programming access. And what we are doing is trying to elucidate the genetic basis of complex human diseases. And you will see from both presentations that both are very much centered on towards this goal. And we are not typical users of Bayesian methods. So that's why you will see that maybe these are not the classical applications that you see, but you will see that this Bayesian framework is a very flexible that can be twisted in many ways to aid genetic discoveries and also aid causal inference techniques. So the first presentation I will give you today in the first hour will be about how we can improve discovery of genetic associations, of genetic markers associated with complex traits. And I will exemplify it through the genetic basis of human longevity, basically trying to predict how long you're gonna live from your DNA sequence. So as you can imagine, it's very challenging. But it's also, since it has a relatively minor genetic basis, it's also quite a difficult task. And that's why we need help and we need to inform these genetic association studies by using other association studies available at hand. And in the second presentation, I will tell you more about how genetic associations can be then leveraged to get an estimation of causal effects of a risk factor on outcomes. So here again, we will look at complex human traits and mostly genetic diseases, genetically predictable and then heritable diseases and how we can establish the contribution of non-genetic risk factors to these diseases. And of course, feel free to interrupt me anytime. Of course, best is if longer questions come to the end, but if there is something which is unclear, please really stop me, don't let it boil in you. So the first part of the presentation will be telling you what is a standard genome-wide association studies. So G-WAS, genome-wide association studies, the four letter abbreviation. And I will show you then how we can improve these classical association studies by building priors for the association strength of each genetic variant on a particular outcome. And I will show you how these priors can help eventually the Bayesian association scan. And I will always use an example today lifespan and to show how we can extract and identify new lifespan associated loci thanks to the smarter Bayesian approach, borrowing strength from other association studies or other related traits to lifespan. And then I will conclude with some follow-up experiments. So genome-wide association studies are essentially modeling an outcome trait. Imagine here, for example, I always show data. So the data is including, for example, body mass index. So it's general measure of obesity. The normal range is between 20 and 25. And between 25 and 30, people are classified as overweight and above 30, people classified as obese. But we keep the continuous outcome and we are modeling it as a function of various environment factors. So we call them covariates because that's not our central interest at least in genetic association studies. And you can see here some examples on the right-hand side. I hope you can see my mouse. For example, of course factors such as age, sex, diet, physical activity, and a bunch of other factors, they impact our body mass index. But these are sort of nuisance parameters. But the more we can find, the more we can reduce the variance of the error term. So this is very useful for improving statistical strength. And the second component of this model is the genotype data. And here I just give one example of one part of the genotype. So these are genetic markers in the genome, which are single nucleotide polymorphisms. So single variant changes. Does anybody know what a SNP is? Or maybe raise your hand if you don't. Because then I would just explain it a little bit more. Okay. All right, okay. So SNP is a single nucleotide variant. This is, you can imagine the long stretch of DNA sequence of people. And in this long stretch of DNA, there are always four different nucleotides, ACG or T, that constitutes the DNA sequence. 99.8% of it is identical between any pairs of human beings. But the remaining part, which is variable across individuals, is the very interesting part, which might be associated to complex human traits. So for example, a person at a given genomic location might inherit an ALL from a father and the TLE from a mother that constitutes a genotype. And we can count here, for example, how many TLE is a person inherited in total. So it can be either zero, one or two. Person can inherit either zero TLE, one TLE. And it only comes from one parent or two TLE. But from both parents, this person got the TLE. And that's how these genotypes can be coded. So at each genomic location, you can imagine there are about minimum 10 million such locations in the genome that are variable that can be examined. And the typical genome-wide association scan, it goes variant by variant. So you can imagine this model fitted about 10 million times. And each time you're really interested in what is the effect of these genetic markers on the outcome trait. And this is the effect size. And that's what GWAS is after. The rest is, it can, of course, have more sophisticated family design or other different structures based on, for example, geographic location. And those are kind of structured noise. And then there can be the, plus the additional classical noise term, which is independent between individuals. So this is the model that's what we are fitting and we are testing. We are really focusing on the estimate of the genetic effect. For example, here this is one very clear example where this variant is close to the gene called FTO. And here you can see three kinds of genotype groups that exist in the population. This is data from the small study in Switzerland, in Lausanne, a cohort called Kolaus. And here we classify people according to what L is the inherited from the parents at this particular location. So every SNP, every single nucleotide marker has an identifier, an RS, and then a log number. It's not very interesting. The point is that if we split people according to how many AL is somebody inherited from their parents, of course, the maximum is two AL is from both mother and father. We can see that the BMI values of those people are slightly different. And if we pick the regression line, then we can see that each additional AL is increasing BMI by about 0.7 units. And 0.7 units is quite a lot. This means that these people have on average more than one and a half kilo. For example, just one AL is inherited. So these people, in general, they have identical environment, but the only difference between them, of course, not the only, but one key difference between them is really that how many AL is they carry. And you can, of course, in this linear regression, you can estimate not only the slope, but the standard error. And then you can come up with AP values. This is very frequent, but no prior knowledge, no prior assumptions. It's totally uniform across the whole genome. Every SNP has the same chance to pick that up. And then we get the P value, which is not even significant nominally. So this is quite discouraging because it's a sample of almost 6,000 people. And this variant has the largest effect on obesity. And we still can't pick it up. Now, if you correct for a couple of covariance, as I mentioned, like agent, sex, and physical activity, and diet, then this can actually become significant, roughly around 10 to the minus four. If you do it genome-wide in a much larger sample, so this is a result of such a scan. And this result is visualized often by manhead templates and manhead template. Every dot is a SNP as a marker in the genome, and it's x-axis location. It shows where they are, on which chromosome they are sitting, on which arm, for example, here, you see the PRM, the QRM, chromosome nine, and they are labeled by the nearest gene. And then you can see there are really association towers, so single-nucleated variants in the given locus, they are associated. And the strongest one that I just mentioned is near the gene FTO. It's probably not the gene FTO itself, which is implicated, but some other genes nearby. But the point is that on the y-axis, what we show is the exorcism strength is the minus locked MP value of the association between this variant and obesity. And then you can see different genomic locations, how strongly they are associated with obesity. And you can see that these people are very strong, but actually each variant explains only a very, very small fraction of variability in BMI. For example, this FTO variant itself, it explains only a third of a percent of BMI variation between individuals. So it's really limited in terms of predictive power, single-individual variant alone. Since we do roughly a million independent tests, we need to correct for it, and typically bone-firing correction ensures that we are controlling family-wise error rate. So basically the probability of making even one mistake if we select and report these SNPs is below 5%. So it's a very stringent criterion. And with this, we can really identify very, very, very stringently a low side that are associated, for example, with obesity in this case, but you need very large sample size because to pass this threshold, you need enormous sample sizes because the effect sizes are very, very small. So that's the problem with genome and association studies that huge samples are needed because the traits are very, very polygenic, meaning that it's not only one genetic variant associated with the trait, but really hundreds or rather thousands of different genetic markers across the genome, which can each modulate one trait by a tiny fraction. Like for example, here, the FTO variant is just by a kilo or a kilo and a half. Variants, for example, that impacting height, even the largest ones have an effect of just a few millimeter, but we have thousands of them. Current studies showed about 12,000 independent variants contributing to human stature, human height. So the problem is that how can we get more associations than this? Obviously, if we lower the PVL threshold, we can get more associations, but then we will get much more false positives. So we don't necessarily want to do that. The other option is to increase the sample size and increase sample size will obviously yield more associations eventually if there are more. But we were thinking about something else going Bayesian. Before jumping to that, I would just tell you first of what people, how these genetic association studies can be still useful, even if they are done in an A fashion. The genetic linear model is modeling the outcome trait, for example, imagine here, body mass index as a, the genetic matrix. So basically it's a big matrix. Imagine it's like an Excel table. Actually, can you, sorry to interrupt with this, can everybody put up their hands if you don't know matrix, what is a matrix multiplication? If you multiply a matrix and the factor, for example, do you know how to do this? Basically just to gauge properly, not to lose people. Now, okay, nobody put up their hands. Okay, that's good. Either you're shy or you're really well versed, very good. Okay, so here basically we have a big matrix. So this is genetic data. Each row of this matrix represents an individual and each column is a genetic marker. So what we can very quickly realize is that since we have millions of genetic markers, but we don't have millions of samples straight away in a single cohort, let's say in the lozenge cohort will be only 6,000. But even if you take larger cohorts like UK Biobank, we have about half a million people. So we typically, we have many, many more columns than rows. So we can't just do a simple regression where we are estimating each variance effect because we just have too many variants. So instead of that, we use random effect model. So we don't really care about individual, each individual snips effect size on the trade. But what we care about is what is their cumulative effect on the trade? So in model, so we put the prior on these effects that on average, of course, a genetic marker can increase or decrease a trade. If prior, we don't know if an LLA would increase or decrease a trade. And also we can swap and count the other LLA at this location and count rather the TLA. And then the effect will be multiplied by minus one. So because of that, on average, these effects are expected to be zero in terms of effect size. But what we really interested in is what is the variance? Because the variance would be proportional to the sum of the squared effects, which would be exactly the explained variance of this model. So we assume that these effects are independent of each other in the genome and they have a particular variance. And instead of trying to estimate every single snips effect, we are just trying to estimate what is the variance of this distribution effect. So it's what's called an empirical base approach where we don't fix a prior distribution with all its parameters and shape and so on. But we simply, in this case, we fix the distribution shape, but we use the model to estimate what is the optimal prior parameter that maximizes the data likelihood. So the way, yep. I have a question. When you assume that the snips are independent, so you cannot use snips that are in linkage then. So you should take snips that are distant from each other. Exactly. So what we do here is that you can throw in any snip in this geometric, but the effects will be, it's only means that the effects are independent of each other. So just because two snips are in linkage this equilibrium, they each can contribute to an outcome trait. And those effects can be independent of each other because as if we were just fitting a multivariable regression, even if two variables are correlated, they will each have their own estimated effects. So we're not to estimate their errors will be correlated, but the effect sizes will be not correlated. But it's a very good point. When actually these models can be extended and indeed what we can do is we can do it in a stratified fashion. And for example, to say that snips that have many LD partners, so many snips that are in many other snips which are in LD, for those we assign a smaller, we allow to have a different sigma and those that have fewer LD partners, they allow to have a different sigma and that's typically will be a larger sigma. So if a snip has many LD partners, then probably these effects are distributed across the different correlated snips. And because of that, there each individual snip will expect it to have a smaller effect by those snips that are sort of lonely and especially for example, the rare ones, rare snips typically have larger effects. So we can do it in a stratified fashion where we split up the genome into parts which have fewer LD friends, lower LD frequency, those typically will, and they can estimate these effects and we will actually, people have done it and we see that indeed those variants they have larger effects. And variants that are also only coding region they expect to have larger effect and so on. So you can stratify this analysis if you're large enough samples to have a prior, which is different for different classes of snips of organic markers. So here just for simplicity, I assume very stupidly that every snip has the same, would have the same effect. And then, because then the model becomes very, very simple, then if we want to estimate what is the variance of the outcome, it will be, it can be rather not going to the details, but essentially just knowing the variance covariance structure of the effect estimates of the effect of true effects and this is the genetic data matrix. So just to remind you is these are individual times snips, individual times genetic variants, plus of course the error term itself, which has a diagonal error structure. Then since this variance, that's very important that we assume the independence between them, it will be just simply this sigma G times identity. So that can be pulled out from this equation. And then we divide by M and multiply by M. And what's very interesting is that this G times G transpose divided by M, this is the kinship matrix. Basically this tells us it's an individual times individual matrix and in row five column 10 tells us what is the genetic similarity between integer five and 10. If this value is higher, it means these people are highly higher at a higher relationship relatedness. For example, if you have siblings, this value would be about half. If you have identical twins, this value will be one. If you have second degree relatives, then it will be a quarter and so on, so forth. So this is the so-called kinship matrix and we can replace it by that. So what's very handy is that if you multiply the number of markers with the expected variance of that marker, then this is actually the total heritability. So it's very convenient for us because it means that if I just calculate the similarity between each pairs of individuals in terms of phenotype, that is proportional to the kinship times the heritability plus one minus heritability times an identity matrix, which makes it very convenient to fit the likelihood model. And then we basically have a single unknown parameter in this case, but of course, then we can do stratifying the genome. Then you have one parameter for each group of SNPs. And that would tell us what is the contribution to the phenotypic trait variance of each group of SNPs. And we can use Halzen-Elston regression, which is faster, but it has slightly higher standard error, which basically what it does is it just takes the phenotypic similarities to takes the difference between the phenotypes and squares them and they regress it onto kinship matrix and that regression is proportional to the heritability. So this is all very useful. We can even use relatively smaller sample sizes to estimate heritability to BMI, which is estimated in this sample as 0.2. So about 20% of BMI is explained by these genetic markers. If you look at another obesity-related trait, which is a body shape obesity ratio adjusted for BMI, that's about 10% heritability. And then since then people have been doing it in larger samples. And if we really do it in a pretty large sample of about half a million people, then we get about 30% heritability for BMI and about 55% heritability for height. So this is very handy because we know that if we have enough sample, so if you just boost the sample size, eventually we can get to what kind of predictive accuracy we can achieve with such data. So this works very well when we have large sample sizes and one trait where we don't have typically large sample sizes is human lifespan, because of course for that you would need to have a cohort where people are all dead and they are genotyped because then we can associate the age of death with which genetic markers they carried and then this association could reveal whether there are particular genetic markers that predisposes to die earlier. So obviously if we don't have such cohort, one option for this is to replace the individual's death and we use the parental death as a proxy. Actually what we really do is that we replace the individual's genotype with the parental approximation for the parental genotype and then they essentially pretend to run a GWAS and association scan in the parents. If for example, in the UK Biobank and many other traits and many, many cohorts, people who were asked, participants are asked when their parents died or they are alive or if they are dead when they die. So with this, we can really improve a lot to the sample size boost quite a bit, but this is still not enough. So for that we realized it's much more interesting if we build some priors, just some background on life expectancy genetics, it's expected to be about 20, 30% heritable since then many studies, so this was from twin studies is probably largely overestimated. It's much more realistic now to assume that it's somewhere around 5% to 10%. There is a very strong assortative mating. So people tend to couple with others who are gonna live a similar age. So there are two reasons. One is because they're gonna share lifestyle and it's already initially people tend to pick people who are genetically also predisposed to live the same age. It's very interesting because of course we don't know the genetic of a partner. When we pick our partners, but we tend to live of a much more similar age as our partners compared to just random pairs in the population. Okay, this was just in parenthesis. So genetic studies focused mostly on extremes. So looking at people who live extreme long, so over 90 or 85 and compared to gender population individuals and then they identified the FOE locus to be associated with longevity, FOXO3 and EBF1. But this had been rarely replicated apart from FOE, which seems to be coming recurrently appearing. So it's the same variant. It's a very close variant to one which predisposed to Alzheimer's disease to high LDL levels. The same variant is very pleiotropic. Pleiotropic means that it has an impact on multiple human traits. So the same variant which is increasing our LDL levels, so the bed lipids or the same variant which is predisposing us to develop Alzheimer's disease, the very same variant at the same allele is also making us live shorter. So this made us think that maybe longevity is more or less just an accumulation of bad alleles that predispose us to different diseases. Newer studies, when the first release of UK Biobank appeared, there were about 116,000 white British samples for which they run a scan and they have not replicated FOXO3 and EBF1, but they found FOE. And the new one, which is the CHRNA3 and 5, I will talk more about it later. So we decided that maybe this is now a large enough sample size we could do something smarter than just running a simple GWAS scan. And the basic idea is to build smart priors, priors that take into account other studies that where we can borrow strength from. So how could we estimate the effect of a SNP on lifespan? So we assume that these SNPs affecting lifespan simply by acting through different risk factors. For example, for lifespan, you can imagine the obesity, diabetes, dyslipidemia, different cardiovascular diseases are all contribute to shorter lifespan. Then if we know the effect of these SNPs on, for example, body mass index. And if we know the effect of body mass index on lifespan, then of course the expected effect of the SNP on lifespan would be this beta one times alpha one. This would be the product of the two effect. Now there can be additional risk factors and then there would be beta one times alpha one plus beta two times alpha two because this SNP might be pleiotropic and does not only exert effect on lifespan through a single trait, but maybe on multiple traits. So in summary, what we assume is that the effect of the SNP on lifespan a priori should be the sum of the product of these effects, where we sum it up over all the different risk factors. So that's the key idea is that if we know these effects and these are typically very, very large studies on different risk factors, then we can use these effects to inform us about what might be the effect of these SNPs on lifespan if there is no additional effect from these SNPs otherwise on lifespan. So at least we can this way, we can pick up SNPs that are impacting lifespan through detectable and well-defined risk factors. I will talk more in the second half in the second presentation about how to estimate these effects. But for the moment, just imagine that this is known. And of course, from epidemiological studies, these have been estimated over several decades with varying accuracy and success. But if you just assume these to be known and I will not go into the details now, but I will tell you later that we can actually estimate the effect of these risk factors on human lifespan. And for example, we see that education is something which extends lifespan. So every year you spend in education it extends your lifespan by the same year. So what time you waste at universities actually add you give them back this, which is a good news. Although it doesn't extend to beyond bachelor level. So doing a PhD certain doesn't extend your lifespan. Maybe it's short as it actually. And the biggest killer is of course smoking, which has the biggest impact and the best thing you can do is actually quit smoking. So if you ever smoked and you quit then smoking, then the two effects cancel each other out. And the intensity of smoking, how many cigarettes you smoke a day it has an extra burden on lifespan. And as you can see for most of diseases, even a psychiatric diseases and the classical this lipidemic cardiometabolic diseases are all decreasing lifespan to a different extent. And last one, for example, obesity like every kilo is about decreasing your lifespan by two months. Just as a side note, this is carrying a kilo all your life. So if you just gain a kilo of during the Christmas period because you eat too much and then you shed it afterwards, it will have no impact on lifespan. Okay, so let's assume that these risk factors affected on lifespan is known. Then we can plug these back here. And then we just need genome wide association studies that are linking SNPs to different risk factors. And that's what we do in our patients can. So our non model will be that the effect so real energy was on lifespan that will give us estimates of the SNP effect on lifespan that will become I for SNP I. And of course that's just an estimator for the true effect with some variance. Sorry, Zoltan, there was a question but I had the same from Animesh on the chat. When you on the previous slide when you sum over the risk factor you assume that they're independent. Exactly, so that's a very good point. So here exactly what we did in this one, we did a multivariable Mandeli randomization. So we be estimating these effects in a multivariable fashion. So we take all these risk factors together. So these are not the causal effect of this risk factor alone on lifespan but it is a multivariable effect of the risk factor lifespan accounting for all the other risk factors. So that's indeed it's a very important point. These risk factors are correlated and these causal effects are of course individual causal effects might look bigger just because these different risk factors are correlated. So we need multivariable causal effect estimates on lifespan. These are of course univariable effects, these are fine. But indeed to add these up, these alphas these causal effects need to be multivariable or we need to choose independent risk factors which we don't do. Thanks very good question. So thanks for, I will keep an eye on the chat more often. Okay, so now we have these priors. And so I call them you meaning that simply the new let's assume that these causal effects are known these multivariable causal effects are known and these beta's we are estimating from other GWA studies. So just a reminder, these beta is the effect of SNP I on trade J on risk factor J and we're summing this up over all the risk factors. So that's our prior. But of course we are not that sure about the prior because this is just an estimate. So it has its own variance. So because we know that these are all noisy estimates or the actual of the actual causal effect of SNP I on trade J, we know what is the variance of this estimator. So we know that the variance of this prior we are not that sure about it. It has its own variance which is the sum of the variances times the these alpha weights which are the multivariable causal effects. Of course in reality, we can be a bit milder and we could assume that it's not only these risk factors that through which the SNP will impact the trade. So we can even inflate this. We can add some arbitrary constant. We can double it, whatever it depends on our belief how strongly we believe that we capture most of lifespan effects with the trades we looked at. So if you know that you only look at the small set of trades typically want to increase this tau. And if we are very confident that we used a large set of trades in our study we used about 60 different trades to choose from, we were pretty sure that we were capturing most. So we have another hypothesis that the defect is zero the true effect is zero. And we have an alternative hypothesis that this effect is coming from a normal distribution with the mean value is exactly as expected from these mediated effects through the different risk factors. And we have some confidence around these these priors. So we have two models and we want to compare how likely one model against another given the data that we observe and the data that we observe is really just the association strength with the variance estimator. Of course this is estimated from finite samples. So of course base vector is a very obvious thing to look at which will estimate what is the probability of the data given the alternative model divided by the probability of the data given the null model. So in our case, the data is really what we observe is the estimate what is the estimated effect of this name. The problem from largely was studies with very small effects of these estimates are often because of the large noise is difficult to distinguish this from zero. And that's where we hope that the prior will help. So since we have really simple models, both model one, I mean, model zero is really simple because it has basically the theta zero parameter is just that the effect is zero. And the other model says that this gamma parameter is coming from this normal distribution. So the total probability of model one will be when we integrate out all the gammas from this distribution. So here we plug in this distribution into the prior and we take the probability of the data given the prior and the product will be integrated out for all possible values of gamma. So I just here replaced everything in gamma with the actual values, which is this was just a general formula. And now this is for our actual case. That's what we do. And in our case, it's very simple because all the observed data distribution and the prior distribution are all normal. So here this is just a reminder for the normal probability density function. This integrates very simple because this is just you're plugging in a value of zero. You don't need to integrate anything. And here it's an integration of two Gaussians which will integrate out to a Gaussian as well. And that will be the final probability of the data given the alternative model defined by the prior. And here you can see that there will be the original estimation error. So that comes from the data. And the second uncertainty, the second part of this variance comes from the confidence in the prior and they're just summed up. And here would be, we are plugging in here our prior estimate, which is we are using now a lot of data from other genetic association studies for the given SNP. So don't the reminder, I always refers to SNP, SNP I. So we have this base factor. It should be strictly speaking, I for each SNP. So for example, one example here is that the null hypothesis tells us that this we are expecting the defects as come from here. The alternative hypothesis for a given SNP might be, for example, we expect the SNP to be lifespan increasing. In that case, this mu will be positive. It means that we just see this allele and because of this allele effect, for example, this allele is increasing education levels, is decreasing LDL levels. Maybe it's increasing HDL levels, increasing HDL levels. Maybe it's decreasing systolic blood pressure. Maybe it's decreasing obesity. And through all these positive impacts, we expect it to also increase lifespan. And that's why we have a positive prior mean with its own variance. And let's say now we observed a particular effect size. Let's say it was two and we compared the likelihoods of the null versus the alternative. And that's what this base factor is basically doing. This is coming from the GWAS study on lifespan and this is coming all the other studies from which we want to borrow the strength from. Now, there's an extra complication, which probably I won't have that much time to go into, given the limited amount of time, but you will often hear, when you hear Bayesian people, since I'm not Bayesian myself, people are very much used to frequentists are very much used to P values because what the general claim is that if you give a very large prior, for example, to many, many SNPs, for example, I give an enormous positive prior for all my SNPs in the genome. Then of course, whenever I buy chance, I will get a, I will observe a relatively larger positive effect, then my prior will really boost it and my posterior will be very, very large. And in that case, you might pick up sort of seemingly significant associations, but you wonder whether now how much, how dependent it is on my prior. So that's exactly what we wanted to examine here is that if I now, if you were to just generate, for example, a GWAS with fully null effects, basically taking a phenotype, which has no genetic basis, or I'm just taking random numbers for every individual instead of taking the actual lifespan value and I run a GWAS on that. Then of course, those GWAS will give me effect size estimates, which are totally meaningless. And I can still plug these in and I will still get base factors. And now the question is whether, how different those base factors will be compared to the base factors that I observed with my GWAS. And that's exactly what we did. So we just basically generated 1,000 null genome wide association studies. So we generate the 1,000 traits which are not correlated at all with any of the genome. And we calculated the base factors for those. So now imagine that for a million genetic markers, now we have 1,000 new base factors generated. And then we compare the two against each other. You might jump a few slides for the moment and I come back to this. So these were the distribution, the red one of the base factors that we observed in our lifespan study. So you see some base factors are gigantic. And this black one is the distribution of the base factors that we get when we generated this 1,000 null GWAS is. So if there is no association in the genome whatsoever then this is the distribution of base factors that we expect to see. So if I see a base factor somewhere here this is nothing out of the ordinary but clearly base factors of this range these are way, way larger than we expect if there were no association whatsoever. So at least when all hypotheses are null so that none of the snips are associated with the trade we can really know what kind of base factor distribution we expect and we can compare that to what we actually observed. And that gave us some confidence and then based on that we can generate p-values for each SNP. So because basically when you want to sell anything Bayesian let's say you claim a SNP that had a very high base factor many of the journals want to see a p-value. And this is a nice way to then associate the p-value to your observed base factor. But now the difference is that we will rank our SNPs based on the strength, the evident Bayesian evidence of the base factor size. And then we check how what kind of p-value it can translate to. Maybe just some words about we don't have to be we can be a bit smarter than just simulating. So here I just said that we run 1000 GWAS I mean the running 1000 GWAS takes a lot of time because every time you run an association let's say in half a million samples you run association with more than a million markers typically 10, 15 million markers and then you repeat it 1000 times it really burns a lot of CPU. And there's no point doing it you can one can be smarter because basically we can replace these gummas by if you divide the effect estimates with the corresponding standard error that gives us a Z statistic. Z statistic is because it's under the null it should be coming from a standard normal distribution. And now we're looking at the base factors that we generate under the null. What is nice about it is that we need to just plug in these new Z statistics that we generate. And we know that this should be coming just from a standard normal distribution. These are the effect size divided by standard error. And all we need to do is to estimate what this value is. And after a few pages of algebra we can actually generate this and we can save a lot of time by deriving these probabilities. And this can be actually this led to at least hundred fold increase in runtime. So we can very nicely quickly get around this problem we don't need to actually generate null geostatistics and we can integrate out because we know that these Zs are coming from a standard normal. These probabilities can be very quickly evaluated. So it's all good. We can now generate base factors and on top of it, we can assign or associate P values to these base factors. And that's where we can compare the two and that's how we get the P value. Now, what do we get? If we actually run a G was and now we look at those P values that we get from the base factors, not surprisingly, the top two hits are the same two top two hits that were found by the classical G was only finds this one. Just a side note. So this apple, apple C1, apple E, it's a region with a very large linkage this equilibrium block. So there are really many variants that are in close correlated. So it's sometimes difficult to identify which gene might be the causal gene. Clearly, apple E is one of the strongest candidates. And it makes a lot of sense that genetic markers that predispose us to have higher lipid levels, higher LDL levels will lead to earlier coronary artery disease or stroke that can eventually lead to death. Or if we live through this without any stroke or any major cardiac event, the same variant will also predispose us to develop Alzheimer's disease and then people tend to also live shorter. So this variant has a double hit. It kills, it tries to kill us earlier because of our increasing early lipid levels. And even later, it has a secondary hit. Even if you survive this, then we might die earlier because of Alzheimer's disease. So this is because it's highly platropic. And that's why it's picked up so, so strongly. The other one is actually a very simple explanation. This variant is a nicotinacetic coline receptor. And those receptors, if we have polymorphisms in there, you are predisposed to actually much more smoking because you appreciate nicotin much, much more. So when the nicotin binds, it releases much more basically pleasure for the brain. So these people who carry the very functional variant, then nicotin will excite it much more and they find more pleasure in nicotine and smoking. And then they tend to die, of course, earlier. If an average person who smokes about a pack of cigarette a day dies roughly about 10 to 15 years earlier. And carrying this variant is, of course, through smoking shortens lifespan. But if you carry this variant and you never smoked in your life, it will not change at all your propensity to earlier death. So this really shows that these markers are associated with longevity, but if the mediating disease or environmental factor is not present, then they will not have an impact on longevity. You will find another lipid-related gene, even two of these. There is the very well-known obesity gene. So in total, we find about 16 different genetic markers that thanks to the Bayesian scan, where because of our priors, which really gave a boost in the discovery, we can discover now many more low side that are associated with vitrimal longevity at 5% FDR level, forced discovery rate level. Here's more of a list, but also just to show that here, you see that we have a prior, what kind of zest statistic we expect to see based on the related disease studies. And here is what kind of zest statistics we observed in the UK Biobank Association's summer statistics. So we see that they are actually agreeing in direction, but typically the observance are larger. So meaning that the effect that we see are not fully explained by these risk factors that we examine, and it's perfectly normal. If you look at, for example, the FOE region here, we have a massive effect, but just based on, sorry, just based on the effect of the SNPs, for example, here among the risk traits, we didn't even include Alzheimer's disease. We really included more cardiovascular traits. So we don't observe everything, but it gives us still a good indication of useful ones. The obesity is different. Basically the FTO variant effect on lifespan is entirely explained by the mediated effect of this FTO variant through obesity. So here it is a perfect match, but for the majority of the traits, the effect on lifespan is actually exceeding, is multiple, for example, for the cigarette nicotine acid decline receptor, we see that the effect that we observe is far larger than what the priors would suggest. And here you can also see some effect that, just to appreciate that these effects are minor. So even if you carry this FOE epsilon-4 variant, it will shorten life by about five months. So these are not massive effect. Each is a couple of months. And of course the chance is to carry the risk allele, which is here on multiple locations is very, very low. So there are probably no one in this classroom carries even four of them or five of them at the same time. So these effects just rarely add up and there are only really, really few individuals where all these risk factors are present. And so what do we see as the effect of these SNPs? We can look at the pleiotropic effects and we see which are the traits through which maybe they mediate lifespan. So for example, here if you look at the bottom SNP is the, this FTO variant on chromosome 19, you can see it has this very large life decreasing effect. And it's life decreasing because it increases the idea cholesterol also leads to coronary artery disease. But for example, it lowers a little bit the chance of type two diabetes, but it increased triglycerides, it decreased HDL levels. So we see that it has a negative impact on many, many different diseases, but not on type two diabetes. And then you can see some other traits, which are more, for example, the education driven are here because they're decreasing education, eventually they lead to lower lifespan. Of course it's not education which is lower and lifespan education is just a very, very good proxy for socioeconomic status. And socioeconomic status is a very strong indicator of how long people live because they have just people who have higher education, they tend to earn more unfortunately, this is British data. So the access to health system is not the same for people who are more versus less. Of course, people who are more, they have access to private care. They might be more educated to take care of their health to reduce overweight, to move more and so on and so forth. So it's really capturing a multitude of effects and not just a single one. Here you can see, this is the CHR and A35, the nicotine acid coronary receptor effect which is massively impacting smoking but also increasing schizophrenia risk. So you can see the effects are often very polygenic and impacting multiple traits and through many, many different, for example here, many small effects of a single SNP leading to eventually a life shortening effect. And a couple of these SNPs at least five have not been picked up by genome-wide associations that is for these other traits because their effects is minute on each of and every one of them but cumulatively they lead to a life shortening effect. So most of the obesity, the lifespan SNPs are very pleiotropic SNPs and impacting many different factors. Another way to look at, to get more confidence that these SNPs are real and indeed lead to shorter lifespan is that if you look at their association with age. So if you look at the cohort such as the UK Biobank you can think about someone who is very old and in the cohort it means that if you take an individual who is above 75 years old, this individual part of these people who are about 75 are already dead. And another part don't participate in the cohort because they're not fit enough anymore to come and participate in a very lengthy questionnaire follow-up and so on. And these are really often you have to travel to get to the nearest checkup center and so on. So if you look at individuals who are older and they still participate, they have to be pretty fit. It's a very selected subset of people. While if you look at younger people who participate among younger people there would be practically nobody dead and there would be very, very few who are unfit to participate. So as people are older and older and this stratified a cohort to older people they are really much healthier people. They are definitely healthier because it may be the healthier lifestyle but also they might be healthier because they had luckier genetics for particular diseases. So if you imagine that we have a life shortening variant and it has 30% frequency in the full population. So when we look at people who are young then there is no selection. So all of them are here. But if this variant is shortening lifespan then it means that those to be fit there must have a lower allele frequency than those who are unfit. And we are selected in this cohort only the fit people because those are the ones who tend to participate in such studies. This is called a healthy volunteer bias. So the allele frequency of such markers that how frequently these risk allele appears in older people it's correlated with age because if you're older you tend to have a lower frequency of risk allele. And if you're young there was no selection so it has the proper full frequency in this case 30%. And this frequency goes down as we look at older and older people in the cohort. So that's what actually we've seen. So here on the x-axis for these 16 markers that we discovered to be associated with lifespan on the x-axis we show the effect on lifespan largest effect is the FOE variant. And on the y-axis we show that how much is the frequency decreased by every year. Basically this is the average age difference per allele per month. So basically if there's a decrease in the frequency with age then we see those nips being low here and if it's increased then we see high here. So if something is alive shortening allele it will be decreasing frequency with age and negative effect on lifespan. And we see a pretty nice correlation except for three snips that these discovered markers they seem to follow this pattern that if they are detected to have positive effect extending lifespan they tend to have a less and less frequency in the I'm sorry increased frequency in the older group. And if it's decreasing lifespan it will have a decreased frequency in the older study participants. So there's not very much in line what we expect to see. What we also done is as a final slide we also looked at the 16 loci and we tried to identify which gene might be a causal gene. Because of course these are just nips but we need to identify sometimes genes and what is the mechanism through which these nips are exerting for example short to lifespan. And one general consideration is that these nips might be changing the expression level of certain genes that are nearby. These are called the CCQTLs. If they are modifying gene expression for example in the brain especially for example, we look at the CHRN-F5 that's of course very much brain centered the nicotin receptor. Maybe there we see that the variant which is disposing to more smoking will have if the gene's expression is higher for those who are more predisposed to smoke then maybe the expression of this gene might be causally linked to for example smoking frequency initiation or eventually to lifespan. So in the second half we'll talk about this causal inference technique called Mendelian Analyzation which is here called MR. We could identify three genes that are potentially their gene expression in brain might causally reduce lifespan. And among this we found actually one of them in a mouse study where we looked at the expression level of this gene and we measured this gene in a 72 day old mice and we looked at as a function of their gene expression actually we looked at what is the median lifespan of this mice and we see a very strong negative correlation. So it means that indeed increased expression level of this which is exactly what we predicted from the human data mouse prefrontal cortex expression of this gene is strongly negatively correlated with the lifespan of the mice. So it is just some sort of confirmation that we can in polymorphism identified in human study which is associated to lifespan we can move to the gene thanks to expression QTL studies. So we can link now to actual function of the gene in a very particular tissue and then you can follow this up in some model organism where we can really clearly see, obviously you will never find a human cohort which measured gene expression of people at a given age as a proxy or the predictor for lifespan of these people. So that's where we need to move to animal studies. So in conclusion, we see many novel loci thanks to this Bayesian approach that we build smart priors that are very informative. We can also assign p-values to boost confidence that these are not just happening by chance it would not be seen by any random association study just because we have a particular structure on the priors or maybe the priors are too strong. And many of the associated variants are heavily involved in lipid transport and lipid metabolism. And the discovered loci were very, very platyropic and we could identify really different means and different pathways through which these effects are eventually leading to reduced lifespan. And some follow-up experiment has shown that maybe the expression level of this gene might be relevant and a good biomarker to detect basically accelerated aging because if we see that already the expression level of a gene at given age is already larger maybe this gene might be a proxy of accelerated aging. Similar studies or related studies have been looking a lot at the methylation levels which we didn't do that much in this study is that methylation levels are a very nice footprint of different abusing life for example of smoking on alcohol used and so on which are visible in our methylation profile the same as physical activity. General fitness is there is a particular methylation profile and methylation probes that are correlated with these lifelong exposures to positive and negative impact and those can be predictive also of human lifespan. So this is a very active research of finding early biomarkers of accelerated aging and these snips might serve a role here. Okay, if you're interested more the paper can be looked up to and otherwise I would like to thank the collaborators so to the collaboration with the team at the University of Edinburgh at the APFL and at the University of Lausanne and also the lead author who ran almost all analysis was Aaron I was looking my group and I thank also my full group and I'm happy to discuss more about this. Okay, so I will change gears and instead of focusing on genetic associations I will focus more on identifying causal effects between respectors and mostly cardiometric outcomes but it can be any complex disease. And of course we will use genetics for this I will not abandon that part but basically an important factor that I already mentioned previously was that sometimes you don't care that much about individual snips effect on an outcome and that's when we just put a any sort of reasonable prior on the distribution of the effects of genetic markers overall on trade if you're interested in other quantities such as the heritability of a trade which I mentioned in the first talk and in this talk I will mention more that if our center of attention is more on causal effects of a trade or another that again is probably a good idea to not to care too much about individual snip effects because then these can be integrated out. So it might sound mystical I will get down to the details. First I will just introduce a little bit you to causal inference and how it differs from simple correlation or simple regression techniques and the central technique that I will be using here is called manilaeronavization which is an instrument variable method which I will describe in full detail and then I will show you one particular SAP type or an advanced manilaeronavization which is using Bayesian priors on the effects of the snips on the exposure and on the outcome. And I will show you how this method works what is the key underlying model behind it and what kind of results can we get and how can we improve causal inference through Bayesian priors on snip effects. So probably many of you have seen this slide this is fairly popular which correlates chocolate consumption per kilogram per year per capita for a given country against the Nobel prize winners per 10 million individuals of the given country and of course we often show this because Switzerland is stopping up is leading both categories we eat a lot of chocolate and also this country boasts many per capita Nobel prize winners. The correlation is very striking is point eight of course in this publication they chose countries carefully to nicely reflect this not to deviate too much from this correlation. The question is of course does eating chocolate make us smarter? And obviously doesn't. So many times when we observe a correlation between a risk factor and a clinical outcome it's very often not correlation and not causation. And the reason for this is there are many, many hidden confounding factors. So a confounder is a factor which is impacting both of your X and Y variables. So just because you do regression Y on X it doesn't give you a causal effect estimate on X or Y because you could have done a regression on X on Y the other direction and it would give you an exact same p-value and then the effect sizes will be just the inverse of one another. So regression will never tell you whether two trades are really causally related one to another and very often if they're just nothing else than correlation with of course with some additional covariates that can be included. In this case a Nobel Prize in chocolate consumption the GDP of a country is a very important confounder. The more the higher the GDP of a countries of course more it can invest into research and even if the country invests the same proportion but the richer country could invest more in research and also richer country can invest the higher proportion of the GDP to research which eventually leads to more Nobel Prizes. There is no rocket science about it there is no such thing as smarter people in one country or the other. Higher GDP also will allow people to earn enough money to spend more on luxury products such as chocolate and then people will increase their chocolate consumption and it's as simple as that probably there can be other additional confounding factors but the key here is that there is no causal relationship in either direction and there's just a simple confounding factor which leads to very high correlation between the two. There are sometimes true correlations for example if you look at what is the correlation between BMI and diabetes and essentially the correlation between these two trades is reflecting mostly a causal effect from BMI to diabetes and this has been already confirmed by many epidemiological studies interventional studies decreasing BMI by a different diet it reduces diabetes prevalence and diabetes incidence. More precisely in such interventional studies if we look at BMI and physical activity we see that of course increased physical activity tends to reduce BMI or at least the part of BMI which is fat related but also people with higher BMI they tend to do less physical activity so it's a bi-directional causal relationship some sort of vicious circle. And then it can be much more complicated once for example BMI and education level if you're a student who spent many probably years or at least months on it the two are highly negatively correlated. Social economic status is a clear confounder of it people with higher social economic status of course can afford going to university to achieve higher education and people with higher social economic status also can afford to buy healthier food which will be less impactful on body mass index can afford for example to have access to sports which otherwise might be expensive to pursue which will eventually also lead to lower body mass index and so on. There's also a causal effect from education level to BMI it's a very clear causal effect people who are more educated they also more educated about the beneficial impact of physical activity of healthier diet and as a consequence they will have lower BMI. And additionally it's highly debated probably there is not much or if any there's a very, very minor but probably there is I would rather just more confidently say there's probably a zero effect and no effect from BMI to education level although several studies show a negative causal effect but that's somehow biased by not accounting for appropriately to parental social economic status. So as you can see correlation does not imply causation and to actually get causal effects is much more complicated than just looking at correlation or linear regression if an estimate. Minder randomization is a technique which attempts to derive causal effects instrumental variables. So the key here is that we are not looking just at the two traits or potential covariates and so on to estimate the causal effect but we are using additional factors. So these are these instruments which are strongly correlated to the exposure. So here from here on I will often refer to the risk factors exposure and diseases outcome. So the genetic markers are correlated with the exposure this can be used as instrument because if we look at the effect of the genetic markers on the outcome if there is no direct effect from this genetic marker on the outcome but all its effect is mediated through the exposure. So I'm using obesity and diabetes as an example then as you've seen before for the Bayesian G was is that the effect of the genetic marker on this exposure times the causal effect of the exposure on the outcome will give us the total effect of the genetic marker on the outcome. Again, a problem is if the genetic marker has additional direct playaotropic effect. And as I mentioned to you and I alluded to you could see it in the promise heat map playaotropy is actually very often happening. It's a frequent phenomenon. It's very pervasive in complexity genetics. So we can already be suspicious that probably this condition won't hold. And also what we want is that this genetic marker is not correlated to any confounder of the outcome exposure outcome relationship. So of course there are many ifs here. What's easy is that the genetic marker this instrument has to be correlated with the exposure that's easy to verify but these two are very difficult and often these are violated. But for simplicity for the moment if we assume that these are not violated then very simply the causal effect times the exposure effect is the outcome effect. If this really holds then all we need to do is take the outcome effect and divide it by the exposure effect. And this ratio would be one estimator for the causal effect. The advantage of genetic studies is that we don't only just have one genetic marker which is associated with the origin exposure with respecter, but we have often hundreds of them. So for each of them, we can ask what this marker would give as a causal effect estimate. So what's the SNP-I? So the ith instrument, the ith genetic marker which is associated with exposure which will be calculated as the ratio of the SNP-I effect on the outcome divide with SNP-I effect on the exposure it will have a causal effect estimate for the SNP-I. And for each of the estimate we will have its standard error. So we'll know how confident we are can be about each individual estimates. And then for example, if you look at the just the distribution of these different estimates we can either use the mean, the median, weighted mean where weighted mean would be to these estimates weighted by the inverse of the square standard error which is called inverse variance weighting. This is a technique which combines multiple estimates together in an optimal way, optimal in a sense that the final estimator of this weighted mean will be unbiased and will have the smallest possible variance which we want of course an estimate to be really precise. So this is of course, it's a very appealing character but we can have the median weighted mode. And of course, whenever you hear about median and mode the first thing comes to your mind you use these kinds of estimators when you have huge outliers in the data which can really drive the mean far away but median and mode are much more robust to big outliers. And outliers in our case really it means very biotropic variance. So how it happens in practice is that we plot the effects of each instrument variable so each genetic marker which is associated with the exposure in a very robust manner if we plot these SNPs effect on the outcome such as diabetes on the y-axis and on the x-axis we plot the effect on the exposure. And if we fit a regression line which is forced through the origin of the zero zero point the slope of this regression line is the same as the inverse variance, sorry estimator inverse variance weighted combination or methanolized causal effect estimate. So this is very handy for example these are real associations. So here this is if you still remember from the first hour this is the FTO variance so this is the largest effect on BMI and you can see that this effect the SNP has also proportional effect on type two diabetes and when you fit the slope the slope is about 0.25 if I remember correctly more or less. So this is the estimated causal effect. But you can see for example this variant which is way off the line and these ones and those ones and there might be suspicious variance that they might be player-tropic effects and that's why they are off the line which means that this particular SNP does not only have effect on type diabetes through impacting BMI but it might have an additional direct effect of maybe 0.01 strength which is basically deviation from this point from the line. So player-tropy is indeed a problem. More generally there can be really many kind of player-tropies and so this is a work by Liza Nino to brilliant PhD students in my group and Nino is already an opposite in Exeter because this is the general model what we have is that we have genetic effects of course direct effects on the exposure. Those are the great effects. We always want to use genetic markers which really just directly impacting the exposure but genetic markers can also have additional player-tropic effect that direct effect on the outcome and then can have direct effects on a confounder which then has an effect both on X and Y. And this is really dangerous if you think about it because if I use a genetic marker which impacts the confounder then when we're creating this ratio which remember the causal effect estimate is just what is the effect estimate of this marker on the outcome divided by what's the effect of this marker on the exposure. So another effect of this marker on the outcome would be gamma U times QY. The effect on the exposure is gamma U times QX. So the ratio of the two would be QY divided by QX. That's a problem because we estimate the ratio of these two causal effects as an estimator for an XY causal effect which is totally wrong. So that's the biggest problem and we get danger with these causal inference techniques if we are using variants that are having an indirect effect on the exposure which acting through something else and if that, of course, if QY is zero it will not lead to any bias but if QY is non-zero and it's a true confounder then the slope estimates will be heavily biased towards the QY, QX ratio. Now, there's of course another problem which is immunolimization which is assumed that there is no reverse causation. So these three red arrows are really problematic and that's what we're gonna model all in one go. First of all, these gammas are genetic effects. These are direct genetic effects acting directly on X, directly on U, directly on Y. What we assume here is that these effects are independent of each other which may not be true but also if they are not, usually these violations happen in very special case when you use a parental trait of X. So for example, if this is your exposure, this is your BMI and this is your parental BMI then of course these two are correlated so that's problematic but otherwise these are typically not a big issue. So instead of estimating each of the SNP effects which is done in Mendelian randomization that we really need an accurate estimate of the effect of the exposure and the outcome of each genetic marker and we select only those ones that are acting that are significant enough to be considered. Here we take this as a nuisance parameter. We don't care about the actual effects. What we really want to estimate are these two cause effects and maybe these confounder effects. So all of these are nuisance parameters and we model it with a sort of spike and slab distribution instead of formula I will just describe it in the next slide. So what we assume is that a part of the genome actually a large part of the genome typically here I put 90% of the genome has zero effect on a trait. I just didn't put exactly zero so that I put some variance around it so that it's visible on the plot. So that's the zero effect SNPs, 90% and about 10% of the effects of the SNPs have an actual effect, a non-zero effect on X and they come from this distribution. And the mixture of these two what we will be observing so it will be still a fair in our distribution and this really tiny bump will tell us that actually there's this small fraction of the genome which is actually impacting a trait. There have been many methods which estimated this and for complex traits probably this proportion of non-null SNPs it ranges between probably one and 10% most of the time maybe maximum 15%. So that's really great because we can now assume this prior this spike and slab prior for all the direct effects and these will be independent of each other. Now I just drew this directed graph this can be turned into equations so we're modeling X as impacted by U by Y and by the direct genetic effects. Of course this G times gamma U shouldn't be added because it's already included in U so U is already in there and some error. The same thing for Y it's also impacted by U with the coefficients of gamma U by Q U and then impacted by X because X impacts Y in a causal fashion and there are the direct genetic effects and there is the confounder which has the gamma U effect going in plus some error. So you can see these are the three variables that we are modeling here in this graph. Now the problem is that U is unobserved so we don't know what U is so this equation isn't just not here. So what we can do is just substitute U into the other equations and then we can express X simply the part of X is through indirect genetic effects through U plus indirect genetic effects through Y so the effects that go through Y and then go back here plus direct genetic effects acting directly from G to X plus some error and Y is the same way it has also effects acting through U effects acting through G and effects that are coming directly. So the advantage of this equation now is that here we don't have any more the confounder and we still have the effects of the genetic markers on the confounder which we don't know but this we will not estimate for each SNP separately but we will just use the prior distribution for this and we will only estimate the basically proportion of SNPs that have an indirect effect on X through something else. So this model, this is the simplified model now without U if we multiply each side of this model by the genetic markers transposed and divide by N then we get the marginal estimate so this would be the G was estimates this would be the simple linear marginal normal linear model estimates of a SNP K of SNP K on X which is the simplest form of linear regression when both X and G are normalized with zero mean and unit variance then and there is no other variables in a linear regression that's what you're mimicking here and that's what we get from genome-wide association studies so these are just the effect of SNP K on X and it's the effect of SNP K on Y estimated from the G was studied both of them and it doesn't even need to be actually estimate from the G same sample. So you can see that to get from the actual values of a trait let's say of BMI to the SNP effects on BMI you just need to multiply both sides by this vector and then you can see that all these constructs of G times gamma will be this SNP K transpose times G times gamma which is nothing else than the correlation structure the local correlation structure around SNP K so that would be a raw K which is a vector which will be listing all the correlation values between SNP K and all the other SNPs nearby and the same thing for all the other terms since all these gammas are just they come from a spike and slab distribution the entire marginal effect sizes so these effect sizes are coming from G was these are now bivariate effect sizes effect on the same SNP on X and Y these can be modeled as the sum of effects through U effects direct effects on X and effects on Y and here these effects are again this is a combination of these random effect times the local correlation structure so what happens here is that we can skip a few parts of it but what is really the key here is that we have these multivariable effects where this was the spike and slab distribution and it's multiplied by the local LD structure so this is the correlation between SNP K and SNP J and this is the multivariable causal effect of SNP J on trade X and we sum these up and those will be the marginal effects so really the way to imagine it is that if you have a true causal effect at given locus then it's multivariable effect it will be smeared across the whole region now if I have another SNP and I want to just estimate its marginal effect so simple univariable G was effect on a trade it will be the sum of all the effects of multivariable effects times the correlation between the causal variant and my interrogated variant K so if these are the causal effects in a locus so here I'm just looking at one genetic region and the darker the blue the larger negative effect is and the darker the red the larger positive effect is so if these are the actual causal effects at a given genomic location and whenever it's white those SNPs have zero effects and this is my LD structure meaning that this SNP is perfectly correlated with my SNP K that's probably actually my SNP K and you can see that there will be some highly correlated ones then smaller correlated ones and then some negatively correlated ones and then some with completely zero correlation with my SNP K that I'm interrogating so then the product of these two which we are calculating here will be just multiplying this value with that value that comes here and whenever there is a zero of course it will be zero so we will be only counting the places where both of them are non-zero so when there is a non-zero correlation and the non-zero causal effect and that gives rise just to some of these two and all the others will be zero and they need to sum up these effects and that will be this ZXK so that will be the marginal effect of SNP K on X so what we want of course this is a nuisance parameter we don't care about individual effects at all we just want to get rid of them so all we care about is the distribution of these so we already assigned a prior effect on the SNPs so this is the prior effects on the multivariable SNP effects which is a spike in SNP it's a mixture of null effects and non-null effects we can also model the local LD structure so this is how typical LD structure looks like this is the correlation with a given fixed SNP K and all the other SNPs in the region so this is SNP K itself and you will see there are some other SNPs with very high LD very high correlation with this focal SNP and then the correlation dies up with distance if you move away, correlation here is very very low so these are just actual estimates of correlation so whenever you see a correlation and LD value that you can download from any databases it's a mixture of actually a null LD, a zero LD but with measured with noise because the reference panels are of finite size and a true effect and this is also, it's interesting because the local LD also looks like a mixture of a kind of spike in SNP the LD and also the multivariable causal effects of SNPs across the genome are also looking like a spike in SNP so this would be a discussion mixture the product of two different Gaussian mixtures which would be if this itself is also modelled as spike in SNP this would be a product of two Gaussians times a product of two binomials so this is the product of the two binomials and this is the product of the two Gaussians now the probability density function of the product of two Gaussians is a best-sell function of second kind which is quite complicated in terms of PDF value when it's very cumbersome to work with so that's why we switch from probabilistic PDF function probability density function to characteristic function of a variable I guess you're not all aware of what is a Fourier transform and what is a characteristic function can you just put your hands up if you don't know what it is because I think I can just skip that everybody knows what is a Fourier transform and what is a characteristic function okay I just wanted to check whether you're awake because I guess it's not common knowledge so basically the trick here is that of course you heard about probability density functions basically we put a histogram of a variable random variable a characteristic function is a transformation of this instead of if you know the density function of that then this is an expectation of e to the power of i where i is the square root of minus one times t times x and that's the value at t of the characteristic function of course it might seem completely pointless why do we define a new function like this but actually it's very handy because for many functions if you calculate the characteristic function in the PDF function those are much much simpler and if you want to add up multiple variables you see if it's x1 plus x2 it will be simply just a product of those two characteristic functions and for example the characteristic function of this modified second-order best-sell function is very it's much much much simpler than the PDF function and to get to the actual PDF of this is very cumbersome but to get to the characteristic function of this zk which contributed to the marginal effect is is way simpler it's just of this form so that's why it's very handy because eventually we can work out what is the characteristic function of the marginal effect sizes which is a product of three relatively simple characteristic functions and then in the end we just need to back transform it where we do the inverse of it to back and that's the most cumbersome part so this calculation is super easy and very very fast and we just need a fast way to back transform it and just trust me on this that there are really fast methods called fast Fourier transformation which can really in a speedy manner to calculate these back transformations so we need to move away from the probability density function to something else to get the formulas much much simpler to get the final likelihood function and then we go back to this characteristic function in the very last step to the inverse of it to get back to the PDF function it might not make much sense to you now but believe me that it really accelerates a lot the computation so now we have a final formula for the probability density function so we basically integrated out all the nuisance parameters about the individual effects so we just put a prior on them and we allowed its own have its own variance and all its mixing proportions and we estimated all those but we did not estimate at all the individual snippets on the outcome and that's a huge advantage because that gives us much more power because we don't waste our energy on estimating each individual snippet effects and we can use all snips in the genome to estimate now these causal effects forward backward and and confounder effects so what happens really when there is a confounder U as I allowed it to if you have some snips that are directly acting through U then they will give rise to a different slope when we look at the snippet effects on the exposure versus outcome and what MR is doing is just blindly fitting a slope to the exposure versus outcome effects and some of these these ones are confounder related snips and these are true snips with direct effect on X so you can see that the direct effect is much smaller than these two and then we just blindly mix everything together because of course we don't know which one is acting through a confounder because we don't know the confounder is all what we could detect all what this is sophisticated likelihood function fitting is doing is basically just fitting two different slopes to this big cloud of points where we allow one slope here, one slope there and there's another slope for the reverse causation and then there is another slope which has null effect on anything so that's what's all happening really behind graphically is that fitting multiple slopes and afterwards guessing which slope might be the confounder slope and which slope is the causal effect slope so what do we get why do we do all this sophistication because then we get very interesting estimates not only for the causal effects but for many other parameters so here we simulated data on 50 000 samples where both X has direct effects and indirect effects and Y is the same way and of course the key parameter is what is the causal effect of X on Y what is the causal effect of Y on X in our simulation in this setting we had no reverse causal effect we had 0.3 causal effect from X to Y so that's why it's 0.3 here and these are other interesting parameters which were not really the point of our investigation but they come out as something additional which is actually very useful so pi X was representing the polygenicity of the exposure it means that what fraction of the genome seems to be associated with the exposure so here it's a natural logarithm scale so it's probably one in about every hundred steps is associated with the exposure or maybe five percent then what proportion of the SNPs is directly associated causally to the outcome this is the heritability of the exposure which we set as 25 percent that's the heritability of the outcome which we set as 20 percent so these are direct heritabilities and these are the confounder effects so this is we set it at somewhere at 0.16 and the other one is 0.11 so what's nice about it is that we fit this model and while we integrate out all the individual SNP effects we can directly estimate heritability we can estimate how polygenic the trait is basically the larger this pi value the more polygenic the trait is the larger fraction of the genome is having a co-direct causal effect on the trait and most importantly it estimates both causal effects going x to y and y to x when you look at many other methods that have been designed to estimate causal effects from x to y you can see that many of them are underestimating causal effect depending on different settings but none of them is really really accurate in in most settings when we use a different setting the null setting when the causal effect is 0 both direction x and y y to x is 0 but we have confounder effects still present we can't get very accurately those confounder effects but what's key is that we still get back a correct zero causal effect while other methods are generally claiming a non-zero significantly non-zero causal effect then we can do different other violations now we can put in a strong effect acting through this confounder which is a negative so we put here a opposite signed qx and qy while we have a positive 0.3 causal effect but a opposite signed confounder effect so when qy is is not the same sign as qx so one is negative that's positive in that case they are misled because if you can imagine the ratio of qy qx will be non-negative so there is a positive slope a cloud of points around the positive slope and another cloud of points around a negative slope which is due to the confounder and the mix of this positive and negative slope leads to a massive underestimation of the true causal effects for most of the methods that are in the literature and we get pretty much back what we're supposed to see if there is a strong negative causal effect for example the same way these effects will be massively underestimated so we come in there at least ourselves that in many simulation settings when we violate these Mendelian randomization assumptions so there is a heritable confounder or there is a reverse causal effect or there are just play through effects the methods can be massively driven by SNPs that are not acting directly on x or SNPs that are acting either directly on u or directly on y and that's a problem and we seem to be able to solve this what's interesting is that I briefly mentioned genetic correlation so traits can you can just look at the correlation between two traits that gives them idea of how similar they are they might be either causal or they might be a confounder a slightly different aspect is genetic correlation when you only use the look at the genetic part of the trait and you ask how correlated the genetic basis of the trait is you can the same way you can ask how correlated the environmental component of these traits are these are we looked at 13 trait pairs and for every pair we calculated the genetic correlation and with this model we can also estimate what kind of genetic correlation we expect to see for x and y based on our model and that very nicely agrees with the observed genetic correlation then we looked at really the most interesting part is basically running this method for every trait pair for example look at BMI what is the causal effect of BMI on systolic blood pressure you can see we get back nice positive effect which has been confirmed many many times the effect of BMI on type 2 diabetes this effect is again as expected and as it's in the literature for example we get back the effect of smoking on myocardial infarction very strong causal effect the effect the positive effect the sort of healthy effect of education on coronary artery disease on diabetes also high education decreases smoking and so on so many of these good effects of higher education is also picked up by our method was also very interesting that of course our method also estimates the confunders so for example when we look at birth weight and type 2 diabetes many methods are showing a significant causal effect between how heavy you were at birth and whether you developed how early you developed type 2 diabetes or develop at all we observe no causal effect it's a non-significant negative effect the large confidence interval and it also implied that our model said that there is a confunder which had which has the same sign on both type 2 diabetes and birth weight and when we look at a large number of traits which could be potential confunders parental obesity seems to be one of them parental obesity increases birth weight and parental obesity it also increases the predisposition of the offspring type 2 diabetes even so it's quite interesting that although even causal effects seem to be causal models seem to be confused by this and assuming that there is a causal effect going from birth to diabetes but actually we managed to identify that it's driven by a confounder and upon when we look for what could be a confounder and actually scanning through traits we identified one potential typical confounder traits which are a lot of morbidity related phenotypes even fat free mass but also fat mass another exciting example is for example HDL levels and systolic blood pressure where we see a negative causal effect so higher HDL level is decreasing systolic blood pressure so it's another benefit beneficial effect of the high density telepids which classical Mendelian immunization doesn't pick up and our method picks up also that there is a confounder effect which has been actually the authors of this paper contacted us and they told us that they see in our paper and that they be suspected there is a positive confounder of this relationship and in their paper they show that this is actually alcohol which has the causal effect on both of the medical consumption and there are of course many others for example height decreasing systolic blood pressure what we see we expect to see and yet to discover another trait which is a confounder of this relationship with equal positive effects on both of them if you're interested in more this was published about a year ago again I would like to acknowledge Lisa Nino who did the lion's share of this work of all the simulation and real-trade association results and with this I finish and I'm happy to listen to your questions