 Welcome back from the break. It's my pleasure to announce the speakers of the next short course. It's co-taught by Florence de Menet and Emmanuel Bouzigan. Emmanuel will go first, but I'll introduce them in reverse order. Florence de Menet is a founding member of the network that underlies the series of summer schools. She's a global authority in genetic epidemiology, which is, for example, demonstrated by the fact that she was the president of the International Genetic Epidemiology Society in 2008. She won several big prizes in the field, the INZEM, a Brie de Recherche Clinique in 2003, and an award by the French Academy of Sciences in Epidemiology in 2010. She is the director of research for genetic variation in human diseases at INZEM now for 40 years. We are very happy to have her in our network, to be one column of our network that supports our efforts for years. And we are also happy that she will present here today, together with Emmanuel Bouzigan, who's also at INZEM and also a leading expert in statistical genetics and genetic epidemiology. She's a member of the Board of Directors of the French Federation of Human Genetics and a board member of the French Society of Human Genetics as a treasurer. They are both world-renowned experts for genome-bite association studies, and that's what they are going to talk about today. We are very happy to have you here. The word and the floor is yours, Emmanuel. Thank you, Casting. So I begin. So Florence and I will present GWAS and post-GWAS strategies to uncover the genetic mechanism underlying multifactorial disease. The talk is organized in two parts. I will present GWAS part and Florence will present the post-GWAS part. In my presentation, I will give the principle of GWAS, amputation and meta-analysis, some different tools that contributed to the identification of tri-associated genetic variant. Then I will show you how the integration of functional annotation of disease loci by using EQTL or epigenetic data can help identifying genes involved in disease susceptibility. So what is GWAS? The goal of GWAS is to detect association between genetic polymorphism and a trait, which may be qualitative as a disease status, such as asthma, diabetes or obesity, quantitative, such as blood glucose level, BMI, or we can also study time to disease onset. GWAS relies on specific genetic variant called SNPs for single nucleotide polymorphisms. SNPs are DNA sequence variations occurring when a single nucleotide at a given position in the genome is altered. In GWAS, we, in fact, to perform association analysis with a panel of SNPs that are adequately spaced along the genome to capture most of the disease equilibrium between SNPs. I mean, the link in the equilibrium between SNPs is in fact correlation between SNPs. So what we measure? In fact, we observe genotype data. We observe a disease phenotype and we test association between this phenotype and the markers. But the idea is if the SNPs are adequately spaced on the genomes, they can capture and observe marker, so ungenotype markers that are in reality, the disease susceptibility locus is due to linkage equilibrium, the correlation between the SNPs. So GWAS were made enabled by advancing high throughput genotyping technology. With microRNA, sorry, sorry. With microRNA, now we are able to have access to more than 500,000 to one million SNPs genotyped in a single individual. So when you receive GWAS data, you receive a file, not clean. So the first step you have to do is to conduct to perform a quality control on both and the individuals that are included in your genotype file and on markers. First, you check samples. You will compute what we call the color rate of individuals, so the proportion of success. So if the subject with a success sample less than 95 or 97% is good, you will check and the heterosythes, you will compute global heterosythes, heterosythes and specific heterosythes for chromosome Y and chromosome X. You will search for creatives and inspectors relatedness between among individuals of your study sample. And if you have family data, if you analyze family data, you will search for Mendelian errors too. Finally, using genotype data, you will perform a principal component analysis to correct for population structure. This PCA will be done by using a set of uncorrelated SNP, around 80,000 SNPs that are genotyped in your data and are also available in reference panels such as app map data. And this principal component analysis will allow you detecting outliers and correct for cryptic structure. And this and the principal components extracted from this PCA analysis can be used to adjust for stratification in the following statistical test that you are performing in GWAS analysis. Indeed, population stratification is the main issue in GWAS. Population stratification is important to take into account in GWAS even within population of European ancestry. As shown in this slide, there is a close correspondence between principal components of genetic variation in Europe and geographical map. Once you have done the quality control on subject, you can do the quality control on markers. Again, you will compute the proportions of successive markers called retort SNPs and you will exclude SNPs if the missingness was upper than 3% or 5%. You will compute the R-Divine Bay equilibrium. This is, in fact, R-Divine Bay equilibrium is a fixed relation between AL and genotype frequencies. And you will check any deviation for R-Divine Bay equilibrium in control and exclude SNPs if you found test R-Divine Bay deviation from R-Divine Bay equilibrium. And then finally, you will compute the minor ALL frequency at each SNP of each ALL. So once you have cleaned your data, you have a genotype file, a phenotype file, and you will merge both to compute and to perform the, finally, the GWAS. So the statistical method that can be used in GWAS depends on the study design. Do we have access to case control data or family data? Are we interested in a binary tray or concrete-active tray or time to onset or time to disease onset? The standard method used in GWAS are regression models because it can be logistic linear or linear mixed model. Because these regression models have the advantage that they can include more than one predictor in the regression equation. In fact, when you test the effect of a SNP on the disease status, you can adjust this effect. You can incorporate in the regression model different covariates, lifestyle and environmental covariates are known to influence the disease status. And you will incorporate the genetic principle component to adjust for population substructure. Then you will test the effect of the SNP on the disease status by performing a world test. As in GWAS, we tested 500,000 SNPs. We have to use transient threshold to declare a signal, genome-wide significance. And usually the threshold that is used is a p-value for the SNP effect of 5, 10 minus 8. Of note, I just want to note that in the majority of GWAS GWAS are in the majority, a single marker analysis. You consider one SNP at a time. And we will go back on that with the presentation of Florence. Once GWAS has been performed, you will look, you will check for deviation between off-service and expected new SNP disease association. You will compute what we call a genomic inflation factor defined as the ratio of the median of the empirical observed distribution of the space statistic to the expected median. And you can use this inflation factor to correct the standard error of your SNP effect and to recompute the statistical test of your GWAS. To present GWAS result, the easy ways is a visualization method called a Manhattan plot, where you plot minus log 10 p-value of the SNP effect, the statistical test of the SNP effect, along the genome. So each dot represents a statistical test and you look at the results that reach the genomic significance level threshold. And what it's often observed in associated vision is that the lead SNP at a given region, is as a given region associated look is, is accompanied by many adjacent SNP that, and they might show association, this adjacent SNP may show association because of the LD structure, the LD, the linkage equilibrium with the lead SNP. The power of GWAS relies on two main characteristics. The screening at high level of resolution and a large sample size. To increase the SNP density, we can use imputation. To increase sample size, we can use a statistical tool called meta-analysis. And so we can act on these two characteristics to increase the power of GWAS. Amputation, so amputation are done to increase SNP density. Impute alleles, the strategy is to impute alleles for all missing SNP, then analyze them as if they were in fact genotyped. It's a way in fact to combine a different study through meta-analysis chapter, but to combine different studies that have been genotype using different array. So studies that have different information across the genome, so different SNP genotype or across the genome. To compute imputation, you need to have access to reference panel, so such as the 1000 genome or the Aplotype reference consortium. So the principle of genotype imputation is to use a spare set of genotypes, GWAS genotyping, and Aplotype structure of existing samples such as 1000 genome samples to infer data for sample with pass markers. On the first step, you will identify region of chromosomes shared between the study samples of several genotypes and individuals from the reference Aplotype, the reference panel. Then for each study sample is failed and the Aplotypes are modeled as a mosaic of those in the Aplotype reference panel. Two programs are mainly used to impute the genotype data, impute and minimak. Both methods are IDENT Markov model-based methods. For this method, impute and minimak, with this method, you can, in fact, input the whole genome. But you have some specific program to impute the HLA region. The imputation, you can do the imputation on your own server or you can use free genotype imputation service such as the Michigan imputer server or the Sanker imputation server. So when you have done an imputation, you have access to the same data on the whole genome in different study, you can combine results across study and increase the power for detecting significant results and decreasing the certainty of the estimated genetic decrease. And the statistical method to use in meta-analysis depends on the type of available data, the statistical model used to perform GWAS or summary statistic shared by collaborative group. As we have seen that the main, the most statistical model that is mainly used in GWAS is regression models, you will have probably access to regression coefficient and standard error of the SNP effect on the disease. So you can compute meta-analysis using a fixed effect models, what we call inverse variance weighted model, or a random effect models. Fix effect model assumes that observed effect are estimate of a single effect. Average effect computed, average effect are computed by writing each to this regression coefficient according to the inverse variance to the inverse of the sampling variance. And this mode method has been extended to random effect model by incorporating in the sampling variance an interest study variance to the interest study variance. And this random effect model allows for effect to vary across studies. So in the following slide, I will give you some example of how I would say imputation and meta-analysis can increase the potential of GWAS to detect associated variance to asthma. One of the first meta-analysis that had been performed worldwide was performed in the context of Gabrielle consortium, EU funded consortium, where that gather more than 20 study, including only individual of European ancestry. This meta-analysis include more than 10,000 asthmatic cases and more than 16,000 controls. And this GWAS performed in 2010, what performed in fact with genotyped data, no imputed data and lead to the identification of CIS asthma loci. Several years later, multi-ancestry meta-analysis of asthma was performed in the context of the taxi consortium. And in this meta-analysis, population of European ancestry were analyzed and combined with population of African, Japanese and Latino ancestry. More than 20,000 asthmatic subject were analyzed and more than 118,000 controls were included. And this is the largest meta-analysis that had been performed on GWAS as identified more than 800 SNPs that were associated with asthma at a genome-wide significance level. Another way to increase the sample size for GWAS is to use biobanks. On this slide, I just indicate some of the biobanks that are available worldwide. Many more exist. So one of the most used, I will say, biobank in GWAS is the UK biobank. And last year, GWAS asthma was performed using UK biobank data. But when you use a biobank, you have to keep in mind that often you, in fact, you consider a single population, in fact. In the UK biobank data for asthma, only a subject of European ancestry were analyzed. More than almost 40,000 asthmatic cases were analyzed with more than 300,000 controls. And this large study led to the identification of 61 distinct asthma loci. So what did GWAS bring? In fact, GWAS was successful to identify many new loci associated with many tray or disease. As illustrated with this graph, this photo where each dot represents an association senior, it's retrieved from the GWAS catalog. Each dot represents an association senior found a SNP, found associated with a given disease, at a genome-wide significant level. And in the GWAS catalog, almost 200,000 of variantary associations are available. So GWAS bring new biological impotencies regarding pathophysiological mechanism underlying disease. GWAS have also shown that distinct disease or tray have to be biologically unrelated, often share association with genetic variants more often than expected by chance. And we call that, pleiotropy. And we illustrate that with an example of asthma. The author, the member of the taxi consortium, search for overlap between asthma loci and GWAS catalog seniors. And they search for overlap doing enrichment analysis. And this analysis, identify significant overlap of asthma loci with loci underlying immune-related disease, but that was expected for diseases like asthma. But also with other diseases with an inflammatory component like cardiovascular, neuropsychiatric or cancer disease. Finally, what is the biological of GWAS variant? So what is the biological of variants that have been found through GWAS? Indeed, once GWAS has been performed, additional steps are required to identify the causal variant and the target gene in a given genomic region. But this identification can be difficult. Because previous studies have shown that more than 90% of GWAS variants map to non-coding regions and half of them map, in fact, to intergenic region. So it may be difficult to identify which gene and which genetic variant is involved in a given variant. In the susceptibility of a given disease. Moreover, when we look at the regional plot, where we can see that many genes are managing underlying association in a given region. So in such region, what is the causal SNP? What is the target gene? The studies have shown that GWAS variant may mostly reside in gene regulatory region. So the incorporation of functional annotation to interpret GWAS results will be a useful tool. And the functional annotation that we can incorporate are EQT, expression quantitative trail-of-sight data. So expression data. Many database with expression data are available. And we can use also regulatory elements, such epigenetic marks. And I will give you different examples of how to incorporate such data, such as functional annotation. So one way is to enter your data in a given region. One way is to integrate GWAS and expression data. Among the different database that are publicly available with expression data and EQTL data, there is a genotype tissue expression project that was launched in 2010. The aim of this project is to build a catalog of genetic effects on gene expression across a large number of tissues. The specificity of this project of data is that it gathers many tissues. The expression data measure, gene expression data measure in different tissues from same donors, some same donors. At least EQTL data, EQTL analysis has been performed for 49 tissues. And this analysis, we have to keep in mind when using this genetic data that the donor, the people where these tissues were extracted are death people, this is a person, more often male, with an age more than 16 years old and most of them are for European descent. This data, the JTEC project has identified that more than 90% of the common variants show nominal association with the expression level of at least one gene in at least one tissue. They have identified that also that the gene expression is tissue and cell specific. And more than one third of genes show sex bias expression in at least one tissue. So when we want to use expression data, we have to pay attention on the tissue we choose to examine regarding the disease we are studying. So one way to integrate GWAS and EQTL results is to perform collocalization analysis. The principle of this collocalization analysis is to estimate the posterior probability that the same variant is causal in both GWAS and EQTL study while accounting for the uncertainty of the linkage disequilibrium. The statistical method are mainly based on Bayesian statistical framework and of not this method are applied to summary association statistics. You don't have to, you don't need to have access to the raw data. And Bayesian method derive the posterior support for each of the five hypotheses describing the possible association of the region with both each GWAS and EQTL. You can consider that there is no causal variant for either three in the region you are interested in or there is a causal variant for the disease association for the GWAS results. A causal variant for the gene expression association through the EQTL result only. There are two distinct causal variants in the study region one for each trait or what is more interesting there is a single causal variant common to both three and we call that collocalization. We use such method collocalization method to explore the association that was found with asthma in the 5Q-31 region in the taxiconsortion GWAS meta-analysis of GWAS. And here on the top right you have in that region the least nip of the the least nip of the asthma GWAS on the bottom you have the least nip of the EQTL analysis the least nip in fact of the the expression level of the transcript of NDFI P1 and doing collocalization we evidence that this last nip so the EQTL least nip as the highest collocalization posterior probability so it's probably the it's probably the causal nip in that region and this nip in fact is so an EQTL for NDFI P1 that is involved in negative regulation in negative regulatory function in IG dependent mad cell activity a clear mechanism in allergic asthma different method can be used to integrate EQTL data in the context of the taxiconsortion they use a conditional analysis to dissect a complex region associated with asthma the 70 Q 12 Q 21 region in that region in fact two measures associated were identified one in the whole sample and this signal was located in the ERBBB2 PGAP3 region and the second signal was located in the ORMDL3 region but was found in the pediatric subgroup it means it was found for childhood onset asthma and these two regions are located and the ORMDL3 region is located 180 kb far from the first region so in that paper they use aso... they perform conditional analysis using aso... using lead asthma when the surgery was nipped and PIC EQTL snipped with the genes transcribed blood and lung tissue and they identify that the lead pediatric asthma snipped I count for association of the PIC E-snip with ORMDL3 transcript in blood while the asthma it... the E-snip found in the world population accounts for the association of PIC E-snip with PGAP3 transcript in lung so this analysis identifies that asthma associated near these two blocks may affect asthma risk to the expression of different genes in different tissues finally conditional annotation of disease association those side can also be done by integrating genetic and epigenetic data and we show you how the integration of DNA variation and epigenetic data helps pinpointing a gene involved in the combiner asthma plus rhinitis phenotype EPIC may genetic mechanism such as DNA mutilation is tone modification or MIR RNA are regular are involved in the regulation of gene expression and it has been shown that epigenetics play a role in immune response and thus potentially in asthma among the different epigenetic mechanisms there is imprinting so genomic imprinting can be defined as the effect the effect of NALL to give a loci the effect of the NALL on the disease differ according to who has transmitted this NALL for example the NALL was transmitted by the father this NALL can have an effect whereas if it is transmitted by the mother it has no effect on the disease the conversory can be also found so in a given analysis we have a study in fact we have searched for genetic variants that were associated with asthma and allergic rhinitis first we conduct we perform genomic linkage analysis in European families we identify a region that was linked to asthma plus rhinitis phenotype under apparent origin effect we find map this region and study and perform association analysis with SNP that region in the discovery sample any replication sample that only includes families and finally we select associated SNP with asthma plus rhinitis phenotype under apparent origin effect and for this SNP we test association with denim ventilation in the same French-Canadian families that participate to the step 2 and we perform causal inference test to investigate whether the association between the SNP and the phenotype can be explained by denim ventilation so first we perform linkage analysis of asthma and rhinitis and identify a region located under chromosome 4 that was linked to the asthma plus rhinitis phenotype and we observe an increased evidence for linkage of DNA marker with disease when taking apparent of origin effect in that case transmission of risk from the father then in that region we perform association analysis of asthma plus rhinitis with more than 1,000 SNPs once again this association analysis were performed by taking into account apparent of origin effect and we identify through the pool analysis of the EG and CELESI families that a marker located needs the MTNR1A gene is associated with the asthma plus rhinitis phenotype and what we identify is that the patternally transmitted JLL of the SNP RS1 00009104 is associated with asthma plus rhinitis phenotype so at the first step we conduct association analysis of DNA mutilation with this particular RLL under an imprinting model and we use the French Canadian some of the French Canadian families to explore this association and we found an association of this particular RLL when it's transmitted by the father with the mutilation at a particular CPG site and there was no association of this RLL when it was transmitted by the mother so performing this association of DNA mutilation we were able to identify a patternally transmitted variant JLL that is associated with DNA mutilation level at the CPG site and we also tested the association of the CPG site with the disease and we found that this that this mutilation is associated with the disease so finally to understand to explore whether the DNA mutilation mediates the association between the SNP and the phenotype we conduct conditional test by performing for condition first we have shown that the SNP and the JLL at the SNP is associated with the phenotype then we show that this SNP is associated with the level of mutilation at the CPG site even if when I would say when this SNP is associated with the CPG site even when adjusting for the phenotype on the first step we show that DNA mutilation is associated with the phenotype and it's associated this association is kept when adjusting for G and finally the last test we perform is to search whether the JLL of the SNP is associated with the disease when conditioning on the mutilation and what we show is that we no more see the association when we adjust the level of mutilation so we were doing this causal and inference test we have we have shown that the effect of the paternally transmitted SNP JLL on asthma, chrysanitis, phenotype is mediated by differential DNA mutilation so by integrating genetic and epigenetic data or study has demonstrated that the mutilated CPG site within the MTNR1A gene mediates the effect of a paternally transmitted JLL genetic variant on asthma and renalgic chrysanitis comorbidity the MTNR1 MTNR1A gene is of particular interest in asthma because it expresses a bit of BNT cell it involves melatonin action and melatonin has immunomodulatory effect in allergic diseases so MTNR1 is a good candidate gene for asthma and renalitis so now I will let now Florence will continue the presentation so I am going now to talk about the post-GWAS strategy so as Emmanuel said, what did GWAS bring they brought many loci associated with many traits and all these loci are listed in the GWAS catalogue an important finding of GWAS is that SNP or genes common to various traits are more often are more often than expected to existing playotropy and also it generated new biological hypothesis, new pathological mechanism GWAS have brought a wealth of information but GWAS are based on single SNP analysis as Emmanuel I told you so the SNP with low marginal effect and interacting with other general environmental factors may be missed SNP is one of the main finding and explain only part of the disease irritability and biological of the disease associated variant are mostly unknown also Emmanuel gave you two examples where it was possible to infer the causal variant in many cases I mean it's progressing right now but in many instances we don't know what the target gene or what the causal variant so there is a need to develop strategy and method that allow integrating biological knowledge such as pathway based or network based method to test for gene-gene and gene-environment interaction and also what can be the use of these GWAS I will show at the end that GWAS summary statistic can be used for disease risk prediction and this is very topic right now so in my presentation I will show that integrating GWAS outcome and pathway information can lead to the identification of pathway and rich in disease genes so to provide some biological insight and also to facilitate the detection of gene-gene interaction then network analysis of GWAS outcomes can lead to the identification of interconnected gene module influencing the disease and finally I will present what are polygenic risk score that enables to improve disease risk prediction so the principle of pathway analysis is to is to integrate so the GWAS summary statistic with pathway so as input you have your single-sneed results so your p-value from your GWAS then the next step is to map SNP to gene using genome assembly and SNP position then to compute a gene statistic then to map your gene to your pathway for example gene ontology category to compute a pathway statistic which is an n-response score and then to compute the significance of your n-response score for permutation so mapping SNP to gene the most common way of doing is to use location so to map SNPs if they belong to genes so within gene boundaries or to use also to extend the location 20 kb or 50 kb from the gene boundaries but also we can locate the SNPs based on in case of degree pattern or more and more now it's proposed to locate to map the genes so called using functional annotation but this is sometimes quite difficult to do this at the genome-wide level to compute a gene statistic there are various methods which have been proposed we can take the best SNP statistic within the gene but as you know because of the LD pattern we have to compute empirical p-value for permutation there is also it's possible also to combine the different SNP statistic in a gene for example to take all the SNP in a gene or the best SNP in a gene or a set of genes meeting a given threshold but also we have to correct for the correlation between SNPs and there is a package which is called Vegas which can do that or it's possible also to use the raw data in terms of pathway we can have access to many databases one of the most the common one which is mostly used is the gene ontology which is a collection of vocabulary the issue is there is it's kind of a tree with many gene ontology categories connected to each other there is a high redundancy of gene set so in some package we have to choose this tree at a given level for example the level 4 of the gene ontology database there is another also well known pathway database which is used is Keg and which is made of functional pathway metabolic pathway between indicating the relationship between genes and there is also the m6db which is the molecular signature database which aggregate indeed data from gene ontology Keg but also other database such as bio-carta and reactome so to compute the pathway statistic it's an enrichment score so you have your gene wise SNPs which have been mapped to the gene so here in red are the genes belonging to a given pathway so you rank your genes according to their to the strength of their association with the disease going from the most strongly associated to the less associated and then if the gene is part of the pathway the enrichment score is increased and the increase is proportional to the gene statistic whereas if the gene does not belong to the pathway here for example p3 does not belong to the pathway so the enrichment score is decreased and this decrease is proportional to the number of genes which do not belong to the pathway so you do that along the list of genes you have your enrichment score which is either increasing or increasing and it reaches a maximum and you stop when it reaches the maximum and the genes driving the enrichment of the pathway are called leading genes so but there are two issues the enrichment score is influenced by the size of the pathways the number of genes in the pathway but also the non-independence of genes for example we know that there are gene cluster genes which are located close to each other because they have the same function for example in the case of the HLA region there is a second issue which is multiple testing like for the steps we have many pathways which are tested so there is a multiple testing due to the number of pathways tested but also the non-independence between pathways because pathways share genes there are genes which belong to a certain number of types of pathways so we have to permutation to compute empirical p-value so we have to generate the null distribution of end response score using permutation for example 100,000 permutation or more we can normalize the end response score by computing by removing the mean and dividing by standard deviation obtained under the null distribution and we can compute the empirical significance and then do multiple testing correction by computing the first discovery rate there are different types of permutation which can be done the gold standard is to use phenotype permutation so if you have access to the raw data it's possible to do it but you need to have access to the phenotypic and xenotypic data and very often people it's difficult to share this data so we don't have to have access we can't have access to this data and also it's highly time consuming we could do permutation of SNP statistic so this is computationally efficient but this does not preserve the linkage of the equilibrium among SNP or the correlation among functionally related genes which are located close to each other on the genome we can also use gene statistics which is also computationally efficient but it ignores the variation of number of genes of number of SNPs assigned to a gene so there have been comparisons which have been made for the simulation of this type of permutation and it shows that the gene statistic permutation does not perform well there is an inflated type 1 error rate where the steep permutation provides results more similar to the phenotype permutation there is a very interesting method which has been proposed by Lera and colleagues which is called circular genomic permutation and that method allows to preserve the linkage pattern and the clustering of genes indeed they consider the genome as being circular and SNP are ordered according to their genomic position and you can generate circular genomic permutation by rotating single steep statistic with respect to their genomic location some postdoctoral students in my lab Mary-Anne Brossard and Amory Weiss they compare this type of permutation the circular genomic permutation to phenotype permutation and steep permutation and they found that circular permutation were performing much better than the steep permutation and they provided similar results to the phenotype permutation so this is the type of method now we are using I will provide now I will give a present an example we apply this past based analysis to two melanoma G-WAS which have been conducted in a French dataset the Melaris Consortium included 4,000 individuals and a US dataset the MD Anderson cancer study dataset which included almost 1,000 individuals so there were some common signal association signal on the genome between the G-2 G-WAS but we wanted to see what was going on at the pathway level so the data were imputed using the APAP-3 software using imputation as Emmanuel has talked about before the steeps were assigned to 20,000 genes there were 20,000 20,000 genes with SNP then to gene ontology category and finally there were 316 geo in one dataset standard 19 in the other dataset we found that there was geo which made the first discovery rate less than 5% 28 in Melaris, 27 in MDAC and 5 of them were shared by the 2 datasets 3 of this pathway were known to be involved in melanoma responds to life simulus because melanoma is a plutonium cancer and we know that exposure to UV sun exposure is very important so the gene involved in that pathway where most of them this pathway is known to be associated with melanoma also regulation of mitotic cell cycle there are well known gene CDK and 2A which belong to that pathway which is also known to predispose to melanoma induction of program cell death is also known to be associated with melanoma but at the time these 2 pathways were not known to be associated with melanoma cytokine activity and oxidative phosphorylation and what was interesting is that cytokine activity the immune response and we know now that there is a success of the development of immunotherapy in melanoma so we found that this finding was interesting and also oxidative phosphorylation the recent GWAS of melanoma also now point to genes involved in that pathway. PASCO analysis can can provide information on the pathway which are biologically relevant in the disease but also can be used to detect gene-gene interaction if you test and I will show you that in the slide gene-gene interaction at the genome level the threshold to declare a gene-gene interaction is at the order of 10 to minus 12 or 15 you know for single-sip analysis it was 10 to the minus 8 here you have to be even more strident so it's very difficult to detect gene-gene interaction at the genome-wide level if you do that analysis by by itself but if you do it within a pathway-based analysis framework it will facilitate the detection of gene-gene interaction so we did that in these two data sets Melaris and MDAC we test for c-sleep interaction using the inter-sip software you have a logistic model in which you have the c-sleep 1 effect, the c-sleep 2 effect plus an interaction term we selected the c-sleep pair with the p-value of for interaction of 10 minus 4 we follow up the c-sleep in the MDAC data set and then we perform a meta-analysis of the interaction effect and we evaluated the statistical significance for a yard-cycle procedure to correct for multiple testing so because the c-sleeps are correlated similarly the c-sleep with the c-sleep pairs are also correlated and so we have to compute the number of independent c-sleep pairs for that we use an extension of the math approach this approach has been developed for single c-sleep and we extended that to the correlation matrix of interaction term. Now for the single c-sleep it does the principle component analysis of the LD matrix and it uses this principle component to compute the effective number of independent c-sleeps here we did the same to compute the effective number of independent c-sleep pairs then we computed the threshold for significance for a given pathway by dividing using the Bonferony correction by dividing the 5% level by the sum of effective number of c-pairs for a given gene pair and summing over all gene pairs within a pathway so this is a t-passway threshold and then because we had 5-passways which were significant for the two data sets we computed the overall threshold which is threshold for the pathway divided by the number of pathway tested so this is to show you the gain of using that strategy as opposed to the genome-wide de-re-study genome-wide interaction study over the genome if we had to consider all possible c-pairs across the genome the threshold would be 510 to the minus 14 if we had to consider all the c-pairs over the 5-passways it would be 10 to the minus 9 the effective number of c-pairs over the 5-passways 2.5 to the minus 8 and if we use that hierarchical correction we have a p-value we have a threshold for the p-value between 3 to the minus 7 to 7 to the minus 9 which varies according to the size of the pathway so by using that procedure we were able to detect a significant c-pair over the 5 million c-pairs tested in the Regulation Midotic Cell Cycle Geo the steps belong to 2 gene, Kerf1 and AF-C1-L2 and as you can see on this slide the interaction indeed you have the risk for the AF-P1-L2 gene it increases when the genotype at the Terf1 gene is CT whereas it decreases when the genotype at the Terf1 gene is CC so it's called a flip-flop no an inverse flip-flop pattern of interaction and these two genes have particular relevance indeed Terf1 is involved in the telomeric repeat by an infactor it's involved with the sheltering complex which is a complex of genes of protein which are at the telomer at the end of the chromosome and are involved in the telomer biology which is known to play major role in aging but also it's associated with many cancer in the time it was an emerging process in melanoma and the other gene is an active filament associated protein and it promotes the growth of cancer cells and interestingly we found a paper where there was a screening experiment of 12,000 proteins which reported in fact an interaction between these two proteins thus confirming our result for a strategy this is to show that it's very interesting to use a complementary strategy to discover the gene in a complex disease such as melanoma indeed these are the genes of the sheltering complex I talked about there are six genes involved in the sheltering complex some of these genes were discovered by exome sequencing in familial melanoma so only in familial in familial aggregation of melanoma there are two 1P, ACD and POT1 therefore it was found by the pathway and gene interaction analysis and these genes also interact with which was discovered by GWAZ so this shows that by combining different strategy exome sequencing GWAZ, pathway and gene interaction analysis you are able progressively to disentangle the mechanism underlying a complex disease the second strategy is a network analysis so this analysis you can integrate biological knowledge for example a human protein interaction within a network with GWAZ results single sneak P value to find some network gene module and rate in association signal I will present an example also for ASBA using the data from the Gabriel consortium and this was done by UN Longliu which was a previous student in the MLPM network and indeed the rationale for this type of network analysis is the guild by association genes are interconnected in a network are thought to be more likely to share similar function and thus to be involved in the disease process so the goal is to find some network and reach in disease associated genes so we use GWAZ outcome from the Gabriel ASBA consortium that Emmanuel has already presented we focus on childhood onset ASBA and there were we decided to split 18 childhood onset ASBA study we decided to split into two dataset dataset with nine study and another dataset of the remaining nine study there were about 300 cases 3,000 cases, 3,000 control in each set because we wanted to I will show you to be able to use these two datasets separately in order to find consistent results and the meta-analysis included of millions of which are included using the HAPMAC2 panel the protein network which was used was a PINA platform which integrates six public related database showing protein-protein interaction all together there are 17,000 protein and 166,000 interactions there so the main the overview of the network analysis is to assign to gene to get a gene-based p-value and then load this gene-based p-value on the network so to weight the PINA network then to select within this score network the sum network so to search then gene module within each dataset then to select gene module which are shared by the two dataset and intersect this gene module then evaluate the significance for permutation so for so we assign to gene according to their location as for the pathway but it would be possible in the future to use also functional annotation and EQTL we corrected for computing the gene statistic we use the best p-value but we corrected that value to take into account the gene length pattern really using the circular genomic permutation I presented before and you are long sometimes it takes time because you have to compute 100,000 or more permutation and you are long as developed an exact algorithm that is implemented in a fast CGP which is available in GitHub and which can be used very easily to get the corrected p-value so to select the sum network we use the dense module search algorithm which has been proposed by Dia and colleague and for that we have to compute the module score the module score is the sum of the gene weight the gene weight being derived from the p-value divided by the square root of the number of genes so you start with the seed gene in your genome and you add the direct interactor to the module and you compute at each time the module score is the module score increased by at least 10% you continue and if it's less than 10% you stop you do that for each seed I mean you start to use all the gene you have in your gene network and so you have many modules at the end so you do that we did that in the data set 1 we did that in the data set 2 but as I told you there are many modules which are generated in each set so we computed all per wise module similarity and we selected the module which were most similar across the 2 data set so module showing consistent result across the 2 data set we selected the 10 most similar module and then we intersected this selected module in order to have a consistent sum network from both data set we computed the module significance permutation for the significance of the module score compared with the background network and also the significance of the association of the module score with asthma by doing circular genomic permutation as before what we found we identified network including 91 genes which was significantly associated with asthma at the 10-5 level this module score included 19 genes which are associated with asthma at the nominal p-value in both in the 2 data sets there were 39 13 were already known and 6 new genes as you see on this slide this network has a very interesting structure there was a core network and in this core network all the genes in red are known genes which were very well known genes associated with asthma on 3 different at 3 different low size some on the chromosome 17 that the manual has presented GSDMA, GSDMB or MDL3 other ones another one on chromosome 9, IL33 and some other on chromosome 2 IL18 RAP so all these known genes were connected to the APP gene which is known to be a gene predisposing to Alzheimer disease amyloid beta protein gene so I will discuss this finding later on because we found this was interesting and beside this core sub network there were 5 peripheral network which were all connected to the core network and which were also meaningful functionally and some of the genes in blue are new genes which have a biological function relevant for asthma so we were struck indeed by this finding that the asthma known asthma genes were all connected to the APP genes and then when we search for the literature we found that indeed there are several lines of evidence showing relationship between asthma and Alzheimer disease epidemiological study have reported an increased risk of Alzheimer disease in patients with asthma there are some genetic factors involved in Alzheimer disease which are immunorated or inflammatory involved in inflammatory process so the GP genetic signature found in the mouse model of AD which indicate that the two main pathway are neuronal and immune response and also an asthma drug was found to have a beneficial impact on AD by in an AD mouse model so all this line of evidence indicate that there may be some pleiotropy between asthma and Alzheimer disease and study now showing that immune-related mechanism are important in both psychiatric and neurogenerative disease when we did then functional analysis using the David package in order to find a functional related module we found five functional related module the immune response in black which is spread out across the whole network and we find three other functional module which indeed were co-localized exactly with the some network identified for example in green in red this is a chemokine cluster which is also relevant for asthma they are all here in green validation cluster which is known also to role in asthma role in the epithelium barrier and in that cluster there are some genes for example like MIG B and MIG A which have been indeed found later on to be associated with asthma and also the zinc finger transcription factor which may play role in the regulation of gene regulation so indeed the network analysis allows to identify new candidate genes that were not detected by the GWAS but they need to be confirmed by follow-up study I told you some of them now have confirmed some of these genes the core asthma genes were connected to APP showing pleotropy and functional analysis revealed 14 clusters which are biological meaningful for asthma generate a new hypothesis that can be further validated by functional study finally I will discuss how GWAS summary statistic can be used to compute polygenic risk core and how they can contribute to predict disease risk the polygenic risk core is a weighted sum of the risk allele an individual carry the weight being the sleep effect size estimated by your previous GWAS so PRS is the sum of the allele count for an individual E and sleep J beta being the effect size estimated from GWAS the classical method used to compute PRS is called clumping fresh holding you have to sum over all independent SNPs so you have to do some clumping in order to eliminate the SNPs which are in linkage equilibrium and it is not known which set of SNPs should be used either the SNPs over the whole genome or the SNPs meeting the genome significant level or intermediate number of SNPs so indeed people have proposed to try different fresh hold in order to select the PRS providing the best prediction accuracy but other method have been also developed proposed based on penalized regression to take into account the LDB3 sleep it is implemented in the last sum software also a biasing approach which is also used broadly which is the LD pride software and extension of LD pride have been recently proposed to integrate functional annotations so the first use of the PRS was done indeed a long time ago by Percell for Schizophrenia at the time NG was no SNPs submitting the genome level so he assumed that the Schizophrenia was rather polygenic due to the large number of SNPs with small effect so he constructed a PRS in the first Schizophrenia study and tested for the PRS in the second one and was able to establish an association between that PRS and Schizophrenia that was used also to detect pleiotropy by constructing a PRS using DRWA's results from Schizophrenia that PRS was found associated with Bipolar Disorder so showing a common genetic contribution between Schizophrenia and Bipolar Disorder but now PRS are mainly used to predict risk one of the first example which was led to some interesting results was one computed for coronary heart disease which showed that the PRS was able to identify individual with high risk equivalent to monogenic rotation so for that in that study the PRS was derived using DRWA summary statistic reference panel using 1000 genome European data they computed 31 candidate score based on the clumping processing method but also an LDP then they computed this 31 PRS in the UK Biobank Phase 1 data including 120,000 individuals and they choose the best PRS based providing the maximum around the curve then they tested that PRS in the UK Biobank Phase 2 and they assess the association of the best PRS with the disease as you see in this slide to get good PRS you have to have large data set for your DRWA and large data set for your testing data set and when they computed the odd ratio at different threshold so these are the they divided the PRS into percentiles and when they look at the top 1% as compared to the remaining 99% they found an odd ratio of 4.8 which was highly significant and as you see the odd ratio is increasing depending on the threshold you use to select your individual according to their PRS value recently there are many many studies it's a very topic many many people have computed PRS for a large number of diseases this was done without function we can predict first of COPD COPD is chronic obstructive pulmonary disease is defined by a low FVC1 over MvC these are measures of function FVC1 is for respiratory volume in one second FVC is for vital capacity they use the very large GWAS data PIROMETA which is an international consortium and they merged that with the UK Biobank they had they were using the GWAS data 7 million SNPs they derived their PRS in one dataset using the LASOSOM method there were 1.5 million SNPs for the FVC1 PRS 1.2 million for the other long function measure and when they combined the PRS from the two measures FVC1 and FVC1 divided by FVC they have the PRS of 2.5 million SNPs and they tested that PRS in nine population based and case control studies they found that the odd ratio for COPD for each disease so they computed the odd ratio for each disease versus the first disease and they observed an odd ratio for the the last disease of 7.9 as compared to the first disease in European ancestry individual and 4.9 in European individual indeed the GWAS are computed mainly in European subject so they cannot be applied they can provide not so good results in non-European population but they have to be made to get more data in non-European ancestry individual and this is in the bottom this is to show that PRS can improve disease prediction but other risk factor have to be taken into account classical clinical risk factor have to be taken into account in that case it was 8 seconds smoking so in green the PRS blue risk factor red PRS plus risk factor as you can see the AUC is always better when you use both PRS plus risk factor so to conclude so we can use use of PRS in disease prediction for prevention strategy for example for COPD we can reinforce smoking cessation in those with high PRS use of PRS in segment choice for example it has been shown that there is a big case of statin in subject with high genetic risk for cardiovascular disease but a lot of development has to be made this is one example but right now it's not well known how PRS can be used in choice of treatment and also PRS can be used to refine penetrant of high risk variance but PRS has also limitation as I pointed out so it depends in which dataset you have computed you have used so depending on the ethnicity and the environment lifestyle factor of the individual and may not be applied right away to another population careful evaluation of the value of the PRS is needed before using it in personalized medicine and also it's important to combine PRS probably with other clinical and environmental risk factor so in conclusion DWS data provides a wealth of information DWS data coupled with biological information can provide insight into the biological mechanism and PRS may contribute to disease prevention and therapeutic intervention but needs right now to be used with questions so I acknowledge the various group which have been involved in the study of manual and myself and our sponsor thank you very much we thank you Emmanuel and Florence for this presentation that was wonderful now it's time for questions are there any questions from the network please raise your hand so there's one Giovanni please go ahead hi thank you again for the talk that I have is probably very basic but you mentioned that you often make testable predictions that need to be validated what does the validation process usually look like is it in vitro validation is it additional population studies right now the validation is mainly statistics but it could be done also it's true using experiment but it's done it's statistical to know if it improves the prediction or completing a ratio it's done like that at a statistical level thank you thank you there is a question from Slido I think it is for Emmanuel thanks for the talk is distribution of GWAS hits between coding and non-coding regions different than expected by chance that was posted during Emmanuel's part of the talk so it's probably addressed to you sorry there were some works in the building so it's difficult to hear but I think I hear the question yes in fact the fact that we found some association signals more in non-coding regions was really a new result it was not expected because we were thinking on Mendelian disease where in fact the effect of a variant is due to a change of the protein structure and in GWAS in fact we found the result outside in fact within GWAS in non-coding regions but in regulatory regions that this region this NIP in fact will alter the gene expression level and not the structure good thank you is there another question from the network yes Lukas please go ahead thank you very much for your very interesting talk I just have a question out of curiosity as you mentioned the equality and the generalizability of the results depends on the amount and the quality of the data that you collect and all the biobanks and data centers I know are in developed countries so I wanted to know if you know of any initiative from third world countries that are generating data of this kind yep that's the question I can answer about that indeed right now in the US because of the PR as they realize it cannot be applied well so non-European population so they make effort now to develop research at least in the US these are not, no these are unmixed populations but they develop effort in the US to get more and more data in African-American Latinos and so on so in unmixed populations but I guess there will be also some effort developed perhaps in Africa there is a lot of effort developed for example in Asia, in China and in Japan there are very important effort for getting biobanks so Asian populations are covered the issue is for Africa but they are developed in collaboration yes with UK or US collaborators thank you I don't see a new question on Slido is there another one from the network if not then I may take the opportunity and ask Florence one question when you perform a network you describe this principle very well and you do not find anything it could be that either the network is of too poor quality that you don't know the network structure well enough it could also be that your genetic example like the number of individuals is too small to find a strong association so there may be data sets where you don't find a network GVAS hit now but over time as network data gets better or as genetic data sets gets larger you suddenly discover something I wonder whether you have ever seen such a scenario where over time with the data set growing or the network data getting better you suddenly get a network GVAS hit that you didn't observe in the first place I did not observe that but I think what you say yes I think so of course if you increase your biological knowledge more and more information you will have indeed as you know you have long used another method sigmod you apply to string for example here you use protein-protein interaction but there are other networks which include protein interaction gene expression text mining so more and more information you use you have better chance for example to find a network but also you have to use accurate information the issue is that sometimes text mining is an interesting method but the relationship between gene is not so accurate so you have to increase but also you have to be careful about using accurate information so there is a balance between size and accuracy that's very right yes I agree good are there further questions not on the slide not in the network then I take the opportunity to thank both of you for a wonderful presentation thank you very much Emmanuel and Flores