 So for me, it's good morning, but I'm sure it's good afternoon and good evening to many of you Roderick, thank you so much you and your colleagues for inviting me and for a very gracious introduction I'm really delighted. I have to say perhaps unlike many of you When in code began I Was uncertain as to how it might impact our own work And I can tell you that beginning with some simple observations we made till today We are bigger and bigger and bigger users of encode than of probably any other resource and I think it's really a remarkable sort of set of four sites by a number of people that Has led to encode being so important in the work of not only ours many many Others as well. So I'm just going to really give you an overview It's not a very long time, but as it should be and I'm going to try to convince you that we are now at Really a new stage where we can begin to understand Not only regulatory control at the level of the sequence that many of you do but how we can use what you do From molecules to the explanation of complex traits I think for a very long time as you know the history of GWAS has been such That we've mapped but mapping is not the same thing as finding the specific components and for expanding those components So I'm going to begin with Problem that many of you know that it's really about a hundred years ago that Ronald Fisher First proposed the idea that complex traits multifactorial traits were not in violation of Mendel's rules But in fact arose from Mendelian behavior additive behavior of alleles across many many genes The fact that many genes collaborated to produce a phenotype was in fact considered a very radical idea in those days But by 1920 in fact the work was not published in time because of World War one Altenberg and Muller Herman Muller had some Beautiful work in trying to dissect a complex trait in drosophila called truncate wing Into its multiple components one on the X chromosome two on autosomes that he also showed or they showed that the X chromosome Component was independent of the sex determining factor So hundred years later with the profusion of GWAS studies. We now know that Fisher was correct that there are Genes and there are factors that determine the behavior additive behavior of traits across almost every part of the genome and it spread more or less uniformly in proportion to the size of chromosomes and So the question is how does this become come about and what is really its molecular basis? So in order to contextualize it more for my sake and the work the way in which our work has proceeded I'm going to give you initially a somewhat more simpler example and this came about through a study of Hirschsprung's disease Which is a phenotype of a ganglionosis and I'm not going to go through all of the slides in detail You will have them and you can do that when you have time But Hirschsprung's disease is essentially a disease of the absence of enteric ganglia. It's a ganglionosis It's a disorder of the gut in which a portion of the gut is not innervated. It doesn't have neurons And so here are the layers of the gut This is a transfer sections and it is these this my enteric and submucosal plexus in which That comprises the enteric nervous system that is missing in a portion of the gut and sometimes the entire gut And the question is what are the developmental reasons? What are the genetic reasons? Why this cell type fails to develop and there are many forms short and long forms depending on how much of the gut is affected It's got variable frequency It's about one in five thousand in Europe or individuals of European ancestry About twice that frequency in Asians and appears to be relatively rare in Africans But that data is still not very good It's been known to be multifactorial for a very long time. It's got about a 4% recurrence risk It's got an altered sex ratio and what drew me to this subject a long time ago is that it had Significant associations with single-chain traits like warden-brex syndrome or sharp warden-brex syndrome With chromosomal changes such as with Down syndrome. So that trisomy 21 Individuals have a 50-fold increase of the risk of respring disease as well as many other Complex forms and there were some more or less Mendelian forms of the disease in animal models in the mouse the rat on the horse as well What really intrigued us in studying this disease and I think I'm missing a slide here But anyways is that in all of the mapping that we did in 2005 Eileen Emerson in my lab identified really to us a remarkable finding that explained the fact that even though we knew that the starosin kinase encoding gene red Which had mutations coding mutations recording mutations in about 15% of families That even though we could show linkage to that region in multiplex families We couldn't find a coding variant and it turned out this is from as you can imagine from the time Not all of the sequences were available That we eventually identified this non-coding element that showed high sequence identity conserved across most vertebrates including all the way to zebra fish And in fact, it had a sequence change that segregated in families This is the segregation ratio with a very high frequency This eventually we showed was an enhancer an enhancer that bound socks then it was a simple substitution It was very frequent 24% these are control frequencies 24% in Africa nearly absent in I'm sorry in Europe Nearly absent in Africa about 45% in Asia and it was an Enhancer that was specific to dorsal root ganglia as well as the gut So we were excited excited because this was the first example of a common polymorphism Disrupting a non-coding element that was a gut specific enhancer and leading to loss of this cell We now know from many other studies. I'm recently published last year that there many kinds of genetic elements that lead to Hirschsprung There are at least four distinct enhancers that do it in red and type 3 semaphorens There are two dozen coding genes in which they are rare mutations that lead to this disease as well as nine very large CNVs one that includes of course trisomy 21 Here are the frequencies in cases and controls But I want you to focus on the odds ratio even the enhancers taken cumulatively when we count the number of pathogenic alleles Has an odds ratio of 4.5 and explains about 44 percent of the risk Coding genes do CNVs do but they're much rarer in the population They have much higher individual risk, but by virtue of being less rare. They explain much less of the disease So there are all of these components and our question was are these components are unrelated or how were they related to one another? And through a number of studies all recently published in the last few years. We now know that at least 10 of those factors At least 11 of the factors Sort of listed here. They are related to each other through a gene regulatory network This has been done in both Mouse cells in human cells This has been done in mouse models of disease of her scrungs disease in human patients in human embryos The data are very robust that a series of transcription factors in fact Controls the expression of two genes in these neural crest cell derivatives one is red there are these three transcription factors and These three transcription factors with two in common gatter two and sock stand control the related endothelan type B receptor Related in function in what it does both of them are very strongly epistatic But there are other features other genes its receptor its ligand its core receptor its signal termination all of which In fact the genes encoding these proteins the red really means they have coding mutations red has Three non coding mutations in three enhancers But the importance of the gene regulatory network is that it emphasizes the importance of the non coding genome These genes regulate each other Transcriptionally and so they must be enhancers through which they act Here's red itself. I don't have to tell this audience that there various kinds of data We can borrow from end code as well as do some studies ourselves there are a variety of and enhancers in this Topologically associated domain that contains red There are specific ones in that we've identified on by virtue of conservation on having norm that they exist in encode and then we've shown them by Lucifer is assays that they can act as At least in vitro as cis regulatory elements enhancers They're 32 we know at red Ten of which we know not only contain a sequence variant that's associated with rush prince disease in the population But they also show allelic differences at the level of the two alleles having different Luciferous activities We now we know I'm going to speak a little bit about these three that are starred The existing three LD domain so we know that they genetically are more or less independent and we know that they bind those three risks Transcription factors that I spoke about they bind gata to they bind socks then and they find RARP So here are three transcription factors that regulate the activity of three enhancers that regulate the activity of red So this is a very pleasing example. We know there's still features of this that we don't know we know that the We know that the gene regulatory network is far bigger And we're doing a whole sort of studies genetic studies by gene perturbation by doing si RNAs by doing CRISPR to find out What all of them are? The point I want to make is that What this has allowed us to think through in the genesis of the disease is even though we know what we know We still have a long way to go For example sequence changes may lead to compromise enhancer function That's insufficient Maybe insufficient to cause disease or a phenotype. It may require local chromatin structure and function changes So we need to measure those it will then affect gene expression We know the three enhancers do the variants do affect red gene expression Then it changes protein gene expression not exactly one to one But meeting not exactly proportional But at least one to one that leads to changes in the gene regulatory network that we've already demonstrated That leads to changes in cellular function and eventually phenotypic variation So we've got to follow this chain this cascade of molecular events and the way we are now doing this is Collaborating with my colleague at NYU Jeff Booker to use big technology big DNA technology That is we're gonna model a hundred fifty kb red locus where we can put in the enhancers We want the human coding sequence with sequence variants that we want to follow through all of these aspects To find out what does it take? To really lose these cells Well, her friend is fine, but we are all interested to see how far can we push this paradigm? So I'll tell you a little bit as I promised in my abstract of using these kinds of systems to Understand the cognate system primarily understanding blood pressure. So much of the work here is the work of Don Juan Lee who got his PhD with my beer that many of you are familiar with Did a post of over a number of years with me both at Hopkins and NYU and now has an independent position at Boston Children's and Harvard And he is a person who drove this through at least those both bike mirrors and my lab and it in fact incorporates Other ideas in the field including there has been similar work by Christina Leslie and others So I'm not going to describe the model here except to say it's a machine learning model It uses support vector machines to try and distinguish the sequence features of Regents that we believe are enhancers or we know our enhancers versus those that are not And this can be done the trick here in fact is to consider Enhancer sequences through a library of cameras and to attach weights to them And it is this library that can in fact the cameras can have gaps that in fact is the defining feature of Defining feature of these sys elements What happens with a model such as this it is essentially a model of how Enhancers can be defined in any given tissue is that it allows us to predict what happens when they are sequence alterations And again, this is published work, which is why I'm not going to mention this Discuss it anymore is that when we have a sequence variant we have an objective way of saying whether it's going to Compromise enhance a function in any way that then can be addressed by using a whole variety of both in vivo in vitro and other methods So we started this in a short work The initial work is published on human cardiac enhancer maps in which we took the whole idea of doing this and You might think of this as taking data from encode and other studies and making combined maps And the reason to do this is the empirical observations of enhancers, of course are not comprehensive There's variation signal intensity There's variation in the depth of sequencing across studies their protocol differences in the times of Assays we may use DNA seek for some a tax seek for others even in DNA seek There are various Different protocols and of course there's a stochastic nature of the chromatin state during recovery of the samples for analysis so here's a sort of summary of all of the data and It did five samples some from us three from encode the CRIs would be fine by 600 nucleotide elements with a collection of 11 where is allowing this matches We observed in quote about 160,000 such cis elements But having a model we could also predict 88,000 that should have been observed But we're not and we have to test of course whether they are real or not The interesting thing is these CRIs that were predicted and not quite observed observed only in one For those that had weaker signals They were relatively more cell type specific and they tended to map more to repeat sequence and Which is why I think mapping them often can become difficult but there's a whole variety of data that we use to show that they are in mass real and What we also showed by having this Gamer signature is that the whole variety of cardiac express Transcription factors were the ones that would potentially bind these elements and we identified 33 and 34 such factors That belong to 35 different transcription factor families But importantly we discovered or rediscovered all of the key cardiac Transcription factors that are known that is the gathers the meffs the MITFs and such So we are now Working on this with an additional number of samples both males and females and having Geographically different samples that is cardiac geography when looking at the right atria and the ventricles and the intercept of defects and so on and That that's for another day I also want to point out that these maps are not interesting because They give us only a global view. Of course, it takes work but for any specific value like the most important sodium channel for for Kami Athena times SCM 5a Mutations that lead to Brigada syndrome and to QT modulation We have identified from this kind of work a series of CRE variants that we believe are causal because through a variety of assays that we've done here on the right and they're causal in the sense that they can recapitulate gene expression in the heart that's taken from human human tissue and Of course newer data, for example in looking at the real specific expression has been very very helpful in discriminating these specific signals so here At least in my mind, this is proof Of principle at least that for specific genes we can not only map the enhancers find sequence variants But also identify which ones are the most likely causal ones So what I'm going to do today in the last Ten minutes or so is very quickly try to tell you where this kind of maps may lead to with respect to blood pressure So here we've recently done this analysis With Dong Wan and this is looking at you're now looking at four blood pressure relevant tissues There are many tissues that contribute to blood pressure, which is why studying is genetics become complicated So here we've taken data from encode That comes from the adrenal gland that comes from the heart We also have data from the heart as I showed you from the tibial artery as well as from the kidney these are data sets we've generated and Here I've listed which are the ones that are available and at the bottom I'm going to describe a QC procedure of the ones that we've used for making maps Now the reason they're two important Features here that I should point point out that when we combine databases Like we did in the previous example we gain much greater power Because we get a larger number of enhancers we get a larger number of higher quality enhancers and There's actually added benefit in having different type of data sets And we've done in a limited analysis that having DNA seek and a taxi you can also having say K27 acetylation data the different kinds of Molecular ways in which we can define cis elements in fact add to the power of the heritability Analysis I've never described The problem with many data sets is that we really need to do some Assessment as to what the relative quality is and quality can differ because of a whole variety of reasons Technical reasons such as some tissues keep well after autopsy others don't keep well some but anyways issues such as that and this method this is from See young hand and and Don Juan Lee is this is really it to the fact that because there's a variation signal Intensity in the CREs that are observed One of the things you can do is to rank them and if you rank them Then you can take various a percentile sets of them or thousands of them or in units of 5,000 And you can basically fit these kind of machine learning models to each set And of course, there's no doubt that signals that are very intense we can learn the features much better than signals that are far weaker and So this is a kind of scoring that's been done Which is the reason why we only selected some of them and here what I'm showing you for only one here The adrenal on the x-axis are the rank of peak subsets in units of 10,000 So you have the first 50,000 peaks the next 50,000 peaks the third and so on and here is with cross-validation here's the The area under the curve and what you see is that we've in fact for this analysis Are really arbitrarily taken the first 50,000 peaks that some data sets in fact behave far far better than others And there could be a variety of reasons for that and we've now done this for the adrenal for the tibial to give us some idea of The endothelial component the heart and the kidney So here's sort of a summary of that data That is here the four tissues Here's the number of six elements that could be identified and here's the total number Of course, as all of you know, there's overlap between them Which is why they just won't add up to 500,000. Here's the typical length Typical length turns out to be more or less the same and here's the genome coverage each individual tissue We get about coverage of about 10% of the genome in these six elements, but combined It's about 20% so you've got to remember that we are using information on one-fifth of the genome to try and understand where genetic variation means and Yes, I'm clues as to how we've done these annotation So the first thing to notice is that This was actually quite interesting as you can imagine This is just a complex Venn diagram of the four tissues to show overlaps in where the enhancers assist elements lie and You will notice that the ones on the top are the four that are found only in that tissue tissue specific for this point so they're roughly about anyway somewhere from age to say 16 17% and they and there are these four numbers Then this is class of 16 and a half percent. That's common to all four You will note that the number that are found in pairs and triples is actually a much smaller number So it looks like These six elements are either tissue specific or they are common. That is they are much more widespread Now I understand that when you look at tens of tissues some of this I think subtly will change So the question is how much do these six elements contribute to blood pressure heritability? So we took all 1% bigger than 1% that is polymorphic thousand genome variants We got UK Biobank summary statistics, which can use which we can use for both systolic and diastolic blood pressure This is over 350,000 people or more and what one can do is this partitioned heritability analysis I'm not going to go into detail But basically this is now turned out to be a very reliable standard way in which we can ask an estimate The fractional contribution to heritability of any class of elements Now the SNP heritability the phenotypic heritability of systolic blood pressure Which I'm only going to talk about systolic today. Diastolic results are very similar as I say is about 30% or 25% The SNP heritability that we can measure from that phenotype is about 14% is about half of that Where of course that other half heritability is the variety of explanations But as you know, most you are studies when done across the genome Recover about half of the phenotypic variability. So here the tissues combined So we can explain about two-thirds of the heritability So this is a fraction of the total heritability So here are the individual heritability in all of the slides I'm going to show you and then the standard error and then an enrichment Enrichment in the sense we are explaining two-thirds of the heritability by using 23% of all sites. So that's a Enrichment of about close to three-fold and here the p-values that's highly significant So that's what the overall results are Now it looked to us in examining the data that much of the signal was actually coming from the artery and this was done by Taking these data and breaking them up into seven groups. The four individually uniquely open. They are marked in red The artery and other tissues that are in purple The non arterial tissues that are in green and then all tissues which is in yellow and some of these the same Figures that you've seen before We can now take these categories and we can repeat the LD score enrichment analysis And now what you find is that overall of course the results don't change But the really significant results now come from those that are in this arterial segment So that's only 4% of all SNPs that explain 17% of the total heritability. This is highly significant This is the common part remember that this classification that I showed you in the previous slide is a non overlapping segments of cis elements. So artery Does a significant amount the common ones do a significant amount The non arterial tissues of course contribute But it's a much less significant amount with of course comparable enrichment But it's much less significant because it use explains much less of heritability So a further thing that we could do you remember I told you that because in having a machine-learned model is you can now predict what happens to the sequence variants that we've considered So on the left is the figure that I showed you on the right are those same cis elements with the same variants But using only those predictions by this method called Delta SVM, which is just a score differential It's just a difference in the SVM to speak to which ones are likely to have an effect And what you find is that the effect of the total that is cis elements that are found everywhere is actually small and most of the effect is from Individual tissues and that suggests that even when these cis elements are open in multiple tissues They exert the effect in blood pressure at least in this example Primarily through only one of those tissues either the heart of the kidney and this is really summarized here Again the same kind of data. It shows Fractions of SNPs, so this is less than 1% in each of these cases Now by the way overall we do better We are now using about 5% of the SNPs not 23% and we can explain about a third of the heritability And what you find is fractional contributions that are highly significant of which the arterial component is the most significant Heart only, adrenal only of course they contribute, but it's much less significant So it suggests that much of systolic blood pressure heritability at least as estimated here is likely through an endothelial component This is not a radically new idea even though geneticists have said most of the evidence is through the renal system I want to just end by saying that you know these kind of maps can be Constructed and used from all the other epigenetic marks that people study And I think this is going to be important because eventually this will allow us to build a quantitative Epigenetic regulatory core which actually has turned out to be very very important and will turn out to be Be more important for study of complex diseases I have a whole host of people to thank I think I've Spoken about Dong Won's work and his recent postdoc Xiong but many other people have helped with collection of tissues collection of families analysis And there's no doubt none of this would have been possible in the absence of encode and recently also the data from GTX So, thank you very much Thanks a lot. I've been there for your for your great talk The genetic regulatory code We have a few questions already in the Q&A answer I I forgot to mention that you can actually ask the questions while While the speaker is Speaking right because they are resistant people can upvote them so that If we have questions that we need to prioritize we will do it According to the preferences of the people for for the questions. So I'll start with the first question by by Ross Hardison Which is the following when you use a sequence-based method like GSK SVM for quality control It seems like the assumption is that direct binding to motive is higher quality Are you concerned that you may be removing peaks that the sets that reveal indirect binding? For example when a transcription factor is part of a complex Yeah, I think I think most of these I don't know whether the assumption is exactly what you ask Ross I Think the question is what is the sequence specificity? So it doesn't matter whether the binding can happen individually or of course the many factors that do so in a cooperative way I think the main assumption here is that you have to have a sequence-specific Signature the sequence specific signature doesn't have to be a uniform logo It could have gaps it could have many many Sequences that contribute to that recognition But this is a way of learning what that sequence signature is so There is some merit to what you say, but it is not as extreme as you as As I don't know whether you're stating it, but it's not as extreme as one thinks Thanks. The second question is for a participant Yes, okay, Ross is actually happy with your good answer I guess that you can see you can see the chat. Yes, I can I can see them. Yeah So I can answer this The second time can answer some of these questions, which is that I don't want to take over your role But whether enhancers are intergenic or intergenic. I Actually haven't looked at it in a while but of course the vast majority of the Genome is Intergenic and that's where most of them lie They do lie in Entronic regions as well and some of them in fact do overlap, you know, some supporting exons as well But the vast majority I think Are going to be intergenic Intergenic and some and then in introns The other I think the question The other question is that is the enriched heritability with the combined enhancers For blood pressure due to them being more likely to be conserved No, I don't I don't think so. I think one of the things we've learned about and I'm looking forward to hearing Paul Fletcher speak a little later, which is that you know, I Think the short answer is no there are enhancers that clearly act as very strong enhancers in the human that are not Conserved in that are not conserved at least in the mouse and even in the Hirschbrunn example that I mentioned I Mean you could you could take that human enhancer and stick it in the mouse or even in zebra fish and get It to work in assays on very very specific cell types. So we know it can act as an enhanced But they don't and that's presumably simply because of the turnover of these sites Yes that is actually Again, maybe I will just emphasize this please use the use the Q&A to make questions There are some participants that are asking questions to the chat So let me let me promote the question that Jorge Ferrer that's put into the chat. Thank you Do you have any information whether the arterial signal is predominantly driven by open chromatin variants variants in endothelial cells It's with muscle or other cell types So right now we don't have them those precisely the kinds of experiments we are doing so I think Right now we don't but we strongly suspect based on a variety of other physiological data that it's endothelial But nobody's really done this level of experiments. So the benefit of encode is That we can do these first sets of analysis to make our Hypothesis more precise so you can imagine I don't know what the statistical power considerations are but I think given that more data phenotypically will be collected We can imagine having a canonical set of tissues first which we will look at to say here's a complicated trait and What tissues contribute to it by the way, we've done other Experiments of doing both positive controls and negative controls when we look at the rythmias The signal comes only from the heart in exactly doing this kind of experiment So we know that the analyses can be specific, but we need much more much more experience But once you get to a particular tissue, I think there's Increasing levels of single cell data and single nuclear data, which I think are also going to turn out to be useful to say which cell type and There the answer I can tell you even the classical in Hirschbrand disease. I'm sorry I'm going back and forth The canonical thought has been that the disease arises from loss of intestinal neurons It's not quite clear that the defect in neurons is Also mirrored by defects in glia and in fact, there's a substance the major defect may be even earlier in In fact, the progenitor cells for this phenotype So I really expect these kind of studies to clarify where exactly the disease starts So then a few questions are still in the Q&A I don't know if you respond Bank at Maradi the expression of oh, I'm sorry. Yeah Yeah, so the question is will be considered expression of the TS. Yes. I don't think Any statistical method any computational method is going to work in without being constrained by real data So just as having a Set of standards or a set of enhancers that we know work as enhancers help us to improve the model There's no doubt expression of transcription factors have been very important For us to say these are the ones that are you know that are probably the factors that act And then of course we have to do specific experiments So there's another question Yeah, so there's a question here that Delta SVM is based on machine learning What do you think is the best training set? Are you concerned that depending on the training set you get slightly different results? I think there's actually no doubt I don't know that we have enough to know exactly what the best training set is but our limited opinion says that One of the best is just going to be this allele specific expression. So you can imagine That if there are two variants that allele specific expression simply because we can control the number of reads And in a heterozygous, for example, if it could be an otherwise rare variant that in that individual There can be enough data to Show the robustness of a signal. So I think that's going to turn out to be probably the best But we've used others we've used for example You know, not eqtls, but dna's, you know qtls and other kind of data and So I think we don't have enough experience to be able to say That they won't vary but I would think that as the data improves we can answer that question much better We still have time for a couple of questions while We wait for maybe other participants that have One question for you regarding the the hints for disease So you identify a number of protein coding genes and answers a copy number of variants that were related to the disease You did not mention long non-coating carnage. I mean, there is Yeah It's about the function of long You don't find any and I like to ask you whether a few things whether you didn't find any because they are not Whether this is specific or this disease in general or do you see the long non-coating carnage play a really small role in most Yeah like so One observation that is very robust and has been bothering at least us for a at least a year is that We have done parallel studies in the mouse where we have far greater control, but we have studies in human embryos as well collected Through a resource in the UK called the HDBR that many of you know I think one of the things that's really very interesting is that the rat deficiency Leads to changes in many many transcription factors So not only does it affect many genes, but it affects many transcription factors And so we've been puzzling over we know the result is true the question is how does a You know receptor tyrosine kinase do this and I think we've eventually Sort of come up with the idea of the hypothesis that there is probably a long non-coating RNA Involved and the likelihood is that we know that rat affects its own transcription factor socks 10 and At least there's some evidence that the socks family of transcription factors Some of them associate very actively with a number of long non-coating RNA. So I think It's going to turn out that as we look at the interactions between these genes That long non-coating RNA is in fact Rightfully will have to be considered as a global way by which they can change the regulation of many genes Mm-hmm So let me if there are no more questions I would just like to remind people that they can post questions in this like channel at the end of the session We'll have the live Zoom with with with the speakers who can actually ask questions to them I just want to ask you a very general question Related to the question that has been asked to you What what do you think in terms of the cell that in which particular cell types We're active what do you see is there the role of single-cell analysis and human cell Allah's projects like that in the endification of the genetic basis of disease so You know like like many others we are also doing so You know my labs my labs philosophy so far is whatever we do we do in a spring disease first Because if we can't explain it in that system, we don't know that it's even biologically going to be tractable. I think there it's I Think the single-cell expression data has been of course very very significant because You know the total neuronal Diversity in the gut is not really very well known A Taxi data at least in looking at trying to define enhancers using Those kinds of open chromatin assays together with looking at say K 27 a satellite on we We have begun And I think our initial data suggests that we should be able to get cell type specific and answers The problem I think is going to be how many we can resolve at this stage of the technology that is finding the top hundred baby You know very easy And that probably is limited by the actual nature of the single cells a single, you know nuclei that we can We can extract we always Think that even in the mouse that we can dissociate cells and get a random sample of the whole tissue And we just don't know it and and so I think this is just going to be a technical matter We've got to solve that is do we sample enough cells and do we sample enough numbers of even the rare of cell types So that we can get not for expression But for open chromatin we can get specific data on literally thousands of enhancers. We don't know that yet That needs to be solved