 Thank you very much for the opportunity. This is a real pleasure to be here. I'd like to tell you about the insights that we've gained from the SOFLA, Modern Code Integration, and also follow up a little bit on the theme that Mark set up of reflecting how that has given us insight into human biology and specifically human disease. So what I'm going to present today is the result of the analysis working group and the data analysis center, which you can see here, is very friendly and working closely together. And this has been really a tremendous adventure, I would say, through integrative genomics. So the folks doing the work are sitting here in the audience. Ben Brown, for example, Peter Park, Peter Kirchenko, Matt Eaton, Dave McCallpine, Eric Kly, Steve Hennikoff, Casey Brown, and Nicholas Negra from Kevin White's group, as well as, of course, Lincoln State, Goss Miklam and the DCC to name a few. And they're contributing from very different classes of biological elements, but at the same time, each of these folks has sort of really come together to contribute to the integrative analysis. So what I'm showing here is the work of many, many people. So I would like to, again, use the same slide that Elise showed earlier to basically tell you how we're actually thinking about all these different classes of elements with a lot of different assays coming together and giving us different glimpses of what is the underlying sort of biological story of the genome. And you've also heard from the data coordination center, this pointer doesn't really work, from the data coordination center as to how even putting all these data sets together has been a tremendous challenge. And this is the 1,000 or so data sets that have been linked through the science papers for immediate download as of December 2010 in a lot of work from the DCC. And this is the total tally as of today. You can see that the number of data sets has actually doubled across the different types of regulatory elements. So Mark already did a great job in sort of introducing all these classes of elements in terms of MRNAs, non-coding RNAs, microRNAs, SI RNAs, PI RNAs, also comparative transcriptomics, looking at variants of chromatin as well as nucleosome turnover, looking at histone modifications, chromosomal proteins, and understanding the complexes of replication, origins of replication timing, differential replication, and then transcription factors. So when you start simply overlapping all these elements together, what you end up is this astonishing picture where if you started out with, say, the protein coding annotation of the Drosophila melanogaster genome, you would cover about 20% of the genome. And as you start adding these small RNAs, 3-prime UTRs, non-coding RNAs, PUL2, TF binding, transcription factor binding, insulators, additional bound proteins, polycomb domains, origins of replication, enhancer and promoter states, transcribe states, heterochromatic states, introns, you end up seeing what fraction of the genome each of these elements covers by themselves. And as you start piling them onto each other, you end up with more than 75% of the genome or nearly 85% of the genome covered for both the overall genome as well as the conserved genome shown in red here. You see an even higher fraction of the genome being covered. So what we've gone from is about 20% of the genome being quote-unquote interpretable based on overlap with protein coding regions to nearly 80% or 90% using these diverse assays. And what you also see is, instead of just covering the genome each base once, what we see is the number of assays that multiply overlap these regions in multiple ways. So for example, you see that about 5% of the genome is covered by more than 14 different regulators and 65% of the genome is covered by at least one. And similarly here, you can see the number of overlapping transcripts, the number of overlapping different classes of chromatin elements. And then if you pile all these elements together, you have about 50% of the genome covered by at least four assays and 30% of the genome covered by at least eight assays. So this is not just a painting of the genome, this is a multiple painting by many colors. So for example, if you see this region here where we have several protein coding and non-coding transcripts sitting here on the right, and then these large unannotated region on the left, the moment you overlap modern code datasets on it, you see this increase in the coverage of these coding regions where you can see a lot of small RNAs and non-coding RNAs sort of lighting up and coming from these transcripts. But you can also see these vast regions of bound proteins, for example, that are happening in the middle of these genes here in regulatory elements as well as in the middle of this large intergenic region, which now has a new gene model sitting right in it with many different regulatory elements that are annotated by modern code. So what we're looking here is really much of what was promised by this encyclopedia of non-coding elements, this encyclopedia of both coding and non-coding DNA elements and sort of the emphasis on these large non-coding regions. And at least just to clarify, I see the timer went for 25 minutes, is that correct? Okay, just checking. So where do we go from here? So we can certainly pile all these elements together and sort of count the amount of the genome that we're covering, but what I'd like to tell you about today is what do we gain by actually putting all these elements together across these different data types? So for example, we can start annotating coding and non-coding genes and actually distinguishing different classes of transcripts by actually overlapping the transcriptional information across different types of mRNAs or different types of transcripts in the cell and also overlaying that with evolutionary signatures of the patterns of change of these regions. So for example, if we look here in this very small new transcript that sits between these previously annotated genes, you see that in fact this single transcript here contains within a single exon two independent small peptides on the order of 20 amino acids each, which we would never be able to recognize unless we had both the extremely precise transcriptional evidence, as well as the comparative evidence showing us that the patterns of change here precisely match what you would expect for protein coding regions. You can also go within extremely well-studied transcripts such as this heat shock response element or this exist orthologue that actually determines X chromosome dosage and actually discover new transcribed regions within them that actually fold into well-defined structures that are evolutionaryly conserved and you can also go within protein coding exons and discover microRNA genes that are actually overlapping the protein coding regions and encoding both amino acids as well as short regulatory RNAs that can actually target downstream genes. You can also discover downstream of conserved stop codons regions of overlapping functions here where this serves as both the three prime ETR of this particular translation termination region, but you also have an alternative translation termination that simply reads through the stop codon and translates these additional regions and we found actually 300 of those examples in the fly genome and actually an additional four examples in the human genome that again started out from the model organisms before we knew they also existed in human. But beyond these protein coding regions, we'd like to also annotate non-coding elements and in particular we've been working with Gary Carpin and Peter Kirchenko from Peter Park's lab to actually annotate chromatin regulatory elements such as enhanced promoters and a diversity of different classes of regulatory regions. So as you heard in the previous talk, DNA is wrapped around these nucleosomes each of which is made up of eight histone proteins and each of which has a long tail of amino acids that can undergo post-translational modifications and there's a large number of these modifications creating many distinct combinations of histone marks which are very difficult to interpret because you end up with a large number of genome-wide tracks mapping the locations of these modification maps across the genome but we and others have developed algorithms for actually learning the hidden chromatin states that are actually responsible for the observed combinations of chromatin marks that we can learn completely de novo across the genome and only then overlap with existing functional elements. So in collaboration with Peter Park, as well as Jason Ernest from my group, we basically set out to annotate the chromatin states of the Drosophila genome. So first of all, this is a picture by Peter Park that actually shows how each of these chromatin marks is in fact very strongly positionally biased to be in different places with respect to different classes of genes or origins of replication or specific insulator proteins or specific transcription factors. So you can see here that each combination of chromatin marks in fact perhaps uniquely defines each one of these regions in a very consistent way across the whole genome. So we can use that, for example, to discover new and surprising sometimes classes of elements. For example, we can use that to define promoter signatures based on the presence of, for example, H3K36, H2B, H3K4 tramethylation that is associated with both promoter regions as well as transcribe regions. And you can also use that to actually define new classes of elements. For example, we actually found a new class of H3K36 monomethylation marks that are associated with their application origins and then collaborated with Dave McCullpine to map those across the genome. You can also systematically learn combinations of these modifications, as I mentioned earlier. So this is learning not just the specific combinations but also the intensity of each of these combinations and you end up with nine different chromatin states which we can simply refer to as active transcription star sites where you can see a high intensity of H3K4 tramethylation and to a lower degree H3K4 dimethylation and H3K9 acetylation. And then this is extremely unrid for TSS proximal regions. You can define active exons and elongation elements, active introns, both enhancers, as well as intergenic based on again specific combinations of these chromatin marks and specific intensities with respect to each other. There's a specific chromatin state that in fact defines male ex genes that contains H4K16 acetylation and is again specifically enriched for this class of genes. You can define polycomb repressed elements that are marked by H3K27 tramethylation, heterochromatic elements that are marked instead by H3K9 tramethylation and other basal and repressed elements in the genome. So this now gives us a handle for going off and annotating the genome using these large classes. But we can also go further and actually define more discrete states where specific combinations of marks are defined regardless of the intensity. And we've used that to actually define 30 different chromatin states which correspond to these nine states as you see here. And when we intersected those with a very large array of functional elements such as nucleosome solubility, hot spots, nucleosome turnover, different classes of insulators, different classes of histone deacetylases, hot regions, early origin, origins replication that via early, regions of origin replication complex binding, as well as different classes of transcription factors. What's really astonishing is that these histone modifications alone can actually pick out each of these classes of elements in different states, suggesting that in fact chromatin is encoding a much more diverse array of functions than previously thought. It is not just encoding active and inactive regions, it is instead encoding a vast array of different classes of annotations. So we can now go beyond just simply annotating regions of the genome. So before we looked at both coding and non-coding transcripts and then we looked at regulatory regions, we can now start connecting these regions together to actually piece together regulatory networks that Mark Erstein alluded to earlier. So the first thing that we can do is in fact learn the hierarchy of this network by actually combining transcription factors and microRNAs and then I'm gonna switch to these regions of high occupancy. So just looking at the physical regulatory network, namely which transcription factor is actually physically contacting which target gene, we again find a hierarchical structure where most of the links are in fact pointing down and only a small number of links are pointing up if you arrange the regulators in this particular way. And a very interesting picture emerges if you include microRNAs in this picture that are shown in red surrounding these transcription factors which is that the feedback from the bottom layers of the hierarchy to the top layers of the hierarchy is in fact predominantly happening through microRNA regulators that are increasingly targeted by the bottom layers of the hierarchy and increasingly targeting the top layers of the hierarchy which is rather surprising. And this is also found if you study the specific structural motifs of this regulatory network, these recurrent patterns of connectivity where you see these cascades of transcription factors targeting each other and then feedback coming back through a microRNA layer just like we see here. So both at the low level of the network where you study the specific patterns of connectivity and at the high level of the network where you started the overall hierarchical layout, you can in fact see this feedback of regulatory information from transcription factors through microRNA to other transcription factors namely the master regulators that Mark was talking about earlier. So again, just like in human and in worm, in fly you find these regions of very high occupancy by multiple transcription factors. So if you look at the average number of transcription factors found with each of the regulators that was profiled in the Drosophila genome, you see that only a small number of factors are in fact binding alone but most of the factors are binding with another six or sometimes 10 partners. So this is the median of partners that every location that these transcription factors bound has suggesting that in fact there's some regions in the genome that are just very, very widely bound and that's where the name of high occupancy target regions comes from and in fact this term was coined by Kevin White in the Drosophila genome. What's interesting here is that we can now bring in these different classes of functional elements to help annotate and understand these hot regions. The first thing that we can do is in fact look at regulatory motifs and what we can ask is given a particular complexity of a transcription factor which tells us the number of other factors that are binding with it, are the regulatory motifs more enriched to the right or to the left of this class? And what we're finding is that most of the time it's actually a depletion for these regulatory motifs suggesting that regions of increased complexity are less likely to contain regulatory motifs suggesting that as more and more transcription factors are binding you are less and less likely to bind to your specific motif and therefore you increase the non-specificity of binding or the non-specific binding. You can also overlay that with the chromatin state annotations that we had earlier and what you're finding is that specific chromatin states are enriched for either hot or cold regions of transcription factor binding and striking finding comes if you actually overlap that with regions of origin of replication complex binding. So as you increase the complexity of a transcription factor binding site you also increase in an almost linear way the likelihood of binding of the origin replication complex suggesting an intricate interplay between replication and transcription factor binding that was previously unappreciated. Now you can look within these regions of high occupancy and search for regulatory motifs that are specifically enriched within these regions and you end up with a large number of specific regulatory motifs that are predominantly found within these high occupancy target regions and that do not match other existing regulators suggesting that perhaps a different class of binding may be actually guided to these regions in a sequence-specific way and then enabling this non-specific binding by additional regulators. And we're finding a very interesting story that sort of mirrors that in the human genome when we actually study the interplay of transcription factors regulatory motifs and chromatin we find first of all that transcription factors show distinct chromatin preferences, different transcription factors shown here are in fact matching different classes of chromatin states and even though they're all matching regions of open chromatin they're matching differentially promoters, voice promoters, enhancers, weak enhancers, a class of enhancers that's in fact lacking histone modification marks and so on and so forth. Now when we look at the regulatory motif preferences for each of these factors we do find indeed that the motifs for these transcription factors are enriched in the regions of binding which is reassuring but you see additional binding beyond these motifs namely you find that the motifs are in fact just like we saw in fly depleted amongst all regions of TF binding they're depleted in the regions of high occupancy suggesting indeed that they're in addition to these specific binding that there's some non-specific binding happening within these regions and specific and particularly surprising were these three chromatin states that lacked histone modifications that showed abundant binding but also showed no non-specific binding therefore these regions in order to be compromisive and bind without a motif they actually require the histone modifications open chromatin is not enough suggesting that open chromatin alone again is not as is not sufficient information that you need perhaps chromatin regulators to recognize these marks to enable the non-specific binding. The state preferences also predict the pair-wise transcription factor co-occurrence patterns that we observed in the fly so now if you correct for the chromatin state preferences you actually remove a lot of these TF co-occurrence that we observed across human fly and worm. So we can use this information now to build predictive models of gene regulation so before we looked at physical regulatory networks of which transcription factor is actually physically contacting what target gene or at least what upstream region of a target gene we can now start building functional regulatory networks by actually integrating all that information together. So we're looking at motif instances that are conserved across different species and therefore more likely to be functional. We're integrating with that the physical evidence of binding using chromatin immunoprecipitation. We're also using correlation information between transcription factors and their target genes both in terms of their chromatin marks as well as in terms of their gene expression patterns and then we're putting all of that into a learning framework that predicts given the vector of information across all of these different patterns whether a particular transcription factor is in fact targeting a particular target gene. So we can use that to actually define functional enrichments across the targets of genes of the same transcription factor namely what we're finding is that depending on what gene expression cluster you are in and what transcription factor is targeting you you're much more likely to have the same function and we can see these functional enrichments across different lines of evidence here and we can use that information to now predict the likely functions of genes that were previously unannotated. For example, by observing that genes that are targeted by specific regulators are in fact involved in cellular respiration we can predict that additional genes that were previously unannotated are also involved in cellular respiration and if we do that in a cross-validation framework we actually have very strong predictive value for several of these annotation terms enabling us to predict more than 1,000 new functions for previously unannotated genes in the Drosophila genome. We can also predict regulators based on the stage of development at which they're acting looking here at embryo, larva, pupa, and adult in different days or hours of development here where you can see here that at specific branch points the expression of these regulators in fact changes just as the expression of their target gene changes predicting the action of specific regulators at specific branch points. And we can also develop predictive regulatory models that use this targeting information from the functional network to actually predict the expression level of target genes based on the expression level of the corresponding regulators. So this is for example the true expression pattern for the Groucho gene and these are five of its predicted regulators so we can in fact learn a function for each of these edges that predicts as a linear function the expression level at each different stage of development for that gene and we can compare that to a random prediction based on randomized network and we see that indeed the randomized network doesn't do very well at all. And we can do that for a very large number of genes for example here I'm showing the top 1000 genes or at least sampling from the top 1000 genes that are the best predicted and you can see here both negative correlations for genes whose expressions predicted based on repressors and positive correlations for genes whose expressions predicted based on activators. So how does that all translate now to actually interpreting human disease? So in the ENCODE project we've actually similarly mapped these chromatin states using combinations of chromatin marks across numerous human cell lines to actually define different classes of enhancer, promoter and transcribed regions enabling us to now look at any region of the human genome and at a glance observe its activity patterns across different cell types. Now what's really exciting here is that we can now use correlations between these activity patterns just like we did in the fly to actually link together not just enhancers to their target genes but also enhancers to their likely trans regulators that are sitting upstream of them by actually studying these vectors of activity for gene expression, for chromatin, for regulator motif enrichment and for transcription factor expression across the different cell types and we can use that to actually predict activators as well as repressors for each of the cell types based on the joint action of the regulator motifs, the expression of the transcription factor as well as the activity of the chromatin state. So we can use that to actually define a number of activators and repressors for each of the cell types and we can now use that information of these predicted regulatory regions in each of the cell types and then the predicted linking between the transcription factors and these regulatory regions both downstream in terms of what enhancer is in fact targeting what gene and upstream in terms of what regulator is targeting what enhancer to actually start interpreting disease association studies. For example, we find that the top scoring regions for a genome-wide association study for systemic lupus erythromatosis is in fact having 18 SNPs genome-wide significant, six of which are in fact falling specifically within the GM enhancers that we have defined using this chromatin state and if you look within one of them in this particular example, you find that the SNP that is associated with the disease phenotype is in fact disrupting a predicted causal motif for the ETS-1 regulator and therefore resulting in activation of this particular enhancer and likely changing the expression of the downstream gene which is predicted to be a target of this region based on our activity profiles. So we have automated this so we can now do that for any one region. We can basically read off our regulatory map that ETS-1 is a predicted activator of GM cell lines, that GFI-1 is a predicted repressor of K562. In this particular case, the disease-associated variant is in fact creating a motif for the repressor GFI-1 which is then predicted to repress the activity of this enhancer region and therefore lead to inactivation of this particular gene and therefore lead to the disease phenotype and we can also of course leverage information from comparative studies across 29 mammals in this particular case where the specific SNPs that are disrupting conserved instances of regulatory motifs are much more likely to be associated with disease. We've automated this process in a tool that anyone can use called haploreg where you can actually go and mine the entire encode database of regulatory annotations across specific regulatory motifs, binding of specific regulators, DNS hypersensitivity across 80 different cell types and the chromatin state annotation maps across the nine cell types as well as conserved elements across the 29 mammals and of course the coding and non-coding gene annotations in order to interpret for every SNP that's associated with the disease which of the neighboring SNPs might actually be responsible for the disease phenotype. So overall what I wanna leave you with is that you can in fact use this type of information to annotate coding and non-coding regions, to annotate chromatin regulatory elements, to define networks of regulator targets and their downstream genes and to build predictive models of gene regulation and putting all that together in the example of human disease you can use that to actually annotate non-coding SNPs and also link them both to the upstream transcription factors that bind them as well as the target genes that they regulate. So ultimately our goal is to be able to use that information to systematically annotate human disease and what that requires is a systematic understanding of gene regulation where we can predict for every coding or non-coding mutation in the genome exactly what the functional implications are likely to be and that's what the goal of the ENCODE project is. So what we're doing now is comparing fly and worm and of course comparing that to human and we've done a lot of work in defining these orthologs and tomorrow you'll hear a lot more about sort of how each of these stories plays out when you compare flies, worms and human. So ultimately I believe that model organisms can be extremely powerful for actually understanding the relationship between genotype and phenotype because we can study at a systems level the effect of these functional elements and selective pressures on trade associated regions and also given the powerful genetics and the short time spans, we can use them for systematic mutations as well as drug screening. So again, this has been a wonderful collaboration with the entire analysis working group across Drosophila and worm. So I acknowledge some of the key contributors at the beginning of the talk and this is the full set of authors here for our integrated paper. You can see the stars here of a very large number of equal contributors because this has really been an incredibly sort of collaborative team effort and again a set of PIs here that are all sort of again equal contributors. So I'll stop there and take questions. So this sort of gets to the importance of doing functional analyses and so Barbara Wall and others specifically looking at transcription factors. Can you speak into the mic? I'll have to lower myself. To our left. So anyway, the point is that transcription factor binding seems to be excessive, right? There's a lot of sites in the gene, I'm not talking about hot regions, I'm talking about tens of thousands of sites but only a few of them actually seem to regulate gene expression and so what I'm specifically wondering about is in terms of using SNPs for doing the analysis of human disease, how much that confounds the analysis and do we need to have depletions of the transcription factors and an analysis of how that affects transcription before we can actually do the linkage to human disease? You have a fantastic point and so as you'll notice from, in the next phase of ENCODE there have been several technologies that are funded for actually systematically validating the functional consequences of regulatory elements and we have been involved in the development of one of those in collaboration with Tarjan Mickelson where we can now test thousands of enhancers that are designed from scratch using plasmids certainly and therefore the native chromatin context might not be the same but we can now test the effect of individual mutations on the expression of downstream genes in reporter assays. What we're finding is that the causal motifs that we're predicting here will in fact disrupt the expression of downstream genes in a reporter assay suggesting that in fact those specific motifs are responsible for controlling enhanced activity in a way and in fact if you do neutral mutations within the binding site itself then you maintain activity of the downstream gene. If you shuffle the binding site or if you make a mutation that changes in the high information content basis then you in fact disrupt downstream activity. So I think this is one type of technology where you can test individual regulatory elements in isolation. I think what we're gonna see in the future is and there's a lot of technologies underway for doing that is ways to actually massively test elements in their native chromatin context to actually integrate them into the genome and I think that's in a way one of the powers of the model organisms that in fact you can do that systematically. So I think we're in for many surprises on the technological end. Every few years we think that wow that was a great few years but I think looking forward we have much more to expect.