 Thanks, Gene. I'd like to thank you all for being here today and provide you a welcome on behalf of NHGRI, National Human Genome Research Institute, also on behalf of ENCODE, and hopefully this will be a useful experience for you. We have a couple of important objectives we'd like to tell you about ENCODE, what it is and how it can be used. But also very important, this is two ways we'd like to hear from you all. What are ways that the resource, the presentation of the resource, or the presentation of these outreach efforts could be made more useful to the community? You're here as a group that can help us to hear more about that. This picture at the bottom here, this is the ENCODE consortium, or some fraction of it. And my heart goes out to these people who work very hard to develop all of this data and do all of this analysis that you're going to be hearing a lot about. This large group of people has worked very hard to create this, and I want to thank them. Going off to the right are program colleagues of mine from National Human Genome Research. Elise Feingold was at the beginning of the project and started it out, and Dan Gilchrist, my colleague, is here with us meeting today. So I'm going to tell you a little bit about the ENCODE resource, what's the rationale for it, what is it, and how is it made, a little bit about how it can be used to study human disease. That's not all that it can be used for, but that's one important application, and a little bit about how to access ENCODE materials. To me, ENCODE makes me think in some ways of music. It has collections of lots of different things. They all work together. There's not the one thing that matters or the one thing that's important, but depending on what you're doing, you may care more about one instrument or another or what kind of music, and the same kind of thing takes place in ENCODE. So the reason for ENCODE, I think, is quite clear. The completion of the human genome project became obvious that understanding much of the genome is difficult. Understanding what portions of the genome are doing something of value is difficult. We have a genetic code that tells us about protein coding regions, but that's a very small part of the genome that's very sparse, right? So how do we find the regulatory regions? How do we find out what they're doing and how to identify them as another problem? And this is important because we know that non-coding DNA does a lot of important things. The vast majority of common disease associations and heritability of the associations is imputed to lie in non-coding regions. So that's important. There are Mendelian disorders that are caused by genetic variation in non-coding regions. So by those two measures, among others, non-coding DNA is important. But it's not easy to study, and this functional information could be used to interpret disease, gene regulation, mechanisms of mutation, and so forth. If we had a better understanding on it. So ENCODE is one project a lot of individual investigators are also working in this area trying to get a better understanding of these issues. Here's a graphic way to see the problem. If you just take a snippet of the genome, not a random snippet, but a snippet, looking at it quickly, it's difficult for most people to get information out of it. Now, to some extent, this is an exaggeration, right? We all know that partially this is a solved problem. You can have maps like at the bottom of the slide where you can mark exons and you can mark variants that are associated with disease where you can do the same thing by color coding the sequence. And that gives you some information now that you didn't get just from looking at the sequence. But I would argue that this is still pretty sparse in terms of information content, even though this is very valuable. So the approach that ENCODE is taking, a lot of individual investigators are doing this as well, is to build richer maps so that we might have more information. So how this might work is now you can see the same GWAS variant in context of a gene. This comes from non-ENCODE work. And from ENCODE data, you can see that there are biochemical signatures in this region, and we'll explain what they might look like. And it turns out that this is the signature for a candidate regulatory region. So that tells you something about this variant could lie in a regulatory region. ENCODE data can tell you other stuff because it's cell type specific. You can see that this information is preferentially coming from a particular cell type. So that may give you some idea of what cell type is important in this process. And while this variant appears to lie within a particular gene, you can see that easily from a sequence alignment. There are ways to predict what gene it might be work with. This is, of course, predictive, not known with certainty. In this case, the prediction is that this, if it's a regulatory variant, is actually working to control the neighboring genes. So being able to have these stories moving forward would help us to much better appreciate the role of genetic variants and phenotypic traits. So the goals of ENCODE are really simple to state. How to get to this is another story. We're trying to identify all of the candidate functional elements in the genome. We're also trying to share this resource freely with the community. That really means it's unrestricted access, not controlled access. There's no login. There's no purchase. You simply go to the website, use whatever it is that you want. And people are using this to look at the genetic basis of disease, gene regulation, how mutation occurs, and so forth. But any process that you'd like to use it for is allowed. So the way ENCODE is doing this resource is built upon decades of research and gene regulation. For years, people have known different biochemical signatures are associated with gene regulation transcription, post-transcriptional processes. And these have been converted into genome-wide assets. And then the results of these are used to sort of reverse engineer to make predictions about how the genome is working. So you can make predictions about where genes are. That's probably the simplest part. What are the transcripts that are encoded within those genes? And then what are regulatory elements, such as promoters, enhancers, and other distal regulatory elements that might be controlling those genes? So a big shout-out to the gene regulation community that has over many years developed these techniques. And ENCODE data are very cell-type specific, which is both an advantage and a drawback to doing this. It's a drawback because you need to collect the data in many different cell fates and cell states, unlike to first approximation with DNA sequencing. But then it gives you richer information, because the answers that you get back tell you something about what cell types are involved. And if you look at differentiation over time, total potent cells start off with one set of candidate regulatory elements. Some of those are extinguished over time. New cells, new elements appear to be turned on and progenitor cells, and then other elements turn on and fully differentiated cells. So these changes in regulatory element signature tell us about cell fate. So to date ENCODE has done a lot of useful things that I think help out the community. We're sharing thousands of data sets. Today they're shared without any embargo. There's no wait one year until the consortium has had a chance to publish on it, say. They're shared through unrestricted access. The equivalent of GO, there's no signing up for DBGAP or buying an account or something to access the data. The data are uniformly processed to a large extent, which minimizes the chance of coming up with cross data set comparison and finding differential effects that aren't real, their analysis are effects. We're also sharing the software that we've either produced or used in the project so others can use it and also to make things transparent. We're working with other projects, other individual investigators, to make the data more interoperability, talking about standards and how things are communicated. One of the things that we share that we share are ENCODE publications, and there are hundreds that have come from the consortium, but perhaps for this audience more important, there are over a thousand publications from outside of ENCODE that have used ENCODE data. Some of these look at human disease, some of these look at basic biology, and they cover a wide spectrum of diseases, and I think that this attests to the translational value of their resource. So in summary, we're trying to create a catalog of all the functional elements of the genome. This is freely shared with the community. The foundation for this is established techniques and methodologies that have been used for years in gene regulation studies. And these maps can be used to make predictions about genome function, which can help you in other work. So now we're going to take a quick peek at how you might use ENCODE to study the role of genetic variation in human disease. And here I'm going to give you a few use cases, some ideas that I've gleaned from the literature. This is not the only sum total of what can be done. The major idea here is that ENCODE supports hypothesis generation and refinement. High throughput data are useful in identifying hypotheses to test. And some of these things are predicting causal variants for traits or diseases, predicting target genes for either regulatory elements or genetic variants, cell types that are affected by regulatory elements or variants, and then also predicting upstream regulators understanding the pathways of how these systems work. So prediction of causal variants, to some this is obvious problem, to others it's not obvious that this is an issue. But if one does a study such as GWAS, you get a locus, but there are other variants that are in LD. So that doesn't tell you, just from that first step, what is the causal variant? ENCODE type annotations can be used as one fine mapping strategy. You've also got the issue that it's possible that multiple variants work on the same trait or disorder. You might not be trying to find the causal variant, rather a group of them. So ENCODE data can help you to figure that out. And a recently published study looked at ENCODE and Common Fund or Roadmap Epigenomics data and found that many GWAS findings overlap with these annotations. So that suggests that this could be used. If one is not familiar with these kinds of assays, the simplest way to get started is there are a couple of pre-computed resources. First, I'll tell you about Haplorag from Manolis Kellis's group at MIT that can help you to sort through ENCODE and roadmap data. And with Haplorag, you can, step one, put in genomic coordinates or SNPIDs or a list of them and then click Submit, and then it will return pre-computed analysis of ENCODE data. This is just the first step of the kind of result that you get returned. And it'll show you for both the lead SNP and others that are in LD, what kinds of ENCODE and roadmap annotations are found in that area. And that can help guide you in further experiments to decide what you want to do next. Similarly, Mike Snyder's group and Mike Cherry's group, both at Stanford, developed regular MDB. And in a different way, this analyzes ENCODE data, roadmap data, and some other data sources. Again, you can put in genomic coordinates or SNPIDs and then click Move On and initially, you're presented with a score. The likelihood that the genomic region you're looking at has regulatory potential. And then, of course, you can drill down and see the individual underlying data and see why that assessment came about. As Gene said at the beginning, and I meant to say at the beginning, but forgot, please don't worry about writing down URLs or detailed explanations. All of these slides are going to be shared. So that information will be available to everybody. There's also ENCODE SysElement browser, and this has a number of different functions for human data and mouse data. And in this example, you can again put in a genomic coordinate and it will find candidate regulatory elements. In this case, it finds them by looking for DNAs, hypersensitive sites, and also tell you about candidate upstream regulators by looking for motifs that are found in those regions. Now, prediction of target genes is also an interesting problem. Gtex and ENCODE data and other projects are finding that it's not unusual that regulatory sites don't work on the nearest gene. They work on some gene that's further away. So if one finds a piece of DNA that you think is important and it's not protein coding, if you just guess that the nearest gene is its target, the growing story is you'll be wrong at least half the time, okay? And if we look at, for instance, predictions based on GWAS studies, a lot of times it's quite distant, that the regulatory region is from the gene. And in a practical sense, what this could mean is you might have some variant of interest and start studying the nearest gene, but perhaps there's a better candidate that one could work with, a better prediction. And that could change your research project if you didn't have to spend a few months studying one gene, studying the next where your only filter was proximity. If you knew other principled ways to say what was the likely candidate, you could come up with a better answer. So you can do this by using regulatory elements database. Terry Fury's lab developed this and it works a number of different ways. One of the things you can do is put in genomic coordinates and it'll again report DNAs, hypersensitive sites in that region, candidate regulatory regions. And if you click on any of them, it tells you additional information. And one of the things that we'll tell you is if there's a prediction for what gene is the likely target of that, it will give you a confidence estimate and what that target gene might be. The encodes this element browser, which I pointed out before. And in the bottom of these slides, you can see that you're gonna be hearing more about this in workshop session three, for example. So this is just a very quick overview. But again, you can put in a gene name and this tool will return to what are regulatory elements that appear, what are candidate regulatory elements that might be linked to this gene. And the one that I've highlighted here is one that I showed in that initial example of showing a common control region that affected many genes. I chose that example because in mouse that's been very well worked out that region's been knocked out. And it does in fact control these genes. If you go to encode human data, you could make the same prediction without having known any of that. And last use case I'll tell you about is predicting target cell types. Again, to some people this is surprising that this is something you would need to do to others it's intuitive. But if you think about most human diseases, most human diseases affect several cell types. Sometimes they affect several cell types directly. Sometimes it's more subtle. For instance, sometimes the affected cell type is a developmental niche for the cell type that actually shows the dramatic effect. There are people who have lack of lymphocytes because their thymus, the niche where lymphocytes develop is impaired. So the affected cell type is not where that gene appears. And sometimes of course the etiology of diseases is not well understood. I'm old enough to be able to remember when it was a wild and crazy idea that type one diabetes, a metabolic disorder might involve the immune system. And we now know it's an autoimmune disorder, but for some fraction of diseases that we work on today, we don't know the cell types. So again, in regulatory elements database, if you call up different elements, in some cases those elements appear to be active in only some cell types. Other elements appear to be broadly utilized. So that could give you some information about cell type that's affected. Similarly, if you look at haplorag and RegulomDB, which I talked about briefly, when you see the underlying evidence, if for instance you mouse over something in haplorag, it tells you the cell types that that data came from. And that could guide your hypothesis, especially if something is known about the disease. If you think Alzheimer's, okay, neurons, macrophages, glial cells and your hit is for glial cells, right? Then that helps you to refine things. It's more difficult to use if nothing is known about the disease. And if we look in the ECODE SysElement Browser, and for instance, look at gene expression, you can type in a mouse or human gene name. It might be the gene that you think is important in that disease is expressed in a restricted number of cell types, obviously. Some other genes are broadly expressed. And that again could guide your hypothesis as to what's going on. A more principled way to do this is you could take regulate, you could intersect genetic variants that are linked to a disorder, so intersect them with those with regulatory elements and then ask whether those regulatory elements are enriched in particular cell types. And the code for doing this is shared at this URL on the slide. And then you can ask in a more principled way which of several candidate cell types might be involved in a particular disorder or process. So in summary, I think the main thing that ENCODE can help with is hypothesis generation and refinement. And that this can be used to predict causal variants, to predict target genes, to predict cell types and upstream regulators. All of the examples I've showed you are examples of how you could use this with genetic results. If you had epigenetic results or an epigenetic cohort, you could do many of the same things within code data. Of course, if you're skilled in that, then you know that some additional caveats apply. The same kind of thing I've shown germline examples, but one could do all of this with somatic variants, but again, additional caveats apply when you have that. So now I'm gonna quickly run through how you can access ENCODE materials. You're gonna hear a lot more about this at least in workshops two, one, and five. So the ENCODE portal, ENCODEproject.org, that's the one URL I would want you to remember, ENCODEproject.org, has a lot of this info. And for instance, you can access the data under either search or matrix. If you go to matrix, there's a nice representation, which I've crudely trimmed, so you can't see most of it. It shows you the kind of data that's offered, data types, projects that the data come from, and so forth. And you can use facets to narrow down the subset of the data that you wanna see. I wanna emphasize the data policy. As I said in the beginning, this is freely shared. So if the data are there, you can use it. There's no restriction, no embargo or anything. We do ask the standard scientific thing that cite the project that if you, like with any other data source, if you use something, cite it. For reasons of transparency and reproducibility, it's good to share accession numbers. If you share the accession numbers of what you did, then your peers can tell what you did and how you did it. Okay. New to the ENCODE portal is the search by region. You can put in a genomic coordinate or SNPID, find ENCODE data that overlaps with that region and visualize it. That's a quick way for those of you that wanna see tracks or to download elements to get to data. There's an ENCODE encyclopedia. You're gonna be hearing about this. This'll be the focus of workshop one, all right? And it has high level and very close to the data analysis. It has a range of things. Ground level close to the data annotations of things like where are transcription factors bound. Middle level annotations are putting the data together to find things like where are promoters and enhancers and how can you visualize them. And then top level annotations put together many different types of data to generate inferences. You'll be hearing about that, especially in workshop five. ENCODE publications. Again, there are community publications that I would point out to you because there are examples of how people outside of ENCODE are using the data to study human disease, to study basic biology, to produce software tools. But there are also ENCODE publications that again illustrate what was done and how the data can be used. We're very interested in sharing the standards for how things are done and this is done again for the same two reasons, for transparency so that you know what was done and how it was done so you can evaluate for yourself how credible this is. And then we're also sharing these standards because there are many people in the field that are including some that are newer in the field and they're wondering how should I be doing this? We hold this out as one way that you can do it but of course it's for each group to decide how to do it. We don't dictate standards to anybody. So you can click through and see different types of standards and experiment documents that we share. We're sharing software that we're using in the project again so that people can understand what was done and how it was done or if people are doing this kind of work and want to use software. Some of this software was created by ENCODE, some of this software was created outside of the project but we use it heavily so they're both important to know about and we share both of those kinds of software. And lastly I'd like to give a shout out to our partners in the International Human Epigenomes Consortium. ENCODE is a member of this as is the NIH Common Fund Project Roadmap Epigenomics. A bunch of like-minded projects are collecting epigenomic data across many cell types and sharing them through different mechanisms. The IHEC data portal and project summary are available. There's lots of useful data there including a subset of the ENCODE data. So I want to finalize here with a list of URLs that I don't expect you to read now or write down now. Again all of these slides are shared and you can click on the links rather than write them down but we share for instance tutorials from the project including the results of this, we'll get posted online, analysis tools in the portal and then IHEC resources. I'd highlight the ENCODE mailing list. If you want to hear regular updates on the project you can join this list. So I'll conclude by reminding you that the goals of ENCODE are to find all of the candidate functional regions in the genome. This is an aspirational goal. I think there's no way to know when it's done or how it's done. And we're freely sharing the results that we find as a resource with the community. And ENCODE data are already being used in studies of human disease and human biology and they can be used any way that anyone can come up with. I'll stop here by thanking the people on the ENCODE consortium first. Here's another representation of some of them. I have the privilege of working with smart people every day and hearing great ideas and it makes my job really fun. And my program colleagues at NHGRI, Elise and Peter Goode started the ENCODE project long before. I had anything to do with it. Dan Gilchrist and I work on the project now along with Elise and Dan is here today. Courage you to say hi to Dan as well. And they're great colleagues and people to work with. And of course all of this work like any other bit of science is built on years of study that came before it. Lots of work on gene regulation. Lots of work on genetic basis of human disease. So I'll stop there and take any questions if anybody has any questions or we can move on to the next talk. Okay, so now what I'd like to do is introduce our first scientific speaker, our keynote speaker. Oh, we have one question, thank you. So regarding the ENCODE future, so what are the NHGRI plan to do? Like what kind of focus or emphasis? Sure, some of this is public record and some of this is to be decided yet. And I can tell you about the public record part. And we've recently accepted applications for a new round of ENCODE. They haven't been reviewed and funding decisions haven't been made. But the applications we accepted were for new mapping centers, sort of like the production centers that we have now. Computational analysis centers, that's an ongoing activity that we have now in ENCODE. And then a new type of activity, functional characterization centers. We also have applications for those in addition to a data coordinating center and a data analysis center. So there's a good chance that moving forward, there'll be more data production and ENCODE, more mapping. There'll be releases of data through the DCC in analysis. And then we might have additional information about a function of different genomic regions. The past iterations of ENCODE did not focus on functional analysis of the mapping results. The thought was that was within the purview of the community. The idea was to do the mapping and to come up with raw material for others to work on. But we've seen that this is a need for us as well to be involved in. So that's why that's come in. Thank you.