 Okay, we're going to go ahead and get started. And I'd like to begin by thanking you all and welcoming you to our ENCODE users meeting. It's an honor to have you all here. Mike Payzen, and I'm here on behalf of National Human Genome Research Institute. We're part of the U.S. National Institutes of Health, and we're a funding agency that organizes ENCODE. So I'd like to start with a couple of logistical announcements for you. First of all, I'd like to let you know that the break area is down the hall, and to the left, they're going to put up signs later, and it's going to be open all day. And people can just walk down there if you want a coffee in the middle of the session. So if I'm boring you, you can step right out. And there will be replenishing the water that's on the tables during the lunch and during the afternoon sessions. So there will be more. All right. So I'd like to thank you all for being here today. And I'd like to tell you our two main objectives for this meeting. First of all, we'd like to tell you about the ENCODE resource, what it is, how it can be used, why it is that we think this is an important thing at NIH. But second of all, a very important thing is we'd like to hear from people that are using the resource or are interested in using the resource what is working for them, what would help if things were changed, how to make things better. So we're trying to get those two functions going at the same time. So what I'm going to do first is tell you a little bit about the ENCODE resource. Its history, the goals, and the approaches of it. From there, I'll move on and tell you kind of quickly about high-level ways that you could use ENCODE for different kinds of research. I'm going to focus on examples from human disease, but I think ENCODE goes beyond that. And then I'll finish up by telling you a little bit about how you can access ENCODE materials. I'll go through the second and third parts rather quickly. The reason for this, of course, is that's the point of the workshop. You'll be hearing a lot of this in detail from other presenters later. So I'd like to begin by telling you that functional genomics is very important to NHGRI, the National Human Genome Research Institute. And it's important to us for a number of different reasons. One of them that I would highlight is we know that non-coding regions are very important for disease, and non-coding regions are not well studied by a lot of different approaches. So if you look at the vast majority of findings from GWAS studies or the vast majority of heritability imputed from these studies, most of that appears to lie in non-coding regions. Also if you look anecdotally at individual diseases, there are examples like Fragile X, a Mendelian disorder, where essentially all of the heritability lies in one non-coding variant or things like ALS, where the largest amount of heritability associated with a single variant is a non-coding variant. That's more than all of the other variants that are known at this point put together. So non-coding DNA is very important, but non-coding DNA is not easily studied. So the ENCODE Consortium is one group that's working on this. And the way ENCODE Consortium works, you can see in purple that there are data production groups, a number of them focused on different data types. There's a data coordination center, and you'll be hearing from people at the data coordination center today, that accessions collect the data and then allows it to be shared. There's a data analysis center and analysis working group that together are taking the lead on the low level of analysis, uniform processing of the data. We also have computational analysis groups and technology development groups. And together you can see on the bottom of the slide the output, the encyclopedia, main models, chromatin states, element IDs, and regulatory regions. So I'd like to quickly tell you about, I think, some of the highlights of ENCODE accomplishments. The ENCODE Consortium is sharing thousands of data sets. They're generally shared through GEO. They're shared in an unrestricted manner. The data today are released without any embargo, pre-publication. And the data are high quality and contain replicates. And we also uniformly process the data to facilitate comparison from one experiment to the next. We also put a lot of effort into sharing software. We share software through our data portal and also through standard resources like GitHub. We work hard at data interoperability. We do this both within the consortium and through our partners, for instance IHEC, the International Human Epigenome Consortium, where we're trying to set up common standards, human ontologies. And we've also recently developed an unrestricted access consent to allow people to participate in human research and share their data through unrestricted access mechanism. But mainly in terms of accomplishments, I'd like to let you know about ENCODE publications. In addition to hundreds of publications that have come from the consortium, there are at least a thousand publications that have come from what we call the community. People that are outside of ENCODE that don't receive ENCODE funding yet are using the data in their publications. And about a third of these publications are studies of human disease, which I think attests to the high translational value of the resource. As you might expect, a large number of different disorders are being studied using ENCODE. Cancer, allergy, auto-immunity, inflammatory disorders, and neurologic psychiatric disorders are high up on that list. So ENCODE is the encycl... stands for Encyclopedia of DNA Elements. And the aspirational goal of ENCODE, I don't know that this is achievable or one could ever tell if we got there, is to identify all of the candidate functional elements in the genome. And the twin goal that goes with this, without this, the first goal would be irrelevant, is to share that resource freely, make it available for the biomedical community so that it can be widely used. We think that this resource is very useful in studies of the genetic basis of human disease. We also think it's useful in studies of gene regulation, but of course it's freely available, so however anyone wants to use it is fine with us. So part of the rationale for having this project is that reading the human genome is not an easy thing to do. So if one is interested in protein coding regions, of course we have the genetic code very successful for about 1% of the human genome in terms of understanding what it might do. But for the regulatory part of the genome, we don't have a corresponding regulatory code where one can just look at the genome, figure out what parts are functional and what it is that they're doing. One can use sequence conservation, and that can help to identify candidate functional elements, but even when that works, it doesn't tell you what those elements do, when and where they function. So we think it's very important to have unbiased experimental data collection in order to figure out what the rest of the genome is up to. So in a nutshell, this is something of the problem. So this is one-two millionth of the human genome, and if one stares at a pattern like this, I would imagine for most of us, though there is the occasional person with an unusual cognitive gift, it's just a string of letters that you look at. Now of course I'm exaggerating here, we have a way around this. We use maps and annotation to add understanding, right, that we can learn something from this. Here now the exons are colored in blue, and region with a variant associated with allergy and asthma is colored in red, or you can look at the picture below and see the same kind of chart. So now you get extra information from this by having the maps and the annotation, but I'd argue that this still doesn't get you very far. You don't know how this region is involved in asthma and allergy. So our philosophy, like many others, is that richer maps would provide additional information and would add more insight. So if you take a cartoon of the same area showing a gene and red arrow pointing to a candidate variant associated with human disease, you can add in some encode data, and from that you can see in the red box that some of these tracks show signals that cluster in one area, and that's a candidate regulatory element. Moreover, if you look at that, it turns out that the signal is particularly strong in one cell type, TH2 cells. So that could tell you something about where this element might be working. Moreover, this element lies within a particular gene, RAD50, right? So the standard thing might be to say that this element might be connected to the RAD50 gene, but using encode one can make forecasts, and the forecasts are that this element appears to be working through the neighboring genes. So in the course of the workshop, you'll hear about how you can make these kinds of predictions. Of course, predictions need to be experimentally tested. So encode is built on decades of gene regulation. Mechanistic studies have found lots of signatures that are connected with gene regulation. In many cases, causality has been established between these. And what encode is doing, like a lot of other projects, is saying, can we reverse engineer? Can we take these signals that are associated with gene regulation and use them to predict where regulation may be happening? So we're doing a variety of genomic assays looking at chromates and structure and transcriptomes and proteins binding to DNA and RNA. And from this, we're inferring where genes may be, the transcripts that may be coming from those genes, and also regulatory elements you can see on the bottom of the cartoon. So we're collecting lots and lots of RNA data. Long RNA and short RNA, long RNA telling us about mRNAs, short RNA telling us about microRNAs, and depending on kind of non-coding RNA, could be in either group. All of the RNA data that we collect is cell specific. This is important in terms of what you can get out of encode. It gives you extra information that goes beyond what you get from germline genetic variants. One of the projects of encode, gen code, is manually trying to curate transcripts that come from the RNA data. We're also very carefully looking at chromates and structure. Chromates and structure has been intimately connected to gene regulation for a number of years. We're collecting data using DNAs, histone modifications, and DNA methylation. And primarily, this is telling us about where regulatory regions are, enhancers, and promoters. Again, the data are cell specific, so it gives us information about what cell type the events are happening. We're also collecting information about nucleic acids and where proteins are bound to them. We're doing this with DNA binding proteins, especially transcription factors. And that tells us a lot about transcriptional regulation, transcriptional start sites, where regulatory proteins are and regulatory elements are. We're also doing this. This is a new thing with the current phase of encode with proteins that are bound to RNA. And of course, RNA binding proteins are telling us about splicing translational regulation and stability. So in summary, what is the encode resource for us? It's a freely shared catalog of candidate regulatory elements. Encode is built upon years of study of gene regulation. So I think it's a very solid foundation. And one can use encode maps to make predictions about gene regulation or the role of genetic variation in human disease. So with that, I'll transition to the next part. How could one use encode to understand the role of genetic variants in human disease? And I think that the best way to start with this to think about some high level use cases, all right? If one has candidate variants associated with a disease, one can make predictions about what the causal variants are. One can make predictions about what the target genes for those variants are. One can make predictions about the cell type that's involved. And also make predictions about what the mechanism of regulation might be. So what I'm going to talk about today is from the standpoint of having genetic data. But if you have epigenetic data and epigenetic cohorts, you can do a lot of the same analysis with some different caveats that come into play. Similarly, I'm going to talk about the case of germline genetic data. But a lot of this would also work with somatic variants. So if we start with predicting causal variants, if one has variants associated with a trait or disease, those are not necessarily the variants that cause that trait or disease. And this happens for a number of reasons shown on the slide. First of all, it may be that multiple variants are in linkage disequilibrium. So statistically, those variants all have the same score. One can't tell which is better or worse. It may also be that the causal variant is not tested in the experiment, in which case there's no assessment of whether how likely that variant, the true causal variant, is to be important. And finally, there may be more than one variant that's at work. A lot of times we talk about finding the causal variant, but in fact, multiple variants may be important. So I think encode is important and has something to say in this as do other epigenomics projects like IHEC and Roadmap Epigenomics, because what's been found in the course of study is that many disease-associated variants lie within parts of the genome that are isolated as, that are highlighted as candidate regulatory elements, as you can see from this slide. So that suggests that knowing about these candidate functional elements might help you to understand something about disease. Now the simplest way to get at this, and I'm going to go through this very quickly because you'll be hearing about this in workshop session two, is to use some tools to have pre-computed analysis. So Haplorag from Manolis-Kellis' lab at MIT will accept as inputs things like SNPIDs or genomic coordinates, and then it will quickly return to you a list of encode findings from those entries. Similarly, a tool called RegulomeDB, developed by Mike Cherry and Mike Snyder's lab, which you'll also be hearing about in workshop session two. Again, you can enter VCF files or lists of SNPIDs or genomic coordinates, and it will first return to you by the arrow number three, a score assessing how likely that is to be a regulatory element, and then also detailed information. So you can quickly get information out of encode without having lots of information about how encode is set up or how gene regulation works. Another feature that RegulomeDB has is this GWAS database. It's a list of curated GWAS studies that have been done by other groups, and they're presented as hyperlinks. And for each one, you can click on that, see the list of variants. You can click through and see what study associated that variant with the trait or disease and what the evidence is that this is or is not a regulatory element. I'd also like to call your attention to something you'll be seeing in session four. Fangyua from encode has developed this encode syselement browser, which has a number of different functions. And one of the things that it can do is to find elements that flank a gene of interest. You can also offset the space that you're going to search. So that can quickly find for you regulatory elements in the vicinity of a region of interest. And perhaps my favorite is looking at actual browser tracks, depending on what one knows about particular genes and gene regulation. It can be very helpful to look at the tracks themselves. You'll be hearing some of this in session one in the workshops today. So another important use case is predicting target genes. And this may seem counterintuitive to some people. You have a variant. It sits on a piece of DNA. Of course, you know the gene. This works pretty well with variants that lie in coding regions. But what's been found in encode, and I think GTEX has similar findings, is that very often a regulatory variant works on a gene that's not the closest gene to it. As you can see from this study from John Stamatoianopoulos lab, Greg Crawford's lab. And the same holds true of variants that are linked with human traits and human diseases. Often they appear, the forecasts are that they work at a gene that's at some distance. So if one just says, I have this variant, I know that the neighboring gene is the target, one will be wrong, perhaps a lot of the time. So here's a concrete example of a variant that's associated with platelet count. And you can see the arrow going from that to a gene that's five genes away, Jack II. If one as your research program moved through genes, it might be many years before you got to the forecast target gene. In part based on regulatory signatures, in part based on DNA-DNA interactions, from encode data, one can predict that this variant might be associated with Jack II. And you can see with this example, chosen for this reason, how this is transformative. Instead of studying RCL-1 involved in RNA-3 prime processing and figuring out how that's involved in platelet count, we already know Jack II is a tyrosine kinase and a well-studied pathway. There are small molecules that regulate this pathway that doesn't prove that this is the correct connection. But boy, was it transformative to be able to start with this on one's list early. So to make these kinds of predictions, there's a number of ways that you can do this. One is through the regulatory elements database developed by Terry Fury's group. And I should point out that these slides have URLs on them. We're sharing the slides, so please don't feel compelled to write down URLs, okay? And this tool will allow you to start with the genomic coordinate, find the regulatory elements, or start with the gene of interest. If you start with a genomic coordinate range, we'll tell you here are candidate regulatory elements as hyperlinks, all right? And if you click on one, such as the one with the red arrow, it gives you additional information. And in one of the parts of information it gives you, yeah, you can see with the red arrow here, is a prediction of a gene that that element is associated with. And I think we have links to a handout of how to do this. You can also do this with the encodes, this element browser. You'll hear about this in session four. You can start with the gene and ask what elements might be linked to it. And as in the example that I showed you right in the beginning of the talk, you can make the same prediction through this method. We've also published a table of this data, and you can sort it by either the promoter start site or by the regulatory element start site, and you can scan the table or compute on it in order to make the same kinds of predictions. The encode query tool is examining the same data as this. So that's how this example that I showed you in the beginning came out. And this example is chosen in part because it's well known in mouse that this area is a regulatory region for these neighboring genes. Knockout studies and transgenic studies have shown that. And you can make the same prediction for human from encode data without knowing that finding. So I'd like to now move on to predicting target cell types. Sometimes this is surprising to people. Why would you need to predict the target cell type for a disease? So first of all, we know a lot of diseases affect more than one cell type. Think about, for instance, heart disease. You've got endothelial cells, smooth muscle cells, macrophages, liver, lots of cell types are affected. So if you have a genetic variant that's important in that disorder, you don't know which of the things on that list. But second of all, there's this idea that the default, that the defect doesn't have to be in the cell with pathology. For instance, there are people who have a defect in their adaptive immune system. And in some cases, the answer is not defective lymphocytes. Rather, it's a defective niche where those lymphocytes would have formed. So depending on how broadly one thinks about hypotheses, that may or may not have been a cell type that one would consider. And finally, disease etiology is not always well known. And I'm old enough to remember a time when it was considered to be a wild and crazy idea that perhaps type one diabetes might involve the immune system. Now it's textbook teaching, but I'm sure there will be other things about disease that today we don't know about, that in 20 years we'll be well understood. So looking at these things at first principles can help you find what's going on. So if one looked in regulatory elements database for different regulatory elements, one of the things it reports is the activity of that candidate element by cell type. And in this example, the red arrow points to a very cell type specific element. Getting this kind of information can help you to limit which cell types might be at work. Similarly, if you look in RegulomDB or in Haploreg, which again you'll be hearing about in workshop session two, they report not just the evidence that something is an element, but what cell types does that come from? And that can help you to again, figure out what cell type might be important. If you look at the CIS element browser, which you'll be hearing about again in workshop session four, again this reports information on cell specificity. So one thinks a particular target gene is at work and one finds that that target gene is expressed very strongly in a small number of cell types and not in other cell types. That can help you to generate a hypothesis of what the effected cell type might be. And I think we have a handout on this, not a workshop event, but a more principled way to do this is to look for enrichment of genetic variants associated with a traitor disorder in candidate regulatory elements. As was done here in this publication from John Stamatoianopoulses' lab now three years ago. And by doing this one can take what's a digestive disorder, Crohn's disorder, known at the time to involve inflammatory cells, but if you didn't know that, you could look at this and say, are the variants enriched in cells in the gut? Are they enriched in cells with the immune system? And the answer here is that they're enriched in cells of the immune system. Now something that people will often ask is what if my cell type is not in encode? Clearly this all works better if the exact cell fate in state happens to be encode. And you can see that TH-17 is the best hit and it's known that TH-17 cells infiltrate into gut during episodes of Crohn's disorder. But if you cover that up, you can also see TH-1 cells. If you cover that up, you can see all of the blood cells are enriched and you can see in purple the gut cells are not. So you can see that even without the ideal cell under some circumstances, you will be able to distinguish between groups of cell types. So I don't think that it always has to be that the exact cell type has to be present in encode in order to get meaningful information. And the last data thing I'd like to tell you about in code is it may even be possible to use this to cast a wider net for genetic variance. So from the same publication, enrichment was seen in heart samples for an electrical signaling trait in heart. And what you can't tell, and if you look along the x-axis, the p-value for the GWAS threshold has changed. And what you can see is that by relaxing the threshold, there's still very strong enrichment in heart cells. What you can't see in this is that the number of associated variants increases from on the order of 10 to on the order of 200 as one relaxes the p-value, yet there's still enrichment in the correct cell type. So if one had throughput to test additional hypotheses, one could say, I'd like to look at that larger group. There might be informative variants in the larger group, and then one can test that larger group in perhaps a more principled way. So in summary, I think the main use of encode here is hypothesis generation. I would not be going to encode and saying from this website and from that website, I know this variant causes this disorder in the cell type. But you can use it to predict, to make testable predictions about causal variants, cell types, affected genes, and even mechanism. So now I'm going to very quickly fly through how you would access encode materials, because again, this is going to be a major part of the workshop. You're going to be hearing a lot about this. And again, the slides are shared, so you have access to all of the URLs. Then a big part of workshop one will be accessing the encode portal. So we have our data standards shared through the portal. This is a screenshot from the encode portal. encodeproject.org is the URL, encodeproject.org. And from methods you can see where our data standards are, we share them for two reasons that are important for us. One is for transparency, so that you know what was done and how it was done, so you can decide for yourself what you think of the evidence. The other thing is that you may be doing related experimental work. You may wonder about using similar guidelines, and of course we're happy to share that. We share our software, and we do this through portal, and also through more standard mechanisms such as GitHub. Again, from our portal, you can drop down to software tools. And we're sharing two different kinds of software. There's lots of different ways to split it, right? But some of the software is software that's been developed by the consortium. We try and explain clearly when that's the case. And other cases, there's software that we've found to be very useful developed by others in the community. We wanna call attention to it and give a shout out to it by putting on the portal as well. We have ways that you can download and visualize and code data. That I would say is the meat and potatoes of the portal as it were. And you'll be hearing about this in workshop session one. And our portal is at encodeproject.org. And you can use faceted browsing. You'll be hearing about that to gate on the type of data that you're interested in, depending on the way that your research program is thinking about what's happening. And you can, for instance, move to visualizing tracks or downloading data. I wanna comment quickly on the data policy. Since the beginning of this encode phase, this is how the data work, data policy works. The data are released as soon as they have past quality control. They're immediately shared. There's no embargo. That means if you can see the data, you can immediately use it. You can publish on it. You don't have to figure out when it is that you can wait or when you're allowed to use the data. We have what we're calling the encyclopedia in-house. It's found on the annotations or genome annotations page. And you'll be hearing about this from Michael Precaro in workshop session four from Z-Ping Weng's lab. And what we have here are a number of simplified tools and a number of more complex annotations. We're trying to be everything to everybody. We want every detailed analysis that we've ever done to be available if it's used to somebody, but we're also trying to make simple tools available for anybody, regardless of what they may be trying to do. In the top half of this genomic annotations page, we have simplified tools. You can, for instance, visualize data where you can download a number of what we think are the most important human elements. We're gonna do this soon from house. And further down in the page, you can find links and links and links and links to many, many detailed analyses that have been done for many specialized reasons. We also share publications, publications that come from the consortium. These are encode-funded publications. And also publications that come from the community, the lower red arrow. These are publications that we found where people without encode funding have used encode data. If out of the goodness of your heart, you're doing some of these and you would like to share, send me an email. They're actually quite challenging to find because there's no encode as a common English word, right? So there were three million encode publications before the project started and it's getting worse. Again, the community publications attest to the translational value of the resource. I'd also like to comment on our partners in the International Human Epigenome Consortium. The IHEC has a data portal that's very useful. It has data from a number of different projects. I encourage you to go check out and you can find raw data, links to raw data. You can find published data from a number of different projects. I'll caution you that we're behind an encode with getting all of our data on the IHEC portal. If you're looking for encode data, about only one third of our data is on the IHEC portal. We hope to fix that soon. But a number of different consortia from different continents are involved in IHEC. I'm not gonna go through this because again, sharing the slides, but a number of useful URLs to access encode data, data from our partners. So a very handy thing to have. And I would point out that these slides are shared if they're useful for you for teaching or something like that or sharing with colleagues. They're shared, they're going to be on the internet. They're available for that purpose. So in summary, I'd like to say that the goals of encode, the two main goals are to create this resource of candidate functional elements. And then also the second goal is to share this resource with the biomedical community. And we think that the data are very useful for the study of basic biology and disease. And there are a reason for saying this is seeing the large number of publications that are using encode data. So I would like to then thank my colleagues in the consortium. Here's a photo from our recent consortium meeting. And also my colleagues at NHGRI, Elise Feingold, who's the scientific manager of the project and Dan Gilchrist, who's also here today. And then also Peter Goode, who's been a long-term member of the project. So I'll stop there and then I'll get us set for our keynote speaker. And I'll be around for the whole meeting if people have other questions, but I'm assuming we don't. Fung has an announcement.