 So I will get started. So I will give an overview and then Joe will take over to describe how we made the encyclopedia. So in case you have not heard of the encode consortium, the whole name is the encyclopedia of DNA elements. And this is a group photo of the consortium at the end of phase three. This was at Salk Institute. And as you can see, it's a pretty big consortium. So what do we do? We have a few goals. The first one is we aim to catalog all functional elements in the human and mouse genomes. And also we want to develop freely available resources for the research community. And we want to study human as well as mouse and other model organisms. Encode used to have worm and fly in the past. And at the end, we want to produce components through data generation, data analysis and repository so that our results will be able to benefit the broader research communities. So here is a brief history of the encode project. It started out in 2003. It was a pilot phase. And then phase two, phase three. So pilot phase was using microarray. And phase two started to use hogenome sequencing. Phase three just finished a few months ago. Phase four, we are in phase four right now. It started in February of 2017. So in order to be a good community resource, we aim to do a few things that make it usable for the community. First is we rapidly release all the data pre-publication and the data are released through the encodeproject.org, the encode portal. And as soon as a data set is deemed quality of high enough quality is released through the portal. And the second is we want to make all the software tools and analysis pipelines open source. And we put them at the GitHub and so that anybody can download freely and use them. So we have also worked very hard to establish data standards, quality control metrics and analytical tools to make sure we can evaluate the quality of each data set and then flag them if some data sets may have issues with any of the metrics or standards. So today we will talk about encode encyclopedia which is a compilation of the results that we obtained from analyzing encode data. So being a community project, we really care about how useful the data are. So NHGI specifically Mike Payson, a program officer at NHGI, spends lots of time cataloging how many papers cite encode data. So here is a graph separating the papers into two groups. One in blue is publications by encode members and then in red are publications by outside non-encode members but outside community. So as you can see more and more outside users have started to find the encode data benefiting their research. So what kind of publications use encode data? So here is a breakdown of around 1750 community publications using encode data. So you can see that a huge chunk of them use encode data to try to understand human diseases and roughly the same number use encode data to understand basic biology and then also a sizeable number use encode data to develop novel methods and software. So because the number one application of encode data is human diseases, that's why we come here and hope to facilitate this application. So you can see these kind of all kinds of diseases, cancer, autoimmunity, neurological, human genetics diseases and cardiovascular. So that's why we're here. Here is the website for the encode portal I mentioned before. So everything we produce is freely available through encodeproject.org. So there are a few advantages of working for a consortium and the first is the data sets. There are lots of data sets and they are publicly available. So we have access to them also inside a consortium and we try to coordinate the data production efforts so that many essays are performed on the same set of cells and tissues so that it's the data sets are nice and neat so you can perform more easily integrative analysis. If you use public data they might be a hodgepodge of different cell types and different numbers of essays not as easy for integrative analysis. And also we spend a lot of time defining and implementing uniform data processing pipelines and also quality control and then the data coordination center spends a lot of effort to collect and curate, make sure they are correct, the metadata for encode data sets and also standards. So all these ensure uniform processing and accurate metadata for integration. But also being on consortium it's a little bit different from doing individual research lab kind of projects because a lot of the decisions are made by working groups. We call them working groups. So how do working groups decide things? They get on a phone and so we have lots of conference calls and you can see and sometimes it's good and bad and you can voice your opinions if you want to play a bigger role in a working group but does take time to get on those phone calls and then contribute. So what's new in encode four? So there are a couple of new things for encode four. One is we can take samples from the community and then process them through the mapping centers and then also the characterization centers and then produce the data and then feed into the whole corpus of encode data. That's one thing new. And then the second one is we will aim to take also the community data from outside encode and then go through the data coordination center, try to also add them into the encode collection. So unlike previous encode phases before we did not have characterization centers now we have a whole set of characterization centers and the new the mapping centers here are eight mapping centers and the PIs and the kind of data they produce each of them produce different kinds of data. The characterization centers include these five. I think there are three additional ones that have been approved. I haven't added them to this list. And as I mentioned I'm co-leading the data analysis center with Mark Gustin from Yale and the data coordination center is headed by Mac Cherry from Stanford. In addition to these centers there are also computational analysis projects that are more loosely connected with data production. And there are six of them for encode four. So let me summarize a little bit about phase three encode data production. So I mentioned that you can go to the encode portal encoproject.org to get all the data. I just want to also mention that the Epigenome Roadmap project which has ended all the data are also available on the encode portal. So in total there are 9,000 data sets in encode. And during phase three we produced close to 4,000 data sets on humans and just above 1,000 data sets on mouse. Also there is a collaboration between encode and the GTACS Consortium. So we collect human tissue samples through the GTACS route and then process them in encode. So there are four donors, two males, two females and a multiple tissues as you can see in this matrix. One thing about the mouse component of encode three is Bing Ren's project from UCSD and his team members, several other universities included produced harvested mouse tissues across developmental time points from embryonic time 10.5 to adult and then also across tissue and this is pretty good because these kind of tissues are difficult to obtain for humans. So each of this tissue was assayed by eight histone marks, DNA methylation, hoaching on bisulfite sequencing and RNA-seq. So there's a pretty rich data set. And also Grant Graveley's project worked on RNA binding proteins. They had a multi-prong approach. They performed Eclipse data for a large number of RNA binding proteins in two cell lines, K562 and Unhabg2. This is a liver, a hepatocyte cell line and K562 is a red blood cell derived cell line. And they also performed 70 RNA bind and seek in vitro experiments to determine the binding specificities of each RNA binding protein and then they performed rip seek for over 30 RBPs across three cell types. And you can see here is a genome browser shot of their data, of their Eclipse data for the RBP, RBFox2. And then here is the input. So you can see the signal is much over input. So I think, so that's kind of an overview and today we will focus on the encode encyclopedia. As I said earlier, it's like a summary of all the analysis results. So you can imagine all the raw data are deposited into, actually everything is deposited at the encode portal. So all the raw data are submitted to the portal and they get run through uniform processing pipelines. And then the pipelines output a bunch of things. For example, for DNA seek data, you will get DNA site and sensitive sites. For histone mark, chip seek data, you will get peaks and so on and so forth. For RNA binding proteins, we just mentioned, you also will get peaks. And so these outputs from the uniform processing pipelines constitute the ground level of the encode encyclopedia. So today we will focus on a particular component of the integrative level of the encyclopedia which is called the registry of candidate regulatory elements which is more an integration of all these raw data and the ground level annotation to produce regions in the genome that are candidates of regulatory elements. So some of them are like promoters, some of them are like enhancers, some of them are bound by CTCF, some of them are target genes and so on. And we have also custom made visualization tool called the screen that will be able to display the registry and then also the underlying data. So next, Joel is going to talk about how this registry is made. And after that, Michael will have a live demo about how screen works and make sure you get on the wifi so that you can follow along about the tutorial on screen.