 With that, I'll just, I'll begin. I wanted to begin the workshop with A, to set up the subsequent talks of the workshop, but also talk about the encode data flow. Mike Payson this morning talked about a lot of the resources that are available to search and query encode data. And I just wanted to talk about how the data flows from different members of the encode consortium to you, and then how, and then also to do a little introduction to all the different workshops that are coming up today and tomorrow. So as part of the encode consortium, there are production labs and analysis groups that generate the primary experimental data. So the chip seek data, DNA seek data, RNA seek data. There are also analysis groups that take this information and do some sort of analysis in order to identify genomic regions in the human and mouse genomes. All this information, including the data files and the metadata for those assays and files, are submitted to the data coordination center. So the data coordination center takes the information with a primary goal of data processing. We process a large percentage of the data files in the uniform way. We review the metadata with the labs. We also then develop the data portal. The data encode data portal is really where you, it's a primary source, it's a central source for the encode data. This information is then, it can be used by very, very many other resources and researchers to do their work. So following me, you'll hear from Pauline from the UCSC Genome Browser. They'll talk about how you can visualize encode data at the UCSC Genome Browser, what additional features they've added. Emily from Ensemble will also talk about additional ways to search encode data at Ensemble. And then Jill will talk about ReguloneDB and Haplowreg, which are two tools, web-based tools that allow you to annotate your variants. Geo, we submit all the process data to Geo. And so Seth and Ben will talk about the uniform processing pipelines. Once those are run on all the data, they'll also be available from Geo. And then finally, Fung will moderate a session with four speakers that talk about how different members of the encode consortium have integrated the data to identify annotations, but then also develop tools that you can use with your own data to come up, to find these annotations in your own data set. And so that is the general data flow from encode, all the data are publicly available. And so the encode portal, I mentioned this a little bit, but just to reiterate, it's really the central source for the encode data, the primary experimental data and the consortium-generated analysis data. It's also a hub for the project information. And so data standards, different publication, the pipelines that are run. Also, one of our main goals is to provide high-quality metadata. And as I progress through the talk, I hope to show you why this is important and why having high-quality metadata can help you find the data that you want. And so for the workshop, the goals of the workshop, it's really to be able to find the information you want on the encode portal. So you can find information about the project itself. And the main goal of why we're here, how do you find the encode data that you're interested in finding? Well, I'll show you how you can take the information, take the data files you find, how to download data from the portal, how we've instituted a way to visualize the data at the UCSC Genome Browser in a one-click way. And then also, for those who are interested, all the information you see on the website is available programmatically via the REST API. And so I'll end with talking about how you can do that and the exercises also include examples and links to how you can learn more about that. And we can help you, if that's really something you're interested in learning more about, we can help answer questions as you work through those exercises as well. So first, starting with the encode portal organization, this is just a screenshot that's faded out with all the little items. The menu bar is broken into data. This is where you can find the core of the encode portal, all the data that's generated by the consortium. Methods has software and pipelines. About encode is general project information, and then finally, some help documentation. There is a quick search box. This section has recent news and updates, and then also a quick help in the left corner. So it has just really easy links about to get you to the places that you're wanting to get, whether it's browsing for data, a search example, or the encode annotations. And so going into the project documentation, it is spread over several of the menu items. The methods has links to experimental standards. You can find out more about publications that are published by the encode consortium members, but also how other researchers in the community have used encode data. There is a section of community publications that describes how researchers who are not funded by encode have used the encode data. And then finally, the tutorials will link all, once the videos for this users meeting are up, we'll also have the videos of the talks linked up to the agenda on the tutorials. And there are also previous tutorials that are linked from that website. And so this is just an emphasis of what we want to be able to do. Here are the list of publications. Many of the publications have the supplementary information added to those publications. And so you can download the supplementary info from that paper. A subset of the papers also have links to data sets, which then have links to files that you can download from the encode portal as well. And then one of the final things I want to point out about the information, consortium information available at the portal are the antibody characterizations. There is a large effort to make sure that the antibodies that are being used in the IP-based assays are of high quality. And so all this information about how the antibody performs in westerns and other secondary characterization methods, this is an example of an SHRNA knockdown showing that the knockdown is actually effective because you're not seeing any signal from that antibody. And so all this information is available at the portal as well and that's underneath the data antibodies tab. And so if you're thinking about doing a chip seek assay, you can take a look at what antibodies have been characterized by the encode consortium. There's a large number that the consortium members felt like they didn't meet the standards in order to pursue a genome-wide chip seek assay. So they chose not to, but they make this information available on the web to communicate that to other researchers. And so now getting to the finding encode data. I'm not gonna do a live demo, but these slides that I'm showing you now, there's a walkthrough in the exercises linked off the portal so you can follow along with me following that handout. I'm gonna describe what to do so you can follow along with me that way. And if you run into problems, raise your hand and someone will come help. So encodeproject.org, that's the URL for the portal. There are two ways that you can find encode data. You can browse under the data menu, you can click on the assays or you can just do a free text search. I'm gonna start with a search example so you can type skin into the search box. And so once you've done that, you see a list of results of everything that matches the text string, the string skin on the encode portal. And there are different categories of information, there are the different experiments, there are the biosamples that are the biological material that were assayed, there's publications and different web pages on the portal that match that term. So go ahead and click on experiments and that will send you to a list of 170 experiments that somehow match the word skin from your search. And you can see on the search results that it actually, the first one is skin of body, so it's doing an exact match. But what you can see on some of the other search results that is not immediately obvious that skin is matching the text string that you're seeing in the results. But they're all parts of the skin cell types that make up your epidermis. And so melanocytes I think are here, there's keratinocytes further down on the list. What this is doing is searching the relationships between the biological samples. I'll go into this a little bit further down to show you what we're using to get those relationships. But what the searching for skin example shows you is that it's not a straight search, it's not a straight text match. There is a, because we're using ontologies to describe the biological material, there are relationships between the different biological concepts that are being searched when you're doing a free text search. So then I'll move to the browse encode data. So under the data tab, there's an assays link. If you click on that, that shows you the full list of encode assays, encode experiments that is available to you. These have been released by the encode consortium. And so here there's, I think that's 4,800 experiments that are released. And so what you can do at this point is just take a look at what's available to you. On the left hand side of the page are a set of categories with descriptors in it. So these are facets. They can be used to filter down the set of search results to get you the exact set of encode assays that you're interested in looking at. So I really hate shopping. Like I just really, really hate shopping. So when I want, I need to buy clothes. And so the thing that I, when we were setting up the portal, one of the examples that we talked about was this website called Zappos. I don't know if you guys have ever used Zappos, but essentially you can narrow down the hundreds of thousands of items that they have and get to the items you want in three clicks. And so the shopping websites have figured out how to get you to the items you want and buying it in the fastest way possible. And so that's essentially the philosophy that we took with the genomic data. There are so many descriptors of how you can find genomic data. And what we did is we worked with the production labs, the analysis groups within the encode consortium to identify a core set of metadata that are essential to helping researchers, A, find the data you want, and two, help you interpret, give you enough information to be able to interpret the results that you're seeing. And so all this information on the left-hand side is metadata that describes the assays. And so what can you do with this information? So you can see the full list of the different types of metadata we have. Selecting one item in a category will narrow the results immediately. If you select two items in a category, it effectively works as an OR. So if you want two items in it, you can, two items in a category, you can select both of them, it'll give you the union of it. Selecting another item in a different category acts as an and. It joins the two categories itself. And so going back to our skin example, you can select skin under organ. I think it says skin of body. And so then once you've done that, let me go back, oops, oh, that's not going. I don't have a screenshot of that, but if you select skin here, it will get you down to the 170 assays that you got to by just doing the free text search as well. And one of our goals in trying to put a category information there that's an organ is that you don't know specifically which cell types, which subsection of the organ was used. Our goal is to describe the biological material, that's assay, to the most specific concept possible. So it could be a fibroblast of lung or fibroblast of foot skin, left foot skin even. But what you want to be able to do is be able to find the full set even without knowing what specific terms were used to describe the biological material. And so I talked about searching and filtering. You can use both at the same time. So taking the example of skin, you can search for skin, you'll get the same results. Select assays, you get 170. If you're only looking for RNA-seq assays from adult skin samples, you go ahead and select RNA-seq and then adult and then you have 10 experiments in the ENCODE data corpus that are of interest to you. And so in a handful of clicks, we've gotten from 4,800 experiments which is a completely overwhelming number to the 10 that you are interested in based on the biological question that you want to be addressed. And so what can you do with that? Now that you have it, great. So I have a list, I have a list of 10 phrases. What can I do with it? So the first thing you can do is you want to be able to look at the information. You want to be able to look at the data that's been generated. So if you click on the visualize link, we will automatically generate a track hub that connects at the UCSC genome browser for you to evaluate. Pauline will talk a little bit more about the mechanics of how you can configure tracks and use the UCSC genome browser. But what we provide at the ENCODE portal is a one-click way for you to visualize the information that's associated with the assays. The next thing, you've taken a look at the results. These are the files that you want. Maybe you want the raw data in order to process it yourself. So that's uniformly processed with the data from your lab. And so you want to download, you want links to download the files. So once you click the download link, that will give you a little pop-up window that explains what to expect in order to download the files. It gives you a little command that you can use, either download, if you only have a small subset of files, you can download your laptop, but you can take this file, move it to your server or your high performance compute cluster in order to download the files there. And so I've mentioned metadata. I've talked about finding the assays in group. I want to take a step, a sort of more detail about per experiment, per assay that's done, in order to give you some information about what type of metadata you should be able to expect when you're visualizing the files or you're downloading the data. Or when you're looking, the third option I haven't talked about is when you're looking at a single assay on the webpage itself. And so what we wanted to do was basically describe how we create a metadata model that describes, that reflects how the assays and the analysis are done. And so we have an experiment. It can have one or many biological replicates, which can have one or many technical replicates, each of these biological technical replica combos has a data file, has a raw data file, which then is acted upon by various software and pipelines in order to create a processed file. I have a BAM here as an example, which then is then potentially combined with data from a second replicate or also data from a control experiment to then to do further analysis and then is used to generate peak calls, which are then regions of the genome that are thought to have a binding site for a specific transcription factor. And so this is a general schematic of what you can expect. I'm gonna spend a little more time talking about what assay information we do capture as part of the biological replicates. When we were talking with the production labs and the consortium and also with other researchers and many of us at the DCC come from a molecular biology background, there are very, it was clear that there are reagents that are reused from different assays. And so what we did was we identified those reagents that are commonly reused across a wide range of assays a lab does. And so there are bio samples and so you get one sample material. Maybe you do 10 assays on that, but you wanna make sure that those 10 assays are linked to the specific tissue that is obtained by your lab. The second are antibodies. You buy a lot of antibodies and you use it to chip that factor in 100 cell types, let's say, or six cell types. You wanna make sure that that antibody you use is referred to exactly the same way in all the assays that you're doing. Libraries. Many times you'll go back to the same library and re-sequence it for various reasons. And so we wanted to make sure that we are able to link up the right files to the right libraries that were sequenced. And then finally, files. Everyone takes a file, it takes fast cues. You can run it through multiple permutations of, let's use the Chipsic example, a Chipsic pipeline with different peak colors, different ways to identify different methods that are used to identify where that factor may bind in the genome. And so we took this set of information. We worked with the labs to identify the specific metadata that describes that information. On this list is really just a subset of the metadata we have. And then we accessioned them. Because of this concept, you wanna be able to refer to a unique item very explicitly. We assigned accessions to each of these items. So biosamples have biosample accessions. There are donor accessions. There are antibody accessions, the specific lot that is purchased. There is potential lot to lot variation. So we accessioned to the specific lot. There are libraries that are accessioned, and then files. Every file has a unique accession so that when you run 10 files, 100 files through a processing step, you know what your inputs are and in order to recreate your outputs. And so going back to the biological samples that I mentioned earlier, there are very many ways that you can describe a specific biological input. Somebody may call that sample a lung fibroblast. Somebody may call it fibroblast obtained from lung. You know, there's just, there's so much richness in the human language that is difficult to make sure that everyone says exactly the same thing the same way. But when you're doing analysis, when you're doing integrative analysis, you want to know that what you're looking at is the files that you got, that that information can be comparable to other similar types of information. Using a very simple example is you wanna be able to know what the sex of the biosamples you obtained were. One lab may capture that information as male and female, all lower case. Another lab may capture as capital M and capital F. And so when you're trying to combine all that information computationally, that is difficult to do because when you're reading through it, you can assume that M is male or maybe it's mixed. I don't know, F is female. But you can make these assumptions while you're reading it, but a computer program can't. And this process of data cleaning of genomic data can take a significant amount of time. And what we've done with, especially the biosample description is to use publicly developed ontologies in order to describe that information. And so we selected a set of ontologies to describe assays, also biological tissue materials, cell types, cell lines, chemical treatments that are applied with the idea that once you have multiple projects using the same set of ontologies, you can get instant interoperability. So we've worked with the Roadmap Epigenomics project to map their biological samples to the same set of ontologies. And this applies to any assays that you do. If you describe your assays using the same set of ontologies as the ENCODE data, you can have instant interoperability with the ENCODE data as well. And so what exactly is an ontology? Just a show of hands. Has anyone, is anyone familiar with what an ontology is? Oh, okay. Not, not too bad. But just in brief, an ontology is a set of words that have a relationship to each other. And then all the relationships must be true. And because all the relationships must be true, you can make inferences about terms that are related by multiple steps away. And so in general, very specific ontology terms are often called child terms. The more general terms are called parent terms. And just like you can make inferences about a child and a grandparent, you can make inferences between a child term and a parent term that's maybe three links away based on the relationships. And so this is why a mitochondrial, you cannot make this relationship a chromosome as part of a nucleus because that means that the mitochondrial chromosome implicitly can be, it can be assumed if this relationship exists that a mitochondrial chromosome is part of a nucleus, which is not true. And so we take this information to help the searches that we did in the beginning, the search of skin, and also to populate the facet so that when you facet on the skin term, you get the set of assays are done against keratinocytes or melanocytes. And then this slide just emphasizes that point and how when you have different projects that have annotated at different levels of the ontology, you can still do the search for skin and then find roadmap data and encode data that can be considered together. And so going back to the portal page itself, I talked about how you can find sets of experiments. We've talked a little bit about the metadata that is available describing those experiments itself. If you're interested in viewing the specific details of one experiment, the metadata I talked about, you should expect to see the metadata that I talked about on the webpage itself. And so this example here, ENCSR, 823-VEE, this is just a screenshot if you want to take a look at it on your laptops. The top is really a summary about how the, sort of what is the general setup of that assay. The second section of the page is how that assay was done, how the libraries were constructed, any protocols, any additional documentation about the data that was done, about how the assay was performed. And then second, the specific biological replicate information with links to the biosamples that were used. And then going down at the bottom of the page is where you start seeing the data processing information. This is the list of files, here's the list of the raw data, and then there are processed files underneath. And then right above that is a graphical representation of each of the steps in the data processing pipeline that were used to generate the processed files. So it states the version of the software, what inputs were used to create what outputs. And what we really want to be able to do is have a level of transparency about how that file you were looking at was generated, and also clearly indicate what the data provenance is so that if you're really interested in recreating this information, you can go and do it yourselves. And so on this page, you can then do downloads for single files. Maybe you only wanted the FASTQs from a single experiment. You can click on that experiment and then download the files from there. And then, so that's been, I talked about the web-based access of the encode data. All the information you see on the website is available programmatically via JSON object. So this is, if you just put question mark format equals JSON after the URL, this is the data structure that is used to generate the web page itself. And so if you're interested in programmatically querying the encode portal, this is how you would do it. And so this is how the programmatic access to the encode data fits in. So this is the data coordination center database. This contains all the metadata, links to the data files. Here is the metadata in JSON format. All that information is available in JSON format. We actually use that same information to generate the web pages. And then so you can query the metadata using our REST API. And this URL is the link to have, to give you more information about the REST API on the exercises on the very last page, our additional links to sample code and other modules you may need. If you're gonna write it in Python, but basically, you can add question mark format equals JSON pretty much to the end of any web page in order to see the JSON that's used to generate it. And then the examples we show you are in Python, but you can query the database via the REST API in any language. We know people who have tried it in PHP and in Perl and C. And this is just a slide to show you, to emphasize the point that if you can construct a URL via the facets, you can copy that information and then programmatically query the search results as well. And so that's what this example is. And what I want to end with is just putting the role of the data coordination center and other genomic resources. There are the genome browsers, there's Uniprot, there's model organism databases. And what our goal really is, is to empower and facilitate the research that's done by the general scientific community. It's sort of this role we have in the life cycle of data generation. There's publications, there's different computational analyses, there's experimental data that's generated by various consortium, various labs. What the DCC does and also other genomic resources is we take this information and we either put some data structure on it. I talked about our metadata that has a structure to it. And so we've basically taken genomic assay protocols and put it into a structured queryable format. There are other groups that are taking literature facts, publication tidbits and putting those into structured data. So we organize this information to make data accessible, organized, and easy to find so that it can help you in hypothesis generation. The encode elements, the genomic regions that are identified by the project. What we really, all these tools that we're gonna be talking about for the rest of the workshop is really to empower you to find the information in order to help accelerate your research so that you can generate, you can validate some of these findings. And so that then it can get fed into other genomic resources to then help somebody else. So that's my little soapbox of how you should view genomic databases. And then I'll end with sort of future things. There is tomorrow's encode uniform processing pipelines that we'll talk about. This is information that if you go on the Pipelines page, you'll start seeing pipeline information as we start releasing the pipeline. And then also Michael Priccaro will talk about the genomic annotations and what you can find there as well. And then coming soon, we've talked about querying the metadata, querying the encode data via the metadata. And you're gonna be hearing during this talk how to query the encode data via genes that are targets, genes are genomic regions. And so we've been working with other members of consortium to be able to integrate that information into the portal so that you can, instead of three clicks, getting a list of assays based on the metadata, you can enter the genomic coordinates that you're interested in and then find the two or three data sets that might be of interest to you. And so be on the lookout for that. And this is the Encode Data Coordination Center. We talk with the labs and we make sure that Encode data is accessible and don't hesitate to come find us. During the meeting, Mike is the PI, he's the boss man. I'm the project manager. I make sure that DCC runs day-to-day. Gene and Seth are data wranglers and they're here and they can answer pretty much, everyone can answer pretty much any question. And then you can always email us or tweet us. And our code is publicly available. And this is where the pipeline code is as well. So with that, I'll end and I talk longer than I expected. So I hope for the next 15 minutes, explore the exercises, use the exercises. I can go through any other demo if you want me to and open up to questions too.