 Okay, good morning. I'm Mike Cherry from Stafford University, and my group does the Data Coordination Center for Incode Project. Unfortunately, it's not, it's covering too much today, so I will give you an interactive demo of this, but hopefully I'll point you to enough information that you'll be able to do that on your own. We hope that we've done enough as far as creating tutorials and such that you'll be able to do that. Today, so what I'm talking about right now is a little bit, so is the data accessibility and how to, in the processing that we do at the DCC and introduce the standards that the data is put through. You'll hear a little bit later after me, actually not that long from now, about how to interpret some of the data as well as the analysis that comes at the very end. Okay, so the goal of what I'm telling you today is that you'll learn, at least in concept, how to download data, where to find the data, information about the features of our portal, the understanding of why we create metadata, maybe not what the metadata is all about, as well as a real cursory introduction to the processing pipelines that we've created. Okay, so as you've seen, the encode is a collection of assays of characteristics of the genome. These assays put together are used to create, associated together are used to define where promoters and enhancers and such might be. The assays themselves are just measuring various characteristics of the genome such as accessibility of the DNA and where particular proteins are bound. So I want to focus my talk on this particular issue, the topic that came up in this particular issue about the reproducibility of research. So of course this is a big problem and we're all sort of want to make sure that the data we provide is very standardized and reproducible reproducibility. Nature came up with a collection of articles that have come out over the past couple of years. There is a URL at the bottom, you probably can't read it but I think if you Google that very top line, you'll find this issue. So some of the big worries, concerns that are pointed out here is that the antibodies, the problems, I'll get more about this, you need quality control and particular measures of confidence. You want to have, make sure that there's appropriate replication designed into the experiment. The data needs to be accessible and not hidden away somewhere and that there needs to be really a standardized unique identifiers for the, for the data sets themselves. Okay, so I'll start in on, on the accessibility of the data. As been mentioned a couple of times, the encodeproject.org is the main portal for the site. You can, you can basically find everything that the project has, it's a very, very transparent project. Everything that we have is available from this website. This will include documentation of our standards and publications, as Mike had already mentioned. If you go to this website now really quickly, before I move on, you'll see that there are mainly four menus at the top. You can get at the data that we provide. You can understand what we're doing because of the documentation we provide, lists of publications. And in particular, this one on the farthest right there, the menu is help. So we've been putting together large sets of tutorials and guides. We certainly need to develop that more, but I think you can get into the project pretty quickly by going there. In particular, as Fung mentioned, there was a users meeting last summer where there was a three-day meeting with lots of tutorials and presentations. You'll find videos of those there. So they're not your common little five-minute tutorials, it's actually the whole talk. But you'll hear from Jill and the others in an hour format of what they actually have done in a much more rich way than we'll get in the short presentation today. So I encourage you to go there. And we'll also have the videos and everything from this meeting to go up there. And of course, there's always a search box for getting information. And if you really just want to get in there quickly, you don't know where to start. There's a quick help down at the bottom. Okay, so the main look of the page, if you go in and click on data, go to Assays, you'll get this page. This is the starting view. You'll see all of the experiments that have been provided, that have been collected by us. I did miss one little thing up here. Sorry. The total. So today we have over 5,000 data sets, experiments that have been available. So that's more than 5,000 files because of the replicates and such. This includes a large number of biosamples. So the biological material that has been included. And of course, there's hundreds of terabytes of files that are available for download from our Amazon instance. As of last week, we've included the epigenetics roadmap metadata into this site. It was quite a project to synchronize the metadata on these. We've taken the roadmap metadata and incorporated it into our site. And had to do a lot of work to map it up to the standards that we use for INCODE itself. And so that allows me to remember to say that here, at the very bottom of this page, you'll see you'll have the ability to switch between projects. So you'll find at the very bottom of projects heading, and you can click there to see just the roadmap experiments. By default, you come in and it's the INCODE experiments. So this is called a faceting search, which is basically a way of filtering. So we have large amounts of metadata, very rich hundreds of fields that are available for each experiment. And what we've done on the site is create these facets, sort of pre-computed filters that you can do. So if you click on these numbers there, like on a shopping site, you reduce the total number of matches as you go forward. And so in this example here, we've clicked it down to a smaller set. So there's only six experiments. At this point or any point along the way, you can click on the download button. It doesn't actually download the files immediately. It gives you the URLs that you can use to retrieve those files, because you may have many hundreds of thousands. You don't actually, not hundreds of thousands, tens, but you probably don't necessarily want to download those files to your Mac or whatever. With the URLs, you can take that and download from any command line. We also give you a short summary of the metadata associated with those files. There is also a visualize button, and you can use track hubs to the Santa Cruz browser to actually display these particular experimental results on the Santa Cruz browser as well. Okay, so that's the biggest feature of the Encoder Portal as far as getting at the data. This is normally where we'd rather spend a lot of time on an interactive session, but I encourage you to explore that, and hopefully the tutorials are there. If you need assistance from the DCC in any particular way, there's a help desk email. I certainly encourage you all to basically tell us anything. If it works, it doesn't work. How can we help you? As you dig deeper into the individual experiments, you'll see that we've accessioned basically everything, so that everything about the experiments have been accessioned, and so in this case, an antibody has been used in an experiment, and so the antibody itself has a record. The biosamples have records, and these are all accessioned. So we've accessioned things by categories. It's basically shortcuts that we can remember these license plate numbers. So it's in code at the beginning, then there's like FF for files. There's numbers and then letters. So all the accession numbers look in this way, so it's easy to find. Like a particular aspect of this is that you can share the accession number with somebody else. The accession number may go to a file, but it also is associated with a metadata for that file. So you really only have to give somebody the accession number, and then they can retrieve whatever you want them to. You can also share a list of accession numbers, and that allows them to download the same files that you had. So you don't need to share some complicated file name, a directory path such as that. You just need the accession number itself. So I've talked a little bit about a few of these already. We need to have high standards for the information that comes in. We want to be able to have replicates designed into the experiments itself. And part of the standards is to create metadata, and this is really an amazing part of the ENCODE project, is that the laboratories have committed to sharing the information about their experiments. So those of you in the wet lab, you know that you write down in your lab books how you made the solution, where you bought your products, how you did the extraction of your DNA, what speed you ran the centrifuge at, all this kind of information. Normally when you read a paper, you won't find this information, right? They'll say, oh, we did it with a method that we previously used published here. You go to that paper, and it says, oh, we used this method that was mentioned on this website, but it references a supplemental site, and it was done with changes not published, right? So how are you supposed to reproduce that experiment? Read the minds, call them up. The student has already left. The lab notebooks have been lost, right? So there's a commitment from the ENCODE production labs that they provide us all this information. It's a requirement, okay? So this is really incredible, right? I mean, so it's just part of the process of ENCODE, is that all this metadata is transmitted to us and is made available, stored in a database. You can retrieve it all. Built into the standardization that's done is that the experiments, the particular assays, the experiments have to match a particular standardized methods approach, and this will include for many of them replicates. So in this particular experiment, you see that the same biosample, two different libraries were made from that biosample, and each one of those libraries was sequenced independently, and at the bottom there, what it's showing is there's an accession number for each of the FASTQ files associated with those replicates. So you can download the replicates independent of everything. Again, so the transparency of sharing all the data together. Again, I have to really congratulate all the labs for their commitment to doing this. It's a lot of extra work. There are standards as far as the quality of the data, the read depth of the data. Of course, they're providing all the metadata. It's typically something that most labs wouldn't want to do. Here it's just a requirement of the project. So the whole point of ENCODE, of course, is to create the highest quality standard of results, of metadata, of all the processing. So the point is that you really can't get better data than what you can get out of ENCODE. It's just not going to be there. It's really, we often get asked, can we incorporate our data into the ENCODE portal? And the issue is the level of these standards. When people say they want to do that, I know that they don't approach us. They approach the labs typically, and the labs will tell them what they have to do, and today we've gotten more than maybe one person that says, oh, yeah, I can do that. It's a really high bar that's put on the ENCODE labs. So moving forward with some of the other aspects of this, we want to have control of the confidence and reliability of samples, and part of that goes to the pipelines that are created. So again, if you're on the dry side, or not, you might read in an experiment that the data was processed in a particular way, using method published by a second search with changes here. We substituted this version of that and all this. So you oftentimes don't really know what the processing was involved, or they say processed, and they give you some name for a pipeline of things, which you don't actually know what goes on inside of that. So avoid this pipeline black box sort of point. As part of ENCODE, so we want to have everything completely transparent. So the pipelines have been defined, and are run on each type of sample, on each data type that's submitted to the project by my group. And to do this, each one of the data producing centers has come together with the others creating the same type of data to define a pipeline to be used. Now, this may not necessarily be the best, trendiest pipeline that's out there, but of course the pipeline analysis itself is researched. But what it's been agreed upon is a very high quality standard pipeline to be used, and it's used for everything of that particular data type. So this avoids the problem where you want to compare data between different labs. You would have to actually understand how they process the data very specifically to know is it really appropriate to compare the data one-for-one. So all data from ENCODE, a particular data type, can be compared because it's all been run the same way. It has the same standards for production, so everything is really the same. And so if you want to understand the pipeline, you can look at a page that describes the pipeline itself, each one of the boxes tells you a little information about the step and what software has been used. And of course, you have metadata for the sample, for the files, you also have metadata for the processing of the files. So this describes, on a particular experiment, you want to look at the results. You can see the processing that was done on that file here. And so in this case, you see these duplicate boxes, one on top of the other, the replicates involved in here. So each replicate was processed and then joined at the end. So each replicates processing is also shown here. And what it's showing in the upper right there is if you click on one of the boxes, you'll get more detail about what happened underneath that. You'll see what versions of the programs that were used and sort of more detail flowing into there. We track all of this. This is really rich metadata that's provided. And as with the data coming in and flowing out of the pipelines, if an integrated analysis has been done on these files, we have that data coming, that we have the results of those analysis coming back into the portal. And the metadata associated with what was done in the computational labs is also included. So the very last thing here is the antibodies. If any of you have worked with antibodies, you know it's kind of a little bit of magic how they're made. It's a huge problem because the reproducibility of antibodies from one bunny to the next can be quite variable. And so ENCODE, again, standardizing everything to really an extreme level, requires a lot of extra validation work to go into that. So what you'll see for a particular antibody here that has been determined to work, there's the Western Block provided. So how many times you look at a paper and they're talking about an antibody and they show you their Western Block, why they think that antibody works. The ENCODE labs do that for you. So this is a huge amount of extra work, quite grumbling because it is so much extra work, but it's the requirement of the project. I don't know what the exact numbers are, but it's a huge number of the antibodies that just never pass. It's really a small number. They really have to process lots of them to get the few that work. So I hope you'll see that because of the standards and the transparency and the agreement between the labs and both the wet labs and the dry labs, that ENCODE really provides you with the possible data that you can get anywhere. And so then that allows you to compare your results with that, knowing that if you find some variation between your data and the ENCODE data, you can explore into the metadata and find out what might be happening there. We've hopefully reduced the variables associated with all this. And then the last two things are, so as far as it's an advanced workshop and some of the things you'll hear a little bit later, you want to start processing data coming out of ENCODE, we have created programmatic access, programmatic access into the portal. This is described a little bit in the help section under tutorials and the help there. There is an API which allows you to get at all of the metadata. Everything in the database is transparent. The facets are only a small amount of that metadata and the facets are created for a user interface, a usability for people, but with a computer you can get at everything. There are examples of Python scripts that you can use and get into these files. And so you can develop, if you're a programmer or can twist somebody's arm that is a programmer, you can really get at all of this information in ENCODE, pulling it back as you want. And the little example here, if you look at the slides online when they're available there, it's basically you just add a little tag at the end saying, give me this information in JSON format. So everything that you see on the page, the information that you use to build the page is actually available as a JSON format. So when you do a search and you see a long list of experiments there, you can get at this from the JSONomics themselves. I really skimmed over the way that we standardize information and that we use ontologies. And so using the ENCODE standards for the metadata, it's allowed us to incorporate information from other projects because we're using ontologies. So for example, the Oberon ontology, which talks about anatomy, cell line ontologies, assay ontologies and such, are all used and that gives a standard way of going between experiments as well as projects. And we're starting to incorporate other types of information as well as Mike and Elise tell us we should. And then this is all coming out of the port. And so the last thing is the group. Really got a great group of people at Stanford there. The second row are the Wranglers and the assistant curators that help all this information. They in a sense have to touch all of the data coming out of the labs. They work with the labs to sort of help them remember that particular fields of metadata need to be incorporated. There's a lot of effort really going into matching the design of the experiment, so the types of replicates, how the metadata is described. And then in the bottom level, we have the folks that build the pipelines, maintain the software we use within the group for analyzing the data and just pushing things back and forth. Two great people that have created our website, really fantastic pair there, as well as the foals that do the website itself, the UI, system managers, and then the secretary for the group. Thank you.