 Well, thank you all for coming this afternoon. My name is Seth, and Ben and I will teach you today about the ENCODE Uniform Processing Pipelines that we've developed at the DCC. We have a link here on setting up the environment for you to actually run these pipelines yourselves. So if you haven't already done that, you might want to click there and do it. You'll miss my introductory remarks as you focus on creating your account. I guess that's okay. Before I start, I want to tell you that this is a very interactive session. We have the luxury of having two hours together. So my colleagues from the DCC are here and will be circulating around to help you. So if you get stuck at any point, please just raise your hand and someone will help you. So the people who will be helping are my colleague Gene Davidson, who's waving at you from the back. And Mike Cherry, who just introduced a session, Yuri Hong here in the front. And you'll meet Ben in a minute because he'll be up here on the stage. And we're also joined by two colleagues from DNA Nexus. Joe Dale is also in the back and George Asimenos as well. So we have a lot of people here who can help you. So at any point, we want this to be very much a workshop and not a lecture. So if at any point you're hung up and you want to ask me a question or Ben a question or just have somebody come and help you, please just raise your hand. Network issues will just have to do our best on. So as Mike said, if you have multiple devices on the network now, maybe shut down the ones that you're not using to free up a few more addresses. Okay, so with that, I will begin with a few leading questions. Just to give us a chance to know who you are and what you'd like to learn here. So just with a show of hands, I'm going to ask some sort of random questions. So how many of you have downloaded encode data and intersected it with your own data? Okay, that's good. How many of you have already implemented at your institute a software pipeline based on what encode is done? Okay, good. How many of you believe that you could repeat an encode analysis starting from fast queues to generate an IDR thresholded set of peaks? Okay, that's a few, that's good. How many of you want to repeat the encode analysis on your own data? That's great, cuz that's why we're here. And just a general question, how many of you have found in the past that you needed to access encode data, but you found it difficult or you just didn't know where to start? Okay, hands, that's okay too. So we want to help with all of these. All right, thank you for that. So I've been writing pipeline code now for several months and so everything looks like a pipeline to me now. So you're actually in a pipeline and I'm going to put this workshop in the context of a pipeline through this meeting. So yesterday, Yuri showed you the encode portal, which is really a way to access encode data. And then Pauline and Emily showed you the UCSC Genome Browser and the Ensemble Browser. Once you've accessed encode data, how do you visualize it? And then Emily and Jill showed you some tools for actually interpreting encode data. And right now, Ben and I are going to talk about how the data, how the process data are actually generated through these processing pipelines. And then in the next workshop, you'll get to explore some advanced analysis methods from Michael and Yanli and Luca and Camden. So that's sort of where we are right now in the overall scheme of the workshops, is to talk about how analysis data and encode are generated. So you might know that the DCC, the Data Coordinating Center, delivers all of the encode data to you through the encode portal. So all of that data actually exists on the cloud in an Amazon bucket. And when you download files from the portal, they actually come out of this bucket. And you can use the encode portal to search and find that data, both the primary data and also process data. The DCC also delivers through the portal's metadata the information about these transformations from sample to library to primary data that help you to help to justify your interpretation of encode data in biological terms. So these are things like antibody specificity in a chip seek experiment, the properties of DNA libraries that are sequenced, maybe protocol documents, all of that available through the encode portal. Sort of the metadata that describes the experiments that encode has done. So we're pretty good at that, but one thing that really has not been as well described, I think, are the transformations between primary data to process data, and that's what the pipelines do. They take the primary data from the experiments and then they turn it into something that you might visualize of the UCSC genome browser or that you might take into subsequent statistical analyses of your own. So to try to improve the transparency of this process and the repeatability of the process, we started this project at the DCC of deploying the defined encode processing pipelines. And for the project, we had the following goals. And I'm going to go through these goals now because really the figure of merit for our workshop is at the end we hope that you will feel that we have met some of these goals. But this is what we're trying to accomplish. So first is to deploy the consortium defined processing pipelines for four key experiment types, chip seek, RNA seek, DNAs, and whole genome bisulfite sequencing for DNA methylation measurements. At the DCC, we are the ones who actually use the pipelines to generate the standard set of peak calls or quantitations or methylation calls that make up sort of the core set of analysis files from encode. So at the DCC, we're going to use those pipelines to produce those. We're going to capture the metadata to make clear exactly what software we used, what versions, what parameters we ran it with, what inputs were used and so forth. And then of course to capture accession and share with you the output of those pipelines. And then here's the key. And this is why we're here and this is why we have this workshop today. Is we could have done all that and it could have all been sort of a black box to you. But we didn't want it to be that way. We actually want to deliver exactly the same pipelines in a form that absolutely anyone can run either on their own data or against encode data in a scalable way. So if you have one experiment, we want it to be a tractable installation for just one experiment or thousands of experiments. So these are the goals that we set for ourselves for this pipeline development project. And they can kind of be summed up by the words here in the bottom. We want the pipelines to be replicable. We want the provenance of the files that are generated by the pipeline and the software that's used to be transparent. We want them to be relatively easy to use and we want them to be scalable. Because you may have one or ten or a hundred experiments, we have 10,000 experiments to run or 5,000. Okay, so those are the goals for this pipeline development project at the DCC. So when we thought about how we would accomplish this, one of the questions that we had to decide was where to deploy the pipelines. So you probably have access to a high performance computing cluster. We do too at Stanford, of course. Some of you might have your own. And that's, of course, a possibility to deploy this as a set of scripts that you could perhaps run on your machines. We didn't choose that as our first deployment platform. Instead, we chose to deploy these pipelines on the cloud. And this is sort of a summary of our thinking process and how we reach the decision of where to deploy these pipelines first. So what I've done here in this table is just summarize sort of some considerations that you get for different platform delivery choices. So I want to see whether it's hard to develop the pipelines, whether it's hard to share them, whether it's hard for you to run them, whether they're elastic, this idea of being able to run it on just one or many hundreds of experiments. Whether the provenance of exactly what was run, what inputs were used and what software was used is clear and how much it costs to run them. And for us, the cost also involves the cost of development. Okay, so if we were to deploy this as a set of scripts, maybe in a tar ball that you downloaded and you brought up on your cluster, it would be hard for us to develop that because development is always hard because we have to actually develop a lot of infrastructure software that runs steps of the pipeline at the same time that could be run in parallel. It's somewhat difficult actually to share that too, because you need to install it on your cluster. You might not have the same versions of software. I don't know if you've ever had that problem of trying to install something and you don't have the right prerequisite software versions. It's also hard to run because shell scripts sometimes don't work. It's not particularly elastic unless you have an enormous cluster and by elastic, I mean I want to be able to throw a thousand experiments at this pipeline and have it run in the same amount of time that it takes to run one. Providence is somewhat more difficult because you might be running a different software, a different version of the software than we are. And the cost is sort of obscure. It's difficult because clusters are oftentimes subsidized and it's not always clear just exactly how much it costs to run something. So you might say, well, we just use this containerizing technology like Docker or something like that. So we can sort of deliver this binary image that just runs. And that has improved some of these metrics, but not really all of them. However, we thought, well, if we could just make this stuff run on the web, so that you just came to a website, all the software was already ready to run on the cloud, right? Some compute cluster that you have nothing to do with. Then it actually is really easy to share because everybody has a web browser, that's really all you need, that, plus your data. It's very easy to run for the same reason. It's very elastic because in fact, as we'll tell you later, all of the compute that stands behind these pipelines is Amazon Web Services, which is the same compute that serves up Amazon Prime videos when you binge watch TV series or whatever. So it doesn't, the compute is a non-issue that got plenty of capacity to run all of these experiments. The provenance is excellent because it's exactly the same software that we run to produce the standard output from Encode. It's the software that you can run too. The cost is sort of a different, maybe, appears different because you know exactly how much it costs because this is on a commercial web-based platform. So we chose to deploy the pipelines on this web-based cloud platform first. And we know that people have their own clusters and that you wanna be able to run this software on your own cluster as well. And so there will be subsequent deployments that will allow you to do that. And even today, all of the software that runs these pipelines is completely open source and it can be adapted today to run on your cluster as well. We've just chosen this platform called DNA Nexus to deploy the pipelines first. We actually run the pipelines there as well. So you hear a lot more today about DNA Nexus and how the pipelines run there. But I just wanted to spend a couple of minutes talking about why we chose to deploy the pipelines there first. Yes. So I forgot a metric. What was the one that I forgot? Okay, very good. So the question is confidentiality. So you have, yeah, that's a good point. And it so happens that this platform is compliant for PHI so you can actually bring data that cannot be shared publicly to your workspace and run the software on the data because we never see the data. The data reside in your project and it's obscure. You bring the software to, you bring the compute to your data. And we'll talk more about exactly what that means later, but yes, you're right. If you have clinical data or other data that can't be made public, this environment can handle that. Can, yes. So that's a good question. Okay, so I'm gonna stop there and we're gonna start the live demo portion of the presentation. And I just wanna reiterate that this is a workshop and we want this to be interactive. And so please raise your hand if at any point you get lost or you have a question and then we'll stop and answer your question or one of us will try to help you.