 Okay, if I can have your attention please, we're going to get started in just a couple of minutes. Before we start, I'd like to make two quick announcements. One is that you've heard sometimes we've talked about community publications. These are publications by people that are not funded by ENCODE that use ENCODE data. And we find those manually and track those manually at NHGRI. So if out of the goodness of your heart, and if you have any of these that you were aware of and you want to send to us, thank you. That's very helpful. And the second thing I'd like to tell you about is something that's preliminary. But yet you're all here, so it seems timely and a good thing to talk about. Earlier someone asked about what's happening next within ENCODE. And we're in the process, there are RFA's that were out, as many of you are likely aware, and we're in the process of seeing what comes next. And you may have heard in some of the public discussion about that, that the two ideas have arisen recently. We're interested in the idea of community data, data that was not produced through ENCODE funding being shared through the ENCODE DCC, if people want to do that. We're also interested in the idea of the consortium processing samples that come from the community. So there are people that are experts in particular cell types, diseases, samples and so forth that are interested in having their samples processed by the consortium. That's something to think about. This time, this next round of the consortium is not formed. So I don't know if this is going to happen, let alone how this is going to happen. So I can't promise it. But we're all here and it's as good a time as any to start thinking about this. And then one of the challenges that we would face is if we have data coming from other people, how would that data be processed uniformly? Because we know if you process data through different pipelines, then that creates sometimes different artifacts, differential signals, which are simply because of the processing, as opposed to the changes in the data. So with that, I'm going to move us on to we're going to hear our next featured session. Seth Stratton from the ENCODE DCC is going to tell you about ENCODE pipelines. And this is very near and dear to my heart, an idea that came from the DCC, not from any GRI. But by sharing these pipelines, we increase the transparency of the project. It makes it easier for everybody to know what's being done. And then we also offer these for people that want to use them in their own work. So now Seth's going to tell you about that. Thanks a lot. Okay. So my name is Seth and I work for the ENCODE Data Coordinating Center or DCC here at Stanford. This is a workshop and so it will be very interactive, although I have a few introductory slides to get us warmed up. But because this is interactive, we have a lot of expert helpers on the floor that are going to help you actually run these pipelines in your own account. And before I begin, I wanted to introduce them, look around, they're kind of at the back. So people who are standing, raise your hands over here and I'll tell you their names. So starting in the back corner here is Marcus. Gene is there, Esther is here, Marcus, and Nana is there. Tim, Ethan, Beck, Aditi, and there's Katrina, Mike in the middle. Chai is here, there's Cricket, Ben, Forrest, Alpha, and Jason. So that's a lot of helpers. They all know the exercise that we're going to be going through today. So please, they'll be watching over your shoulder and if you get hung up, they might even come and offer their help. But if you need their help, please raise your hand and they'll help you. So just quickly a thumbs up. How many of you have got your DNA nexus account set up, ready to go? Okay, that's a lot, that's great, okay, terrific. If you don't, you might want to do that now, it only takes a few minutes. If you do that, then you can actually follow along and do the pipeline runs that we're going to be showing today. So first, a little bit of motivation behind what we're trying to accomplish. So in sessions yesterday and today, you've seen a lot of ways to access encode data. You've seen, of course, our portal that we build, but you've seen presentations on RegulomDB, Haploreg, of course the Genome Browser, Factorbook, Epilogos, the encode annotation tool from Fung's lab. And a naive reaction to that might be, couldn't life be simple? Why do we have all these ways of looking at encode data? In fact, there's a really good reason why we have all these ways, and that's because this is highly multi-dimensional data. And a comparative or a correlative approach is going to take many different axes through that data. And so it's appropriate that we have lots of ways to slice that large corpus of data. So this is sort of the flow of data to populate these tools. I like Zhiping's term of ground level data here. This is sort of the data that are produced from each individual experiment that are integrated later. And of course, there's this primary data that's transformed into, primary data in the form of reads, because these are all next generation sequencing based assays. And so these reads get transformed into this ground level data, the things that you're familiar with like, system modification tracks and TF binding sites and so forth. And then that's distributed out to these sites. Now, this rich ecosystem of ways to access the data is a good thing. It's not a needless complication, it's a good thing. A rich ecosystem of analysis methodologies is not necessarily a good thing. For the reason that Mike said in the beginning, I might have my way, you might have your way, there might be the old way we used to do it, and then the new way we're going to do it. And when we have that, we can end up with this. And that tends to break the way that we do comparative analysis downstream. And so the idea that I'm trying to get across here is that while we have a diversity of data producers in encode, several labs do chip seek for example, several labs do the same assays and different factors, we have a diversity of producers of data. We want to have some sort of consistency in a common methodology that's repeatable and reusable for doing this initial transformation of primary data into this ground level data. We need this for encode for the reasons that I just said, but you might be interested in repeating exactly those transformations on your data for your experiments. Because if you do, then your results are guaranteed to be comparable to encode's results. So that's the motivation behind what we're going to look at today, which is how we have implemented these common analysis methodologies, and how we have deployed them in a way that allows you to run them on your data. Okay, so it's a workshop. What are our goals? First I'm going to introduce the analysis pipelines that we've built. And then we're going to get right into actually running some. So I have some sample data sets that we can use today. And we're going to run a transcription factor, a chip seek experiment from Richard Meyer's lab at Hudson Alpha. And that's a Z bed one chip in K562. Then we're going to run the long RNA seek pipeline on a total RNA experiment from human tissue. At that point we will have launched the pipelines. They take a while to run. What I've made is chromosome 21 extracts so that they run a little bit faster than if we were running the experiment on the whole genome. They should take about 45 minutes to an hour to run. And so we have to get this going quickly so that we'll have time at the end to actually visualize the results. So while we're waiting, I'll talk a little bit more about the inputs and the outputs of the pipeline and what you can expect when it finishes. And when it does finish, then we'll work together to visualize the outputs of the pipeline runs at the UCSC genome browser. And what I would like as many of you as possible to take home is the ability to replicate these encode analyses on your own data or on encode data. Okay, so sort of the technical axis, if you want, of an encode experiment can be understood as this series of transformations. You have a sample that has been perhaps treated or permuted in some way, a library is prepared, primary data are generated in the form of sequencing reads. So what our group, the data coordinating center, are funded to do within the encode consortium is to gather, one thing that we're funded to do is to gather all of the metadata that described these transformations. We're sort of the materials and methods section for encode. And that is distributed through the encode portal, which you saw yesterday, and it includes all sorts of information about how the experiments were done, characterization of the libraries that were produced, characterization of antibodies that were used for experiments and so forth. And so a lot of the effort that we contribute to the consortium is in this area of wrangling, right, which is getting all of this metadata and primary data from the encode production labs. So all of the encode production labs data and metadata come to us and get distributed through the portal to you guys. So we also, of course, deliver encode data, all of the data that encode production groups generate is submitted to us. We store it on the cloud, it's linked into the metadata database, and you get these sort of relationships between files, primary data transformed into process data, and all those files are available freely for download from the encode portal. So we do a lot of work to document these transformations from sample to primary data, that's the metadata about encode experiments, but we also wanted to document the transformation from primary data to process data. And we started a project at the DCC, this pipelines project, which the output of which we'll be demonstrating today with the following goals. We needed to process all of the encode data in a uniform and comparable way. So we wanted to deploy, these are all pipelines that have been defined by working groups within the consortium, so people interested in RNA-seq within the consortium got together and defined the RNA-seq pipeline, similarly for ChIP-seq and so forth. So we wanted to deploy these defined pipelines for these data types and use them to generate all of the standard encode peak calls and quantitations and so forth that you download. And so what you get from the encode portal is the primary data that had been generated by the labs and then the process data that had been generated through these defined pipelines at the DCC. I mentioned that we capture all the metadata, the way experiments are done, but there's, of course, metadata about processing, so what software we used, what software versions, the parameters that we use, what inputs were used for each individual run, what outputs were produced and so forth. So we capture an accession and distribute all that through the portal and then here's the reason why you are here. And this was sort of the extra thing that I think makes this a more interesting project. And that is we wanted to deliver exactly the same code in a form that absolutely anyone could run in a scalable way on just one experiment or as we run it on thousands of experiments. And I don't, by exactly the same code, I don't mean I gave you a tar ball that has some code that you then expand on your computer and maybe use some of the dependencies on your computer and maybe the versions are different and so forth. I don't know if you've been through that before, but we wanted a way that you could run exactly what we run. And so we needed a platform that would support that. We considered several different options and ultimately decided to deploy these processing pipelines to the cloud with a web-based interface. So we're in Silicon Valley, everything's on the cloud, right? So this is cloud-based analysis. All the processing is actually done. All the compute is actually done on Amazon Web Services, on virtual machines and AWS. And the interface is provided by this platform called DNA Nexus. DNA Nexus gives us a lot of user interface tools and data management tools that we don't have to write ourselves so that we can focus on writing the data analysis code and deploying it then the infrastructure sort of comes along. All right, so this just summarizes some of the considerations about whether it was hard or easy to develop, whether it's hard or easy to share, whether it's hard or easy for you to sort of first bite into the pipelines the first time you run it, the elasticity, whether it can be run on just one experiment or many. And in the end, we finally decided on this cloud-based platform, which I'm going to show you today. The code is all open source and it's freely available. The URL for our GitHub is here. You'll see, if you look into the code, that there'll be a little bit of code in there that's specific to the way we've deployed to this platform. But it's all Python, it's all very standard and you'll recognize lots of software tools that you might already use integrated into those scripts. Okay, so end of introductory slides. This is a schema of the transcription factor chip seek pipeline that we're going to run today. I thought I would just show you this picture before we jump into it so you kind of have a sense of all the steps that are going to get run. So like many Bioninformatics pipelines that start with reads, there's a mapping step that produces mapped reads. We derive signal tracks from that. So these deviations on the genome browser that you're probably familiar with. We call peaks. The transcription factor pipeline has a pseudo-replication strategy in the middle here that allows us to compare replicates later. This experiment has been biologically replicated so it has been repeated on two independent samples and I have the data for those two replicates. Peaks are called on those replicates and then we use a statistical framework called IDR in order to compare those replicates and determine which peaks are reproducible in the two replicates. And so it looks like a lot of steps but I think most of them are familiar to you. You map, you call peaks, and then you compare your replicates to figure out what's a real signal. We also have a histone modification chip seek pipeline that differs from this a little bit in that we don't use the IDR framework for histone chip because the width of the peaks actually makes an IDR analysis a little bit difficult. And so there's just a very simple replicate concordance step in the histone chip seek pipeline but otherwise we map and generate signal tracks and so forth in the same way. So this is just a summary of the software tools that are integrated into the pipeline that do these steps. We use BWA, we use Picard, we use SAM tools, we use Max2 to generate the signal tracks, SPP is the peak caller, IDR is the replicate concordance framework, and then for histone modifications, Max2 is actually our peak caller. I said we overlap the two replicates to determine the replicable peaks and we hope to have some sort of IDR-like framework for histone replicates in the future. So input files are simple, fast queues. The outputs that are generated are not simple and we'll talk about that as we go through it today but you'll get signal tracks, you'll get peaks, and then you'll get processed peak lists that have been compared replicate to replicate. There are quite a few quality control metrics that are generated from the pipeline. These are super important because these are the numbers that actually allow you to compare the quality of your chip and the depth of your sequencing to ENCODE standards because there's a standard cutoff for each of these metrics that all of the ENCODE labs have to meet in order for that experiment to appear on the portal and so you'll have access to those same metrics for your experiments if you run this pipeline. Now, let's actually get into it. So I'm going to switch away from my slides and go directly to the TF chip seek PDF file. Bring that up on your computer because I'm just going to go through it step by step with you guys. So that file should look something like this. All right, so at this point, anybody who wants to run the pipeline, have you got the TF document up? Let me have a thumbs up to see if we're all together, okay? All right, anybody having trouble, raise your hand and we'll get you started. We're going to actually pause here and make sure that everybody who wants to follow along and actually do this has got this up. So raise your hand if you need help getting just getting to this. It's okay, we'll help you get there. Yeah, if you're wondering like, okay, where do I start ENCODE 2016.org, the website for the meeting has a link on it for workshop materials. Scroll down to our workshop. This PDF is linked off of that. So ENCODE 2016.org, if you want to follow along and do this, you need this up on your computer. So I'll show you how. Okay, so to get to that PDF, if you haven't gotten to it yet, ENCODE 2016.org will bring you to this page. Tutorials and workshop materials is right here in the middle. Click on that. Scroll down to our session, which is right here, workshop session four. And over here, there'll be links. We're going to start with the chip seek transcription factor tutorial. So click on that right there and it should download the PDF to your computer and you should have it. All right, so how many more have now got it? Thumbs up if you've got the TF PDF. That's great. All right, terrific. All right, so it may take a while to download it because we're all downloading the PDF. It's got a lot of images in it. So it might take a minute to download that it will come to your computer eventually. Okay, so please raise your hand if you want me to wait while you get yourself squared away before we begin. We want to wait over here. Okay, we're going to wait just another couple of minutes. Get that PDF. Okay, here we go. So this is a TF chip seek experiment. A link here in the PDF will take you to the experiment page at the Encode Portal if you're interested in looking at the experimental details. Here's a summary of the steps. What we're going to do is we're going to find the pipeline software and copy it into your Dean and Exis account. We're going to populate it with the example input files and we're going to actually run it, watch its progress and then visualize the output. Okay, so you have your account, that's step one. Step two is to create a new project in your account. So once you have your Dean and Exis account up on your screen, you should be able to create a new project with this green button here, new project. So do that now and give your project some sort of useful name like chip demo or something like that. So let's get that done now so you've got a place to actually put all the pipeline files. So if you created a new project, thumbs up if you've done that. Okay, right, your friends are getting ahead of you already. This is a race. Oh and another good person to ask is the person next to you who just raised their hand. How'd you do that? So the question was asked to I think Dean and Exis will ever be free. So let's see, we actually have some people from Dean and Exis here today. Would you guys work for free? Would you work for free? No, they don't. I don't work for free, I don't know about you. I mean, I know, it's almost, but almost is not really free, right? There's a big difference there. Nothing is free, nothing is free. Okay, so thumbs up, who's got their project created? Or more importantly, Raging Hands have got me to wait here while you create your project. Okay, we're ready, all right, we're gonna go. So now is the time to actually add the chip pipeline software and sample data. And so in your project that you just created, there will be a green button called add data. So click on that add data button and you'll get a pop-up that'll ask you whether you wanna add the data from your computer, which would be the way that you would upload something from your computer or transfer from your server. What we're gonna do is we're gonna add from another project because I already have all the software and example data loaded up on another DNNX project so you can copy it directly into your account. So click on from another project and then that'll give you a list of projects that you have access to. If you don't see the ENCODE Uniform Processing Pipelines project, type in ENCODE into the search box here and you should see that come up and click on that ENCODE Uniform Pipelines project. Okay, if you've got the ENCODE Uniform project, thumbs up. That's pretty good. Anybody want me to wait? We're gonna wait for just a minute. Gonna wait for just a minute. So get that ENCODE Uniform Pipelines project open. I see some helping going on so I'm gonna wait for just another minute, another second. Okay, thumbs up, who's here? Who's on this page here? Okay, all right. Okay, so to add the software and the sample data for the Chipsiq pipeline, click on the box next to Chipsiq. That selects that folder from this public project where all of the ENCODE Pipelines will always be available so this is where you'll always get your ENCODE Pipelines as we update them and so forth. All of the test data is inside this folder as well so if you just click on Chipsiq, it's all you have to do, click on that button there and then click on add data, right? So over here, we've selected the Chipsiq folder by clicking on that, check box, and then clicked on add data, right? Over here, it's the Chipsiq folder and it's green button add data, all right? Thumbs up on who's done that. Okay, you guys are rocking on, that's good. So you'll see something that looks like this, it's a little progress bar, you can hit close, okay? So who wants me to wait? Who wants me to wait here? Okay, great. So what you've done now after you click close here is that you've actually transferred all of the files and all of the resources necessary to run the ENCODE Chipsiq pipeline into your DNNX's project so you've got it, all right? So the example that we're going to run today is a TF Chipsiq experiment and we're going to map it to HD19, right? So the HD19 human assembly. And so what you want to do here is click on this ENCODE TF Chipsiq HD19 workflow and that's what this step 10 is, all right? So click on that ENCODE TF Chipsiq HD19 workflow and after you do so, you should see the top of it at least will look something like this. Time out. I'm going to pause. How many are here? Some people and not so many. Okay, that's cool, we're going to wait. If you need help, wave your hand. So what this, this is just the top of it, if you've got it open you'll see a long list of stages. Each of those horizontal rows is a processing stage in this workflow. And so this is sort of the schematic representation of all of the steps that are going to run and what we're going to do next is populate those stages with inputs which will instantiate that workflow into an actual analysis run. So this workflow is a reusable template that you can use to start any TF Chipsiq analysis run but we'll give it a unique name and tell it where we want the outputs and that'll be an instantiation of this workflow as an analysis run. All right, so who's with me? Who's here? Okay, raise your hand if you want me to continue to pause so you guys can work some more. We're going to pause just another minute. Okay, that's cool, yep. And nothing will go wrong if you work ahead. Please ask a question. Okay, so the question is I mentioned gem as a peak caller. We're not switching to gem. We'll add gem, right? And so there'll be, SPP is a peak caller that produces highly stable peak lists that work very well with our IDR framework. Gem is also an outstanding peak caller and we want to be able to generate gem peaks as well but for this first rev, we just implemented one peak caller and it was the one that worked best with IDR. So gem will be added soon. Okay, so helpers, how are we doing? We're good? Okay, terrific, all right, that's great. So here we are, you have your workflow, let's name it. So in this example, you can name it anything you want but in this example, I named it human Zbed one, the name of the factor that was actually chipped down in this experiment. And so type that in this box here in the upper left of your workflow. Again, we're instantiating this workflow as an analysis run. So that was step 11 in your protocol. So name it and after you name it, set your output folder. So this just tells the platform where you want it to put your outputs. And that's this button here that says set output folder, click on that over here, set the output folder, set the output folder, click on this button here. Now, I'm gonna create a new folder where the results will be stored. So what you'll see after you click on that, both on the screen at the same time, I can't. After you click on this set output folder, you'll see this pop up here, click on the folder that has a plus sign on it to make a new folder that will house your outputs and give it a name. I called mine human Zbed one results and then I clicked plus in order to create that folder. So again, overview, we're instantiating this workflow with all the various inputs to start a real analysis run. This is a reusable frame, reusable workflow, reusable template, but we're instantiating it with inputs to start an actual run. All right, so who has created their output folder? Thumbs up if you created an output folder, that's great. Raise your hand if you want me to wait. Okay, cool, we're rolling. So you've created an output folder and so your workflow now should look something like this. You should have a name in the upper left and you should have an output folder here, yes? Okay, cool. Now we're actually going to add data to this workflow. So step 15, if you're following along in the PDF, we're going to tell it where to find the fast cues. So we're going to select this reads one box in the map replicate one stage. The pipeline takes single end as well as paired end sequencing runs. The example today is single end sequencing and so you only have one fast cue with the read ones for replicate one, right? I'm sorry that read and replicate both start with R and both have the number one in them. That makes it a little bit confusing. We're talking about the reads one for the reads one for replicate one. So if you'll click on this reads one button, a new window will open, which allows you to navigate to your input files. So it should look something like this, all right? So who's with me so far? We have this list of files here, okay? That's cool. Who wants me to wait? Okay, we'll wait a second. There are a few ways in which you can get completely turned around at this point. So I'm going to make sure everybody's okay before I move on. Raise your hand if you need help, if you're stuck. Okay, yeah, great. So while you guys are working, I'll answer these excellent questions. The question was, I don't have fast cues. I've got bands. I already mapped my reads because I map them my way and I want to put those in to your pipeline. Well, first of all, that's not the encode pipeline anymore because the encode pipeline takes fast cues and I'm not being pedantic about that. That's actually important, right? Because remember that forest of my way, your way, his way, her way in the beginning? Mapping is a super important step in this process and if you do it differently from the way we do it, I can't guarantee that the outputs are going to be the same. All right, so that was the pedantic answer. The answer, of course, yes, of course you can do it. It's very simple. You just delete the mapping step and you start out with an M file. You're not running the encode pipeline anymore once you've done that. You're running a piece of it, which is not a bad thing, but I hope you see what I mean by really from beginning to end is the encode pipeline. Sure, well that, yes, yeah. I don't have time to go through them all, but we can talk afterwards if you want. Okay, so thumbs up, who's here? Who's ready to keep rolling? Who would like, hands up if you'd like me to just pause or need more help? Okay, we got one pause back in the back corner. Give me a thumbs up cricket when you guys are ready to roll. Oh, I thought you said you had a question. Hey, you, thumbs up to you, we got you. No, no, no, sorry. I was trying, I was being overly responsive, yes. Download files from, yes, yes. Anything that has a URL, absolutely anything that has a URL, you can, the transfer will happen over the network. You don't have to bring it through your machine. Okay, so we're gonna keep moving. You've chosen your read one now for replicate one, which is this file, it starts with R1. R1 here starts with replic, this stands for replicate one. So you've chosen that file and you should see this. When you go back to your workflow, you should see that this reads one is now filled in with that file name, okay? Thumbs up if you see that, okay? Good, repeat that process for the replicate two, the control replicate one, and the control replicate two. You know how to do this already? So you're on your own. Click on the reads one boxes on each of those mapping stages for map rep two, map control one, map control two, and add each of the individual fast cues that start with R2, C1 and C2 respectively. Okay, so the question is, if I have a hundred chip seek experiments, do I have to go through this by myself every time? Absolutely not. As you might imagine, I run the chip experiments for encode thousands of them through this pipeline. It's all accessible programmatically through scripts. So I don't use the user interface, but this is what you might use for one or two, but it's all accessible programmatically as well. All right, so populate your reads ones for those four mapping steps, the rep one, the rep two, the control one, and the control two. If you get hung up, raise your hand and someone will help you. Okay, so three replicates. We do all pairwise IDR comparisons, yeah? And we pick the longest lists that pass this IDR framework. I do not have a three-replicate pipeline pre-built for you here, but I can build you on, right? So if you'll give Ben your card or your email, we can get that free, but the pipeline does run on three replicates. You will only see that, but I know but I'll show you how to not see three. I'll show you one, so if you go here, yep, yep. All right, so you need to navigate into, you have to navigate into this folder here that has the Z-bed one FASTQs. There are FASTQs for a mouse H3K9 acetyl experiment, for another human math K experiment, CTCF experiment. You have to navigate into this folder, ENCSR286PCG in order to get this list of four FASTQs. These are the only four FASTQs that work for this experiment because they're the four FASTQs that were created by that experiment. So it'll be tricky because there is a screen where you get a list of, I don't know, 20 or something and you won't know which one to choose because a bunch of them will start with R1. You need to navigate into this specific folder in order to get these specific four FASTQs, which are the FASTQs that you want for this experiment. Okay, who has all four mapping steps configured? Pretty good, all right. Anybody want me to wait while they continue to work? Okay, we're gonna wait just a minute. When you're done, you should see something like this. Your workflow has been named, right? You've given it an output folder and you've added the replicate one FASTQ, the replicate two FASTQ, the control, the matched control for replicate one and the matched control for replicate two. Four FASTQs into the reads one input because this is single-ended sequencing, all right? Who has this all filled in, all four? Okay, who wants me to wait? Okay, we're gonna wait just a minute. So for those of you who have got all this set up, all the other inputs, including the HG-19 indexed reference for BWA, the Chrome sizes file, the .as files for the big beds that the pipeline makes, all the other support files have been pre-configured in this workflow and so we don't have to specify anything else. All we have to do is specify the inputs. Now, of course, you can go in and change those if you wanted to map, for example, the GRCH38 or something else, but for today, this pipeline has been pre-configured to run HG-19. Yes? Yes, every time you load the pipeline, well, if you load it in the way that we did today, that is select that entire chip seek folder, you'll get everything that's in there. You could copy the individual workflows if you wanted to and all of the required inputs that have been pre-configured will be transferred automatically. You don't have to transfer the example data if you don't want it, but if you click the whole chip seek folder, you'll get everything. It'll always be there. Yeah, it will always be there. It's part of our distribution is test data. Okay, so raise your hand if you don't want me to move on. Okay, can we get, Ethan, can you help right here? Oh, it's a double negative, sorry, did I get it wrong? Okay, we're moving on, that's for sure. All right, so this is the easy part. You're done. Click run as analysis. Hands up, thumbs up if you've clicked run as analysis. Okay, you have now launched the end code TF chip seek pipeline on Deon axis. That's really all there is to it. You'll believe me when you see outputs, but once you click on that, it's now running. So let me tell you a little bit of what's happening in the background, because it's sort of interesting. So on Amazon Web Services, the same cloud compute framework that serves you up your Netflix shows when you binge watch, is now running this TF chip seek analysis. So everybody, and there's maybe 100 of you who just clicked run as analysis. There's a bunch of instances coming up on Amazon Web Services that are starting out running BWA, and then they're gonna run SAM tools, and they're gonna run SPP, and then they're gonna run a bunch of other stuff too. So all encapsulated in these scripts that we've written. And so you could do this from through your lousy DSL connection at home or wherever, because no data have actually moved between your computer and any server. All of this has happened in the cloud. You could close your laptop right now, you could come back to it later. It runs in the cloud, doesn't need you anymore. You are now superfluous. The computers have taken over. If you have the fast cues on your laptop, then you have to upload them to the platform, which does not cost money. It's the computer that costs money. Okay. So you have a tab now called monitor, and you should see the job of the analysis that you just launched there, and you can click on this little, it's a plus sign maybe on your screen, click on it in order to see all the sub jobs that are running. So do that and tell me if you see it. Okay, cool, it's possible. So that shows all of the sub steps that are running in this pipeline. It will take probably on the order of 45 minutes for this to complete, and that should give us a little bit of time at the end to do visualization once it's done. Okay. So, hands up if you're stuck, if it's not working, you'd like it to work, and you want help. Okay, right here. Oh, Esther's there, sorry. Yes, a question. Most TF experiments with something on the order of 20 million reads per replicate, four replicates, two experiment replicates, two replicated controls, that entire experiment end to end should cost about $20, everything, from fast queue to replicated peaks. Yeah, so let's say that you realize that you just populated with the wrong fast queues, and you want to stop the job. So if you select that job, there'll be a red button called terminate, which you can click and stop the run and then go back and populate a new workflow. If you made a mistake and put the wrong fast queue in or something like that. Yes. This display here is about as close as you can get. Now, there is a special way of running the pipeline so that you can actually SSH into the workers that are running each step, and then you can really see everything that's going on. I'm not gonna show that today, but you can do it, you can do that. Okay, what I'd like to do now is actually get the RNA-seq pipeline going, right? Because not everybody's chips, some people do RNA-seq, and so let's leave this running and switch to the RNA-seq example. You need to bring up the RNA-seq PDF. It's the step-by-step guide just like this one. It's accessible in exactly the same way as you got to the TF chip-seq step-by-step. And go to 2016.org, that link in the middle about the workshop materials. Scroll down to our session, click on the PDF that is for the RNA-seq pipeline. Let's get that started now, too. It might take a minute to download the PDF because everyone is doing it all at the same time, but it looks like this when you got it. Raise your hand if you're having trouble getting to this PDF. Raise your hand if you're having trouble getting to the PDF. Yeah? Okay, I'm sorry. Apparently the link in my slide deck is broken, so do what I said just now, which is go to encode2016.org, go to the workshop materials, scroll down to our session, and the link is there. Okay, so thumbs up if you have the RNA-seq PDF step-by-step up on your screen, and you're ready to roll. That's not very many, so we'll wait a second. Raise your hand if you can't get to the RNA-seq PDF and need help. Keep your hand up until someone comes and helps you. Okay, that's a good question. So the question was, do we have pipelines for comparing two or more many experiments? No. So what we're focused on is the analysis steps that produce this ground, what the G-ping calls this ground level per experiment, peak sets, transcript quantitations, signal tracks. Once that's all created in a comparable way, then you can go back and begin to use other tools to do cross-experiment comparisons, but that's not what we have implemented here. Thumbs up if you're ready to go forward with RNA-seq. Let's do it. So this is very similar. We're going to first copy the RNA-seq pipeline and it's supporting files into your project. So you don't have to make a new project, you can use the ones you've already got. Go back to your project and click add data, just like you did before. Click from another project. In exactly the same way as you did before, find the encode uniform processing pipeline project just like you did before. So I'm gonna pause here and let you get to that point. So just like you did before, you're going back to your project, you're adding data, upper left-hand corner, green button with a plus, finding the encode uniform processing pipeline project, the public project, which is the same place you got the chip-seq folder from and click from another project. Click on the encode uniform processing pipelines project. And if you're stuck, raise your hand so someone can come and help. Okay, who got back to the encode uniform pipelines project? Yes? All right, who wants me to wait for a minute before I move forward? Okay, I got some requests to wait, we're gonna wait. Okay, here we go. All right, so you are now back to the encode uniform processing pipelines project. Last time you clicked on the chip-seq folder, this time click on the long RNA-seq folder, this checkbox next to long RNA-seq, long RNA-seq, long RNA-seq, and click add data down below right. You did the same thing with chip-seq, do it with long RNA-seq this time. It's a little tricky if you highlight, you might expand this, you want this button right here, you want the checkbox next to long RNA-seq and add data. Need help, raise your hand. If you're ready to go forward, thumbs up. Okay, let's go. So you've clicked add data, you see this progress bar that shows that the data have been copied to your project. Click close, back to your project and you should see now a folder called long RNA-seq in your project. You can click on that to see this list of applets and workflows in your long RNA-seq folder. So who sees this in their long RNA-seq folder? All right, anybody need help, raise your hand. Okay, let's keep rolling. So just like you selected the chip-seq workflow, select the encode RNA-seq long pipeline one paired end replicate. This is the checkbox next to it. Select that checkbox and click run analysis. Same way we did it for chip. Select the pipeline by clicking this box here and then click on run analysis. Who's there? Who wants me to stop and wait? Okay, so these are in your long RNA-seq folder in your project. Got it, great. Yeah, I didn't do a complete screenshot, this is like a zoom in there. Just like you did for the chip-seq pipeline, click on that workflow and we'll instantiate it with inputs so that it can run. Give it a name, I called this RNA-test or Tim did because he wrote it and set an output folder in the same way that you did for chip-seq. So give it a name and create a new output folder in exactly the same way that you did for chip. After you've named it and selected your output folder, give me a thumbs up so I know we can keep rolling. Okay, I see some thumbs. Hands up if you need help at this stage. I'm seeing one time out. Okay, we're gonna pause for just a minute. This is exactly the same that you did for chip. You gave it a name and then you created a new folder for the outputs and you associated the pipeline run with that folder. And raise your hands if you're stuck. Thumbs up if you've named your workflow and you've specified your output folder and you're ready to move forward. Okay, who wants me to wait longer? Okay, we're rolling. So just like you did with chip, you wanna populate the fast queues for the input. In your workflow, you'll see a box that is called read one of paired in fast queue file and then right below it will be read two. This example comes from paired in sequencing. So we have two fast queues, read ones in one, fast queue, read twos in the other and so we're gonna put a fast queue, one fast queue in each one of these boxes. Click on the first one, read one and you'll see a long list of files. First, go over here on the left and open up with this little carrot, the long RNAseq folder, the examples folder and click on the input folder, okay? It's important to do that in order to get this short list of fast queues. Otherwise the platform tries to guess like every fast queue in your project and then you see too many. So expand using these little carrots here. Drill down through long RNAseq, through examples into the input folder and you should see this list here. And what you wanna choose is the chromosome 21, read one, fast queue.gz. We're just running chromosome 21 today because that runs in about an hour and that's only how much time it goes. It was just chromosome 21, mm-hmm. Yeah, and this is comparable, should be comparable. No, so that's a great question. I'm glad you asked that because I didn't make that clear. So the question was if I'm running it on everything, do I have to run this 21 times for each chromosome? No, because your fast queue will have unmapped reads, right? Your fast queue, you don't know which chromosome it goes to. The only reason this is chromosome 21 is that I took the whole fast queue from the experiment, I mapped it and then I pulled out the reads from chromosome 21 and I turned it back into a fast queue for this example. So on the cluster, does it actually split up the jobs? And no, it does not. However, the mapping occurs on a 32 core instance and it saturates all 32 cores. So that's actually even better than breaking up into 21 jobs. So like the source code, for example? Right, so the question is how do I see all the different parameters that are actually being run in the software packages? So you look in the source. So some of those parameters are surfaced as settable parameters in the workflow run, but not all of them. And the reason is because this is the end code pipeline and as soon as you change one parameter, it's something different from what we run. So you have to go to the source code to actually see which parameters that we're feeding to these different parameters. Yes, it will say that, but it'll also say the version of the software that was used and it will say which software packages were used for each step. Okay, so thumbs up if you have selected this reads one chromosome 21.fascue, so for our own sake, okay. Hands up if you still need help, if you want me to pause. Okay, we're gonna go. So that was the read one. So let's add the second fast cue of that read pair. Click on read two of paired in fast cue and go back to that. List of files by expanding these folders with those characters just the same way that you did before, get into this input folder and this time select the chromosome 21 dash R2. Thumbs up if you've already done that. All right, good. All right, so you have now specified both the read ones and the read twos for the star alignment step in the pipeline. This has already been, I'm gonna go to step 18, the PDF in case you're following along in the PDF. So this has actually been pre-configured for GRCH 38, right, the Chipsick examples on H219. This has been pre-configured for GRCH 38, so that's the assembly that we're going to map to. And because you have fulfilled all of the, and you see that that's automatically filled in here, the only required inputs were your fast cues, so you can now click run as analysis. So do that, click run as analysis. Thumbs up if you've clicked run as analysis. Okay, very good, so you have now instantiated the Chipsick pipeline and the Encode RNA-Seq pipeline. They're both running on the cloud. The Chipsick pipeline will be finished to look at probably in about 10 minutes maybe. And so we're gonna take a break now so you can go outside and get a drink of water or whatever, come back in 10 minutes and we'll see if these jobs have finished. So, great. All right, so we will start up again here in a minute or so. So helpers, go around and see how far the TF pipeline has run on some of the people's. So check the monitor on your DNA Nexus account and let's see how far you've gotten on the TF workflow run. Let's just do a sample of a few. NSPP, we have a vote for NSPP. NSPP, good. How long has it been in SPP? How many minutes? Just two minutes in SPP? 12 in SPP, that's cool. Okay, that gives me an idea of where we are. If you click on the monitor tab and then click on whatever you named your analysis. Sorry, if you click on, there's two ways to get to it. You can click on this and it'll give you a graph of all the different substages or you can click on this little minus here and get the list. Okay, so while the TF pipeline is finishing up let me go into a little bit more detail about what the pipelines actually produce. They produce a lot of files. Some of the files are intermediate files that are saved because for various reasons you might want to get deeply into the internals of how the pipeline runs and so we save those files but because there are a lot of files there I wanted to call your attention to sort of the basic or key files that you need to be aware of. So when you're just starting with the pipelines these are the files that you want to pay attention to. So for TF chip seek and histone chip seek we have peak calls and we have signal tracks. The signal tracks are these continuous varying tracks that you see on the browser. It's what almost everybody uses in their slides when they talk about chip seek and it shows just the tag density along the genome. So you're familiar with those. We call those the signal tracks and you get signal tracks for TF and you get signal tracks for histone. You can get a signal track for rep one you can get a signal track for rep two you can get a signal track for the pool and you can get some other signal tracks which I won't talk about but if you're just starting with the pipelines which one should you look at? The pooled control normalized signal tracks will probably be the first thing that you want to look at. It takes both replicates combines them together subtracts the background from the control and then plots that control normalized tag density and you can interpret the number of the scale on that as a fold increase in tag count over the control at that position. So those are the pooled control normalized signal tracks for both TF and histone chip for long RNA seek the plus and minus strand uniquely the plus and minus strand signal from uniquely mapped reads of good place to start. There are other files of course that are output from the pipeline but that's a good one to start. How do I find these? I've identified them here on the slide as the stage in your workflow that you run and the analysis that you run the stage and then colon the name of the output. So you'll see that your pooled control normalized signal is called pooled signal. I'll show you how to find it in just a second pooled signal and then you're interested in the peak calls. Take a moment to tell you the difference between peaks and signal. So a lot of times people show a signal track and then they say, well, can you see there's a peak? Now that's right, but there's also another set of bed file of peaks or peak calls. Peak calls are the output of the statistical framework that we apply to the signal to ask are these excursions that you see in the signal really significant or not? When you take into account the rest of the signal and you take account the control, is this excursion the track real? Those are called or statistically significant. Those are called peak calls. And so you'll find those in the pipeline as peak calls and in the TF chip seek, the IDR framework outputs what's called an optimal set. The optimal set of peaks is a good place to start to see those actual peak calls. And the bed file is just features along the genome that have a start coordinate and have a stop coordinate. It just says right here is where we think that excursion in the signal track is significant. RNA seek, as I said, you have your signal track. The pipeline also produces a quantitation files. Now, there are some special purpose files. So as I told you, there's a lot of files generated by the pipeline. Some of them are you really get deep down before you begin to use them. There are some special purpose files though. There's sort of intermediate level expertise if you want to say it that way that you might be interested in the optimal set of peak calls includes pool data. Whereas another set called the conservative set only considers peaks that appear in both true replicates without pooling. And so that's the conservative set. There are, as I said, signal tracks that are just each replicate and not all data pooled. There's signal tracks that instead of being expressed as a fold control over the signal, sorry, a fold control over, fold signal over the control are actually key values. So, you know, like an all hypothesis that it's different from the control signal. And the histone chip seek pipeline also produces a set of files. I'm not gonna go through any more of them, but this slide now you have it so that when you go back and look at these pipelines, oh yeah, there was this list of outputs that I was supposed to pay attention to, they're here. The ones that we're going to visualize today are just these right here. Another important category of outputs I mentioned before are the quality control metrics. So these are sometimes stored in files, but these are numbers or plots that we generate as part of the pipeline run to evaluate properties of the library to get some sense of whether the chip worked or not, and also to quantify the degree of replicate concordance. So I've summarized them for you here. This slide is just meant something you can look back to to get the resources, get the references for where the math behind these is defined. But in short, we are concerned with read depth, and there are standards that all encode production labs conform to for their chip seek experiments in terms of read depth. There are estimates of library complexity. There's an analysis that's performed for the chip seek pipeline called cross correlation that takes advantage of the asymmetrical distribution of five prime ends around the binding site of a TF peak or histone modification that's called the cross correlation. And there's a plot for that. There are some numbers that summarize what's in that plot. Those are all generated. Again, there are guidelines on the encode portal that all of the encode production labs follow with respect to these metrics. And because the pipeline generates these, you can compare these metrics for your experiment as well. So for replicate concordance for TF chip seek pipeline, you're interested in two numerical statistics, the IDR rescue ratio and the self-consistency ratio. Self-consistency ratio just asks, do I tend to see the same peaks in both of my replicates? So are they self-consistent? Oh, sorry, that's not right. So the self-consistency ratio is an expression of if I split the signal from one replicate, do I recover the same peaks in both of those splits? And then the rescue ratio is if I take all of the signal and put it together, if I pool the reads, which peaks do I get that I didn't get if I just considered the replicates by myself? So if I build up the signal, what new peaks do I see? The reproducibility test is just a pass borderline fail. It's just sort of like I want to know the bottom line, was this a good experiment or not? The references here will tell you more about what those metrics mean, but a good graphical summary of some of the metrics that I just described is in this Lant at all paper in GR, which you should definitely read. It's all about some of the QC metrics that ENCODE has adopted for chip seek experiments. So it gives you an idea of what we mean when we talk about a low complexity library, a library that has lots of different fragments versus a library that appears to have been generated just by PCR amplification, five prime ends lined up. This idea that five prime ends of chip fragments are asymmetrically distributed around the peak, that's the cross correlation analysis, and it shows what the graphs look like in good and bad experiments. So all of that is calculated by the pipeline and you can read about it in this paper. So I showed you the schema for the TF pipeline that we ran. I didn't show you a schema for the RNA seek pipeline because I wanted to get things rolling, but I'll show you now. So the RNA seek pipeline looks like this, starts with fast cues, you've done this, right? You populated the mapping step with the fast cues and now what's running is signal tracks are being generated and also a quantification step is run and when there's a multiple replicate RNA seek file, there's a rampage experiment. IDR is the replicant concordance framework and there's a statistic called MAD which is calculated for long RNA seek. The pipeline that you're running today does not include top hat. So you'll see when you look on the ENCODE portal, there'll be many RNA seek experiments in which we've also mapped with top hat and accession to those results because there are reasons why you might want to use mappings that have been generated by top hat versus star, but for the pipeline that you're running today, you're only using one map versus star. Who's, go back to your DNA nexus monitor tab, who's TF chip seek pipeline is done. Give me a thumbs up if you're done. I got one, somebody's done, anybody else? Where is it? Call out your step where it is if you're not done. Okay, somebody's in max. Okay. All right, so I'm going to show you one thing before we collect the results from the pipeline. No, no, no, no, that's everything. That's a full, yeah, this will be, I know this, I think it's $2 or something or a dollar or maybe less than a dollar. Yeah, that's a full, the full TF chip seek pipeline takes about seven hours, less than that. Tim, what's a typical run time in a long RNA seek experiment? Yes, 10 hours. It's not charged per hour but because the different steps use different amounts of compute resources. So you have to kind of look at the cost as a total per experiment cost. Not ahead of time because, you know, you don't really know what you're going to get. Say again. Okay, okay. If you see that your pipeline has failed, which is possible, go back and look at the inputs and ask one of your helpers to help you. Find, if you gave it, for example, the RNA seek input for a chip seek experiment, so. So I want to show you what the output of the pipeline looks like on the ENCODE portal. So the TF chip seek experiment that you are running today in your account on DDNXS, I have already run at the DCC and accessioned all of the results using exactly the same code that you're running today, although I access it programmatically, not through the graphical interface, but it's exactly the same code. So I wanted to show you what that looks like when I accession all of those files. So when you navigate to the experiment page for that particular Zbed 1 chip seek experiment here, as you've seen, the portal gives you some metadata about the experiment itself, about the replicate structure. We have two replicates in this experiment. And then this is sort of a summary of the connectivity between all the different files. Files are yellow ovals and processing steps are blue rectangles. And on the portal, you can get access to the QC metrics that the pipeline calculates through these little green QC bubbles that are attached to their respective files. So it looks a little bit complicated because it is. The TF chip seek pipeline does a lot. This is the mapping step here. And so you can see that you have reads being transformed into alignments through a mapping step. Those alignments then go through a signal generation step, through a peak calling step, that those peaks that are generated are matched with their replicate run through IDR to produce these optimal and conservative peak sets that I told you about before. So every chip seek experiment on the portal that we've run through the pipeline looks like this. RNA seek experiments have a graph of file connectivity as well. If you are interested in details about the processing steps, you can click on one of these blue rectangles and see metadata about that step. So for example, I clicked on this is replicate two. This is the fast queue that you just mapped the reads for chromosome 21. And you can see that the version of BWA that was used for the mapping is here. The list of other software tools that were used in the pipeline are listed here. And you get the steps in the pipeline that were run and so forth. And if you can actually download each of these individual files to your laptop just by clicking on the file itself and you'll get metadata about the file. And if you click on one of these little green bubbles, you'll get the QC metrics that were actually produced. And so you don't have to go back to DNNX to see anything that we have run and accessioned to the ENCODE portal. That's all there. But you'll get the same information for your experiment when you run it on DNA Nexus. And so you can see a little bit about what Mike Payson was talking about in the very beginning of this session is if we begin to work with people outside of the consortium who wanna contribute their data to ENCODE, we need these numbers for your experiments too. And this is the way, these pipelines are the way to generate them for your experiment. Okay. I'd happy to take questions at this point. If your TF chip seek pipeline is done, give me a thumbs up if your pipeline is done. Okay, I'm seeing some that are done. Okay, hand up, stop if your TF chip seek pipeline is still running. Okay, most of them are still running. Okay, so we'll let that roll a little bit longer. And I'll show you what an RNA seek experiment looks like. Okay, the question is, are there pipelines for ClipSeq and RNA BindingSeq? Not at this moment. However, the production labs that do those experiments are deploying those pipelines to DNA Nexus as well. And so you'll get those pipelines soon. I don't know exactly when, but you will get them. And they'll run just like this. That's not true for every data type in ENCODE, but it is true for the ones that you asked about. So that's good. Right, exactly. Yeah, in DCC speak, you see the metadata for the pipelines, but not the pipelines themselves. So just like a chip seek experiment, this is the experiment page for the RNA seek experiment that you're running. And you can see the graph of files that are produced. Zoom out so I can get them on one screen. Again, you have reads. In this case, there's multiple fast queues because it's paired-ended and there are sequencing runs that are pooled. You see an alignment step and you see a signal generation step and you see a quantification step. This is exactly what's running in your DNA Nexus account for RNA seek right now. Okay, so Yelp Bingo in your TF pipeline is done. So how many of you have found this representation of your pipeline run, this Gantt chart looking thing? Okay, you get to this on your monitor tab. If you click on the analysis name that you gave here, you go on your monitor tab and you go on your analysis name and you see this sort of Gantt chart that represents all the different stages and it actually shows the dependency of one stage on another. So you can see that mapping happened first. This is time going from left to right. Mapping happened first and then the filtration and the cross-correlation analysis depended on those. For the rep one, it started up as soon as rep one mapping was complete and for rep two, it started out as soon as rep two was complete. All this scheduling, I didn't have to write because it's all part of the platform which is another reason why we chose to deploy them this way. So when I ran this, my end code peaks step took 11 minutes and my SPP step took 24 minutes. Whose SPP has been going longer than 24 minutes? Okay, how close are you to 24? Give me some numbers. 16, 22, 16, 20s in the 20s. Yeah, 17, okay great. So we'll let it run for just a little bit longer. Yes, I can explain more about the pseudo-replication strategy. I don't have a graph. Okay, so pseudo-replication is a fancy word for splitting all the reads in half. So now you can apply that splitting to either true biological replicates. So you can take rep one, split, and by split I just mean choose without replacement half the reads. Split the reads, map them independently, call peaks on them independently, and then compare those peak lists. So that would be comparing two pseudo-replicates of replicate one. You could do the same for replicate two. You could do the same for all the reads pooled together. So you take all the reads, you pool them together, you've got a pooled read set, now you pseudo-replicate that, right? And then you call peaks on those two. So each of those two pseudo-replicates of the pool contain information from both replicates, right? And what we do is we actually run the IDR replicate comparison on all of those pseudo-replicate pairs. Pseudo-replicates for each true replicate and the pseudo-replicates for the pool. That's why if you click on SPP peaks, it'll expand out to nine runs of SPP, if I remember exactly how many there were. And that's what we're doing there. So what we're doing is we're actually calling peaks on the true replicates, the pool, the pseudo-replicates of rep one, pseudo-replicates of rep two, and the pseudo-replicates of the pool. And then all of those pseudo-replicate comparisons are fed into IDR. Because what you want is you wanna know, do I get replicable peaks from my pool by pooling the signal? There might be, if you look at the signal track for your true replicate one, there might be a little excursion that just barely misses being significant. And if you saw exactly that same excursion, exactly the same excursion in your second replicate, but yet it was just below significance, when you pool those together, as long as there's not too much noise in the control, those will now be significant excursions in the signal track and can be called as a peak from the pool. And so the pseudo-replication strategy is from the standpoint of sort of splitting signal, very simple, just a matter of partitioning into two halves, calling peaks on each half, and then subjecting those two to an IDR analysis as if they were true replicates. And so in the end in the TF pipeline, what you get is these two peak sets, I called it the conservative set and the optimal set. The conservative sets is if you own, for whatever reason, if you only want, you don't really want this pseudo-replication strategy in the pool, you just wanna know which peaks do I really, strong peaks do I see in both true replicates? That's the conservative set, so you're being conservative. And however, the optimal set is through this pooling and pseudo-replication strategy, if you could rescue more peaks, then you get those peaks in the optimal set. And that's why there's a number, the QC metric called the rescue ratio that expresses how successful you were, but how successful the IDR was in rescuing peaks through this pooling and pseudo-replication strategy. All right, so even before we begin to visualize outputs, if you'll go to this tab that says manage, you'll see the folder that you created for the output of your analysis run. And if you click on the carrot for that folder, there'll be subfolders inside that have captured the output. And even though the pipeline hasn't completed, if you will select this folder called encode max two inside the output folder that you created for your pipeline run, you should see a list of files that look something like this. So click the carrot next to your output folder, expand to this list that says encode BWA, max two, SPP and IDR, click on this folder and you should see this list of files. Thumbs up if that's what you see. Okay, terrific. So this right here pool.fc signal bigwig, that is your full control signal track. It's done. We can visualize that now. We don't have to wait until all the rest of the peak calling is finished. So we can pull this right out of the pipeline and visualize it. Select this checkbox next to that pool fc signal.bw. Select that checkbox and click download. And what we're gonna do here is we're actually gonna generate a URL to that file that you just created by running the pipeline and we're gonna go over to UCSC and we're gonna paste that URL in and visualize it on the UCSC genome browser. So to do that, click on this checkbox next to pool fc signal.bw and click on download. I will scroll back up for a second. All right, so what we're doing here is we're navigating into your output folder, right? I'm under the manage tab. I'm in my output folder, which I've expanded by clicking on this little carrot and then I've clicked on this encode max two folder in order to see the files that were generated by that stage. If you see this, give me a thumbs up so I know where we are, okay? Hands up if you want me to stop or you need help. You will only see this if your pipeline has completed the step called encode peaks. So if you haven't completed encode peaks, then you haven't made this yet and so you can't see it. And the reason I'm pushing ahead a little bit is because I wanna get to visualize this before we run out of time. So raise your hand if your pipeline is still in this encode peaks step. Okay, all right, some of you. All right, I'm sorry about that. We're a little bit, maybe you can catch up. Select this and click download in order to generate a URL. How many of you see this window here where you've generated a URL? Thumbs up if you see this window here, okay? All right, so highlight that. In this example, there's two URLs, but don't worry about that. Get your URL, highlight it and copy it. Go over C or whatever. And then open a new tab to UCSC Genome Browser, genome.ucse.edu. And once you've copied that URL and opened genome.ucse.edu, give me a thumbs up so I know where we are. Good, good. This side of the room is doing better than this side of the room. I'm sorry, that's just the way at it. Maybe they're responding better, maybe it's the sun, I don't know. Maybe they're getting more help. If you move over to this side, all you need to do is just go to genome.ucse.edu and hold what we're gonna do is we're gonna add a custom tracker to this My Data menu. All right, so hold up your hand to stop me if you are still working on getting to this place. Okay, I've got a couple of stops. So I'm gonna go back a little bit and I reverse a little bit. So remember what we did is we went under our manage tab in your project and then we drilled down through these folders, through the folder that you used as your output folder, we selected the results, encode max two, we clicked on the check box and we clicked on download to generate a URL to that file. And then we copied that URL to, copied that URL to the clipboard and then in a new tab or new window in your web browser go to genome.ucse.edu. Yeah, you're getting ahead. For now, let's just look at that one signal track from, because not everybody has finished the peak calling so I'm not gonna add the peaks track yet. I'm just gonna add the signal track because that's already finished for most people's pipeline. If you need me to stop or you need help, please raise your hand. Okay, we're gonna go. So you have in this window, the UCSC genome browser genome.ucse.edu click on My Data and under My Data you're gonna see Custom Tracks. So click on Custom Tracks under My Data on the UCSC genome browser and you should see this window here, Add Custom Tracks. If you're at Add Custom Tracks, give me a thumbs up if you're Add Custom Tracks. Okay, good. And I'm just gonna give you a second to get there. We're about to paste the URL that we copied from D&N-Xs. We're about to paste it into this window here. Once you get to this window, make sure that we're on the right genome assembly. Make sure that you're a mammal and that you're human and that the assembly is HG-19. So make sure that you've selected the right genome assembly because that's the assembly that we mapped our experiment to. So the way we got here is genomeucse.edu. We clicked on My Data, Custom Tracks and that gave us this window here. If you have this window here where you're ready to paste the URL, give me a thumbs up, please. Yeah, don't worry about the fact that I'm pasting two here because this is when the pipeline is absolutely 100% completed. We're just gonna look at the signal track because most people are finished with that. Paste your URL into this box here. And if you can't paste it in or you're having trouble, hold up your hand for help. I don't see any hands. Okay, right here, Beck. Right here, Gene. Hold up your hand if you want someone to help you get your URL, paste it into UCSC. All right, if you've pasted your URL into UCSC, click Submit. Who wants me to wait? We'll get there. Okay, after you click Submit, you'll see this window here. Double check that you're on the human HG-19 assembly and click Go. Who's on this page right here? Thumbs up if you're on or past this page. Okay, great. Hands up if you want me to wait so someone can help you. Okay, click Go. I did not see something that looks like this yet. But you should see UCC Genome Browser on Human February 2009. Are you at least here on the browser page? Thumbs up if you're at least to this point. Yeah, okay. Down here, scroll down. You may see different tracks because you may have tracks already open in UCSC that you've looked at before. Scroll down and you'll see under Custom Tracks, your Zbed 1.FoldControlSignal should be here. Select Full so that you can see the full signal track. So go down, scroll down below the track display and this FoldControlSignal that you added as a custom track, click on that dropdown and click Full and then click Submit. Full and then, sorry, Refresh, not Submit. And then go to Chromosome 21. So say bingo when you get a signal track that looks sort of like this. Some people got it. Chromosome 21. Chromosome 21 and if you want to go to exactly where we are here, go to these coordinates, I have them a little bit larger here. In your PDF if you still have it open, go to this location right here. And do that by typing exactly this into the box at the top above the tracks to go to that location. So Chromosome 21, I won't read it out this year. You can also just search in that same box for the superoxide dismutase 1 gene, SOD1, you can just type that in here as well. Thumbs up if you see that signal track and you've gotten to some sensible location on Chromosome 21. Yeah, raise your hand if you need help getting there. Some hands. If you can't read exactly what the Chromosome position is, just go to SOD1, you can just search for it and then zoom out 3x and you'll see this. So you should see HG19 at the top. You'll see other tracks, you might see other tracks because you may have different tracks already open in your UCSC genome browser, but you should also see this Zbed 1 fold controls. Well, it'll be called something else for you, I think. I named it Zbed 1 on this example. But this fold control signal track set it to full and you should see this. Hold your hand up if you're not here and you need help finding it. Yeah, yours will be pool.FC underscore signal. I named it that for the example. I skipped the step actually where we name it. No, this is the, so the track that you're looking at here is the pooled signal control normalized for this experiment, all right? Now, if you're at this SOD1 locus and you look across here, you see that there's lots of signal in this region of chromosome 21. There is an excursion in the signal track here that's higher than all the rest and you might ask the question, well, is that a statistically significant peak? Or, more importantly, is this one a statistically significant peak? Is this one? Is that? That's what the IDR optimal peak set will answer for you. The instructions on getting that up in your browser are in the PDF. I've added it here. It's a little track, but it's super important. There it is. Right here, that block right there is the peak that the pipeline called for this signal. Notice that there are no other peaks in this region. That is not a peak. That is not a peak. None of this is peaks except for this signal here because it survived this IDR framework. It was observed in both replicates consistently. Help or question? Help. This is the pool FC signal. Oh, this one here. It's an optimal narrow peak big bed. Okay, now we're going to do the bottom line pole. So, if you tried to run the pipeline, but couldn't get here, raise your hand. If you tried to run the pipeline and you did get here, raise your hand. Okay, congratulations. You guys just ran the ENCODE TF chip seek pipeline on real data. Now you can do it yourself for your experience. Give yourself a hand. Yay. Yes. Okay. Do a little Q&A for the last few minutes. Yeah, I'm happy to take questions. And the first question was, okay, once my pipeline finishes, how do I get to those peak calls? That's in your PDF. If you go to the TF PDF and look at step 28, okay, it'll show you how to get the optimal narrow peak calls and you can generate a URL for that file and paste it into the browser. So, if I zoom out a hundred times from the SOD1 gene, I see some peaks that are much higher that don't get called in the narrow peaks file, but, you know, they are not narrow. They're a little fatter. So, can you tell us why? Okay, so what? Are these artifacts or they're just not called by the pipeline? So, if you're, there's a lot of ways a signal can fail to rise above the level of significance. You see a wide signal, for example. When you zoom way out, what you're seeing is a smoothed signal across the whole segment of the chromosome. We don't call peaks that wide. There's a wouldn't be useful, right? Peaks that we call are punctate regions where there's good evidence that, in this case, Zbed 1 actually bound in that sample. And so, what you really have to do is you have to zoom in until you're at roughly this scale, the per, you know, one gene or five genes or 10 genes scale until you can start to resolve the peak calls. You'll always be able to see the signal, but you have to zoom in until there's a few genes before you begin to see the peak calls. And again, this data set is only chromosome 21. Yes, question. So, the question is can you change the color of the tracks on the genome browser? Absolutely you can. If you, I don't have a slide, I could bring it up here, hold on. So, below the tracks, there's a button that says configure and that allows you to set all sorts of different parameters. So, we will be coloring the tracks that we export to UCSC from the portal later, but right now they're just black and white. But the pipeline just generates black and white tracks. Yes. No. The question was on UCSC, if the transcript is on the quick strand instead of the Watson strand, can you actually flip it around so it looks like it's going the way you want it to go? The answer is no. Yes, question. So, the question is do we have documentation for the pipeline? Oh, docker implementation, very good. So, the question is, do we have a docker implementation of the pipeline? No, we do not. However, that's the next thing we're gonna do. So, docker is a containerization technology that allows you to put arbitrarily complicated software with lots and lots of steps and lots and lots of dependencies in a container that has everything in it and then you can run that container on any computer, anywhere, if you have the right prerequisite software. So, that's another way that people use to deliver reusable software and we'll do that. So, the question was what's the difference between the conservative and the optimal set? The answer is related to what I said before about pseudo-replication. So, the conservative set are peaks that are observed in the true biological replicates. In order to make it into the conservative set, you have to be observed in both reps as a significant peak. However, if you pool the signal from the two replicates together, you might have two signals that are weak, but they're absolutely identical signals. In each individual replicate, they don't rise above significance, but they're clearly the same signal. You pool that together, it rises to significance. That peak now, if it passes IDR, can get through to the optimal set. So, that's the difference between the conservative set and the optimal set. Most people use the optimal set because I can explain to you exactly why. It's, and there's no funny business. It really is just that instance in which you look at the signal track, you're like, man, I know there's something there, but it's just below the line and I look at my other replicate, it's there too. Those can go in the optimal set. Most people use the optimal set. Yeah, you would use the conservative set if you really only wanted those very strong signals that were significant in both true replicates. Okay, that's it for the Pipeline's workshop. Something that I wanted to say in the beginning and I forgot to, and I guess I'll say it now, is on behalf of the DCC, the organizers of this event, thank you all very much for coming to Stanford and welcome and thanks for coming to our workshop. And we hope very much that you'll be able to take some of the skills that you learned today home and actually use these pipelines. Encode help at lists.stanford.edu is the help desk for Encode. All those email messages come to our group and if you have any questions at all about the Pipeline or anything else that the DCC has presented or anything about Encode, you can send it to that message and we will all, as a group, see that message. So let us know if we can help and thanks. I'd like to thank Seth very much for leading us through live demos. They're always tricky and that was elegantly done. Kudos to Seth. Nothing crashed. And all of our friends from D&E Nexus here and everyone who ran around and helped and all of you for your patience while we ran around and helped. So we have a little bit of a break, 15 minutes. I think there's some cookies and coffee. We'll come back. We have four shorter demos this afternoon and then we're, that's almost it for today. Then the fun begins. Tonight at the reception, we will have the live help desk. So if you still have lingering questions about what you've done in the last couple of hours, we'll have a table set up just for Pipeline questions. So feel free to find one of us during that and then of course the Encode Help email address is a great resource. If you get home and you're like, wait, what did I do? Feel free to email us. Thank you very much. Have a nice break. Go search your links. We'll be back in 15 minutes.